# üöÄ Entra√Ænement Llama 3.2 3B pour G√©n√©rateur de Formulaires

Ce notebook permet d'entra√Æner un mod√®le Llama 3.2 3B avec LoRA pour g√©n√©rer des structures de formulaires JSON.

**Configuration recommand√©e:**
- Runtime: Python 3
- Hardware accelerator: GPU (T4, V100, ou A100)
- RAM: High-RAM si disponible

**Temps estim√©:** 1-3 heures selon le GPU

## üìã √âtape 1: Configuration de l'environnement

Installation des biblioth√®ques n√©cessaires.

In [None]:
# Installation des d√©pendances
!pip install -q -U transformers datasets accelerate peft bitsandbytes sentencepiece
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

print("‚úÖ Installation termin√©e!")

In [None]:
# V√©rification du GPU
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"CUDA version: {torch.version.cuda}")
else:
    print("‚ö†Ô∏è Attention: GPU non disponible. L'entra√Ænement sera tr√®s lent.")
    print("Allez dans Runtime > Change runtime type > Hardware accelerator > GPU")

## üíæ √âtape 2: Connexion √† Google Drive

Pour sauvegarder le mod√®le entra√Æn√©.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Cr√©er le dossier de sauvegarde
!mkdir -p /content/drive/MyDrive/llama3-form-generator

print("‚úÖ Google Drive mont√©!")

## üîê √âtape 3: Authentification Hugging Face

Llama 3.2 3B n√©cessite une acceptation de la licence et un token Hugging Face.

**Instructions:**
1. Allez sur https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
2. Cliquez sur "Agree and access repository"
3. Cr√©ez un token sur https://huggingface.co/settings/tokens
4. Collez-le ci-dessous

In [None]:
from huggingface_hub import login
from getpass import getpass

# Entrez votre token Hugging Face
token = getpass("Entrez votre token Hugging Face: ")
login(token=token)

print("‚úÖ Authentification r√©ussie!")

## üìÅ √âtape 4: Pr√©paration des donn√©es

Deux options:
- **Option A**: Uploader votre dataset existant
- **Option B**: Cloner le repository GitHub

In [None]:
# Option A: Upload manuel du dataset
from google.colab import files

print("Uploadez votre fichier training_dataset.jsonl:")
uploaded = files.upload()

print("‚úÖ Dataset upload√©!")

In [None]:
# Option B: Cloner le repository (d√©commentez si vous utilisez cette option)
# !git clone https://github.com/VOTRE_USERNAME/FORM_EDITOR_LLM.git
# %cd FORM_EDITOR_LLM
# !python prepare_dataset.py
# !cp training_dataset.jsonl /content/

In [None]:
# V√©rifier le dataset
import json
import os

dataset_path = "/content/training_dataset.jsonl"

if os.path.exists(dataset_path):
    with open(dataset_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()
        print(f"‚úÖ Dataset trouv√©: {len(lines)} exemples")
        
        # Afficher un exemple
        if lines:
            example = json.loads(lines[0])
            print("\nüìã Exemple:")
            print(f"Instruction: {example['instruction'][:100]}...")
            print(f"Output: {example['output'][:100]}...")
else:
    print("‚ùå Dataset non trouv√©. Veuillez uploader training_dataset.jsonl")

## üèãÔ∏è √âtape 5: Entra√Ænement du mod√®le

Cette cellule lance l'entra√Ænement avec Llama 3.2 3B.

In [None]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset

print("üöÄ D√©marrage de l'entra√Ænement...\n")

# Configuration
model_name = "meta-llama/Llama-3.2-3B-Instruct"
output_dir = "/content/llama3-form-generator"

# Charger le tokenizer
print("üì• Chargement du tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Charger le mod√®le
print("üì• Chargement du mod√®le (cela peut prendre quelques minutes)...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    low_cpu_mem_usage=True,
    trust_remote_code=True
)

# Pr√©parer pour LoRA
model = prepare_model_for_kbit_training(model)

# Configuration LoRA
print("üîß Configuration LoRA...")
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Charger le dataset
print("\nüìö Chargement du dataset...")
dataset = load_dataset('json', data_files="/content/training_dataset.jsonl", split='train')
print(f"Dataset size: {len(dataset)} exemples")

# Format de prompt pour Llama 3.2 Instruct
def format_and_tokenize(example):
    text = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Tu es un assistant sp√©cialis√© dans la g√©n√©ration de structures de formulaires JSON.<|eot_id|><|start_header_id|>user<|end_header_id|>

{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{example['output']}<|eot_id|>"""

    return tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=2048
    )

print("üîÑ Tokenisation du dataset...")
dataset = dataset.map(
    format_and_tokenize,
    remove_columns=dataset.column_names,
    desc="Tokenizing"
)
dataset = dataset.train_test_split(test_size=0.1)

print(f"Train: {len(dataset['train'])} | Test: {len(dataset['test'])}")

# Configuration d'entra√Ænement
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    fp16=True,
    optim="paged_adamw_32bit",
    dataloader_num_workers=2,
    report_to="none",
    load_best_model_at_end=True,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
)

print("\nüèãÔ∏è D√©but de l'entra√Ænement...")
print("=" * 60)

# Entra√Æner
trainer.train()

print("\n‚úÖ Entra√Ænement termin√©!")

## üíæ √âtape 6: Sauvegarde du mod√®le

In [None]:
# Sauvegarder le mod√®le localement
print("üíæ Sauvegarde du mod√®le...")
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"‚úÖ Mod√®le sauvegard√© dans {output_dir}")

# Copier vers Google Drive
print("\nüì§ Copie vers Google Drive...")
!cp -r {output_dir} /content/drive/MyDrive/

print("‚úÖ Mod√®le sauvegard√© sur Google Drive!")
print(f"üìÅ Chemin: /content/drive/MyDrive/llama3-form-generator")

## üß™ √âtape 7: Test du mod√®le

Testons le mod√®le avec un exemple.

In [None]:
import json

def test_model(instruction):
    prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Tu es un assistant sp√©cialis√© dans la g√©n√©ration de structures de formulaires JSON.<|eot_id|><|start_header_id|>user<|end_header_id|>

{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=1024,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extraire la r√©ponse
    if "assistant" in result:
        response = result.split("assistant")[1].strip()
        return response
    return result

# Test
print("üß™ Test du mod√®le:\n")
instruction = "Cr√©e un formulaire d'inscription avec nom, pr√©nom, email et t√©l√©phone"
print(f"Instruction: {instruction}\n")

result = test_model(instruction)
print(f"R√©sultat:\n{result}")

## üì• √âtape 8: T√©l√©charger le mod√®le (optionnel)

Si vous voulez t√©l√©charger le mod√®le localement.

In [None]:
# Compresser le mod√®le
!zip -r llama3-form-generator.zip {output_dir}

# T√©l√©charger
from google.colab import files
files.download('llama3-form-generator.zip')

print("‚úÖ T√©l√©chargement lanc√©!")

## üìä R√©sum√©

### Ce que vous avez fait:
1. ‚úÖ Configur√© l'environnement Colab
2. ‚úÖ Authentifi√© avec Hugging Face
3. ‚úÖ Charg√© le dataset d'entra√Ænement
4. ‚úÖ Entra√Æn√© Llama 3.2 3B avec LoRA
5. ‚úÖ Sauvegard√© le mod√®le sur Google Drive
6. ‚úÖ Test√© le mod√®le

### Prochaines √©tapes:
- Utilisez le notebook `inference_colab.ipynb` pour g√©n√©rer des formulaires
- Le mod√®le est disponible sur votre Drive: `/content/drive/MyDrive/llama3-form-generator`
- Vous pouvez affiner l'entra√Ænement en modifiant les hyperparam√®tres

### Ressources:
- [Documentation Llama 3.2](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
- [Documentation LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)
- [Documentation Transformers](https://huggingface.co/docs/transformers/)