<a href="https://colab.research.google.com/github/Slebbon/TextGeneration_Projet_PSL_EnC/blob/main/FLAN_T2T_10_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FLAN MODEL:** Text 2 Text

### **Importation des variables et libraires nécessaires**

In [1]:
! pip install -U accelerate
! pip install -U transformers
! pip install -U torch
! pip install datasets



In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

### **Importation de l'ensemble des donnèes**

In [3]:
# Chargement du tokenizer et du modèle Flan pré-entraîné
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
from datasets import load_dataset
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In [37]:
#Chargement des CSV
data_dir = r"C:\Users\marco\OneDrive\Desktop\TXTgeneration\Dataset\Divided_Dataset"

datasets = load_dataset('csv', data_files={
    'train': f'train_10.csv',
    'validation': f'validation_10.csv',
    'test': f'test_10.csv'
})

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [38]:
# Fonction de tokenisation des données
def tokenize_function(examples):
    model_inputs = tokenizer(examples["Text"], padding="max_length", truncation=True, max_length=512)
    model_inputs["labels"] = tokenizer(examples["Author"], padding="max_length", truncation=True, max_length=512)["input_ids"]
    return model_inputs

# Tokeniser les ensembles de données
tokenized_datasets = datasets.map(tokenize_function, batched=True)

Map:   0%|          | 0/966 [00:00<?, ? examples/s]

Map:   0%|          | 0/293 [00:00<?, ? examples/s]

Map:   0%|          | 0/290 [00:00<?, ? examples/s]

### **FLAN MODEL:** Entrainement

In [7]:
# Définir les paramètres d'entraînement
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True
)




In [8]:
# Créer l'entraîneur
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer
)

In [9]:
# Effectuer le fine-tuning
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.237229
2,No log,0.022209
3,5.101100,0.010542


TrainOutput(global_step=726, training_loss=3.5229919317996865, metrics={'train_runtime': 2375.8483, 'train_samples_per_second': 1.22, 'train_steps_per_second': 0.306, 'total_flos': 1984426807394304.0, 'train_loss': 3.5229919317996865, 'epoch': 3.0})

In [10]:
path_to_save_fine_tuned_model = r"C:\Users\marco\OneDrive\Desktop\TXTgeneration"

In [11]:
# Sauvegarder le modèle
model.save_pretrained(path_to_save_fine_tuned_model)
tokenizer.save_pretrained(path_to_save_fine_tuned_model)

('C:\\Users\\marco\\OneDrive\\Desktop\\TXTgeneration/tokenizer_config.json',
 'C:\\Users\\marco\\OneDrive\\Desktop\\TXTgeneration/special_tokens_map.json',
 'C:\\Users\\marco\\OneDrive\\Desktop\\TXTgeneration/spiece.model',
 'C:\\Users\\marco\\OneDrive\\Desktop\\TXTgeneration/added_tokens.json',
 'C:\\Users\\marco\\OneDrive\\Desktop\\TXTgeneration/tokenizer.json')

### **FLAN:** Output

In [12]:
# Chargement du modèle
model = AutoModelForSeq2SeqLM.from_pretrained(path_to_save_fine_tuned_model)
tokenizer = AutoTokenizer.from_pretrained(path_to_save_fine_tuned_model)

In [13]:
# Fonction permettant de générer un texte dans le style souhaité
def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        early_stopping=True
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


In [27]:
# Exemple d'utilisation
prompt_mixed = "tell another one"
generated_text_mixed = generate_text(prompt_mixed)

print("Generated text in mixed style:")
print(generated_text_mixed)

Generated text in mixed style:
Shakespeare


In [28]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [29]:
torch.save(model.state_dict(), '/content/drive/MyDrive/model10.pth')

### **FLAN:** Evaluation

In [30]:
!pip install -U sacrebleu rouge_score

Collecting sacrebleu
  Downloading sacrebleu-2.4.2-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.7/106.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting portalocker (from sacrebleu)
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=a44bcc78aa87693bbd16441354e207ea60b2c05de4b11232b4e33810fca0ca40
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: portalocker, colorama, sacrebleu, 

In [32]:
from datasets import load_metric

In [33]:
bleu = load_metric("sacrebleu")
rouge = load_metric("rouge")

  bleu = load_metric("sacrebleu")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [39]:
def evaluate_model(model, tokenizer, dataset, max_length=100):
    model.eval()
    preds, labels = [], []

    for example in dataset:
        inputs = tokenizer(example["Text"], return_tensors="pt", padding=True, truncation=True)
        outputs = model.generate(inputs["input_ids"], max_length=max_length, num_return_sequences=1, no_repeat_ngram_size=2, early_stopping=True)
        pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
        label = example["Author"]

        preds.append(pred)
        labels.append(label)

    #BLEU
    bleu_score = bleu.compute(predictions=preds, references=[[label] for label in labels])

    #ROUGE
    rouge_score = rouge.compute(predictions=preds, references=labels)

    return bleu_score, rouge_score


In [40]:
test_dataset = tokenized_datasets["test"]
bleu_score, rouge_score = evaluate_model(model, tokenizer, test_dataset)

print(f"BLEU Score: {bleu_score}")
print(f"ROUGE Score: {rouge_score}")



BLEU Score: {'score': 0.4502766987366023, 'counts': [238, 0, 0, 0], 'totals': [580, 290, 249, 216], 'precisions': [41.03448275862069, 0.1724137931034483, 0.10040160642570281, 0.05787037037037037], 'bp': 1.0, 'sys_len': 580, 'ref_len': 290}
ROUGE Score: {'rouge1': AggregateScore(low=Score(precision=0.7793103448275862, recall=0.7793103448275862, fmeasure=0.7793103448275862), mid=Score(precision=0.8206896551724138, recall=0.8206896551724138, fmeasure=0.8206896551724138), high=Score(precision=0.8655172413793103, recall=0.8655172413793103, fmeasure=0.8655172413793103)), 'rouge2': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.0, recall=0.0, fmeasure=0.0), high=Score(precision=0.0, recall=0.0, fmeasure=0.0)), 'rougeL': AggregateScore(low=Score(precision=0.7758620689655172, recall=0.7758620689655172, fmeasure=0.7758620689655172), mid=Score(precision=0.8206896551724138, recall=0.8206896551724138, fmeasure=0.8206896551724138), high=Score(precision=0.862

**Interprétation Globale**

- **BLEU** : Le score BLEU indique que le modèle génère des textes avec une correspondance modérée au style de référence. La faible précision pour les bi-grammes, tri-grammes et 4-grammes suggère que le modèle ne capture pas bien les structures phrastiques plus longues et complexes, importantes pour le style shakespearien.
- **ROUGE** : Les scores ROUGE1 et ROUGE-L sont élevés, indiquant que le modèle capture bien les mots individuels et leurs séquences immédiates, ce qui est utile pour capturer le style de Trump. Cependant, les scores nuls en ROUGE2 indiquent un manque de cohésion dans les paires de mots consécutifs, un élément important pour la fluidité stylistique de Shakespeare.