<a href="https://colab.research.google.com/github/GeraudBourdin/llm-scripts/blob/main/Trainning_Gpt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation

In [2]:
## Définition du model pour l'ensemble de l'execution de cette feuille.
## https://huggingface.co/openai-community

MODEL = "gpt2";        ##  openai-community/gpt2
#MODEL = "gpt2-medium";      ## openai-community/gpt2-medium
#MODEL = "gpt2-large";       ##  openai-community/gpt2-large
#MODEL = "gpt2-xl"; ##  openai-community/gpt2-xl


In [None]:
## Installation des  Libraries
## peut necessiter d'être relancé 2 fois pour que tout soit bien prit en compte
## penser a relancer la vm via "execution > relancer la session"
!pip install transformers --use-deprecated=legacy-resolver
!pip install datasets
!pip install GPUtil
!pip install accelerate -U
!pip install transformers[torch]

In [3]:
## Import des librairies

import torch
import pandas as pd
from numba import cuda
from datasets import load_dataset
from GPUtil import showUtilization as gpu_usage
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TrainingArguments, Trainer, default_data_collator

In [9]:
tokenizer = GPT2Tokenizer.from_pretrained(MODEL)
model = GPT2LMHeadModel.from_pretrained(MODEL,
                                        pad_token_id=tokenizer.eos_token_id)

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# model.to(device) #optionnel

# Explication: Tokenizer
Aucune utilité concernant notre but final.

In [None]:
## Quelques informations concernant le model loadé
print("The max model length is {} for this model".format(tokenizer.model_max_length))
print("default maximum supported sentence length is 1024 is {}".format(tokenizer.max_model_input_sizes))
print("The beginning of sequence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))
tokenizer.max_model_input_sizes

In [None]:
## les informations du tokenizer
print(tokenizer)

In [None]:
## Création d'une série de tokens en fonction d'une phrase
sentence = 'I am a PHP Developer'
input_ids  = tokenizer.encode(sentence,
                              return_tensors = 'pt')
print('encoded : ')
print(input_ids)


print("\r\n")
## decode tokens :
print('décoded : ')
print(tokenizer.decode(input_ids[0][3]))

# Explication: Génération de texte

**Greedy Search**: Le mot suivant est prédit en fonction de sa probabilité la plus haute

**Beam Search**: Plusieurs mots suivants sont prédits, le model détermine dans ces mots quel sera le meilleur ( `num_beams=5` )

**Random sampling**: le mot suivant est déterminé de maniere aléatoire. On utilise `do_sample=True`. Mais ça peut conduire à des incohérences. On utilise alors le parametre temperature pour faire la balance entre la probabilité haute et basse du meilleur mot suivant.

**Top-K Sampling**: On se concentre sur les K mots suivant en éliminant toutes les mots a faible probabilité.

**Top-P Sampling**: il se concentre sur le choix du plus petit ensemble de mots pour lequel la probabilité cumulée dépasse un seuil donné Pour implémenter l'échantillonnage Top-P, définissez simplement "top_k" sur 0 et spécifiez une valeur pour "top_p".

In [15]:
# Greedy Search ( Le mot suivant est prédit en fonction de sa probabilité la plus haute )
greedy_output = model.generate(input_ids,
                               max_length=100,
                               no_repeat_ngram_size=2)

for i, output in enumerate(greedy_output):
    print("{}: {}...".format(i, tokenizer.decode(output, skip_special_tokens=True)))
    print('')


0: I am a PHP Developer and I am very passionate about the PHP language. I have been working on PHP for over 10 years and have worked on many different projects.

I have a passion for the web and want to make it better. So I decided to start my own company. It is called "The PHP Project".
.php
 (The name is a reference to the popular PHP programming language, PHP. The PHP project is based on the original PHP, which was developed by the...



In [14]:
## Beam Search:
## - des "beams" sont produit correspondant à n potentiels mots suivants appelés hypotheses.
## - Plus il y a d'hypotheses à prédire a chacune des étapes plus il y a de calculs.
## - il faut trouver le bon ratio entre la pertinence et la lourdeur de la génération de phrases.
beam_output = model.generate(input_ids,
                             max_length = 100,
                             num_beams=5,
                             num_return_sequences=3, ## le nombre de séquences retournées doit être < à num_beams
                             no_repeat_ngram_size=2,
                             early_stopping=True)

for i, output in enumerate(beam_output):
    print("{}: {}...".format(i, tokenizer.decode(output, skip_special_tokens=True)))
    print('')

0: I am a PHP Developer and I have been working on this project for a few years now.

In this post I am going to show you how to use PHP 5.5 to build your own PHP application. I hope you will enjoy this tutorial as much as I enjoyed writing it. If you have any questions or comments, feel free to leave them in the comments below....

1: I am a PHP Developer and I have been working on this project for a few years now.

In this post I am going to show you how to use PHP 5.5 to build your own PHP application. I hope you will enjoy this tutorial as much as I enjoyed writing it. If you have any questions or comments, feel free to leave them in the comments section below....

2: I am a PHP Developer and I have been working on this project for a few years now.

In this post I am going to show you how to use PHP 5.5 to build your own PHP application. I hope you will enjoy this tutorial as much as I enjoyed writing it. If you have any questions or comments, feel free to leave them in the comment

In [16]:
## Random sampling,
## le mot suivant est déterminé de maniere aléatoire.
## do_sample=True peut rendre le texte incohérent
## Pour éviter cela on utilise "temperature" qui fait la balance entre la priorité haute et basse du meilleur mot suivant.

random_output = model.generate(input_ids,
                               do_sample=True,
                               max_length=100,
                               top_k=0,
                               temperature=0.8)

for i, output in enumerate(random_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am a PHP Developer. I am a PHP Developer. I have worked on topics such as Web Apps, Websockets, Java and more. I am a PHP Developer. I have worked on topics such as Web Apps, Websockets, Java and more.

Jan 13, 2014, 07:56 PM #48 Vitor Reffered Quote Select Post

Select Post Deselect Post

Deselect Post Link to Post

Link to Post Member Give Gift

...



In [17]:
## Top-K Sampling
## On se concentre sur les K mots suivant en éliminant toutes les mots a faible probabilité.
## Donc dans un random sampling on va éléiminer la partie basse de la temperature ( balance entre probabilité haute et basse)
## donc top_k permet de donner un nombre de motsprincipaux que nous allons inclure dans la distribution de probabilité conditionnelle.

top_k_output = model.generate(input_ids,
                              do_sample=True,
                              max_length=100,
                              top_k=50)

for i, output in enumerate(top_k_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')




0: I am a PHP Developer, I don't care for my work in PHP and always will be, but I take PHP seriously and I do my best to do my best and my employees deserve better. If there is one thing we can learn in this industry, IT, I hope some of you know it already. There are many more great people and amazing people in the business who could benefit from what we will be doing and, if not, how can we make it better?

I am...



In [18]:
## Top-P Sampling: Au lieu de sélectionner les k mots les plus probables,
## il se concentre sur le choix du plus petit ensemble de mots pour lequel
## la probabilité cumulée dépasse un seuil donné, noté « p ».
## Dans cette méthode, toute la masse de probabilité est ensuite déplacée vers les mots de cet ensemble.
## La principale distinction entre l'échantillonnage Top-K et Top-P réside dans la flexibilité de la taille de l'ensemble.
## Dans l'échantillonnage Top-K, la taille définie reste fixe, tandis que dans l'échantillonnage Top-P, la taille peut varier.
## Pour implémenter l'échantillonnage Top-P, définissez simplement "top_k" sur 0 et spécifiez une valeur pour "top_p".


top_p_output = model.generate(input_ids,
                              do_sample=True,
                              max_length=100,
                              top_p=0.8,
                              top_k=0)

for i, output in enumerate(top_p_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am a PHP Developer with more than 50 years of experience and developed new functional programming languages, but are just now seeing an audience at a broader level. Our goal is to help people find web development as a meaningful job. We want to enable people to solve their own problems in an online environment, with natural engineering, about twenty years of experience. We want to get the population interested, with our advice and support. And we want to provide a forum where our friends, family, and colleagues can...



Nous avons la flexibilité d'utiliser les techniques d'échantillonnage **Top-K** et **Top-P** dans notre approche. Cette combinaison permet d'atténuer l'inclusion de mots inhabituels ou à faible probabilité tout en permettant une taille de sélection dynamique. Pour implémenter cela, nous devons simplement spécifier des valeurs pour « **top_k** » et « **top_p** ». De plus, si nous le souhaitons, nous pouvons incorporer le paramètre de **temperature** initial.

In [19]:
top_k_p_outputs = model.generate(input_ids,
                                 do_sample=True,
                                 max_length=2*100,
                                 top_k=50,
                                 top_p=0.85,
                                 num_return_sequences=5)


for i, output in enumerate(top_k_p_outputs):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am a PHP Developer. I am very happy to work with developers to help improve their PHP and MySQL environments.


I am also a PHP developer with a keen interest in web development. I was born in the USA. I started coding when I was about 3.


I am a PHP developer and I am very proud to share my experience. I hope you'll enjoy the job and I am looking forward to your help in the future.


My PHP experience started with a job in PHP development. In 2013, I went to work in a PHP lab. After that, I started writing web code using Ruby on Rails. In 2014, I started a PHP company. After that, I started a business selling open source projects. I am happy to have joined these projects and I am excited to bring your knowledge to the world of PHP and MySQL.

I am currently an administrator for the WordPress community. In 2012, I started writing PHP code on my WordPress website. In...

1: I am a PHP Developer and a Web Developer.

My work involves developing and implementing Web apps and service

# Utilisation d'un dataset

In [20]:
## On défini le dataset qu'on va vouloir exploiter
dataset_name = "tiny_shakespeare" ## https://huggingface.co/datasets/tiny_shakespeare ne contient qu'une seul ligne donc rapide.


cache_dir = "lm_dataset/"
datasets = load_dataset(dataset_name, cache_dir=cache_dir)


print(datasets)

Downloading data:   0%|          | 0.00/634k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/35.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/36.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})


In [21]:
## On applique le tokenizer au dataset
## le dataset contient 3 parties : train / validation / test
## train contient la colonne "text"

column_names = datasets["train"].column_names
## Bon c'est assez simple, si contient la conlonne "texte" sinon on prent la premiere par defaut
text_column_name = "text" if "text" in column_names else column_names[0]

def tokenize_function(examples):
    output = tokenizer(examples[text_column_name])
    return output

tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=column_names,
    desc="Running tokenizer on dataset")

Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (301966 > 1024). Running this sequence through the model will result in indexing errors


Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ? examples/s]

In [22]:
## lorsqu'on travail avec de grosses données il faut scinder en "chunks" en fonction
# du nobre max de token que peut prendre le model en context d'input.

block_size = tokenizer.model_max_length
## si jamais le context peut prendre plus de 1024 on bride tout de même a 1024
## sinon on respect le max du model.
if block_size > 1024:
    block_size = 1024

## On crée une fonction group_texts
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()}
    result["labels"] = result["input_ids"].copy()
    return result


lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    desc=f"Grouping texts in chunks of {block_size}")

Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ? examples/s]

Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ? examples/s]

Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ? examples/s]

In [23]:
train_dataset = lm_datasets["train"]
eval_dataset = lm_datasets["validation"]

training_args = TrainingArguments(output_dir = "output/",
                                  per_device_train_batch_size=1,
                                  num_train_epochs=50,
                                  save_total_limit=1)
trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=eval_dataset,
                  tokenizer=tokenizer,
                  data_collator=default_data_collator)



##### si jamis il y a cette erreur : Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1
##### c'est qu'il faut redemarer l'environnement (Execution>redemarer la session et tout executer)

In [24]:
## nettoyage de la memoire

gpu_usage()
torch.cuda.empty_cache()




| ID | GPU | MEM |
------------------
|  0 |  0% |  8% |


In [25]:
## On lance l'entrainnement

train_result = trainer.train()

Step,Training Loss
500,3.4841
1000,3.1624
1500,2.9458
2000,2.729
2500,2.5432
3000,2.3771
3500,2.2031
4000,2.0438
4500,1.9073
5000,1.7834


In [26]:
trainer.save_model()

metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)

trainer.save_state()

***** train metrics *****
  epoch                    =       50.0
  total_flos               =  7154406GF
  train_loss               =     1.5456
  train_runtime            = 1:42:04.29
  train_samples_per_second =        2.4
  train_steps_per_second   =        2.4


In [27]:
torch.manual_seed(2)

ids = tokenizer.encode('One does not simply walk into',
                       return_tensors='pt').cuda()



################## GREEDY
print('################## GREEDY')
greedy_output = model.generate(ids,
                               max_length=100,
                               no_repeat_ngram_size=2)
for i, output in enumerate(greedy_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')



################## BEAM
print('################## BEAM')
beam_output = model.generate(ids,
                             max_length = 100,
                             num_beams=5,
                             num_return_sequences=5,
                             no_repeat_ngram_size=2,
                             early_stopping=True)

for i, output in enumerate(beam_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

################## Random Sampling
print('################## Random Sampling')
random_output = model.generate(ids,
                               do_sample=True,
                               max_length=100,
                               top_k=0,
                               temperature=0.8)

for i, output in enumerate(random_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')


################## Top-K Sampling
print('################## Top-K Sampling')

top_k_output = model.generate(ids,
                              do_sample=True,
                              max_length=100,
                              top_k=50)

for i, output in enumerate(top_k_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')


################## Top-P Sampling
print('################## Top-P Sampling')

top_p_output = model.generate(ids,
                              do_sample=True,
                              max_length=100,
                              top_p=0.8,
                              top_k=0)

for i, output in enumerate(top_p_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')


################## Top-K and Top-P Sampling
print('################## Top-K and Top-P Sampling')

top_k_p_outputs = model.generate(ids,
                                 do_sample=True,
                                 max_length=2*100,
                                 top_k=50,
                                 top_p=0.85,
                                 num_return_sequences=5)

for i, output in enumerate(top_k_p_outputs):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

################## GREEDY
0: One does not simply walk into a room and yields himself to
the unseen hand; but his action,
receiving the hand that he hath in his hand, and falling
down, obeys the throw, swoons the prince.

First Senator:
He durst not lift a finger, nor did not kneel for the
drum that was used to make him grow. He
knew the office of the state, the revenue
and the directing earldom:...

################## BEAM
0: One does not simply walk into a bower and wail
As 'twere toasting, as 'tis another thing
That art burthen'd with being.

VOLUMNIA:
Thou wert born to a good husband, and thou hast
Became a Gentleman of the Cupid's Son: thou art a
Great man; and I have read many fair Gentleman's
Beserserkings; but thou'rt not born so,
And...

1: One does not simply walk into a bower and wail
As 'twere toasting, as 'tis another thing
That art burthen'd with being.

VOLUMNIA:
Thou wert born to a good husband, and thou hast
Became a Gentleman of the Cupid's Son: thou art a
Great man; a

# Téléchargement des fichiers

In [33]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [34]:
!zip -r /content/output.zip /content/output

  adding: content/output/ (stored 0%)
  adding: content/output/model.safetensors (deflated 7%)
  adding: content/output/all_results.json (deflated 40%)
  adding: content/output/training_args.bin (deflated 51%)
  adding: content/output/checkpoint-14500/ (stored 0%)
  adding: content/output/checkpoint-14500/rng_state.pth (deflated 25%)
  adding: content/output/checkpoint-14500/model.safetensors (deflated 7%)
  adding: content/output/checkpoint-14500/training_args.bin (deflated 51%)
  adding: content/output/checkpoint-14500/config.json (deflated 52%)
  adding: content/output/checkpoint-14500/generation_config.json (deflated 34%)
  adding: content/output/checkpoint-14500/optimizer.pt (deflated 8%)
  adding: content/output/checkpoint-14500/vocab.json (deflated 68%)
  adding: content/output/checkpoint-14500/special_tokens_map.json (deflated 74%)
  adding: content/output/checkpoint-14500/scheduler.pt (deflated 55%)
  adding: content/output/checkpoint-14500/trainer_state.json (deflated 79%)
  

In [None]:
from google.colab import files
files.download('output.zip')

In [36]:
!ls -lah

total 1.8G
drwxr-xr-x 1 root root 4.0K Feb 12 11:56 .
drwxr-xr-x 1 root root 4.0K Feb 12 09:29 ..
drwxr-xr-x 4 root root 4.0K Feb  8 14:20 .config
drwxr-xr-x 4 root root 4.0K Feb 12 09:39 lm_dataset
drwxr-xr-x 4 root root 4.0K Feb 12 11:34 output
-rw-r--r-- 1 root root 1.8G Feb 12 11:56 output.zip
drwxr-xr-x 1 root root 4.0K Feb  8 14:21 sample_data
