<a href="https://colab.research.google.com/github/GeraudBourdin/llm-scripts/blob/main/Trainning_Gpt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation

In [None]:
## Définition du model pour l'ensemble de l'execution de cette feuille.
## https://huggingface.co/openai-community

MODEL = "gpt2";        ##  openai-community/gpt2
#MODEL = "gpt2-medium";      ## openai-community/gpt2-medium
#MODEL = "gpt2-large";       ##  openai-community/gpt2-large
#MODEL = "gpt2-xl"; ##  openai-community/gpt2-xl


In [None]:
## Installation des  Libraries
!pip install transformers --use-deprecated=legacy-resolver
!pip install datasets
!pip install GPUtil
!pip install accelerate -U
!pip install transformers[torch]

In [None]:
## Import des librairies

import torch
import pandas as pd
from numba import cuda
from datasets import load_dataset
from GPUtil import showUtilization as gpu_usage
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TrainingArguments, Trainer, default_data_collator

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained(MODEL)
model = GPT2LMHeadModel.from_pretrained(MODEL,
                                        pad_token_id=tokenizer.eos_token_id)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device) #optionnel

# Explication: Tokenizer
Aucune utilité concernant notre but final.

In [None]:
## Quelques informations concernant le model loadé
print("The max model length is {} for this model".format(tokenizer.model_max_length))
print("default maximum supported sentence length is 1024 is {}".format(tokenizer.max_model_input_sizes))
print("The beginning of sequence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))
tokenizer.max_model_input_sizes

In [None]:
## les informations du tokenizer
print(tokenizer)

In [None]:
## Création d'une série de tokens en fonction d'une phrase
sentence = 'I am a PHP Developer'
input_ids  = tokenizer.encode(sentence,
                              return_tensors = 'pt')
print('encoded : ')
print(input_ids)


print("\r\n")
## decode tokens :
print('décoded : ')
print(tokenizer.decode(input_ids[0][3]))

# Explication: Génération de texte

**Greedy Search**: Le mot suivant est prédit en fonction de sa probabilité la plus haute

**Beam Search**: Plusieurs mots suivants sont prédits, le model détermine dans ces mots quel sera le meilleur ( `num_beams=5` )

**Random sampling**: le mot suivant est déterminé de maniere aléatoire. On utilise `do_sample=True`. Mais ça peut conduire à des incohérences. On utilise alors le parametre temperature pour faire la balance entre la probabilité haute et basse du meilleur mot suivant.

**Top-K Sampling**: On se concentre sur les K mots suivant en éliminant toutes les mots a faible probabilité.

**Top-P Sampling**: il se concentre sur le choix du plus petit ensemble de mots pour lequel la probabilité cumulée dépasse un seuil donné Pour implémenter l'échantillonnage Top-P, définissez simplement "top_k" sur 0 et spécifiez une valeur pour "top_p".

In [None]:
# Greedy Search ( Le mot suivant est prédit en fonction de sa probabilité la plus haute )
greedy_output = model.generate(input_ids,
                               max_length=100,
                               no_repeat_ngram_size=2)

for i, output in enumerate(greedy_output):
    print("{}: {}...".format(i, tokenizer.decode(output, skip_special_tokens=True)))
    print('')


0: I am a PHP Developer and I am very passionate about the PHP language. I have been working on PHP for over 10 years and have worked on many different projects.

I have a passion for the web and want to make it better. So I decided to start my own company. It is called "The PHP Project".
.php
 (The name is a reference to the popular PHP programming language, PHP. The PHP project is based on the original PHP, which was developed by the...



In [None]:
## Beam Search:
## - des "beams" sont produit correspondant à n potentiels mots suivants appelés hypotheses.
## - Plus il y a d'hypotheses à prédire a chacune des étapes plus il y a de calculs.
## - il faut trouver le bon ratio entre la pertinence et la lourdeur de la génération de phrases.
beam_output = model.generate(input_ids,
                             max_length = 100,
                             num_beams=5,
                             num_return_sequences=3, ## le nombre de séquences retournées doit être < à num_beams
                             no_repeat_ngram_size=2,
                             early_stopping=True)

for i, output in enumerate(beam_output):
    print("{}: {}...".format(i, tokenizer.decode(output, skip_special_tokens=True)))
    print('')

0: I am a PHP Developer and I have been working on this project for a few years now.

In this post I am going to show you how to use PHP 5.5 to build your own PHP application. I hope you will enjoy this tutorial as much as I enjoyed writing it. If you have any questions or comments, feel free to leave them in the comments below....

1: I am a PHP Developer and I have been working on this project for a few years now.

In this post I am going to show you how to use PHP 5.5 to build your own PHP application. I hope you will enjoy this tutorial as much as I enjoyed writing it. If you have any questions or comments, feel free to leave them in the comments section below....

2: I am a PHP Developer and I have been working on this project for a few years now.

In this post I am going to show you how to use PHP 5.5 to build your own PHP application. I hope you will enjoy this tutorial as much as I enjoyed writing it. If you have any questions or comments, feel free to leave them in the comment

In [None]:
## Random sampling,
## le mot suivant est déterminé de maniere aléatoire.
## do_sample=True peut rendre le texte incohérent
## Pour éviter cela on utilise "temperature" qui fait la balance entre la priorité haute et basse du meilleur mot suivant.

random_output = model.generate(input_ids,
                               do_sample=True,
                               max_length=100,
                               top_k=0,
                               temperature=0.8)

for i, output in enumerate(random_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am a PHP Developer, so it was easy to figure out the minds behind all of my projects. However, I never got around to implementing the frameworks that I was studying for. So for example, the first time I worked on a PHP project I didn't have the skills to deploy to a web server. When I did that, I was at a loss how I had to implement it.

After a lot of work, I managed to figure out how to make a remote web server...



In [None]:
## Top-K Sampling
## On se concentre sur les K mots suivant en éliminant toutes les mots a faible probabilité.
## Donc dans un random sampling on va éléiminer la partie basse de la temperature ( balance entre probabilité haute et basse)
## donc top_k permet de donner un nombre de motsprincipaux que nous allons inclure dans la distribution de probabilité conditionnelle.

top_k_output = model.generate(input_ids,
                              do_sample=True,
                              max_length=100,
                              top_k=50)

for i, output in enumerate(top_k_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')




0: I am a PHP Developer and I know all about PHP and that's what I do at Microsoft. I know that PHP on Windows and OS X is a very modern OS which means that the PHP you are writing will probably become much smaller. It also means that PHP 3.8 is less and less likely to use the same language. PHP is not only one of the most widely used programming languages, but also because it is written in PHP and it is fast!

So how do you know...



In [None]:
## Top-P Sampling: Au lieu de sélectionner les k mots les plus probables,
## il se concentre sur le choix du plus petit ensemble de mots pour lequel
## la probabilité cumulée dépasse un seuil donné, noté « p ».
## Dans cette méthode, toute la masse de probabilité est ensuite déplacée vers les mots de cet ensemble.
## La principale distinction entre l'échantillonnage Top-K et Top-P réside dans la flexibilité de la taille de l'ensemble.
## Dans l'échantillonnage Top-K, la taille définie reste fixe, tandis que dans l'échantillonnage Top-P, la taille peut varier.
## Pour implémenter l'échantillonnage Top-P, définissez simplement "top_k" sur 0 et spécifiez une valeur pour "top_p".


top_p_output = model.generate(input_ids,
                              do_sample=True,
                              max_length=100,
                              top_p=0.8,
                              top_k=0)

for i, output in enumerate(top_p_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am a PHP Developer, I'm passionate about creating performance driven apps. However, I am also a PHP-dev, which is why I need to get to know the complex programming language and how it can help me learn.

A big part of that programming knowledge comes from mastering PHP. If you have been playing with PHP for a long time and have been using it for some time, then it is very easy to learn and can be used in a variety of ways.

Now...



Nous avons la flexibilité d'utiliser les techniques d'échantillonnage **Top-K** et **Top-P** dans notre approche. Cette combinaison permet d'atténuer l'inclusion de mots inhabituels ou à faible probabilité tout en permettant une taille de sélection dynamique. Pour implémenter cela, nous devons simplement spécifier des valeurs pour « **top_k** » et « **top_p** ». De plus, si nous le souhaitons, nous pouvons incorporer le paramètre de **temperature** initial.

In [None]:
top_k_p_outputs = model.generate(input_ids,
                                 do_sample=True,
                                 max_length=2*100,
                                 top_k=50,
                                 top_p=0.85,
                                 num_return_sequences=5)


for i, output in enumerate(top_k_p_outputs):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am a PHP Developer and I am the CEO of a company which makes a lot of amazing products. We are a community driven organization. I am happy to have your support.

The people who make and sell this website. The people that support us, the people that make and sell our products, the people who make and sell our content. This is what we need to build something special.

How can I help?

We are working hard on our brand new web site. If you're interested in joining us, there's no better way to do it than to sign up for our mailing list here....

1: I am a PHP Developer and Developer and I am a Certified Composer. If you are not a PHP Developer, please do not read this guide because I want to help you understand the PHP concepts.


There is no real magic trick or trick to the PHP programming language.


PHP is simply a language that is used to develop applications. It does not have to be the same to do programming. You can create apps, read documents, edit documents, build websites and 

# Utilisation d'un dataset

In [None]:
## On défini le dataset qu'on va vouloir exploiter
dataset_name = "tiny_shakespeare" ## https://huggingface.co/datasets/tiny_shakespeare ne contient qu'une seul ligne donc rapide.


cache_dir = "lm_dataset/"
datasets = load_dataset(dataset_name, cache_dir=cache_dir)


print(datasets)

In [None]:
## On applique le tokenizer au dataset
## le dataset contient 3 parties : train / validation / test
## train contient la colonne "text"

column_names = datasets["train"].column_names
## Bon c'est assez simple, si contient la conlonne "texte" sinon on prent la premiere par defaut
text_column_name = "text" if "text" in column_names else column_names[0]

def tokenize_function(examples):
    output = tokenizer(examples[text_column_name])
    return output

tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=column_names,
    desc="Running tokenizer on dataset")

In [None]:
## lorsqu'on travail avec de grosses données il faut scinder en "chunks" en fonction
# du nobre max de token que peut prendre le model en context d'input.

block_size = tokenizer.model_max_length
## si jamais le context peut prendre plus de 1024 on bride tout de même a 1024
## sinon on respect le max du model.
if block_size > 1024:
    block_size = 1024

## On crée une fonction group_texts
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()}
    result["labels"] = result["input_ids"].copy()
    return result


lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    desc=f"Grouping texts in chunks of {block_size}")

In [None]:
train_dataset = lm_datasets["train"]
eval_dataset = lm_datasets["validation"]

training_args = TrainingArguments(output_dir = "output/",
                                  per_device_train_batch_size=1,
                                  num_train_epochs=50,
                                  save_total_limit=1)
trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=eval_dataset,
                  tokenizer=tokenizer,
                  data_collator=default_data_collator)



##### si jamis il y a cette erreur : Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1
##### c'est qu'il faut redemarer l'environnement (Execution>redemarer la session et tout executer)

In [None]:
## nettoyage de la memoire

gpu_usage()
torch.cuda.empty_cache()




| ID | GPU | MEM |
------------------


In [None]:
## On lance l'entrainnement

train_result = trainer.train()

Step,Training Loss


In [None]:
trainer.save_model()

metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)

trainer.save_state()

In [None]:
torch.manual_seed(2)

ids = tokenizer.encode('One does not simply walk into',
                       return_tensors='pt').cuda()



################## GREEDY
print('################## GREEDY')
greedy_output = model.generate(ids,
                               max_length=100,
                               no_repeat_ngram_size=2)
for i, output in enumerate(greedy_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')



################## BEAM
print('################## BEAM')
beam_output = model.generate(ids,
                             max_length = 100,
                             num_beams=5,
                             num_return_sequences=5,
                             no_repeat_ngram_size=2,
                             early_stopping=True)

for i, output in enumerate(beam_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

################## Random Sampling
print('################## Random Sampling')
random_output = model.generate(ids,
                               do_sample=True,
                               max_length=100,
                               top_k=0,
                               temperature=0.8)

for i, output in enumerate(random_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')


################## Top-K Sampling
print('################## Top-K Sampling')

top_k_output = model.generate(ids,
                              do_sample=True,
                              max_length=100,
                              top_k=50)

for i, output in enumerate(top_k_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')


################## Top-P Sampling
print('################## Top-P Sampling')

top_p_output = model.generate(ids,
                              do_sample=True,
                              max_length=100,
                              top_p=0.8,
                              top_k=0)

for i, output in enumerate(top_p_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')


################## Top-K and Top-P Sampling
print('################## Top-K and Top-P Sampling')

top_k_p_outputs = model.generate(ids,
                                 do_sample=True,
                                 max_length=2*100,
                                 top_k=50,
                                 top_p=0.85,
                                 num_return_sequences=5)

for i, output in enumerate(top_k_p_outputs):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')