<a href="https://colab.research.google.com/github/Bakeita/Machine-Learning-Models/blob/main/FrenchTextGenerationUsingGpt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

#Using the pretrained gpt2 model for french text generation


In [None]:
model = GPT2LMHeadModel.from_pretrained("asi/gpt-fr-cased-base")
tokenizer = GPT2Tokenizer.from_pretrained("asi/gpt-fr-cased-base")
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50000, 1792)
    (wpe): Embedding(1024, 1792)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((1792,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1792,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((1792,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout)

### **Greedy Search**

Greedy search simply selects the word with the highest probability as its next word: $w_t = argmax_{w}P(w | w_{1:t-1})$ at each timestep $t$. The following sketch shows greedy search. 

![Greedy Search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/greedy_search.png)

Starting from the word $\text{"longtemps"}$, the algorithm 
greedily chooses the next word of highest probability $\text{"je"}$ and so on, so that the final generated word sequence is $\text{"Longtemps", "je", "me"}$ having an overall probability of $0.5 \times 0.4 = 0.2$.

In the following we will generate word sequences using GPT2 on the context $(\text{"Longtemps", "je", "me", 'suis', "couch√©", "de", "bonne", "heure."})$. Let's see how greedy search can be used in `transformers` as follows:


In [None]:
input_sentence = "Longtemps je me suis couch√© de bonne heure."
input_ids = tokenizer.encode(input_sentence, return_tensors='pt')

greedy_outputs = model.generate(
    input_ids, 
    max_length=100
)

print(50 * '-'+"Output:\n" + 50 * '-')
print(tokenizer.decode(greedy_outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Longtemps je me suis couch√© de bonne heure. J'avais une bonne raison de dormir. Je me suis lev√© √† cinq heures, j'ai pris mon petit d√©jeuner, j'ai d√©jeun√©, j'ai d√Æn√©, j'ai d√Æn√©, j'ai d√Æn√©, j'ai d√Æn√©, j'ai d√Æn√©, j'ai d√Æn√©, j'ai d√Æn√©, j'ai d√Æn√©, j'ai d√Æn√©, j'ai d√Æn√©


**Beam search**

reduces the risk of missing hidden high probability word sequences by keeping the most likely `num_beams` of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. Let's illustrate with `num_beams=2`:

In [None]:
input_sentence = "Longtemps je me suis couch√© de bonne heure."
input_ids = tokenizer.encode(input_sentence, return_tensors='pt')

beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print(50 * '-'+"Output:\n" + 50 * '-')
print(tokenizer.decode(beam_outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Longtemps je me suis couch√© de bonne heure. J‚Äôavais pris l‚Äôhabitude de m‚Äô√©tendre sur mon lit, de me recroqueviller sur moi-m√™me, de me recroqueviller sur moi-


Nice, that looks much better! We can see that the repetition does not appear anymore. Nevertheless, n-gram penalties have to be used with care. An article generated about the city New York should not use a 2-gram penalty or otherwise, the name of the city would only appear once in the whole text!

Another important feature about beam search is that we can compare the top beams after generation and choose the generated beam that fits our purpose best.

In transformers, we simply set the parameter num_return_sequences to the number of highest scoring beams that should be returned. Make sure though that num_return_sequences <= num_beams!


In [None]:
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)
# now we have 5 output sequences
print(50 * '-'+"Output:" + 50 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


--------------------------------------------------Output:
--------------------------------------------------
0: je vient de l'ecode, maintenant je dois dire que j'ai eu un peu de mal au d√©but, mais je me suis vite rendu compte que c'etait pas si mal que ca. J'espere que vous allez
1: je vient de l'ecode, maintenant je dois dire que j'ai eu un peu de mal au d√©but, mais je me suis vite rendu compte que c'etait pas si mal que ca. J'espere qu'il
2: je vient de l'ecode, maintenant je dois dire que j'ai eu un peu de mal au d√©but, mais je me suis vite rendu compte que c'etait pas si mal que ca. J'espere que tu vas
3: je vient de l'ecode, maintenant je dois dire que j'ai eu un peu de mal au d√©but, mais je me suis vite rendu compte que c'etait pas si mal que ca. J'espere qu'un
4: je vient de l'ecode, maintenant je dois dire que j'ai eu un peu de mal au d√©but, mais je me suis vite rendu compte que c'etait pas si mal que ca. J'espere que vous avez


### **Sampling**

In its most basic form, sampling means randomly picking the next word $w_t$ according to its conditional probability distribution:

In [None]:
# Generate a sample of text
input_sentence = "je vient de l'ecode, maintenant je dois"
input_ids = tokenizer.encode(input_sentence, return_tensors='pt')

sampling_outputs = model.generate(
    input_ids, 
    max_length=70, 
    do_sample=True
)

print(50 * '-'+"Output:" + 50 * '-')
print(tokenizer.decode(sampling_outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


--------------------------------------------------Output:--------------------------------------------------
je vient de l'ecode, maintenant je dois me proteger, et mes deux beaux parents sont partis chez le veto, ma tante et moi, j'aime pas aller a l'ecole, j'ai dit bonne chance l'an dernier, merci beaucoup de faire de belles rencontres, de vous etes vraiment tr√®s genereuse,


### **Top-K Sampling**

[Fan et. al (2018)](https://arxiv.org/pdf/1805.04833.pdf) introduced a simple, but very powerful sampling scheme, called ***Top-K*** sampling. In *Top-K* sampling, the *K* most likely next words are filtered and the probability mass is redistributed among only those *K* next words. 
GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation. 

We extend the range of words used for both sampling steps in the example above from 3 words to 10 words to better illustrate *Top-K* sampling.

![top_k_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/top_k_sampling.png)

Having set $K = 6$, in both sampling steps we limit our sampling pool to 6 words. While the 6 most likely words, defined as $V_{\text{top-K}}$ encompass only *ca.* two-thirds of the whole probability mass in the first step, it includes almost all of the probability mass in the second step. Nevertheless, we see that it successfully eliminates the rather weird candidates $\text{"not", "the", "small", "told"}$ 
in the second sampling step.


Let's see how *Top-K* can be used in the library by setting `top_k=50`:

In [None]:
# Generate a sample of text
input_sentence = "Longtemps je me suis couch√© de bonne heure."
input_ids = tokenizer.encode(input_sentence, return_tensors='pt')

topK_outputs = model.generate(
    input_ids, 
    max_length=90, 
    do_sample=True,   
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=1
)

print(50 * '-'+"Output:" + 50 * '-')
print(tokenizer.decode(topK_outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


--------------------------------------------------Output:--------------------------------------------------
Longtemps je me suis couch√© de bonne heure. Pour ne pas me r√©veiller chaque soir je me mettais toujours sur le ventre, et je me sentais fatigu√© si le coeur me battait ou s'il voulait me faire sentir malade. Quand je me r√©veillais je m'appelais Fran√ßois et je m'endormais de bonne heure. Je n'avais pas un sommeil bien agit√©, je n'avais rien √† dire. Mais quelquefois quand je me


### **Top-p (nucleus) sampling**

Instead of sampling only from the most likely *K* words, in *Top-p* sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability *p*. The probability mass is then redistributed among this set of words. This way, the size of the set of words (*a.k.a* the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution. Ok, that was very wordy, let's visualize.

![top_p_sampling](https://github.com/patrickvonplaten/scientific_images/blob/master/top_p_sampling.png?raw=true)

Having set $p=0.92$, *Top-p* sampling picks the *minimum* number of words to exceed together $p=92\%$ of the probability mass, defined as $V_{\text{top-p}}$. In the first example, this included the 9 most likely words, whereas it only has to pick the top 3 words in the second example to exceed 92%. Quite simple actually! It can be seen that it keeps a wide range of words where the next word is arguably less predictable, *e.g.* $P(w | \text{"The"})$, and only a few words when the next word seems more predictable, *e.g.* $P(w | \text{"The", "car"})$.

Alright, time to check it out in `transformers`!
We activate *Top-p* sampling by setting `0 < top_p < 1`:

In [None]:
# Generate a sample of text
input_sentence = "Longtemps je me suis couch√© de bonne heure."
input_ids = tokenizer.encode(input_sentence, return_tensors='pt')

neucleus_outputs = model.generate(
    input_ids, 
    max_length=90, 
    do_sample=True,   
    top_k=40, 
    top_p=0.95, 
    num_return_sequences=1
)

print(50 * '-'+"Output:" + 50 * '-')
print(tokenizer.decode(neucleus_outputs[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


--------------------------------------------------Output:--------------------------------------------------
Longtemps je me suis couch√© de bonne heure. Aujourd‚Äôhui, vers six heures du matin, comme le soleil se couchait encore, je me suis lev√© pour regarder sous le bassin de l‚Äôentr√©e. J‚Äôai ferm√© la porte √† clef derri√®re moi et je suis mont√© √† l‚Äô√©chelle. J‚Äôavais un pantalon d√©chir√© √† la taille et j‚Äôavais l‚Äôair d‚Äôun homme qui


#Here is the final models with all the decoding methods mentionned above

   * Greedy Search
   * Top-P (nucleus) sampling
   * Top-K Sampling
   * Sampling
   * Beam search 

And the French gpt-2 is called:  **asi/gpt-fr-cased-base \n**


  GPT-fr üá´üá∑ is a GPT model for French developped by Quantmetry and the Laboratoire de Linguistique Formelle (LLF). We train the model on a very large and heterogeneous French corpus. We release the weights for the following configurations:

In [None]:
# Generate a sample of text
input_sentence = "Le laboureur et ses enfants travailler prener de la peine c'est"
input_ids = tokenizer.encode(input_sentence, return_tensors='pt')

outputs = model.generate(
    input_ids, 
    max_length=100, 
    do_sample=True,   
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=1
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Le laboureur et ses enfants travailler prener de la peine c'est parcequ'il a eleve ses enfants plus que ceux qui n'ont jamais travaille de leur vie. c'est le travail qui leur coute.il doit donc bien travailler, c'est de lui qu'il faut le remercier, et pour cela, il faut lui rendre des comptes, lui dire qu'il doit faire le travail qu'il a fait, le remercier, lui donner ses benedictions
