## Copyright 2021 Antoine Simoulin.

<i>Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Icons made by <a href="https://www.flaticon.com/authors/freepik" title="Freepik">Freepik</a>, <a href="https://www.flaticon.com/authors/pixel-perfect" title="Pixel perfect">Pixel perfect</a>, <a href="https://www.flaticon.com/authors/becris" title="Becris">Becris</a>, <a href="https://www.flaticon.com/authors/smashicons" title="Smashicons">Smashicons</a>, <a href="https://www.flaticon.com/authors/srip" title="srip">srip</a>, <a href="https://www.flaticon.com/authors/adib-sulthon" title="Adib">Adib</a>, <a href="https://www.flaticon.com/authors/flat-icons" title="Flat Icons">Flat Icons</a> and <a href="https://www.flaticon.com/authors/dinosoftlabs" title="Pixel perfect">DinosoftLabs</a> from <a href="https://www.flaticon.com/" title="Flaticon"> www.flaticon.com</a></i>

# Exercice Models de langue

In [1]:
%%capture

# Check environment
if 'google.colab' in str(get_ipython()):
  IN_COLAB = True
else:
  IN_COLAB = False

if IN_COLAB:
  # ‚ö†Ô∏è Execute only if running in Colab
  !pip install -q transformers==3.1.0
  !pip install -q tensorflow==2.0.0
  # then restart runtime environment

In [40]:
from collections import Counter
import numpy as np
import pandas as pd

## Exercice 1: les mod√®les de langues de types N-gram

Dans ce premier exercice, nous allons impl√©menter un mod√®le de langue de type N-gram pour construire un **syst√®me d'auto-compl√©tion**. Ce type de syst√®me est utilis√© dans Google pour proposer de compl√©ter les queries de recherche ou pour la r√©daction des textos pour proposer le mot suivant par exemple.

<img src = "autocomplete.png" style="width:500px"/>

Comme nous l'avons vu en cours, les mod√®les de langues n-grams cherchent √† estimer la probabilit√© conditionnelle d'un mot $t$ dans la phrase √©tant donn√© les $n$ mots pr√©c√©dents $w_{t-1}, w_{t-2} \cdots w_{t-n}$ : 

$$ P(w_t | w_{t-1}\dots w_{t-n}) \tag{1}$$

On estime cette probabilit√© avec $\hat{P}$ en comptant les occurrences des sequences de mots dans les donn√©es d'entrainement :

$$ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})} \tag{2} $$

Avec $C(\cdots)$ le nombre d'occurrences d'une s√©quence de mots donn√©e. En pratique, le d√©nominateur peut √™tre nul. On va ajouter un param√®tre de smoothing. On ajoute une constate $k$ au num√©rateur et $k \times |V|$ au d√©nominateur avec $|V|$ la taille du vocabulaire. On a donc :

$$ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n) + k}{C(w_{t-1}\dots w_{t-n}) + k|V|} \tag{3} $$

Si on a un n-grams qui n'apparait pas, l'√©quation (3) devient donc $\frac{1}{|V|}$.

In [41]:
sentences = [['je', 'suis', 'en', 'vacances'],
             ['je', 'vais', 'partir', '√†', 'la', 'r√©union'],
             ['je', 'suis', 'en', 'r√©union'],
             ['je', 'vais', 'partir', 'en', 'vacances']]

In [42]:
unique_words = list(set(sentences[0] + sentences[1] + sentences[2] + sentences[3]))
unique_words

['je', 'r√©union', 'vacances', '√†', 'vais', 'suis', 'en', 'partir', 'la']

<hr>
<div class="alert alert-info" role="alert">
    <p><b>üìù Exercice :</b> Ecrire une fonction qui g√©n√®re tous les n-grams d'une phrase avec n un param√®tre de la fonction.</p>
</div>
<hr>

In [67]:
# %load solutions/ngrams.py

def sentence_2_n_grams(sentences, n=3, start_token='<s>', end_token='</s>'):
    ngrams = []
    for s in sentences:
        tokens = [start_token] + s + [end_token]
        ngrams += zip(*[tokens[i:] for i in range(n)])
    return Counter([" ".join(ngram) for ngram in ngrams])

In [69]:
unigram_counts = sentence_2_n_grams(sentences, 1)
print("Uni-gram:")
print(unigram_counts)

bigram_counts = sentence_2_n_grams(sentences, 2)
print("\nBi-gram:")
print(bigram_counts)

trigram_counts = sentence_2_n_grams(sentences, 3)
print("\nTri-gram:")
print(trigram_counts)

Uni-gram:
Counter({'<s>': 4, 'je': 4, '</s>': 4, 'en': 3, 'suis': 2, 'vacances': 2, 'vais': 2, 'partir': 2, 'r√©union': 2, '√†': 1, 'la': 1})

Bi-gram:
Counter({'<s> je': 4, 'je suis': 2, 'suis en': 2, 'en vacances': 2, 'vacances </s>': 2, 'je vais': 2, 'vais partir': 2, 'r√©union </s>': 2, 'partir √†': 1, '√† la': 1, 'la r√©union': 1, 'en r√©union': 1, 'partir en': 1})

Tri-gram:
Counter({'<s> je suis': 2, 'je suis en': 2, 'en vacances </s>': 2, '<s> je vais': 2, 'je vais partir': 2, 'suis en vacances': 1, 'vais partir √†': 1, 'partir √† la': 1, '√† la r√©union': 1, 'la r√©union </s>': 1, 'suis en r√©union': 1, 'en r√©union </s>': 1, 'vais partir en': 1, 'partir en vacances': 1})


<hr>
<div class="alert alert-info" role="alert">
    <p><b>üìù Exercice :</b> Ecrire une fonction qui calcule la probabilit√© d'un mot en fonction des ngrams pr√©c√©dents.</p>
</div>
<hr>

In [None]:
# %load solutions/estimate_proba.py
def estimate_probability(word, previous_n_gram,
                         n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
    denominator = n_gram_counts.get(previous_n_gram, 0)
    denominator += k * vocabulary_size

    numerator = n_plus1_gram_counts.get(previous_n_gram + ' ' + word, 0)
    numerator += k

    probability = numerator / denominator

    return probability

In [80]:
word_1 = "je vais"
word_2 = "partir"
tmp_prob = estimate_probability(word_2, word_1, bigram_counts, trigram_counts, len(unique_words), k=1)

print("La probabilit√© du mot '{}' √©tant donn√© le pr√©c√©dent n-gram '{}' est : {:.3f}."
      .format(word_2, word_1, tmp_prob))

La probabilit√© du mot 'partir' √©tant donn√© le pr√©c√©dent n-gram 'je vais' est : 0.273.


In [81]:
def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0,
                           start_token='<s>', end_token='</s>', unk_token='<unk>'):
    
    # On ajoute end_token et unk_token to the vocabulary
    # start_token ne peut pas apparaitre comme mot suivant donc pas besoin de l'ajouter
    vocabulary = vocabulary + [end_token, unk_token]
    vocabulary_size = len(vocabulary)
    
    probabilities = {}
    for word in vocabulary:
        probability = estimate_probability(word, previous_n_gram, 
                                           n_gram_counts, n_plus1_gram_counts, 
                                           vocabulary_size, k=k)
        probabilities[word] = probability

    return probabilities

In [82]:
next_word_proba = estimate_probabilities("je", unigram_counts, bigram_counts, unique_words, k=1)

for w, p in next_word_proba.items():
    print("La probabilit√© du mot '{}' √©tant donn√© le pr√©c√©dent n-gram '{}' est : {:.3f}."
          .format(w, 'je', p))

La probabilit√© du mot 'je' √©tant donn√© le pr√©c√©dent n-gram 'je' est : 0.067.
La probabilit√© du mot 'r√©union' √©tant donn√© le pr√©c√©dent n-gram 'je' est : 0.067.
La probabilit√© du mot 'vacances' √©tant donn√© le pr√©c√©dent n-gram 'je' est : 0.067.
La probabilit√© du mot '√†' √©tant donn√© le pr√©c√©dent n-gram 'je' est : 0.067.
La probabilit√© du mot 'vais' √©tant donn√© le pr√©c√©dent n-gram 'je' est : 0.200.
La probabilit√© du mot 'suis' √©tant donn√© le pr√©c√©dent n-gram 'je' est : 0.200.
La probabilit√© du mot 'en' √©tant donn√© le pr√©c√©dent n-gram 'je' est : 0.067.
La probabilit√© du mot 'partir' √©tant donn√© le pr√©c√©dent n-gram 'je' est : 0.067.
La probabilit√© du mot 'la' √©tant donn√© le pr√©c√©dent n-gram 'je' est : 0.067.
La probabilit√© du mot '</s>' √©tant donn√© le pr√©c√©dent n-gram 'je' est : 0.067.
La probabilit√© du mot '<unk>' √©tant donn√© le pr√©c√©dent n-gram 'je' est : 0.067.


In [83]:
estimate_probabilities("en", bigram_counts, trigram_counts, unique_words, k=1)

for w, p in next_word_proba.items():
    print("La probabilit√© du mot '{}' √©tant donn√© le pr√©c√©dent n-gram '{}' est : {:.3f}."
          .format(w, 'en', p))

La probabilit√© du mot 'je' √©tant donn√© le pr√©c√©dent n-gram 'en' est : 0.067.
La probabilit√© du mot 'r√©union' √©tant donn√© le pr√©c√©dent n-gram 'en' est : 0.067.
La probabilit√© du mot 'vacances' √©tant donn√© le pr√©c√©dent n-gram 'en' est : 0.067.
La probabilit√© du mot '√†' √©tant donn√© le pr√©c√©dent n-gram 'en' est : 0.067.
La probabilit√© du mot 'vais' √©tant donn√© le pr√©c√©dent n-gram 'en' est : 0.200.
La probabilit√© du mot 'suis' √©tant donn√© le pr√©c√©dent n-gram 'en' est : 0.200.
La probabilit√© du mot 'en' √©tant donn√© le pr√©c√©dent n-gram 'en' est : 0.067.
La probabilit√© du mot 'partir' √©tant donn√© le pr√©c√©dent n-gram 'en' est : 0.067.
La probabilit√© du mot 'la' √©tant donn√© le pr√©c√©dent n-gram 'en' est : 0.067.
La probabilit√© du mot '</s>' √©tant donn√© le pr√©c√©dent n-gram 'en' est : 0.067.
La probabilit√© du mot '<unk>' √©tant donn√© le pr√©c√©dent n-gram 'en' est : 0.067.


In [84]:
def make_count_matrix(n_plus1_gram_counts, vocabulary,
                      start_token='<s>', end_token='</s>', unk_token='<unk>'):
 
    vocabulary = vocabulary + [end_token, unk_token]
    vocabulary_size = len(vocabulary)
    
    # obtain unique n-grams
    n_grams = list(n_plus1_gram_counts.keys())
    
    row_index = {n_gram: i for i, n_gram in enumerate(n_grams)}
    col_index = {word: j for j, word in enumerate(vocabulary)}
    
    nrow = len(n_grams)
    ncol = len(vocabulary)
    count_matrix = np.zeros((nrow, ncol))
    
    for n_plus1_gram, count in n_plus1_gram_counts.items():
        n_gram = n_plus1_gram
        word = n_plus1_gram.split()[-1]
        if word not in vocabulary:
            continue
        i = row_index[n_gram]
        j = col_index[word]
        count_matrix[i, j] = count
    
    count_matrix = pd.DataFrame(count_matrix, index=[' '.join(ng.split()[0:-1]) for ng in n_grams], columns=vocabulary)
    return count_matrix

In [85]:
sentences = [['je', 'suis', 'en', 'vacances'],
             ['je', 'vais', 'partir', '√†', 'la', 'r√©union'],
             ['je', 'suis', 'en', 'r√©union'],
             ['je', 'vais', 'partir', 'en', 'vacances']]

display(make_count_matrix(bigram_counts, unique_words))

Unnamed: 0,je,r√©union,vacances,√†,vais,suis,en,partir,la,</s>,<unk>
<s>,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
je,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
suis,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
en,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
vacances,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
je,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
vais,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
partir,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
√†,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
la,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [86]:
# Show trigram counts
display(make_count_matrix(trigram_counts, unique_words))

Unnamed: 0,je,r√©union,vacances,√†,vais,suis,en,partir,la,</s>,<unk>
<s> je,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
je suis,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
suis en,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
en vacances,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
<s> je,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
je vais,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
vais partir,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
partir √†,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
√† la,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
la r√©union,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [87]:
def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):
    count_matrix = make_count_matrix(n_plus1_gram_counts, unique_words)
    count_matrix += k
    prob_matrix = count_matrix.div(count_matrix.sum(axis=1), axis=0)
    return prob_matrix

In [88]:
display(make_probability_matrix(bigram_counts, unique_words, k=1))

Unnamed: 0,je,r√©union,vacances,√†,vais,suis,en,partir,la,</s>,<unk>
<s>,0.333333,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667
je,0.076923,0.076923,0.076923,0.076923,0.076923,0.230769,0.076923,0.076923,0.076923,0.076923,0.076923
suis,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.230769,0.076923,0.076923,0.076923,0.076923
en,0.076923,0.076923,0.230769,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923
vacances,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.230769,0.076923
je,0.076923,0.076923,0.076923,0.076923,0.230769,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923
vais,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.230769,0.076923,0.076923,0.076923
partir,0.083333,0.083333,0.083333,0.166667,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333
√†,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.166667,0.083333,0.083333
la,0.083333,0.166667,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333


In [89]:
display(make_probability_matrix(trigram_counts, unique_words, k=1))

Unnamed: 0,je,r√©union,vacances,√†,vais,suis,en,partir,la,</s>,<unk>
<s> je,0.076923,0.076923,0.076923,0.076923,0.076923,0.230769,0.076923,0.076923,0.076923,0.076923,0.076923
je suis,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.230769,0.076923,0.076923,0.076923,0.076923
suis en,0.083333,0.083333,0.166667,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333
en vacances,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.230769,0.076923
<s> je,0.076923,0.076923,0.076923,0.076923,0.230769,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923
je vais,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.076923,0.230769,0.076923,0.076923,0.076923
vais partir,0.083333,0.083333,0.083333,0.166667,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333
partir √†,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.166667,0.083333,0.083333
√† la,0.083333,0.166667,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333
la r√©union,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.083333,0.166667,0.083333


In [90]:
def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):
    
    n = len(list(n_gram_counts.keys())[0].split()) 
    previous_n_gram = ' '.join(previous_tokens.split()[-n:])
    probabilities = estimate_probabilities(previous_n_gram,
                                           n_gram_counts, n_plus1_gram_counts,
                                           vocabulary, k=k)

    suggestion = None
    max_prob = 0
    for word, prob in probabilities.items(): 
        if prob > max_prob: 
            suggestion = word            
            max_prob = prob    
    return suggestion, max_prob

In [91]:
previous_tokens = "je vais"
tmp_suggest1 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0)
print(f"Pour les tokens 'je vais',la suggestion est le mot '{tmp_suggest1[0]}' avec une probabilit√© de {tmp_suggest1[1]:.4f}.")

Pour les tokens 'je vais',la suggestion est le mot 'partir' avec une probabilit√© de 0.2308.


On peut calculer la perplexit√© pour √©valuer le mod√®le. Cette derni√®re est donn√©e par :

$$ PP(W) =\sqrt[N]{ \prod_{t=n+1}^N \frac{1}{P(w_t | w_{t-n} \cdots w_{t-1})} } \tag{4}$$

Avec $N$ la longueur de la phrase et $n$ la taille des n-grams (par exemple 2 dans le cas des bigrams). On cherche √† minimiser la perplexit√© du mod√®le.

In [92]:
def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0,
                         start_token='<s>', end_token='</s>', unk_token='<unk>'):
    
    n = len(list(n_gram_counts.keys())[0].split()) 
    tokens = [start_token] + sentence + [end_token]
    N = len(tokens)
    
    product_pi = 1.0
  
    for t in range(n, N): 
        n_gram = tokens[t-n:t]
        word = tokens[t]
        
        probability = estimate_probability(word, ' '.join(n_gram), 
                                           n_gram_counts, n_plus1_gram_counts, 
                                           len(unique_words), k=1)
        product_pi *= 1 / probability

    perplexity = product_pi**(1/float(N))
    
    return perplexity

In [93]:
perplexity_train1 = calculate_perplexity(sentences[0],
                                         unigram_counts, bigram_counts,
                                         len(unique_words), k=1.0)
print(f"La perplexit√© pour la premi√®re phrase du corpus est : {perplexity_train1:.4f}.")


perplexity_train1 = calculate_perplexity(['Tu' ,'pars', 'ou', 'en', 'vacances', '?'],
                                         unigram_counts, bigram_counts,
                                         len(unique_words), k=1.0)
print(f"La perplexit√© pour la phrase test est : {perplexity_train1:.4f}.")

La perplexit√© pour la premi√®re phrase du corpus est : 2.9089.
La perplexit√© pour la phrase test est : 6.6343.


## Exercice 2: les mod√®les de langues avec r√©seaux de neurones : GPT-2


J'ai entrain√© un mod√®le GPT-2 <span class="badge badge-secondary">([Radford et al., 2019](#radford-2019))</span> sur 50M de phrases extraites du corpus OSCAR <span class="badge badge-secondary">([Su√°rez et al., 2019](#suarez-2019))</span>. Le mod√®le a ensuite √©t√© fine-tun√© (on a continu√© l'entrainement) sur le Tome 2 de Harry Potter. Ce mod√®le est une architecture de r√©seaux de neurones assez connu pour les mod√®les de langues. Il permet de g√©n√©rer du texte de mani√®re assez r√©aliste. On va utiliser la librairie `transformers` pour utiliser le mod√®le. **Vous devez r√©cup√©rer les poids du mod√®les sur le moodle (fichier french-gpt2-hp)**.

In [33]:
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

In [34]:
tokenizer = GPT2Tokenizer.from_pretrained("./french-gpt2-hp")
model = TFGPT2LMHeadModel.from_pretrained("./french-gpt2-hp", 
                                          pad_token_id=tokenizer.eos_token_id, from_pt=True)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

Some weights or buffers of the PyTorch model TFGPT2LMHeadModel were not initialized from the TF 2.0 model and are newly initialized: ['transformer.h.10.attn.masked_bias', 'transformer.h.4.attn.bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.10.attn.bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.11.attn.bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.3.attn.bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.6.attn.bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.8.attn.bias', 'lm_head.weight', 'transformer.h.2.attn.bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.5.attn.bias', 'transformer.h.9.attn.bias', 'transformer.h.7.attn.bias', 'transformer.h.1.attn.bias', 'transformer.h.0.at

In [35]:
input_ids = tokenizer.encode(
    "Dans son mouvement, la queue du Basilic lui avait jet√© le Choixpeau magique √† la t√™te. ",
    return_tensors='tf')

In [36]:
# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Setting `pad_token_id` to 2 (first `eos_token_id`) to generate sequence


Output:
----------------------------------------------------------------------------------------------------
Dans son mouvement, la queue du Basilic lui avait jet√© le Choixpeau magique √† la t√™te.  ‚ÄîQu'est-ce que tu fais l√†?
demanda Harry.
‚ÄîQu'est-ce que tu fais l√†


In [37]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids, 
    max_length=200, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Setting `pad_token_id` to 2 (first `eos_token_id`) to generate sequence


Output:
----------------------------------------------------------------------------------------------------
Dans son mouvement, la queue du Basilic lui avait jet√© le Choixpeau magique √† la t√™te.  ‚ÄîQu'est-ce qui se passe?
demanda-t-il d'une voix aigu√´.
‚ÄîQuoi?‚Äî‚Äì‚Äî‚Äìdemarre aussit√¥t de la foule, les yeux fix√©s sur le visage de Malefoy qui avait l'air de plus en plus livide, comme s'il n'avait pas eu le temps de prononcer le moindre mot...  Harry se pr√©cipita dans la salle commune des Gryffondor, √† c√¥t√© de Ron et de Hermione, mais il ne fut pas surpris de voir que le professeur McGonagall √©tait en train de dire quelque chose sur la Chambre des Secrets et qu'elle ne semblait pas convaincue que c'√©tait la meilleure chose √† faire...
Il y eut un long silence, puis il se tourna vers Ron, le regard perdu dans ses pens√©es, et le silence qui r√©gnait autour de lui se r√©percut en


In [38]:
input_ids = tokenizer.encode(
    "Assis un peu plus loin, Harry reconnut Gilderoy Lockhart, v√™tu d'une robe de sorcier bleu-vert.",
    return_tensors='tf')

In [39]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids, 
    max_length=200, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Setting `pad_token_id` to 2 (first `eos_token_id`) to generate sequence


Output:
----------------------------------------------------------------------------------------------------
Assis un peu plus loin, Harry reconnut Gilderoy Lockhart, v√™tu d'une robe de sorcier bleu-vert.
‚ÄîQu'est-ce qu'il y a?
demanda-t-il en s'effor√ßant de ne pas faire de bruit, mais il n'eut pas le temps de prononcer le moindre mot, et il se laissa tomber sur le sol humide et humide de la salle commune de Gryffondor, √† c√¥t√© de Ron et de Hermione qui le regardaient avec des yeux ronds et des cheveux boucl√©s qui lui tombaient sur les yeux. Mais il ne fut pas surpris de voir que le professeur McGonagall √©tait le seul √† l'avoir vu, alors que les autres √©l√®ves de Serpentard √©taient assis c√¥te √† c√¥te sur un banc de Quidditch, au fond duquel √©tait √©crit en lettres noires :  ‚ÄîViens, dit Harry, je t'ai dit que j'√©tais le plus grand sorcier de tous les temps


## üìö References

> <div id="radford-2019">Alec Radford, Jeffrey Wu, Rewon Child, David Luan and Dario Amodei. <a href=https://openai.com/blog/better-language-models/> Better Language Models and Their Implications.</a></div>

> <div id="suarez-2019">Su√°rez, Pedro Javier Ortiz, Beno√Æt Sagot, and Laurent Romary. <a href=https://hal.inria.fr/hal-02148693> Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures.</a> 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f√ºr Deutsche Sprache, 2019.</div>