# Generating Text

To generate text, the user needs to give a starting word. Then the program attempts to find a probable word that would usually follow the previous word / couple of words.

In [2]:
import os
from random import randint
import pandas as pd
import seaborn as sns

In [3]:
folder = os.path.join(os.getcwd(),'..','data','korpus')
vanilla = os.path.join(folder,'ngram','vanilla')
unk = os.path.join(folder,'ngram','unk')

In [4]:
df = pd.read_csv(os.path.join(folder,'norm_korpus_clean.csv'))
words = df['Word'].to_numpy()

###### Algorithm Specifications

If the ngram supplied is a unigram, then text generation doesn't need to look at the previous word and it essential becomes a random guess from the top 50 words. If the ngram is not a unigram, then we find in the ngram all words that start with the n-1 previous words. Then we choose a word to generate from the top 50 most probable words.

If the model supplied is Vanilla, the program may break. This is because if there doesn't exist an instance of the n-1 previous words in the ngram, then the program doesn't have a word to generate.

If the model supplied is Laplace and the n-1 previous words do not appear in the ngram, Laplace smoothing considers all sequences of words to be matched with every other word at least once. Hence in this case, Laplace transform essentially chooses a random word, since all available probabilities are the same.

If the model supplied is UNK and the n-1 previous words do not appear in the ngram, the unkown words are replaced with the \<UNK> token. The program then re-attempts to find an instance of an ngram that starts with the new modified history. If an instance is still not found, since the \<UNK> model was also Laplace Smoothed, then a random word is taken from the corpus.

In [5]:
def generate_text(phrase:str, model:str, n:int):
    
    if model == 'vanilla' or model == 'laplace':
        model_path = vanilla
    elif model == 'unk': 
        model_path = unk
    else: raise Exception('Model does not exist!')
    
    print("Loading Models: ",end='')
    if n in [1,2,3]:
        xgrams = ['unigram.csv','bigram.csv','trigram.csv']
        ngram_path = xgrams[n-1]
        
        xgrams_types = ['Unigram','Bigram','Trigram']
        ngram_type = xgrams_types[n-1]
        
        df_ngram = pd.read_csv(os.path.join(model_path, ngram_path))
        
        ngrams = df_ngram[ngram_type].unique()
        
        if n != 1:
            prev_df_gram = pd.read_csv(os.path.join(model_path,xgrams[n-2]))
            prev_ngrams = prev_df_gram[xgrams_types[n-2]].unique()
            
    else: raise Exception('Choose Unigram, Bigram or Trigram!')
    print('[OK]')
    print('Generating Sentence...')
    
    generated_word = ""
    
    if n == 1:
        top_words = df_ngram['Probability'].astype(float).nlargest(50).index
    
    while generated_word != '</s>':
        phrase += ' '
        if n == 1:
            generated_word = df_ngram.iat[top_words[randint(0,49)],0]
            phrase += generated_word
        
        else:
            #Get previous words
            tokens = phrase.split(' ')
            history = ';'.join(tokens[len(tokens)-n:len(tokens)-1])
            
            #Find most probable words that follow previous words
            if model == 'vanilla' or history in prev_ngrams:
                top_words = df_ngram[df_ngram[ngram_type].str.startswith(history)]['Probability'].astype(float).nlargest(50).index
            
            #If there is no match for the history, then laplace smoothing comes in.
            #Laplace smoothing will ensure that there is an occurence with the previous history at least once 
            #with every other word. Hence the next word becomes essentially a random guess.
            else:
                if model == 'unk':
                    #Replace unkown words with UNK and recalculate accordingly
                    new_history = ['<UNK>' if w not in words else w for w in history.split(';')]
                    history = ';'.join(new_history)
                    
                    #Attempt to find any nrgams with the <UNK> modified history
                    if history in prev_ngrams:
                        top_words = df_ngram[df_ngram[ngram_type].str.startswith(history)]['Probability'].astype(float).nlargest(50).index
                    
                    #If there is still no combination with the UNK words, then take a random guess due to laplace smoothing
                    else: top_words = [randint(0,len(df_ngram)-1)] 
                        
                    
                else: top_words = [randint(0,len(df_ngram)-1)]
            
            
            #Pick a random word from the top 50.
            generated_word = df_ngram.iat[top_words[randint(0,len(top_words)-1)],0].split(';')[-1]
            #Add to current phrase
            phrase += generated_word
            
    return phrase


##### Evaluation

Let's consider the same starting phrase <i>Jiena kont</i> to generate text across all different models

In [5]:
%%time
generate_text('Jiena kont', 'vanilla', 1)

Loading Models: [OK]
Generating Sentence...
Wall time: 132 ms


"Jiena kont dawn kif li L- xi L- kien l- kif jkun qed qed għall- għal jew l- fl- fuq ma' id- kull se fuq minn is- xi ma l- aktar lil din wara ma' oħra b' biex fuq L- biex se dawn hemm L- il- kien is- jkun id- il- it- hemm jew sena ir- biex lill- lil hemm L- din oħra tiegħu hemm jkun tal- dawn lill- lill- dan ma fl- jkun sena u dwar aktar xi sena ma se ir- ħafna <s> b' dwar fuq minn b' xi għal lill- għall- sena f' fuq oħra mill- lil 1 <s> li kien jew din il- ħafna xi u se xi tiegħu oħra minn meta 1 għal <s> sena meta dan dan f' jew jkun L- it- dwar biex kull dawn L- ta' qed kull biex li sena biex tal- kull il- f' għal tiegħu se L- kien għal u f' jew tiegħu tiegħu is- is- ħafna dawn fil- fuq għall- lil b' ir- hemm it- meta ma f' kif f' jew sena ir- fil- meta ħafna minn kif lill- fuq tal- dawn b' fuq jkun <s> is- <s> fuq biex xi fuq tiegħu fil- ma' is- b' aktar it- ta' ir- 1 l- Il- aktar lill- fl- Il- lill- dwar minn tal- f' tal- f' dak kif meta dan jew lill- fl- ma' ta' <s> se u ir- jkun

As we can see from the vanilla unigram generation, the generated text is giberrish. We note that the majority of the words are articles and prepositions. This makes sense since the top most 50 common words in the corpus are as such. We should also realise that if the end of sentence token was not among the top 50 common words, then the generation would never stop!

In [6]:
%%time
generate_text('Jiena kont', 'vanilla', 2)

Loading Models: [OK]
Generating Sentence...
Wall time: 5.12 s


'Jiena kont li Mark Tulius Cicerus 106 kif jiġi mill- familja li </s>'

Switching to a bigram model, we can immediately see some improvements. We know that that are words that are not among the top 50, such as <b>familja</b> and <b>Mark</b> which is a name. Hence we can confirm that the program is not taking random guesses.

The bigram <b>mill- familja</b> makes sense. In the corpus you would expect to find nouns to follow articles, and not anything else. The bigram <b>jigi mill-</b> also makes sense, as <b>jigi</b> is often used to mean <b>he is related to</b> rather than <b>he came</b>.

However the sentence doesn't seem to indicate any particular flow.

In [15]:
%%time
generate_text('Jiena kont', 'vanilla', 3)

Loading Models: [OK]
Generating Sentence...
Jiena;kont
kont;għadni
għadni;skolastiku
skolastiku;hu
hu;kien
kien;jagħmel
Wall time: 8.72 s


'Jiena kont għadni skolastiku hu kien jagħmel </s>'

The following sentence, although still not completely coherent, still indicates the best flow. Most notably the word <b>għandi</b>. Given that the current context is the first person (<b>Jien</b>) it would make sense to see verbs that are in the first person, unlike <b>għandu</b> or <b>għandhom</b>.

This implies that the trigram is working since <b>kont</b> doesn't encode any information on the speaker, unlike <b>Jiena</b>.

The same reasoning goes for <b>jagħmel</b>. The word is considering the previous 2 words <b>hu kien</b> which are in the masculine third person.


In [8]:
%%time
generate_text('Jiena kont', 'laplace', 1)

Loading Models: [OK]
Generating Sentence...
Wall time: 142 ms


'Jiena kont sena jew din meta se u se dawn li oħra dak lil is- minn </s>'

For Laplace unigram, the output isn't very different form that of vanilla unigram

In [9]:
%%time
generate_text('Jiena kont', 'laplace', 2)

Loading Models: [OK]
Generating Sentence...
Wall time: 5.94 s


"Jiena kont f' pajjiżi mhux talli qatt x- xewqa sempliċi ikunu determinati f' isem philodendron </s>"

Once again we with the bigram, we also see that the words have more coherent structure. Most notably the bigram <b>x- xewqa</b>. The article is dictating that the following noun should start with an <b>x</b> which it does. Similairly the preposition <b>f'</b> always follows a noun.

In [10]:
%%time
generate_text('Jiena kont', 'laplace', 3)

Loading Models: [OK]
Generating Sentence...
Wall time: 49.1 s


"Jiena kont naħseb jien faqqiegħ li qed jiġri biex jitla' fuq \xad vapur irid juri bil- fatti ta' dak hawn Malta kontra l- Iskozja ssir membru tal- unjoni monetarja u ekonomika il- fatt dwar kif għandu jitqies bħala persuna b' saħħitha f' dan 1 għan tiġi l- Ħadd li għadda Malta għelbet lill- goalkeeper avversarju għal skor ta' 5 sena ma ukoll ma bħala mistenni </s>"

In [11]:
%%time
generate_text('Jiena kont', 'unk', 1)

Loading Models: [OK]
Generating Sentence...
Wall time: 47.6 ms


"Jiena kont dwar dak u għall- fil- lil ma' meta L- il- fil- minn fuq wara ma wara dwar dak meta jkun dak dan </s>"

The UNK unigram gives the same result as the others

In [12]:
%%time
generate_text('Jiena kont', 'unk', 2)

Loading Models: [OK]
Generating Sentence...
Wall time: 7.95 s


'Jiena kont għaliex jekk huwa parti fejn dawn ġew <UNK> of l- ħajja normali stabbilit għall- finanzi illi ruħha kull u għal bl- ATT XX </s>'

Once again we see that prepositions are always followed by nouns. Moreover, we see the generation of the \<UNK> token, proving that the \<UNK> model is working.

In [13]:
%%time
generate_text('Jiena kont', 'unk', 3)

Loading Models: [OK]
Generating Sentence...
Wall time: 24.3 s


"Jiena kont għadni żgħir niġri isfel stess lit- tarbija għaċ- ċajt Offi mifhum mozzjoni Abu trattament ażjenda men mhux ikunu ma' grupp żgħażagħ jorganizzaw numru ta' pazjenti bl- iskizofrenija kif ukoll fl- Istitut Mediterranju u fid- dekasteru tagħha </s>"

Once again we see that <b>għadni</b> respects the context of the previous 2 words. Consonants that are "<b>xemxin</b>" are also respected in the articles <b>lit- tarbija</b> and <b>għaċ- ċajt</b>. You would expect to find the phrase <b>grupp żgħażagħ jorganizzaw numru ta'</b> while reading an article about a charity event for example. However since the trigram only considers up to 2 words for context, it seems that the word <b>pazjenti</b> was taken from a different article about mental health institutions such as Mount Carmel. You would also expect to find <b>numru ta' pazjenti bl- iskizofrenija</b> in an article talking about hospital patients.

Let's explore some more examples from the UNK model, since it seems to give the most accuracte answers

In [27]:
generate_text('Il- Laburisti', 'unk', 3)

Loading Models: [OK]
Generating Sentence...


"Il- Laburisti wkoll riedu l- permess mill- Ministru ta' qabel fis- 26 u 27 ta' Diċembru l- ġenituri kienu ġew trasferiti lill- Awtorità dik l- era </s>"

The generated sentence above gives insight on the pronounciation of numbers in the maltese language. <b>fis- 26</b> shows that even though the number doesn't start with the letter <b>s</b>, speech information is still encoded within the korpus.

We should also note that <b>fis- 26</b> meant that that <b>26</b> was representing a date, and hence the phrase was <b> fis- 26 u 27 ta' Diċembru</b>

In [24]:
generate_text('In- Nazzjonalisti', 'unk', 3)

Loading Models: [OK]
Generating Sentence...


"In- Nazzjonalisti qatt ma ltqajt ma' qaddisa ferħana Chiara kienet mistiedna għall- festival tal- inbid </s>"

A notable phrase is <b>Chiara kienet mistiedna għall- festival tal- inbid</b>. The ngram encodes the information that Chiara is a female name and hence the words <b>kienet mistiedna</b> follow.

In [34]:
generate_text('Nhar it-', 'unk', 3)

Loading Models: [OK]
Generating Sentence...


"Nhar it- Tnejn 10 ta' Lulju 1994 li jistabbillixxi </s>"

Going over the date idea, the phrase <b>Nhar it- </b> would lead the text generation to list out a date.

In [11]:
generate_text('Għawdex huwa', 'unk', 3)

Loading Models: [OK]
Generating Sentence...


'Għawdex huwa servizz skond in- nomeklatura tal- Komunità għandu jiġi sottomess il- parteċipant </s>'