# Generating the N-Gram Models

An N-Gram model is simply an enumeration of n-pairs of words, their frequency and probability. The ngrams will be represented in a csv spreadsheet. This is so they are compatible with Pandas and we do not have to recompute them each time we wish to use them.

There are 3 models in total: Vanilla, Laplace and UNK. The Laplace model will include an occurence of every possible combination of n words known in the corpus. Saving such a model would be slow and wasteful. During text generation and perplexity calculation the laplace model will make use of the vanilla model and if an unkown sequence of words is encountered, then we perform laplace smoothing.

In [1]:
import os
import pandas as pd
import seaborn as sns

In [2]:
folder = os.path.join(os.getcwd(),'..','data','korpus')

vanilla = os.path.join(os.getcwd(),'..','data','korpus','ngram','vanilla')
laplace = os.path.join(os.getcwd(),'..','data','korpus','ngram','laplace')
unk = os.path.join(os.getcwd(),'..','data','korpus','ngram','unk')

### Generating Vanilla Language Model

The vanilla models will be implemented by first counting the frequency of the ngrams (sequence of n words) within the korpus. Then the probabiltiies will be calculated.

Bigrams and Trigrams make use of the previous ngram (Unigram and Bigram respectively). Hence when the unigram or bigram is generated, a dictionary is created by using Pandas df.to_dict. This is so to facilitate O(1) access time when searching.

In [4]:
df = pd.read_csv(os.path.join(folder,'korpus.csv'))

word_count = pd.read_csv(os.path.join(folder,'korpus_frequency.csv'))['Frequency'].sum()
words = df['Word'].to_numpy()


#### Unigram

In [5]:
%%time

#Calculating frequencies

def calcualte_unigram(words, word_count: int):
    unigram = {}
    
    #Calculating frequency.
    for i in range(len(words)):
        first  = words[i]

        if f'{first}' in unigram:
            unigram[f'{first}'] += 1
        else:
            unigram[f'{first}'] = 1
            
    df_unigram = pd.DataFrame(unigram.items(), columns=['Unigram', 'Frequency'])
    
    #Calculating probability.
    df_unigram['Probability'] = [freq/word_count for freq in unigram.values()]
    df_uniram = df_unigram.dropna()
    print('Finished!')
    return df_unigram


df_unigram = calcualte_unigram(words, word_count)

Finished!
Wall time: 43.5 s


In [6]:
%%time
#Saving unigram
dict_unigram = df_unigram.set_index('Unigram').T.to_dict('list')
df_unigram.to_csv(os.path.join(vanilla,'unigram.csv'), index=False)

Wall time: 18.9 s


#### Bigram

In [8]:
#Bigram
def calculate_bigram(words, unigram: dict): 
    bigram = {}

    #Calculate frequency.
    for i in range(len(words)-1):
        first  = words[i]
        second = words[i+1]
        
        if f'{first};{second}' in bigram:
            bigram[f'{first};{second}'] += 1
        else:
            bigram[f'{first};{second}'] = 1        
    
    df_bigram = pd.DataFrame(bigram.items(), columns=['Bigram', 'Frequency'])
    
#     df_bigram['Probability'] = [bi_freq/unigram[bi.split(';')[0]][0] for bi,bi_freq in bigram.items()]
    
    x = []
    for i,(bi,bi_freq) in enumerate(bigram.items()):
        try:
            x.append(bi_freq/unigram[bi.split(';')[0]][0])
        except Exception: x.append(1)
        
    df_bigram['Probability'] = x
    
    
    print('Finished!')
    return df_bigram
    
df_bigram = calculate_bigram(words, dict_unigram)


Finished!


In [9]:
df_bigram

Unnamed: 0,Bigram,Frequency,Probability
0,<s>;L-,277539,5.795826e-02
1,L-;għan,2976,9.152558e-03
2,għan;prinċipali,396,1.182513e-02
3,prinċipali;ta',1319,1.077526e-01
4,ta';Conectando,1,3.734804e-07
...,...,...,...
8654882,Personalment;irrid,1,8.912656e-04
8654883,se;nsellmu,1,2.835801e-06
8654884,xulxin;jibgħat,1,7.336757e-05
8654885,""";Inselli",1,3.789817e-06


In [None]:
#Saving bigram

dict_bigram = df_bigram.set_index('Bigram').T.to_dict('list')
df_bigram.to_csv(os.path.join(vanilla,'bigram.csv'), index=False)

#### Trigram

In [None]:
%%time

def calculate_trigram(words, bigram: dict): 
    trigram = {}

    #Calculate frequency.
    for i in range(len(words)-2):
        first  = words[i]
        second = words[i+1]
        third = words[i+2]

        if f'{first};{second};{third}' in trigram:
            trigram[f'{first};{second};{third}'] += 1
        else:
            trigram[f'{first};{second};{third}'] = 1
            
            
    df_trigram = pd.DataFrame(trigram.items(), columns=['Trigram', 'Frequency'])
    
    #Calculate probability.
    probability = []

    for tri,tri_freq in trigram.items():
        first,second = tri.split(';')[:2]
        bi_freq = dict_bigram[f'{first};{second}'][0]
        probability.append(tri_freq/bi_freq)

    df_trigram['Probability'] = probability
    
    print('Finished!')
    return df_trigram

df_trigram = calculate_trigram(words, dict_bigram)

In [None]:
df_trigram

In [None]:
#Saving trigram
df_trigram.to_csv(os.path.join(vanilla,'trigram.csv'), index=False)

### Generate UNK Model

To generate the UNK model, first we remove all words in the vanilla unigram model that have a frequency of 2 or less. We create an extra token called \<UNK> that will represent all the removed words. The frequency of the \<UNK> willbe the sum of the frequency of the removed words.

#### Unigram

In [None]:
#Sum frequencies for all word with frequency less than 3.

#Load vanilla unigram.
df_unigram = pd.read_csv(os.path.join(vanilla,'unigram.csv'))

#Calculate Frequency and Probability.
condition = df_unigram['Frequency'] < 3

unk_frequency = df_unigram[condition].Frequency.sum()
unk_probability = unk_frequency/word_count

#Remove the words that occure less than 3 times.
df_unigram = df_unigram.drop(df_unigram[condition].index)

#Add the <UNK> token.
df_unk = pd.DataFrame({'Unigram': '<UNK>', 'Frequency': unk_frequency, 'Probability': unk_probability}
                   ,index=[0])
df_unigram = pd.concat([df_unk,df_unigram], ignore_index = True)

df_unigram = df_unigram.dropna()

#Save model.
df_unigram.to_csv(os.path.join(unk,'unigram.csv'), index=False)
dict_unigram = df_unigram.set_index('Unigram').T.to_dict('list')

#Replace low frequency words with <UNK>

for i in range(len(words)):
    #If the word is not in dict_unigram then it means it was removed and set to <UNK>
    if words[i] not in dict_unigram:
        words[i] = '<UNK>'


df_unigram

Then we recalculate the Bigram and Trigram based on this \<UNK> unigram

#### Bigram

In [None]:
df_bigram = calculate_bigram(words, dict_unigram)

In [None]:
df_bigram.to_csv(os.path.join(unk,'bigram.csv'), index=False)
dict_bigram = df_bigram.set_index('Bigram').T.to_dict('list')

df_bigram

#### Trigram

In [None]:
%%time

df_trigram = calculate_trigram(words, dict_bigram)

In [None]:
df_trigram.to_csv(os.path.join(unk,'trigram.csv'), index=False)

df_trigram