In this second part of the project, the objective is to build an n-gram language model that is defined by the probabilities of all the n-grams in the corpus. Assuming that each token only depends on n-1 previous tokens, the language model is fully defined by the probability of any token in the corpus given its n-1 previous tokens (the prefix).

In [19]:
import collections
import numpy as np
import operator
import pandas as pd
import re

from nltk.util import ngrams

Data are imported, the dataset was cleaned in the script Clean_dataset.ipynb.

In [2]:
data = pd.read_csv('cleaned_dataset.csv')

In [3]:
data.shape

(779478, 6)

In [4]:
data.head

<bound method NDFrame.head of         post_id  parent_id  comment_id  \
0             1        NaN         NaN   
1             3        NaN         NaN   
2             4        NaN         NaN   
3             6        NaN         NaN   
4             7        NaN         NaN   
...         ...        ...         ...   
779473   279994        NaN    536471.0   
779474   279998        NaN    536439.0   
779475   279998        NaN    536514.0   
779476   279999        NaN    536802.0   
779477   279999        NaN    542550.0   

                                                     text category  \
0                           Eliciting priors from experts    title   
1       What are some valuable Statistical Analysis op...    title   
2       Assessing the significance of differences in d...    title   
3       The Two Cultures: statistics vs. machine learn...    title   
4                  Locating freely available data samples    title   
...                                          

Split the data in training and test set.

In [5]:
test_set = data[data['category'] == "title"]
train_set = data[data['category'].isin(["comment", "post"])]

In [6]:
train_set.shape

(693040, 6)

The texts are padded with a start and end symbol.

In [7]:
texts = train_set['cleaned_texts']
texts = [f"<s> {text} </s>" for text in texts]

texts[0:3]

['<s> how should i elicit prior distributions from experts when fitting a bayesian model ? </s>',
 '<s> in many different statistical methods there is an assumption of normality . what is normality and how do i know if there is normality ? </s>',
 '<s> what are some valuable statistical analysis open source projects available right now ? edit : as pointed out by sharpie , valuable could mean helping you get things done faster or more cheaply . </s>']

In [9]:
# split texts in trigrams
n = 3
tri_grams = [list(ngrams(sentence.split(), n)) for sentence in texts]
tri_grams = [tri_gram for trigram_list in tri_grams for tri_gram in trigram_list]

Make a count dict for trigrams

In [10]:
count_dict = collections.defaultdict(lambda: collections.defaultdict(int))

for first, second, third in tri_grams:
    count_dict[(first, second)][third] += 1

Make a probability dict.

In [11]:
prob_dict = collections.defaultdict(lambda: collections.defaultdict(float))

for bi_key, value in count_dict.items():
    sum_values = sum(value.values())
    for third, count in value.items():
        prob_dict[bi_key][third] = count / sum_values
    

The function generate_text takes a bigram and the prob_dict as input and produces text.

In [25]:
def generate_text(prob_dict, bi_gram):
    bi_gram_dict = prob_dict[bi_gram]
    most_probable = max(bi_gram_dict.items(), key=operator.itemgetter(1))[0]
    
    return most_probable

In [26]:
start_of_text_list = ['shall', 'we']

while len(start_of_text_list) < 100:
    
    input_tuple = (start_of_text_list[-2], start_of_text_list[-1])
    new_word = generate_text(prob_dict, input_tuple)
    start_of_text_list.append(new_word)
    
    if new_word == '</s>':
        break
        
print(start_of_text_list)

['shall', 'we', 'presume', 'that', 'the', 'data', '.', '</s>']


The function sentence_prob calculates the probability of a whole sentence.

In [28]:
def sentence_prob(prob_dict, sentence, n):
    tri_grams = list(ngrams(sentence.split(), n))
    
    sentence_probability = 0
    for first, second, third in tri_grams:
        tri_gram_prob = prob_dict[(first, second)][third]
        sentence_probability += np.log(tri_gram_prob)
        
    sentence_probability = np.exp(sentence_probability)
        
    
    return sentence_probability

test_sen = "this does not go as it should go ."
sentence_prob(prob_dict, test_sen, n)
    
    

1.2065397568632802e-12

In [21]:
def sentence_perplexity(sentence):
    
    sentence_length = len(sentence.split())
    
    prob_of_sentence = sentence_prob(prob_dict, sentence, n)
    perplexity = prob_of_sentence**(-1/sentence_length)
    
    return perplexity

sentence_perplexity(test_sen)
    

21.099547981469296