# N-Gram Language Models #

Language models are one of the most important concepts in Natural Language Processing (NLP). They help us assign probabilities to sequence of words, such as sentences or phrases. Their application spans beyond NLP. For example, language models help us obtain better speech recognition, optical character recognition (OCR) and information retrieval results, to name a few. Given a sentence $S$ with a set of $w_i$ words, $i=1,2,...,n$, language models formally define the probability of the sentence as the probability of having the particular sequence of words:  
$$ \Large p(S)=p(w_1, w_2, w_3, ... , w_n) $$  

They also help us predict the probability of a specific word given the previous words in the sentence:  

$$ \Large p(w | w_{-1}, w_{-2}..w_{-k}) $$  

The probability of a sentence is computed using the chain rule:  

$$ \Large p(S)=p(w_1, w_2, w_3, ... , w_n) = \prod_i^n {p(w_i|w_1,w_2,...,w_{i-1})} $$  

When computing this probability we use the Markov assumption which simplifies the computation of the above probability:    

$$ \Large p(S)=p(w_1, w_2, w_3, ... , w_n) \approx \prod_i^n {p(w_i|w_{i-k},w_{i-(k-1)},...,w_{i-1})} $$  

The Markov assumption states that the conditional probability of a word $w_i$, given all of the previous words in the sentence, could be approximated by considering only the $k$ previous words.


### N-Gram Models ###

#### Unigram Model ####
The unigram language model is the simplest of the n-gram models. Under this model the probability of the sentence $S$ is simply a product of the probabilities of each individual word in the sentence:  

$$ \Large p(S) \approx \prod_i^n {p(w_i)} $$  

N-gram probabilities are computed using Maximum Likelihood Estimates (MLE). For the unigram language model we compute MLE by counting the number of times word $w_i$ occurred in the collection and we divide that number by the total number of words $v$ in the collection:
$$ \Large p(w_i) = \frac{count(w_i)}{\sum_{i=1}^{v}{count(w_i)}} $$  

#### Bigram Model ####
In the bigram model the probability of a word is conditioned only on the previous word:  
$$ \Large p(S) \approx \prod_i^n {p(w_i|w_{i-1})} $$  

The MLE for the bigram LM is computed by dividing the number of times words $w_i$ and $w_{i-1}$ occured together in the collection with the number of occurances of word $w_i$, $count(w_i)$ :

$$ \Large p(w_i|w_{i-1}) = \frac{count(w_i|w_{i-1})}{count(w_i)} $$  

#### Trigram Model ####
In the trigram model the probability of a word is conditioned on the previous two words:  
$$ \Large p(S) \approx \prod_i^n {p(w_i|w_{i-1},w_{i-2})} $$  

The MLE for the trigram LM is computed by dividing the number of times words $w_i$, $w_{i-1}$, $w_{i-2}$ occurred together in the collection with the count for $w_{i-1}$ and $w_{i-2}$ occurring together, $count(w_{i-1},w_{i-2})$:

$$ \Large p(w_i|w_{i-1},w_{i-2}) = \frac{count(w_i|w_{i-1},w_{i-2})}{count(w_{i-1},w_{i-2})} $$  

## Example ##
In this task we are going to compute bigram and trigram language models using a novel. Rather than implementing language models on our own we are going to use the nltk package.  

Look into the books folder and choose one of the ten books that you would like to compute language models over. In the example below we are using the novel "Siddhartha" by Herman Hesse. Load the book and extract words and sentences:

In [7]:
import nltk
import itertools
book= open('./books/hesse_siddhartha.txt','r').read()

words = nltk.word_tokenize(book)
words = [word.lower() for word in words]
sentences = nltk.sent_tokenize(book)
tokenized_sentences = []
for sentence in sentences:
	words = nltk.word_tokenize(sentence)
	words = [word.lower() for word in words]
	tokenized_sentences.append(words)

Compute bigram language model:

In [6]:
#First compute the bigrams (i.e. the tuple of words that occur together)
bigram_model  = nltk.bigrams(words)
#Then compute the frequency for each bigram
bigram_frequency_word = nltk.ConditionalFreqDist(bigram_model)
#To get the bigram probabilities we would need to normalize the bigram frequencies:
bigram_probability_word = nltk.ConditionalProbDist(bigram_frequency_word, nltk.MLEProbDist)

#Let's compute the bigram probabilities constraining on sentences:
bigram_frequency_sent = nltk.ConditionalFreqDist((word[0],word[1]) for word in list( itertools.chain (*[nltk.bigrams(i) for i in tokenized_sentences])))
bigram_probability_sent = nltk.ConditionalProbDist(bigram_frequency_sent, nltk.MLEProbDist)

#Let's print the bigram probabilities computed over sentences:
bigram_probability = bigram_probability_sent
bigram_frequency = bigram_frequency_sent

all_bigrams2 = {}
for source_word in bigram_probability:
	prob_words = bigram_probability[source_word].samples()
	denom = len(prob_words)
	all_bigrams2[source_word]={}
	for target_word in prob_words:
		prob = bigram_probability[source_word].prob(target_word)
		all_bigrams2[source_word][target_word] = prob
		print ("p("+target_word+"|"+source_word+")={0:.4f}".format(prob))

p(by|effected)=1.0000
p(,|willingness)=0.3333
p(to|willingness)=0.3333
p(delights|willingness)=0.3333
p(hermit|forlorn)=1.0000
p(disciples|gotama's)=0.3333
p(sermon|gotama's)=0.3333
p(favourite|gotama's)=0.3333
p(.|area)=1.0000
p(,|pleased)=1.0000
p(,|folded)=0.7500
p(.|folded)=0.2500
p(.|rumours)=1.0000
p(the|informed)=1.0000
p(golden|shone)=0.2500
p(,|shone)=0.2500
p(or|shone)=0.2500
p(into|shone)=0.2500
p(siddhartha|follow)=0.1250
p(,|follow)=0.1250
p(his|follow)=0.1250
p(him|follow)=0.3750
p(that|follow)=0.1250
p(you|follow)=0.1250
p(--|brahmans.)=1.0000
p(which|features)=0.3333
p(of|features)=0.3333
p(,|features)=0.3333
p(of|readiness)=1.0000
p(up|woke)=1.0000
p(buddha|alleged)=1.0000
p(effected|is)=0.0031
p(where|is)=0.0062
p(full|is)=0.0031
p(n't|is)=0.0278
p(searching|is)=0.0031
p(hard|is)=0.0031
p(taking|is)=0.0031
p(shaking|is)=0.0031
p(good|is)=0.0216
p(useful|is)=0.0031
p(his|is)=0.0062
p(easy|is)=0.0031
p(now|is)=0.0123
p(one|is)=0.0093
p(harming|is)=0.0031
p(alive|is)=0.0

In [3]:
#We could also obtain bigram probabilities for specific words:
bigram_probability.conditions()
query = bigram_probability["suffering"].prob("was")
print (str(query))

0.06818181818181818


Compute trigram language model:

In [4]:
#First compute the trigrams:
trigram_model  = nltk.trigrams(words)
#Then compute the frequency for each trigram:
trigram_words = (((word[0], word[1]), word[2]) for word in trigram_model)
trigram_frequency_word = nltk.ConditionalFreqDist(trigram_words)
#To get the trigram probabilities we would need to normalize the trigram frequencies:
trigram_probability_word = nltk.ConditionalProbDist(trigram_frequency_word, nltk.MLEProbDist)
#Let's compute the trigram probabilities constraining on sentences:
trigram_frequency_sent = nltk.ConditionalFreqDist(((word[0], word[1]), word[2])  for word in list( itertools.chain (*[nltk.trigrams(i) for i in tokenized_sentences])))
trigram_probability_sent = nltk.ConditionalProbDist(trigram_frequency_sent, nltk.MLEProbDist)

#Let's print the trigram probabilities computed over sentences:
trigram_probability = trigram_probability_sent
trigram_frequency = trigram_frequency_sent
all_trigrams2 = {}
for source_word in trigram_probability:
	prob_words = trigram_probability[source_word].samples()
	denom = len(prob_words)
	all_trigrams2[source_word]={}
	for target_word in prob_words:
		prob = trigram_probability[source_word].prob(target_word)
		freq = trigram_frequency[source_word][target_word]
		prob2 = (1.0)*freq/denom
		all_trigrams2[source_word][target_word] = prob
		print ("p("+target_word+"|"+source_word[0]+","+source_word[1]+")={0:.4f}".format(prob))

p(all|for,them)=0.5000
p(.|for,them)=0.5000
p(friendly|waited,,)=0.5000
p(watching|waited,,)=0.5000
p(a|,,such)=0.5000
p(is|,,such)=0.5000
p(his|doubt,in)=1.0000
p(all|talking,to)=0.3333
p(himself|talking,to)=0.3333
p(you|talking,to)=0.3333
p(alone|samanas,slept)=1.0000
p(''|food,:)=1.0000
p(would|'',we)=1.0000
p(shadow|light,and)=0.5000
p(peace|light,and)=0.5000
p(me|happened,to)=0.2500
p(slip|happened,to)=0.1250
p(meet|happened,to)=0.1250
p(glance|happened,to)=0.1250
p(him|happened,to)=0.2500
p(you|happened,to)=0.1250
p(father|chiefly,his)=1.0000
p(without|alone,,)=0.3333
p(he|alone,,)=0.3333
p(i|alone,,)=0.3333
p(runaway|following,the)=1.0000
p(down|fall,straight)=1.0000
p(of|the,foolishness)=1.0000
p(,|talked,about)=0.3333
p(what|talked,about)=0.3333
p(it|talked,about)=0.3333
p(so|was,right)=1.0000
p(,|not,full)=1.0000
p(made|offerings,were)=1.0000
p(evil|to,avoid)=1.0000
p(me|mystery,of)=0.5000
p(what|mystery,of)=0.5000
p(!|such,feats)=1.0000
p(from|give,away)=1.0000
p(.|of,truth)

**[Assignment 1]** What about unigram language model? Using the above example code for computing bigram and trigram language models seen if you could implement the unigram language model.

**[Solution 1]**

In [5]:
unigram_frequency = nltk.FreqDist(words)
unigram_probability = nltk.ConditionalProbDist(unigram_frequency, nltk.MLEProbDist)
denom = len(words)
all_unigrams = {}
all_unigrams2 = {}
for word in unigram_frequency:
	freq = unigram_frequency[word]
	prob2 = (1.0)*freq/denom
	all_unigrams[word]= prob2
	all_unigrams["p("+word+")"]=prob2
	print ("p("+word+")={0:.4f}".format(prob2))

query = all_unigrams["he"]
print (str(query))

p(the)=0.0233
p(to)=0.0233
p(and)=0.0233
p(everything)=0.0233
p(loved)=0.0233
p(in)=0.0465
p(smile)=0.0233
p(life)=0.0465
p(been)=0.0233
p(,)=0.1163
p(ground)=0.0233
p(his)=0.0465
p(him)=0.0698
p(touching)=0.0233
p(was)=0.0233
p(.)=0.0233
p(deeply)=0.0233
p(who)=0.0233
p(motionlessly)=0.0233
p(holy)=0.0233
p(reminded)=0.0233
p(what)=0.0233
p(sitting)=0.0233
p(valuable)=0.0233
p(of)=0.0233
p(ever)=0.0465
p(bowed)=0.0233
p(before)=0.0233
p(had)=0.0465
p(he)=0.0465
p(whose)=0.0233
0.046511627906976744
