# Topic
Next word predictor using a bigram language model.

Reference：
1. https://stackoverflow.com/questions/36797539/how-can-i-sort-mle-probability-according-to-the-value


Steps to follow:

1) Build the bigram LM:

    1.A) Use nltk to compile all the unique bigrams from the corpus you used for the previous assignment.  

    1.B) Compute probability of each bigram using MLE ( count(w1 w2) / count(w1) ) 

2) Next word prediction using the above bigram LM:

    2.A) Get an input word from user, inpW.

    2.B) Use the above bigram LM to find all the bigrams where the input word, inpW, is w1.  Display all possible next words from these bigrams and their corresponding probabilities. 

## Get unique bigrams
A. Use nltk to compile all the unique bigrams

In [1]:
import nltk
from nltk.tokenize import word_tokenize

def getBigrams(text):
    if type(text) == str:                 # input: string
        tokens = nltk.word_tokenize(text)
    elif type(text) == nltk.text.Text:    # input: text book
        tokens = set(text)
    else:
        print("Error input type!")
    bigrm = tuple(nltk.bigrams(tokens))
    token_freq = nltk.FreqDist(tokens)
    bi_freq = nltk.FreqDist(bigrm) 
    return bigrm, token_freq, bi_freq

In [2]:
# test function of getBigram 
text = "I will go to California to meet my friend"
bigrm, token_freq, bi_freq = getBigrams(text)
print("The bigrams are: ", bigrm)
print("The frequence of token: ", token_freq.most_common(1000)) 
print("The frequence of bigram: ", bi_freq.most_common(1000))

The bigrams are:  (('I', 'will'), ('will', 'go'), ('go', 'to'), ('to', 'California'), ('California', 'to'), ('to', 'meet'), ('meet', 'my'), ('my', 'friend'))
The frequence of token:  [('to', 2), ('I', 1), ('will', 1), ('go', 1), ('California', 1), ('meet', 1), ('my', 1), ('friend', 1)]
The frequence of bigram:  [(('I', 'will'), 1), (('will', 'go'), 1), (('go', 'to'), 1), (('to', 'California'), 1), (('California', 'to'), 1), (('to', 'meet'), 1), (('meet', 'my'), 1), (('my', 'friend'), 1)]


In [3]:
def getProbability(bigrm, token_freq, bi_freq):
    prob_dic = dict()
    for item in bigrm:
        token = item[0]
        num = token_freq[token]
        # MLE ( count(w1 w2) / count(w1) ) 
        prob = bi_freq[item] / num
        prob_dic[item] = prob
    return prob_dic

In [4]:
# test function of getProbability
prob_dic = getProbability(bigrm, token_freq, bi_freq)
print(prob_dic)

{('I', 'will'): 1.0, ('will', 'go'): 1.0, ('go', 'to'): 1.0, ('to', 'California'): 0.5, ('California', 'to'): 1.0, ('to', 'meet'): 0.5, ('meet', 'my'): 1.0, ('my', 'friend'): 1.0}


## Predict next word

In [5]:
def inputWord(input_str, prob_dic, token_freq, bigrm):
    res = "Possible next words: "
    if input_str not in token_freq:
        print("Sorry, we didn't find")
    for key, val in prob_dic.items():
        if key[0] == input_str:
            tmp = str(key[1]) + ": " + str(val)
            res += tmp + ", "
    return res[:len(res)-2]
        

In [6]:
# test inputWord function
test_string = "to"
print(inputWord(test_string, prob_dic, token_freq, bigrm))

Possible next words: California: 0.5, meet: 0.5


## Test in a large corpus from text book

In [7]:
from nltk.book import *
bigrm, token_freq, bi_freq = getBigrams(text3)
prob_dic = getProbability(bigrm, token_freq, bi_freq)

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [8]:
test1 = "He"
ans1 = inputWord(test1, prob_dic, token_freq, bigrm)
print(ans1)

print("----------")
# test2
test2 = "is"
ans2 = inputWord(test2, prob_dic, token_freq, bigrm)
print(ans2)

Possible next words: uppermost: 1.0
----------
Possible next words: fill: 1.0
