# Next word predictor using a bigram language model
Created by: Khalid Mehtab Khan
Date Modified: 02/28/2024

Given a word

We can predict a list of words that can follow the input word inn the domain of the vocabulary

- required imports
-- nltk, corpus, bigrams, collections Counter

In [7]:
import nltk
from nltk import bigrams

# Importing the Counter library to count the frequencies
from collections import Counter

In [8]:
# Importing the webtext corpus from nltk
# file used 'pirates.txt'

nltk.download('webtext')
from nltk.corpus import webtext

[nltk_data] Downloading package webtext to
[nltk_data]     /Users/khalidkhan/nltk_data...
[nltk_data]   Package webtext is already up-to-date!


## Building a bigram vocab

In [9]:
# doing some preprocessing, as the script contains certain repetitive symbols
text = webtext.raw('pirates.txt').replace("[","").replace("]","").replace("*","").replace(":","").replace(".","").replace("!","").replace(",","").replace("-"," ")

# converting the text to lower case
# for mathcing the same word in different cases
text = text.lower()

# printing the text
print(text)

pirates of the carribean dead man's chest by ted elliott & terry rossio
view looking straight down at rolling swells sound of wind and thunder then a low heartbeat
scene port royal
teacups on a table in the rain
sheet music on music stands in the rain
bouquet of white orchids elizabeth sitting in the rain holding the bouquet
men rowing men on horseback to the sound of thunder
eitc logo on flag blowing in the wind
many rowboats are entering the harbor
elizabeth sitting alone at a distance
marines running kick a door in 
a mule is seen on the left in the barn where the marines enter
liz looking over her shoulder
elizabeth drops her bouquet
will is in manacles being escorted by red coats
elizabeth swann will
elizabeth runs to will
elizabeth swann why is this happening? 
will turner i don't know you look beautiful
elizabeth swann i think it's bad luck for the groom to see the bride before the wedding
marines cross their long axes to bar governor from entering
beckett in white hair and curl

In [10]:
# Total words

words = text.split()
print("Words: ",len(words),words)

# computing the frequency of each word
# in other words unigram frequency
word_freq = Counter(words)

# computing all bigrams
bigramslist = list(bigrams(words))
unique_bigrams = set(bigramslist)



# counting frequencies for biagrams
bigram_freq = Counter(bigramslist)


print('Total bigrams:',len(bigramslist))
print('Unique bigrams:',len(unique_bigrams))


# printing the frequency of each bigram
print('Bigram : Frequency')
for bigram in unique_bigrams:
    print(f'{bigram} : {bigram_freq[bigram]}')


Words:  16703 ['pirates', 'of', 'the', 'carribean', 'dead', "man's", 'chest', 'by', 'ted', 'elliott', '&', 'terry', 'rossio', 'view', 'looking', 'straight', 'down', 'at', 'rolling', 'swells', 'sound', 'of', 'wind', 'and', 'thunder', 'then', 'a', 'low', 'heartbeat', 'scene', 'port', 'royal', 'teacups', 'on', 'a', 'table', 'in', 'the', 'rain', 'sheet', 'music', 'on', 'music', 'stands', 'in', 'the', 'rain', 'bouquet', 'of', 'white', 'orchids', 'elizabeth', 'sitting', 'in', 'the', 'rain', 'holding', 'the', 'bouquet', 'men', 'rowing', 'men', 'on', 'horseback', 'to', 'the', 'sound', 'of', 'thunder', 'eitc', 'logo', 'on', 'flag', 'blowing', 'in', 'the', 'wind', 'many', 'rowboats', 'are', 'entering', 'the', 'harbor', 'elizabeth', 'sitting', 'alone', 'at', 'a', 'distance', 'marines', 'running', 'kick', 'a', 'door', 'in', 'a', 'mule', 'is', 'seen', 'on', 'the', 'left', 'in', 'the', 'barn', 'where', 'the', 'marines', 'enter', 'liz', 'looking', 'over', 'her', 'shoulder', 'elizabeth', 'drops', 'her

# Calculating probablity
- using MLE calculating probability of each bigram pair
- prob = count(bigram) / count (w1)  
 {in bigram}


# Output for step1

In [11]:
# Caluclating probablity of each bigram using the biargram freq and word freq (unigram freq)

for bigram in bigram_freq:
  w1, w2 = bigram
  bigram_freq[bigram] = bigram_freq[bigram] / word_freq[w1]

print("Bigram Model & Probabilities:")

for bigram in bigram_freq:
  print(f'P({bigram[1]}|{bigram[0]}) \t= {bigram_freq[bigram]}')

Bigram Model & Probabilities:
P(of|pirates) 	= 0.25
P(the|of) 	= 0.3508771929824561
P(carribean|the) 	= 0.0009328358208955224
P(dead|carribean) 	= 1.0
P(man's|dead) 	= 0.2857142857142857
P(chest|man's) 	= 0.4
P(by|chest) 	= 0.020833333333333332
P(ted|by) 	= 0.023255813953488372
P(elliott|ted) 	= 1.0
P(&|elliott) 	= 1.0
P(terry|&) 	= 0.08333333333333333
P(rossio|terry) 	= 1.0
P(view|rossio) 	= 1.0
P(looking|view) 	= 0.07692307692307693
P(straight|looking) 	= 0.1111111111111111
P(down|straight) 	= 1.0
P(at|down) 	= 0.09302325581395349
P(rolling|at) 	= 0.010869565217391304
P(swells|rolling) 	= 0.2
P(sound|swells) 	= 1.0
P(of|sound) 	= 0.26666666666666666
P(wind|of) 	= 0.0035087719298245615
P(and|wind) 	= 0.2
P(thunder|and) 	= 0.0038910505836575876
P(then|thunder) 	= 0.5
P(a|then) 	= 0.030303030303030304
P(low|a) 	= 0.004629629629629629
P(heartbeat|low) 	= 0.5
P(scene|heartbeat) 	= 0.5
P(port|scene) 	= 0.07142857142857142
P(royal|port) 	= 0.55
P(teacups|royal) 	= 0.08333333333333333
P(on|t

## Taking user input, to check with words in the vocabulary
### If word is found, a next word is predicted

### if word not found, user is asked to input a different word


In [12]:
# User input
inpW = input("Enter a word: ")

# Finding the possible words that can come after the user input
possibilities = []
for bigrams in bigram_freq:
  w1,w2 = bigrams
  prob = bigram_freq[bigrams]
  if w1 == inpW:
    possibilities.append((w2, prob))


# Sorting the possibilities based on the probability

if possibilities:
  possibilities = sorted(possibilities , key=lambda x : x[1], reverse=True)


# Printing the possible words
  print(f"User Input:  {inpW}\n")
  print("Possible words")
  print("Phrase\t\t", "Probability")
  for word, prob in possibilities:
    print(f'{inpW} {word}:\t {prob}')

# If the word is not found
else:
  print("Word not found, try a different word")


User Input:  running

Possible words
Phrase		 Probability
running off:	 0.16666666666666666
running down:	 0.16666666666666666
running kick:	 0.08333333333333333
running run:	 0.08333333333333333
running across:	 0.08333333333333333
running with:	 0.08333333333333333
running motion:	 0.08333333333333333
running from?:	 0.08333333333333333
running behind:	 0.08333333333333333
running alone:	 0.08333333333333333
