<a href="https://colab.research.google.com/github/MarinaOhm/NLP-using-N-grams/blob/main/NLP_using_N_grams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language modelling

In [None]:
!pip install nltk



In [None]:
# importing packages needed for the project
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('gutenberg')

from nltk.corpus import gutenberg
from nltk import bigrams, trigrams
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter
import random

stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [None]:
# Choosing the text 'Emma' by Jane Austin
emma = gutenberg.words('austen-emma.txt')

# Exploring the text corpus
print('The chosen text corpus contains: ' + str(len(emma)) + ' words')

emma_sentences=gutenberg.sents('austen-emma.txt')
print('The chosen text corpus contains: ' + str(len(emma_sentences)) + ' sentences')


The chosen text corpus contains: 192427 words
The chosen text corpus contains: 7752 sentences


### Test corpus

The following sentences will be used througout the assignment to test our uni-, bi-, and trigram models:
1. never, did, she
2. None, None, She
3. None, She, did
4. She, did, unknownword

## Trigram model

In [None]:
# First, we will build the trigram model

# Initiating our dictionary used for storing our trigrams
emma_model_trigram = defaultdict(lambda: defaultdict(lambda: 0))

# Looping throuch the sentences of the text
for sentence in emma_sentences:
  # Converting each sentence into trigrams. Setting padding to true in order to add indications in the beginning/end of sentences
  for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
    # Counting one each time word 1 and word 2 are followed by word 3.
    emma_model_trigram[(w1, w2)][w3] += 1

# Testing cases
print('Count: ' + str(emma_model_trigram["never", "did"]["she"])) 
print('Count: ' + str(emma_model_trigram[None, None]["She"]))
print('Count: ' + str(emma_model_trigram[None, "She"]["did"]))
print('Count: ' + str(emma_model_trigram["She", "did"]["unknownword"]))
print('-----------------------------------')

# Transforming the counts into probabilities
for w1_w2 in emma_model_trigram:
    total_count = float(sum(emma_model_trigram[w1_w2].values()))
    for w3 in emma_model_trigram[w1_w2]:
        emma_model_trigram[w1_w2][w3] /= total_count

# Testing cases
print('Probability: ' + str(emma_model_trigram["never", "did"]["she"])) 
print('Probability: ' + str(emma_model_trigram[None, None]["She"]))
print('Probability: ' + str(emma_model_trigram[None, "She"]["did"]))
print('Probability: ' + str(emma_model_trigram["She", "did"]["unknownword"]))

Count: 1
Count: 460
Count: 10
Count: 0
-----------------------------------
Probability: 0.2
Probability: 0.05933952528379773
Probability: 0.021739130434782608
Probability: 0.0


## Bigram model

In [None]:
# Building the bigram model similar to the trigram model, however this time we will only base it on one previous word

emma_model_bigram = defaultdict(lambda: defaultdict(lambda: 0))

for sentence in emma_sentences:
  # Insteas of w1, w2, w3 we now only use two words w1, w2 and the function bigrams() to convert sentences into bigrams
  for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True):
    emma_model_bigram[w1][w2] += 1

# Testing cases
print('Count: ' + str(emma_model_bigram["did"]["she"])) 
print('Count: ' + str(emma_model_bigram[None]["She"]))
print('Count: ' + str(emma_model_bigram["She"]["did"]))
print('Count: ' + str(emma_model_bigram["did"]["unknownword"]))
print('-----------------------------------')

# Transforming the counts into probabilities
for w1 in emma_model_bigram:
    total_count = float(sum(emma_model_bigram[w1].values()))
    for w2 in emma_model_bigram[w1]:
        emma_model_bigram[w1][w2] /= total_count


# Testing cases
print('Probability: ' + str(emma_model_bigram["did"]["she"])) 
print('Probability: ' + str(emma_model_bigram[None]["She"]))
print('Probability: ' + str(emma_model_bigram["She"]["did"]))
print('Probability: ' + str(emma_model_bigram["did"]["unknownword"]))

Count: 6
Count: 460
Count: 15
Count: 0
-----------------------------------
Probability: 0.01791044776119403
Probability: 0.05933952528379773
Probability: 0.026690391459074734
Probability: 0.0


## Unigram model

In [None]:
# Tokenizing the text to a dictionary stored with their frequency as value
counts = Counter(gutenberg.words('austen-emma.txt'))

# Counting the total number of words in the text
total_count = len((gutenberg.words('austen-emma.txt')))

def unigram_model(word):

  # Calculating the probability as the number of time a given word occurs divided by the total number of words in the corpus
  prob = counts[word] / float(total_count)

  print('The word ' + word + ' appears ' + str(counts[word]) + ' times in the text. Meaning a probability of: ' + str(prob))

In [None]:
unigram_model('She')
unigram_model('unknownword')
unigram_model('did')

The word She appears 562 times in the text. Meaning a probability of: 0.002920588067163132
The word unknownword appears 0 times in the text. Meaning a probability of: 0.0
The word did appears 335 times in the text. Meaning a probability of: 0.0017409199332733972


## Interpolation model

In order to create the interpolation model we turn the three models into functions. 

In [None]:
def unigram_model(unigram):
  counts = Counter(gutenberg.words('austen-emma.txt'))
  total_count = len((gutenberg.words('austen-emma.txt')))
  prob_uni = counts[w1] / float(total_count)
  return prob_uni


def bigram_model(bigram):
  w1=bigram[0]
  w2=bigram[1]
  emma_model_bigram = defaultdict(lambda: defaultdict(lambda: 0))
  emma_model_bigram[w1][w2]
  for sentence in emma_sentences:
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True):
      emma_model_bigram[w1][w2] += 1
  for w1 in emma_model_bigram:
      total_count = float(sum(emma_model_bigram[w1].values()))
      for w2 in emma_model_bigram[w1]:
        if total_count != 0:
          emma_model_bigram[w1][w2] /= total_count
      return emma_model_bigram[w1][w2]

def trigram_model(trigram):
  [w1,w2]=[trigram[0],trigram[1]]
  w3=trigram[2]
  emma_model_trigram = defaultdict(lambda: defaultdict(lambda: 0))
  for sentence in emma_sentences:
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
      emma_model_trigram[(w1, w2)][w3] += 1
  for w1,w2 in emma_model_trigram:
      total_count = float(sum(emma_model_trigram[w1,w2].values()))
      for w3 in emma_model_trigram[w1,w2]:
        if total_count != 0:
          emma_model_trigram[w1,w2][w3] /= total_count
      return emma_model_trigram[w1,w2][w3]

The interpolation model will draw upon the three models created above. 

The input must be a trigram, thus a list of length three. 

Additionally, the user must assign the lambda values.

In [None]:
def interpolation(trigram, lambda1, lambda2,lambda3):

  assert len(trigram)==3;

  unigram_input=trigram[2]
  bigram_input=[trigram[1],trigram[2]]
  trigram_input=[trigram[0],trigram[1],trigram[2]]
  
  prob=(lambda1 * unigram_model(unigram_input)) + (lambda2 * bigram_model(bigram_input))+(lambda3 * trigram_model(trigram_input))
  
  print(prob)


In [None]:
# Testing the interpolation model
interpolation_input=['never','did','she']
interpolation(interpolation_input, lambda1=1/3,lambda2=1/3,lambda3=1/3)

0.0010397567902647156


## Different Lambda values 

In [None]:
# Different lambda values for maximizing probability

lambdas=[(0.2,0.4,0.4),(0.4,0.3,0.3),(0.2,0.3,0.5), (0.1,0.2,0.7)]

In [None]:
# Looping though each tuple in the lambdas list. 

for i in lambdas:
  x=i[0] 
  y=i[1]
  z=i[2]
  print('Interpolation model using lambdas: ' + str(x) + ', ' + str(y) + ', ' + str(z))

  # Inserting the lambda values in the interpolation() function
  interpolation(interpolation_input, lambda1=x, lambda2=y,lambda3=z)
  
  # Moving on to the next tuple
  lambdas=+1 
  
  print('-----------------------------------------------------')
  

Interpolation model using lambdas: 0.2, 0.4, 0.4
0.0012466687931336148
-----------------------------------------------------
Interpolation model using lambdas: 0.4, 0.3, 0.3
0.0009363007888302659
-----------------------------------------------------
Interpolation model using lambdas: 0.2, 0.3, 0.5
0.0009610612272478733
-----------------------------------------------------
Interpolation model using lambdas: 0.1, 0.2, 0.7
0.0006878338805709354
-----------------------------------------------------


From the above we find that the lambdas 0.2, 0.4 and 0.4 brings the highest probability for the given test sentence

## Random sentences

In the following we will generate some random text using trigrams

In [None]:
def generate_random_sent():

  text = [None,None]
  
  # Variable used to break the while loop when turning True
  end_sentence = False
  
  while not end_sentence:
      rand_num = random.random()
      acc = .0
      
      # For each iteration a random number generated from the random() function will be used as a benchmark for whether the word gets appended
      for word in emma_model_trigram[tuple(text[-2:])].keys():
          acc += emma_model_trigram[tuple(text[-2:])][word]
          if acc >= rand_num:
              text.append(word)
              break
          else:
            continue
      if text[-2:] == [None, None]:
          end_sentence = True

  print('The random sentence generated from our trigram model is: \n >> ', ' '.join([t for t in text if t]))

In [None]:
generate_random_sent()
print('\n')
generate_random_sent()
print('\n')
generate_random_sent()
print('\n')
generate_random_sent()

The random sentence generated from our trigram model is: 
 >>  He had his influence , such very good opinion of herself , in reply , Mr . Weston , if not towards William Larkins is such a worshipping wife , " must have been so thoroughly from the saddle of mutton for dinner , as my brother , whose prospects were closing , while their two fathers were engaged .-- It was known that you were here ?


The random sentence generated from our trigram model is: 
 >>  " You should not have believed it !


The random sentence generated from our trigram model is: 
 >>  He had met her before , passed suspiciously through Emma ' s difference -- which one should expect to marry well , had been alone !-- Such a change !


The random sentence generated from our trigram model is: 
 >>  " Yes , I do not want ; consequence I do not commit yourself ."


From the above sentences we find that the model in general creates random sentences that makes sense such as not having several nouns, pronouns or verbs right after each other. 
A weakness is the use of signs in the text like ; and ! and " that sometimes appear quite random througout the sentences. 

# Question 2

### Question 2 is answered using pen and paper (see word document). 