<a href="https://colab.research.google.com/github/NULabTMN/ps2-Connor-Frazier/blob/development/LanguageModeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Your task is to train *character-level* language models. 
You will train unigram, bigram, and trigram character level models on a collection of books from Project Gutenberg. You will then use these trained English language models to distinguish English documents from Brazilian Portuguese documents in the test set.

In [0]:
import pandas as pd
import httpimport
import math
from sklearn.model_selection import ParameterGrid
from decimal import Decimal
import sys

with httpimport.remote_repo(['lm_helper'], 'https://raw.githubusercontent.com/jasoriya/CS6120-PS2-support/master/utils/'):
  from lm_helper import get_train_data, get_test_data

This code loads the training and test data. Each dataset is a list of books. Each book contains a list of sentences, and each sentence contains a list of words. For building a character language model, you should join the words of a sentence together with a space character.

In [0]:
# get the train and test data
train = get_train_data()
test, test_files = get_test_data()

## 1.1
Collect statistics on the unigram, bigram, and trigram character counts.

If your machine takes a long time to perform this computation, you may save these counts to files in your github repository and load them on request. This is not necessary, however.

In [0]:
# Your code here
# Create a held out set with four of the training books
# Split training 80% = 14
# Split held-out 20% = 4 
# Used for finding lambdas for  linear interpolation smoothing
# Uncomment below to set up lambda training
held_out = [train.pop(3), train.pop(6), train.pop(9), train.pop(12)]


"""Find counts for all unigrams, bigrams, trigrams"""
# Create data structure to save counts
data_dict = {}
data_dict['unigrams'] = {}
data_dict['bigrams'] = {}
data_dict['trigrams'] = {}

# Turn traing data into one string with all words and punctuation
sentence_string = ""
for example in train:
  for sentence in example:
    sentence_string += "#" # Add sentence begin character
    for word in sentence:
        sentence_string += word + " "
    sentence_string += "#" # Add sentence end character

# Unigram counts
for i in range(len(sentence_string)):
  if sentence_string[i: i+1] not in data_dict['unigrams']:
    data_dict['unigrams'][sentence_string[i: i+1]] = 1
  else:
    data_dict['unigrams'][sentence_string[i: i+1]] += 1
# Bigram counts
for i in range(len(sentence_string) - 1):
  if sentence_string[i: i+2] not in data_dict['bigrams']:
    data_dict['bigrams'][sentence_string[i: i+2]] = 1
  else:
    data_dict['bigrams'][sentence_string[i: i+2]] += 1
# Trigram counts
for i in range(len(sentence_string) -2):
  if sentence_string[i: i+3] not in data_dict['trigrams']:
    data_dict['trigrams'][sentence_string[i: i+3]] = 1
  else:
    data_dict['trigrams'][sentence_string[i: i+3]] += 1      


"""Find probabilities for all unigrams, bigrams, trigrams"""
unigram_probabilities = {}
bigram_probabilities = {}
trigram_probabilities = {}

# Unigrams probabilities
unigram_total_count = 0
for key, value in data_dict['unigrams'].items():
  unigram_total_count += value

for key, value in data_dict['unigrams'].items():
  unigram_probabilities[key] = (value/unigram_total_count)

# Bigrams probabilities
for key, value in data_dict['bigrams'].items():
  bigram_probabilities[key] = (value/data_dict['unigrams'][key[0]])
  
# Trigrams probabilities
for key, value in data_dict['trigrams'].items():
  trigram_probabilities[key] = (value/data_dict['bigrams'][key[0:2]])       


## 1.2
Calculate the perplexity for each document in the test set using linear interpolation smoothing method. For determining λs for linear interpolation, you can divide the training data into a new training set (80%) and a held-out set (20%), then using grid search method:
Choose ~10 values of λ to test using grid search on held-out data.

Some documents in the test set are in Brazilian Portuguese. Identify them as follows: 
  - Sort by perplexity and set a cut-off threshold. All the documents above this threshold score should be categorized as Brazilian Portuguese. 
  - Print the file names (from `test_files`) and perplexities of the documents above the threshold

    ```
        file name, score
        file name, score
        . . .
        file name, score
    ```

  - Copy this list of filenames and manually annotate them as being correctly or incorrectly labeled as Portuguese.




In [18]:
#Your code here

#Uncomment commented block below to run lambda training
# """Find the best lambda values on the held out set"""
# # Taken from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html
# # Set up grid search
# param_grid = {'L1': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1], 'L2': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1], 'L3': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]}
# parameters = list(ParameterGrid(param_grid))

# # Remove all parameter sets that do not sum to 1
# valid_parameters = []
# for iteration in parameters:
#   iteration_sum = 0
#   for key, value in iteration.items():
#     iteration_sum += value
#   valid = abs(iteration_sum - float(1)) < 0.0000000000001
#   if valid == True:
#     valid_parameters.append(iteration)

# """Find the best parameter set"""

# # Variables to hold best perplexity and parameter set
# min_perplexity = sys.maxsize
# best_parameters = valid_parameters[0]

# # Loop through all valid parameter sets
# for parameter_set in valid_parameters:
#   #Set lambda values for this iteration
#   l1 = parameter_set["L1"]
#   l2 = parameter_set["L2"]
#   l3 = parameter_set["L3"]

#   # Set entropy, and token count variables for this iteration
#   token_count = 0
#   held_out_entropy = 0

#   # Create string containing all words in the held out set
#   sentence_string = ""
#   for example in held_out:
#     for sentence in example:
#       sentence_string += "#" # Add sentence begin character
#       for word in sentence:
#         sentence_string += word + " "
#       sentence_string += "#" # Add sentence end character  

#   # Loop through all of the trigrams in the validation string
#   for j in range(len(sentence_string) -2):
#     trigram = sentence_string[j: j+3]
#     token_count += 1

#     # Better solution for unk tokens?? Simple solution taken from https://www.usna.edu/Users/cs/nchamber/courses/nlp/f13/slides/set3-LMs.pdf
    
#     # Calculate the trigram probability
#     trigram_probability = 0
#     # If the trigram probability is know add it, else add the replacement probability
#     if trigram in trigram_probabilities:
#       trigram_probability += (l3 * trigram_probabilities[trigram])
#     else:
#       trigram_probability += (l3 * (1/len(trigram_probabilities)))

#     # If the bigram probability is know add it, else add the replacement probability
#     if trigram[0:2] in bigram_probabilities:
#       trigram_probability += (l2 * bigram_probabilities[trigram[0:2]])
#     else:
#       trigram_probability += (l2 * (1/len(bigram_probabilities)))

#     # If the unigram probability is know add it, else add the replacement probability
#     if trigram[0] in unigram_probabilities:
#       trigram_probability += (l1 * unigram_probabilities[trigram[0]])
#     else:
#       trigram_probability += (l1 * (1/len(unigram_probabilities)))  

#     # Add the log probability to the held out total entropy
#     held_out_entropy += math.log2(trigram_probability)
    
#   # Calculate the held out set perplexity
#   held_out_set_perplexity = 2**((held_out_entropy * -1)/token_count)
#   # print(held_out_set_perplexity)

#   # Update the best perplexity and parameter set if it is better
#   if held_out_set_perplexity < min_perplexity:
#     min_perplexity = held_out_set_perplexity
#     best_parameters = parameter_set

# print(min_perplexity)
# print(best_parameters)      
# {'L1': 0.1, 'L2': 0.3, 'L3': 0.6}

#Comment below block if running lambda training
"""Find the perplexities of the test documents"""
# Set the lambdas
l1 = 0.1
l2 = 0.3
l3 = 0.6
# Data structure to hold perplexities of each test document
test_perplexities = {}

# Loop through all of the doucments in the test set
for example in test:
# Set entropy, and token count variables for this iteration
  entropy = 0
  token_count = 0
# Create string containing all words in the document
  sentence_string = ""
  for sentence in example:
    sentence_string += "#" # Add sentence begin character
    for word in sentence:
      sentence_string += word + " "
    sentence_string += "#" # Add sentence end character

  # Loop through all of the trigrams in the document string
  for j in range(len(sentence_string) -2):
    trigram = sentence_string[j: j+3]
    token_count += 1
          
    # Calculate the trigram probability
    trigram_probability = 0

    # If the trigram probability is known add it, else add the replacement probability
    if trigram in trigram_probabilities:
      trigram_probability += (l3 * trigram_probabilities[trigram])
    else:
      trigram_probability += (l3 * (1/len(trigram_probabilities)))

    # If the bigram probability is known add it, else add the replacement probability
    if trigram[0:2] in bigram_probabilities:
      trigram_probability += (l2 * bigram_probabilities[trigram[0:2]])
    else:
      trigram_probability += (l2 * (1/len(bigram_probabilities)))

    # If the unigram probability is known add it, else add the replacement probability
    if trigram[0] in unigram_probabilities:
      trigram_probability += (l1 * unigram_probabilities[trigram[0]])
    else:
      trigram_probability += (l1 * (1/len(unigram_probabilities)))

    # Add the log probability to the document total entropy
    entropy += math.log2(trigram_probability)

  #Calculate the test file perplexity
  test_file_perplexity = 2**(-entropy/token_count)
  # print(test_file_perplexity)

  # Save the test file perplexity
  test_perplexities[test_files[test.index(example)]] = test_file_perplexity

# Setting threshold at 10 based off obeservations
# Loop trhough the test file perplexities, and print the ones above the threshold
for key, value in test_perplexities.items():
  if value > 10:
    print("File Name: %s, Perplexity: %f" % (key, value))


5.149289978007447
{'L1': 0, 'L2': 0.3, 'L3': 0.7}


## 1.3
Build a trigram language model with add-λ smoothing (use λ = 0.1).

Sort the test documents by perplexity and perform a check for Brazilian Portuguese documents as above:

  - Observe the perplexity scores and set a cut-off threshold. All the documents above this threshold score should be categorized as Brazilian Portuguese. 
  - Print the file names and perplexities of the documents above the threshold

  ```
      file name, score
      file name, score
      . . .
      file name, score
  ```

  - Copy this list of filenames and manually annotate them for correctness.

In [19]:
# Your code here

"""Find the perplexities of the test documents"""

# Set the lambda
l3 = 0.1

# Data structure to hold perplexities of each test document
test_perplexities = {}

# Loop through all of the doucments in the test set
for example in test:
  # Set entropy, and token count variables for this iteration
  entropy = 0
  token_count = 0
  # Create string containing all words in the document
  sentence_string = ""
  for sentence in example:
    sentence_string += "#" # Add sentence begin character
    for word in sentence:
      sentence_string += word + " "
    sentence_string += "#" # Add sentence end character

  # Loop through all of the trigrams in the document string
  for j in range(len(sentence_string) -2):
    trigram = sentence_string[j: j+3]
    token_count += 1
              
    # Calculate the trigram probability
    trigram_probability = 0

    # If the trigram probability is known add it, else add the replacement probability
    if trigram in trigram_probabilities:
      trigram_probability += (l3 * trigram_probabilities[trigram])
    else:
      trigram_probability += (l3 * (1/len(trigram_probabilities)))

    # Add the log probability to the document total entropy
    entropy += math.log2(trigram_probability)

  
  #Calculate the test file perplexity
  test_file_perplexity = 2**(-entropy/token_count)
  # print(test_file_perplexity)

  # Save the test file perplexity
  test_perplexities[test_files[test.index(example)]] = test_file_perplexity


# Setting threshold at 200 based off obeservations
# Loop trhough the test file perplexities, and print the ones above the threshold
for key, value in test_perplexities.items():
  if value > 200:
    print("File Name: %s, Perplexity: %f" % (key, value))
    

File Name: br94fe1.txt, Perplexity: 393.809458
File Name: ag94ju07.txt, Perplexity: 404.608064
File Name: br94de01.txt, Perplexity: 374.623320
File Name: ag94ag02.txt, Perplexity: 397.831267
File Name: ag94se06.txt, Perplexity: 421.017679
File Name: ag94mr1.txt, Perplexity: 392.869303
File Name: ag94ab12.txt, Perplexity: 397.907121
File Name: ag94ou04.txt, Perplexity: 385.807471
File Name: ag94fe1.txt, Perplexity: 362.680538
File Name: ag94ma03.txt, Perplexity: 411.135439
File Name: br94ab02.txt, Perplexity: 368.537224
File Name: ag94de06.txt, Perplexity: 391.560078
File Name: br94ju01.txt, Perplexity: 375.132363
File Name: br94ma01.txt, Perplexity: 395.006771
File Name: br94jl01.txt, Perplexity: 377.504859
File Name: br94ag01.txt, Perplexity: 400.657841
File Name: br94ja04.txt, Perplexity: 385.280533
File Name: ag94jl12.txt, Perplexity: 392.513347
File Name: ag94no01.txt, Perplexity: 391.713588
File Name: ag94ja11.txt, Perplexity: 371.802422


## 1.4
Based on your observation from above questions, compare linear interpolation and add-λ smoothing by listing out their pros and cons.

Run instrunctions:

Currently running all sections will include:

1.1:

Collect all counts of unigrams, bigrams, and trigrams.
Calculate empirical probabilities for all unigrams, bigrams, and trigrams.

1.2:

Calculate the perplexity of each test document using linear interpolation smoothing and report the files that are above threshold.

1.3:

Calculate the perplexity of each test document using the trigram model with lambda smoothing and report the files that are above threshold.

If you want to see how the lamdas are determined, first uncomment the heldout set list creation at the top of section 1.1. Second, uncomment the first large commented code block at the top of 1.2, and comment out the large code block in the second half of 1.2. To run the code in order to determine the test files that are portuguese, just reverse these steps. One thing to note, the best parameters from the validation set were {'L1': 0, 'L2': 0.3, 'L3': 0.7} where L1 corresponds to unigrams, L2 to bigrams, and L3 to trigrams. However {'L1': 0.1, 'L2': 0.3, 'L3': 0.6} was chosen to be used so that unigram probabilities would be included.



Observations:

Both methods were able to show the perplexity difference between the english and prtuguese files but with different levels of perplexity. The linear interpolation smoothing method reported the english documents perplexities as less than 10 and the portuguese documents between 13 and 15 creating a clear difference between the two. The lamda smoothing trigram model reported much higher perplexities where the english documents were around 180 and the portuguese documents were around 400 which also was another clear difference between the documents. Both methods did not perform as well when attempting to use an UNK token probability for unrecognized tokens. The perplexities were overall much smaller barely separated enough to tell the difference in documents. Therefore a uniform distribution was used to provide a probability for unrecognized tokens during calcuations.

Based on these observations, I have found that the linear interpolation smoothing method works much better for character level language modeling. It has the ability to use all of the information in the trigram by recognizing smaller amounts of characters in order to better identify language it has seen before.


Linear Interpolation Smoothing:

Pros:

Ability to backoff to recognize bigrams and unigrams 

Ability to use all knowledge contained in a trigram including a trigram's subsets of characters

Lower Perplexities from using the models complete knowledge

Cons:

More computation required

More information gathering

Requires finding Lambda weights


Add Lambda Smoothing:

Pros:

Simple approach that still accomplishes the task

Less Computation

Cons:

Does not use all available information

Misses oppurtunity to recognize smaller bits of information



