# Data Loading/Preprocessing

Upload the `A1_DATASET.zip` file from eLearning to the files.

## Preprocessing Explaination

- Keep alphabetic characters (convert all to lowercase)
- Keep `'` (singular apostrophes) for contractions -- in the data, we see a lot of `n't`, `'ll`, etc. These can be important for context (ie. `did` followed by `n't` vs just having `did` alone)
- Replace `/` (forward slashes) with ` ` (spaces) -- this is because there are 2 cases that forward slashes would be used:
    - Between numbers in fractions, but this context can be implied using bigrams (eg. 1 followed by 3)
    - Between words used as "or" (eg. clean/friendly), which isn't that important for context
- Remove all other characters
- Trim all remaining strings and filter out those that are empty


In [None]:
!unzip A1_DATASET.zip

Archive:  A1_DATASET.zip
   creating: A1_DATASET/
  inflating: A1_DATASET/.DS_Store    
  inflating: __MACOSX/A1_DATASET/._.DS_Store  
  inflating: A1_DATASET/train.txt    
  inflating: __MACOSX/A1_DATASET/._train.txt  
  inflating: A1_DATASET/val.txt      
  inflating: __MACOSX/A1_DATASET/._val.txt  


In [None]:
def read_file(path) -> list[str]:
    """Read in a file and return a list of lines."""
    with open(path, 'r') as f:
        lines = [line.rstrip() for line in f]
    return lines

input_train = read_file('A1_DATASET/train.txt')
input_val = read_file('A1_DATASET/val.txt')
input_train[:3]

['I booked two rooms four months in advance at the Talbott . We were placed on the top floor next to the elevators , which are used all night long . When speaking to the front desk , I was told that they were simply honoring my request for an upper floor , which I had requested for a better view . I am looking at a brick wall , and getting no sleep . He also told me that they had received complaints before from guests on the 16th floor , and were aware of the noise problem . Why then did they place us on this floor when the hotel is not totally booked ? A request for an upper floor does not constitute placing someone on the TOP floor and using that request to justify this . If you decide to stay here , request a room on a lower floor and away from the elevator ! I spoke at length when booking my two rooms about my preferences . This is simply poor treatment of a guest whom they believed would not complain .',
 "I LOVED this hotel . The room was so chic and trendy , the bed was comforta

In [None]:
def clean_str(input: str) -> str:
    """Remove all but spaces and alpha"""
    input = input.lower().replace('/', ' ')
    return ''.join(char for char in input if char.isalpha() or char in " ").strip()


def preprocess(input: list[str]) -> list[list[str]]:
    """Preprocess the input data according to a set of rules."""
    input = map(clean_str, input)
    return list(filter(lambda line: line != '', map(lambda line: line.split(), input)))

In [None]:
train = preprocess(input_train)
val = preprocess(input_val)
train[0][:20], val[0][:20]

(['i',
  'booked',
  'two',
  'rooms',
  'four',
  'months',
  'in',
  'advance',
  'at',
  'the',
  'talbott',
  'we',
  'were',
  'placed',
  'on',
  'the',
  'top',
  'floor',
  'next',
  'to'],
 ['i',
  'stayed',
  'for',
  'four',
  'nights',
  'while',
  'attending',
  'a',
  'conference',
  'the',
  'hotel',
  'is',
  'in',
  'a',
  'great',
  'spot',
  'easy',
  'walk',
  'to',
  'michigan'])

# Unsmoothed N-grams

We will calculate the unigram and bigram counts of the training set, where the unigram counts is stored in a dictionary with `{word: count}` and bigrams are stored in a dictionary with `{(word1, word2): count}`.

Then, we'll use the formulas for calculating the unigram/bigram probability models.

In [None]:
unigram_counts = {}
bigram_counts = {}

# Compute unigram counts
for line in train:
    for word in line:
        if word in unigram_counts:
            unigram_counts[word] += 1
        else:
            unigram_counts[word] = 1

# Compute bigram counts
for line in train:
    for i in range(len(line) - 1):
        bigram = (line[i], line[i + 1])
        if bigram in bigram_counts:
            bigram_counts[bigram] += 1
        else:
            bigram_counts[bigram] = 1

# Compute unigram and bigrams probabilities model
# For unigrams, it's simply the count of word / total number of words
# For bigrams, it's Count(A, B) / Count(A)
unigram_probs = {word: count / sum(unigram_counts.values()) for word, count in unigram_counts.items()}
bigram_probs = {bigram: count / unigram_counts[bigram[0]] for bigram, count in bigram_counts.items()}

In [None]:
print(unigram_probs['the'])
print(bigram_probs[('the', 'hotel')])

# Print top 10 probable words in both models
print(sorted(unigram_probs.items(), key=lambda x: x[1], reverse=True)[:10])
print(sorted(bigram_probs.items(), key=lambda x: x[1], reverse=True)[:10])

0.06739116619002225
0.07808374198415692
[('the', 0.06739116619002225), ('and', 0.03298379408960915), ('a', 0.028573244359707657), ('to', 0.026564982523037815), ('was', 0.023209405783285668), ('i', 0.021773117254528122), ('in', 0.01604067365745154), ('we', 0.014197648554178583), ('of', 0.013295201779472514), ('hotel', 0.013180807117890055)]
[(('honoring', 'my'), 1.0), (('constitute', 'placing'), 1.0), (('placing', 'someone'), 1.0), (('justify', 'this'), 1.0), (('decide', 'to'), 1.0), (('preferences', 'this'), 1.0), (('believed', 'would'), 1.0), (('keihl', 's'), 1.0), (('junior', 'suite'), 1.0), (('lawry', 's'), 1.0)]


# Smoothing

To handle unknown words, we replace them with the `<UNKNOWN>` keyword.

We will apply Laplace Smoothing (takes ~52 secs). Additionally, we will apply +2 Smoothing.

In [None]:
# Add <UNKNOWN> keyword
unigram_counts['<UNKNOWN>'] = 0
bigram_counts[('<UNKNOWN>', '<UNKNOWN>')] = 0

In [None]:
# Generate list of bigram probabilities with Laplace Smoothing
bigram_probs_smoothed = {}

unigram_list = list(unigram_counts.keys())

for v in unigram_list:
    for k in unigram_list:
        bigram_probs_smoothed[(v, k)] = (bigram_counts.get((v, k), 0) + 1) / (unigram_counts.get(v, 0) + len(unigram_counts))

In [None]:
# Generate list of bigram probabilities with +2 Smoothing
bigram_probs_smoothed_2 = {}

unigram_list = list(unigram_counts.keys())

for v in unigram_list:
    for k in unigram_list:
        bigram_probs_smoothed_2[(v, k)] = (bigram_counts.get((v, k), 0) + 2) / (unigram_counts.get(v, 0) + 2 * len(unigram_counts))

In [None]:
print(bigram_probs_smoothed[('the', 'hotel')])
print(bigram_probs_smoothed_2[('the', 'hotel')])

0.03714311286136221
0.024407416099507157


# Perplexity Calculation

In [None]:
import math

def calculate_perplexity(probabilities, validation_set, is_bigram=False):
    total_log_prob = 0
    token_count = 0

    for line in validation_set:
        for i, word in enumerate(line):
            if is_bigram and i > 0:
                # For bigram model
                prev_word = line[i-1]
                prob = probabilities.get((prev_word, word), probabilities.get(('<UNKNOWN>', '<UNKNOWN>'), 1e-10))
            else:
                # For unigram model
                prob = probabilities.get(word, probabilities.get('<UNKNOWN>', 1e-10))

            total_log_prob += -math.log(prob)
            token_count += 1

    average_log_prob = total_log_prob / token_count
    perplexity = math.exp(average_log_prob)
    return perplexity

# Calculate perplexity for unigram model
unigram_perplexity = calculate_perplexity(unigram_probs, val)

# Calculate perplexity for bigram model
bigram_perplexity = calculate_perplexity(bigram_probs, val, is_bigram=True)

# Calculate perplexity for Laplace-smoothed bigram model
bigram_smoothed_perplexity = calculate_perplexity(bigram_probs_smoothed, val, is_bigram=True)

# Calculate perplexity for Add-2-smoothed bigram model
bigram_smoothed2_perplexity = calculate_perplexity(bigram_probs_smoothed_2, val, is_bigram=True)


print(f"Unigram Model Perplexity: {unigram_perplexity}")
print(f"Bigram Model Perplexity: {bigram_perplexity}")
print(f"Bigram (Laplace Smoothing) Model Perplexity: {bigram_smoothed_perplexity}")
print(f"Bigram (+2 Smoothing) Model Perplexity: {bigram_smoothed2_perplexity}")

Unigram Model Perplexity: 729.0068112574407
Bigram Model Perplexity: 30355.48594886052
Bigram (Laplace Smoothing) Model Perplexity: 1384.1037358070637
Bigram (+2 Smoothing) Model Perplexity: 1876.745092369044
