# Assignment 1
You should submit the **UniversityNumber.ipynb** file and your final prediction file **UniversityNumber.test.out** to Moodle. Make sure your code does not use your local files and that the results are reproducible. Before submitting, please **run your notebook and keep all running logs** so that we can check.

## 1 $n$-gram Language Model
**Q1**: Expand the above definition of $ p(\vec{w})$ using naive estimates of the parameters, such as $  p(w_4 \mid w_2, w_3) {=}  \frac{C(w_2~w_3~w_4)}{C(w_2~w_3)} $ where \( C(w_2 w_3 w_4) \) denotes the count of times the trigram $ w_2 w_3 w_4 $ was observed in a training corpus.

**Write your answer:**

$ p(\vec{w})$ =  $ p(w_1) ⋅ p(w_2 \mid w_1) ⋅ p(w_3 \mid w_1, w_2) ⋅(w_4 \mid w_2, w_3) ... p(w_n \mid w_{n-2}, w_{n-1})$ \\
$ = \dfrac{C(w_1)}{C(*)} \dfrac{C(w_1~w_2)}{C(w_1)} \dfrac{C(w_1~w_2~w_3)}{C(w_1~w_2)} \dfrac{C(w_2~w_3~w_4)}{C(w_2~w_3)} ... \dfrac{C(w_{n-2} ~w_{n-1} ~w_n)}{C(w_{n-2}~w_{n-1})}$




**Q2**: One could also define a kind of reversed trigram language model $p_{reversed}$ that instead assumed the words were generated in reverse order (from right to left):
\begin{align} p_{reversed}(\vec{w}) \stackrel{\tiny{\mbox{def}}}{=}&p(w_n) \cdot p(w_{n-1} \mid w_n) \cdot p(w_{n-2} \mid w_{n-1} w_n) \cdot p(w_{n-3} \mid w_{n-2} w_{n-1}) \\ &\cdots p(w_2 \mid w_3 w_4) \cdot p(w_1 \mid w_2 w_3) \end{align}
By manipulating the notation, show that the two models are identical, i.e., $ p(\vec{w}) = p_{reversed}(\vec{w}) $ for any $ \vec{w} $ provided that both models use MLE parameters estimated from the same training data (see Q1 above).

**Write your answer:**

The MLE of $ p(\vec{w})$ is $ C(w_1) \dfrac{C(w_1~w_2)}{C(w_1)} \dfrac{C(w_1~w_2~w_3)}{C(w_1~w_2)} \dfrac{C(w_2~w_3~w_4)}{C(w_2~w_3)} ... \dfrac{C(w_{n-2} ~w_{n-1} ~w_n)}{C(w_{n-2}~w_{n-1})}$ \\

Which can be canceled to: \\

$ C(w_1~w_2~w_3) \dfrac{C(w_2~w_3~w_4)}{C(w_2~w_3)} ... \dfrac{C(w_{n-2} ~w_{n-1} ~w_n)}{C(w_{n-2}~w_{n-1})}$ \\

Similarly, we can write the MLE of $ p_{reversed}(\vec{w}) $ to: $ C(w_n) \dfrac{C(w_{n-1}~w_n)}{C(w_n)} \dfrac{C(w_{n-2} ~w_{n-1} ~w_n)}{C(w_{n-1}~w_n)} \dfrac{C(w_{n-3}~w_{n-2}~w_{n-1})}{C(w_{n-2}~w_{n-1})} ... \dfrac{C(w_2~w_3~w_4)}{C(w_2~w_3)} \dfrac{C(w_1~w_2~w_3)}{C(w_1~w_2)} $ \\

Which can also be canceled to: \\

$ {C(w_{n-2} ~w_{n-1} ~w_n)} \dfrac{C(w_{n-3}~w_{n-2}~w_{n-1})}{C(w_{n-2}~w_{n-1})} ... \dfrac{C(w_2~w_3~w_4)}{C(w_2~w_3)} \dfrac{C(w_1~w_2~w_3)}{C(w_1~w_2)} $ \\

This is equavalent to the answer above, if we multiply all the denominators and numerators respectively, the composition are equavalent.




## 2 $N$-gram Language Model Implementation

In [210]:
!wget -O train.txt https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/lm/train.txt
!wget -O dev.txt https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/lm/dev.txt
!wget -O test.txt https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/lm/test.txt

--2023-10-22 14:34:56--  https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/lm/train.txt
正在解析主机 raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
正在连接 raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：6640478 (6.3M) [text/plain]
正在保存至: “train.txt”


2023-10-22 14:34:57 (7.26 MB/s) - 已保存 “train.txt” [6640478/6640478])

--2023-10-22 14:34:57--  https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/lm/dev.txt
正在解析主机 raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
正在连接 raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：872910 (852K) [text/plain]
正在保存至: “dev.txt”


2023-10-22 14:34:58 (2.44 MB/s) - 已保存 “dev.txt” [872910/872910])

--2023-10-22 14:34:58--  https://raw.githubuserco

### 2.1 Building vocabulary

**Code**

In [211]:
import nltk
import re
import string
from nltk.lm import Vocabulary
from nltk.lm.preprocessing import pad_both_ends

def preprocess_file_nltk(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.readlines()

    # Tokenize and preprocess the sentences
    sentences = []
    unpaded_sentences = []
    for line in content:
        line = re.sub(f"[{string.punctuation}]", "", line)
        tokens = nltk.word_tokenize(line.strip())
        unpaded_sentences.append(tokens)
        tokens = list(pad_both_ends(tokens, n=2))  # Pad the sentence with <s> and </s>
        sentences.append(tokens)
    # print(sentences[:10])
    # Create a frequency dictionary using nltk's FreqDist
    freq_dict = nltk.FreqDist(token for tokens in sentences for token in tokens)
    # print(sorted(freq_dict))
    # only keep tokens that appear at least 3 times in the file first 
    vocab = Vocabulary(freq_dict, unk_cutoff=3)
    return sentences, unpaded_sentences, vocab

train_file = './train.txt'
dev_file = './dev.txt'
test_file = './test.txt'

train_sentences_noUNK, train_sentences_unpad_noUNK, train_vocab = preprocess_file_nltk(train_file)
train_sentences = [[word if word in train_vocab else "<UNK>" for word in sentence] for sentence in train_sentences_noUNK]
train_sentences_unpad = [[word if word in train_vocab else "<UNK>" for word in sentence] for sentence in train_sentences_unpad_noUNK]

dev_sentences_noUNK, dev_sentences_unpad, dev_vocab = preprocess_file_nltk(dev_file)
dev_sentences = [[word if word in train_vocab else "<UNK>" for word in sentence] for sentence in dev_sentences_noUNK]
test_sentences_noUNK, test_sentences_unpad, test_vocab = preprocess_file_nltk(test_file)
test_sentences = [[word if word in train_vocab else "<UNK>" for word in sentence] for sentence in test_sentences_noUNK]
print(train_sentences[:10])
print(dev_sentences[:10])

print(f"Train vocabulary size: {len(train_vocab)}")
print(f"Dev vocabulary size: {len(dev_vocab)}")
print(f"Test vocabulary size: {len(test_vocab)}")

[['<s>', 'facebook', 'has', 'released', 'a', 'report', 'that', 'shows', 'what', 'content', 'was', 'most', 'widely', 'viewed', 'by', 'americans', 'between', 'april', 'and', 'june', '</s>'], ['<s>', 'it', 'contains', 'sections', 'showing', 'the', 'top', '20', 'domains', 'links', 'pages', 'and', 'posts', 'in', 'terms', 'of', 'views', '</s>'], ['<s>', 'a', 'companion', 'guide', 'that', 'describes', 'how', 'the', 'data', 'was', 'gathered', 'and', 'analyzed', 'was', 'also', 'released', '</s>'], ['<s>', 'facebook', 'released', 'the', 'data', 'in', 'response', 'to', 'reports', 'that', 'posts', 'from', 'rightwing', 'sources', 'had', 'the', 'most', 'interaction', '</s>'], ['<s>', 'the', 'top', 'posts', 'only', 'account', 'for', 'less', 'than', '0', '</s>'], ['<s>', '1', 'percent', 'of', 'the', 'content', 'viewed', 'by', 'us', 'users', 'and', 'the', 'data', 'only', 'accounts', 'for', 'public', 'posts', 'not', 'posts', 'made', 'in', 'private', 'groups', '</s>'], ['<s>', 'a', 'link', 'to', 'the', '

**Discussion**
Given the vocabulary size above, the number of parameters in a n-gram model is simply vocab size to its nth times, i.e. |V| ^ n. Since the number of parameters of a ngram model refers to the number of possible n grams that can be generated from it.

### 2.2 $N$-gram Language Modeling

**Code**

In [212]:
from nltk import everygrams, ngrams
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline, pad_both_ends, flatten

def train_language_model(n, train_data, vocab):
    # Create an n-gram generator with padding
    train_data, padded_sents = padded_everygram_pipeline(n, train_data)
    # Train the language model
    lm = MLE(n, vocabulary=vocab)
    lm.fit(train_data, padded_sents)
    print(lm.vocab)
    
    return lm

def compute_perplexity(n, lm, test_data):
    # Preprocess the test data
    test_ngrams = [ngram for sent in test_data for ngram in ngrams(sent, n)]
    # print(test_ngrams[:10])
    return lm.perplexity(test_ngrams)

# Train models and calculate perplexity
unigram_lm = train_language_model(1, train_sentences, train_vocab)
bigram_lm = train_language_model(2, train_sentences, train_vocab)

train_unigram_perplexity = compute_perplexity(1, unigram_lm, train_sentences)
train_bigram_perplexity = compute_perplexity(2, bigram_lm, train_sentences)

dev_unigram_perplexity = compute_perplexity(1, unigram_lm, dev_sentences)
dev_bigram_perplexity = compute_perplexity(2, bigram_lm, dev_sentences)

# test_unigram_perplexity = compute_perplexity(1, unigram_lm, test_sentences)
# test_bigram_perplexity = compute_perplexity(2, bigram_lm, test_sentences)

print("Train Unigram Model Perplexity:", train_unigram_perplexity)
print("Train Bigram Model Perplexity:", train_bigram_perplexity)

print("Dev Unigram Model Perplexity:", dev_unigram_perplexity)
print("Dev Bigram Model Perplexity:", dev_bigram_perplexity)

# print("Test Unigram Model Perplexity:", test_unigram_perplexity)
# print("Test Bigram Model Perplexity:", test_bigram_perplexity)

<Vocabulary with cutoff=3 unk_label='<UNK>' and 17683 items>
<Vocabulary with cutoff=3 unk_label='<UNK>' and 17683 items>
Train Unigram Model Perplexity: 968.4033991354514
Train Bigram Model Perplexity: 72.38090357599297
Dev Unigram Model Perplexity: 967.7749548528787
Dev Bigram Model Perplexity: inf


**Discussion**

For training sets, since the language model is trained based on training data, a lower bigram perplexity is expected as bigram captures context information that unigram does not.
However, the infinity from Bigram model comes from a 0 numerator, and it explodes the perplexity since the formula of perplexity inserts probability of guesses to a division's denominator.



### 2.3 Smoothing

#### 2.3.1 Add-one (Laplace) smoothing

**Code**

In [213]:
from nltk.lm import Laplace

def train_laplace_language_model(n, train_data, vocab):
    # Create an n-gram generator with padding
    train_data, padded_sents = padded_everygram_pipeline(n, train_data)
    # Train the language model with Laplace smoothing
    lm = Laplace(n, vocabulary=vocab)
    lm.fit(train_data, padded_sents)

    return lm

# Train the models with Laplace smoothing
unigram_lm = train_laplace_language_model(1, train_sentences, train_vocab)
bigram_lm = train_laplace_language_model(2, train_sentences, train_vocab)

train_unigram_perplexity = compute_perplexity(1, unigram_lm, train_sentences)
train_bigram_perplexity = compute_perplexity(2, bigram_lm, train_sentences)

dev_unigram_perplexity = compute_perplexity(1, unigram_lm, dev_sentences)
dev_bigram_perplexity = compute_perplexity(2, bigram_lm, dev_sentences)

# test_unigram_perplexity = compute_perplexity(1, unigram_lm, test_sentences)
# test_bigram_perplexity = compute_perplexity(2, bigram_lm, test_sentences)

print("Train Unigram Perplexity:", train_unigram_perplexity)
print("Train Bigram Perplexity:", train_bigram_perplexity)

print("Dev Unigram Perplexity:", dev_unigram_perplexity)
print("Dev Bigram Perplexity:", dev_bigram_perplexity)

# print("Test Unigram Perplexity:", test_unigram_perplexity)
# print("Test Bigram Perplexity:", test_bigram_perplexity)

Train Unigram Perplexity: 969.230602654412
Train Bigram Perplexity: 1402.4365114712714
Dev Unigram Perplexity: 969.1493297099746
Dev Bigram Perplexity: 1639.0737296018713


**Discussion**

After adding 1 to numerator and V to demoninator for each guess, the issue of exploding perplexity is controled through the introduction of non-zero terms in a division
Notice that bigram perplexity has increased a lot for training data. This is because for each guess P(w_i | w_i-1, w_i-2), the behavior was already good on training data (pretty straightforward, since we are fitting the model on it). However, the added 1 and |V| to numerator and denominator in perplexity's conponents were flipped, and the size of V is overwhelming, even if we take a squareroot.

#### 2.3.2: Add-$k$ smoothing

**Code**

In [214]:
from nltk.lm import Lidstone
from nltk.lm.preprocessing import padded_everygram_pipeline

def train_lidstone_language_model(n, train_data, vocab, lmda):
    # Create an n-gram generator with padding
    train_data, padded_sents = padded_everygram_pipeline(n, train_data)
    # Train the language model with Lidstone smoothing
    lm = Lidstone(lmda, order=n,vocabulary=vocab)
    lm.fit(train_data, padded_sents)
    
    return lm

# Train the models with Lidstone smoothing, which is just k smoothing.
lmda = 5e-3

unigram_lm = train_lidstone_language_model(1, train_sentences, train_vocab, lmda)
bigram_lm = train_lidstone_language_model(2, train_sentences, train_vocab, lmda)

train_unigram_perplexity = compute_perplexity(1, unigram_lm, train_sentences)
train_bigram_perplexity = compute_perplexity(2, bigram_lm, train_sentences)

dev_unigram_perplexity = compute_perplexity(1, unigram_lm, dev_sentences)
dev_bigram_perplexity = compute_perplexity(2, bigram_lm, dev_sentences)

# test_unigram_perplexity = compute_perplexity(1, unigram_lm, test_sentences)
# test_bigram_perplexity = compute_perplexity(2, bigram_lm, test_sentences)

print("Train Unigram Perplexity:", train_unigram_perplexity)
print("Train Bigram Perplexity:", train_bigram_perplexity)

print("Dev Unigram Perplexity:", dev_unigram_perplexity)
print("Dev Bigram Perplexity:", dev_bigram_perplexity)

# print("Test Unigram Perplexity:", test_unigram_perplexity)
# print("Test Bigram Perplexity:", test_bigram_perplexity)

Train Unigram Perplexity: 968.4034233091518
Train Bigram Perplexity: 113.10241373999663
Dev Unigram Perplexity: 967.7781169242493
Dev Bigram Perplexity: 378.2551205514041


**Discussion**

We can see that the perplexity of bigram is well controlled. Since add K smoothing is basically add 1 smoothing with coefficients k that we can define, we can use a relatively small k to control the unbalanced scale of 1 and V introduced by add 1 smoothing while also preventing zero division.
I have tried the following K values: 0.01, 5e-3, 1e-3, 1e-4, 5e-4, 1e-5. and 5e-3 provided a relatively good performance among all parameters, and will be used for further model development on the same training set.


#### 2.3.3 Linear Interpolation

**Code**

\# todo



In [215]:
import numpy as np
import tqdm
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

def train_mle_language_model(n, train_data, vocab, lidstone_lambda=5e-3):
    train_data, padded_sents = padded_everygram_pipeline(n, train_data)
    # lm = MLE(n, vocabulary=vocab)
    lm = Lidstone(lidstone_lambda, order=n,vocabulary=vocab)
    lm.fit(train_data, padded_sents)
    return lm

unigram_lm = train_mle_language_model(1, train_sentences, train_vocab)
bigram_lm = train_mle_language_model(2, train_sentences, train_vocab)
trigram_lm = train_mle_language_model(3, train_sentences, train_vocab)
print("Finished training language models")

Finished training language models


In [206]:

def perplexity_of_interpolated_model(lambdas, unigram_lm, bigram_lm, trigram_lm, sentences):
    perplexity = 0
    total_words = 0
    
    for sentence in sentences:
        sentence_perplexity = 1
        for i in range(len(sentence)):
            # calculate the probability for each n-gram model
            unigram_prob = unigram_lm.score(sentence[i])
            bigram_prob = bigram_lm.score(sentence[i], sentence[max(i - 1, 0):i])
            trigram_prob = trigram_lm.score(sentence[i], sentence[max(i - 2, 0):i])
            # interpolate the probabilities
            interpolated_prob = lambdas[0] * unigram_prob + lambdas[1] * bigram_prob + lambdas[2] * trigram_prob
            sentence_perplexity *= (1 / interpolated_prob)
        
        total_words += len(sentence)
        perplexity += np.log(sentence_perplexity)
    
    return np.exp(perplexity / total_words)

# Find the coefficients that minimize the perplexity
best_lambdas = None
best_perplexity = float('inf')

# print(np.arange(0, 1.1, 0.1))
for lambda1 in np.arange(0, 1.1, 0.05):
    for lambda2 in np.arange(0, 1.1 - lambda1, 0.05):
        lambda3 = 1 - lambda1 - lambda2
        lambdas = [lambda1, lambda2, lambda3]

        current_perplexity = perplexity_of_interpolated_model(lambdas, unigram_lm, bigram_lm, trigram_lm, dev_sentences)
        
        if current_perplexity < best_perplexity:
            best_perplexity = current_perplexity
            best_lambdas = lambdas

print("Best coefficients:", best_lambdas)
print("Corresponding dev minimum perplexity:", best_perplexity)

test_perplexity = perplexity_of_interpolated_model(best_lambdas, unigram_lm, bigram_lm, trigram_lm, test_sentences)
print("Test perplexity:", test_perplexity)

  perplexity += np.log(sentence_perplexity)


Best coefficients: [0.2, 0.7000000000000001, 0.09999999999999998]
Corresponding dev minimum perplexity: 247.2466910211572
Test perplexity: 244.09669707439983


**Discussion**

Since add k smoothing (which is Lidstone model fron nltk) provides the best result among all smoothing attempts above, we build 1~3gram models that applies add k smoothing with the same k value from above. Then, I tried to use the stepsize of 0.1 and 0.05 for probing the best combination of n-gram combinations. The perplexity was lowered the most when we assign a relatively large weight to the bigram model on the validation and test sets.
Moreover, although nltk has perplexity function, the behavior looks weird, leading to very unbalanced weight distribution. Thus, I have computed the perplexity sentence by sentence.

##### **Optimization**:

\# todo

## 3 Preposition Prediction

In [220]:
!wget -O dev.in https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/prep/dev.in
!wget -O dev.out https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/prep/dev.out
!wget -O test.in https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/prep/test.in

--2023-10-22 14:46:02--  https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/prep/dev.in
正在解析主机 raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
正在连接 raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：210427 (205K) [text/plain]
正在保存至: “dev.in”


2023-10-22 14:46:03 (1.51 MB/s) - 已保存 “dev.in” [210427/210427])

--2023-10-22 14:46:03--  https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/prep/dev.out
正在解析主机 raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
正在连接 raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：10018 (9.8K) [text/plain]
正在保存至: “dev.out”


2023-10-22 14:46:03 (14.3 MB/s) - 已保存 “dev.out” [10018/10018])

--2023-10-22 14:46:03--  https://raw.githubusercontent.com/q

### 3.1 Using n gram models trained above

In [174]:
import nltk
import re
import string
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.lm import Vocabulary

def train_prep_language_model(n, train_data, vocab, lidstone_lambda=5e-3):
    train_data, padded_sents = padded_everygram_pipeline(n, train_data)
    # lm = MLE(n, vocabulary=vocab)
    lm = Lidstone(lidstone_lambda, order=n,vocabulary=vocab)
    lm.fit(train_data, padded_sents)
    return lm

def process_PREP(tokens_unmerged):
    tokens = []
    i = 0
    while i < len(tokens_unmerged):
        if tokens_unmerged[i:i+3] == ['<', 'PREP', '>']:
            tokens.append('<PREP>')
            i += 3
        else:
            tokens.append(tokens_unmerged[i])
            i += 1
    return tokens

def preprocess_train_file_preposition(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.readlines()
    # Tokenize and preprocess the sentences
    sentences = []
    unpaded_sentences = []
    for line in content:
        line = re.sub(f"[{string.punctuation}]", "", line)
        tokens = nltk.word_tokenize(line.strip())
        tokens = list(pad_both_ends(tokens, n=2))  # Pad the sentence with <s> and </s>
        sentences.append(tokens)
    # Create a frequency dictionary using nltk's FreqDist
    freq_dict = nltk.FreqDist(token for tokens in sentences for token in tokens)
    # print(sorted(freq_dict))
    vocab = Vocabulary(freq_dict, unk_cutoff=3)
    return sentences, vocab

def preprocess_file_preposition(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.readlines()
    # Tokenize and preprocess the sentences
    sentences = []
    unpaded_sentences = []
    for line in content:
        # punctuation_to_remove = "".join([ch for ch in string.punctuation if ch not in "<>"])
        # line = re.sub(f"[{punctuation_to_remove}]", "", line)
        tokens = nltk.word_tokenize(line.strip())
        tokens = list(pad_both_ends(tokens, n=2))  # Pad the sentence with <s> and </s>
        sentences.append(tokens)
    # Create a frequency dictionary using nltk's FreqDist
    freq_dict = nltk.FreqDist(token for tokens in sentences for token in tokens)
    # print(sorted(freq_dict))
    vocab = Vocabulary(freq_dict, unk_cutoff=3)
    return sentences, vocab

In [221]:
def evaluate_PREP_acc(sentences, answers, unigram_lm, bigram_lm, trigram_lm):
    prepositions = ["on", "in", "at", "for", "of"]
    # Initialize variables
    total_masked = 0
    correct_predictions = 0

    lambdas = [0.15, 0.7, 0.15] # best coefficients found in previous section

    for i, sentence in enumerate(sentences):
        tokens = process_PREP(sentence)
        masked_indices = [i for i, token in enumerate(tokens) if token == "<PREP>"]
        guess_index = 0
        for masked_index in masked_indices:
            # Predict masked word using ngram score
            max_prob = -1
            prediction = None
            for prep in prepositions:
                # calculate the probability for each n-gram model that uses K smoothing
                unigram_prob = unigram_lm.score(prep)
                bigram_prob = bigram_lm.score(prep, tokens[max(masked_index - 1, 0):masked_index]) + bigram_lm.score(prep, tokens[masked_index:min(masked_index + 1, len(tokens))])
                trigram_prob = trigram_lm.score(prep, tokens[max(masked_index - 2, 0):masked_index]) + trigram_lm.score(prep, tokens[masked_index:min(masked_index + 2, len(tokens))])
                # interpolate the probabilities
                prob = lambdas[0] * unigram_prob + lambdas[1] * bigram_prob + lambdas[2] * trigram_prob
                if prob > max_prob:
                    max_prob = prob
                    prediction = prep
            
            # Compare prediction with the answer
            total_masked += 1
            answer = answers[i].split()
            # print(f"Prediction for <PREP> at index {masked_index}:", prediction)
            # print(f"Answer for <PREP> at index {masked_index}:", answer[guess_index])
            if prediction == answer[guess_index]:
                correct_predictions += 1
            guess_index += 1

    # Calculate accuracy
    accuracy = correct_predictions / total_masked
    return accuracy

def save_prediction(sentences, unigram_lm, bigram_lm, trigram_lm):
    prepositions = ["on", "in", "at", "for", "of"]
    lambdas = [0.15, 0.7, 0.15] # best coefficients found in previous section

    # Open the output file
    with open('test.out', 'w') as outfile:
        for i, sentence in enumerate(sentences):
            tokens = process_PREP(sentence)
            masked_indices = [i for i, token in enumerate(tokens) if token == "<PREP>"]
            predictions = []
            for masked_index in masked_indices:
                # Predict masked word using ngram score
                max_prob = -1
                prediction = None
                for prep in prepositions:
                    # calculate the probability for each n-gram model that uses K smoothing
                    unigram_prob = unigram_lm.score(prep)
                    bigram_prob = bigram_lm.score(prep, tokens[max(masked_index - 1, 0):masked_index]) + bigram_lm.score(prep, tokens[masked_index:min(masked_index + 1, len(tokens))])
                    trigram_prob = trigram_lm.score(prep, tokens[max(masked_index - 2, 0):masked_index]) + trigram_lm.score(prep, tokens[masked_index:min(masked_index + 2, len(tokens))])
                    # interpolate the probabilities
                    prob = lambdas[0] * unigram_prob + lambdas[1] * bigram_prob + lambdas[2] * trigram_prob
                    if prob > max_prob:
                        max_prob = prob
                        prediction = prep

                # Store the prediction
                predictions.append(prediction)
            # Write the predictions to the output file
            outfile.write(' '.join(predictions) + '\n')
    # Calculate accuracy

# train_source = './data/prep/validate.in' # this is the same as dev.in except that it has answers inserted
dev_source = './dev.in'
dev_answer = './dev.out'
test_source = './test.in'
# test_answer = './data/prep/test.out'

# Load and tokenize source sentences
# train_sentences, train_vocab = preprocess_train_file_preposition(train_source) # unused
dev_sentences, dev_vocab = preprocess_file_preposition(dev_source)  
test_sentences, test_vocab = preprocess_file_preposition(test_source)

# Load the answers
with open(dev_answer, "r") as f:
    dev_answers = f.read().splitlines()
# with open(test_answer, "r") as f:
#     test_answers = f.read().splitlines()

print("Dev set accuracy:", evaluate_PREP_acc(dev_sentences, dev_answers, unigram_lm, bigram_lm, trigram_lm))
# print("Test set accuracy:", evaluate_PREP_acc(test_sentences, test_answers, unigram_lm, bigram_lm, trigram_lm))
save_prediction(test_sentences, unigram_lm, bigram_lm, trigram_lm)


Dev set accuracy: 0.5701262272089762


### Discussion

for this task, I have applied both k smoothing and interpolation to the 3 ngram models. Moreover, adding the reversed context helped improve the accuracy by a little bit, since guessing the preposition while also using reversed context provides more information about the preposition. Unlike the equal condition while guessing the whole sentences, it do help improve the accuracy when we are guessing single word.

### 3.2 RoBERTa Attempt

This is just playing for fun, testing if transformers can provide better results. If the local files are not found, just ignore these code blocks.

In [1]:

def preprocess_input_file(file_path):
    with open(file_path, 'r') as file:
        text = file.read()
    text = text.replace('<PREP>', '<mask>')
    with open("./data/prep/dev_mask.in", 'w') as file:
        file.write(text)

input_file = "./data/prep/dev.in"
preprocessed_file = "./data/prep/dev_mask.in"
preprocess_input_file(input_file)

In [2]:
from transformers import RobertaTokenizer, LineByLineTextDataset, DataCollatorForLanguageModeling, RobertaForMaskedLM, Trainer, TrainingArguments, pipeline

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=preprocessed_file,
    block_size=128,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=1.0
)
model = RobertaForMaskedLM.from_pretrained('roberta-base')

training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()
model.save_pretrained("./fine_tuned_roberta")

  from .autonotebook import tqdm as notebook_tqdm
  return torch._C._cuda_getDeviceCount() > 0


Step,Training Loss


In [15]:
# Load the fine-tuned model
fine_tuned_model = RobertaForMaskedLM.from_pretrained("../A1/fine_tuned_roberta")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

# Create a fill-mask pipeline
fill_mask = pipeline(
    "fill-mask",
    model=fine_tuned_model,
    tokenizer=tokenizer
)

# Predict masked words
sentence = "palestinian leader yasser arafat <mask> wednesday welcomed the resumption <mask> israeli-syrian peace talks , which were due to begin later <mask> the day <mask> the united states."
masked_sentence = sentence.replace('<PREP>', '<mask>')
predictions = fill_mask(masked_sentence)
# print(predictions)

# Filter and print preposition predictions
prepositions = ["at", "in", "on", "for", "of"]

preposition_predictions = []
for mask_preds in predictions:
    for pred in mask_preds:
        token = pred['token_str'].strip()
        if token in prepositions:
            preposition_predictions.append(token)
            break

print(preposition_predictions)

['on', 'of', 'in', 'in']


In [29]:
with open("./data/prep/test.in", "r", encoding="utf-8") as f:
    input = f.readlines()
    test_sentences = []
    for line in input:
        test_sentences.append(line.replace('<PREP>', '<mask>'))
        
with open("./data/prep/test.out", "r", encoding="utf-8") as f:
    test_answer = f.readlines()

# print(test_sentences)
# print(test_out_lines)

def get_predictions(sentence):
    preposition_predictions = []
    predictions = fill_mask(sentence)
    for pred in predictions:
        if type(pred) is dict:
            token = pred['token_str'].strip()
            if token in prepositions:
                preposition_predictions.append(token)
                break
        else:
            for candidate in pred:
                token = candidate['token_str'].strip()
                if token in prepositions:
                    preposition_predictions.append(token)
                    break
    return preposition_predictions

# Evaluate the model's performance
correct_count = 0
total_count = 0

for test_in_line, test_out_line in zip(test_sentences, test_answer):
    predictions = get_predictions(test_in_line)
    correct_answers = test_out_line.strip().split()
    total_count += len(correct_answers)

    for pred, correct in zip(predictions, correct_answers):
        if pred == correct:
            correct_count += 1

# Calculate and print the correction rate
correction_rate = correct_count / total_count
print(f"Correction rate: {correction_rate:.2%}")

Correction rate: 76.90%
