# Assignment 1
You should submit the **UniversityNumber.ipynb** file and your final prediction file **UniversityNumber.test.out** to Moodle. Make sure your code does not use your local files and that the results are reproducible. Before submitting, please **run your notebook and keep all running logs** so that we can check.

## 1 $n$-gram Language Model
**Q1**: Expand the above definition of $ p(\vec{w})$ using naive estimates of the parameters, such as $  p(w_4 \mid w_2, w_3) {=}  \frac{C(w_2~w_3~w_4)}{C(w_2~w_3)} $ where \( C(w_2 w_3 w_4) \) denotes the count of times the trigram $ w_2 w_3 w_4 $ was observed in a training corpus.

**Write your answer:**

$ p(\vec{w})$ =  $ p(w_1) ⋅ p(w_2 \mid w_1) ⋅ p(w_3 \mid w_1, w_2) ⋅(w_4 \mid w_2, w_3) ... p(w_n \mid w_{n-2}, w_{n-1})$ \\
$ = \dfrac{C(w_1)}{C(*)} \dfrac{C(w_1~w_2)}{C(w_1)} \dfrac{C(w_1~w_2~w_3)}{C(w_1~w_2)} \dfrac{C(w_2~w_3~w_4)}{C(w_2~w_3)} ... \dfrac{C(w_{n-2} ~w_{n-1} ~w_n)}{C(w_{n-2}~w_{n-1})}$




**Q2**: One could also define a kind of reversed trigram language model $p_{reversed}$ that instead assumed the words were generated in reverse order (from right to left):
\begin{align} p_{reversed}(\vec{w}) \stackrel{\tiny{\mbox{def}}}{=}&p(w_n) \cdot p(w_{n-1} \mid w_n) \cdot p(w_{n-2} \mid w_{n-1} w_n) \cdot p(w_{n-3} \mid w_{n-2} w_{n-1}) \\ &\cdots p(w_2 \mid w_3 w_4) \cdot p(w_1 \mid w_2 w_3) \end{align}
By manipulating the notation, show that the two models are identical, i.e., $ p(\vec{w}) = p_{reversed}(\vec{w}) $ for any $ \vec{w} $ provided that both models use MLE parameters estimated from the same training data (see Q1 above).

**Write your answer:**

The MLE of $ p(\vec{w})$ is $ C(w_1) \dfrac{C(w_1~w_2)}{C(w_1)} \dfrac{C(w_1~w_2~w_3)}{C(w_1~w_2)} \dfrac{C(w_2~w_3~w_4)}{C(w_2~w_3)} ... \dfrac{C(w_{n-2} ~w_{n-1} ~w_n)}{C(w_{n-2}~w_{n-1})}$ \\

Which can be canceled to: \\

$ C(w_1~w_2~w_3) \dfrac{C(w_2~w_3~w_4)}{C(w_2~w_3)} ... \dfrac{C(w_{n-2} ~w_{n-1} ~w_n)}{C(w_{n-2}~w_{n-1})}$ \\

Similarly, we can write the MLE of $ p_{reversed}(\vec{w}) $ to: $ C(w_n) \dfrac{C(w_{n-1}~w_n)}{C(w_n)} \dfrac{C(w_{n-2} ~w_{n-1} ~w_n)}{C(w_{n-1}~w_n)} \dfrac{C(w_{n-3}~w_{n-2}~w_{n-1})}{C(w_{n-2}~w_{n-1})} ... \dfrac{C(w_2~w_3~w_4)}{C(w_2~w_3)} \dfrac{C(w_1~w_2~w_3)}{C(w_1~w_2)} $ \\

Which can also be canceled to: \\

$ {C(w_{n-2} ~w_{n-1} ~w_n)} \dfrac{C(w_{n-3}~w_{n-2}~w_{n-1})}{C(w_{n-2}~w_{n-1})} ... \dfrac{C(w_2~w_3~w_4)}{C(w_2~w_3)} \dfrac{C(w_1~w_2~w_3)}{C(w_1~w_2)} $ \\

This is equavalent to the answer above, if we multiply all the denominators and numerators respectively, the composition are equavalent.




## 2 $N$-gram Language Model Implementation

In [None]:
!wget -O train.txt https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/lm/train.txt
!wget -O dev.txt https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/lm/dev.txt
!wget -O test.txt https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/lm/test.txt

### 2.1 Building vocabulary

**Code**

In [18]:
import nltk
from nltk.lm import Vocabulary
from nltk.lm.preprocessing import pad_both_ends

def preprocess_file_nltk(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.readlines()

    # Tokenize and preprocess the sentences
    sentences = []
    unpaded_sentences = []
    for line in content:
        tokens = nltk.word_tokenize(line.strip())
        unpaded_sentences.append(tokens)
        tokens = list(pad_both_ends(tokens, n=2))  # Pad the sentence with <s> and </s>
        sentences.append(tokens)
    # print(sentences[:10])
    # Create a frequency dictionary using nltk's FreqDist
    freq_dict = nltk.FreqDist(token for tokens in sentences for token in tokens)
    # print(sorted(freq_dict))
    # only keep tokens that appear at least 3 times in the file first 
    vocab = Vocabulary(freq_dict, unk_cutoff=3)
    return sentences, unpaded_sentences, vocab

train_file = './data/lm/train.txt'
dev_file = './data/lm/dev.txt'
test_file = './data/lm/test.txt'

train_sentences_noUNK, train_sentences_unpad_noUNK, train_vocab = preprocess_file_nltk(train_file)
train_sentences = [[word if word in train_vocab else "<UNK>" for word in sentence] for sentence in train_sentences_noUNK]
train_sentences_unpad = [[word if word in train_vocab else "<UNK>" for word in sentence] for sentence in train_sentences_unpad_noUNK]

dev_sentences, dev_sentences_unpad, dev_vocab = preprocess_file_nltk(dev_file)
test_sentences, test_sentences_unpad, test_vocab = preprocess_file_nltk(test_file)
print(train_sentences[:10])

print(f"Train vocabulary size: {len(train_vocab)}")
print(f"Dev vocabulary size: {len(dev_vocab)}")
print(f"Test vocabulary size: {len(test_vocab)}")

[['<s>', 'facebook', 'has', 'released', 'a', 'report', 'that', 'shows', 'what', 'content', 'was', 'most', 'widely', 'viewed', 'by', 'americans', 'between', 'april', 'and', 'june', '.', '</s>'], ['<s>', 'it', 'contains', 'sections', 'showing', 'the', 'top', '20', 'domains', ',', 'links', ',', 'pages', ',', 'and', 'posts', 'in', 'terms', 'of', 'views', '.', '</s>'], ['<s>', 'a', 'companion', 'guide', 'that', 'describes', 'how', 'the', 'data', 'was', 'gathered', 'and', 'analyzed', 'was', 'also', 'released', '.', '</s>'], ['<s>', 'facebook', 'released', 'the', 'data', 'in', 'response', 'to', 'reports', 'that', 'posts', 'from', 'right-wing', 'sources', 'had', 'the', 'most', 'interaction', '.', '</s>'], ['<s>', 'the', 'top', 'posts', 'only', 'account', 'for', 'less', 'than', '0', '.', '</s>'], ['<s>', '1', 'percent', 'of', 'the', 'content', 'viewed', 'by', 'us', 'users', ',', 'and', 'the', 'data', 'only', 'accounts', 'for', 'public', 'posts', ',', 'not', 'posts', 'made', 'in', 'private', 'gr

**Discussion**
Given the vocabulary size above, the number of parameters in a n-gram model is simply vocab size to its nth times

### 2.2 $N$-gram Language Modeling

**Code**

In [19]:
import math
from nltk import everygrams
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline, pad_both_ends, flatten

def train_language_model(n, train_data, vocab):
    # Create an n-gram generator with padding
    train_data, padded_sents = padded_everygram_pipeline(n, train_data)
    
    # Train the language model
    lm = MLE(n, vocabulary=vocab)
    lm.fit(train_data, padded_sents)
    print(lm.vocab)
    
    return lm

def compute_perplexity(n, lm, test_data):
    # Preprocess the test data
    padded_sentences = [list(pad_both_ends(sent, n)) for sent in test_data]
    # flattened_sentences = list(flatten(test_data))
    # test_data, padded_sents = padded_everygram_pipeline(n, test_data)
    test_ngrams = [ngram for sent in test_data for ngram in everygrams(sent, max_len=n)]
    # Calculate perplexity
    return lm.perplexity(test_ngrams)

# Train models and calculate perplexity
unigram_lm = train_language_model(1, train_sentences, train_vocab)
bigram_lm = train_language_model(2, train_sentences, train_vocab)

unigram_perplexity = compute_perplexity(1, unigram_lm, dev_sentences)
bigram_perplexity = compute_perplexity(2, bigram_lm, dev_sentences)

print("Unigram Model Perplexity:", unigram_perplexity)
print("Bigram Model Perplexity:", bigram_perplexity)

<Vocabulary with cutoff=3 unk_label='<UNK>' and 17658 items>
<Vocabulary with cutoff=3 unk_label='<UNK>' and 17658 items>
Unigram Model Perplexity: 835.112106023236
Bigram Model Perplexity: inf


**Discussion**

The infinity from Bigram model comes from a 0 numerator



### 2.3 Smoothing

#### 2.3.1 Add-one (Laplace) smoothing

**Code**

In [21]:
from nltk.lm import Laplace

def train_laplace_language_model(n, train_data, vocab):
    # Create an n-gram generator with padding
    train_data, padded_sents = padded_everygram_pipeline(n, train_data)
    
    # Train the language model with Laplace smoothing
    lm = Laplace(n, vocabulary=vocab)
    lm.fit(train_data, padded_sents)
    
    return lm

# Train the models with Laplace smoothing
unigram_lm = train_laplace_language_model(1, train_sentences, train_vocab)
bigram_lm = train_laplace_language_model(2, train_sentences, train_vocab)

# Calculate perplexity
unigram_perplexity = compute_perplexity(1, unigram_lm, dev_sentences)
bigram_perplexity = compute_perplexity(2, bigram_lm, dev_sentences)

print("Unigram Perplexity:", unigram_perplexity)
print("Bigram Perplexity:", bigram_perplexity)

Unigram Perplexity: 836.0150916297277
Bigram Perplexity: 1006.1865977200891


**Discussion**

\# todo



#### 2.3.2: Add-$k$ smoothing

**Code**

In [6]:
from nltk.lm import Lidstone
from nltk.lm.preprocessing import padded_everygram_pipeline

def train_lidstone_language_model(n, train_data, vocab, lmda):
    # Create an n-gram generator with padding
    train_data, padded_sents = padded_everygram_pipeline(n, train_data)
    
    # Train the language model with Lidstone smoothing
    lm = Lidstone(lmda, order=n,vocabulary=vocab)
    lm.fit(train_data, padded_sents)
    
    return lm

# Train the models with Lidstone smoothing (choose a value for lambda, e.g., 0.5)
lmda = 5e-3
# Train the models with Laplace smoothing
unigram_lm = train_lidstone_language_model(1, train_sentences, train_vocab, lmda)
bigram_lm = train_lidstone_language_model(2, train_sentences, train_vocab, lmda)

# Calculate perplexity
unigram_perplexity = compute_perplexity(1, unigram_lm, dev_sentences)
bigram_perplexity = compute_perplexity(2, bigram_lm, dev_sentences)

print("Unigram Perplexity:", unigram_perplexity)
print("Bigram Perplexity:", bigram_perplexity)

Unigram Perplexity: 835.1135795103045
Bigram Perplexity: 495.6881461545059


**Discussion**

\# todo



#### 2.3.3 Linear Interpolation

**Code**

\# todo



In [None]:


# Train MLE models with different n-grams


# def interpolated_probability(ngram, lambda1, lambda2, lambda3, unigram_lm, bigram_lm, trigram_lm):
#     unigram_prob = unigram_lm.score(ngram[-1])
#     bigram_prob = bigram_lm.score(ngram[-2:], ngram[:-1])
#     trigram_prob = trigram_lm.score(ngram, ngram[:-1])

#     return lambda1 * unigram_prob + lambda2 * bigram_prob + lambda3 * trigram_prob

# def interpolated_perplexity(lambda1, lambda2, lambda3, unigram_lm, bigram_lm, trigram_lm, dataset):
#     log_prob_sum = 0
#     token_count = 0
#     epsilon = 1e-10

#     for sentence in dataset:
#         for i in range(2, len(sentence)):
#             ngram = tuple(sentence[i-2:i+1])
#             prob = interpolated_probability(ngram, lambda1, lambda2, lambda3, unigram_lm, bigram_lm, trigram_lm)
#             log_prob_sum += math.log(prob + epsilon) 
#             token_count += 1

#     return math.exp(-log_prob_sum / token_count)

# Train unigram, bigram, and trigram models


# models = [unigram_lm, bigram_lm, trigram_lm]

# # in interpolation, we always mix results of all models with trained lambda values from dev set
# # Optimize hyperparameters (lambdas) on the dev set
# best_lambdas = [0.0, 0.0, 0.0]
# print(np.arange(0, 1.1, 0.1))
# best_dev_perplexity = float('inf')
# for lambda1 in np.arange(0, 1.1, 0.1):
#     for lambda2 in np.arange(0, 1.1 - lambda1, 0.1):
#         lambda3 = 1 - lambda1 - lambda2

#         perplexity = interpolated_perplexity(lambda1, lambda2, lambda3, unigram_lm, bigram_lm, trigram_lm, dev_sentences)
#         if perplexity < best_dev_perplexity:
#             best_dev_perplexity = perplexity
#             best_lambdas[0], best_lambdas[1], best_lambdas[2] = lambda1, lambda2, lambda3
#             print("Best lambda1:", lambda1, "Best lambda2:", lambda2, "Best lambda3:", lambda3, "Best perplexity:", perplexity)

# print("Finished training hyperparameters")

# # Report perplexity on the training and dev sets
# # train_perplexity = compute_interpolated_perplexity(models, best_lambdas, train_sentences)
# train_perplexity = interpolated_perplexity(best_lambdas[0], best_lambdas[1], best_lambdas[2], unigram_lm, bigram_lm, trigram_lm, train_sentences)
# print("Best Lambdas:", best_lambdas)
# print("Training Perplexity:", train_perplexity)
# print("Dev Perplexity:", best_dev_perplexity)

# # Report perplexity on the test set
# test_perplexity = interpolated_perplexity(best_lambdas[0], best_lambdas[1], best_lambdas[2], unigram_lm, bigram_lm, trigram_lm, test_sentences)
# print("Test Perplexity:", test_perplexity)

In [None]:
import numpy as np
import tqdm
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

def train_mle_language_model(n, train_data, vocab, lidstone_lambda=5e-3):
    train_data, padded_sents = padded_everygram_pipeline(n, train_data)
    # lm = MLE(n, vocabulary=vocab)
    lm = Lidstone(lidstone_lambda, order=n,vocabulary=vocab)
    lm.fit(train_data, padded_sents)
    return lm

unigram_lm = train_mle_language_model(1, train_sentences, train_vocab)
bigram_lm = train_mle_language_model(2, train_sentences, train_vocab)
trigram_lm = train_mle_language_model(3, train_sentences, train_vocab)
print("Finished training language models")

In [23]:

def perplexity_of_interpolated_model(lambdas, unigram_lm, bigram_lm, trigram_lm, dev_sentences):
    perplexity = 0
    total_words = 0
    
    for sentence in dev_sentences:
        sentence_perplexity = 1
        for i in range(len(sentence)):
            unigram_prob = unigram_lm.score(sentence[i])
            bigram_prob = bigram_lm.score(sentence[i], sentence[max(i - 1, 0):i])
            trigram_prob = trigram_lm.score(sentence[i], sentence[max(i - 2, 0):i])

            interpolated_prob = lambdas[0] * unigram_prob + lambdas[1] * bigram_prob + lambdas[2] * trigram_prob
            sentence_perplexity *= (1 / interpolated_prob)
        
        total_words += len(sentence)
        perplexity += np.log(sentence_perplexity)
    
    return np.exp(perplexity / total_words)

# Find the coefficients that minimize the perplexity
best_lambdas = None
best_perplexity = float('inf')

# print(np.arange(0, 1.1, 0.1))
for lambda1 in np.arange(0, 1.1, 0.1):
    for lambda2 in np.arange(0, 1.1 - lambda1, 0.1):
        lambda3 = 1 - lambda1 - lambda2
        lambdas = [lambda1, lambda2, lambda3]

        current_perplexity = perplexity_of_interpolated_model(lambdas, unigram_lm, bigram_lm, trigram_lm, dev_sentences)
        
        if current_perplexity < best_perplexity:
            best_perplexity = current_perplexity
            best_lambdas = lambdas

print("Best coefficients:", best_lambdas)
print("Minimum perplexity:", best_perplexity)

  perplexity += np.log(sentence_perplexity)


Best coefficients: [0.2, 0.7000000000000001, 0.09999999999999998]
Minimum perplexity: 194.88369622296648


**Discussion**

\# todo



##### **Optimization**:

\# todo

## 3 Preposition Prediction

In [None]:
!wget -O dev.in https://raw.githubusercontent.com/qtli/COMP7607-Fall2023/master/assignments/A1/data/prep/dev.in
!wget -O dev.out https://github.com/qtli/COMP7607-Fall2023/blob/master/assignments/A1/data/prep/dev.out

### 3.1 RoBERTa Attempt

In [1]:

def preprocess_input_file(file_path):
    with open(file_path, 'r') as file:
        text = file.read()
    text = text.replace('<PREP>', '<mask>')
    with open("./data/prep/dev_mask.in", 'w') as file:
        file.write(text)

input_file = "./data/prep/dev.in"
preprocessed_file = "./data/prep/dev_mask.in"
preprocess_input_file(input_file)

In [2]:
from transformers import RobertaTokenizer, LineByLineTextDataset, DataCollatorForLanguageModeling, RobertaForMaskedLM, Trainer, TrainingArguments, pipeline

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=preprocessed_file,
    block_size=128,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=1.0
)
model = RobertaForMaskedLM.from_pretrained('roberta-base')

training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()
model.save_pretrained("./fine_tuned_roberta")

  from .autonotebook import tqdm as notebook_tqdm
  return torch._C._cuda_getDeviceCount() > 0


Step,Training Loss


In [15]:
# Load the fine-tuned model
fine_tuned_model = RobertaForMaskedLM.from_pretrained("../A1/fine_tuned_roberta")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

# Create a fill-mask pipeline
fill_mask = pipeline(
    "fill-mask",
    model=fine_tuned_model,
    tokenizer=tokenizer
)

# Predict masked words
sentence = "palestinian leader yasser arafat <mask> wednesday welcomed the resumption <mask> israeli-syrian peace talks , which were due to begin later <mask> the day <mask> the united states."
masked_sentence = sentence.replace('<PREP>', '<mask>')
predictions = fill_mask(masked_sentence)
# print(predictions)

# Filter and print preposition predictions
prepositions = ["at", "in", "on", "for", "of"]

preposition_predictions = []
for mask_preds in predictions:
    for pred in mask_preds:
        token = pred['token_str'].strip()
        if token in prepositions:
            preposition_predictions.append(token)
            break

print(preposition_predictions)

['on', 'of', 'in', 'in']


In [29]:
with open("./data/prep/test.in", "r", encoding="utf-8") as f:
    input = f.readlines()
    test_sentences = []
    for line in input:
        test_sentences.append(line.replace('<PREP>', '<mask>'))
        
with open("./data/prep/test.out", "r", encoding="utf-8") as f:
    test_answer = f.readlines()

# print(test_sentences)
# print(test_out_lines)

def get_predictions(sentence):
    preposition_predictions = []
    predictions = fill_mask(sentence)
    for pred in predictions:
        if type(pred) is dict:
            token = pred['token_str'].strip()
            if token in prepositions:
                preposition_predictions.append(token)
                break
        else:
            for candidate in pred:
                token = candidate['token_str'].strip()
                if token in prepositions:
                    preposition_predictions.append(token)
                    break
    return preposition_predictions

# Evaluate the model's performance
correct_count = 0
total_count = 0

for test_in_line, test_out_line in zip(test_sentences, test_answer):
    predictions = get_predictions(test_in_line)
    correct_answers = test_out_line.strip().split()
    total_count += len(correct_answers)

    for pred, correct in zip(predictions, correct_answers):
        if pred == correct:
            correct_count += 1

# Calculate and print the correction rate
correction_rate = correct_count / total_count
print(f"Correction rate: {correction_rate:.2%}")

Correction rate: 76.90%


### 3.2 LSTM Attempt

In [5]:
# create validation files according to correct answer
def read_file(file_path):
    with open(file_path, 'r') as f:
        content = f.read()
    return content

def write_file(file_path, content):
    with open(file_path, 'w') as f:
        f.write(content)

def fill_prep_words(input_file, output_file, result_file):
    input_content = read_file(input_file)
    output_content = read_file(output_file)

    input_sentences = input_content.strip().split('\n')
    output_preps = output_content.strip().split()

    output_index = 0
    result_sentences = []

    for sentence in input_sentences:
        words = sentence.split()
        filled_words = []

        for word in words:
            if word == '<PREP>':
                filled_words.append(output_preps[output_index])
                output_index += 1
            else:
                filled_words.append(word)

        result_sentences.append(' '.join(filled_words))

    result_content = '\n'.join(result_sentences)
    write_file(result_file, result_content)

# Main code execution
input_file = './data/prep/dev.in'
output_file = './data/prep/dev.out'
result_file = './data/prep/validate.in'
fill_prep_words(input_file, output_file, result_file)

input_file = './data/prep/test.in'
output_file = './data/prep/test.out'
result_file = './data/prep/test_validate.in'
fill_prep_words(input_file, output_file, result_file)

In [24]:
import numpy as np
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from keras.utils import to_categorical


# Replace these with the paths to your training and validation files
train_data_path = './data/prep/dev.in'
val_data_path = './data/prep/validate.in'

with open(train_data_path) as f:
    train_data = f.read().splitlines()

with open(val_data_path) as f:
    val_data = f.read().splitlines()

# Tokenize the text data
tokenizer = Tokenizer(filters='', lower=False)
tokenizer.fit_on_texts(train_data + val_data)
total_words = len(tokenizer.word_index) + 1

# Generate sequences and labels
def generate_sequences(data):
    input_sequences, labels = [], []
    for line in data:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence[:-1])
            labels.append(n_gram_sequence[-1])
    return input_sequences, labels

train_sequences, train_labels = generate_sequences(train_data)
val_sequences, val_labels = generate_sequences(val_data)

# Pad sequences
max_sequence_len = max([len(seq) for seq in train_sequences + val_sequences])
train_sequences = pad_sequences(train_sequences, maxlen=max_sequence_len, padding='pre')
val_sequences = pad_sequences(val_sequences, maxlen=max_sequence_len, padding='pre')

# One-hot encode labels
train_labels = to_categorical(train_labels, num_classes=total_words)
val_labels = to_categorical(val_labels, num_classes=total_words)


In [None]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len))
model.add(LSTM(256, return_sequences=True))
model.add(LSTM(256))
model.add(Dense(total_words, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_sequences, train_labels, epochs=100, validation_data=(val_sequences, val_labels))
model.save('lstm_masked_word_prediction.h5')