# Parallel Corpora and LMs

You receive a small parallel corpus extracted from EuroParl, a very influential corpus in Machine Translation which contains speeches delivered at the European Parliament translated in a variety of languages: you'll work with English, Dutch, and Italian. The corpus comes as a .json file containing a dictionary mapping IDs to sentences in the three target languages, indicated as 'en', 'it', and 'nl'. Sentences under each language come as a list of strings, and sentences have been tokenized already, with tokens separated by white spaces.

You should carry out the following tasks:

1. read the input .json file, then:
	- pre-process sentences in all three languages by
        * lowercasing everything (1pt)
        * replacing each digit with the capital letter D (1pt)
        * removing all characters that aren't letters (mind accented letters!) or white spaces (2pts)
	- split the sentence IDs in a training set (80%) and a test set (20%) with random seed 4242 - remember to sort the dictionary keys before sampling, or you will sample different IDs than I did. (1pt)
    
> 5 points available, assigned as indicated above if the step is carried out correctly (everything is lowercased, all numbers replaced, only letters and white spaces, correct splitting). Try to replicate your code twice to make sure the seed is working as intended.


2. train a total of four character-level Statistical Language Models using the LM class provided in Notebook04 - make sure the resulting object has attributes counts, vocab, and vocab_size), with add-k smoothing and k=0.01:
	- a bigram model (predicting the current character based on the two previous characters)
	- a tetragram model (predicting the current character based on the four previous characters)
   given the following inputs:
        * the sentences in the English training set, after getting rid of all white spaces
    	* the word types extracted from the English training set

!!! Set any character which occurs less than 20 times in the English training sentences to the 'UNK' string.
!!! Remember that language models can only be compared if they have the same vocabulary: make sure that all models are trained using the vocabulary of the models trained on sentences, not words.
!!! Get inspiration from the Corpus and LM classes introduced in class, but edit them to fit the task.
    
At the end of task 2 you should have four LMs all having the same vocabulary:
* a character-level bigram language model trained on full English sentences without white spaces
* a character-level bigram language model trained on word types from the English training sentences
* a character-level tetragram language model trained on full English sentences without white spaces
* a character-level tetragram language model trained on word types from the English training sentences

You should submit a .pkl file for each LM, dumping the LMs to .pkl files and naming files using the template Name(Initial)Surname_[words|sents]_[2gr|4gr]_en.pkl ( the | symbol means OR ). Therefore, John K. Doe should name his bigram model trained on sentences JohnKDoe_sents_2gr_en.pkl. If you don't have a middle name, just use NameSurname. If you have multiple surnames, add them as NameSurname1Surname2, with no intervening spaces. The notebook contains the code backbone to save a .pkl file, you need to edit it to choose the correct object to save and the appropriate file name given your name and surname.

> 4 points available: we will automatically check whether 5 random transition counts and the vocabulary of your models check out with ours. For each model where the check succeeds, you will receive one point.


3. Compute the perplexity of all four models on:
	- the English training sentences
	- the English training word types
	- the English test set (sentences)
	- the Dutch test set (sentences)
	- the Italian test set (sentences)

You should submit a .csv file with the following structure, column names, and values ([2/4] means either 2 for bigram models or 4 for tetragram models, the options under test_data indicate the five sets to be used to compute perplexity):

|ngram_size|training_data|test_data|perplexity|
|---|---|---|---|
|[2/4]|[words/sents]|[ITtest/NLtest/ENtest/ENtrain_sents/ENtrain_words]|float (rounded at 4 decimal places)|
|---|---|---|---|

The file should be named according to the template _Name(Initial)Surname_perplexities.csv_

> 5 points available: you get 1 point if all four LMs yield the correct perplexity scores for a test_dataset.


4. Out of all Italian and Dutch word types longer than 4 characters and with at least 5 occurrences in the Italian/Dutch sentences, find:
	- the word with the lowest perplexity according to each of the four models
	- the word with the highest perplexity according to each of four models

You should submit two .csv files (one for the lowest perplexities, one for highest perplexities) with the following structure, column names, and values ([2/4] means either 2 for bigram models or 4 for tetragram models, str indicates that the word should appear as a string):

| lang | word | ngram_size | training_data | perplexity |
|---|---|---|---|---|
|[it/nl]|str|[2/4]|[words/sents]|float (rounded at 4 decimal places)|
|---|---|---|---|

The files should be named according to the template _Name(Initial)Surname_perplexities_[max|min].csv_, so Jane Smith should submit a file named _JaneSmith_perplexities_max.csv_ containing 8 rows each storing the word with the highest perplexity according to each of the four LMs per language.

> 4 points available: you get 0.25 points for each correct word identified


5. Answer the following questions:
	- a. compare LMs' perplexity on the English training sets, sentences and words, then explain the differences in perplexity considering what changes between the two training set-ups. (5 pts, 150 words)
	- b. which LM trained on sentences generalizes better to unseen sentences in the same language, bigram or tetragram? explain why this is the case. (5 pts, 150 words)
	- c. compare LMs trained on English in their ability to fit Italian and Dutch sentences: which factor between ngram size and training corpus (words or sentences) affects perplexity the most? Explain why we observe this pattern. (4 pts, 100 words)
	- f. what patterns can you identify in the words with the lowest perplexity in Dutch and Italian? (4 pts, 100 words)
	- g. what patterns can you identify in the words with the highest perplexity in Dutch and Italian? (4 pts, 100 words)
    
> 22 points in total, see specifications next to each question


Summing up, you will have to submit 8 files:
- 1 python notebook in .ipynb format
- 4 .pkl files each containing a language model with attributes count, vocab, and vocab_size (you can of course add multiple attributes if it helps you, but these three have to be there, with those exact names!)
- 3 .csv files, one storing the perplexity for each model on the five possible test sets (Italian, Dutch, English sentences from the test set; English sentences from training set, and English word types from the training set.

## Task1

In [53]:
import json
import pickle as pkl
import re
import random
from sklearn.model_selection import train_test_split
import numpy as np
from collections import defaultdict,Counter
import nltk
import pandas as pd


In [54]:
corpus = json.load(open('./parallel_sentences_en-nl-it.json', 'r'))

In [55]:
def preprocess_sentences(sentences):
    preprocessed = []
    for sentence in sentences:

        sentence = sentence.lower()
        sentence = re.sub(r'\d', 'D', sentence)
        sentence = re.sub(r'[^a-z\s]', '', sentence)
        
        preprocessed.append(sentence)
    return preprocessed

for entry in corpus.values():
    for lang in entry:
        entry[lang] = preprocess_sentences(entry[lang])

sorted_keys = sorted(corpus.keys())
train_ids, test_ids = train_test_split(sorted_keys, test_size=0.2, random_state=42)

sample_preprocessed_sentence = corpus[sorted_keys[0]]['en'][0]

print("Sample preprocessed sentence:")
print(sample_preprocessed_sentence)
print("Train IDs:")
print(train_ids)
print("Test IDs:")
print(test_ids)

Sample preprocessed sentence:
mr president  i would like to begin by joining those who congratulated the rapporteur on this work 
Train IDs:
['03-03-27_53', '00-07-05_82', '10-05-18-007_089', '09-12-16-011_240', '01-04-05_197', '01-05-15_270', '10-11-24-008_999', '07-06-19-009_213', '08-12-16-006_197', '04-03-10_282', '01-05-15_274', '09-12-16-012_305', '08-11-18-005_093', '11-02-17-007_284', '11-05-09-020_158', '04-01-15_21', '98-04-01_234', '97-04-08_8', '99-10-05_186', '07-11-14-008_244', '07-09-06-010-02_136', '08-03-12-012_259', '10-05-06-004_052', '08-02-18-024_150', '01-02-12_5', '97-04-08_140', '00-12-14_60', '09-03-24-005_966', '08-12-16-004_148', '03-11-06_26', '09-03-24-003_060', '96-12-11_29', '10-12-15-006_110', '99-07-20_26', '07-11-15-003_019', '05-10-25_8', '01-12-13_123', '03-09-03_148', '97-10-01_32', '09-11-24-008_249', '98-03-10_170', '98-04-01_83', '97-07-16_213', '09-12-15-012_117', '08-09-03-010_197', '06-05-18_133', '08-12-16-010_293', '11-05-10-012_235', '08-01

## Task2

In [56]:
class Corpus(object):

    """
    This class creates a corpus object read off a .json file consisting of a list of lists,
    where each inner list is a sentence encoded as a list of strings.
    """

    def __init__(self, t, n, corpus=None, bos_eos=True, vocab=None):

        """
        A Corpus object has the following attributes:
         - vocab: set or None (default). If a set is passed, words in the input .json file not
                         found in the set are replaced with the UNK string
         - path: str, the path to the .json file used to build the corpus object
         - t: int, words with a lower frequency count than t are replaced with the UNK string
         - ngram_size: int, 2 for bigrams, 3 for trigrams, and so on.
         - bos_eos: bool, default to True. If False, bos and eos symbols are not prepended and appended to sentences.
         - sentences: list of lists, containing the input sentences after lowercasing and
                         splitting at the white space
         - frequencies: Counter, mapping tokens to their frequency count in the corpus
        """

        self.vocab = vocab
        self.corpus = corpus
        self.t = t
        self.ngram_size = n
        self.bos_eos = bos_eos

        # input --> [['I am home.'], ['You went to the park.'], ...]
        self.sentences = self.read()
        # output --> [['i', 'am', 'home' '.'], ['you', 'went', 'to', 'the', 'park', '.'], ...]

        self.frequencies = self.freq_distr()
        # output --> Counter('the': 485099, 'of': 301877, 'i': 286549, ...)
        # the numbers are made up, they aren't the actual frequency counts

        if self.t or self.vocab:
            # output --> [['i', 'am', 'home' '.'], ['you', 'went', 'to', 'the', 'park', '.'], ...]
            self.sentences = self.filter_words()
            # output --> [['i', 'am', 'home' '.'], ['you', 'went', 'to', 'the', 'UNK', '.'], ...]
            # supposing that park wasn't frequent enough or was outside of the training vocabulary, it gets
            # replaced by the UNK string

        if self.bos_eos:
            # output --> [['i', 'am', 'home' '.'], ['you', 'went', 'to', 'the', 'park', '.'], ...]
            self.sentences = self.add_bos_eos()
            # output --> [['eos', i', 'am', 'home' '.', 'bos'],
            #             ['eos', you', 'went', 'to', 'the', 'park', '.', 'bos'], ...]

    def read(self):

        """
        Reads the sentences off the .json file, replaces quotes, lowercases strings and splits
        at the white space. Returns a list of lists.
        """

        if isinstance(self.corpus, str):
            if self.corpus.endswith('.json'):
                sentences = json.load(open(self.corpus, 'r'))
            else:
                sentences = []
                with open(self.corpus, 'r', encoding='latin-1') as f:
                    for line in f:
                        # first strip away newline symbols and the like, then replace ' and " with the empty
                        # string and get rid of possible remaining trailing spaces
                        line = line.strip().translate({ord(i): None for i in '"\'\\'}).strip(' ')
                        # lowercase and split at the white space (the corpus has ben previously tokenized)
                        sentences.append(line.lower().split(' '))

        elif isinstance(self.corpus, list):
            sentences = []
            for l in self.corpus:
                sentence = []
                for w in l:
                    sentence.append(w.lower().translate({ord(i): None for i in '"\'\\'}).strip(' '))
                sentences.append(sentence)

        else:
            raise ValueError("""Unrecognized input corpus: either pass a valid file path or
                             a list of lists of strings to the argument corpus""")

        return sentences

    def freq_distr(self):

        """
        Creates a counter mapping tokens to frequency counts

        count = Counter()
        for sentence in self.sentences:
            for word in sentence:
                count[w] += 1

        """

        return Counter([word for sentence in self.sentences for word in sentence])


    def filter_words(self):

        """
        Replaces illegal tokens with the UNK string. A token is illegal if its frequency count
        is lower than the given threshold and/or if it falls outside the specified vocabulary.
        The two filters can be both active at the same time but don't have to be. To exclude the
        frequency filter, set t=0 in the class call.
        """

        filtered_sentences = []
        for sentence in self.sentences:
            filtered_sentence = []
            for word in sentence:
                if self.t and self.vocab:
                    # check that the word is frequent enough and occurs in the vocabulary
                    filtered_sentence.append(
                        word if self.frequencies[word] > self.t and word in self.vocab else 'UNK'
                    )
                else:
                    if self.t:
                        # check that the word is frequent enough
                        filtered_sentence.append(word if self.frequencies[word] > self.t else 'UNK')
                    else:
                        # check if the word occurs in the vocabulary
                        filtered_sentence.append(word if word in self.vocab else 'UNK')

            if len(filtered_sentence) > 1:
                # make sure that the sentence contains more than 1 token
                filtered_sentences.append(filtered_sentence)

        return filtered_sentences

    def add_bos_eos(self):

        """
        Adds the necessary number of BOS symbols and one EOS symbol.

        In a bigram model, you need on bos and one eos; in a trigram model you need two bos and one eos, and so on...
        """

        r = 1 if self.ngram_size == 1 else self.ngram_size - 1
        padded_sentences = []
        for sentence in self.sentences:
            padded_sentence = ['#bos#']*r + sentence + ['#eos#']
            padded_sentences.append(padded_sentence)

        return padded_sentences

In [57]:
class CharCorpus(Corpus):
    def read(self):
        if isinstance(self.corpus, str):
            if self.corpus.endswith('.json'):
                sentences = json.load(open(self.corpus, 'r'))
            else:
                sentences = []
                with open(self.corpus, 'r', encoding='latin-1') as f:
                    for line in f:
                        line = line.strip().translate({ord(i): None for i in '"\'\\'}).strip(' ')
                        sentences.append(list(line.lower()))
        elif isinstance(self.corpus, list):
            sentences = []
            for l in self.corpus:
                sentence = []
                for w in l:
                    sentence.extend(list(w.lower().translate({ord(i): None for i in '"\'\\'}).strip(' ')))
                sentences.append(sentence)
        else:
            raise ValueError("""Unrecognized input corpus: either pass a valid file path or
                             a list of lists of strings to the argument corpus""")
        return sentences

    def freq_distr(self):
        return Counter([char for sentence in self.sentences for char in sentence])

    def filter_words(self):
        filtered_sentences = []
        for sentence in self.sentences:
            filtered_sentence = []
            for char in sentence:
                if self.t and self.vocab:
                    filtered_sentence.append(
                        char if self.frequencies[char] > self.t and char in self.vocab else 'UNK'
                    )
                else:
                    if self.t:
                        filtered_sentence.append(char if self.frequencies[char] > self.t else 'UNK')
                    else:
                        filtered_sentence.append(char if char in self.vocab else 'UNK')
            if len(filtered_sentence) > 1:
                filtered_sentences.append(filtered_sentence)
        return filtered_sentences

    def add_bos_eos(self):
        r = 1 if self.ngram_size == 1 else self.ngram_size - 1
        padded_sentences = []
        for sentence in self.sentences:
            padded_sentence = ['#bos#']*r + sentence + ['#eos#']
            padded_sentences.append(padded_sentence)
        return padded_sentences

In [58]:
class LM(object):
    def __init__(self, n, vocab=None, smoother='Laplace', k=1, lambdas=None):

        self.vocab = vocab
        self.vocab_size = 0  # Will be updated after training
        # make sure that no fixed quantity is added to counts when the smoother is not Laplace
        self.k = k if smoother=='Laplace' else 0
        self.ngram_size = n
        self.counts = defaultdict(lambda: defaultdict(int))
        self.smoother = smoother
        self.lambdas = lambdas if lambdas else {i+1: 1/n for i in range(n)}

    def get_ngram(self, sentence, i, n):

        if n == 1:
            return sentence[i]
        else:
            ngram = sentence[i-(n-1):i+1]
            history = tuple(ngram[:-1])
            target = ngram[-1]
            return (history, target)


    def update_counts(self, corpus, n):

        if self.ngram_size != corpus.ngram_size:
            raise ValueError("The corpus was pre-processed considering an ngram size of {} while the "
                             "language model was created with an ngram size of {}. \n"
                             "Please choose the same ngram size for pre-processing the corpus and fitting "
                             "the model.".format(corpus.ngram_size, self.ngram_size))

        self.counts = defaultdict(dict)
        # if the interpolation smoother is selected, then estimate transition counts for all possible ngram_sizes
        # smaller than the given one, otherwise stick with the input ngram_size
        ngram_sizes = [n] if self.smoother != 'Interpolation' else range(1,n+1)
        for ngram_size in ngram_sizes:
            self.counts[ngram_size] = defaultdict(dict) if ngram_size > 1 else Counter()
        for sentence in corpus.sentences:
            for ngram_size in ngram_sizes:
                for idx in range(n-1, len(sentence)):
                    ngram = self.get_ngram(sentence, idx, ngram_size)
                    if ngram_size == 1:
                        self.counts[ngram_size][ngram] += 1
                    else:
                        # it's faster to try to do something and catch an exception than to use an if statement to
                        # check whether a condition is met beforehand. The if is checked everytime, the exception
                        # is only catched the first time, after that everything runs smoothly
                        try:
                            self.counts[ngram_size][ngram[0]][ngram[1]] += 1
                        except KeyError:
                            self.counts[ngram_size][ngram[0]][ngram[1]] = 1

        # first loop through the sentences in the corpus, than loop through each word in a sentence
        self.vocab = {word for sentence in corpus.sentences for word in sentence}
        self.vocab_size = len(self.vocab)

    def get_laplace_ngram_probability(self, history, target):

        try:
            ngram_tot = np.sum(list(self.counts[self.ngram_size][history].values())) + (self.vocab_size*self.k)
            try:
                transition_count = self.counts[self.ngram_size][history][target] + self.k
            except KeyError:
                transition_count = self.k
        except KeyError:
            transition_count = self.k
            ngram_tot = self.vocab_size*self.k

        return transition_count/ngram_tot

    def perplexity(self, test_corpus):
        probs = []
        for sentence in test_corpus.sentences:
            for idx in range(self.ngram_size-1, len(sentence)):
                ngram = self.get_ngram(sentence, idx, self.ngram_size)
                if self.ngram_size == 1:
                    probs.append(self.get_unigram_probability(ngram))
                else:
                    if self.smoother == 'Laplace':
                        probs.append(self.get_laplace_ngram_probability(ngram[0], ngram[1]))
                    elif self.smoother == 'Interpolation':
                        probs.append(self.get_interpolated_ngram_probability(ngram[0], ngram[1]))

        entropy = np.log2(probs)

        # Check if entropy list is not empty
        if len(entropy) > 0:
            avg_entropy = -1 * (np.sum(entropy) / len(entropy))
        else:
            avg_entropy = float('inf')

        # Return perplexity based on average entropy
        return pow(2.0, avg_entropy)


    def get_ngrams(self):
        ngrams = []
        for sentence in self.corpus:
            for i in range(len(sentence) - self.ngram_size + 1):
                ngrams.append(sentence[i:i+self.ngram_size])
        return ngrams

    def get_next_char(self, prev_chars):
        possible_chars = [ngram[-1] for ngram in self.ngrams if ngram[:-1] == prev_chars]
        return random.choice(possible_chars) if possible_chars else None

    def generate(self, start_chars, length):
        result = start_chars
        for _ in range(length):
            next_char = self.get_next_char(result[-self.ngram_size+1:])
            if next_char is None:
                break
            result += next_char
        return result

In [59]:

def replace_infrequent_chars(sentences):
    all_chars = ''.join(sentences)
    char_counts = Counter(all_chars)
    return [''.join('UNK' if char_counts[char] < 20 else char for char in sentence) for sentence in sentences]

english_sentences = [corpus[id]['en'] for id in train_ids]

english_sentences_flat = [sentence for sublist in english_sentences for sentence in sublist]

english_sentences_no_spaces = [''.join(sentence.split()) for sentence in english_sentences_flat]

word_types = list(set(word for sentence in english_sentences_flat for word in sentence.split()))

english_sentences_no_spaces_unk = replace_infrequent_chars(english_sentences_no_spaces)
word_types_unk = replace_infrequent_chars(word_types)

bigram_model_sentences = LM(3, smoother='Laplace', k=0.01)
bigram_model_sentences.update_counts(CharCorpus(t=0, n=3, corpus=english_sentences_no_spaces_unk), 3)

bigram_model_word_types = LM(3, smoother='Laplace', k=0.01)
bigram_model_word_types.update_counts(CharCorpus(t=0, n=3, corpus=word_types_unk), 3)

tetragram_model_sentences = LM(5, smoother='Laplace', k=0.01)
tetragram_model_sentences.update_counts(CharCorpus(t=0, n=5, corpus=english_sentences_no_spaces_unk), 5)

tetragram_model_word_types = LM(5, smoother='Laplace', k=0.01)
tetragram_model_word_types.update_counts(CharCorpus(t=0, n=5, corpus=word_types_unk), 5)

english_test_sentences = [corpus[id]['en'] for id in test_ids]

english_test_sentences_flat = [sentence for sublist in english_test_sentences for sentence in sublist]

test_sentences_no_spaces = [''.join(sentence.split()) for sentence in english_test_sentences_flat]

test_sentences_no_spaces_unk = replace_infrequent_chars(test_sentences_no_spaces)

test_corpus_sentences_bigram = CharCorpus(t=0, n=3, corpus=test_sentences_no_spaces_unk, bos_eos=True, vocab=None)
test_corpus_word_types_bigram = CharCorpus(t=0, n=3, corpus=word_types_unk, bos_eos=True, vocab=None)

test_corpus_sentences_tetragram = CharCorpus(t=0, n=5, corpus=test_sentences_no_spaces_unk, bos_eos=True, vocab=None)
test_corpus_word_types_tetragram = CharCorpus(t=0, n=5, corpus=word_types_unk, bos_eos=True, vocab=None)


print("Bigram model sentences perplexity: ", bigram_model_sentences.perplexity(test_corpus_sentences_bigram))
print("Bigram model word types perplexity: ", bigram_model_word_types.perplexity(test_corpus_word_types_bigram))
print("Tetragram model sentences perplexity: ", tetragram_model_sentences.perplexity(test_corpus_sentences_tetragram))
print("Tetragram model word types perplexity: ", tetragram_model_word_types.perplexity(test_corpus_word_types_tetragram))


Bigram model sentences perplexity:  8.284316121400012
Bigram model word types perplexity:  8.956682831745013
Tetragram model sentences perplexity:  4.112991622075985
Tetragram model word types perplexity:  4.174197192108753


In [60]:

with open('ZhivkoParapanov_sents_2gr_en.pkl', 'wb') as f_out:
    pkl.dump(bigram_model_sentences, f_out)

with open('ZhivkoParapanov_words_2gr_en.pkl', 'wb') as f_out:
    pkl.dump(bigram_model_word_types, f_out)

with open('ZhivkoParapanov_sents_4gr_en.pkl', 'wb') as f_out:
    pkl.dump(tetragram_model_sentences, f_out)
    
with open('ZhivkoParapanov_words_4gr_en.pkl', 'wb') as f_out:
    pkl.dump(tetragram_model_word_types, f_out)

## Task 3

3. Compute the perplexity of all four models on:
	- the English training sentences
	- the English training word types
	- the English test set (sentences)
	- the Dutch test set (sentences)
	- the Italian test set (sentences)

You should submit a .csv file with the following structure, column names, and values ([2/4] means either 2 for bigram models or 4 for tetragram models, the options under test_data indicate the five sets to be used to compute perplexity):

|ngram_size|training_data|test_data|perplexity|
|---|---|---|---|
|[2/4]|[words/sents]|[ITtest/NLtest/ENtest/ENtrain_sents/ENtrain_words]|float (rounded at 4 decimal places)|
|---|---|---|---|

The file should be named according to the template _Name(Initial)Surname_perplexities.csv_

> 5 points available: you get 1 point if all four LMs yield the correct perplexity scores for a test_dataset.


In [61]:
EN_test_ids = [corpus[id]['en'] for id in test_ids if 'en' in corpus[id]]
IT_test_ids = [corpus[id]['it'] for id in test_ids if 'it' in corpus[id]]
NL_test_ids = [corpus[id]['nl'] for id in test_ids if 'nl' in corpus[id]]

In [62]:
def preprocess_test_data(test_data):
    if test_data and isinstance(test_data[0], list):
        test_data = [' '.join(sentence) if isinstance(sentence, list) else sentence for sentence in test_data]
    
    test_data_no_spaces = [''.join(sentence.split()) for sentence in test_data]
    test_data_no_spaces_unk = replace_infrequent_chars(test_data_no_spaces)
    return test_data_no_spaces_unk

def create_char_corpus(test_data, n, vocab):
    return CharCorpus(t=0, n=n, corpus=test_data, bos_eos=True, vocab=vocab)

# Preprocess all test datasets
EN_test_processed = preprocess_test_data(EN_test_ids)
print(EN_test_processed[0])
NL_test_processed = preprocess_test_data(NL_test_ids)
IT_test_processed = preprocess_test_data(IT_test_ids)

ithankthepresidencyverymuchforthatreplythecaseofmadeleinemccannhasarousedalotofinterestandindeedcontroversyiamnotgoingtogointothedetailsofthatcasebutwhatconcernsmehereiswhatlessonswecanlearngenerallyabouttheadequacyofeuropeanactioninthecaseofmissingchildreniwanttoaskaboutthreeissuesthefirstisamissingchildrenshotlineyesterdaycommissionerfrattinitoldusthathewasnotatallsatisfiedbymemberstatesactiontoimplementthecouncildecisionfromfebruaryforasinglephonenumberformissingchildrenwhichshouldhavebeeninplaceinaugustonlyfourmemberstateshavechosenaserviceproviderthreememberstateshavefailedtorespondtoarequestforinformationatallthatisnotveryimpressivewillyouchaseuptheothermemberstatessecondlyafewweeksagojusticeandhomeaffairsministerscalledforaneudatabaseonmissingchildrenibelievethatsomeprivateattemptshavebeenmadeincooperationwithyoutubeandmadeleinemccannsparentswilltheeusupporthavingaproperdatabasethirdlyyoutalkedaboutworkontheexchangeofinformationonsexoffendersbutwhenarewegoingtohaveacomputerisedd

In [63]:

models = {
    bigram_model_sentences: ('bigram_sentences', 3),
    bigram_model_word_types: ('bigram_word_types', 3),
    tetragram_model_sentences: ('tetragram_sentences', 5),
    tetragram_model_word_types: ('tetragram_word_types', 5)
}

test_datasets = {
    'ENtest': EN_test_processed,
    'NLtest': NL_test_processed,
    'ITtest': IT_test_processed,
}

results = []
for model, (model_name, ngram_size) in models.items():
    for test_name, test_data in test_datasets.items():
        test_corpus = create_char_corpus(test_data, ngram_size, model.vocab)
        perplexity = model.perplexity(test_corpus)
        results.append([ngram_size, model_name, test_name, round(perplexity, 4)])

df = pd.DataFrame(results, columns=['ngram_size', 'training_data', 'test_data', 'perplexity'])

In [64]:

df.to_csv(
     'ZhivkoParapanov_perplexities.csv',
      encoding='utf-8',
      index=False
 )

## Task4

4. Out of all Italian and Dutch word types longer than 4 characters and with at least 5 occurrences in the Italian/Dutch sentences, find:
	- the word with the lowest perplexity according to each of the four models
	- the word with the highest perplexity according to each of four models

You should submit two .csv files (one for the lowest perplexities, one for highest perplexities) with the following structure, column names, and values ([2/4] means either 2 for bigram models or 4 for tetragram models, str indicates that the word should appear as a string):

| lang | word | ngram_size | training_data | perplexity |
|---|---|---|---|---|
|[it/nl]|str|[2/4]|[words/sents]|float (rounded at 4 decimal places)|
|---|---|---|---|

The files should be named according to the template _Name(Initial)Surname_perplexities_[max|min].csv_, so Jane Smith should submit a file named _JaneSmith_perplexities_max.csv_ containing 8 rows each storing the word with the highest perplexity according to each of the four LMs per language.

> 4 points available: you get 0.25 points for each correct word identified

In [65]:
def preprocess_test_data(test_data):
    if test_data and isinstance(test_data[0], list):
        test_data = [' '.join(sentence) for sentence in test_data]

    processed_sentences = [' '.join(re.sub(r'\d', 'D', sentence).lower().split()) for sentence in test_data]
    
    words = [word for sentence in processed_sentences for word in sentence.split()]

    processed_words = replace_infrequent_chars(words)

    return processed_words

def get_words_from_sentences(sentences):
    return [word for sentence in sentences for word in sentence.split()]

def filter_words_by_frequency_and_length(words):
    word_freq = Counter(words)
    return [word for word in word_freq if len(word) > 4 and word_freq[word] >= 5]

def compute_word_perplexity(model, word):
    bos = '<s>'
    eos = '</s>'
    word = bos + word + eos
    
    prob = 1.0
    for i in range(len(word) - model.ngram_size + 1):
        ngram = word[i:i + model.ngram_size]
        if len(ngram) < model.ngram_size:
            break
        history, char = ngram[:-1], ngram[-1]

        prob *= model.get_laplace_ngram_probability(history, char)

    if prob == 0:
        prob = 1e-10
    
    perplexity = 1 / prob ** (1 / len(word))
    return perplexity

model_details = {
    bigram_model_sentences: ('3', 'sents'),
    bigram_model_word_types: ('3', 'words'),
    tetragram_model_sentences: ('5', 'sents'),
    tetragram_model_word_types: ('5', 'words')
}


IT_test_words = preprocess_test_data(IT_test_ids)
NL_test_words = preprocess_test_data(NL_test_ids)

IT_filtered_words = filter_words_by_frequency_and_length(IT_test_words)
NL_filtered_words = filter_words_by_frequency_and_length(NL_test_words)

def preprocess_words(test_data_sentences):
    preprocessed_words = []
    for sentence in test_data_sentences:
        words = re.sub(r'[^a-zA-Z\s]', '', sentence).lower().split()
        preprocessed_words.extend([re.sub(r'\d', 'D', word) for word in words])
    return preprocessed_words

def replace_infrequent_words(words, threshold=5):
    freq = Counter(words)
    return [word if freq[word] >= threshold else 'UNK' for word in words]

IT_processed_words = replace_infrequent_words(preprocess_words(IT_test_words))
NL_processed_words = replace_infrequent_words(preprocess_words(NL_test_words))

IT_frequent_words = filter_words_by_frequency_and_length(IT_processed_words)
NL_frequent_words = filter_words_by_frequency_and_length(NL_processed_words)


perplexities_data = []

for model, (ngram_size, training_data) in model_details.items():
    for lang, words in [('it', IT_filtered_words), ('nl', NL_filtered_words)]:
        for word in words:
            perplexity = compute_word_perplexity(model, word)
            perplexities_data.append({
                'lang': lang,
                'word': word,
                'ngram_size': ngram_size,  
                'training_data': training_data, 
                'perplexity': round(perplexity, 4)
            })


df_perplexities = pd.DataFrame(perplexities_data)

highest_perplexity = df_perplexities.loc[df_perplexities.groupby(['ngram_size', 'training_data', 'lang'])['perplexity'].idxmax()]
lowest_perplexity = df_perplexities.loc[df_perplexities.groupby(['ngram_size', 'training_data', 'lang'])['perplexity'].idxmin()]


In [66]:

highest_perplexity.to_csv(
    'ZhivkoParapanov_perplexities_max.csv',
     encoding='utf-8',
     index=False
)

lowest_perplexity.to_csv(
    'ZhivkoParapanov_perplexities_min.csv',
     encoding='utf-8',
     index=False
)

## Task 5

Answer questions in the separate markdown blocks below.

#### 5a
- a. compare LMs' perplexity on the English training sets, sentences and words, then explain the differences in perplexity considering what changes between the two training set-ups. (5 pts, 150 words)

Models trained on sentences instead of just random words, it gets better at predicting how English actually flows. This is probably because it learns how letters and word endings fit together in real sentences.

Longer models are even better at predicting normal English sentences. But when we train them on random words, they get way worse. This likely means longer models need more context to work well

#### 5b
- b. which LM trained on sentences generalizes better to unseen sentences in the same language, bigram or tetragram? explain why this is the case. (5 pts, 150 words)

The tetragram model trained on sentences generalizes better to unseen English sentences than the bigram model, as indicated by its significantly lower perplexity of 4.1747 compared to the bigram model's 8.2832. This lower perplexity shows that the tetragram model more accurately predicts the sequence of characters in new sentences.

The tetragram model considers four characters at a time, allowing it to capture more complex patterns and dependencies within the language. These longer contexts provide a more detailed understanding of the language structure, 

#### 5c
- c. compare LMs trained on English in their ability to fit Italian and Dutch sentences: which factor between ngram size and training corpus (words or sentences) affects perplexity the most? Explain why we observe this pattern. (4 pts, 100 words)

The data reveals that n-gram size affects perplexity more than the training corpus type when English-trained language models are applied to Italian and Dutch sentences. Tetragram models exhibit higher perplexity due to their sensitivity to English-specific syntactic and lexical features, which become mismatches in other languages. Bigram models, with shorter context windows, adapt better, showing lower perplexity. This pattern underscores how longer context lengths in models increase sensitivity to language-specific characteristics, impacting their performance on texts in different languages.

#### 5d
- d. what patterns can you identify in the words with the lowest perplexity in Dutch and Italian? (4 pts, 100 words)

The words "molto" in Italian and "geval" in Dutch, which exhibit the lowest perplexities across different models and training setups, suggest a pattern where common words with frequent usage in both languages are better predicted by the language models. These words likely occur with consistent patterns in the training data, making them easier for both bigram and tetragram models to predict accurately, regardless of whether the models were trained on words or sentences. This consistency in low perplexity scores, evident across both n-gram sizes and training types, suggests that frequently used words with stable usage contexts are universally easier for models to handle, reflecting their effective learning and generalization capabilities.

#### 5e
 - e. what patterns can you identify in the words with the highest perplexity in Dutch and Italian? (4 pts, 100 words)

The words with the highest perplexity in both Italian ("cristianodemocratici") and Dutch ("verkiezingswaarnemingsmissie") are notably long and complex, which contributes to their higher perplexity across all models. These terms, likely specific to political or formal contexts, are less frequent in everyday language, making them harder for the models to predict accurately. This is evidenced by their consistently high perplexity, regardless of whether the model is trained on sentences or words, or uses a bigram or tetragram approach. The complexity and specialized nature of these words challenge the models' ability to effectively learn and predict their patterns, highlighting the models' limitations with less common, context-specific vocabulary.