## (a) Warmup
As a warmup, write code to collect statistics about word frequencies in the two languages. Print the 10 most frequent words in each language.

If you're working with Python, using a CounterLinks to an external site. is probably the easiest solution.

Let's assume that we pick a word completely randomly from the European parliament proceedings. According to your estimate, what is the probability that it is speaker? What is the probability that it is zebra?

In [None]:
import numpy as np
import pandas as pd
import re
from collections import Counter, defaultdict
import random

In [None]:
def extract_sent(file_path):
    sent = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            sentences = re.split(r'\s', line.strip())
            for sentance in sentences:
                if sentance.strip():
                    sent.append(sentance.strip())

    return sent

In [None]:
file_path_swe = "datasets/europarl-v7.sv-en.lc.sv"
swe_sentance = extract_sent(file_path_swe)
file_path_eng = "datasets/europarl-v7.sv-en.lc.en"
eng_sentance = extract_sent(file_path_eng)

In [None]:
counter_swe = Counter(swe_sentance)
most_common_swe = counter_swe.most_common(10)
for word, count in most_common_swe:
   print(f"{word}: {count}")

In [None]:
counter_eng = Counter(eng_sentance)
most_common_eng = counter_eng.most_common(10)
for word, count in most_common_eng:
   print(f"{word}: {count}")

In [None]:
total_words_eng = len(counter_eng)
speaker_counts = counter_eng['speaker']
zebra_counts = counter_eng['zebra']
the_counts = counter_eng['the']

prob_speaker = speaker_counts/total_words_eng
prob_zebra = zebra_counts/total_words_eng
prob_the= the_counts / total_words_eng 

print("Probability of picking 'speaker' randomly:", prob_speaker)
print("Probability of picking 'zebra' randomly:", prob_zebra)
print("Probability of picking 'the' randomly:", prob_the)

## (b) Language modeling
We will now define a language model that lets us compute probabilities for individual English sentences.

Implement a bigram language model as described in the lecture, and use it to compute the probability of a short sentence.
What happens if you try to compute the probability of a sentence that contains a word that did not appear in the training texts? And what happens if your sentence is very long (e.g. 100 words or more)? Optionally, change your code so that it can handle these challenges.

## Solution
1. If a word in the text did not appear, the bigram probability would be 0. This might not accurately reflect the actual likelihood of the sentence occurring in real language usage. There are several ways to solve this, we will use add-1 smoothening. Add-1 smoothing involves adding a small constant (usually 1) to all observed counts before computing probabilities. This ensures that even unseen combinations of words have a non-zero probability.
2. For very long sentances, the probability might become a very small number to the muliplication of probabilites. This could lead to that the floating-point arithmetic becomes less precise. To handle this we will use log probabilities instead of probabilities directly, as logarithms can help maintain numerical stability.

In [None]:
class bigramModel:
    def __init__(self, corpus) -> None:
        self.corpus = corpus
        self.vocab = set()
        self.bigram_counts = Counter()
        self.unigram_counts = Counter()
        self.bigram_prob = {}
        self.build_model()

    def build_model(self):
        sentences = []
        sentence = []

        for word in self.corpus:
            if word == '.':
                sentences.append(sentence)
                sentence = []
            else:
                sentence.append(word)
                self.vocab.add(word)
                self.unigram_counts[word] += 1

        for sentence in sentences:
            for i in range(len(sentence) - 1):
                bigram = tuple(sentence[i:i+2])
                self.bigram_counts[bigram] += 1

        for bigram, count in self.bigram_counts.items():
            self.bigram_prob[bigram] = count / self.unigram_counts[bigram[0]]

    def sentence_prob(self, sentence):
        words = sentence.split()
        prob = 1.0
        for i in range(len(words) - 1):
            bigram = tuple(words[i:i+2])
            if bigram in self.bigram_prob:
                prob *= self.bigram_prob[bigram]
            else:
                prob *= 1 / (self.unigram_counts[bigram[0]] + len(self.vocab))
        return prob

In [None]:
model = bigramModel(eng_sentance)
sentence = "i declare"
probability = model.sentence_prob(sentence)
print("Probability of sentence '{}': {}".format(sentence, probability))

## c) Translation modeling. We will now estimate the parameters of the translation model P(f|e).

Self-check: if our goal is to translate from some language into English, why does our conditional probability seem to be written backwards? Why don't we estimate P(e|f) instead?

Write code that implements the estimation algorithm for IBM model 1. Then print, for either Swedish, German, or French, the 10 words that the English word european is most likely to be translated into, according to your estimate. It can be interesting to look at this list of 10 words and see how it changes during the EM iterations.


In [None]:
class ibm_model:
    def __init__(self, english_corpus, foreign_corpus, num_iterations=10):
        self.english_corpus = english_corpus
        self.foreign_corpus = foreign_corpus
        self.num_iterations = num_iterations
        self.translation_probs = {}
        self.initialize_translation_probs()

    def initialize_translation_probs(self):
        for foreign_sentence, english_sentence in zip(self.foreign_corpus, self.english_corpus):
            foreign_words = foreign_sentence.split()
            english_words = english_sentence.split()

            for foreign_word in foreign_words:
                if foreign_word not in self.translation_probs:
                    self.translation_probs[foreign_word] = {}
                    for english_word in english_words:
                        self.translation_probs[foreign_word][english_word] = random.random()
            
# Preprocess English and foreign text
english_corpus_preprocessed = []
foreign_corpus_preprocessed = []

for sentence in eng_sentance:
    words = sentence.strip().split()
    if words[-1] == 'NULL':
        english_corpus_preprocessed.append(' '.join(words[:-1]) + ' NULL')
    else:
        english_corpus_preprocessed.append(' '.join(words))

for sentence in swe_sentance:
    words = sentence.strip().split()
    if words[-1] == 'NULL':
        foreign_corpus_preprocessed.append(' '.join(words[:-1]) + ' NULL')
    else:
        foreign_corpus_preprocessed.append(' '.join(words))

# Initialize and train the IBM model
ibm_model = ibm_model(english_corpus_preprocessed, foreign_corpus_preprocessed)      