## (a) Warmup
As a warmup, write code to collect statistics about word frequencies in the two languages. Print the 10 most frequent words in each language.

If you're working with Python, using a CounterLinks to an external site. is probably the easiest solution.

Let's assume that we pick a word completely randomly from the European parliament proceedings. According to your estimate, what is the probability that it is speaker? What is the probability that it is zebra?

In [1]:
import numpy as np
import pandas as pd
import re
from collections import Counter

In [2]:
def extract_sent(file_path):
    sent = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            sentences = re.split(r'\s', line.strip())
            for sentance in sentences:
                if sentance.strip():
                    sent.append(sentance.strip())

    return sent

In [3]:
file_path_swe = "datasets/europarl-v7.sv-en.lc.sv"
swe_sentance = extract_sent(file_path_swe)
file_path_eng = "datasets/europarl-v7.sv-en.lc.en"
eng_sentance = extract_sent(file_path_eng)

In [4]:
counter_swe = Counter(swe_sentance)
most_common_swe = counter_swe.most_common(10)
for word, count in most_common_swe:
   print(f"{word}: {count}")

.: 9648
att: 9181
,: 8876
och: 7038
i: 5949
det: 5687
som: 5028
för: 4959
av: 4013
är: 3840


In [5]:
counter_eng = Counter(eng_sentance)
most_common_eng = counter_eng.most_common(10)
for word, count in most_common_eng:
   print(f"{word}: {count}")

the: 19322
,: 13514
.: 9774
of: 9312
to: 8801
and: 6946
in: 6090
is: 4400
that: 4357
a: 4269


In [6]:
total_words_eng = len(counter_eng)
speaker_counts = counter_eng['speaker']
zebra_counts = counter_eng['zebra']
the_counts = counter_eng['the']

prob_speaker = speaker_counts/total_words_eng
prob_zebra = zebra_counts/total_words_eng
prob_the= the_counts / total_words_eng 

print("Probability of picking 'speaker' randomly:", prob_speaker)
print("Probability of picking 'zebra' randomly:", prob_zebra)
print("Probability of picking 'the' randomly:", prob_the)

Probability of picking 'speaker' randomly: 0.0009008197459688316
Probability of picking 'zebra' randomly: 0.0
Probability of picking 'the' randomly: 1.7405639131609765


## (b) Language modeling
We will now define a language model that lets us compute probabilities for individual English sentences.

Implement a bigram language model as described in the lecture, and use it to compute the probability of a short sentence.
What happens if you try to compute the probability of a sentence that contains a word that did not appear in the training texts? And what happens if your sentence is very long (e.g. 100 words or more)? Optionally, change your code so that it can handle these challenges.

## Solution
1. If a word in the text did not appear, the bigram probability would be 0. This might not accurately reflect the actual likelihood of the sentence occurring in real language usage. There are several ways to solve this, we will use laplace smoothening. This means that instead if assigning 0 probability to an unseen bigram, we will add a pseudo-count to each possible event, ensuring that even unseen bigrams have a non-zero probability.
2. For very long sentances, the probability might become a very small number to the muliplication of probabilites. This could lead to that the floating-point arithmetic becomes less precise. To handle this we will use log probabilities instead of probabilities directly, as logarithms can help maintain numerical stability.

In [12]:
def train_translation_model(sentences):
    
    # count all bigrams from the text
    bigram_counts = Counter(zip(sentences, sentences[1:]))

    # calculate the probabilities
    bigram_prob = {}
    size = len(set(sentences))
    for bigram, count in bigram_counts.items():
        prev_word = bigram[0]
        bigram_prob[bigram] = np.log((count + 1) / (sentences.count(prev_word) + size))
    
    return bigram_prob

In [11]:
def sentance_log_prob(sentance, bigram_prob):
    log_prob = 0.0
    for i in range(len(sentance)-1):
        bigram = (sentance[i],sentance[i+1])
        log_prob += bigram_prob.get(bigram, np.log(1e-10))
    return log_prob


In [15]:
bigram_probabilities = train_translation_model(eng_sentance)

## c) Translation modeling. We will now estimate the parameters of the translation model P(f|e).

Self-check: if our goal is to translate from some language into English, why does our conditional probability seem to be written backwards? Why don't we estimate P(e|f) instead?

Write code that implements the estimation algorithm for IBM model 1. Then print, for either Swedish, German, or French, the 10 words that the English word european is most likely to be translated into, according to your estimate. It can be interesting to look at this list of 10 words and see how it changes during the EM iterations.


In [None]:
def initialize(source, target):
    return {(f, e): np.random.rand() for f in source for e in target}

In [None]:
def ibm_model(source, target, iterations):
    