# Group 5 - Module 5: Natural Language Processing

***
### Group Members:
* **Nils Dunlop, 20010127-2359, Applied Data Science, e-mail: gusdunlni@student.gu.se (16 hours)**
* **Francisco Erazo, 19930613-9214, Applied Data Science, e-mail: guserafr@student.gu.se (16 hours)**

#### **We hereby declare that we have both actively participated in solving every exercise. All solutions are entirely our own work, without having taken part of other solutions." (This is independent and additional to any declaration that you may encounter in the electronic submission system.)**

# Assignment 5
***

## Problem 1: Reading and Reflection
***
### **(A) AI Problems with a Wide Range of Approaches:**

Several AI problems have seen a similarly wide array of approaches, reflecting the many computational techniques utilized over the decades. Some notable examples include:
- **Speech Recognition:** From early phonetic-based systems to modern deep learning approaches, speech recognition has evolved significantly. Early systems relied on simple pattern matching and rule-based systems to understand spoken language. Over time, statistical models like Hidden Markov Models (HMMs) played a crucial role, and now deep neural networks, particularly recurrent neural networks (RNNs) and convolutional neural networks (CNNs) dominate the field.
   
- **Computer Vision:** The evolution of computer vision techniques from simple edge detection algorithms and feature extraction methods to sophisticated deep learning models is another example. Techniques have ranged from geometric model fitting and template matching to the use of deep convolutional neural networks (CNNs) for tasks such as image classification, object detection, and semantic segmentation.
   
- **Game Playing:** The development of AI for game playing, from chess and checkers to complex video games, has seen strategies evolve from brute-force search algorithms and heuristic-based approaches to the use of machine learning techniques, including reinforcement learning and deep learning for strategy optimization.

### **(B) Similarities Between Rule-based Translation Systems and Neural Systems:**

Despite their differences, there are notable similarities between some aspects of rule-based translation systems and state-of-the-art neural systems:

- **Structured Knowledge:** Rule-based systems are explicitly programmed with linguistic rules and dictionaries. Neural systems, particularly those employing attention mechanisms, implicitly learn to encode structured knowledge about languages through training on large datasets, creating internal representations that mirror some aspects of linguistic rules.

- **Context Handling:** Both systems attempt to handle context in translation, albeit in radically different ways. Rule-based systems may use context rules to choose the correct translation of a word based on surrounding words. Neural systems, especially those using mechanisms like attention, are able to consider broader context in a more fluid and dynamic manner to understand the appropriate meaning.
   
- **Error Correction:** Both systems have mechanisms for error correction, although neural systems do this implicitly. Rule-based systems may use post-processing rules to correct common mistakes, while neural systems learn error patterns and their corrections during training.

### **(C) Situations Favoring Rule-based Solutions Over Modern Approaches:**

There are scenarios where a rule-based solution might be preferable to a modern neural or statistical approach:

- **Limited Data Scenarios:** For languages or specialized domains where there is a scarcity of training data, rule-based systems can be more practical and effective because they do not require large datasets to perform reasonably well.
   
- **Predictability and Transparency:** In situations where predictability, transparency, and the ability to audit or explain decisions are crucial (e.g., in some legal or regulatory contexts), rule-based systems offer clear advantages. Their decisions are based on explicit rules that can be examined, understood, and modified by humans.

- **Real-time Constraints:** In cases where computational resources are limited or real-time performance is essential, the relative simplicity and lower computational requirements of rule-based systems can be an advantage over more resource-intensive neural models.

## Problem 2: Implementation
***

### (A) Warmup: Word Frequencies
***

In [1]:
import pandas as pd
from collections import Counter
import re
import numpy as np
from collections import defaultdict
import time

# Load datasets into DataFrames
de_en_df = pd.DataFrame({
    'German': pd.read_csv("dat410_europarl/europarl-v7.de-en.lc.de", sep='\n', header=None, names=['text'], squeeze=True),
    'English_DE': pd.read_csv("dat410_europarl/europarl-v7.de-en.lc.en", sep='\n', header=None, names=['text'], squeeze=True)
})

fr_en_df = pd.DataFrame({
    'French': pd.read_csv("dat410_europarl/europarl-v7.fr-en.lc.fr", sep='\n', header=None, names=['text'], squeeze=True),
    'English_FR': pd.read_csv("dat410_europarl/europarl-v7.fr-en.lc.en", sep='\n', header=None, names=['text'], squeeze=True)
})

sv_en_df = pd.DataFrame({
    'Swedish': pd.read_csv("dat410_europarl/europarl-v7.sv-en.lc.sv", sep='\n', header=None, names=['text'], squeeze=True),
    'English_SV': pd.read_csv("dat410_europarl/europarl-v7.sv-en.lc.en", sep='\n', header=None, names=['text'], squeeze=True)
})

# Display the first few rows of the German-English dataset
print(de_en_df.head())

                                              German  \
0  ich erkläre die am freitag , dem 17. dezember ...   
1  wie sie feststellen konnten , ist der gefürcht...   
2  im parlament besteht der wunsch nach einer aus...   
3  heute möchte ich sie bitten - das ist auch der...   
4  wie sie sicher aus der presse und dem fernsehe...   

                                          English_DE  
0  i declare resumed the session of the european ...  
1  although , as you will have seen , the dreaded...  
2  you have requested a debate on this subject in...  
3  in the meantime , i should like to observe a m...  
4  you will be aware from the press and televisio...  


In [2]:
# Display the first few rows of the French-English dataset
print(fr_en_df.head())

                                              French  \
0  je déclare reprise la session du parlement eur...   
1  comme vous avez pu le constater , le grand &qu...   
2  vous avez souhaité un débat à ce sujet dans le...   
3  en attendant , je souhaiterais , comme un cert...   
4  je vous invite à vous lever pour cette minute ...   

                                          English_FR  
0  i declare resumed the session of the european ...  
1  although , as you will have seen , the dreaded...  
2  you have requested a debate on this subject in...  
3  in the meantime , i should like to observe a m...  
4  please rise , then , for this minute &apos; s ...  


In [3]:
# Display the first few rows of the Swedish-English dataset
print(sv_en_df.head())

                                             Swedish  \
0  jag förklarar europaparlamentets session återu...   
1  som ni kunnat konstatera ägde &quot; den stora...   
2  ni har begärt en debatt i ämnet under sammantr...   
3  till dess vill jag att vi , som ett antal koll...   
4             jag ber er resa er för en tyst minut .   

                                          English_SV  
0  i declare resumed the session of the european ...  
1  although , as you will have seen , the dreaded...  
2  you have requested a debate on this subject in...  
3  in the meantime , i should like to observe a m...  
4  please rise , then , for this minute &apos; s ...  


In [4]:
# Function to calculate word frequencies
def calculate_word_frequencies(df_column):
    word_counts = Counter()
    for text in df_column:
        # Remove punctuation and other non-word characters
        words = re.findall(r'\b\w+\b', text.lower())
        word_counts.update(words)
    return word_counts

In [5]:
# Calculate word frequencies for each language
de_word_counts = calculate_word_frequencies(de_en_df['German'])
en_de_word_counts = calculate_word_frequencies(de_en_df['English_DE'])

fr_word_counts = calculate_word_frequencies(fr_en_df['French'])
en_fr_word_counts = calculate_word_frequencies(fr_en_df['English_FR'])

sv_word_counts = calculate_word_frequencies(sv_en_df['Swedish'])
en_sv_word_counts = calculate_word_frequencies(sv_en_df['English_SV'])

In [6]:
# Function to print most common words
def print_most_common(word_counts, language, num=10):
    print(f"Most common words in {language}:")
    for word, count in word_counts.most_common(num):
        print(f"{word}: {count}")
    print("\n")

In [7]:
# Print the 10 most frequent words for each language
print_most_common(de_word_counts, "German")
print_most_common(en_de_word_counts, "English (DE)")

print_most_common(fr_word_counts, "French")
print_most_common(en_fr_word_counts, "English (FR)")

print_most_common(sv_word_counts, "Swedish")
print_most_common(en_sv_word_counts, "English (SV)")

Most common words in German:
die: 10521
der: 9374
und: 7028
in: 4175
zu: 3169
den: 2976
wir: 2863
daß: 2738
ich: 2670
das: 2669


Most common words in English (DE):
the: 19853
of: 9633
to: 9069
and: 7307
in: 6278
is: 4478
that: 4441
a: 4438
we: 3372
this: 3362


Most common words in French:
apos: 16729
de: 14528
la: 9746
et: 6620
l: 6536
le: 6177
à: 5588
les: 5587
des: 5232
que: 4797


Most common words in English (FR):
the: 19627
of: 9534
to: 8992
and: 7214
in: 6197
is: 4453
that: 4421
a: 4388
we: 3341
this: 3332


Most common words in Swedish:
att: 9181
och: 7038
i: 5954
det: 5687
som: 5028
för: 4959
av: 4013
är: 3840
en: 3724
vi: 3211


Most common words in English (SV):
the: 19327
of: 9344
to: 8814
and: 6949
in: 6124
is: 4400
that: 4357
a: 4271
we: 3223
this: 3222



In [8]:
# Calculate and print probabilities for 'speaker' and 'zebra' in each language
def print_probabilities(word_counts, language, total_words):
    for word in ['speaker', 'zebra']:
        prob = word_counts[word] / total_words if word in word_counts else 0
        print(f"Probability of '{word}' in {language}: {prob:.6f}")

total_de_words = sum(de_word_counts.values())
total_en_de_words = sum(en_de_word_counts.values())
total_fr_words = sum(fr_word_counts.values())
total_en_fr_words = sum(en_fr_word_counts.values())
total_sv_words = sum(sv_word_counts.values())
total_en_sv_words = sum(en_sv_word_counts.values())

print_probabilities(de_word_counts, "German", total_de_words)
print_probabilities(en_de_word_counts, "English (DE)", total_en_de_words)
print_probabilities(fr_word_counts, "French", total_fr_words)
print_probabilities(en_fr_word_counts, "English (FR)", total_en_fr_words)
print_probabilities(sv_word_counts, "Swedish", total_sv_words)
print_probabilities(en_sv_word_counts, "English (SV)", total_en_sv_words)

Probability of 'speaker' in German: 0.000000
Probability of 'zebra' in German: 0.000000
Probability of 'speaker' in English (DE): 0.000042
Probability of 'zebra' in English (DE): 0.000000
Probability of 'speaker' in French: 0.000000
Probability of 'zebra' in French: 0.000000
Probability of 'speaker' in English (FR): 0.000046
Probability of 'zebra' in English (FR): 0.000000
Probability of 'speaker' in Swedish: 0.000000
Probability of 'zebra' in Swedish: 0.000000
Probability of 'speaker' in English (SV): 0.000039
Probability of 'zebra' in English (SV): 0.000000


### (B) Language Modeling
***

In [9]:
# Tokenize the text and add start and end tokens
tokenized_text_de = [('<START>',) + tuple(re.findall(r'\b\w+\b', sentence.lower())) + ('<END>',) for sentence in de_en_df['English_DE']]
tokenized_text_fr = [('<START>',) + tuple(re.findall(r'\b\w+\b', sentence.lower())) + ('<END>',) for sentence in fr_en_df['English_FR']]
tokenized_text_sv = [('<START>',) + tuple(re.findall(r'\b\w+\b', sentence.lower())) + ('<END>',) for sentence in sv_en_df['English_SV']]

# Count individual words and bigrams
word_counts_de = Counter()
bigram_counts_de = Counter()
word_counts_fr = Counter()
bigram_counts_fr = Counter()
word_counts_sv = Counter()
bigram_counts_sv = Counter()

for sentence in tokenized_text_de:
    word_counts_de.update(sentence)
    bigram_counts_de.update(zip(sentence[:-1], sentence[1:]))

for sentence in tokenized_text_fr:
    word_counts_fr.update(sentence)
    bigram_counts_fr.update(zip(sentence[:-1], sentence[1:]))

for sentence in tokenized_text_sv:
    word_counts_sv.update(sentence)
    bigram_counts_sv.update(zip(sentence[:-1], sentence[1:]))

In [10]:
# Compute the probability of a sentence using a bigram model with Laplace smoothing
def bigram_sentence_probability(sentence, word_counts, bigram_counts, total_words, smoothing=1):
    sentence = ('<START>',) + tuple(re.findall(r'\b\w+\b', sentence.lower())) + ('<END>',)
    bigram_probs = []

    for first_word, second_word in zip(sentence[:-1], sentence[1:]):
        bigram_count = bigram_counts[(first_word, second_word)]
        word_count = word_counts[first_word]
        # Apply Laplace smoothing
        prob = (bigram_count + smoothing) / (word_count + smoothing * (total_words + 1))
        bigram_probs.append(prob)

    # Use logarithms to avoid underflow with long sentences
    log_probs = np.log(bigram_probs)
    log_prob_sentence = np.sum(log_probs)

    return np.exp(log_prob_sentence)

# Test the bigram_sentence_probability for the German-English dataset
de_sentence = "The economic impact of the legislation was significant."
de_prob = bigram_sentence_probability(de_sentence, word_counts_de, bigram_counts_de, len(word_counts_de))
print(f"Probability of the sentence '{de_sentence}' in German-English data: {de_prob}")

# Test the bigram_sentence_probability for the French-English dataset
fr_sentence = "Diplomatic efforts were intensified to resolve the conflict."
fr_prob = bigram_sentence_probability(fr_sentence, word_counts_fr, bigram_counts_fr, len(word_counts_fr))
print(f"Probability of the sentence '{fr_sentence}' in French-English data: {fr_prob}")

# Test the bigram_sentence_probability for the Swedish-English dataset
sv_sentence = "Environmental policies have become increasingly important."
sv_prob = bigram_sentence_probability(sv_sentence, word_counts_sv, bigram_counts_sv, len(word_counts_sv))
print(f"Probability of the sentence '{sv_sentence}' in Swedish-English data: {sv_prob}")

# OOV word test
oov_sentence = "this is a quixotic test sentence"
oov_prob_de = bigram_sentence_probability(oov_sentence, word_counts_de, bigram_counts_de, len(word_counts_de))
oov_prob_fr = bigram_sentence_probability(oov_sentence, word_counts_fr, bigram_counts_fr, len(word_counts_fr))
oov_prob_sv = bigram_sentence_probability(oov_sentence, word_counts_sv, bigram_counts_sv, len(word_counts_sv))
print(f"Probability of the sentence with an OOV word '{oov_sentence}' in German-English data: {oov_prob_de}")
print(f"Probability of the sentence with an OOV word '{oov_sentence}' in French-English data: {oov_prob_fr}")
print(f"Probability of the sentence with an OOV word '{oov_sentence}' in Swedish-English data: {oov_prob_sv}")

Probability of the sentence 'The economic impact of the legislation was significant.' in German-English data: 5.452363580933155e-28
Probability of the sentence 'Diplomatic efforts were intensified to resolve the conflict.' in French-English data: 5.274458597849841e-34
Probability of the sentence 'Environmental policies have become increasingly important.' in Swedish-English data: 1.381206394015978e-26
Probability of the sentence with an OOV word 'this is a quixotic test sentence' in German-English data: 1.0607721443308697e-21
Probability of the sentence with an OOV word 'this is a quixotic test sentence' in French-English data: 1.0477026424612885e-21
Probability of the sentence with an OOV word 'this is a quixotic test sentence' in Swedish-English data: 1.0820221311118906e-21


### (C) Translation Model
***

In [11]:
def create_parallel_sentences(df_en, df_foreign):
    return list(zip(df_en, df_foreign))

# Create parallel corpora
de_parallel_sentences = create_parallel_sentences(de_en_df['English_DE'], de_en_df['German'])
fr_parallel_sentences = create_parallel_sentences(fr_en_df['English_FR'], fr_en_df['French'])
sv_parallel_sentences = create_parallel_sentences(sv_en_df['English_SV'], sv_en_df['Swedish'])

# Select which language pair to use for the IBM Model 1
parallel_sentences = de_parallel_sentences

# Initialize translation probabilities uniformly
def initialize_translation_probabilities(parallel_sentences):
    all_en_words = set(word for en, _ in parallel_sentences for word in re.findall(r'\b\w+\b', en.lower()))
    all_foreign_words = set(word for _, foreign in parallel_sentences for word in re.findall(r'\b\w+\b', foreign.lower()))

    # Add a buffer for the <NULL> token
    all_en_words.add('<NULL>')

    initial_prob = 1 / (len(all_en_words))  # Now all_en_words includes <NULL>
    translation_probs = {f_word: {e_word: initial_prob for e_word in all_en_words} for f_word in all_foreign_words}

    return translation_probs

def em_algorithm(parallel_sentences, num_iterations=5):
    print("Initializing translation probabilities...")
    start_time = time.time()
    translation_probs = initialize_translation_probabilities(parallel_sentences)
    initialization_time = time.time() - start_time
    print(f"Initialization completed in {initialization_time:.2f} seconds.")

    for i in range(num_iterations):
        iteration_start_time = time.time()
        print(f"Starting iteration {i+1}/{num_iterations}...")

        count_e_f = defaultdict(lambda: defaultdict(float))
        total_e = defaultdict(float)

        # E-step
        print("  E-step...")
        for en_sentence, foreign_sentence in parallel_sentences:
            en_words = ['<NULL>'] + re.findall(r'\b\w+\b', en_sentence.lower())
            foreign_words = re.findall(r'\b\w+\b', foreign_sentence.lower())
            for f_word in foreign_words:
                denom_c = sum(translation_probs[f_word][e_word] for e_word in en_words)
                for e_word in en_words:
                    count_e_f[f_word][e_word] += translation_probs[f_word][e_word] / denom_c
                    total_e[e_word] += translation_probs[f_word][e_word] / denom_c

        # M-step
        print("  M-step...")
        for f_word in translation_probs:
            for e_word in translation_probs[f_word]:
                translation_probs[f_word][e_word] = count_e_f[f_word][e_word] / total_e[e_word]

        iteration_end_time = time.time() - iteration_start_time
        print(f"Completed iteration {i+1}/{num_iterations} in {iteration_end_time:.2f} seconds.")

    return translation_probs

In [12]:
# Before running the EM algorithm
print("Starting the EM algorithm...")
translation_probs = em_algorithm(parallel_sentences)
print("EM algorithm completed.")

# Print the 10 most likely translations for the English word "european"
english_word = "european"
likely_translations = [(f_word, translation_probs[f_word][english_word]) for f_word in translation_probs if english_word in translation_probs[f_word]]
likely_translations.sort(key=lambda x: x[1], reverse=True)
print("\nTop 10 likely translations for 'european':")
for foreign_word, prob in likely_translations[:10]:
    print(f"{foreign_word}: {prob}")

Starting the EM algorithm...
Initializing translation probabilities...
Initialization completed in 25.01 seconds.
Starting iteration 1/5...
  E-step...
  M-step...
Completed iteration 1/5 in 186.56 seconds.
Starting iteration 2/5...
  E-step...
  M-step...
Completed iteration 2/5 in 478.42 seconds.
Starting iteration 3/5...
  E-step...
  M-step...
Completed iteration 3/5 in 461.66 seconds.
Starting iteration 4/5...
  E-step...
  M-step...
Completed iteration 4/5 in 557.25 seconds.
Starting iteration 5/5...
  E-step...
  M-step...
Completed iteration 5/5 in 541.42 seconds.
EM algorithm completed.
Top 10 likely translations for 'european':
europäischen: 0.5632761859565995
europäische: 0.2475383085188334
der: 0.042232841286649406
die: 0.027021545239374084
union: 0.020943386322625646
und: 0.01038648858589231
in: 0.009981265254037574
den: 0.00832870844471093
das: 0.005713162414687346
für: 0.004639426042403674


#### Self-check: if our goal is to translate from some language into English, why does our conditional probability seem to be written backwards? Why don't we estimate P(e|f) instead?
The conditional probability in translation models may appear counterintuitive at first glance. In IBM Model 1, we calculate `P(f|e)` rather than `P(e|f)` because we are establishing an alignment model rather than a direct translation model. This model determines the probability that a word in a foreign language aligns with or translates from an English word. In short we are modeling the generative process of the foreign language given the English text. This approach allows the model to be symmetric and supporting translation in either direction which is beneficial for creating a bilingual model that can assist in translating both to and from English.

### (D) Decoding

In [13]:
def greedy_decoder(foreign_sentence, translation_probs):
    # Tokenize the foreign sentence
    # The regular expression \b\w+\b matches whole words
    foreign_words = re.findall(r'\b\w+\b', foreign_sentence.lower())
    translated_sentence = []

    # For each foreign word find the English word with the highest translation probability
    for f_word in foreign_words:
        best_prob = 0
        best_english_word = None
        for e_word, prob in translation_probs.get(f_word, {}).items():
            if prob > best_prob:
                best_prob = prob
                best_english_word = e_word
        # If no translation is found use the foreign word
        translated_sentence.append(best_english_word if best_english_word else f_word)

    return ' '.join(translated_sentence)

foreign_sentence = "das Haus"
english_translation = greedy_decoder(foreign_sentence, translation_probs)
print(f"Translated sentence: {english_translation}")

Translated sentence: reasonably house


#### Simplifying Assumptions:
- **Word-by-Word Translation**: This approach assumes that a sentence can be translated word-by-word which rarely is the case given differences in grammar and sentence structure between languages.
- **Ignoring Word Order**: The resulting translation does not consider the correct word order in English.
- **No Context Consideration**: Each word is translated independently of its context which can lead to incorrect translations for words with multiple meanings.
- **Out-of-Vocabulary Words**: Words not seen in the training data will not be translated correctly.

## Problem 3: Discussion
***

#### (A) Propose a number of different evaluation protocols for machine translation systems and discuss their advantages and disadvantages. What does it mean for a translation to be "good"? Minimally, you should think of one manual and one automatic procedure. (The point here is not that you should search the web but that you should try to come up with your own ideas.)


#### (B) The following example shows a number of sentences automatically translated from Estonian into English. In Estonian, ta means either "he" or "she", depending on whom we're talking about. Please comment on the translated sentences: what do you think are the technical reasons we see this effect? Do you consider this to be a bug or a feature?

#### (c) Below, we consider three sentences that include the English word bat and their automatic translation into Swedish by Google Translate.Why do you think the translation system has been able to select the correct translation of bat in the first two cases? What might be the reason that it has invented a new nonsense word in the third case?


## References
***

## Self Check
***
- Have you answered all questions to the best of your ability?
Yes, we have.
- Is all the required information on the front page, is the file name correct etc.?
Indeed, all the required information on the front page has been included.
- Anything else you can easily check? (details, terminology, arguments, clearly stated answers etc.?)
We have checked, and everything looks good.