# MMI_2024_NLP - Week 1

#Lab 1: Part 1

# Introduction

Before we start, please change the name of the notebook to the following format : **Firstname_LASTNAME_Lab1_A_naive_bayes.ipynb**


In some cells and files you will see code blocks that look like this:

```python
##############################################################################
#                    TODO: Write the equation for a line                     #
##############################################################################
pass
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################
```

You should replace the `pass` statement with your own code and leave the blocks intact, like this:

```python
##############################################################################
#                    TODO: Write the equation for a line                     #
##############################################################################
y = m * x + b
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################
```

# (A) Naive Bayes model

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [12]:
ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


In [15]:
cd '/content/drive/MyDrive/NLP_Week1_PS/Lab1'

/content/drive/MyDrive/NLP_Week1_PS/Lab1


In this lab, we will implement a language identifier (LID).

Our first model will be based on Naive Bayes.

In [16]:
import io, sys, math, re
from collections import defaultdict
from typing import List, Tuple, Dict

The next function is used to load the data. Each line of the data consist of a label (corresponding to a language), followed by some text, written in that language. Here is an example of data:

```__label__de Zur Namensdeutung gibt es mehrere Varianten.```


In [17]:
def load_data(filename:str)->List[Tuple]:
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    for line in fin:
        tokens = line.split()
        data.append((tokens[0], tokens[1:]))
    return data

You can now try loading the first dataset `train1.txt` and look what examples look like.

In [18]:
data = load_data("train1.txt")
print(data[0])

('__label__de', ['Ich', 'würde', 'alles', 'tun,', 'um', 'dich', 'zu', 'beschützen.'])


Next, we will start implementing the Naive Bayes method. This technique is based on word counts, and we thus need to start by implementing a function to count the words and labels of our training set.

`n_examples` is the total number of examples

`n_words_per_label` is the total number of words for a given label

`label_counts` is the number of times a given label appears in the training data

`word_counts` is the number of times a word appears with a given label

In [19]:
def count_words(data:str)->Dict:
    n_examples = 0
    n_words_per_label = defaultdict(lambda: 0)
    label_counts = defaultdict(lambda: 0)
    word_counts = defaultdict(lambda: defaultdict(lambda: 0.0))

    for example in data:
        label, sentence = example
        ##########################################################################
        #                      TODO: Implement this function                     #
        ##########################################################################
        # Replace "pass" statement with your code
        n_examples += 1
        label_counts[label] += 1

        for word in sentence:
          n_words_per_label[label] += 1
          word_counts[label][word] += 1

        ##########################################################################
        #                            END OF YOUR CODE                            #
        ##########################################################################

    return {'label_counts': label_counts,
            'word_counts': word_counts,
            'n_examples': n_examples,
            'n_words_per_label': n_words_per_label}

Next, using the word and label counts from the previous function, we can implement the prediction function.

Here, `mu` is a regularization parameter (Laplace smoothing), and `sentence` is the list of words corresponding to the test example.

In [20]:
def predict(sentence:List, mu:float, label_counts:Dict, word_counts:Dict, n_examples:int, n_words_per_label:Dict)->str:
    best_label = None
    best_score = float('-inf')

    for label in word_counts.keys():
        score = 0.0
        prior = label_counts[label] / sum(label_counts.values())
        #P(Class | Word) = P(Class) * P(word | Class)
        ##########################################################################
        #                      TODO: Implement this function                     #
        ##########################################################################
        # Replace "pass" statement with your code
        score += math.log(prior)
        for word in sentence:
          # P(w∣C)= count(w,C)+μ / ( count(C)+μ⋅∣V∣ )
          likelihood = (word_counts[label][word] + mu) / ( n_words_per_label[label] + mu * len(word_counts[label]))  # Number of occurrences of a word in the label over the total number of words in this label
          score += math.log(likelihood)

          if score > best_score :
            best_score = score
            best_label = label

        ##########################################################################
        #                            END OF YOUR CODE                            #
        ##########################################################################

    return best_label

The next function will be used to evaluate the Naive Bayes model on a validation set. It computes the accuracy for a particular regularization parameter `mu`.

In [21]:
def compute_accuracy(valid_data:str, mu:float, counts:Dict)->float:
    accuracy = 0.0
    correct_preds = 0

    for label, sentence in valid_data:
      ##########################################################################
      #                      TODO: Implement this function                     #
      ##########################################################################
      # Replace "pass" statement with your code
      predicted_label = predict( sentence, mu, counts['label_counts'], counts['word_counts'], counts['n_examples'], counts['n_words_per_label'] )
      if predicted_label == label:
          correct_preds += 1

    accuracy = correct_preds / len(valid_data)

      ##########################################################################
      #                            END OF YOUR CODE                            #
      ##########################################################################

    return accuracy # Replace "..." statement with your code

In [22]:
print("")
print("** Naive Bayes **")
print("")

mu = 1.0
train_data = load_data("train1.txt")
valid_data = load_data("valid1.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")


** Naive Bayes **

Validation accuracy: 0.740



# Now, it is your turn, try to do it with train2.txt and valid2.txt.


In [23]:
#Write your code here.

print("")
print("** Naive Bayes **")
print("")

mu = 1.0
train_data = load_data("train2.txt")
valid_data = load_data("valid2.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")


** Naive Bayes **

Validation accuracy: 0.809

