The goal of this lab is to implement a language identifier (LID).

Our first model will be based on Naive Bayes.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
import io, sys, math, re
from collections import defaultdict

**The** next function is used to load the data. Each line of the data consist of a label (corresponding to a language), followed by some text, written in that language. Here is an example of data:

```__label__de Zur Namensdeutung gibt es mehrere Varianten.```


In [None]:
def load_data(filename):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    for line in fin:
        tokens = line.split()
        data.append((tokens[0], tokens[1:]))
    return data

You can now try loading the first dataset `train1.txt` and look what examples look like.

In [None]:
data = load_data("/content/drive/MyDrive/NLPWEEK1/train1.txt")
print(data[0])

('__label__de', ['Ich', 'würde', 'alles', 'tun,', 'um', 'dich', 'zu', 'beschützen.'])


In [None]:
data[4:7]

[('__label__ru', ['У', 'меня', 'есть', 'яблоко.']),
 ('__label__it', ['Non', 'possiamo', 'lasciarle', 'lì.']),
 ('__label__ru',
  ['Том',
   'считает,',
   'что',
   'школа',
   '—',
   'это',
   'пустая',
   'трата',
   'времени.'])]

Next, we will start implementing the Naive Bayes method. This technique is based on word counts, and we thus need to start by implementing a function to count the words and labels of our training set.

`n_examples` is the total number of examples

`n_words_per_label` is the total number of words for a given label

`label_counts` is the number of times a given label appears in the training data

`word_counts` is the number of times a word appears with a given label

In [None]:
dummy_data=data[:2]

In [None]:
ans=count_words(dummy_data)

In [None]:
ans

{'label_counts': defaultdict(<function __main__.count_words.<locals>.<lambda>>,
             {'__label__de': 2}),
 'n_examples': 2,
 'n_words_per_label': defaultdict(<function __main__.count_words.<locals>.<lambda>>,
             {'__label__de': 14}),
 'word_counts': defaultdict(<function __main__.count_words.<locals>.<lambda>>,
             {'__label__de': defaultdict(<function __main__.count_words.<locals>.<lambda>.<locals>.<lambda>>,
                          {'Ich': 1.0,
                           'Kunst': 1.0,
                           'Tom': 1.0,
                           'alles': 1.0,
                           'an': 1.0,
                           'beschützen.': 1.0,
                           'dich': 1.0,
                           'ist': 1.0,
                           'tun,': 1.0,
                           'um': 1.0,
                           'uninteressiert.': 1.0,
                           'völlig': 1.0,
                           'würde': 1.0,
                       

In [None]:
dummy_data

[('__label__de',
  ['Ich', 'würde', 'alles', 'tun,', 'um', 'dich', 'zu', 'beschützen.']),
 ('__label__de', ['Tom', 'ist', 'an', 'Kunst', 'völlig', 'uninteressiert.'])]

In [None]:
def count_words(data):
    n_examples = 0
    n_words_per_label = defaultdict(lambda: 0)
    label_counts = defaultdict(lambda: 0)
    word_counts = defaultdict(lambda: defaultdict(lambda: 0.0))

    for example in data:
        label, sentence = example
        ## FILL CODE
        label_counts[label] +=1
        n_examples +=1
        for w in sentence:
          word_counts[label][w] +=1
          n_words_per_label[label] +=1




    return {'label_counts': label_counts, 
            'word_counts': word_counts, 
            'n_examples': n_examples, 
            'n_words_per_label': n_words_per_label}

Next, using the word and label counts from the previous function, we can implement the prediction function.

Here, `mu` is a regularization parameter (Laplace smoothing), and `sentence` is the list of words corresponding to the test example.

In [None]:
import numpy as np

In [None]:
def predict(sentence, mu, label_counts, word_counts, n_examples, n_words_per_label):
    best_label = None
    best_score = float('-inf')

    for label in word_counts.keys():
        score = 0.0
        ## FILE CODE
        my_vocab =len(word_counts[label])
        for w in sentence:
          word_cnt =word_counts[label][w] + mu
          total_words = n_words_per_label[label]+ mu*my_vocab
          score +=np.log(word_cnt/total_words)
        score +=label_counts[label]/n_examples
        if score >best_score:
          best_score=score
          best_label=label



    return best_label

The next function will be used to evaluate the Naive Bayes model on a validation set. It computes the accuracy for a particular regularization parameter `mu`.

In [None]:
def compute_accuracy(valid_data, mu, counts):
    accuracy = 0.0
    correct_preds=0.0
    for label, sentence in valid_data:
        ## FILL CODE
        prediction=predict(sentence, mu, **counts)
        if prediction==label:
          correct_preds+=1
        accuracy=correct_preds/len(valid_data)
     
    return accuracy

In [None]:
print("")
print("** Naive Bayes **")
print("")

mu = 1.0
train_data = load_data("/content/drive/MyDrive/NLPWEEK1/train1.txt")
valid_data = load_data("/content/drive/MyDrive/NLPWEEK1/valid1.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")


** Naive Bayes **

Validation accuracy: 0.915

