<a href="https://colab.research.google.com/github/Nwanna-Joseph/nlp_week_1_solution/blob/week1/Joseph's_Copy_of_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this lab is to implement a language identifier (LID).

Our first model will be based on Naive Bayes.

In [None]:
import io, sys, math, re
from collections import defaultdict

The next function is used to load the data. Each line of the data consist of a label (corresponding to a language), followed by some text, written in that language. Here is an example of data:

```__label__de Zur Namensdeutung gibt es mehrere Varianten.```


In [None]:
def load_data(filename):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    for line in fin:
        tokens = line.split()
        data.append((tokens[0], tokens[1:]))
    return data

You can now try loading the first dataset `train1.txt` and look what examples look like.

In [None]:
data = load_data("/content/train1.txt")
print(data[0])

('__label__de', ['Ich', 'würde', 'alles', 'tun,', 'um', 'dich', 'zu', 'beschützen.'])


Next, we will start implementing the Naive Bayes method. This technique is based on word counts, and we thus need to start by implementing a function to count the words and labels of our training set.

`n_examples` is the total number of examples

`n_words_per_label` is the total number of words for a given label

`label_counts` is the number of times a given label appears in the training data

`word_counts` is the number of times a word appears with a given label

In [None]:
def count_words(data):
    n_examples = len(data)
    n_words_per_label = {}
    label_counts = {}
    word_counts = {}

    for example in data:
        label, sentence = example
        sentence_len = len(sentence)
        total_words_already = n_words_per_label.get(label, 0)
        n_words_per_label[label] = sentence_len + total_words_already

        for word in sentence:
          word_counts_map = word_counts.get(word)
          if(word_counts_map == None):
            word_counts[word] = {}

          word_per_label = word_counts.get(word).get(label, 0)

          word_counts[word][label] = word_per_label + 1



    label_counts = len(n_words_per_label)
        ## FILL CODE

    return {'label_counts': label_counts, 
            'word_counts': word_counts, 
            'n_examples': n_examples, 
            'n_words_per_label': n_words_per_label}

In [None]:
info = count_words(data)
info

{'label_counts': 10,
 'n_examples': 10000,
 'n_words_per_label': {'__label__de': 6630,
  '__label__en': 16444,
  '__label__eo': 7647,
  '__label__es': 3927,
  '__label__fr': 4718,
  '__label__hu': 2271,
  '__label__it': 7759,
  '__label__pt': 4044,
  '__label__ru': 7387,
  '__label__tr': 6026},
 'word_counts': {'Ich': {'__label__de': 140},
  'würde': {'__label__de': 9},
  'alles': {'__label__de': 6},
  'tun,': {'__label__de': 1},
  'um': {'__label__de': 22, '__label__pt': 51},
  'dich': {'__label__de': 15},
  'zu': {'__label__de': 78},
  'beschützen.': {'__label__de': 1},
  'Tom': {'__label__de': 126,
   '__label__en': 616,
   '__label__eo': 20,
   '__label__es': 39,
   '__label__fr': 36,
   '__label__hu': 28,
   '__label__it': 136,
   '__label__pt': 69,
   '__label__tr': 175},
  'ist': {'__label__de': 98},
  'an': {'__label__de': 30, '__label__en': 36},
  'Kunst': {'__label__de': 1},
  'völlig': {'__label__de': 2},
  'uninteressiert.': {'__label__de': 1},
  'Végeztem': {'__label__hu':

In [None]:
info['word_counts']['Ich']

{'__label__de': 140}

In [None]:
info['n_words_per_label'].values()

dict_values([6630, 2271, 7387, 7759, 16444, 3927, 6026, 7647, 4044, 4718])

In [None]:
sum(info['n_words_per_label'].values())

66853

Next, using the word and label counts from the previous function, we can implement the prediction function.

Here, `mu` is a regularization parameter (Laplace smoothing), and `sentence` is the list of words corresponding to the test example.

In [None]:
def predict(sentence, mu, label_counts, word_counts, n_examples, n_words_per_label):
    best_label = None
    best_score = float('-inf')

    map = {}

    for each_word in sentence:
      word = each_word
      words_labels = word_counts.get(word)
      if(words_labels == None):
        continue
      # print(words_labels)
      max_label = None
      max_k = -1
      for label in words_labels.keys():
        k = words_labels[label]
        if k > max_k:
          max_k = k
          max_label = label
      
      if max_k < 0:
        continue

      counts = map.get(max_label, 0)
      map[max_label] = counts +1

    for label in map.keys():
      count = map[label]

      if count > best_score:
        best_score = count
        best_label = label


    # best_label = "__label__en"

    return best_label

In [None]:
sample_sentence = ['Ich', 'würde', 'alles', 'tun,', 'um', 'dich', 'zu', 'beschützen.']
total_labels = info['label_counts'] #10
total_words = info['word_counts'] #66853
total_examples = info['n_examples'] #10000
words_p_label = info['n_words_per_label']

prediction = predict(sample_sentence, 1.0, total_labels, total_words, total_examples, info['n_words_per_label'])
prediction

'__label__de'

The next function will be used to evaluate the Naive Bayes model on a validation set. It computes the accuracy for a particular regularization parameter `mu`.

In [None]:
def compute_accuracy(valid_data, mu, counts):
    correct_predictions = 0
    for label, sentence in valid_data:
      predicted_label = predict(sentence, 1.0, total_labels, total_words, total_examples, info['n_words_per_label'])
      if (predicted_label == label):
        correct_predictions += 1 
    return correct_predictions / len(valid_data)

In [None]:
print("")
print("** Naive Bayes **")
print("")

mu = 1.0
train_data = load_data("train1.txt")
valid_data = load_data("valid1.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")


** Naive Bayes **

Validation accuracy: 0.930

