
implementing  Naive Bayes from scratch

In [None]:
import io, sys, math, re
from collections import defaultdict

The next function is used to load the data. Each line of the data consist of a label (corresponding to a language), followed by some text, written in that language. Here is an example of data:

```__label__de Zur Namensdeutung gibt es mehrere Varianten.```


In [None]:
def load_data(filename):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    for line in fin:
        tokens = line.split()
        data.append((tokens[0], tokens[1:]))
    return data

You can now try loading the first dataset `train1.txt` and look what examples look like.

In [None]:
data = load_data("train1.txt")
print(data[0])

('__label__de', ['Ich', 'würde', 'alles', 'tun,', 'um', 'dich', 'zu', 'beschützen.'])


Next, we will start implementing the Naive Bayes method. This technique is based on word counts, and we thus need to start by implementing a function to count the words and labels of our training set.

`n_examples` is the total number of examples

`n_words_per_label` is the total number of words for a given label

`label_counts` is the number of times a given label appears in the training data

`word_counts` is the number of times a word appears with a given label

In [None]:
def count_words(data):
    n_examples = 0
    n_words_per_label = defaultdict(lambda: 0)
    label_counts = defaultdict(lambda: 0)
    word_counts = defaultdict(lambda: defaultdict(lambda: 0.0))
    words_unique=0
    for example in data:
        n_examples+=1
        label, sentence = example
        if label not in n_words_per_label:
          n_words_per_label[label]=len(sentence)
        else:
           n_words_per_label[label]+=len(sentence)
        if label not in label_counts:
          label_counts[label]=1
        else:
          label_counts[label]+=1
        
        for word in sentence:
            if label not in word_counts.keys():
                   word_counts[label][word]=1
            else:
                 word_counts[label][word]+=1

        ## FILL CODE

    return {'label_counts': label_counts, 
            'word_counts': word_counts, 
            'n_examples': n_examples, 
            'n_words_per_label': n_words_per_label}

In [None]:
count_words(data)['word_counts']

defaultdict(<function __main__.count_words.<locals>.<lambda>>,
            {'__label__de': defaultdict(<function __main__.count_words.<locals>.<lambda>.<locals>.<lambda>>,
                         {'Ich': 140,
                          'würde': 9.0,
                          'alles': 6.0,
                          'tun,': 1.0,
                          'um': 22.0,
                          'dich': 15.0,
                          'zu': 78.0,
                          'beschützen.': 1.0,
                          'Tom': 126.0,
                          'ist': 98.0,
                          'an': 30.0,
                          'Kunst': 1.0,
                          'völlig': 2.0,
                          'uninteressiert.': 1.0,
                          '„Wird': 1.0,
                          'das': 62.0,
                          'in': 57.0,
                          'der': 93.0,
                          'Werkstatt': 1.0,
                          'gemacht?“': 1.0,
                 

Next, using the word and label counts from the previous function, we can implement the prediction function.

Here, `mu` is a regularization parameter (Laplace smoothing), and `sentence` is the list of words corresponding to the test example.

In [None]:
import numpy as np
import math
def predict(sentence, mu, label_counts, word_counts, n_examples, n_words_per_label):
    best_label = None
    best_score = float('-inf')
    scores=[]
    for label in word_counts.keys():
        score = 0.0
        count_demo=n_words_per_label[label]
        py=label_counts[label]/n_examples
        v=len(word_counts[label].keys())
        for word in sentence:
          score+=math.log((word_counts[label][word]+mu)/(count_demo+mu*v))
        score+=math.log(py)
        
        scores.append(score)
    
    best_score=np.max(scores)
    best_label=list(word_counts.keys())[np.argmax(scores)]
        ## FILE CODE

    return best_label

In [None]:
predict('hi guys how are you',)

The next function will be used to evaluate the Naive Bayes model on a validation set. It computes the accuracy for a particular regularization parameter `mu`.

In [None]:
def compute_accuracy(valid_data, mu, counts):
    accuracy = 0.0
    for label, sentence in valid_data:
      print(label)
      best_label=predict(sentence, mu, counts['label_counts'], counts['word_counts'], counts['n_examples'], counts['n_words_per_label'])
      if best_label==label:
        accuracy+=1
        ## FILL CODE
     
    return accuracy/len(valid_data)

In [None]:
print("")
print("** Naive Bayes **")
print("")

mu = 1.0
train_data = load_data("train1.txt")
valid_data = load_data("valid1.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
-48.89528630795282
-49.71923275494247
-47.90960387628469
-48.26770282658604
__label__hu
-30.575872816910824
-27.71629753847966
-30.530331588069007
-30.40832645862288
-31.581418507074986
-29.891132985103198
-30.18891657222674
-30.745789562743287
-29.887897860880617
-30.055631962991527
__label__de
-40.85217312638743
-55.511499752709135
-58.99924390626314
-58.79838669215617
-61.62046169250098
-56.90918911639944
-58.253277703610344
-59.21016055447159
-55.540928507962185
-57.37989070829299
__label__tr
-30.576646444800097
-29.328164404167655
-30.5316932644673
-30.409724408321985
-31.58222524784715
-29.8933418451858
-25.27781999316585
-30.747153302444016
-29.89009111064323
-30.05762642289878
__label__it
-73.32713128877742
-72.97579758100628
-77.98169515459873
-52.497259370323775
-76.51257413357195
-74.92637824091904
-76.9651021315461
-75.48153886244552
-74.95855485322534
-74.90756788653295
__label__fr
-39.94082532891765
-38.0614