The goal of this lab is to implement a language identifier (LID).

Our first model will be based on Naive Bayes.

In [41]:
import io, sys, math, re
from collections import defaultdict

The next function is used to load the data. Each line of the data consist of a label (corresponding to a language), followed by some text, written in that language. Here is an example of data:

```__label__de Zur Namensdeutung gibt es mehrere Varianten.```


In [42]:
def load_data(filename):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    for line in fin:
        tokens = line.split()
        data.append((tokens[0], tokens[1:]))
    return data

You can now try loading the first dataset `train1.txt` and look what examples look like.

In [43]:
data = load_data("train1.txt")
print(data[0])

('__label__de', ['Ich', 'würde', 'alles', 'tun,', 'um', 'dich', 'zu', 'beschützen.'])


Next, we will start implementing the Naive Bayes method. This technique is based on word counts, and we thus need to start by implementing a function to count the words and labels of our training set.

`n_examples` is the total number of examples

`n_words_per_label` is the total number of words for a given label

`label_counts` is the number of times a given label appears in the training data

`word_counts` is the number of times a word appears with a given label

In [44]:
def count_words(data):
    n_examples = 0
    n_words_per_label = defaultdict(lambda: 0)
    label_counts = defaultdict(lambda: 0)
    word_counts = defaultdict(lambda: defaultdict(lambda: 0.0))

    for example in data:
        label, sentence = example
        ## FILL CODE
        #updating number of examples
        n_examples +=1
        #update of labels 
        label_counts[label]+=1
        # Loop in the sentence to find words and labels in it
        for word in sentence:
            #update number of label per sentence
            n_words_per_label[label] +=1
            #update number of word per sentence
            word_counts[label][word] +=1
             
        
        

    return {'label_counts': label_counts, 
            'word_counts': word_counts, 
            'n_examples': n_examples, 
            'n_words_per_label': n_words_per_label}

Next, using the word and label counts from the previous function, we can implement the prediction function.

Here, `mu` is a regularization parameter (Laplace smoothing), and `sentence` is the list of words corresponding to the test example.

In [45]:
def predict(sentence, mu, label_counts, word_counts, n_examples, n_words_per_label):
    best_label = None
    best_score = float('-inf')

    for label in word_counts.keys():
        score = 0.0
        ## FILE CODE
        length_of_vocabulary =len(word_counts[label])
        for word in sentence:
            # add mu to number of time word appeared in a certain sentence to against the zero prediction always
            number_word=word_counts[label][word]+mu
            #compute denominator as mu* total count of word in sentence +number of words in label
            tot_word_appearence=n_words_per_label[label]+mu*length_of_vocabulary
            #update the score
            score +=math.log(number_word/tot_word_appearence)
            
            if score> best_score:
                #Update the best score
                best_score=score
                #update the label based on high score
                best_label=label

    return best_label

The next function will be used to evaluate the Naive Bayes model on a validation set. It computes the accuracy for a particular regularization parameter `mu`.

In [46]:
def compute_accuracy(valid_data, mu, counts):
    accuracy = 0.0
    for label, sentence in valid_data:
        ## FILL CODE
        
        #Compute prediction
        prediction=predict(sentence, mu, **counts)
        # check whether predicted label is the same as true label
        if prediction == label:
            #Once the condition is true update the accuracy
            accuracy +=1.0 
            
     #express accuracy in percentage by multiplying 100
    return accuracy*100.00/float(len(valid_data))

Here, I used the file train2.txt and valid2.txt to check the accuracy, with mu=1.0

In [68]:
print("")
print("** Naive Bayes **")
print("")
# By setting mu=1.0
mu =1.0

train_data = load_data("train2.txt")
valid_data = load_data("valid2.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")


** Naive Bayes **

Validation accuracy: 77.500



Here the task is the same as above but mu has been changed to 0.000002

In [67]:
print("")
print("** Naive Bayes **")
print("")
#set mu to 0.000002 to check its effect on the prediction accuracy
mu = 0.000002

train_data = load_data("train2.txt")
valid_data = load_data("valid2.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")


** Naive Bayes **

Validation accuracy: 80.200



The comments from observation is that once the mu decreases, the accuracy increases. This is related to the formula:     $$p(class_i|\, word_i)=\frac{\mu+ count(word_i,\,class_i)}{\mu*V +count(word_i,\,all\_classes)}$$