<a href="https://colab.research.google.com/github/Aba17/NLP-Naive-Bayes-and-Logistic-Regression/blob/main/NLP_Ababacar_Dioukhan%C3%A9_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this lab is to implement a language identifier (LID).

Our first model will be based on Naive Bayes.

In [1]:
import io, sys, math, re
from collections import defaultdict

The next function is used to load the data. Each line of the data consist of a label (corresponding to a language), followed by some text, written in that language. Here is an example of data:

```__label__de Zur Namensdeutung gibt es mehrere Varianten.```


In [2]:
def load_data(filename):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    for line in fin:
        tokens = line.split()
        data.append((tokens[0], tokens[1:]))
    return data

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


You can now try loading the first dataset `train1.txt` and look what examples look like.

In [6]:
data = load_data("/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/train1.txt")
print(data[0])

('__label__de', ['Ich', 'würde', 'alles', 'tun,', 'um', 'dich', 'zu', 'beschützen.'])


Next, we will start implementing the Naive Bayes method. This technique is based on word counts, and we thus need to start by implementing a function to count the words and labels of our training set.

`n_examples` is the total number of examples

`n_words_per_label` is the total number of words for a given label

`label_counts` is the number of times a given label appears in the training data

`word_counts` is the number of times a word appears with a given label

In [13]:
def count_words(data):
    n_examples = 0
    n_words_per_label = defaultdict(lambda: 0)
    label_counts = defaultdict(lambda: 0)
    word_counts = defaultdict(lambda: defaultdict(lambda: 0.0))

    for example in data:
        label, sentence = example
        if label not in label_counts:
          label_counts[label]=0
        else:
          label_counts[label]+=1
        
        if label not in n_words_per_label:
          n_words_per_label[label]=len(sentence)
        else:
          n_words_per_label[label]+=len(sentence)
        for word in sentence:
          if word_counts[word] and word_counts[word][label]:
            word_counts[word][label]+=1
          else:
            word_counts[word][label]=1
        
        n_examples+=1

    return {'label_counts': label_counts, 
            'word_counts': word_counts, 
            'n_examples': n_examples, 
            'n_words_per_label': n_words_per_label}

In [30]:
par=count_words(data)

In [60]:
par['label_counts']

defaultdict(<function __main__.count_words.<locals>.<lambda>>,
            {'__label__de': 827,
             '__label__en': 2136,
             '__label__eo': 1019,
             '__label__es': 563,
             '__label__fr': 649,
             '__label__hu': 431,
             '__label__it': 1326,
             '__label__pt': 577,
             '__label__ru': 1270,
             '__label__tr': 1192})

Next, using the word and label counts from the previous function, we can implement the prediction function.

Here, `mu` is a regularization parameter (Laplace smoothing), and `sentence` is the list of words corresponding to the test example.

In [62]:
def predict(sentence, mu, label_counts, word_counts, n_examples, n_words_per_label,V):
    best_label = None
    best_score = float('-inf')
    #Prediction
    for label in label_counts.copy():
      score=0.0
      # score=label_counts[label]/n_examples
      for word in sentence:
        score +=((mu+word_counts[word][label])/(mu*V+n_words_per_label[label]))
      if score>best_score:
        best_score=score
        best_label=label
    # print(label)
    return best_label

The next function will be used to evaluate the Naive Bayes model on a validation set. It computes the accuracy for a particular regularization parameter `mu`.

In [65]:
def compute_accuracy(valid_data, mu, counts):
    accuracy = 0.0
    
    label_counts=counts['label_counts'] 
    word_counts=counts['word_counts']
    n_examples=counts[ 'n_examples']
    n_words_per_label=counts['n_words_per_label']
    V=0
    # The length of the vocabulary
    for word in word_counts:
      for label in word_counts[word]:
        V+=word_counts[word][label]
    
    for label, sentence in valid_data:
        predict_label=predict(sentence, mu, label_counts, word_counts, n_examples, n_words_per_label,V)
        # print(f"Pred : {predict_label}, True : {label}")
        if predict_label==label:
          accuracy+=1
    return (100*accuracy)/len(valid_data)

In [66]:
print("")
print("** Naive Bayes **")
print("")

mu = 1.0
train_data = load_data("/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/train1.txt")
valid_data = load_data("/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/valid1.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")


** Naive Bayes **

Validation accuracy: 82.400

