<a href="https://colab.research.google.com/github/Eminent01/NLP-Lab-1/blob/main/Copy_of_logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this second part of the lab, we will implement a language identifier trained on the same data, but using Logistic Regression instead of Naive Bayes.

In [27]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [28]:
import io, sys, math
import numpy as np
from collections import defaultdict

This function is used to build the dictionary, or vocabulary, which is a mapping from strings (or words) to integers (or indices). This will allow to build vector representations of documents. 

In [29]:
# train1="/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/train1.txt"
data = "/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/train1.txt"

In [30]:
def build_dict(filename, threshold=1):
    fin = io.open(filename, 'r', encoding='utf-8')
    word_dict, label_dict = {}, {}
    counts = defaultdict(lambda: 0)
    for line in fin:
        tokens = line.split()
        label = tokens[0]

        if not label in label_dict:
            label_dict[label] = len(label_dict)

        for w in tokens[1:]:
            counts[w] += 1
            
    for k, v in counts.items():
        if v > threshold:
            word_dict[k] = len(word_dict)
    return word_dict, label_dict

In [9]:
build_dict("/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/train1.txt")

({'Ich': 0,
  'würde': 1,
  'alles': 2,
  'um': 3,
  'dich': 4,
  'zu': 5,
  'Tom': 6,
  'ist': 7,
  'an': 8,
  'völlig': 9,
  'das': 10,
  'in': 11,
  'der': 12,
  '–': 13,
  'muss': 14,
  'Ort': 15,
  'und': 16,
  'У': 17,
  'меня': 18,
  'есть': 19,
  'Non': 20,
  'possiamo': 21,
  'lì.': 22,
  'Том': 23,
  'что': 24,
  '—': 25,
  'это': 26,
  'пустая': 27,
  'трата': 28,
  'времени.': 29,
  'My': 30,
  "don't": 31,
  'speak': 32,
  'El': 33,
  'niño': 34,
  'no': 35,
  'sabe': 36,
  'cómo': 37,
  'Она': 38,
  'думала,': 39,
  'он': 40,
  'у': 41,
  'неё.': 42,
  'neden': 43,
  'üstünde': 44,
  'Lo': 45,
  'È': 46,
  'un': 47,
  'Mi': 48,
  'volas': 49,
  'novan': 50,
  'Hice': 51,
  'mi': 52,
  'trabajo.': 53,
  'Me': 54,
  'mil': 55,
  'para': 56,
  'is': 57,
  'the': 58,
  'Quando': 59,
  'está': 60,
  'frio': 61,
  'говорит,': 62,
  'жизнь': 63,
  'в': 64,
  'Австралии.': 65,
  'Bu': 66,
  'ke': 67,
  'la': 68,
  'estas': 69,
  'You': 70,
  "can't": 71,
  'help': 72,
  'but': 73

This function is used to load the training dataset, and build vector representations of the training examples. In particular, a document or sentence is represented as a bag of words. Each example correspond to a sparse vector ` x` of dimension `V`, where `V` is the size of the vocabulary. The element `j` of the vector `x` is the number of times the word `j` appears in the document.

In [31]:
def load_data(filename, word_dict, label_dict):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    dim = len(word_dict)
    for line in fin:
        tokens = line.split()
        label = tokens[0]

        yi = label_dict[label]
        xi = np.zeros(dim)
        for word in tokens[1:]:
            if word in word_dict:
                wid = word_dict[word]
                xi[wid] += 1.0
        data.append((yi, xi))
    return data

First, let's implement the softmax function. Don't forget numerical stability!

In [32]:
import numpy as np
def softmax(x):
  num = np.exp(x+ np.max(x))
  den = np.sum(np.exp(x +np.max(x)))

  return num / den
    ### FILL CODE

Now, let's implement the main training loop, by using stochastic gradient descent. The function will iterate over the examples of the training set. For each example, we will first compute the loss, before computing the gradient and performing the update.

In [34]:
def sgd(w, data, niter):
    lr=0.2
    nlabels, dim = w.shape
    # Epochs
    for iter in range(niter):
        loss=0
       # Shuffling data
        np.random.shuffle(data)
       # Loop
        for label,x in data:
          pred=softmax(w@x.T)
          loss+=np.log(pred[label])
          grad=compute_gradient(x,pred,label)
          w-=lr*grad

    return w

The next function will predict the most probable label corresponding to example `x`, given the trained classifier `w`.

In [11]:
# def compute_loss(w, x, y_true):
#     ##### WRITE YOUR CODE HERE #####
#     n = len(x)# Length of x
#     y_pred = softmax(w@x.T)

In [23]:
def compute_gradient(x,pred, y_true):
  p=pred
  p[y_true]=p[y_true]-1
  return p.reshape(-1, 1)*x.reshape((1,-1))

Finally, this function will compute the accuracy of a trained classifier `w` on a validation set.

In [35]:
def predict(w, x):
  # print(f"Shapes :x {x.T.shape},   www: {w.shape}")
  pred=softmax(w@x.T)
  return np.argmax(pred)
    

In [36]:
def compute_accuracy(w, valid_data):
  accuracy = 0.0
  for label,x in valid_data:
        predict_label=predict(w,x)
        # print(f"Pred : {predict_label}, True : {label}")
        if predict_label==label:
          accuracy+=1
  return (100*accuracy)/len(valid_data)
    ## FILL CODE

In [37]:
print("")
print("** Logistic Regression **")
print("")

word_dict, label_dict = build_dict("/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/train1.txt")
train_data = load_data("/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/train1.txt", word_dict, label_dict)
valid_data = load_data("/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/valid1.txt", word_dict, label_dict)

nlabels = len(label_dict)
dim = len(word_dict)
w = np.zeros([nlabels, dim])
w = sgd(w, train_data, 5)
print("")
print("Validation accuracy: %.3f" % compute_accuracy(w, valid_data))
print("")


** Logistic Regression **


Validation accuracy: 93.000

