In this second part of the lab, we will implement a language identifier trained on the same data, but using Logistic Regression instead of Naive Bayes.

In [1]:
import io, sys, math
import numpy as np
from collections import defaultdict

In [14]:
import pandas as pd
with open('train1.txt','r') as f:
    f.readlines(4)
    

This function is used to build the dictionary, or vocabulary, which is a mapping from strings (or words) to integers (or indices). This will allow to build vector representations of documents. 

In [6]:
def build_dict(filename, threshold=1):
    fin = io.open(filename, 'r', encoding='utf-8')
    word_dict, label_dict = {}, {}
    counts = defaultdict(lambda: 0)
    for line in fin:
        tokens = line.split()
        label = tokens[0]

        if not label in label_dict:
            label_dict[label] = len(label_dict)

        for w in tokens[1:]:
            counts[w] += 1
            
    for k, v in counts.items():
        if v > threshold:
            word_dict[k] = len(word_dict)
    return word_dict, label_dict

This function is used to load the training dataset, and build vector representations of the training examples. In particular, a document or sentence is represented as a bag of words. Each example correspond to a sparse vector ` x` of dimension `V`, where `V` is the size of the vocabulary. The element `j` of the vector `x` is the number of times the word `j` appears in the document.

In [7]:
def load_data(filename, word_dict, label_dict):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    dim = len(word_dict)
    for line in fin:
        tokens = line.split()
        label = tokens[0]

        yi = label_dict[label]
        xi = np.zeros(dim)
        for word in tokens[1:]:
            if word in word_dict:
                wid = word_dict[word]
                xi[wid] += 1.0
        data.append((yi, xi))
    return data

First, let's implement the softmax function. Don't forget numerical stability!

In [183]:
def softmax(x):
    ### FILL CODE
    #Due to the stability, substract max(x) in each x as normalizing
    num=np.exp(x-x.max())
    #Do the same on denominator and compute its sum.
    denom=np.sum(np.exp(x-x.max()))
    #Return the probability
    return num/denom

Now, let's implement the main training loop, by using stochastic gradient descent. The function will iterate over the examples of the training set. For each example, we will first compute the loss, before computing the gradient and performing the update.

In [195]:
def sgd(w, data, niter):
    nlabels, dim = w.shape
  
    
    for iter in range(niter): 
        loss=0
        #unpack data x and label y from its list
        for y, x in data:
            #Initialize True classes to zero as vector.
            true_label=np.zeros((nlabels,1))
            
            # Assign one to the corresponding class,Idenity matrix of size 10 X 10
            true_label[y] = 1.0
            #reshape x to make it a vector of one column, x of size [total_data X 1]
            x=x.reshape(len(x),1)
        
            #Compute the probability by using softmax in prediction.
            pred=softmax(w@x)
            
            #Update the loss on the case of classification 
            loss +=-np.log(pred)
            
            # Compute the gradient of Loss with respect to the weight
            loss_grad=-(true_label-pred)@x.T
            #Update the weights
            w=w-0.09*loss_grad
    return w

The next function will predict the most probable label corresponding to example `x`, given the trained classifier `w`.

In [196]:
def predict(w, x):
    ## FILL CODE
    
    # Use softmax to compute the probability for the prediction
    y_pred=softmax(w@x)
    #Return the index for the class correspond to high probability
    return np.argmax(y_pred)

Finally, this function will compute the accuracy of a trained classifier `w` on a validation set.

In [197]:
def compute_accuracy(w, valid_data):

    
    ## FILL CODE
    
    #Initialize the correct prediction to zero since no prediction now.
    correct_pred = 0
    for i in range(len(valid_data)):
        #Since the data are tuple in list, I just use index access data
        #After unpack y at 0 index and data x at index 1. 
        x_valid=valid_data[i][1]
        y_valid=valid_data[i][0]
        # prediction as index of where probability is maximum 
        index_pred=predict(w,x_valid)
        #Check whether the index meet with correct label
        if index_pred == y_valid:
            #Once condition true, update the correct predictions
            correct_pred +=  1
    #Compute the accuracy
    acc = correct_pred/len(valid_data)
        # return accuracy on the percentage by multiplying 100.
    return acc*100

In [198]:
print("")
print("** Logistic Regression **")
print("")

word_dict, label_dict = build_dict("train1.txt")
train_data = load_data("train1.txt", word_dict, label_dict)
valid_data = load_data("valid1.txt", word_dict, label_dict)
valid_data = load_data("valid1.txt", word_dict, label_dict)


nlabels = len(label_dict)
dim = len(word_dict)
w = np.zeros([nlabels, dim])
w = sgd(w, train_data, 5)
print("")
print("Validation accuracy: %.3f" % compute_accuracy(w, valid_data))
print("")


** Logistic Regression **


Validation accuracy: 92.100



# Observation is that on the learning rate of 0.09, is where the accuracy is better.