# Fenosoa Randrianjatovo


In this second part of the lab, we will implement a language identifier trained on the same data, but using Logistic Regression instead of Naive Bayes.

In [45]:
import io, sys, math
from tqdm import tqdm
import numpy as np
from collections import defaultdict

This function is used to build the dictionary, or vocabulary, which is a mapping from strings (or words) to integers (or indices). This will allow to build vector representations of documents. 

In [3]:
def build_dict(filename, threshold=1):
    fin = io.open(filename, 'r', encoding='utf-8')
    word_dict, label_dict = {}, {}
    counts = defaultdict(lambda: 0)
    for line in fin:
        tokens = line.split()
        label = tokens[0]

        if not label in label_dict:
            label_dict[label] = len(label_dict)

        for w in tokens[1:]:
            counts[w] += 1
            
    for k, v in counts.items():
        if v > threshold:
            word_dict[k] = len(word_dict)
    return word_dict, label_dict

This function is used to load the training dataset, and build vector representations of the training examples. In particular, a document or sentence is represented as a bag of words. Each example correspond to a sparse vector ` x` of dimension `V`, where `V` is the size of the vocabulary. The element `j` of the vector `x` is the number of times the word `j` appears in the document.

In [46]:
def load_data(filename, word_dict, label_dict):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    dim = len(word_dict)
    for line in fin:
        tokens = line.split()
        label = tokens[0]

        yi = label_dict[label]
        xi = np.zeros(dim)
        for word in tokens[1:]:
            if word in word_dict:
                wid = word_dict[word]
                xi[wid] += 1.0
        data.append((yi, xi))
    return data

First, let's implement the softmax function. Don't forget numerical stability!

In [33]:
def softmax(x):
    ### FILL CODE
    MAX=x.max()
    y = np.exp(x - MAX)
    return y / np.sum(y)

Now, let's implement the main training loop, by using stochastic gradient descent. The function will iterate over the examples of the training set. For each example, we will first compute the loss, before computing the gradient and performing the update.

In [207]:


word_dict, label_dict = build_dict("./data/train1.txt")
train_data = load_data("./data/train1.txt", word_dict, label_dict)
valid_data = load_data("./data/valid1.txt", word_dict, label_dict)

nlabels = len(label_dict)
dim = len(word_dict)
w = np.zeros([nlabels, dim])

for Y,X in train_data:
    print(softmax(w@X))
    print("\n")
    print(Y,X)
    print("\n")
    break 


[0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]


0 [1. 1. 1. ... 0. 0. 0.]




In [80]:
w = np.zeros([nlabels, dim])
w.shape

(10, 5826)

In [202]:
def sgd(w, data, niter):
    nlabels, dim = w.shape
#     return w.shape[1]
    for iter in tqdm(range(niter)):
        ### FILL CODE
        lr=0.5
        ema_loss=None
        training_loss = 0.00
        for y, x in data:

            pred = softmax(w@x)
            training_loss += np.log(pred[y])
            
            target_one = np.zeros_like(pred)
            target_one[y] = 1.0
            
            grad = (target_one - pred).reshape((nlabels, -1)) * x.reshape((-1, dim))
            
            w+=lr * grad
            if ema_loss is None:
                ema_loss = training_loss
            else:
                ema_loss += (training_loss - ema_loss) * 0.01

        # Print out progress the end of epoch.
        print("Train Epoch: {} \t train Loss: {:.6f}".format(iter+1, ema_loss))
     
    return w


if "__main__"==__name__ :
    sgd(w, train_data, 2)
#     print(w)
    

 50%|██████████████████████████████                              | 1/2 [00:02<00:02,  2.13s/it]

Train Epoch: 1 	 train Loss: -712.612723


100%|████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.12s/it]

Train Epoch: 2 	 train Loss: -690.856818





The next function will predict the most probable label corresponding to example `x`, given the trained classifier `w`.

In [208]:
def predict(w, x):
    ## FILL CODE
    pred= softmax(w@x)
    return np.argmax(pred)

Finally, this function will compute the accuracy of a trained classifier `w` on a validation set.

In [156]:
def compute_accuracy(w, valid_data):
    ## FILL CODE
    accuracy = 0.0
    for y, x in valid_data:
        y_pred = predict(w, x)
        if y == y_pred :
            accuracy += 1.0
    return ( accuracy / len(valid_data))*100

In [206]:
print("")
print("** Logistic Regression **")
print("")

word_dict, label_dict = build_dict("./data/train1.txt")
train_data = load_data("./data/train1.txt", word_dict, label_dict)
valid_data = load_data("./data/valid1.txt", word_dict, label_dict)

nlabels = len(label_dict)
dim = len(word_dict)
w = np.zeros([nlabels, dim])
w = sgd(w, train_data, 10)
print("")
print("Validation accuracy: %.2f%s" %(compute_accuracy(w, valid_data), "%"))
print("")


** Logistic Regression **



 10%|█████▉                                                     | 1/10 [00:01<00:17,  1.98s/it]

Train Epoch: 1 	 train Loss: -4877.992477


 20%|███████████▊                                               | 2/10 [00:04<00:16,  2.04s/it]

Train Epoch: 2 	 train Loss: -2034.937700


 30%|█████████████████▋                                         | 3/10 [00:06<00:14,  2.08s/it]

Train Epoch: 3 	 train Loss: -1544.218800


 40%|███████████████████████▌                                   | 4/10 [00:08<00:12,  2.12s/it]

Train Epoch: 4 	 train Loss: -1302.662100


 50%|█████████████████████████████▌                             | 5/10 [00:10<00:11,  2.27s/it]

Train Epoch: 5 	 train Loss: -1157.190879


 60%|███████████████████████████████████▍                       | 6/10 [00:13<00:08,  2.22s/it]

Train Epoch: 6 	 train Loss: -1060.389978


 70%|█████████████████████████████████████████▎                 | 7/10 [00:15<00:06,  2.19s/it]

Train Epoch: 7 	 train Loss: -992.017649


 80%|███████████████████████████████████████████████▏           | 8/10 [00:17<00:04,  2.14s/it]

Train Epoch: 8 	 train Loss: -941.585208


 90%|█████████████████████████████████████████████████████      | 9/10 [00:19<00:02,  2.16s/it]

Train Epoch: 9 	 train Loss: -903.073523


100%|██████████████████████████████████████████████████████████| 10/10 [00:21<00:00,  2.15s/it]

Train Epoch: 10 	 train Loss: -872.807390

Validation accuracy: 93.30%






In [209]:
print("")
print("** Logistic Regression **")
print("")

word_dict, label_dict = build_dict("./data/train1.txt")
train_data = load_data("./data/train2.txt", word_dict, label_dict)
valid_data = load_data("./data/valid2.txt", word_dict, label_dict)

nlabels = len(label_dict)
dim = len(word_dict)
w = np.zeros([nlabels, dim])
w = sgd(w, train_data, 10)
print("")
print("Validation accuracy: %.2f%s" %(compute_accuracy(w, valid_data), "%"))
print("")


** Logistic Regression **



 10%|█████▉                                                     | 1/10 [00:24<03:36, 24.01s/it]

Train Epoch: 1 	 train Loss: -24139.380184


 20%|███████████▊                                               | 2/10 [00:49<03:19, 24.96s/it]

Train Epoch: 2 	 train Loss: -15586.643506


 30%|█████████████████▋                                         | 3/10 [01:14<02:53, 24.76s/it]

Train Epoch: 3 	 train Loss: -14219.622659


 40%|███████████████████████▌                                   | 4/10 [01:38<02:26, 24.46s/it]

Train Epoch: 4 	 train Loss: -13555.419423


 50%|█████████████████████████████▌                             | 5/10 [02:05<02:07, 25.42s/it]

Train Epoch: 5 	 train Loss: -13157.904336


 60%|███████████████████████████████████▍                       | 6/10 [02:30<01:40, 25.25s/it]

Train Epoch: 6 	 train Loss: -12895.436810


 70%|█████████████████████████████████████████▎                 | 7/10 [02:59<01:19, 26.49s/it]

Train Epoch: 7 	 train Loss: -12711.779095


 80%|███████████████████████████████████████████████▏           | 8/10 [03:29<00:55, 27.55s/it]

Train Epoch: 8 	 train Loss: -12576.269581


 90%|█████████████████████████████████████████████████████      | 9/10 [03:57<00:27, 27.98s/it]

Train Epoch: 9 	 train Loss: -12472.919024


100%|██████████████████████████████████████████████████████████| 10/10 [04:23<00:00, 26.34s/it]

Train Epoch: 10 	 train Loss: -12391.810220

Validation accuracy: 93.90%






