In [None]:
# Licensing Information:  You are free to use or extend this project for
# educational purposes provided that (1) you do not distribute or publish
# solutions, (2) you retain this notice, and (3) you provide clear
# attribution to The Georgia Institute of Technology, including a link to https://aritter.github.io/CS-7650/

# Attribution Information: This assignment was developed at The Georgia Institute of Technology
# by Alan Ritter (alan.ritter@cc.gatech.edu)

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime > "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

To enable a high-RAM runtime, select the Runtime > "Change runtime type"
menu, and then select High-RAM in the Runtime shape dropdown. Then, 
re-execute this cell.


#Project \#1: Text Classification

In this assignment, you will implement the perceptron algorithm, and a simple, but competitive neural bag-of-words model, as described in [this paper](https://www.aclweb.org/anthology/P15-1162.pdf) for text classification.  You will train your models on a (provided) dataset of positive and negative movie reviews and report accuracy on a test set.

In this notebook, we provide you with starter code to read in the data and evaluate the performance of your models.  After completing the instructions below, please follow the instructions at the end to submit your notebook and other files to Gradescope.

Make sure to make a copy of this notebook, so your changes are saved.

#Download the dataset

First you will need to download the IMDB dataset - to do this, simply run the cell below.  We have prepared a small version of the ACL IMDB dataset for you to use to help make your experiments faster.  The full dataset is available [here](https://ai.stanford.edu/~amaas/data/sentiment/), in case you are interested, but there is no need to use this for the assignment.

Note that files downloaded in Colab are only saved temporariliy - if your session reconnects you will need to re-download the files.  In case you need persistent storage, you can mount your Google drive folder like so:

```
from google.colab import drive
drive.mount('/content/drive')
```

You can also open a command line prompt by clicking on the shell icon on the left hand side of the page, and upload/download files from your local machine by clicking on the file icon.

In [None]:
#Download the data

!wget https://github.com/aritter/5525_sentiment/raw/master/aclImdb_small.tgz
!tar xvzf aclImdb_small.tgz > /dev/null

--2022-04-11 02:36:47--  https://github.com/aritter/5525_sentiment/raw/master/aclImdb_small.tgz
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-04-11 02:36:47 ERROR 404: Not Found.

tar (child): aclImdb_small.tgz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now


# Converting text to numbers

Below is some code we are providing you to read in the IMDB dataset, perform tokenization (using `nltk`), and convert words into indices.  Please don't modify this code in your submitted version.  We will provide example usage below.

In [None]:
import os
import sys

import nltk
from nltk import word_tokenize
nltk.download('punkt')
import torch

#Sparse matrix implementation
from scipy.sparse import csr_matrix
import numpy as np
from collections import Counter

np.random.seed(1)

class Vocab:
    def __init__(self, vocabFile=None):
        self.locked = False
        self.nextId = 0
        self.word2id = {}
        self.id2word = {}
        if vocabFile:
            for line in open(vocabFile):
                line = line.rstrip('\n')
                (word, wid) = line.split('\t')
                self.word2id[word] = int(wid)
                self.id2word[wid] = word
                self.nextId = max(self.nextId, int(wid) + 1)

    def GetID(self, word):
        if not word in self.word2id:
            if self.locked:
                return -1        #UNK token is -1.
            else:
                self.word2id[word] = self.nextId
                self.id2word[self.word2id[word]] = word
                self.nextId += 1
        return self.word2id[word]

    def HasWord(self, word):
        return self.word2id.has_key(word)

    def HasId(self, wid):
        return self.id2word.has_key(wid)

    def GetWord(self, wid):
        return self.id2word[wid]

    def SaveVocab(self, vocabFile):
        fOut = open(vocabFile, 'w')
        for word in self.word2id.keys():
            fOut.write("%s\t%s\n" % (word, self.word2id[word]))

    def GetVocabSize(self):
        #return self.nextId-1
        return self.nextId

    def GetWords(self):
        return self.word2id.keys()

    def Lock(self):
        self.locked = True

class IMDBdata:
    def __init__(self, directory, vocab=None):
        """ Reads in data into sparse matrix format """
        pFiles = os.listdir("%s/pos" % directory)
        nFiles = os.listdir("%s/neg" % directory)

        if not vocab:
            self.vocab = Vocab()
        else:
            self.vocab = vocab

        #For csr_matrix (see http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix)
        X_values = []
        X_row_indices = []
        X_col_indices = []
        Y = []

        XwordList = []
        XfileList = []

        #Read positive files
        for i in range(len(pFiles)):
            f = pFiles[i]
            for line in open("%s/pos/%s" % (directory, f)):
                wordList   = [self.vocab.GetID(w.lower()) for w in word_tokenize(line) if self.vocab.GetID(w.lower()) >= 0]
                XwordList.append(wordList)
                XfileList.append(f)
                wordCounts = Counter(wordList)
                for (wordId, count) in wordCounts.items():
                    if wordId >= 0:
                        X_row_indices.append(i)
                        X_col_indices.append(wordId)
                        X_values.append(count)
            Y.append(+1.0)

        #Read negative files
        for i in range(len(nFiles)):
            f = nFiles[i]
            for line in open("%s/neg/%s" % (directory, f)):
                wordList   = [self.vocab.GetID(w.lower()) for w in word_tokenize(line) if self.vocab.GetID(w.lower()) >= 0]
                XwordList.append(wordList)
                XfileList.append(f)
                wordCounts = Counter(wordList)
                for (wordId, count) in wordCounts.items():
                    if wordId >= 0:
                        X_row_indices.append(len(pFiles)+i)
                        X_col_indices.append(wordId)
                        X_values.append(count)
            Y.append(-1.0)
            
        self.vocab.Lock()

        #Create a sparse matrix in csr format
        self.X = csr_matrix((X_values, (X_row_indices, X_col_indices)), shape=(max(X_row_indices)+1, self.vocab.GetVocabSize()))
        self.Y = np.asarray(Y)

        #Randomly shuffle
        index = np.arange(self.X.shape[0])
        np.random.shuffle(index)
        self.X = self.X[index,:]
        self.XwordList = [torch.LongTensor(XwordList[i]) for i in index]  #Two different sparse formats, csr and lists of IDs (XwordList).
        self.XfileList = [XfileList[i] for i in index]
        self.Y = self.Y[index]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The code below reads the train,dev and test data into memory.  This will take a minute or so.

In [None]:
train = IMDBdata("aclImdb_small/train")
train.vocab.Lock()
dev  = IMDBdata("aclImdb_small/dev", vocab=train.vocab)
test  = IMDBdata("aclImdb_small/test", vocab=train.vocab)

FileNotFoundError: ignored

# Exploring the data

Below are a few examples to help you get familiar with how the data is represented.  $X \in \mathbb{R}^{M \times N}$ contains the design matrix and $Y \in \{+1,-1\}^N$ contains the labels.  Because there are a large number of features in this representation ($X$ contains one column representing each unique word in the dataset), we are using a sparse matrix representation (Numpy's [csr_sparse](http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix)).  PyTorch doesn't have great support for sparse matrices...


In [None]:
# X contains the design matrix representing the training data.
print(f"Train.X has {train.X.shape[0]} rows and {train.X.shape[1]} columns.")

NameError: ignored

In [None]:
# take a glimpse at one sample, the element value should be the count for each unique word
print(train.X.todense()[0,:20])

NameError: ignored

In [None]:
# Labels
train.Y

NameError: ignored

In [None]:
print(f"Train.Y has {train.Y.shape[0]} rows and 1 columns.")

Train.Y has 7222 rows and 1 columns.


In [None]:
# Let's count the frequency of every word appearing in the documents with negative sentiment:
# sum along each col, which is totaly count for each word
word_counts = np.array(train.X[train.Y == -1,:].sum(axis=0)).flatten()
word_counts

array([   69, 20900,   313, ...,     1,     1,     1], dtype=int64)

In [None]:
np.argsort(word_counts) # the indices to make the array ascending

array([12141, 30438, 13013, ...,    11,    27,    16])

In [None]:
# Now, let's sort the words by frequency: (descending values)
sorted_words = list(reversed(np.argsort(word_counts)))
sorted_words[0:10]

[16, 27, 11, 33, 1, 51, 67, 132, 6, 131]

In [None]:
# What is the index of the most frequent word?
sorted_words[0]

16

In [None]:
# Let's see what word that is:
train.vocab.GetWord(sorted_words[0])

'the'

In [None]:
# What are the 10 most frequent words?
[train.vocab.GetWord(sorted_words[x]) for x in range(10)]

['the', ',', '.', 'a', 'and', 'to', 'of', '>', '<', '/']

# Now it's time to write some code!

# Basic Perceptron (10 points)

Implement the perceptron classification algorithm (fill in the missing code marked with `TODO:` below).
The only hyperparameter to tune is the number of iterations.

In [None]:
class Eval:
    def __init__(self, pred, gold):
        self.pred = np.squeeze(pred)
        self.gold = np.squeeze(gold)
        
    def Accuracy(self):
        return np.sum(np.equal(self.pred, self.gold)) / float(len(self.gold))

class Perceptron:
    def __init__(self, X, Y, N_ITERATIONS):
        #TODO: Initalize parameters
        self.lr = 1e-3
        self.N_epochs = N_ITERATIONS
        self.weights = np.zeros((X.shape[1],1))
        self.w_sum = np.zeros((X.shape[1],1))
        self.bias = 0
        self.b_sum = 0
        self.cnt = 1
        self.Train(X,Y)

    def ComputeAverageParameters(self):
        #TODO: Compute average parameters (do this part last)
        self.weights = self.weights - (self.w_sum / float(self.cnt))
        self.bias = self.bias - (self.b_sum / float(self.cnt))
        return

    def Train(self, X, Y):
        #TODO: Estimate perceptron parameters
        for _ in range(self.N_epochs):
          for inputs, label in zip(X, Y):
            prediction = self.Predict(inputs)
            self.weights += self.lr * ((label - prediction) * inputs).reshape(-1,1)
            self.w_sum += self.cnt * self.lr * ((label - prediction) * inputs).reshape(-1,1)
            self.bias += self.lr * (label - prediction)
            self.b_sum += self.cnt * self.lr * (label - prediction)
            
            self.cnt += 1

        return

    def Predict(self, X):
        #TODO: Implement perceptron classification
        out = np.dot(X.toarray(), self.weights) + self.bias
        return np.asarray([1 if out[i]>= 0.0 else -1 for i in range(X.shape[0])])

    def SavePredictions(self, data, outFile):
        Y_pred = self.Predict(data.X)
        fOut = open(outFile, 'w')
        for i in range(len(data.XfileList)):
          fOut.write(f"{data.XfileList[i]}\t{Y_pred[i]}\n")

    def Eval(self, X_test, Y_test):
        Y_pred = self.Predict(X_test)
        ev = Eval(Y_pred, Y_test)
        return ev.Accuracy()

ptron = Perceptron(train.X, train.Y, 5)
print('Simple perceptron on dev set\n')
print(ptron.Eval(dev.X, dev.Y))

ptron.SavePredictions(test, 'test_pred_perceptron.txt')

Simple perceptron on dev set

0.8276327632763276


The simple perceptron is above 80% requirement.

In [None]:
ptron.ComputeAverageParameters()
print('Averaged perceptron on dev set\n')
print(ptron.Eval(dev.X, dev.Y))

ptron.SavePredictions(test, 'test_pred_perceptron_avg.txt')

Averaged perceptron on dev set

0.8406840684068407


The averaged perceptron is above 82% requirement.

In [None]:
print('in-sample accuracy', ptron.Eval(train.X, train.Y))

in-sample accuracy 0.9084741068955968


# Parameter Averaging (5 points)

Once your basic perceptron implementation is working, modify code to implement parameter averaging.  Instead of using parameters from the final iteration, $w_n$, to classify test examples, 
use the average of the parameters from every iteration, $\frac{1}{N}\sum_{i=1}^N w_i$.  A nice trick for doing this efficiently is described in section 2.1.1 of [Hal Daume's thesis](http://www.umiacs.umd.edu/~hal/docs/daume06thesis.pdf).  When you are done, save a copy of your predictions on the test set to turn in at the end (click on the file icon on the left side of the notebook).

# Inspect Feature Weights (5 points)

Modify the code block above to print out the 20 most positive and 20 most negative words in the vocabulary sorted by their learned weights according to your model.  This will require a bit of thought how to do because the words in each document have been converted to IDs (see examples above).

In [None]:
#TODO: Print the 20 most positive and 20 most negative words

sorted_weights = list(np.argsort(ptron.weights.squeeze()))
print('20 most negative indices\n',sorted_weights[:20])
print('20 most negative words\n',[train.vocab.GetWord(sorted_weights[x]) for x in range(20)])
print('20 most positive indices\n',sorted_weights[-20:])
print('20 most positive words\n',[train.vocab.GetWord(sorted_weights[x]) for x in range(-1,-21,-1)])

20 most negative indices
 [3656, 1427, 5443, 3871, 4532, 114, 794, 1860, 3757, 4733, 792, 1154, 1324, 653, 194, 551, 37, 3306, 490, 381]
20 most negative words
 ['worst', 'bad', 'waste', 'awful', 'poor', 'no', 'boring', 'terrible', 'horrible', 'worse', 'nothing', 'plot', '?', 'thing', 'minutes', 'even', 'could', 'annoying', 'either', 'instead']
20 most positive indices
 [1829, 1067, 887, 761, 1704, 355, 1678, 1610, 1807, 273, 275, 808, 1243, 1227, 3, 22, 363, 1675, 911, 61]
20 most positive words
 ['great', 'excellent', 'fun', 'best', 'still', 'loved', 'highly', 'life', 'perfect', 'own', 'also', 'definitely', 'our', 'bit', 'love', 'today', 'wonderful', 'real', 'arthur', 'amazing']


# Evaluate on the test set

Once you are happy with your performance on the dev set, run the cell below to evaluate your performance on the test set.  Make sure to download the predictions of your model in `test_pred_perceptron.txt`.

In [None]:
print(ptron.Eval(test.X, test.Y))

ptron.SavePredictions(test, 'test_pred_perceptron.txt')

0.8395181390196621


The release version perceptron is above 82% requirement.

# PyTorch Introduction

Take a look at the code below, which provides an example of how a simple neural network is capable of learning the XOR function (note that the perceptron cannot do this).  Refer to the PyTorch documentation [here](https://pytorch.org/docs/stable/index.html) for more information.

In [None]:
import torch
import torch.nn as nn
from torch import optim
import random
import numpy as np

#Define the computation graph; one layer hidden network
class FFNN(nn.Module):
    def __init__(self, dim_i, dim_h, dim_o):
        super(FFNN, self).__init__()
        self.V = nn.Linear(dim_i, dim_h)
        self.g = nn.Tanh()
        self.W = nn.Linear(dim_h, dim_o)
        self.logSoftmax = nn.LogSoftmax(dim=0)

    def forward(self, x):
        out = self.W(self.g(self.V(x)))
        #print(out.shape)
        out = self.logSoftmax(out)
        #print(out.shape)
        return out

train_X = np.array([[0,0], [0,1], [1,0], [1,1]])
train_Y = np.array([0,1,1,0])

num_classes  = 2
num_hidden   = 10
num_features = 2

ffnn = FFNN(num_features, num_hidden, num_classes)
optimizer = optim.Adam(ffnn.parameters(), lr=0.1)

for epoch in range(100):
    total_loss = 0.0
    #Randomly shuffle examples in each epoch
    shuffled_i = list(range(0,len(train_Y)))
    random.shuffle(shuffled_i)
    for i in shuffled_i:
        x        = torch.from_numpy(train_X[i]).float()
        y_onehot = torch.zeros(num_classes)
        y_onehot[train_Y[i]] = 1

        ffnn.zero_grad()
        logProbs = ffnn.forward(x)

        #print(logProbs.shape, y_onehot.shape)
        loss = torch.neg(logProbs).dot(y_onehot)
        total_loss += loss
        
        loss.backward()
        optimizer.step()
    if epoch % 10 == 0:    
      print("loss on epoch %i: %f" % (epoch, total_loss))

#Evaluate on the training set:
num_errors = 0
for i in range(len(train_Y)):
    x = torch.from_numpy(train_X[i]).float()
    y = train_Y[i]
    logProbs = ffnn.forward(x)
    prediction = torch.argmax(logProbs)
    if y != prediction:
        num_errors += 1
print("number of errors: %d" % num_errors)

loss on epoch 0: 4.169152
loss on epoch 10: 2.375379
loss on epoch 20: 0.537421
loss on epoch 30: 0.145159
loss on epoch 40: 0.080864
loss on epoch 50: 0.054076
loss on epoch 60: 0.039837
loss on epoch 70: 0.030248
loss on epoch 80: 0.023713
loss on epoch 90: 0.019187
number of errors: 0


# Neural Bag-of-Words (10 points)

Your next task is to implement a neural bag-of-words baseline, NBOW-RAND, as described in [this paper](https://www.aclweb.org/anthology/P15-1162.pdf).  See section 2.1.  Note that the dataset we are using is a sample of the full IMDB dataset to make developing your solution easier, so your performance will be a bit lower than the numbers reported in the paper for the full dataset.

# GPUs

You probably want to use GPUs here to enable faster training with larger word embeddings and hidden layers (the paper listed above uses 300 dimensions).

Colab provides free access to GPUs, which will be useful for rest of the assignment.  To access GPUs, select `Runtime` from the menu at the top of the page, and then choose `Change Runtime Type`; under the `Hardware Accelerator` dropdown select `GPU`.  Note that the free version of Colab does have quotas on how much GPU can be used - you won't need to use GPUs for the first part of the assignment (implementing the Perceptron algorithm).

You can check to make sure a GPU is available using the code below:

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Mon Mar  8 18:13:12 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    23W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Data format

In addition to reading in the data in sparse matrix (Numpy CSR) format, which you used in your implementation of the perceptron algorithm, the data loading code above also reads it in another format in `train.XwordList`.  This contains a list of PyTorch arrays consisting of sequences of word IDs corresponding to the content of each document.  The the documents in `train.XwordList` are presented in the same order as the labels (`train.Y`).  For the next part of the assignment, you should work with the data in this new format, which is fairly standard for text data in neural networks.  See below for a few examples on how to access the data in this new format.

In [None]:
# Example of using XwordList

print("Here is the first document:\n")
print(train.XwordList[0].tolist())

print("After converting from IDs to words:\n")
print([train.vocab.GetWord(x) for x in train.XwordList[0].tolist()])

print("Here is the label:\n")
print(train.Y[0])

Here is the first document:

[67, 150, 16, 592, 80, 149, 402, 27, 3, 569, 13843, 59, 157, 16, 938, 5, 6780, 1903, 27, 562, 154, 3579, 1, 33, 171, 1381, 3813, 12, 239, 15617, 2585, 938, 11, 9, 12, 48, 19350, 27, 16, 225, 48, 862, 1, 16, 187, 4904, 42, 6298, 1795, 11, 67, 412, 39, 198, 51, 9757, 128, 67, 3, 157, 16, 2664, 3401, 13, 2694, 11, 16, 2664, 16, 356, 54, 1677, 512, 48, 6035, 40733, 27, 1, 4611, 11, 40734, 699, 33, 26255, 6484, 32, 34, 9894, 13, 37, 7294, 27, 154, 48, 249, 3277, 157, 25, 4434, 11, 186, 1768, 3487, 78, 663, 51, 100, 157, 553, 1038, 11, 16, 721, 226, 8630, 1803, 144, 5927, 639, 40, 1675, 51, 448, 33, 786, 32, 33, 18865, 11, 90, 5897, 16, 1390, 1486, 51, 33, 1058, 4307, 49, 981, 13250, 11, 569, 584, 2346, 51, 1220, 1, 16, 366, 864, 5592, 11, 365, 16, 1390, 49, 13, 1738, 212, 42, 249, 1345, 21335, 27, 154, 49, 733, 97, 42, 11, 133, 155, 49, 367, 432, 49, 1277, 25, 51, 3946, 27, 154, 374, 723, 40735, 11, 16, 753, 2710, 5, 319, 9, 6097, 3, 27, 133, 9894, 2496, 51, 867

# Suggestions for getting started

We recommend following a similar structure as the XOR example above, starting with an [Embedding Layer](https://pytorch.org/docs/stable/nn.html\#embedding), [ReLU nonlinearity](https://pytorch.org/docs/stable/nn.html\#relu), a single hidden layer and [Adam optimizer.](https://pytorch.org/docs/stable/optim.html\#torch.optim.Adam)
Please refer to the Pytorch Documentation for more information as necessary.  You can use a batch size of one for this assignment, to simplify your implementation.


In [None]:
import tqdm

class NBOW(nn.Module):
    def __init__(self, VOCAB_SIZE, DIM_EMB=300, NUM_CLASSES=2):
        super(NBOW, self).__init__()
        self.NUM_CLASSES=NUM_CLASSES
        #TODO: Initialize parameters.
        self.emb = nn.Embedding(VOCAB_SIZE, DIM_EMB)
        self.actv = nn.ReLU()
        self.lr = nn.Linear(DIM_EMB, self.NUM_CLASSES)
        self.logSoftmax = nn.LogSoftmax(dim=0)

    def forward(self, X):
        #TODO: Implement forward computation.
        out = torch.mean(self.emb(X), 0, True)
        out = self.actv(out)
        out = self.lr(out)
        out = self.logSoftmax(torch.squeeze(out))
        return out


def EvalNet(data, net):
    num_correct = 0
    Y = (data.Y + 1.0) / 2.0
    X = data.XwordList
    for i in range(len(X)):
        logProbs = net.forward(X[i])
        pred = torch.argmax(logProbs)
        if pred == Y[i]:
            num_correct += 1
    print("Accuracy: %s" % (float(num_correct) / float(len(X))))

def SavePredictions(data, outFile, net):
    fOut = open(outFile, 'w')
    for i in range(len(data.XwordList)):
        logProbs = net.forward(data.XwordList[i])
        pred = torch.argmax(logProbs)
        fOut.write(f"{data.XfileList[i]}\t{pred}\n")

def Train(net, X, Y, n_iter, dev):
    
    print("Start Training!")
    #TODO: initialize optimizer.
    optimizer = optim.Adam(net.parameters(), lr=0.0001)
    # 1e-3 for nbow, 1e-4 for cnn
    # cross-entropy loss function
    
    num_classes = len(set(Y))

    for epoch in range(n_iter):
        num_correct = 0
        total_loss = 0.0
        net.train()   #Put the network into training mode
        for i in tqdm.notebook.tqdm(range(len(X))):
          #pass
          #TODO: compute gradients, do parameter update, compute loss.
          y_onehot = torch.zeros(net.NUM_CLASSES)
          y_onehot[int(Y[i])] = 1

          logProbs = net.forward(X[i])
          #print(logProbs)
          loss = torch.neg(torch.squeeze(logProbs)).dot(y_onehot)
          total_loss += loss

          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

        net.eval()    #Switch to eval mode
        print(f"loss on epoch {epoch} = {total_loss}")
        EvalNet(dev, net)


Since some difficulty in accessing GPU, I just ran 3 epochs on CPU, the result could satisfy the requirement by fine tuning the learing rate.

In [None]:

nbow = NBOW(train.vocab.GetVocabSize())
Train(nbow, train.XwordList, (train.Y + 1.0) / 2.0, 3, dev)

Start Training!


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 0 = 3379.13037109375
Accuracy: 0.8478847884788479


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 1 = 1344.42138671875
Accuracy: 0.8685868586858686


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 2 = 561.0531005859375
Accuracy: 0.8663366336633663


# Evaluate on the test set

Once you are happy with your performance on the dev set, run the cell below to evaluate your performance on the test set.  Make sure to download the predictions of your model in `test_pred_nbow.txt`.

In [None]:
EvalNet(test, nbow)
SavePredictions(test, 'test_pred_nbow.txt', nbow)

Accuracy: 0.8647189144281362


The NBOW model is above 83% requirement.

# Same procedure of NBOW implementation on GPU.

In [None]:
class NBOW(nn.Module):
    def __init__(self, VOCAB_SIZE, DIM_EMB=300, NUM_CLASSES=2):
        super(NBOW, self).__init__()
        self.NUM_CLASSES=NUM_CLASSES
        #TODO: Initialize parameters.
        self.emb = nn.Embedding(VOCAB_SIZE, DIM_EMB)
        self.actv = nn.ReLU()
        self.lr = nn.Linear(DIM_EMB, self.NUM_CLASSES)
        self.logSoftmax = nn.LogSoftmax(dim=0)

    def forward(self, X):
        #TODO: Implement forward computation.
        out = torch.mean(self.emb(X), 0, True)
        out = self.actv(out)
        out = self.lr(out)
        out = self.logSoftmax(torch.squeeze(out))
        return out


def EvalNet(data, net):
    num_correct = 0
    Y = (data.Y + 1.0) / 2.0
    X = data.XwordList
    for i in range(len(X)):
        logProbs = net.forward(X[i].cuda())
        pred = torch.argmax(logProbs)
        if pred == Y[i]:
            num_correct += 1
    print("Accuracy: %s" % (float(num_correct) / float(len(X))))

def SavePredictions(data, outFile, net):
    fOut = open(outFile, 'w')
    for i in range(len(data.XwordList)):
        logProbs = net.forward(data.XwordList[i])
        pred = torch.argmax(logProbs)
        fOut.write(f"{data.XfileList[i]}\t{pred}\n")

def Train(net, X, Y, n_iter, dev):
    
    print("Start Training!")
    #TODO: initialize optimizer.
    optimizer = optim.Adam(net.parameters(), lr=0.0001)
    # 1e-3 for nbow, 1e-4 for cnn
    # cross-entropy loss function
    
    num_classes = len(set(Y))

    for epoch in range(n_iter):
        num_correct = 0
        total_loss = 0.0
        net.train()   #Put the network into training mode
        for i in tqdm.notebook.tqdm(range(len(X))):
          #pass
          #TODO: compute gradients, do parameter update, compute loss.
          y_onehot = torch.zeros(net.NUM_CLASSES)
          y_onehot[int(Y[i])] = 1
          
          logProbs = net.forward(X[i].cuda())
          #print(logProbs)
          loss = torch.neg(torch.squeeze(logProbs)).dot(y_onehot.cuda())
          total_loss += loss

          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

        net.eval()    #Switch to eval mode
        print(f"loss on epoch {epoch} = {total_loss}")
        EvalNet(dev, net)

nbow = NBOW(train.vocab.GetVocabSize()).cuda()
Train(nbow, train.XwordList, (train.Y + 1.0) / 2.0, 9, dev)      

Start Training!


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 0 = 4874.962890625
Accuracy: 0.6426642664266426


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 1 = 4514.0439453125
Accuracy: 0.6962196219621962


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 2 = 4012.455810546875
Accuracy: 0.7448244824482448


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 3 = 3465.116455078125
Accuracy: 0.7772277227722773


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 4 = 2959.93896484375
Accuracy: 0.8037803780378038


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 5 = 2536.58154296875
Accuracy: 0.81998199819982


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 6 = 2193.4091796875
Accuracy: 0.833033303330333


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 7 = 1914.1561279296875
Accuracy: 0.8438343834383438


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 8 = 1682.439453125
Accuracy: 0.846984698469847


In [None]:
EvalNet(test, nbow) # for completeness

Accuracy: 0.8519800609249515


# 1-D Convolutional Neural Networks (2 points - optional extra credit)

The final task for this assignment is to implement a convolutional neural network for text classification similar to the CNN-rand baseline described by [Kim (2014)](https://www.aclweb.org/anthology/D14-1181.pdf).  We haven't covered CNNs in lecture yet, so this part is optional.  You should use the same data format as the NBOW model above, starting out with an [Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer, followed by a [1-D convolution](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html), a [max-pooling layer](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool1d.html#torch.nn.MaxPool1d) and a [Dropout layer](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html).  You should be able to use the same `Train()` function you wrote above.

In [None]:
import torch.nn.functional as F

class CNN(nn.Module):
    def __init__(self, VOCAB_SIZE, DIM_EMB=300, NUM_CLASSES=2):
        super(CNN, self).__init__()
        self.NUM_CLASSES=NUM_CLASSES
        #TODO: Initialize parameters.
        self.emb = nn.Embedding(VOCAB_SIZE, DIM_EMB)
        self.conv = nn.Conv1d(DIM_EMB, 200, kernel_size=4, stride=1)
        self.pool = nn.AdaptiveMaxPool1d(1)
        self.lr = nn.Linear(200, self.NUM_CLASSES)
        self.drop = nn.Dropout(p=0.2)
        self.logSoftmax = nn.LogSoftmax(dim=0)


    def forward(self, X):
        #TODO: Implement forward computation.
        out = self.emb(X).permute(1,0).unsqueeze(0)
        out = self.conv(out)
        out = self.pool(out).squeeze(2)
        out = self.lr(out)
        out = self.drop(out)
        out = self.logSoftmax(torch.squeeze(out))
        return out



In [None]:
cnn = CNN(train.vocab.GetVocabSize())
Train(cnn, train.XwordList, (train.Y + 1.0) / 2.0, 3, dev)

Start Training!


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 0 = 4012.194580078125
Accuracy: 0.7916291629162916


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 1 = 2552.740966796875
Accuracy: 0.806030603060306


HBox(children=(FloatProgress(value=0.0, max=7222.0), HTML(value='')))


loss on epoch 2 = 1502.7093505859375
Accuracy: 0.8267326732673267


# Evaluate on the test set

Once you are happy with your performance on the dev set, run the cell below to evaluate your performance on the test set.  Make sure to download the predictions of your model in `test_pred_nbow.txt`.

In [None]:
EvalNet(test, cnn)
SavePredictions(test, 'test_pred_cnn.txt', cnn)

Accuracy: 0.8327333148712268


The CNN model is above 82% requirement.

# Gradescope

Gradescope allows you to add multiple files to your submission. Please submit this notebook along with the test set prediction:
* test_pred_perceptron.txt
* test_pred_nbow.txt
* test_pred_cnn.txt (optional)
* TextClassification_solution.ipynb

To download this notebook, go to `File > Download.ipynb`. You can download the predictions from Colab by clicking the folder icon on the left and finding them under Files. 

Please make sure that you name the files as specified above. You will be able to see the test set accuracy for your predictions. However, the final score will be assigned later based on accuracy and implementation. 

When submitting the .ipynb notebook, please make sure that all the cells run when executed in order starting from a fresh session. If the code doesn't take too long to run, you can re-run everything with `Runtime -> Restart and run all`

You can submit multiple times before the deadline and choose the submission which you want to be graded by going to `Submission History` on gradescope.

