# HW 1 Classification

In this homework you will be building several varieties of text classifiers.

## Goal

We ask that you construct the following models in PyTorch:

1. A naive Bayes unigram classifer (follow Wang and Manning http://www.aclweb.org/anthology/P/P12/P12-2.pdf#page=118: you should only implement Naive Bayes, not the combined classifer with SVM).
2. A logistic regression model over word types (you can implement this as $y = \sigma(\sum_i W x_i + b)$) 
3. A continuous bag-of-word neural network with embeddings (similar to CBOW in Mikolov et al https://arxiv.org/pdf/1301.3781.pdf).
4. A simple convolutional neural network (any variant of CNN as described in Kim http://aclweb.org/anthology/D/D14/D14-1181.pdf).
5. Your own extensions to these models...

Consult the papers provided for hyperparameters. 


## Setup

This notebook provides a working definition of the setup of the problem itself. You may construct your models inline or use an external setup (preferred) to build your system.

In [1]:
!pip install torchtext



In [5]:
# Text text processing library and methods for pretrained word embeddings
import torchtext
import numpy as np
import torch as t
from torchtext.vocab import Vectors, GloVe

In [6]:
def variable(array, requires_grad=False):
    if isinstance(array, np.ndarray):
        return t.autograd.Variable(t.from_numpy(array), requires_grad=requires_grad)
    elif isinstance(array, list) or isinstance(array,tuple):
        return t.autograd.Variable(t.from_numpy(np.array(array)), requires_grad=requires_grad)
    elif isinstance(array, float) or isinstance(array, int):
        return t.autograd.Variable(t.from_numpy(np.array([array])), requires_grad=requires_grad)
    elif isinstance(array, t.Tensor):
        return t.autograd.Variable(array, requires_grad=requires_grad)
    else: raise ValueError

The dataset we will use of this problem is known as the Stanford Sentiment Treebank (https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf). It is a variant of a standard sentiment classification task. For simplicity, we will use the most basic form. Classifying a sentence as positive or negative in sentiment. 

To start, `torchtext` requires that we define a mapping from the raw text data to featurized indices. These fields make it easy to map back and forth between readable data and math, which helps for debugging.

In [47]:
# Our input $x$
TEXT = torchtext.data.Field()

# Our labels $y$
LABEL = torchtext.data.Field(sequential=False)

Next we input our data. Here we will use the standard SST train split, and tell it the fields.

In [48]:
train_dataset, val_dataset, test_dataset = torchtext.datasets.SST.splits(
    TEXT, LABEL,
    filter_pred=lambda ex: ex.label != 'neutral')

Let's look at this data. It's still in its original form, we can see that each example consists of a label and the original words.

In [49]:
print('len(train)', len(train_dataset))
print('vars(train[0])', vars(train_dataset[0]))

len(train) 6920
vars(train[0]) {'text': ['The', 'Rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'Century', "'s", 'new', '``', 'Conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'Arnold', 'Schwarzenegger', ',', 'Jean-Claud', 'Van', 'Damme', 'or', 'Steven', 'Segal', '.'], 'label': 'positive'}


In order to map this data to features, we need to assign an index to each word an label. The function build vocab allows us to do this and provides useful options that we will need in future assignments.

In [50]:
TEXT.build_vocab(train_dataset)
LABEL.build_vocab(train_dataset)
print('len(TEXT.vocab)', len(TEXT.vocab))
print('len(LABEL.vocab)', len(LABEL.vocab))

len(TEXT.vocab) 16286
len(LABEL.vocab) 3


Finally we are ready to create batches of our training data that can be used for training and validating the model. This function produces 3 iterators that will let us go through the train, val and test data. 

In [51]:
train_iter, val_iter, test_iter = torchtext.data.BucketIterator.splits(
    (train_dataset, val_dataset, test_dataset), batch_size=10, device=-1)

Let's look at a single batch from one of these iterators. The library automatically converts the underlying words into indices. It then produces tensors for batches of x and y. In this case it will consist of the number of words of the longest sentence (with padding) followed by the number of batches. We can use the vocabulary dictionary to convert back from these indices to words.

In [52]:
batch = next(iter(train_iter))
print("Size of text batch [max sent length, batch size]", batch.text.size())
print("Second in batch", batch.text[:, 0])
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 0].data]))

Size of text batch [max sent length, batch size] torch.Size([30, 10])
Second in batch Variable containing:
   363
     4
  1247
 10242
    32
  8128
  5369
  3895
    10
   994
    67
   415
    27
 13588
    20
    23
   896
   636
   206
    32
    96
    10
   114
   611
     5
   239
   757
    35
   422
     2
[torch.LongTensor of size 30]

Converted back to string:  Despite the surface attractions -- Conrad L. Hall 's cinematography will likely be nominated for an Oscar next year -- there 's something impressive and yet lacking about everything .


Similarly it produces a vector for each of the labels in the batch. 

In [53]:
print("Size of label batch [batch size]", batch.label.size())
print("Second in batch", batch.label[0])
print("Converted back to string: ", LABEL.vocab.itos[batch.label.data[0]])

Size of label batch [batch size] torch.Size([10])
Second in batch Variable containing:
 1
[torch.LongTensor of size 1]

Converted back to string:  positive


Finally the Vocab object can be used to map pretrained word vectors to the indices in the vocabulary. This will be very useful for part 3 and 4 of the problem.  

In [54]:
# Build the vocabulary with word embeddings
url = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.simple.vec'
TEXT.vocab.load_vectors(vectors=Vectors('wiki.simple.vec', url=url))

print("Word embeddings size ", TEXT.vocab.vectors.size())
print("Word embedding of 'follows', first 10 dim ", TEXT.vocab.vectors[TEXT.vocab.stoi['follows']][:10])

Word embeddings size  torch.Size([16286, 300])
Word embedding of 'follows', first 10 dim  
 0.3925
-0.4770
 0.1754
-0.0845
 0.1396
 0.3722
-0.0878
-0.2398
 0.0367
 0.2800
[torch.FloatTensor of size 10]



## Assignment

Now it is your turn to build the models described at the top of the assignment. 

Using the data given by this iterator, you should construct 4 different torch models that take in batch.text and produce a distribution over labels. 

When a model is trained, use the following test function to produce predictions, and then upload to the kaggle competition:  https://www.kaggle.com/c/harvard-cs281-hw1

In [12]:
def test(model):
    "All models should be able to be run with following command."
    upload = []
    # Update: for kaggle the bucket iterator needs to have batch_size 10
#     test_iter = torchtext.data.BucketIterator(test_dataset, train=False, batch_size=10)
    for batch in test_iter:
        # Your prediction data here (don't cheat!)
        probs = NB(batch.text).long()
        upload += list(probs.data)

    with open("predictions.txt", "w") as f:
        for u in upload:
            f.write(str(u) + "\n")

In addition, you should put up a (short) write-up following the template provided in the repository:  https://github.com/harvard-ml-courses/cs287-s18/blob/master/template/

# First model: NB

In [13]:
def embed_sentence(batch, vec_dim=300, sentence_length=16):
    """Convert integer-encoded sentence to word vector representation"""
    return t.cat([TEXT.vocab.vectors[batch.text.data.long()[:,i]].view(1,sentence_length,vec_dim) for i in range(batch.batch_size)])

In [14]:
# the NB weights. Then you classify with t.sign(t.sum(W(text)) + bias)
# You can do that because the features are the indicators variables of the word occurences
W = t.nn.Embedding(len(TEXT.vocab), 1)  

In [15]:
from collections import Counter
positive_counts = Counter()
negative_counts = Counter()

In [16]:
train_iter.batch_size = 10
len(train_iter)

692

In [17]:
from tqdm import tqdm
i = 0
pos = 0
neg = 0

# count the occurences of each word (classwise)
for b in tqdm(train_iter):
    i += 1
    pos_tmp = t.nonzero((b.label==1).data.long()).numpy().flatten().shape[0]
    neg_tmp = t.nonzero((b.label==2).data.long()).numpy().flatten().shape[0]
    pos += pos_tmp
    neg += neg_tmp
    if neg_tmp < 10:
        positive_counts += Counter(b.text.transpose(0,1).index_select(0, t.nonzero((b.label==1).data.long()).squeeze()).data.numpy().flatten().tolist())
    if pos_tmp < 10:
        negative_counts += Counter(b.text.transpose(0,1).index_select(0, t.nonzero((b.label==2).data.long()).squeeze()).data.numpy().flatten().tolist())
    if i >= 692:
        break

 95%|██████████████████████████████████████████████████████████████████████████▊    | 655/692 [00:01<00:00, 543.67it/s]


In [18]:
for k in range(len(TEXT.vocab)):  # pseudo counts
    positive_counts[k] += 1
    negative_counts[k] += 1

In [19]:
scale_pos = sum(list(positive_counts.values()))
scale_neg = sum(list(negative_counts.values()))
positive_prop = {k: v/scale_pos for k,v in positive_counts.items()}
negative_prop = {k: v/scale_neg for k,v in negative_counts.items()}

In [20]:
r = {k: np.log(positive_prop[k] / negative_prop[k]) for k in range(len(TEXT.vocab))}
W.weight.data = t.from_numpy(np.array([r[k] for k in range(len(TEXT.vocab))]))

In [21]:
bias = np.log(pos/neg)

In [22]:
def NB(text):
    """sign(Wx + b)"""
    return t.sign(t.cat([t.sum(W(text.transpose(0,1)[i])) for i in range(text.data.numpy().shape[1])]) + bias).long()

In [23]:
upload = []
true = []
for batch in test_iter:
    # Your prediction data here (don't cheat!)
    probs = NB(batch.text).long()
    upload += list(probs.data)
    true += batch.label.data.numpy().tolist()
true = [x if x == 1 else -1 for x in true]

  This is separate from the ipykernel package so we can avoid doing imports until


In [24]:
sum([(x*y == 1) for x,y in zip(upload,true)])/ len(upload)

0.8237232289950577

# Model 2: logistic regression

In [107]:
W = t.nn.Embedding(len(TEXT.vocab), 1)
b = variable(0., True)

In [108]:
# loss and optimizer
nll = t.nn.NLLLoss(size_average=True)

learning_rate = 1e-2
optimizer = t.optim.RMSprop([b, W.weight], lr=learning_rate)
sig = t.nn.Sigmoid()

In [109]:
len(train_iter), train_iter.batch_size

(692, 10)

In [110]:
import torch as t

def eval_perf(iterator):
    count = 0
    bs = iterator.batch_size * 1
    iterator.batch_size = 1
    for i, batch in enumerate(iterator):
        # get data
        y_pred = (sig(t.cat([W(batch.text.transpose(0,1)[i]).sum() for i in range(batch.text.data.numpy().shape[1])]) + b.float()) > 0.5).long()
        y = batch.label.long()*(-1) + 2

        count += t.sum((y == y_pred).long())
        if i >= len(iterator) - 1:
            break
    iterator.batch_size = bs
    return (count.float() / len(iterator)).data.numpy()[0]

In [112]:
import torch as t
n_epochs = 20

for _ in range(n_epochs):
    for i, batch in enumerate(train_iter):
        # get data
        y_pred = sig(t.cat([W(batch.text.transpose(0,1)[i]).sum() for i in range(batch.text.data.numpy().shape[1])]) + b.float()).unsqueeze(1)
        y = batch.label.long()*(-1) + 2
        
        # initialize gradients
        optimizer.zero_grad()
        
        # loss
        y_pred = t.cat([1-y_pred, y_pred], 1).float()  # nll needs two inputs: the prediction for the negative/positive classes

        loss = nll.forward(y_pred, y)

        # compute gradients
        loss.backward()

        # update weights
        optimizer.step()
                
        if i >= len(train_iter) - 1:
            break
    train_iter.init_epoch()
    print("Validation accuracy after %d epochs: %.2f" % (_, eval_perf(val_iter)))

Validation accuracy after 0 epochs: 0.65
Validation accuracy after 1 epochs: 0.70
Validation accuracy after 2 epochs: 0.73
Validation accuracy after 3 epochs: 0.73
Validation accuracy after 4 epochs: 0.74
Validation accuracy after 5 epochs: 0.74
Validation accuracy after 6 epochs: 0.75
Validation accuracy after 7 epochs: 0.76
Validation accuracy after 8 epochs: 0.77
Validation accuracy after 9 epochs: 0.77
Validation accuracy after 10 epochs: 0.78
Validation accuracy after 11 epochs: 0.78
Validation accuracy after 12 epochs: 0.78
Validation accuracy after 13 epochs: 0.78
Validation accuracy after 14 epochs: 0.78
Validation accuracy after 15 epochs: 0.78
Validation accuracy after 16 epochs: 0.78
Validation accuracy after 17 epochs: 0.77
Validation accuracy after 18 epochs: 0.77
Validation accuracy after 19 epochs: 0.78


# Model 3: CBOW

In [42]:
def vectorize(text, vdim=300):
    length, batch_size = text.data.numpy().shape
    return t.mean(t.cat([TEXT.vocab.vectors[text.long().data.transpose(0,1)[i]].view(1,length,vdim) for i in range(batch_size)]), 1)

In [249]:
W = variable(np.random.normal(0, .1, (300,)), True)
b = variable(0., True)

In [250]:
# loss and optimizer
nll = t.nn.NLLLoss(size_average=True)
learning_rate = 1e-2
optimizer = t.optim.RMSprop([b, W], lr=learning_rate)
sig = t.nn.Sigmoid()

In [253]:
import torch as t

def eval_perf(iterator):
    count = 0
    bs = iterator.batch_size * 1
    iterator.batch_size = 1
    for i, batch in enumerate(iterator):
        # get data
        text_ = batch.text
        length = text_.data.numpy().shape[0]
        y_pred = (sig(t.mm(variable(vectorize(text_)),W.float().resize(300,1)).squeeze() + b.float().squeeze()) > 0.5).long()
        y = batch.label.long()*(-1) + 2

        count += t.sum((y == y_pred).long())
        if i >= len(iterator) - 1:
            break
    iterator.batch_size = bs
    return (count.float() / len(iterator)).data.numpy()[0]

In [254]:
import torch as t
n_epochs = 20

for _ in range(n_epochs):
    for i, batch in enumerate(train_iter):
        # get data
        text_ = batch.text
        length = text_.data.numpy().shape[0]
        y_pred = sig(t.mm(variable(vectorize(text_)),W.float().resize(300,1)).squeeze() + b.float().squeeze()).unsqueeze(1)
        y = batch.label.long()*(-1) + 2
        
        # initialize gradients
        optimizer.zero_grad()
        
        # loss
        y_pred = t.cat([1-y_pred, y_pred], 1).float()  # nll needs two inputs: the prediction for the negative/positive classes

        loss = nll.forward(y_pred, y)

        # compute gradients
        loss.backward()

        # update weights
        optimizer.step()
                
        if i >= len(train_iter) - 1:
            break
    train_iter.init_epoch()
    print("Validation accuracy after %d epochs: %.2f" % (_, eval_perf(val_iter)))

Validation accuracy after 0 epochs: 0.71
Validation accuracy after 1 epochs: 0.70
Validation accuracy after 2 epochs: 0.70
Validation accuracy after 3 epochs: 0.71
Validation accuracy after 4 epochs: 0.72
Validation accuracy after 5 epochs: 0.71
Validation accuracy after 6 epochs: 0.69
Validation accuracy after 7 epochs: 0.71
Validation accuracy after 8 epochs: 0.71
Validation accuracy after 9 epochs: 0.71
Validation accuracy after 10 epochs: 0.71
Validation accuracy after 11 epochs: 0.71
Validation accuracy after 12 epochs: 0.71


KeyboardInterrupt: 

# Model 4: convnet on word vectors

In [229]:
from torch.nn import Conv1d as conv, MaxPool1d as maxpool, Linear as fc, Softmax, ReLU, Dropout, Tanh, BatchNorm1d as BN

In [265]:
layer1 = conv(300, 100, 2)
layer2 = fc(100, 100)
layer3 = fc(100, 2)
softmax = Softmax()
dropout = Dropout()
relu = ReLU()
tanh = Tanh()


In [272]:
class Convnet(t.nn.Module):
    
    def __init__(self):
        super(Convnet, self).__init__()
        self.conv1 = conv(300, 100, 5, padding=2)
#         self.conv2 = conv(100, 100, 3, padding=1)
        self.fc1 = fc(100, 100)
        self.fc2 = fc(100, 2)
        
    def forward(self, x):
        dropout = Dropout(.2)
        xx = self.conv1(x)
#         xx = tanh(xx)
#         xx = self.conv2(xx)
        xx = tanh(t.max(xx, -1)[0])
        xx = self.fc1(xx)
#         xx = dropout(xx)
        xx = tanh(xx)
        xx = self.fc2(xx)
        xx = dropout(xx)
        return softmax(xx)

In [273]:
def convnet(x):
    xx = layer1(x)
    xx = tanh(t.max(xx, -1)[0])
    xx = layer2(xx)
    xx = tanh(xx)
    xx = layer3(xx)
    return softmax(xx)

In [274]:
# loss and optimizer
nll = t.nn.NLLLoss(size_average=True)
learning_rate = 1e-4
convnet = Convnet()
optimizer = t.optim.Adam(convnet.parameters(), lr=learning_rate, weight_decay=1e-3)

In [275]:
def vectorize(text, vdim=300):
    length, batch_size = text.data.numpy().shape
    return t.cat([TEXT.vocab.vectors[text.long().data.transpose(0,1)[i]].view(1,length,vdim) for i in range(batch_size)]).permute(0,2,1)

In [276]:
import torch as t

def eval_perf(iterator):
    count = 0
    bs = iterator.batch_size * 1
    iterator.batch_size = 1
    for i, batch in enumerate(iterator):
        # get data
        text = batch.text
        y_pred = (convnet(variable(vectorize(text)))[:, 1] > 0.5).long()
        y = batch.label.long()*(-1) + 2

        count += t.sum((y == y_pred).long())
        if i >= len(iterator) - 1:
            break
    iterator.batch_size = bs
    return (count.float() / len(iterator)).data.numpy()[0]

In [277]:
import torch as t
n_epochs = 25

for _ in range(n_epochs):
    for i, batch in enumerate(train_iter):
        # get data
        text = batch.text
        y_pred = convnet(variable(vectorize(text)))
        y = batch.label.long()*(-1) + 2
        
        # initialize gradients
        optimizer.zero_grad()
        
        # loss
        loss = nll.forward(y_pred, y)

        # compute gradients
        loss.backward()

        # update weights
        optimizer.step()
                
        if i >= len(train_iter) - 1:
            break
    train_iter.init_epoch()
    model.eval()
    print("Validation accuracy after %d epochs: %.2f" % (_, eval_perf(val_iter)))
    model.train()

Validation accuracy after 0 epochs: 0.60
Validation accuracy after 1 epochs: 0.72
Validation accuracy after 2 epochs: 0.74
Validation accuracy after 3 epochs: 0.74
Validation accuracy after 4 epochs: 0.75
Validation accuracy after 5 epochs: 0.76
Validation accuracy after 6 epochs: 0.73
Validation accuracy after 7 epochs: 0.76
Validation accuracy after 8 epochs: 0.75
Validation accuracy after 9 epochs: 0.75
Validation accuracy after 10 epochs: 0.75
Validation accuracy after 11 epochs: 0.75
Validation accuracy after 12 epochs: 0.75
Validation accuracy after 13 epochs: 0.75
Validation accuracy after 14 epochs: 0.75
Validation accuracy after 15 epochs: 0.75
Validation accuracy after 16 epochs: 0.75


KeyboardInterrupt: 

In [183]:
convnet(variable(vectorize(text)))

Variable containing:
 1.0000e+00  2.3268e-21
 1.0000e+00  4.0393e-23
 2.1157e-20  1.0000e+00
 1.0000e+00  4.6889e-25
 1.0000e+00  1.6023e-11
 1.0000e+00  6.1515e-29
 2.4236e-25  1.0000e+00
 1.0000e+00  7.2904e-25
 7.1563e-14  1.0000e+00
 3.4280e-26  1.0000e+00
[torch.FloatTensor of size 10x2]

In [184]:
batch.label

Variable containing:
 2
 2
 1
 2
 1
 1
 1
 2
 1
 1
[torch.LongTensor of size 10]