In [1]:
import re
import numpy as np
import torch as th
import torch.autograd as ag
import torch.nn.functional as F
import torch.nn as nn

# Deep Learning for NLP - lab exercise 1

In this first lab exercise we will implement a simple bag-of-word
classifier, i.e. a classifier that ignores the sequential structure of
the sentence, and a classifier based on a convolutional neural network
(CNN). The goal is to predict if a sentence is a positive or negative
review of a movie. We will use a dataset constructed from IMDB.

1.  Load and clean the data
2.  Preprocess the data for the NN
3.  Module definition
4.  Train the network!

We will implement this model with Pytorch, the most popular deep
learning framework for Natural Language Processing. You can use the
following links for help:

-   turorials: <http://pytorch.org/tutorials/>
-   documentation: <http://pytorch.org/docs/master/>

## Data

The data can be download here: <http://caio-corro.fr/dl4nlp/imdb.zip>

There are two files: one with positive reviews (imdb.pos) and one with
negative reviews (imdb.neg). Each file contains 300000 reviews, one per
line.

The following functions can be used to load and clean the data.



In [2]:
# Tokenize a sentence
def clean_str(string, tolower=True):
    """
    Tokenization/string cleaning.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    if tolower:
        string = string.lower()
    return string.strip()


# reads the content of the file passed as an argument.
# if limit > 0, this function will return only the first "limit" sentences in the file.
def loadTexts(filename, limit=-1):
    dataset=[]
    with open(filename) as f:
        line = f.readline()
        cpt=1
        skip=0
        while line :
            cleanline = clean_str(f.readline()).split()
            if cleanline: 
                dataset.append(cleanline)
            else: 
                line = f.readline()
                skip+=1
                continue
            if limit > 0 and cpt >= limit: 
                break
            line = f.readline()
            cpt+=1        

        print("Load ", cpt, " lines from ", filename , " / ", skip ," lines discarded")
    return dataset

The following cell load the first 5000 sentences in each review set.

In [3]:
LIM = 5000
txtfile = "./imdb/imdb.pos"
postxt = loadTexts(txtfile,limit=LIM)

txtfile = "./imdb/imdb.neg"
negtxt = loadTexts(txtfile,limit=LIM)

Load  5000  lines from  ./imdb/imdb.pos  /  1  lines discarded
Load  5000  lines from  ./imdb/imdb.neg  /  1  lines discarded


Split the data between train / dev / test, for example by creating lists
txt_train, label_train, txt_dev, ... You should take care to keep a
50/50 ratio between positive and negative instances in each set.

In [4]:
# split into train / dev / test
train_pos_indices = np.random.choice(len(postxt), size=int(0.6*LIM), replace=False)
# create dev excluding train
dev_pos_indices = np.random.choice(list(set(range(len(postxt))) - set(train_pos_indices)), size=int(0.2*LIM), replace=False)
# create test excluding train and dev
test_pos_indices = list(set(range(len(postxt))) - set(train_pos_indices) - set(dev_pos_indices))

train_neg_indices = np.random.choice(len(negtxt), size=int(0.6*LIM), replace=False)
# create dev excluding train
dev_neg_indices = np.random.choice(list(set(range(len(negtxt))) - set(train_neg_indices)), size=int(0.2*LIM), replace=False)
# create test excluding train and dev
test_neg_indices = list(set(range(len(negtxt))) - set(train_neg_indices) - set(dev_neg_indices))

train_pos = [postxt[i] for i in train_pos_indices]
dev_pos = [postxt[i] for i in dev_pos_indices]
test_pos = [postxt[i] for i in test_pos_indices]

train_neg = [negtxt[i] for i in train_neg_indices]
dev_neg = [negtxt[i] for i in dev_neg_indices]
test_neg = [negtxt[i] for i in test_neg_indices]

# create train / dev / test sets
train = [(x,1) for x in train_pos] + [(x,0) for x in train_neg]
dev = [(x,1) for x in dev_pos] + [(x,0) for x in dev_neg]
test = [(x,1) for x in test_pos] + [(x,0) for x in test_neg]

# Converting data to Pytorch tensors

We will first convert data to Pytorch tensors so they can be used in a
neural network. To do that, you must first create a dictionnary that
will map words to integers. Add to the dictionnary only words that are
in the training set (be sure to understand why we do that!).

Then, you can convert the data to tensors:

-   use tensors of longs: both the sentence and the label will be
    represented as integers, not floats!
-   these tensors do not require a gradient

A tensor representing a sentence is composed of the integer
representation of each word, e.g. \[10, 256, 3, 4\]. Note that some
words in the dev and test sets may not be in the dictionnary! (i.e.
unknown words) You can just skip them, even if this is a bad idea in
general.

In [5]:
# make a dictionary of all words in the training set
word_dict = {}
for sent, _ in train:
    for word in sent:
        if word not in word_dict:
            word_dict[word] = len(word_dict)

def sent2tensor(sent, word_dict):
    # convert sentence to list of indices, if a word is not in the dictionary, skip it
    idxs = [word_dict[word] if word in word_dict else -1 for word in sent]
    # remove words not in dictionary
    idxs = [idx for idx in idxs if idx >= 0]
    if idxs == []:
        return None
    return th.LongTensor(idxs)

train_data = [(sent2tensor(sent, word_dict), label) for sent, label in train]
dev_data = [(sent2tensor(sent, word_dict), label) for sent, label in dev]
test_data = [(sent2tensor(sent, word_dict), label) for sent, label in test]

# remove empty sentences
train_data = [x for x in train_data if x[0] is not None]
dev_data = [x for x in dev_data if x[0] is not None]
test_data = [x for x in test_data if x[0] is not None]

# Neural network definition

You need to implement two networks:

-   a simple bag of word model (note: it may be better to take the mean
    of input embeddings that the sum)
-   a simple CNN as described in the course

To simplify code, you can assume the input will always be a single
sentence first, and then implement batched inputs. In the case of
batched inputs, give to the forward function a (python) list of tensors.

The bag of word neural network should be defined as follows:

-   take as input a tensor that is a sequence of integers indexing word
    embeddings
-   retrieve the word embeddings from an embedding table
-   construct the "input" of the MLP by summing (or computing the mean)
    over all embeddings (i.e. bag-of-word model)
-   build a hidden represention using a MLP (1 layer? 2 layers?
    experiment! but maybe first try wihout any hidden layer...)
-   project the hidden representation to the output space: it is a
    binary classification task, so the output space is a scalar where a
    negative (resp. positive) value means the review is negative (resp.
    positive).

The CNN is a little bit more tricky to implement. The goal is that you
implement the one presented in the first lecture. Importantly, you
should add "padding" tokens before and after the sentence so you can
have a convolution even when there is a single word in the input. For
example, if you input sentence is \["word"\], you want to instead
consider the sentence \["\<BOS>", "word", "\<EOS>"\] if your window is
of size 2 or 3. You can do this either directly when you load the data,
or you can do that in the neural network module.

First: Let's do the linear classifier!

In [6]:
# BAG of word classifier
class CBOW_classifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, linear_dim):
        super(CBOW_classifier, self).__init__()
        # create embedding table
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # create linear layer
        if type(linear_dim) == int:
            self.linear == nn.Sequential(nn.Linear(embedding_dim, linear_dim), nn.ReLU(), nn.Linear(linear_dim, 1))
        elif type(linear_dim) in [list, tuple]:
            layers = [nn.Linear(embedding_dim, linear_dim[0]), nn.ReLU()]
            for i in range(len(linear_dim)-1):
                layers.append(nn.Linear(linear_dim[i], linear_dim[i+1]))
                layers.append(nn.ReLU())
            layers.append(nn.Linear(linear_dim[-1], 1))
            self.linear = nn.Sequential(*layers)
        else:
            raise ValueError("linear_dim must be an int, list or tuple")
        
        
    def forward(self, inputs):
        # get embeddings
        embeds = self.embedding(inputs)
        # sum embeddings and average
        embeds = th.sum(embeds, dim=0) / embeds.shape[0]
        # linear layer
        out = self.linear(embeds)
        # sigmoid
        out = F.sigmoid(out)
        return out
    
    def get_embeddings(self, inputs):
        # get embeddings
        embeds = self.embedding(inputs)
        # sum embeddings and average
        embeds = th.sum(embeds, dim=0) / embeds.shape[0]
        return embeds

## Loss function

Create a loss function builder.

-   Pytorch loss functions are documented here:
    <https://pytorch.org/docs/stable/nn.html#loss-functions>
-   In our case, we are interested in *BCELoss* and *BCEWithLogitsLoss*.
    Read their documentation and choose the one that fits with your
    network output

In [29]:
# define loss function
loss_fn = nn.BCELoss()

embedding_size = 100
linear_size = (100,50)
# define model
model = CBOW_classifier(len(word_dict), embedding_size, linear_size)

# define optimizer
optim = th.optim.Adam(model.parameters(), lr=0.001)


## Training loop

Write your training loop!

-   parameterizable number of epochs
-   at each epoch, print the mean loss and the dev accuracy

In [30]:
# training loop
for epoch in range(10):
    # shuffle training data
    np.random.shuffle(train_data)
    # set model to train mode
    model.train()

    #compute train accuracy
    correct = 0

    # loop over training data
    for sent, label in train_data:
        # zero gradients
        optim.zero_grad()
        # forward pass
        out = model(sent)
        # compute loss
        loss = loss_fn(out, th.FloatTensor([label]))
        # backward pass
        loss.backward()
        # update parameters
        optim.step()

        #compute train accuracy
        pred = 1 if out > 0.5 else 0
        if pred == label:
            correct += 1
    # compute accuracy
    train_acc = correct / len(train_data)
    


    # set model to eval mode
    model.eval()
    # compute accuracy on dev set
    correct = 0
    for sent, label in dev_data:
        # forward pass
        out = model(sent)
        # get prediction
        pred = 1 if out > 0.5 else 0
        # check if prediction is correct
        if pred == label:
            correct += 1
    # compute accuracy
    acc = correct / len(dev_data)
    print("Epoch: {}, Train Acc: {}, Dev Acc: {}".format(epoch, train_acc, acc))

Epoch: 0, Train Acc: 0.687, Dev Acc: 0.7598784194528876
Epoch: 1, Train Acc: 0.8143333333333334, Dev Acc: 0.7776089159067883
Epoch: 2, Train Acc: 0.871, Dev Acc: 0.7877406281661601
Epoch: 3, Train Acc: 0.9138333333333334, Dev Acc: 0.7897669706180345
Epoch: 4, Train Acc: 0.9476666666666667, Dev Acc: 0.786727456940223
Epoch: 5, Train Acc: 0.9661666666666666, Dev Acc: 0.7806484295845998
Epoch: 6, Train Acc: 0.9791666666666666, Dev Acc: 0.7700101317122594
Epoch: 7, Train Acc: 0.9851666666666666, Dev Acc: 0.7771023302938197
Epoch: 8, Train Acc: 0.989, Dev Acc: 0.7695035460992907
Epoch: 9, Train Acc: 0.9905, Dev Acc: 0.7771023302938197


In [9]:
#compute distance between two sentences in the embedding space of model
def compute_distance(sent1, sent2, model, word_dict):
    # get embeddings
    embeds1 = model.get_embeddings(sent2tensor(sent1, word_dict))
    embeds2 = model.get_embeddings(sent2tensor(sent2, word_dict))
    # compute distance
    return th.dist(embeds1, embeds2)

In [10]:
good_bad = compute_distance(["good"], ["bad"], model, word_dict)
bad_terrible = compute_distance(["bad"], ["terrible"], model, word_dict)
good_amazing = compute_distance(["good"], ["amazing"], model, word_dict)
sentence_movie = compute_distance(["this", "movie", "has", 'been', "amazing"], ["movie"], model, word_dict)
print("Distance between good and bad: {}".format(good_bad))
print("Distance between bad and terrible: {}".format(bad_terrible))
print("Distance between good and amazing: {}".format(good_amazing))
print("Distance between 'this movie has been amazing' and movie: {}".format(sentence_movie))
sentence_1 = "this movie makes me hate".split()
sentence_2 = "this movie is literally the best in the world".split()
sent1_sent2 = (compute_distance(sentence_1, sentence_2, model, word_dict))
print("Distance between '{}' and '{}': {}".format(sentence_1, sentence_2, sent1_sent2))

linear_embedder_model = model

Distance between good and bad: 16.094451904296875
Distance between bad and terrible: 15.493986129760742
Distance between good and amazing: 14.684762954711914
Distance between 'this movie has been amazing' and movie: 9.262959480285645
Distance between '['this', 'movie', 'makes', 'me', 'hate']' and '['this', 'movie', 'is', 'literally', 'the', 'best', 'in', 'the', 'world']': 4.809334754943848


It appears that the embeddings we have learnt are not clustered appropriately in the feature space; synonyms are at similar distances as antonyms. But that is ok, our goal has been to train a linear classifier using word embeddings - which we have reasonable achieved.

Next up: CNNs for the same task

In [31]:
# build a convolutional bag of words classifier
class ConvCBOX(nn.Module):
    def __init__(self, vocab_size, embedding_dim, kernel_size, out_channels, linear_dim):
        super(ConvCBOX, self).__init__()
        # create embedding table
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # create convolutional layer
        self.conv = nn.Conv1d(embedding_dim, out_channels, kernel_size, padding = 'same')
        # create linear layer
        if type(linear_dim) == int:
            self.linear = nn.Sequential(nn.Linear(out_channels, linear_dim), nn.ReLU(), nn.Linear(linear_dim, 1))
        elif type(linear_dim) in [list, tuple]:
            layers = [nn.Linear(out_channels, linear_dim[0]), nn.ReLU()]
            for i in range(len(linear_dim)-1):
                layers.append(nn.Linear(linear_dim[i], linear_dim[i+1]))
                layers.append(nn.ReLU())
            layers.append(nn.Linear(linear_dim[-1], 1))
            self.linear = nn.Sequential(*layers)
        else:
            raise ValueError("linear_dim must be an int, list or tuple")
        
        
    def forward(self, inputs):
        # get embeddings
        embeds = self.embedding(inputs)
        # transpose embeddings
        embeds = embeds.transpose(1,2)
        # convolve
        conv_out = self.conv(embeds)
        # max pool
        pool_out = F.max_pool1d(conv_out, conv_out.shape[2])
        # linear layer
        out = self.linear(pool_out.squeeze())
        # sigmoid
        out = F.sigmoid(out)
        return out
    
    def get_embeddings(self, inputs):
        # get embeddings
        embeds = self.embedding(inputs)
        # transpose embeddings
        embeds = embeds.transpose(1,2)
        # convolve
        conv_out = self.conv(embeds)
        # max pool
        pool_out = F.max_pool1d(conv_out, conv_out.shape[2])
        return pool_out.squeeze()

In [32]:
# define model
embedding_size = 100
kernel_size = 3
out_channels = 100
linear_size = (100,50)
model = ConvCBOX(len(word_dict), embedding_size, kernel_size, out_channels, linear_size)

# define optimizer
optim = th.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.BCELoss()

In [33]:
# training loop
for epoch in range(10):
    # shuffle training data
    np.random.shuffle(train_data)
    # set model to train mode
    model.train()

    #compute train accuracy
    correct = 0

    # loop over training data
    for sent, label in train_data:
        # zero gradients
        optim.zero_grad()
        # forward pass
        out = model(sent.unsqueeze(0))
        # compute loss
        loss = loss_fn(out, th.FloatTensor([label]))
        # backward pass
        loss.backward()
        # update parameters
        optim.step()

        #compute train accuracy
        pred = 1 if out > 0.5 else 0
        if pred == label:
            correct += 1
    # compute accuracy
    train_acc = correct / len(train_data)
    


    # set model to eval mode
    model.eval()
    # compute accuracy on dev set
    correct = 0
    for sent, label in dev_data:
        # forward pass
        out = model(sent.unsqueeze(0))
        # get prediction
        pred = 1 if out > 0.5 else 0
        # check if prediction is correct
        if pred == label:
            correct += 1
    # compute accuracy
    acc = correct / len(dev_data)
    print("Epoch: {}, Train Acc: {}, Dev Acc: {}".format(epoch, train_acc, acc))

Epoch: 0, Train Acc: 0.6771666666666667, Dev Acc: 0.7467071935157041
Epoch: 1, Train Acc: 0.8173333333333334, Dev Acc: 0.7624113475177305
Epoch: 2, Train Acc: 0.8965, Dev Acc: 0.770516717325228
Epoch: 3, Train Acc: 0.9455, Dev Acc: 0.7755825734549139
Epoch: 4, Train Acc: 0.9691666666666666, Dev Acc: 0.7725430597771024
Epoch: 5, Train Acc: 0.978, Dev Acc: 0.7831813576494427
Epoch: 6, Train Acc: 0.984, Dev Acc: 0.7781155015197568
Epoch: 7, Train Acc: 0.9863333333333333, Dev Acc: 0.7720364741641338
Epoch: 8, Train Acc: 0.9898333333333333, Dev Acc: 0.756838905775076
Epoch: 9, Train Acc: 0.9908333333333333, Dev Acc: 0.7740628166160081


In [14]:
#compute distance between two sentences in the embedding space of model
def compute_distance(sent1, sent2, model, word_dict):
    # get embeddings
    embeds1 = model.get_embeddings(sent2tensor(sent1, word_dict).unsqueeze(0))
    embeds2 = model.get_embeddings(sent2tensor(sent2, word_dict).unsqueeze(0))
    # compute distance
    return th.dist(embeds1, embeds2)

# compute distances between  various words again
with th.no_grad():
    good_bad = compute_distance(["good"], ["bad"], model, word_dict)
    bad_terrible = compute_distance(["bad"], ["terrible"], model, word_dict)
    good_amazing = compute_distance(["good"], ["amazing"], model, word_dict)
    sentence_movie = compute_distance(["this", "movie", "has", 'been', "amazing"], ["movie"], model, word_dict)
print("Distance between good and bad: ", good_bad)
print("Distance between bad and terrible: ", bad_terrible)
print("Distance between good and amazing: ", good_amazing)
print("Distance between 'this movie has been amazing' and 'movie': ", sentence_movie)

conv_embedder_model = model

Distance between good and bad:  tensor(61.5394)
Distance between bad and terrible:  tensor(53.7923)
Distance between good and amazing:  tensor(35.2654)
Distance between 'this movie has been amazing' and 'movie':  tensor(56.5417)


It seems the two models have similar embedding spaces, and that their performance is similar too! As a downside, the CNN uses a lot more computational power for training. This might be because we are training an embedder at the same time and computing the backward pass through a convolutional layer, which is in itself very expensive to run. Let us try to use a pretrained embedder.

In [15]:
import gensim
from nltk.data import find

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
embedding_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

This time, we will build the dataset in the already vectorized space.

In [16]:
# create embedding for all datapoints
train_data_test = [[[embedding_model[word] for word in sent if word in embedding_model], label] for sent, label in train]
dev_data_test = [[[embedding_model[word] for word in sent if word in embedding_model], label] for sent, label in dev]
test_data_test = [[[embedding_model[word] for word in sent if word in embedding_model], label] for sent, label in test]

# remove empty sentences
train_data_test = [x for x in train_data_test if x[0] != []]
dev_data_test = [x for x in dev_data_test if x[0] != []]
test_data_test = [x for x in test_data_test if x[0] != []]

# convert to tensors
train_data_test = [(th.FloatTensor([x[0]]), x[1]) for x in train_data_test]
dev_data_test = [(th.FloatTensor([x[0]]), x[1]) for x in dev_data_test]
test_data_test = [(th.FloatTensor([x[0]]), x[1]) for x in test_data_test]

  train_data_test = [(th.FloatTensor([x[0]]), x[1]) for x in train_data_test]


To save the lifespan of our mouse wheel, we will rewrite the class in a modular way, to permit various architectures to be developed based on the provided parameters (mainly the use of pretrained embeddings and convolutional layers).

In [17]:
# create new class that is modular; optional embeddings (or use pre-trained embeddings), optional convolutions, linear layers

class MCBOW(nn.Module):
    def __init__(self, embedding_data, conv_data, linear_data):
        super(MCBOW, self).__init__()
        if type(embedding_data) == tuple and len(embedding_data) == 2:
            # we have embedding_data[0] words, each with embedding_data[1] dimensions
            self.embedding = nn.Embedding(embedding_data[0], embedding_data[1])
        elif embedding_data == None:
            # the input will already be the embeddings
            self.embedding = None
        else:
            raise ValueError("embedding_data must be a tuple of 2 ints or None")

        if type(conv_data) == tuple and len(conv_data) == 3:
            # we have conv_data[0] input channels, conv_data[1] output channels, conv_data[2] kernel size
            self.conv = nn.Conv1d(conv_data[0], conv_data[1], conv_data[2], padding = 'same')
        elif conv_data == None:
            # no convolutions
            self.conv = None
        else:
            raise ValueError("conv_data must be a tuple of 3 ints or None")
        
        if type(linear_data) == tuple:
            # we have linear_data[0] input channels, and the rest are hidden layers
            layers = []
            for i in range(len(linear_data)-1):
                layers.append(nn.Linear(linear_data[i], linear_data[i+1]))
                layers.append(nn.ReLU())
            layers.append(nn.Linear(linear_data[-1], 1))
            self.linear = nn.Sequential(*layers)
        else:
            raise ValueError("linear_data must be a tuple of 2 ints")
        
    def forward(self, inputs):
        # get embeddings
        if self.embedding is not None:
            embeds = self.embedding(inputs)
        else:
            embeds = inputs
        if self.conv is not None:
            # transpose embeddings
            embeds = embeds.transpose(1,2)
            # convolve
            conv_out = self.conv(embeds)
            # max pool
            pool_out = F.max_pool1d(conv_out, conv_out.shape[2])
            #pool_out = F.avg_pool1d(conv_out, conv_out.shape[2])
        else:
            # average embeddings
            pool_out = th.sum(embeds, dim=1) / embeds.shape[1]
        # linear layer
        out = self.linear(pool_out.squeeze())
        # sigmoid
        out = F.sigmoid(out)
        return out

Train a CNN classifier using word2vec embeddings.

In [18]:
# define model, we use the pre-trained word2vec embeddings
embedding_data = None
conv_data = (300, 100, 3)
linear_data = (100,)
model = MCBOW(embedding_data, conv_data, linear_data)

# define optimizer
optim = th.optim.Adam(model.parameters(), lr=0.0001)
loss_fn = nn.BCELoss()


In [19]:
# training loop
for epoch in range(10):
    # shuffle training data
    np.random.shuffle(train_data_test)
    # set model to train mode
    model.train()

    #compute train accuracy
    correct = 0

    # loop over training data
    for sent, label in train_data_test:
        # zero gradients
        optim.zero_grad()
        # forward pass
        out = model(th.FloatTensor(sent))
        # compute loss
        loss = loss_fn(out, th.FloatTensor([label]))
        # backward pass
        loss.backward()
        # update parameters
        optim.step()

        #compute train accuracy
        pred = 1 if out > 0.5 else 0
        if pred == label:
            correct += 1
    # compute accuracy
    train_acc = correct / len(train_data_test)
    


    # set model to eval mode
    model.eval()
    # compute accuracy on dev set
    correct = 0
    for sent, label in dev_data_test:
        # forward pass
        out = model(th.FloatTensor(sent))
        # get prediction
        pred = 1 if out > 0.5 else 0
        # check if prediction is correct
        if pred == label:
            correct += 1
    # compute accuracy
    acc = correct / len(dev_data_test)
    print("Epoch: {}, Train Acc: {}, Dev Acc: {}".format(epoch, train_acc, acc))

conv_word2vec_model = model

Epoch: 0, Train Acc: 0.7483870967741936, Dev Acc: 0.7996934082779765
Epoch: 1, Train Acc: 0.8061120543293718, Dev Acc: 0.8083801737353091
Epoch: 2, Train Acc: 0.8247877758913412, Dev Acc: 0.8094021461420542
Epoch: 3, Train Acc: 0.8415959252971138, Dev Acc: 0.8134900357690342
Epoch: 4, Train Acc: 0.8572156196943973, Dev Acc: 0.8109351047521717
Epoch: 5, Train Acc: 0.8643463497453311, Dev Acc: 0.8160449667858968
Epoch: 6, Train Acc: 0.8764006791171477, Dev Acc: 0.8175779253960143
Epoch: 7, Train Acc: 0.8850594227504245, Dev Acc: 0.8175779253960143
Epoch: 8, Train Acc: 0.899151103565365, Dev Acc: 0.8170669391926418
Epoch: 9, Train Acc: 0.9100169779286927, Dev Acc: 0.8037812979049566


Train a linear classifier using word2vec embeddings.

In [20]:
# define model, we use the pre-trained word2vec embeddings
embedding_data = None
conv_data = None
linear_data = (300,64,64,)
model = MCBOW(embedding_data, conv_data, linear_data)

# define optimizer
optim = th.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.BCELoss()


In [21]:
# training loop
for epoch in range(10):
    # shuffle training data
    np.random.shuffle(train_data_test)
    # set model to train mode
    model.train()

    #compute train accuracy
    correct = 0

    # loop over training data
    for sent, label in train_data_test:
        # zero gradients
        optim.zero_grad()
        # forward pass
        out = model(th.FloatTensor(sent))
        # compute loss
        loss = loss_fn(out, th.FloatTensor([label]))
        # backward pass
        loss.backward()
        # update parameters
        optim.step()

        #compute train accuracy
        pred = 1 if out > 0.5 else 0
        if pred == label:
            correct += 1
    # compute accuracy
    train_acc = correct / len(train_data_test)
    


    # set model to eval mode
    model.eval()
    # compute accuracy on dev set
    correct = 0
    for sent, label in dev_data_test:
        # forward pass
        out = model(th.FloatTensor(sent))
        # get prediction
        pred = 1 if out > 0.5 else 0
        # check if prediction is correct
        if pred == label:
            correct += 1
    # compute accuracy
    acc = correct / len(dev_data_test)
    print("Epoch: {}, Train Acc: {}, Dev Acc: {}".format(epoch, train_acc, acc))

Epoch: 0, Train Acc: 0.766383701188455, Dev Acc: 0.7935615738375064
Epoch: 1, Train Acc: 0.7918505942275043, Dev Acc: 0.7961165048543689
Epoch: 2, Train Acc: 0.8103565365025467, Dev Acc: 0.7853857945835463
Epoch: 3, Train Acc: 0.8161290322580645, Dev Acc: 0.7940725600408789
Epoch: 4, Train Acc: 0.8224108658743633, Dev Acc: 0.7945835462442514
Epoch: 5, Train Acc: 0.8285229202037352, Dev Acc: 0.8002043944813491
Epoch: 6, Train Acc: 0.8424448217317487, Dev Acc: 0.7976494634644865
Epoch: 7, Train Acc: 0.8541595925297114, Dev Acc: 0.8002043944813491
Epoch: 8, Train Acc: 0.86553480475382, Dev Acc: 0.7874297393970363
Epoch: 9, Train Acc: 0.8772495755517827, Dev Acc: 0.7828308635666837


We see that in this experiment, the CNN performs better by ~2% compared to a simple linear classifier with aggregated feature spaces. In addition, not building our own embedding space seems to prevent overfitting: we drop from 0.999 training accuracy to a more reasonable ~0.9 training accuracy on the word2vec embedding data models. Let us evaluate the CNN on the test set to make sure the score is similar!

In [23]:
#get final score on the test dataset
correct = 0
for sent, label in test_data_test:
    # forward pass
    out = conv_word2vec_model(th.FloatTensor(sent))
    # get prediction
    pred = 1 if out > 0.5 else 0
    # check if prediction is correct
    if pred == label:
        correct += 1
# compute accuracy
acc = correct / len(test_data_test)
print("Test Acc: {}".format(acc))

Test Acc: 0.7978615071283096


Accuracy for word2vec CNN is reasonably similar for the test and dev sets! The end.