In [2]:
import re
import numpy as np
import torch as th
import torch.autograd as ag
import torch.nn.functional as F
import torch.nn as nn
import random
import math
import pickle
import gzip


The goal is to set up a simple classifier for text and sentiment analysis. 

The goal of this lab session is to implement the model proposed by  Yoon Kim, published in 2014. This model is a sentence classifier based on Convolution. The original paper can be found [here](https://www.aclweb.org/anthology/D14-1181). It was then adapted to DNA sequence classification by [this paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1878-3). Of course, there exists pytorch and tensorflow implementations on the web. They are more or less correct and efficient. However, here it is important to do it yourself. The goal is to better understand pytorch and the convolution. 

The task is the binary classification of movie reviews. The dataset is a part of the imdb dataset. You can find the original dataset on the imdb website or a version on the kaggle website. For this lab session, we will use a preprocessed version and reduced version. 

The road-map is to: 
- Load the data
- step-by-step computation (debug)
- Create a model to wrap the convolution and pooling 



# Data loading 


Load the data : 

In [3]:
fp = gzip.open('imdb.pck.gz','rb')
texts , labels, lexicon  = pickle.load(fp) 

print(type(texts), type(labels), type(lexicon))
print(texts[0])
print("nb examples : ", len(texts))
VOCAB_SIZE = len(lexicon)
print("Vocab size: ", VOCAB_SIZE)


<class 'list'> <class 'torch.Tensor'> <class 'dict'>
tensor([ 36,  25, 381,  10,  58,  21,  83])
nb examples :  30000
Vocab size:  5002


You get 3 objects : 
- *texts*  : a list of tensors, each tensor represent a word sequence to classify. 
- *labels* : the class, positive or negative, of the corresponding text
- *lexicon*: a dictionnary to map integers to real words
Note that a reduced number of words are selected to build the vocabulary. The less frequent words are discarded are replaced by a specific form (*unk* for unknown)
To read the text you can use for example the following code: 

In [42]:
def idx2wordlist(idx_array,lexicon): 
    l = []
    for i in idx_array: 
        l.append(lexicon[i.item()])
    return l
print(texts[0].shape)
for i in range(5): 
    print(idx2wordlist(texts[i+50],lexicon))
print("------------")
for i in range(5): 
    print(idx2wordlist(texts[-i-2000],lexicon))
    


torch.Size([7])
['strong', 'drama']
['please', 'remake', 'this', 'movie']
['very', 'funny', '!']
['great', 'series']
['fun', 'movie']
------------
['absolute', 'waste', 'of', 'time']
['the', 'worst', 'movie', 'ever', 'made']
['slow', 'motion', 'picture', 'that', 'did', "n't", 'get', 'to', 'the', 'point']
['there', 'are', 'good', 'bad', 'movies', 'and', 'there', 'are', 'bad', 'bad', 'movies', 'this', 'one', 'is', 'a', 'real', 'stinker']
['<unk>', 'so', 'bad', 'its', 'funny']


# Embeddings and Convolution layers

Unfortunately, an important part of the work is dedicated to playing with dimensions. This is true for pytorch, as well as tensorflow. Here the sequence of operation is 
- Embedding
- Convolution (1D)
- Pooling
- Linear

Moreover, things can be tricky if we want our model to work properly with mini-batch. 


A quick reminder on Embedding layer

In [6]:
h1 = 4 # dimension of embeddings, the input size for convolution
h2 = 2 # output dimension (filter size) for the convolution
embLayer = th.nn.Embedding(num_embeddings=len(lexicon), embedding_dim=h1)

In [9]:
# Don't play with the first sentence, it's only one word ! 
embs = embLayer(texts[1])
print("the length of the sequence : ", len(texts[1]))
print(embs.shape)


the length of the sequence :  3
torch.Size([3, 4])




Look at the documentation of the Conv1d layer. Read it carefully and try to completely understand the following code. A convolution layer expects a tensor as input, with the following dimensions *B,D,L*: 
- B: size of the batch, the number of examples (here the number of sequence). For the moment we consider *B=1* (only one sequence)
- D: the dimension of the vectors for each time step
- L: the length of the input sequence (the number of time step)

We must therefore modify the dimensions of the tensor generated by the embedding layer accordingly. 

A first solution could be: 

In [10]:
tmp = embs.view(1,h1,-1)
print(tmp.shape)

torch.Size([1, 4, 3])


The shape is correct, but it is safer to check the consistency: the first time step should be the embedding of the first word of the sequence. Is that correct ? 

In [11]:
#### TODO : 
print(tmp[0,:,0]) # the embedding of the first time step
#### 


tensor([1.5888, 1.1686, 0.6045, 0.3685], grad_fn=<SelectBackward>)


Find the good way to tranform embs in consistent way. 

In [None]:
## TODO 
tmp = None ## <--Find the right way
print(tmp.shape)
print(tmp[0,:,0])
## while  the expected value is : 



Now we have a tensor to feed the convolution layer: 

In [12]:
conv1 = th.nn.Conv1d(in_channels=4,out_channels=2,kernel_size=3)

res = conv1(tmp)
print("embs : ",embs.shape)
print("tmp  : ",tmp.shape)
print("conv : ",res.shape)


embs :  torch.Size([3, 4])
tmp  :  torch.Size([1, 4, 3])
conv :  torch.Size([1, 2, 1])


Draw what happens to better understand the obtained dimensions. 

Now if we add another parameter for padding (set to 1). What do you observe ? 
Play a bit with the *kernel_size* along with the *padding* to understand the interaction: 
- try kernel_size=3,padding=1 and (4,1)
- (5,1) and (5,2)

In [None]:
conv1 = th.nn.Conv1d(in_channels=h1,out_channels=h2,kernel_size=3,padding=1)
tmp=embs.view(1,4,-1)
res = conv1(tmp)
print(embs.shape)
print(tmp.shape)
print(res.shape)


What do you propose for pooling ? 


# A first model

First we want to create a model with an embedding layer of size 20, followed by a convolution layer: 
- feature size of 10, 
- kernel size of 3 (for trigram),
- padding set to 1 

All these values must be  parameters.


## Interlude: object programming 

To write our own module, we need to write a class that inherits from Module. A class is a kind of data structure associated to function called methods to manipulate the data. A class allows you to build object, instance of the class. A class is a kind of special type with specific tools (methods) to handle instance of this type. In the definition of a class,  the keyword **self** is by convention a reference to the running object. 

Below is a simple example of a class, just for illustration purpose. 

In [29]:
# Defining a dummy class for a point in 2D
class Point2D: 
    # The constructor is the method used to create an object of this class. 
    # It ensures that the object (the structure) is properly created, 
    # with everything we need. A constructor does not return anything. 
    def __init__(self, a=0 , b=0): 
        self.abs = a # defines an attribute for absciss
        self.ord = b # defines an attribute for ordinate
        # The default value of absciss and ordinate are 0 in this case
    # The method called by print
    # It returns a string
    def __str__(self): 
        return "A point : ("+str(self.abs)+","+str(self.ord)+")"
    
    
    # A regular method : the  distance with an another point
    def distance(self, p2): 
        return math.sqrt((self.abs-p2.abs)**2 + (self.ord-p2.ord)**2)
    
p1 = Point2D(4,2)
print(p1)
p2 = Point2D(b=-1)
print(p2)
print(p1.distance(p2), " == ",p2.distance(p1))
p3 = Point2D()
print(p3)

A point : (4,2)
A point : (0,-1)
5.0  ==  5.0
A point : (0,0)


## A class for our model

The goal now is to write a class to implement the model with embeddings, convolution and pooling. Writing this class, allows you to wrap what you have seen so far. To debug the model, you can first play step-by-step with each layer to ensure you are right with dimensions (it was done earlier). Then, write the class and run the training to evaluate the result (this what we have to do now). 

The class inherits from an existing class of pytorch : *Module*. This means that *Conv1d_classifier* is a *Module*, but we add some peculiarities. For that purpose we 
can fill the following code: 

In [None]:
class Conv1d_classifier(nn.Module):
    '''A text classifier:
    - input = a list of word indices
    - output = probability associated to a binary classification task
    - vocab_size: the number of words in the vocabulary we want to embed
    - embedding_dim: size of the word vectors
    '''
    def __init__(self, vocab_size, embedding_dim, feat_size=10, kernel_size=3,lmax=35):
        super(Conv1d_classifier, self).__init__()
        self.emb_dim = embedding_dim 
        # in the previous line, 
        # store the value of the parameter embedding_dim
        # TODO : write the end of the constructor
        # It is important to create here all the layers of the network. 
        # All layers that have paramaters should be attribute. 
        # For example: 
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        # Go on and add the rest. 
        # TODO ... 
        
        
            
    def forward(self, input):
        # TODO
        # if you need to run forward with the embedding layer, 
        # you can call it by self.embeddings 
        # TODO ... 
        

In [None]:
# Test the class: is everything in place:
# A first classifier is built like : 
classifier = Conv1d_classifier(vocab_size=VOCAB_SIZE,embedding_dim=10)
# The parameters of the classifier are randomly initialize, but we 
# can use it on a sequence : 
out = classif.forward(texts[0])
print(out.shape) # the output has 2 dimensions 
print(out)

# It is correct ? If not, correct the class to get the expected result. 


## Training the model

To train the model, we need to define a loss function and an optimizer. For the moment we will rely on an online learning algorithm: online stochastic gradient descent. Like the previous lab session: 
- we pick one training example
- compute the loss
- back-propagation of the gradient 
- update of the parameters


At the end of one epoch, we evaluate the model on the validation step. 


In [None]:
# Define the training loss 
loss_function = nn.BCELoss()
# The optimizer 
optimizer = th.optim.Adam(classif1.parameters(), lr=0.01)
# Handle the randomization of the training data 
total = len(texts)
ntrain = 20000  # the number of texts for training 
assert(total > ntrain) # be sure it is correct
## 
randomidx = list(range(total))
random.shuffle(randomidx)
## random selection of training examples 
trainidx  = randomidx[:ntrain]
## and for validation 
valididx  = randomidx[ntrain:]
## 
Nepoch = 10 # the number of training epochs 
for e in range(Nepoch): 
    # randomized the training set 
    random.shuffle(trainidx)
    for i in trainidx:
        # TODO : training
        # 
    ## validation score 
    
    

# State of the art model

To have a better model, we should add convolution layers of different kernel size, as in the paper of Yoon Kim 2014. 
We can use kernels of size 3,5, and 7 for instance. Create a new class for this model. 



In [31]:
# TODO 

And finaly add dropout on the last layer hidden layer.  

# Mini-batch training

It is really faster to train the model with mini-batch. The issue is that the input sequences are not of the same size. As a workaround, we can write a function that create the tensor  for a mini-batch. This function needs: 
- a reference of the data (here texts)
- the maximum length of a sequence in the mini-batch
- a list of the indices of the sequences we want to put in the mini-batch

The function creates a tensor and fill it with the selected sequences, but : 
- if a sequence is shorter than the maximum length, we pad the sequence with zero values (fill the empty slots)
- if the sequence is longer, just truncate it. 
This function returns a tensor of dimensions (B,Lmax) to be the input of the embedding layer of our model. 




In [32]:
# TODO 