The goal of this lab session is to implement the model proposed by  Yoon Kim, published in 2014. The original paper can be found [here](https://www.aclweb.org/anthology/D14-1181).
Of course, there exists pytorch and tensorflow implementations on the web. They are more or less correct and efficient. However, here it is important to do it yourself. The goal is to better understand pytorch and the convolution. 

The road-map is to: 
- Implement the convolution and pooling 
- Add dropout on the last layer

To start, it is useful to discover the convolution layers. In this lab, we consider the convolution operation in 1-dimension, followed by the adapted max pooling. 


We use the same dataset as before: imdb. The first following cells are the same as the previous lab session on this dataset (load the data, build the vocabulary, and prepare data for the model). 


# Data loading 


In [1]:
import re
import numpy as np
import torch as th
import torch.autograd as ag
import torch.nn.functional as F
import torch.nn as nn
import random

th.manual_seed(1) # set the seed 


def clean_str(string, tolower=True):
    """
    Tokenization/string cleaning.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    if tolower:
        string = string.lower()
    return string.strip()


def loadTexts(filename, limit=-1):
    """
    Texts loader for imdb.
    If limit is set to -1, the whole dataset is loaded, otherwise limit is the number of lines
    """
    f = open(filename)
    dataset=[]
    line =  f.readline()
    cpt=1
    skip=0
    while line :
        cleanline = clean_str(line).split()
        if cleanline: 
            dataset.append(cleanline)
        else: 
            line = f.readline()
            skip+=1
            continue
        if limit > 0 and cpt >= limit: 
            break
        line = f.readline()
        cpt+=1        
        
    f.close()
    print("Load ", cpt, " lines from ", filename , " / ", skip ," lines discarded")
    return dataset



Load the data 

In [2]:
LIM=5000
txtfile="/users/Dorian/Documents/M2_MASH/NLP/imdb/imdb.pos"
postxt = loadTexts(txtfile,limit=LIM)
print(postxt[0:10])
print (len(postxt), " pos sentences")

txtfile="/users/Dorian/Documents/M2_MASH/NLP/imdb/imdb.neg"
negtxt = loadTexts(txtfile,limit=LIM)
print(negtxt[0:10])

print (len(negtxt), " neg sentences")
 

## vocabulary selection
w2idx = {}


maxlength = 0 
txtidx = []
for sent in postxt+negtxt: 
    isent = []
    maxlength = max(maxlength,len(sent))
    for w in sent: 
        widx=len(w2idx)
        if w in w2idx:
            widx=w2idx[w]
        else :
            w2idx[w]=widx
        isent.append(widx)
    txtidx.append(th.LongTensor(list(set(isent))))
    
print(len(w2idx), " words in the vocab")
print(len(txtidx), " sentences")
print(maxlength, " is maximum sentence length")
print(txtidx[0])

### For the labels
labels = th.ones([2*LIM])
labels[0:LIM] = 0


Load  5000  lines from  /users/Dorian/Documents/M2_MASH/NLP/imdb/imdb.pos  /  0  lines discarded
[['excellent'], ['do', "n't", 'miss', 'it', 'if', 'you', 'can'], ['a', 'great', 'parody'], ['dreams', 'of', 'a', 'young', 'girl'], ['tromendous', 'piece', 'of', 'art'], ['funny', 'funny', 'movie', '!'], ['need', 'more', 'scifi', 'like', 'this'], ['pride', 'and', 'prejudice', 'is', 'absolutely', 'amazing', '!', '!'], ['scott', 'pilgrim', 'vs', 'the', 'world'], ['quirky', 'and', 'effective']]
5000  pos sentences
Load  5000  lines from  /users/Dorian/Documents/M2_MASH/NLP/imdb/imdb.neg  /  1  lines discarded
[['typical', 'movie', 'where', 'best', 'parts', 'are', 'in', 'the', 'preview'], ['not', 'for', 'the', 'squeamish'], ['cool', 'when', 'i', 'was', 'kid'], ['i', 'appreciate', 'the', 'effort', ',', 'but'], ['pretty', 'bad'], ['much', 'ado', 'about', 'nothing'], ['series', 'of', 'unlikely', 'events'], ['april', 'is', 'the', 'cruelest', 'month'], ['great', 'idea', ',', 'but'], ['and', 'people',

# Embeddings and Convolution layers

Unfortunately, an important part of the work is dedicated to playing with dimensions. This is true for pytorch, as well as tensorflow. Here the sequence of operation is 
- Embedding
- Convolution (1D)
- Pooling
- Linear

Moreover, things can be tricky if we want our model to work properly with mini-batch. 


A quick reminder on Embedding layer

In [3]:
h1 = 4 # dimension of embeddings, the input size for convolution
h2 = 2 # output dimension (filter size) for the convolution
embLayer = th.nn.Embedding(num_embeddings=len(w2idx), embedding_dim=h1)

In [10]:
# Don't play with the first sentence, it's only one word ! 
embs = embLayer(txtidx[1])
print(len(txtidx[1]))
print(embs.shape)

7
torch.Size([7, 4])




Look at the documentation of the Conv1d layer. Read it carefully and try to completely understand the following code. 
For that purpose, we can look at the shapes. 

In [11]:
print(embs)
print(embs.view(1,h1,-1))

tensor([[-0.1002, -0.6092, -0.9798, -1.6091],
        [-0.7121,  0.3037, -0.7773, -0.2515],
        [-0.2223,  1.6871,  0.2284,  0.4676],
        [-0.6970, -1.1608,  0.6995,  0.1991],
        [ 0.8657,  0.2444, -0.6629,  0.8073],
        [ 1.1017, -0.1759, -2.2456, -1.4465],
        [ 0.0612, -0.6177, -0.7981, -0.1316]], grad_fn=<EmbeddingBackward>)
tensor([[[-0.1002, -0.6092, -0.9798, -1.6091, -0.7121,  0.3037, -0.7773],
         [-0.2515, -0.2223,  1.6871,  0.2284,  0.4676, -0.6970, -1.1608],
         [ 0.6995,  0.1991,  0.8657,  0.2444, -0.6629,  0.8073,  1.1017],
         [-0.1759, -2.2456, -1.4465,  0.0612, -0.6177, -0.7981, -0.1316]]],
       grad_fn=<ViewBackward>)


In [12]:
conv1 = th.nn.Conv1d(in_channels=4,out_channels=2,kernel_size=3)
tmp=embs.view(1,4,-1)
res = conv1(tmp)
print("embs : ",embs.shape)
print("tmp  : ",tmp.shape)
print("conv : ",res.shape)


embs :  torch.Size([7, 4])
tmp  :  torch.Size([1, 4, 7])
conv :  torch.Size([1, 2, 5])


Draw what happens to better understand the obtained dimensions. 

Now if we add another parameter for padding (set to 1). What do you observe ? 
Play a bit with the *kernel_size* along with the *padding* to understand the interaction: 
- try kernel_size=3,padding=1 and (4,1)
- (5,1) and (5,2)

In [None]:
conv1 = th.nn.Conv1d(in_channels=h1,out_channels=h2,kernel_size=3,padding=1)
tmp=embs.view(1,4,-1)
res = conv1(tmp)
print(embs.shape)
print(tmp.shape)
print(res.shape)


What do you propose for pooling ? 


# A first model

First we want to create a model with an embedding layer of size 20, followed by a convolution layer: 
- feature size of 10, 
- kernel size of 3 (for trigram),
- padding set to 1 

All these values must be of course parameters. 

Writing this class, allows you to wrap what you have seen so far. Remind the previous lab session, to debug the model, you can first play step-by-step with each layer to ensure you are right with dimensions. Then, write the class and run the training to evaluate the result. 

In [None]:
# A lot todo here 

In [None]:
class Conv1d_classifier(nn.Module):
    '''A text classifier:
    - input = a list of word indices
    - output = probability associated to a binary classification task
    - vocab_size: the number of words in the vocabulary we want to embed
    - embedding_dim: size of the word vectors
    '''
    def __init__(self, vocab_size, embedding_dim, feat_size=10, kernel_size=3,lmax=35):
        super(Conv1d_classifier, self).__init__()
        self.emb = th.nn.Embedding(num_embedding = vocab_size, embedding_dim = embedding_dim)
        self.conv = th.nn.Conv1d()
            
            
    def forward(self, input):
        # TODO
        

# State of the art model

To have a better model, we should add convolution layers of different kernel size, as in the paper of Yoon Kim 2014. 
We can use kernels of size 3,5, and 7 for instance. 



In [None]:
# TODO 

And finaly add dropout on the last layer hidden layer.  