# Exam session #3
## Neural Classification

#### Please enter your firstname and lastname below.

- Firstname: **Mahamadi**
- Lastname: **NIKIEMA**

In this assignment, we develop a model for sentiment analysis, a task which consists in predicting a sentiment label of a text. We use a small corpus of **tweets** which have been annotated with a **target** (apple, microsoft, google...) and a **sentiment label**.

**Data file : sanders-twitter-sentiment.csv**

The assignement includes 4 main steps.

1. Loading labels, targets and tweets into a Pandas dataframe (4 points)
2. Converting tweets and labels to integers (14 points)
3. Creating Training and Validation Data (6 points)
4. Inspecting results of RNN classifier (14 points)

# 1. Loading tweets and labels into a Pandas Dataframe (4 points)

####   Exercise 1.1 (2 points)

* Load the sanders-twitter-sentiment.csv file into a pandas dataframe
* Create a new dataframe called `data` which contains three columns with headers and content  `label`, `target` and `tweet`

In [2]:
import pandas as pd
df = pd.read_csv("sanders-twitter-sentiment.csv",sep=',')

In [3]:
df.head()

Unnamed: 0,id_1,id_2,date,tweet,label,target
0,1044,125667332931596290,2011-10-16 20:20:01,&quot;3 principal global players will be activ...,neutral,apple
1,71,126384526925639681,2011-10-18 19:49:53,"If you've been struggling to get hold of me, I...",neutral,apple
2,278,126281019476291585,2011-10-18 12:58:35,"@azee1v1 @apple @umber Proper consolidation, p...",negative,apple
3,5743,126862268725080065,2011-10-20 03:28:16,me acabo de dar cuenta q tengo mas seguidores ...,irrelevant,twitter
4,510,126054725727698944,2011-10-17 21:59:23,"With Siri, Apple Could Eventually Build A Real...",neutral,apple


In [4]:
data = df[["label", "tweet", "target"]]

#### Exercise 1.2 - Print out some examples (2 points)

* Print out the number of rows and columns
* Print out the 1st and secondcolumn of the 1st row

In [5]:
print(f"The number of rows is {data.shape[0]} rows \n The number of column is {data.shape[1]} columns")

The number of rows is 5513 rows 
 The number of column is 3 columns


In [6]:
data.iloc[0]

label                                               neutral
tweet     &quot;3 principal global players will be activ...
target                                                apple
Name: 0, dtype: object

In [7]:
data['label'].value_counts()

neutral       2503
irrelevant    1786
negative       654
positive       570
Name: label, dtype: int64

# 2. Converting tweets and labels to integers (14 points)

#### Exercise 2.1 -  Store tweets and labels into list (2 points)

*  Write a function which returns two lists `texts` and `labels` where
   - texts is the list of tweets  with each tweet prefixed with its target surrounded by < and >   
   E.g., `<apple>` Wow I am loving this new @apple update for my touch. #coolness Well done
   - labels is the corresponding list of labels

* Print out 
   - the number of tweets
   - an example tweet and its corresponding label

In [153]:
def tweets(df):
    texts = []
    labels = []
    for txt, target, label in zip(df["tweet"],df["target"], df["label"]):
        texts.append([f"<{target}> " + txt])
        labels.append([label])
    return texts, labels                        

In [154]:
texts, labels = tweets(data)

#### Exercise 2.2  - Create the vocabulary (2 points)
   
* Extract the corpus vocabulary from `texts`, the list of tweets created in Exercise 2.1
   The vocabulary is the set of distinct tokens occurring in the tweets
* Print out the size of the vocabulary (=the number of distinct tokens in the corpus of tweets)

In [10]:
from nltk  import word_tokenize
tweet = data["tweet"].str.lower()
vocab = set(word_tokenize(tweet.str.cat(sep = ' ')))

In [137]:
len(vocab)

17134

#### Exercise 2.3 - Convert tweets  to integers (2 points)
   
* Define a dictionary `token2int` which assigns 0 to the `<eos>` symbol and which maps each  token in the corpus to an  integer. Each token (including the  `<eos>` symbol) should be mapped to a different integer and none of the vocabulary token should be mapped to 0 (since 0 is the index of the `<eos>` symbol).
* Define the reverse dictionary `int2token`

Example

Input Texts: ["The woman put the book on the table","The woman reads"]   
Created vocabulary: {the, woman, put, book, on, table, reads}    
token2int: {the:1,woman:2,put:3,book:4,on:5,table:6,reads:7}   
int2token: {1:the,2:woman,3:put,4:book,5:on,6:table,7:reads}

In [155]:
from collections import defaultdict
token2int = defaultdict(lambda: len(token2int))
token2int['<eos>'] = 0
for l in texts:
    [token2int[token.lower()] for token in l[0].split()]

In [156]:
int2token = { k : v for v, k in token2int.items()}

#### Exercise 2.4  - Converting labels to integers (2 points)
 
* Define a dictionary `label2int` which maps each distinct label in labels to a distinct integer. 
* Define the reverse dictionary `int2label`
* Print out the length of your label2int dictionary
* Print out the set of labels used to annotate the sentiment of a tweet

In [19]:
label2int = defaultdict(lambda: len(label2int))
label2int['<eos>'] = 0
for label in ['neutral', 'irrelevant', 'negative', 'positive']:
    label2int[label]

In [21]:
int2label = { k : v for v, k in label2int.items()}

#### Exercise 2.5  - Sanity check (4 points)

* Define a function `check_converting` which   
   - converts a string to a list of tokens
   - converts this list of tokens to the corresponding list of integers using `token2int` (cf Ex. 2.3)
   - print this list of integers
   - converts this list of integers back to a list tokens using `int2token` (cf Ex. 2.3)
   - converts this list to a string (use `join`)
   - prints out the resulting list of tokens
* Apply this function to the 5th tweet in the list `texts` created in Ex. 2.1.

**Hint** The input and output of this function should be identical

In [157]:
def check_converting(text):
    tokens = word_tokenize(text[0].lower())
    tokens2int = [token2int[token] for token in tokens]
    rev2tokens = [int2token[int_token] for int_token in tokens2int]
    text = ' '.join(rev2tokens)
    return rev2tokens

In [None]:
check_converting(texts[4])

#### Exercise 2.6 - Converting the lists of tweets and labels to their indices (2)
* Converts the list of tweets created in Exercise 2.1 to the corresponding list of lists of indices using `token2int` (created in Exercise 2.3)
* Converts the list of labels created in Exercise 2.1 to the corresponding list of lists of indices using `label2int`  (created in Exercise 2.4)

In [99]:
texts_ind = [[token2int[token.strip()] for token in s[0].split()] for s in texts]
labels_ind = [label2int[lab[0].strip()] for lab in labels]

In [None]:
texts_ind

#  3. Creating Training and Validation Data (6 points)

In [101]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

max_len = 16
batch_size = 64
embed_size = 128
hidden_size = 128 

#### Exercise 3.1 - Creating tensors (2 points)

* Create a torch tensor of dimension `(nb of tweets, max_len)` whose values are 0; Call this tensor X
* Populate this tensor with the integer representation of the tweets created in Ex.2.6
* Create a tensor Y of dimension `(nb of tweets,)` and populate it with the integer representation of the corresponding sentiment label created in Ex.2.6

In [121]:
text_int = []
for sent in texts_ind:
    to_add = max_len - len(sent)
    sent += [0] * to_add
    text_int.append(sent)

In [None]:
X = torch.zeros(5513, max_len)

In [None]:
X_train = X[:5000]
X_valid = X[5000:]

#### Exercise 3.2 - Divide X  into  X_train (the first 5000 tweets), X_valid (the remaining tweets) and similarly for Y  (2 points)

#### Exercise 3.3 - Use pytorch TensorDataset and DataLoader to shuffle the data and the labels  (2 points)

# 4. Neural Model (14 points)

We use an RNN to classify the tweets. The code for training and testing is provided. 

In [None]:
# Computing the loss
def perf(model, loader):
    criterion = nn.CrossEntropyLoss()
    model.eval()
    total_loss = correct = num = 0
    for x, y in loader:
      with torch.no_grad():
        y_scores = model(x)
        loss = criterion(y_scores, y)
        y_pred = torch.max(y_scores, 1)[1]
        correct += torch.sum(y_pred.data == y).item()
        total_loss += loss.item()
        num += len(y)
    return total_loss / num, correct / num

In [None]:
# The training loop
def fit(model, epochs):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters())
    for epoch in range(epochs):
        model.train()
        total_loss = num = 0
        for x, y in train_loader:
            optimizer.zero_grad()
            y_scores = model(x)
            loss = criterion(y_scores, y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            num += len(y)
        print(epoch, total_loss / num, *perf(model, valid_loader))

In [None]:
# The network
class RNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.embed = nn.Embedding(len(token2int), embed_size, padding_idx=token2int['<eos>'])
        self.rnn = nn.GRU(embed_size, hidden_size, num_layers=1, bidirectional=False, batch_first=True)
        self.decision = nn.Linear(hidden_size * 1 * 1, len(label2int))
        
    def forward(self, x):
        embed = self.embed(x)
        output, hidden = self.rnn(embed)
        return self.decision(hidden.transpose(0, 1).contiguous().view(x.size(0), -1))

rnn_model = RNN()
rnn_model

#### Exercise 4.1  - Testing dimensions (4 points)

* Call the model on the first 3 tweets . Explain the dimension of the output: what does it correspond to ? 

#### Exercise 4.2  - Train the model on 5 epochs  (2 points)

In [None]:
fit(rnn_model, epochs=5)

#### Exercise 4.3 - Print out the accuracy of the model on the training set and on the validation set  (4 points)

#### Exercise 4.4 - Plot the training and validation loss (2 points)

- X is the number of epoch
- Y is the loss

#### Exercise 4.5 - Plot the accuracy on the validation set (2 points)

- X is the number of epochs
- Y is the accuracy at each epoch