<img align="center" style="max-width: 1000px" src="banner.png">

<img align="right" style="max-width: 200px; height: auto" src="hsg_logo.png">

##  Lab 7: Sentiment analysis using long short-term memory networks

GSERM Summer School 2022, Deep Learning: Fundamentals and Applications, University of St. Gallen

The lab environment of the "Deep Learning: Fundamentals and Applications" GSERM course at the University of St. Gallen (HSG) is based on Jupyter Notebooks (https://jupyter.org), which allow to perform a variety of statistical evaluations and data analyses.

We will use the [kaggle Rotten Tomates](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews) dataset for this exercise; you may need to register to download the [data](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data) (no worries, it's free).
The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset.
The train/test split has been preserved for the purposes of benchmarking, but the sentences have been shuffled from their original order.
Each Sentence has been parsed into many phrases (chunks) using the Stanford parser ([learn more](https://nlp.stanford.edu/software/lex-parser.shtml)).
Each phrase has a `PhraseId`, each sentence a `SentenceId`; phrases that are repeated (such as short or common words) are only included once in the data.

- `train.tsv` contains the phrases and their associated sentiment labels; we will further split this into training and validation partitions to train our model and optimize the hyperparameters
- `test.tsv` contains just phrases; use your model to assign a sentiment label to each phrase (homework)

Feel free to browse through the data to familiarize yourself with the task.
The sentiment labels are:

* 0 - negative
* 1 - somewhat negative
* 2 - neutral
* 3 - somewhat positive
* 4 - positive


## Instructions

As in the other labs, we will be using pytorch and a few related modules.

- <https://pytorch.org/docs/stable/generated/torch.nn.Linear.html>
- <https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html>
- `torchtext`'s [GloVe embeddings](https://torchtext.readthedocs.io/en/latest/vocab.html#glove) as initalization to our embedding layer
- We'll be computing the evaluation metrics using the [scikit learn metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) package.

Work your way through the notebook.
We've deliberately chosen very moderate hyperparameters to speed up the computation.
Go through each of the sections and familiarize yourself with the data preparation, model and traing setup as well as evaluation.
Homework assignments are listed at the end of the notebook.


In [1]:
import copy
import time
import re
import string

import torch
import torchtext
import pandas as pd

from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pack_padded_sequence, pad_sequence, pad_packed_sequence
from torch import nn


In [2]:
def load_sentiment_data(path='res/train.tsv'):
    df = pd.read_csv(path, sep='\t', header=0)
    
    # columns = ['PhraseId' 'SentenceId', 'Phrase', 'Sentiment']
    def process_phrase(phrase):
        remove_punct = str.maketrans(string.punctuation, ' '*len(string.punctuation))
        remove_digits = str.maketrans(string.digits, ' '*len(string.digits))
        phrase = phrase.translate(remove_digits)
        phrase = phrase.translate(remove_punct)
        phrase = re.sub(' {2,}', ' ', phrase)
        return phrase.lower()
    
    # apply to all phrases
    df['Phrase'] = df['Phrase'].apply(lambda x: process_phrase(x))
    
    # filter out empty phrases
    df = df[df['Phrase'].str.len() > 1]
    df = df.reset_index(drop=True)

    return df


In [3]:

df = load_sentiment_data()

# split the train.csv in train and test
df_train = df.sample(frac=0.8, random_state=42)
df_vali = df.drop(df_train.index)

# list some stats on the label distribution
print(df.Sentiment.value_counts().sort_index())
print(df_train.Sentiment.value_counts().sort_index())
print(df_vali.Sentiment.value_counts().sort_index())


0     7072
1    27271
2    79410
3    32921
4     9206
Name: Sentiment, dtype: int64
0     5680
1    21759
2    63597
3    26317
4     7351
Name: Sentiment, dtype: int64
0     1392
1     5512
2    15813
3     6604
4     1855
Name: Sentiment, dtype: int64


In [4]:
# The first time you run this will download a ~823MB file
glove = torchtext.vocab.GloVe(name="6B", dim=50)

# We'll use the stoi (string to id) and itos (id to string) methods to convert
# between word token and id.


In [5]:

# Model the RT dataset
class RottenTomatoesDataset(Dataset):
    def __init__(self, df, glove_vocab, label_col='Sentiment', unk='<unk>'):
        super().__init__()
        self.df = df
        self.labels = self.df[label_col].values
        self.glove = glove_vocab
        self.vocab_size = len(glove_vocab)
        self.data = []

        # map the tokens
        for p in self.df['Phrase'].values:
            self.data.append(torch.stack(
                [torch.LongTensor([glove.stoi.get(w, glove.stoi.get(unk))]) for w in p.split()]))

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        return self.data[idx], self.labels[idx]

# when processing sequences of different lengths in batches, we need to pad the
# shorter ones to match the length of the longest in the batch
class SequencePadder():
    def __init__(self, symbol) -> None:
        self.symbol = symbol

    def __call__(self, batch):
        sorted_batch = sorted(batch, key=lambda x: x[0].size(0), reverse=True)
        sequences = [x[0] for x in sorted_batch]
        labels = [x[1] for x in sorted_batch]
        padded = pad_sequence(sequences, padding_value=self.symbol)
        lengths = torch.LongTensor([len(x) for x in sequences])
        return padded, torch.LongTensor(labels), lengths


In [6]:
def get_metrics(model, data_loader, device):
    # we do this with batch size 1, so that padding doesn't affect the prediction
    with torch.set_grad_enabled(False):
        model.eval()
        model.to(device)

        y_pred, y_true = [], []
        sentences = []
        
        for inputs, labels, lengths in data_loader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            out, weights = model(inputs, lengths)

            _, preds = torch.max(out, 1)

            y_pred.append(preds.item())
            y_true.append(labels.item())

        return {
            'f1': f1_score(y_true, y_pred, average='micro'),
            'prec': precision_score(y_true, y_pred, average='micro'),
            'recall': recall_score(y_true, y_pred, average='micro'),
            'acc': accuracy_score(y_true, y_pred),
        }


In [7]:
class LstmClassifierGloveEmbeddings(nn.Module):
    def __init__(self,
                hidden_size,
                output_size, # number of classes
                glove=None,
                num_layers=1,
                bidirectional=False):

        super(LstmClassifierGloveEmbeddings, self).__init__()

        self.input_size = len(glove) # vocabulary size
        self.embedding_size = glove.dim
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.num_layers = num_layers

        # we use nn.Embedding with the pre-trained GloVe vectors (and also don't update them in training)
        self.embedding = nn.Embedding.from_pretrained(glove.vectors, freeze=True)

        self.lstm = nn.LSTM(
                        input_size=self.embedding_size,
                        hidden_size=hidden_size,
                        num_layers=self.num_layers,
                        dropout=0.2 if num_layers > 1 else 0,
                        bidirectional=bidirectional,
        )

        self.num_directions = 2 if bidirectional else 1

        fc_size = self.hidden_size * self.num_directions
        self.fc = nn.Linear(fc_size, output_size)

    # h_n is the previous hidden state tuple as (h_n, c_n)
    def forward(self, x, lengths, h_n=None):
        if h_n is None:
            h_n, c_n = self.init_hidden(x.size(1))
        else:
            h_n =  h_n[0]
            c_n =  h_n[1]
        
        # 1: apply the embedding layer
        embed = self.embedding(x).squeeze(2)
        
        # packed squence helps avoid unneccsary computation, with the length it marks out irrelvant/ padded sequence
        # elements, this allows the efficient computation of sequences of different lengths inside the same batch
        packed_seq = pack_padded_sequence(embed, lengths)
        
        # 2: ...and feed the output to the LSTM
        out, (h_n, c_n) = self.lstm(packed_seq, (h_n, c_n))

        # out contains the output features (h_t) from the last layer of the LSTM, for each timestep
        # h_n contains the final hidden state for each element in the batch.
        # c_n contains the final cell state for each element in the batch.

        # 3 reccurrent layer, hidden dimension = 100, bidirectional, batch_size 2
         
        # h_n.shape = (6, 2, 100)
        # h_n.shape = (num_layers * directions, batch_size, hidden_dimension)
        # 
        # according to the docs, use the following view to address the per-layer hiddens
        # (num_layers, num_directions, batch_size, hidden_dim)
        # h_n.view(self.num_layers, self.num_directions, x.size(1), -1).shape = [3, 2, 2, 100]
        # ... and we'll want the top-most of those, and concatted, if bi-directional

        # get the top/last hidden of the LSTM stack
        h_n_top = h_n.view(self.num_layers, self.num_directions, x.size(1), -1)[-1]
        if self.num_directions == 2:
            # for bi-directional models, we combine/concat both hidden states
            h_n_top = torch.cat([h_n_top[0], h_n_top[1]], 1)
        
        # since shape is now [[..]], drop the outer dim
        h_n_top = h_n_top.squeeze(0)
        logits = self.fc(h_n_top)

        return logits, (h_n, c_n) # only hidden state for the last layer is needed for loss calculation

    def init_hidden(self, batch_size=1):
        device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
        
        # h_0 of shape (num_layers * num_directions, batch, hidden_size)
        h_dim_0 = self.num_layers * self.num_directions
        hidden = (torch.zeros(h_dim_0, batch_size, self.hidden_size, device=device),
                  torch.zeros(h_dim_0, batch_size, self.hidden_size, device=device))

        return hidden



In [8]:
def train_rnn_model(model, dl_train, dl_vali, criterion, optimizer, device, num_epochs=25):
    # this should already be done
    # model.to(device)
    
    since = time.time()

    print(model)
    
    for epoch in range(1, num_epochs + 1):
        print('Epoch {}/{}'.format(epoch, num_epochs))
        print('-' * 10)

        # phase 1: training
        model.train()
        
        train_loss = 0.0
        train_correct = 0

        # go through all the data
        for inputs, labels, lens in dl_train:
            inputs = inputs.to(device)
            labels = labels.to(device)
            
            # zero the parameter gradients
            with torch.set_grad_enabled(True):
                optimizer.zero_grad()

                out, _ = model(inputs, lens) 
                
                # take only the last output, compute loss
                _, preds = torch.max(out, 1)
                loss = criterion(out, labels)

                # do backprop
                loss.backward()
                optimizer.step()

            # statistics
            train_loss += loss.item() * inputs.size(1)
            train_correct += torch.sum(preds == labels.data)
        
        # print stats for the training pass
        print('training: loss={:.4f} acc={:.4f}'.format(train_loss / len(dl_train.dataset), train_correct.double() / len(dl_train.dataset)))

        # phase 2: validate...
        model.eval()

        vali_loss = 0.0
        vali_correct = 0

        for inputs, labels, lens in dl_vali:
            inputs = inputs.to(device)
            labels = labels.to(device)
            
            # forward; no gradient needed now
            with torch.set_grad_enabled(False):
                out, _ = model(inputs, lens) 
                
                # take only the last output
                _, preds = torch.max(out, 1)
                loss = criterion(out, labels)

            # statistics
            vali_loss += loss.item() * inputs.size(1)
            vali_correct += torch.sum(preds == labels.data)
        
        # print stats for vali pass
        print('validation: loss={:.4f} acc={:.4f}'.format(vali_loss / len(dl_train.dataset), vali_correct.double() / len(dl_vali.dataset)))

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))

    return model


In [9]:
# we need to add the padding and unknown symbols to our glove embedding
def append_special(glove, special, vec=None):
    glove.itos.append(special)
    glove.stoi[special] = glove.itos.index(special)
    if vec is None:
        vec = torch.zeros(1, glove.vectors.size(1))
    glove.vectors = torch.cat((glove.vectors, vec))
    return glove

pad_sym = '<pad>'
unk = '<unk>'

# GloVe has been loaded all the way at the top already; now we append to it
glove = append_special(glove, unk)
glove = append_special(glove, pad_sym)


In [10]:

# LSTM parameters
hidden_size = 32
n_layers = 1
bi_direct = False

# training parameters
n_epochs = 10
batch_size = 16
lr = 0.0001
shuffle = True


In [11]:

# set up the model and training mechanics
model = LstmClassifierGloveEmbeddings(hidden_size,
                                      output_size=len(df['Sentiment'].value_counts()),
                                      glove=glove,
                                      num_layers=n_layers,
                                      bidirectional=bi_direct,)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

ds_train = RottenTomatoesDataset(df_train, glove)
ds_vali = RottenTomatoesDataset(df_vali, glove)

dl_train = DataLoader(ds_train,
                      batch_size=batch_size,
                      shuffle=shuffle,
                      collate_fn=SequencePadder(glove.stoi[pad_sym]),
                      drop_last=True)

dl_vali = DataLoader(ds_vali,
                     batch_size=batch_size, 
                     shuffle=False,
                     collate_fn=SequencePadder(glove.stoi[pad_sym]),
                     drop_last=True)

# train model returns the best model for the current run
model = train_rnn_model(model, 
                        dl_train, dl_vali, 
                        criterion, 
                        optimizer,
                        device, 
                        num_epochs=n_epochs)


LstmClassifierGloveEmbeddings(
  (embedding): Embedding(400002, 50)
  (lstm): LSTM(50, 32)
  (fc): Linear(in_features=32, out_features=5, bias=True)
)
Epoch 1/10
----------
training: loss=1.1833 acc=0.5374
validation: loss=0.2695 acc=0.5605
Epoch 2/10
----------
training: loss=1.0483 acc=0.5701
validation: loss=0.2591 acc=0.5708
Epoch 3/10
----------
training: loss=1.0181 acc=0.5791
validation: loss=0.2536 acc=0.5784
Epoch 4/10
----------
training: loss=0.9994 acc=0.5849
validation: loss=0.2497 acc=0.5867
Epoch 5/10
----------
training: loss=0.9866 acc=0.5894
validation: loss=0.2473 acc=0.5891
Epoch 6/10
----------
training: loss=0.9767 acc=0.5935
validation: loss=0.2455 acc=0.5934
Epoch 7/10
----------
training: loss=0.9685 acc=0.5966
validation: loss=0.2443 acc=0.5937
Epoch 8/10
----------
training: loss=0.9617 acc=0.5998
validation: loss=0.2425 acc=0.5982
Epoch 9/10
----------
training: loss=0.9559 acc=0.6026
validation: loss=0.2414 acc=0.6002
Epoch 10/10
----------
training: loss=0

In [12]:

dl_vali2 = DataLoader(ds_vali,
                     batch_size=1, 
                     shuffle=False,
                     collate_fn=SequencePadder(glove.stoi[pad_sym]),
                     drop_last=True)

# run eval for best model and save for this split
scores = get_metrics(model,
                      dl_vali2,
                      device)

print(scores)

{'f1': 0.6022581472927894, 'prec': 0.6022581472927894, 'recall': 0.6022581472927894, 'acc': 0.6022581472927894}


# Homework

1. Train with different parameters (number of hidden units and layers, bi-directional, batch sizes, learning rates) to improve your classification metrics
2. For your best result, assemble a kaggle submission (test.csv)
3. Optional assignment: extend the LSTM model by using `nn.MultiheadAttention` on the outputs and measure the improvement