# Deep learning tips and tricks

In this notebook we will cover a couple of tips and tricks for tweaking a neural text classifier. We will use an LSTM model for our experiments. We use torch version 1.4.
The code is inspired by [this](https://github.com/lukysummer/Movie-Review-Sentiment-Analysis-LSTM-Pytorch/blob/master/sentiment_analysis_LSTM.py) repository.

In [None]:
import random
import torch
import torch.nn as nn
import numpy as np


seed = 0

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

### 1. LOAD THE TRAINING TEXT

In [None]:
from sklearn.datasets import load_files

In [None]:
from nltk import download

In [None]:
download("movie_reviews", download_dir="data")

[nltk_data] Downloading package movie_reviews to data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [None]:
movies = load_files("./data/corpora/movie_reviews")

In [None]:
reviews, encoded_labels = [review.decode() for review in movies.data], movies.target

In [None]:
import random

c = list(zip(reviews, encoded_labels))

random.shuffle(c)

reviews, encoded_labels = zip(*c)

In [None]:
reviews[1]

"vannesa kensington : `austin , do you smoke after sex ? ' austin powers : `i don't know baby , i've never looked ! ' \nand so begins our journey into the most anticipated sequel of the summer season . \naustin powers 2 , the sequel to the sleeper hit of 1997 , is filled to the brim with uproarious sight gags and lurid toilet jokes that will make you keel over with hilarity . \nthe mind of mike myers is obviously a very bizarre place . \nmyers returns as the swinging 60's spy and his arch-nemesis , the bald headed dr . evil , who is given much of the spotlight here . \nthere's an early scene in which dr . evil and his son scott ( seth green ) appear on a jerry springer segment entitled `my dad is evil and wants to take over the world' , hosted by springer himself . \nmost of these talk-show gags , spoofing everything from oprah to regis and kathie lee , are no longer as funny as they once were . \nhappily , this is an exception , especially when a fight breaks out between dr . evil and

In [None]:
encoded_labels[1]

1

### 2. TEXT PRE-PROCESSING

In [None]:
from string import punctuation
import re

word_re = re.compile(r"\b[a-z]{2,}\b")

def tokenize(text):
    processed_text = "".join(ch for ch in text.lower() if ch not in punctuation)
    processed_text = processed_text.replace("\n", " ")
    return word_re.findall(processed_text)

def flatten(tokenized_texts):
    return [word for text in tokenized_texts for word in text]

In [None]:
all_reviews = list(map(tokenize, reviews))
all_words = flatten(all_reviews)

### 3. CREATE DICTIONARIES & ENCODE REVIEWS

In [None]:
from collections import Counter

word_counts = Counter(all_words)
word_list = sorted(word_counts, key=lambda k: word_counts[k], reverse = True)
vocab_to_int = {word:idx+1 for idx, word in enumerate(word_list)}
int_to_vocab = {idx:word for word, idx in vocab_to_int.items()}
encoded_reviews = [[vocab_to_int[word] for word in review] for review in all_reviews]

### 4. CHECK LABELS

In [None]:
assert len(encoded_reviews) == len(encoded_labels), "# of encoded reviews & encoded labels must be the same!"

### 5. GET RID OF LENGTH-0 REVIEWS

In [None]:
import numpy as np
import torch

encoded_labels = np.array([label for idx, label in enumerate(encoded_labels) if len(encoded_reviews[idx]) > 0])
encoded_reviews = [review for review in encoded_reviews if len(review) > 0]

### 6. MAKE ALL REVIEWS SAME LENGTH

In [None]:
def pad_text(encoded_reviews, seq_length):

    reviews = []

    for review in encoded_reviews:
        if len(review) >= seq_length:
            reviews.append(review[:seq_length])
        else:
            reviews.append([0] * (seq_length - len(review)) + review)

    return np.array(reviews)

padded_reviews = pad_text(encoded_reviews, seq_length=200)

In [None]:
padded_reviews[42]

array([   82,   221,  1655,   183,    46,    58,    74,   559,  4335,
        6955,   820,  8195,     6,  3156,     4,   157,    16,   851,
         319,  2865,  6644,  1740,   538,   101,    49,   725,   850,
          13,    24,  4636,  5349,  1868,  8125,  4364,    17,   355,
           2,   327,  1128,     3,  6335,  8597,    19,  8671,     6,
           1,   886,  1868,  1555,   232,   477,    44,     4,   538,
           1,  7707,  2096,   139,    34,    62, 21117,     2,  4775,
         616,     1,   486,   690,  1669,     1,  1393,     5,    17,
          43,    83,     4,  2239,     1,   355,     9,    81,    36,
          34,    18,   424,   454,    17,    61,   554,   759,   116,
           5,     1,    96,     3,     1,  2406,    65,   682,     3,
        2865,  6645,  3403,   343,     9,     1,  3169,  8671, 12042,
          26,     5,  1978,    27,    32,  2503,   261,    10,    32,
         731,  2288,   182,    10,  2295, 21118,    19,  6336,     6,
         475,     4,

### 7. SPLIT DATA & GET (REVIEW, LABEL) DATALOADER

In [None]:
train_ratio = 0.8
valid_ratio = (1 - train_ratio)/2
total = padded_reviews.shape[0]
train_cutoff = int(total * train_ratio)
valid_cutoff = int(total * (1 - valid_ratio))

train_x, train_y = torch.from_numpy(padded_reviews[:train_cutoff]), torch.from_numpy(encoded_labels[:train_cutoff])
valid_x, valid_y = torch.from_numpy(padded_reviews[train_cutoff:valid_cutoff]), torch.from_numpy(encoded_labels[train_cutoff:valid_cutoff])
test_x, test_y = torch.from_numpy(padded_reviews[valid_cutoff:]), torch.from_numpy(encoded_labels[valid_cutoff:])

In [None]:
from torch.utils.data import TensorDataset, DataLoader

train_data = TensorDataset(train_x, train_y)
valid_data = TensorDataset(valid_x, valid_y)
test_data = TensorDataset(test_x, test_y)

BATCH_SIZE = 50
train_loader = DataLoader(train_data, batch_size = BATCH_SIZE, shuffle = True)
valid_loader = DataLoader(valid_data, batch_size = BATCH_SIZE, shuffle = True)
test_loader = DataLoader(test_data, batch_size = BATCH_SIZE, shuffle = True)

### 8. DEFINE THE LSTM MODEL

During the model definition step, we might re-implement model weight initialisation, apply another tricks such as adding various types of dropout to the needed layers etc. There is a noteworthy [discussion](https://stackoverflow.com/questions/49433936/how-to-initialize-weights-in-pytorch) on wheather one should initialize weights manually or not, and, if yes, how? The functions that implement various initialisation methods are located in the `torch.nn.init` module.


![](https://www.researchgate.net/publication/334268507/figure/fig8/AS:788364231987201@1564972088814/The-structure-of-the-Long-Short-Term-Memory-LSTM-neural-network-Reproduced-from-Yan.png)

In [None]:
class SentimentLSTM(nn.Module):

    def __init__(self, n_vocab, n_embed, n_hidden, n_output, n_layers, drop_p = 0.5):
        super().__init__()

        self.hidden_dim = n_hidden
        self.n_layers = n_layers


        self.dropout = nn.Dropout(0.5)
        self.embedding = nn.Embedding(n_vocab, n_embed, padding_idx=0)

        # input_size – The number of expected features in the input x
        # hidden_size – The number of features in the hidden state h
        # num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
        # bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
        # batch_first – If True, then the input and output tensors are provided as (batch, seq, feature) instead of (seq, batch, feature). Note that this does not apply to hidden or cell states. See the Inputs/Outputs sections below for details. Default: False
        # dropout – If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout. Default: 0
        # bidirectional – If True, becomes a bidirectional LSTM. Default: False

        self.lstm = nn.LSTM(input_size=n_embed, hidden_size=self.hidden_dim, num_layers=self.n_layers, batch_first=True)
        self.fc1 = nn.Linear(in_features=self.hidden_dim, out_features=1)

    def forward(self, x):

        h = torch.zeros((self.n_layers, x.size(0), self.hidden_dim)).to(device)
        c = torch.zeros((self.n_layers, x.size(0), self.hidden_dim)).to(device)

        torch.nn.init.xavier_normal_(h)
        torch.nn.init.xavier_normal_(c)

        out = self.dropout(self.embedding(x)) # (batch_size, seq_length, n_embed) -> 50, 200, 100
        out, (hidden, cell) = self.lstm(out, (h,c)) # (batch_size, seq_length, n_hidden) -> 50, 200, 64
        out = self.dropout(out)
        # taking last element of the sequence
        out = torch.sigmoid(self.fc1(out[:,-1,:])) # (batch_size, output) -> 50, 1
        return out

### 9. INSTANTIATE THE MODEL W/ HYPERPARAMETERS

In [None]:
n_vocab = len(vocab_to_int)
n_embed = 100
n_hidden = 64
n_output = 1   # 1 ("positive") or 0 ("negative")
n_layers = 1

model = SentimentLSTM(n_vocab, n_embed, n_hidden, n_output, n_layers)

In [None]:
print(model)

SentimentLSTM(
  (dropout): Dropout(p=0.5, inplace=False)
  (embedding): Embedding(46694, 100, padding_idx=0)
  (lstm): LSTM(100, 64, batch_first=True)
  (fc1): Linear(in_features=64, out_features=1, bias=True)
)


### 10. DEFINE LOSS & OPTIMIZER

L2-regularization is already included into the optimizer. The `weight_decay` parameter is responsible for controlling its intensity.

[BCELoss vs CrossEntropyLoss](https://medium.com/dejunhuang/learning-day-57-practical-5-loss-function-crossentropyloss-vs-bceloss-in-pytorch-softmax-vs-bd866c8a0d23)

In [None]:
from torch import optim

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr = 0.1e-2, weight_decay=1e-3)

### 11. TRAIN THE NETWORK!

To prevent the exploding gradient problem in LSTM/RNN we use the `clip_grad_norm_` function, that takes the `clip` parameter.


In [None]:
n_epochs = 20
clip = 5  # gradient clip to prevent exploding gradient problem in LSTM/RNN
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(n_epochs):
    model.train()
    train_losses = []
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        output = model(inputs)

        loss = criterion(output.squeeze(), labels.float())
        train_losses.append(loss.cpu().item())


        optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

    ######################
    ##### VALIDATION #####
    ######################

    with torch.no_grad():
        model.eval()
        valid_losses = []
        num_correct = 0
        for v_inputs, v_labels in valid_loader:
            v_inputs, v_labels = v_inputs.to(device), v_labels.to(device)


            v_output = model(v_inputs)
            v_loss = criterion(v_output.squeeze(), v_labels.float())
            valid_losses.append(v_loss.item())
            preds = torch.round(v_output.squeeze())
            correct_tensor = preds.eq(labels.float().view_as(preds))
            correct = np.squeeze(correct_tensor.cpu().numpy())
            num_correct += np.sum(correct)

        print("Epoch: {}/{}".format((epoch+1), n_epochs),
              "Training Loss: {:.4f}".format(np.mean(train_losses)),
              "Validation Loss: {:.4f}".format(np.mean(valid_losses)),
              f"Validation accuracy: {num_correct/len(valid_loader.dataset):.4f}")

Epoch: 1/20 Training Loss: 0.6962 Validation Loss: 0.6962 Validation accuracy: 0.4750
Epoch: 2/20 Training Loss: 0.6931 Validation Loss: 0.6963 Validation accuracy: 0.4350
Epoch: 3/20 Training Loss: 0.6899 Validation Loss: 0.6965 Validation accuracy: 0.4650
Epoch: 4/20 Training Loss: 0.6883 Validation Loss: 0.6948 Validation accuracy: 0.5050
Epoch: 5/20 Training Loss: 0.6871 Validation Loss: 0.6912 Validation accuracy: 0.5150
Epoch: 6/20 Training Loss: 0.6820 Validation Loss: 0.6907 Validation accuracy: 0.4550
Epoch: 7/20 Training Loss: 0.6754 Validation Loss: 0.6950 Validation accuracy: 0.5100
Epoch: 8/20 Training Loss: 0.6757 Validation Loss: 0.6958 Validation accuracy: 0.5150
Epoch: 9/20 Training Loss: 0.6717 Validation Loss: 0.7067 Validation accuracy: 0.4800
Epoch: 10/20 Training Loss: 0.6671 Validation Loss: 0.7115 Validation accuracy: 0.4350
Epoch: 11/20 Training Loss: 0.6599 Validation Loss: 0.7298 Validation accuracy: 0.5050
Epoch: 12/20 Training Loss: 0.6521 Validation Loss: 

### 12. TEST THE TRAINED MODEL ON THE TEST SET

In [None]:
model.eval()
test_losses = []
num_correct = 0
model.to(device)
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        test_output = model(inputs)
        loss = criterion(test_output.squeeze(), labels.float())
        test_losses.append(loss.item())

        preds = torch.round(test_output.squeeze())
        correct_tensor = preds.eq(labels.float().view_as(preds))
        correct = np.squeeze(correct_tensor.cpu().numpy())
        num_correct += np.sum(correct)

print("Test Loss: {:.4f}".format(np.mean(test_losses)))
print("Test Accuracy: {:.2f}".format(num_correct / len(test_loader.dataset)))

Test Loss: 0.7333
Test Accuracy: 0.51


### 13. TEST THE TRAINED MODEL ON A RANDOM SINGLE REVIEW

In [None]:
def predict(net, review, seq_length = 200):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    words = tokenize(review)
    encoded_words = [vocab_to_int[word] for word in words]
    padded_words = pad_text([encoded_words], seq_length)
    padded_words = torch.from_numpy(padded_words.reshape(1, -1)).to(device)

    if(len(padded_words) == 0):
        "Your review must contain at least 1 word!"
        return None

    net.eval()
    output = model(padded_words)
    pred = torch.round(output.squeeze())
    msg = "This is a positive review." if pred == 0 else "This is a negative review."

    print(msg)

In [None]:
review1 = "It made me cry."
review2 = "It was so good it made me cry."
review3 = "It's ok."
review4 = "I loved the dialogues!"
review5 = "Garbage"

predict(model, review1)
predict(model, review2)
predict(model, review3)
predict(model, review4)
predict(model, review5)

This is a negative review.
This is a negative review.
This is a positive review.
This is a positive review.
This is a negative review.


## Tasks

1. Initialize model embedding layer with pre-trained word2vec embeddings. Train new model with obtained layer. Compare results with  on test dataset.

2. Optimize the training process by
 - Introducting early stopping
 - Experimenting on various ways of weight initialisation
 - Selecting hyperparameters with [Optuna](https://optuna.org/)