In [None]:
%load_ext autoreload
%autoreload 2

# 2. Sentiment Analysis - Training Routine

## Setup

Firstly, set up the path to the (preprocessed) dataset

In [None]:
# Path to the preprocessed data
import os

fileDir = os.path.dirname(os.path.realpath('__file__'))
absFilePathToPreprocessedDataset = os.path.join(fileDir, '../Data/training.1600000.processed.noemoticon_preprocessed.csv')
pathToPreprocessedDataset = os.path.abspath(os.path.realpath(absFilePathToPreprocessedDataset))
print (pathToPreprocessedDataset)

Choose the device to run the training on:

In [None]:
device = "cpu"

### **Step #1:** Instantiate the dataset

Instantiate the dataset from the provided dataset path. The dataset is responsible for instantiating the used vectorizer.

In [None]:
from Common.TwitterDataset import TwitterDataset

# instantiate the dataset
dataset = TwitterDataset.load_dataset_and_make_vectorizer(pathToPreprocessedDataset)

# get the vectorizer
vectorizer = dataset.get_vectorizer()

### Step #2: Instantiate the model

Instantiate the model and move it to tehe desired device.

In [None]:
from Models.ModelPerceptron import SentimentClassifierPerceptron

# instantiate the model
model = SentimentClassifierPerceptron(num_features=len(vectorizer.text_vocabulary), output_dim=2)

# send model to appropriate device
model = model.to(device)

### Step #3: Instantiate the loss function

In [None]:
import torch.nn as nn

loss_func = nn.CrossEntropyLoss()

### Step #4: Instantiate the optimizer



In [None]:
import torch.optim as optim

learningRate = 0.001

optimizer = optim.Adam(model.parameters(), lr=learningRate)

### Bonus #1: Define how to calculate accuracy of the model 

In [None]:
import torch.nn.functional as F

def compute_accuracy(output, labels):
    probability_values, indices = F.softmax(output, dim=1).max(dim=1)

    correct = (indices == labels).float().sum()

    return correct / len(labels)

## Training loop

The training loop uses the objects that are instantiated in the previous step to update model parameters so that its performance improves over time.

The training loop is composed of two loops: an inner loop over minibatches in the dataset, and an outer loop which repeat the inner loop a predefined number of times (<code>num_epochs</code>). In the innter loop, losses are calculated for each minibatch and the optimizer is used to update the model parameters.

In each epoch, the model is firstly trained on the training set: the training dataset is devided into batches and following 5 steps are repeated for each batch: 
- **Step #1**: Zero the gradients (clear the information about gradients from previous step)
- **Step #2**: Calculate the model output
- **Step #3**: Compute the loss, when compared with labels
- **Step #4**: Use the loss to calculate and backpropagate gradients
- **Step #5**: Use optimize to update weights of the model

After the inner loop over training batches, the similar loop is done over validation data. The main difference is that validation data is not used to update model weights, it is just used to calculate its performance. Therefore, it has 3 steps, repeated for each batch:
- **Step #1**: Calculate the model output
- **Step #2**: Compute the loss, when compared with labels
- **Step #3**: Compute the accuracy, when comapred with the labels

In [None]:
from Common.Trainer import Trainer

sentiment_analysis_trainer = Trainer(
    dataset=dataset,
    model=model,
    loss_func=loss_func,
    optimizer=optimizer
)

In [None]:
# setup the chosen number of epochs
num_epochs = 50
# setup the chosen batch size
batch_size = 64

report = sentiment_analysis_trainer.train(num_epochs=num_epochs, batch_size=batch_size, device=device)

### Explore the training results

#### Training Set

In [None]:
import matplotlib.pyplot as plt

plt.plot(report["train_loss"])
plt.title("Training Set Loss")
plt.show()

plt.plot(report["train_accuracy"])
plt.title("Training Set Accuracy")
plt.show()

#### Validation Set

In [None]:
plt.plot(report["validation_loss"])
plt.title("Validation Set Loss")
plt.show()

plt.plot(report["validation_accuracy"])
plt.title("Validation Set Accuracy")
plt.show()