# Sentiment Analysis - Multilayer Perceptron that uses embedding representation of the input sequence

Now we are going to introduce the PyTorch module called the <code>Embedding</code> layer, that encapsulates the embedding weight matrix.

We are going to build a Multilayer Perceptron model, very similar to the one in previous example, except for the additional <code>Embedding</code> layer that maps the token indices to the embedding vector representation.

The vector representation of the entire sequence is calculated as a sum of embedding vectors for all tokens in the sequence. Using some aggregation function like sum, max or average, is a common method to combine embedding vectors of single tokes in such way that it captures the overall context of the entire sequence. 

<img src="files/word2vec.png" width="300" height="500" align="center"/>


Additionally, there will be two options explored: training the weights of the embedding weight matrix from scratch or initializing the weights with some pretrained embedding weights (GloVe) and just fine-tuning it for the specific task of sentiment analysis.


Everything except the model itself and the vectorizer, stays very similar as in previous examples: building the dataset, vocabulary, data loader, training loop and the evaluation.

## Setup

Firstly, set up the path to the (preprocessed) dataset

In [1]:
# Path to the preprocessed data
import os

fileDir = os.path.dirname(os.path.realpath('__file__'))
absFilePathToPreprocessedDataset = os.path.join(fileDir, '../Data/training.1600000.processed.noemoticon_preprocessed.csv')
absFilePathToGloVe = os.path.join(fileDir, '../Data/glove.6B.50d.txt')
pathToPreprocessedDataset = os.path.abspath(os.path.realpath(absFilePathToPreprocessedDataset))
pathToGloveEmbeddings = os.path.abspath(os.path.realpath(absFilePathToGloVe))
print (pathToPreprocessedDataset)
print (pathToGloveEmbeddings)

c:\Users\v-tastan\source\repos\PetnicaNLPWorkshop\Data\training.1600000.processed.noemoticon_preprocessed.csv
c:\Users\v-tastan\source\repos\PetnicaNLPWorkshop\Data\glove.6B.50d.txt


Choose the device to run the training on:

In [2]:
device = "cpu"

Set the learning rate parameter:

In [3]:
learningRate = 0.001

## Initialization

In [4]:
import torch.nn as nn
import torch.optim as optim
from TwitterDataset import TwitterDataset
from ModelMLPWithEmbeddings import SentimentClassifierMLPWithEmbeddings

# Step #1: Instantiate the dataset
# instantiate the dataset
dataset = TwitterDataset.load_dataset_and_make_vectorizer(pathToPreprocessedDataset, representation="indices")
# get the vectorizer
vectorizer = dataset.get_vectorizer()

### Option A: Train embeddings from scrach, do not use pre-trained embeddings

There are no additional steps requeired, the <code>Embedding</code> Layer's weight matrix should not be set

### Option B: Use pre-trained embeddings

To use pre-trained embeddings, there are three steps:

1. Load the pretrained embeddings
2. Select only subset of embeddings for the words that are actually present on the data
3. Set the <code>Embedding</code> Layer's weight matrix as the loaded subset

#### Step #1.B.1: Load the pre-trained embeddings

Set the path to GloVe pretrained embeddings file and load these pretrained embeddings into <code>PreTrainedEmbeddings</code> instance

In [5]:
from PreTrainedEmbeddings import PreTrainedEmbeddings

embeddings = PreTrainedEmbeddings.from_embeddings_file(pathToGloveEmbeddings)

#### Step #1.B.2: Initialize the embedding matrix

In [6]:
# get list of words in the vocabulary
word_list = vectorizer.text_vocabulary.to_serializable()["token_to_idx"].keys()

# get the pre-trained embedding vectors only for words that appear in the vocabulary
embeddings_matrix = embeddings.make_embeddings_matrix(word_list)

In [7]:
# Step #2: Instantiate the model
# instantiate the model
model = SentimentClassifierMLPWithEmbeddings(
    embedding_size=50,
    num_embeddings=len(vectorizer.text_vocabulary),
    hidden_dim=10,
    output_dim=len(vectorizer.target_vocabulary),
    padding_idx=vectorizer.text_vocabulary.mask_index,
    # Step #1.B.3: set the loaded subset as a weight matrix
    pretrained_embedding_matrix=embeddings_matrix,
)
# send model to appropriate device
model = model.to(device)

# Step #3: Instantiate the loss function
loss_func = nn.CrossEntropyLoss()

# Step #4: Instantiate the optimizer
optimizer = optim.Adam(model.parameters(), lr=learningRate)

## Training Loop

In [8]:
from Trainer import Trainer

sentiment_analysis_trainer = Trainer(
    dataset=dataset,
    model=model,
    loss_func=loss_func,
    optimizer=optimizer
)

In [9]:
# setup the chosen number of epochs
num_epochs = 50
# setup the chosen batch size
batch_size = 16

report = sentiment_analysis_trainer.train(num_epochs=num_epochs, batch_size=batch_size, device=device)

## Evaluate the results

In [10]:
# set the model in eval state
model.eval()

SentimentClassifierMLPWithEmbeddings(
  (embeddings): Embedding(322, 50, padding_idx=0)
  (fc1): Linear(in_features=50, out_features=10, bias=True)
  (fc2): Linear(in_features=10, out_features=2, bias=True)
)

In [11]:
def evaluate(split):
    loss, accuracy = sentiment_analysis_trainer.evaluate(split=split, device=device, batch_size=batch_size)

    print("Loss: {:.3f}".format(loss))
    print("Accuracy: {:.3f}".format(accuracy))

#### Training Set

In [13]:
evaluate(split="train")

Loss: 0.386
Accuracy: 0.816


#### Validation Set

In [16]:
evaluate(split="validation")

Loss: 0.865
Accuracy: 0.667


#### Test Set

In [18]:
evaluate(split="test")

Loss: 0.799
Accuracy: 0.615


## Inference and classifying new data points

Let's do inference on the new data. This is another evaluation method to make qualitative judgement about whether the model is working.

In [19]:
import torch

def predict(text, model, vectorizer):
    """
    Predict the sentiment of the tweet

    Args:
        text (str): the text of the tweet
        model (SentimentClassifierPerceptron): the trained model
        vectorizer (TwitterVectorizer): the corresponding vectorizer
    Returns:
        sentiment of the tweet (int), probability of that prediction (float)
    """
    # vectorize the text of the tweet
    vectorized_text = vectorizer.vectorize(text)

    # make a tensor with expected size (1, )
    vectorized_text = torch.Tensor(vectorized_text).view(1, -1)

    # set the model in the eval state
    model.eval()

    # run the model on the vectorized text and apply softmax activation function on the outputs
    result = model(vectorized_text, apply_softmax=True)

    # find the best class as the one with the highest probability
    probability_values, indices = result.max(dim=1)

    # take only value of the indices tensor
    index = indices.item()

    # decode the predicted target index into the sentiment, using target vocabulary
    predicted_target = vectorizer.target_vocabulary.find_index(index)

    # take only value of the probability_values tensor 
    probability_value = probability_values.item()

    return predicted_target, probability_value

Let's try the model on some examples:

In [22]:
text = "This is a good day."

predict(text, model, vectorizer)

(1.0, 0.5920077562332153)

In [25]:
text = "I was very sad yesterday."

predict(text, model, vectorizer)

(0.0, 0.8469178080558777)

In [28]:
text = "This is a book."

predict(text, model, vectorizer)

(0.0, 0.5402117967605591)

### More detailed evaluation on the Test Set

In [29]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# run the model on the tweets from test set 
y_predicted = dataset.test_df.text.apply(lambda x: predict(text=x, model=model, vectorizer=vectorizer)[0])

# compare that with labels
print(classification_report(y_true=dataset.test_df.target, y_pred=y_predicted))

# plot confusion matrix
print("Consfusion matrix:")
print(confusion_matrix(y_true=dataset.test_df.target, y_pred=y_predicted))

precision    recall  f1-score   support

         0.0       0.54      0.96      0.69        47
         1.0       0.88      0.28      0.43        53

    accuracy                           0.60       100
   macro avg       0.71      0.62      0.56       100
weighted avg       0.72      0.60      0.55       100

Consfusion matrix:
[[45  2]
 [38 15]]
