# Sentiment Analysis of Twitter Posts
<!-- Notebook name goes here -->
<center><b>Notebook: Neural Network Model, Error Analysis, and Tuning</b></center>
<br>

**By**: Stephen Borja, Justin Ching, Erin Chua, and Zhean Ganituen.

**Dataset**: Hussein, S. (2021). Twitter Sentiments Dataset [Dataset]. Mendeley. https://doi.org/10.17632/Z9ZW7NT5H2.1

**Motivation**: Every minute, social media users generate a large influx of textual data on live events. Performing sentiment analysis on this data provides a real-time view of public perception, enabling quick insights into the general population’s opinions and reactions.

**Goal**: By the end of the project, our goal is to create and compare supervised learning algorithms for sentiment analysis.

# **1. Project Setup**

We import the relevant libraries for our neural network, with the most important one being PyTorch. PyTorch provides tools used to build and train neural networks, and it allows us to use our NVIDIA GPU's CUDA cores for faster training.

In [1]:
# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import itertools

# General Imports
import sys, os
sys.path.append(os.path.abspath("../lib"))

# Select device (using CUDA if possible)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", DEVICE)

Using device: cpu


# **2. Data Setup**

Now, we run the data processing pipeline. The below cell's sole purpose is to run the `data.ipynb` notebook, which runs everything related to data cleaning and data pre-processing.

In [1]:
import IPython.core.page
import builtins
import time
from IPython.utils.capture import capture_output

pager = IPython.core.page.page
helper = builtins.help

IPython.core.page.page = lambda *args, **kwargs: None
builtins.help = lambda *args, **kwargs: None

try:
    with capture_output():
        %run data.ipynb
finally:
    IPython.core.page.page = pager
    builtins.help = helper

print("Data Setup is DONE")

# Tests
assert X.shape == (162_801, 29318), "Feature matrix shape is wrong; expected (162_801, 29318)"
assert y.shape == (162_801,), "Labels shape is wrong; expected (162_801,)"

assert X_train.shape == (113_960, 29_318), "Train shape is wrong; expected (113_960, 2)"
assert X_test.shape == (48_841, 29_318), "Test shape is wrong; expected (48_841, 2)"

assert y_train.shape == (113_960,), "Train labels shape is wrong; expected (113_960,)"
assert y_test.shape == (48_841,), "Test labels shape is wrong; expected (48_841,)"
print("All tests passed.")

Data Setup is DONE
All tests passed.


We then prepare the dataset to be PyTorch-compatible.

Because our labels in `y_train` are -1, 0, and 1, they are incompatible with PyTorch which requires class labels to be non-negative integers. TorchableSet maps them to non-negative integers so that PyTorch can handle them.

TorchableSet also ensures that the features (`X_train`) are stored in a sparse matrix and the labels (`y_train`) are in a NumPy array for easier handling.

In [3]:
from barn.data_preparer import TorchableSet
TRAIN = TorchableSet(X_train, y_train)

# **3. Model Selection**

## **Architecture**

In [4]:
class MyLittlePony(nn.Module):
    def __init__(self, vocab_size, hidden_dim, n_hiddens, dropout):
        super().__init__()
        layers = [nn.Linear(vocab_size, hidden_dim), nn.ReLU(), nn.Dropout(dropout)]

        for _ in range(n_hiddens - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout))

        layers.append(nn.Linear(
            hidden_dim, 
            # There are 3 states at the end for the three sentiments
            3
        ))
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)

# **4. Training the Model**

Import the scripts to be used for training and evaluation

In [5]:
from barn.trainer import train, validate, crossvalid

## **Computational Resources**

The training setup was performed on the following machine:
   * CPU: AMD Ryzen 7 7800X3D 8-Core Processor
   * GPU: NVIDIA GeForce RTX 3070 
   * RAM: 16 GB DDR4 (2×8 GB) dual-channel

## **Training**

As a testbench, we first defined an arbitrary set of simple hyperparameters.

In [7]:
hidden_dim     = 2
hidden_neurons = 32
dropout        = 0.2
batch_size     = 32
learning_rate  = 1e-3
weight_decay   = 1e-5
epoch          = 5
batch_size     = 32

criterion = nn.CrossEntropyLoss()
mlp = MyLittlePony(TRAIN.vocab_size, hidden_dim, hidden_neurons, dropout)

mlp.to(DEVICE)

optimizer = torch.optim.Adam(mlp.parameters(), lr=learning_rate, weight_decay=weight_decay)

Then, train the model on these hyperparameters.

In [None]:
test_accuracy = train(
    model        = mlp,
    criterion    = criterion,
    optimizer    = optimizer,
    epoch        = epoch,
    train_loader = DataLoader(TRAIN, batch_size=batch_size, shuffle=True),
    device       = DEVICE
)

print(test_accuracy)

## **Validation**

Now, let's evaluate the model on test data.

In [None]:
validation_accuracy = validate(
    model      = mlp,
    val_loader = DataLoader(TRAIN, batch_size=batch_size, shuffle=False),
    device     = DEVICE
)

print(validation_accuracy)

Save the notebook for future use.

In [None]:
torch.save({
    'model_state_dict': mlp.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'validation_accuracy': validation_accuracy,
    'hyperparameters': {
        'hidden_dim': hidden_dim,
        'hidden_neurons': hidden_neurons,
        'dropout': dropout
    }
}, 'checkpoint.pth')

# **5. Model Tuning**

## **Hyperparameters**

In [None]:
class Hyperparameters:
    """
    Hyperparameters for the multi-layer perceptron (MLP) used for sentiment analysis.

    Remark (Zhean). I defined this as a class to enforce IMMUTABILITY of the hyperparamters. That is,
    no matter what happens in the code, we can ensure that these are never changed.

    # Hyperparameters
    * N_EPOCHS: The number of training epochs.
    * N_HIDDENS: The number of hidden layers in the MLP.
    * N_SNEAKY_NEURONS: The number of neurons in each hidden layer.
    * N_BATCH_SIZE: The batch size.
    * 

    # Usage
    ```
    print(Hyperparameters.N_EPOCHS)        
    print(Hyperparameters.N_HIDDENS)       
    print(Hyperparameters.N_SNEAKY_NEURONS)
    ```
    """
    # Choosable Hyperparameters
    # add subscript (C_) to denote it as choosable
    C_EPOCHS         = [5]#, 10 , 20 , 30]
    C_HIDDENS        = [1  ] #, 2  , 3]           
    C_SNEAKY_NEURONS = [128] #, 256, 512]
    C_BATCH_SIZE     = [64 , 128] #, 256]
    C_DROPOUT        = [0.2]#, 0.3, 0.5]     
    
    # Fixed (choice-less) hyperparameters
    OPTIMIZER        = torch.optim.Adam
    CRITERION        = nn.CrossEntropyLoss()
    LEARNING_RATE    = 1e-3
    WEIGHT_DECAY     = 1e-5

Now, we can get all the possible combinations of the Hyperparameters by:

In [None]:
hyperparam_combinations = itertools.product(
    Hyperparameters.C_BATCH_SIZE,
    Hyperparameters.C_EPOCHS, 
    Hyperparameters.C_HIDDENS, 
    Hyperparameters.C_SNEAKY_NEURONS,
    Hyperparameters.C_DROPOUT
)

## **Grid Search**

In [None]:
from barn.gs import grid_search

In [None]:
# results, best_cfg = grid_search(
#     model_class=MyLittlePony,
#     hyperparam_combinations=hyperparam_combinations,
#     optimizer=Hyperparameters.OPTIMIZER,
#     criterion=Hyperparameters.CRITERION,
#     learning_rate=Hyperparameters.LEARNING_RATE,
#     weight_decay=Hyperparameters.WEIGHT_DECAY,
#     dataset=TRAIN,
#     vocab_size=TRAIN.vocab_size,
#     device=DEVICE,
#     k_fold=5,
#     save_path="best_pony.pt"
# )