# Sentiment Analysis of Twitter Posts
<!-- Notebook name goes here -->
<center><b>Notebook: Neural Network Model, Error Analysis, and Tuning</b></center>
<br>

**By**: Stephen Borja, Justin Ching, Erin Chua, and Zhean Ganituen.

**Dataset**: Hussein, S. (2021). Twitter Sentiments Dataset [Dataset]. Mendeley. https://doi.org/10.17632/Z9ZW7NT5H2.1

**Motivation**: Every minute, social media users generate a large influx of textual data on live events. Performing sentiment analysis on this data provides a real-time view of public perception, enabling quick insights into the general populationâ€™s opinions and reactions.

**Goal**: By the end of the project, our goal is to create and compare supervised learning algorithms for sentiment analysis.

# **1. Project Setup**

In [2]:
# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# General Imports
from tqdm import tqdm
import pandas as pd
import numpy as np
import sys, os
sys.path.append(os.path.abspath("../lib"))

# **2. Data Setup**

Run the data processing pipeline

In [3]:
import IPython.core.page
import builtins
from IPython.utils.capture import capture_output

pager = IPython.core.page.page
helper = builtins.help

IPython.core.page.page = lambda *args, **kwargs: None
builtins.help = lambda *args, **kwargs: None

try:
    with capture_output():
        %run data.ipynb
finally:
    IPython.core.page.page = pager
    builtins.help = helper

print("Data Setup is DONE")

# Tests
assert X.shape == (162_801, 29318), "Feature matrix shape is wrong; expected (162_801, 29318)"
assert y.shape == (162_801,), "Labels shape is wrong; expected (162_801,)"

assert X_train.shape == (113_960, 29_318), "Train shape is wrong; expected (113_960, 2)"
assert X_test.shape == (48_841, 29_318), "Test shape is wrong; expected (48_841, 2)"

assert y_train.shape == (113_960,), "Train labels shape is wrong; expected (113_960,)"
assert y_test.shape == (48_841,), "Test labels shape is wrong; expected (48_841,)"
print("All tests passed.")

Data Setup is DONE
All tests passed.


Now, we need to convert the dataset to a Pytorch compatible dataset.

In [4]:
class BoWForTorch(Dataset):
    label_map = {-1: 0, 0: 1, 1: 2}

    def __init__(self, X, y):
        self.X = X
        self.y = y.map(self.label_map).values.astype(int)

    def __len__(self):
        return self.X.shape[0]

    def __getitem__(self, idx):
        x = self.X[idx].toarray().ravel().astype("float32")
        y = self.y[idx]
        return torch.from_numpy(x), torch.tensor(y, dtype=torch.long)

train_dataset = BoWForTorch(X_train, y_train)
test_dataset  = BoWForTorch(X_test, y_test)

Then, construct DataLoader objects for the train and test dataset.

In [5]:
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True
)

test_loader = DataLoader(
    test_dataset,
    batch_size=32,
    shuffle=False
)

# **3. Model Selection**

In [6]:
class MyLittlePony(nn.Module):
    def __init__(self, vocab_size, hidden_dim, n_sentiment, n_hiddens=1, dropout=0.2):
        super().__init__()

        layers = [nn.Linear(vocab_size, hidden_dim), nn.ReLU(), nn.Dropout(dropout)]

        for _ in range(n_hiddens - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout))

        layers.append(nn.Linear(hidden_dim, n_sentiment))
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)

## **Hyperparameters**

In [7]:
class Hyperparameters:
    """
    Hyperparameters for the multi-layer perceptron (MLP) used for sentiment analysis.

    Remark (Zhean). I defined this as a class to enforce IMMUTABILITY of the hyperparamters. That is,
    no matter what happens in the code, we can ensure that these are never changed.

    # Hyperparameters
    * N_EPOCHS: The number of training epochs.
    * N_HIDDENS: The number of hidden layers in the MLP.
    * N_SNEAKY_NEURONS: The number of neurons in each hidden layer.

    # Usage
    ```
    print(Hyperparameters.N_EPOCHS)          # 10
    print(Hyperparameters.N_HIDDENS)         # 2
    print(Hyperparameters.N_SNEAKY_NEURONS) # 128
    ```
    """
    N_EPOCHS = 10
    N_HIDDENS = 2
    N_SNEAKY_NEURONS = 128

    # block assignments
    def __setattr__(self, name, value):
        raise AttributeError("Cannot modify Hyperparameters")

## **Initializing the MLP**

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

vocab_size = X_train.shape[1]
n_sentiment = 3

model = MyLittlePony(
    vocab_size  = vocab_size,
    hidden_dim  = Hyperparameters.N_SNEAKY_NEURONS,
    n_sentiment = 3,
    n_hiddens   = Hyperparameters.N_HIDDENS
)

model.to(device)

print("MLP initialized")

Using device: cuda
MLP initialized


# **4. Training the model**

## **Training Setup**

In [9]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

## **Training Proper**

In [10]:
for epoch in range(Hyperparameters.N_EPOCHS):
    model.train()
    running_loss = 0.0

    for x_batch, y_batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{Hyperparameters.N_EPOCHS}"):
        x_batch = x_batch.to(device, non_blocking=True)
        y_batch = y_batch.to(device, non_blocking=True)

        optimizer.zero_grad()
        logits = model(x_batch)
        loss = criterion(logits, y_batch)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    avg_loss = running_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{Hyperparameters.N_EPOCHS} | Loss: {avg_loss:.4f}")

Epoch 1/10:   0%|          | 0/3562 [00:00<?, ?it/s]

Epoch 1/10 | Loss: 0.5407


Epoch 2/10:   0%|          | 0/3562 [00:00<?, ?it/s]

Epoch 2/10 | Loss: 0.3249


Epoch 3/10:   0%|          | 0/3562 [00:00<?, ?it/s]

Epoch 3/10 | Loss: 0.1996


Epoch 4/10:   0%|          | 0/3562 [00:00<?, ?it/s]

Epoch 4/10 | Loss: 0.1168


Epoch 5/10:   0%|          | 0/3562 [00:00<?, ?it/s]

Epoch 5/10 | Loss: 0.0761


Epoch 6/10:   0%|          | 0/3562 [00:00<?, ?it/s]

Epoch 6/10 | Loss: 0.0539


Epoch 7/10:   0%|          | 0/3562 [00:00<?, ?it/s]

Epoch 7/10 | Loss: 0.0416


Epoch 8/10:   0%|          | 0/3562 [00:00<?, ?it/s]

Epoch 8/10 | Loss: 0.0339


Epoch 9/10:   0%|          | 0/3562 [00:00<?, ?it/s]

Epoch 9/10 | Loss: 0.0288


Epoch 10/10:   0%|          | 0/3562 [00:00<?, ?it/s]

Epoch 10/10 | Loss: 0.0253


# **5. Evaluation**

In [11]:
def evaluate(model, data_loader, device):
    """
    Evaluate the model on a validation/test dataset.

    # Parameters
    * model: PyTorch model
    * data_loader: DataLoader for validation/test set
    * device: torch.device ('cpu' or 'cuda')

    # Returns
    * accuracy: float, proportion of correct predictions
    """
    model.eval() 
    correct, total = 0, 0

    with torch.no_grad():
        for x_batch, y_batch in tqdm(data_loader, desc="Validation"):
            x_batch = x_batch.to(device, non_blocking=True)
            y_batch = y_batch.to(device, non_blocking=True)

            logits = model(x_batch)
            
            prediction = logits.argmax(dim=1)

            correct += (prediction == y_batch).sum().item()
            total += y_batch.size(0)

    return correct / total

val_accuracy = evaluate(model, test_loader, device)
print(f"Validation Accuracy: {val_accuracy:.4f}")

Validation:   0%|          | 0/1527 [00:00<?, ?it/s]

Validation Accuracy: 0.8176
