# Twitter Sentiment Analysis

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://img.utdstc.com/icon/716/88f/71688f4905ece2a5ee744eaf351ec21bd51491e02025ea4a68501cc93d847e5c:200"> 
</p>
</div>

## Description:
This project implements a sentiment analysis model for Twitter data using the BERT transformer model. The pipeline includes data preprocessing, tokenization, model training, and evaluation. It reads a dataset of Twitter posts, performs necessary preprocessing (such as encoding sentiments), and then trains a sentiment classification model using BERT. The model is capable of classifying sentiments into three categories: Positive, Negative, and Neutral. The code also includes functions for installing dependencies, setting up environment variables, and saving the trained model.

## Step 1: Install Required Packages

- The function install_requirements installs the necessary packages from the requirements.txt file.

- It ensures packages are installed, retrying a set number of times if the installation fails.

In [1]:
import os

requirements_installed = False
max_retries = 3
retries = 0


def install_requirements():
    """Installs the requirements from requirements.txt file"""
    global requirements_installed
    if requirements_installed:
        print("Requirements already installed.")
        return

    print("Installing requirements...")
    install_status = os.system("pip install -r requirements.txt")
    if install_status == 0:
        print("Requirements installed successfully.")
        requirements_installed = True
    else:
        print("Failed to install requirements.")
        if retries < max_retries:
            print("Retrying...")
            retries += 1
            return install_requirements()
        exit(1)
    return

install_requirements()

## Step 2: Load Environment Variables

- Loads environment variables from a .env file using dotenv and checks if specific variables are set.

In [3]:
from dotenv import load_dotenv
import os


def setup_env():
    """Sets up the environment variables"""

    def check_env(env_var):
        value = os.getenv(env_var)
        if value is None:
            print(f"Please set the {env_var} environment variable.")
            exit(1)
        else:
            print(f"{env_var} is set.")

    load_dotenv()

    variables_to_check = []

    for var in variables_to_check:
        check_env(var)

setup_env()

## Step 3: Read Dataset

- Reads a CSV file containing Twitter sentiment analysis data into a Pandas DataFrame.

In [5]:
import pandas as pd


def read_dataset():
    """Reads the dataset"""
    dataset = pd.read_csv("data/twitter_sentiment_analysis/twitter.csv")
    return dataset
dataset = read_dataset()

dataset.head()

## Step 4: Prepare Training and Testing Data

This step processes the dataset by reading it, splitting it into training and testing sets, encoding sentiment labels as numbers, and applying transformations like scaling for numeric data and one-hot encoding for categorical data. The data is then converted into PyTorch tensors for use in training and testing the model.

In [9]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer

def get_train_test_data():
    """Prepare the training and testing data."""
    dataset = read_dataset()
    target_column = "sentiment"
    X = dataset.drop(columns=[target_column])
    y = dataset[target_column]
    
    # Encode target labels into numerical values
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(y)
    
    # Separate columns into categorical and numeric
    categorical_cols = X.select_dtypes(include=["object"]).columns
    numeric_cols = X.select_dtypes(include=["number"]).columns
    
    # Preprocessing the data (Scaling numeric columns and One-Hot Encoding categorical columns)
    preprocessor = ColumnTransformer(
        transformers=[("num", StandardScaler(), numeric_cols),
                      ("cat", OneHotEncoder(sparse_output=False), categorical_cols)])
    
    # Split the dataset into training and testing data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Apply preprocessing and convert to PyTorch tensors
    X_train = torch.tensor(preprocessor.fit_transform(X_train), dtype=torch.float32)
    X_test = torch.tensor(preprocessor.transform(X_test), dtype=torch.float32)
    y_train = torch.tensor(y_train, dtype=torch.float32).reshape(-1, 1)
    y_test = torch.tensor(y_test, dtype=torch.float32).reshape(-1, 1)
    
    return X_train, X_test, y_train, y_test

dataset_path = "data/twitter_sentiment_analysis/twitter.csv"


## Step 5: Preprocessing and Dataset Class for Sentiment Analysis using BERT

This step involves preprocessing the dataset for a BERT-based sentiment analysis model. It handles missing tweet text, converts sentiment labels to numerical values, and creates a custom `TwitterDataset` class for PyTorch. 

The class tokenizes tweets using a BERT tokenizer, prepares input data (token IDs, attention masks), and returns the sentiment label, making the data ready for training with BERT.

In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from torch import nn
from transformers import BertTokenizer, BertModel

# Load your dataset
dataset = pd.read_csv(dataset_path)

# Ensure the 'content' column has only strings and handle missing values
dataset["content"] = dataset["content"].fillna("").astype(str)

# Preprocessing sentiment to numerical labels
def sentiment_to_label(sentiment):
    return {"Positive": 1, "Negative": 0, "Neutral": 2}.get(
        sentiment, 2
    )  # Default to Neutral if not found

dataset["sentiment_label"] = dataset["sentiment"].apply(sentiment_to_label)

# Dataset Class
class TwitterDataset(Dataset):
    def __init__(self, data, tokenizer, max_length):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        tweet = self.data.iloc[index]["content"]
        sentiment = self.data.iloc[index]["sentiment_label"]

        encoding = self.tokenizer(
            tweet,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "label": torch.tensor(sentiment, dtype=torch.long),
        }


## Step 6: Model and Training Setup

This step sets up the model and training configuration for fine-tuning BERT on sentiment classification. Key components include defining hyperparameters (batch size, learning rate, epochs), using the Hugging Face BertTokenizer, splitting the dataset into training and validation, and creating a custom `SentimentClassifier` model with a BERT layer, dropout, and a linear output layer for sentiment prediction. The training uses Cross-Entropy Loss and the AdamW optimizer.


In [None]:
# Hyperparameters
BATCH_SIZE = 16
MAX_LENGTH = 128
EPOCHS = 3
LEARNING_RATE = 2e-5

# Tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Datasets and DataLoaders
dataset = dataset.sample(frac=1).reset_index(drop=True)  # Shuffle dataset
train_size = int(0.8 * len(dataset))
train_data = dataset[:train_size]
val_data = dataset[train_size:]

train_dataset = TwitterDataset(train_data, tokenizer, MAX_LENGTH)
val_dataset = TwitterDataset(val_data, tokenizer, MAX_LENGTH)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Model
class SentimentClassifier(nn.Module):
    def __init__(self, n_classes):
        super(SentimentClassifier, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs[1]
        output = self.drop(pooled_output)
        return self.out(output)

model = SentimentClassifier(n_classes=3)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)


## Step 7: Train and Evaluate the Model

This step involves training and evaluating the sentiment analysis model. It defines two key functions:

- `train_epoch`: Trains the model for one epoch, computes the loss and accuracy, and updates the model's weights.

- `eval_model`: Evaluates the model on the validation set, calculating loss and accuracy to assess performance.

The **training loop** runs for multiple epochs, training and evaluating the model in each epoch. After training, the model's state is saved for future use. This structure helps track performance and improve the model iteratively.


In [None]:
# Training Function
def train_epoch(model, data_loader, criterion, optimizer, device):
    print("Training...")
    model.train()
    total_loss = 0
    correct_predictions = 0

    i = 0
    for batch in data_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = criterion(outputs, labels)
        total_loss += loss.item()

        _, preds = torch.max(outputs, dim=1)
        correct_predictions += torch.sum(preds == labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        i += 1

    return correct_predictions.double() / len(data_loader.dataset), total_loss / len(data_loader)

# Evaluation Function
def eval_model(model, data_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct_predictions = 0

    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs, labels)
            total_loss += loss.item()

            _, preds = torch.max(outputs, dim=1)
            correct_predictions += torch.sum(preds == labels)

    return correct_predictions.double() / len(data_loader.dataset), total_loss / len(data_loader)

# Training Loop
for epoch in range(EPOCHS):
    train_acc, train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
    val_acc, val_loss = eval_model(model, val_loader, criterion, device)

    print(f"Epoch {epoch + 1}/{EPOCHS}")
    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")
    print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

# Save Model
torch.save(model.state_dict(), "sentiment_model.pth")

print("Model training complete!")
    

## Conclusion
This sentiment analysis application uses a deep learning model based on BERT (Bidirectional Encoder Representations from Transformers) to classify tweets into three sentiment categories: Positive, Negative, and Neutral. The app follows a systematic approach to preprocess the data, tokenize the text using BERT's tokenizer, and train a model to predict sentiment.

- Key highlights of the app include:

    - `Data Preprocessing` : It handles missing values and encodes sentiment labels into numerical format for training.

    - `Model Architecture`: The app uses a pre-trained BERT model, which has been fine-tuned to adapt to the sentiment classification task. A dropout layer is applied to prevent overfitting, and a fully connected layer is used for output classification.

    - `Training and Evaluation`: The app includes functions for training and evaluating the model across multiple epochs, monitoring the loss and accuracy on both training and validation datasets.

    - `Performance Monitoring`: The training loop prints the loss and accuracy after each epoch, allowing for easy tracking of model performance.

    - `Model Saving`: After training, the model is saved to a file, allowing for future predictions or model loading without retraining.

Overall, this application provides a robust framework for sentiment analysis of tweets, leveraging the power of BERT to achieve high accuracy and efficiency. It can be further extended to perform sentiment analysis on larger datasets or integrated into a web application for real-time tweet sentiment classification.

---

# Thank You for visiting The Hackers Playbook! 🌐

If you liked this research material;

- [Subscribe to our newsletter.](https://thehackersplaybook.substack.com)

- [Follow us on LinkedIn.](https://www.linkedin.com/company/the-hackers-playbook/)

- [Leave a star on our GitHub.](https://www.github.com/thehackersplaybook)

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>

