# Twitter Sentiment Analysis

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://img.utdstc.com/icon/716/88f/71688f4905ece2a5ee744eaf351ec21bd51491e02025ea4a68501cc93d847e5c:200"> 
</p>
</div>

## Description:
This project implements a sentiment analysis model for Twitter data using the BERT transformer model. The pipeline includes data preprocessing, tokenization, model training, and evaluation. It reads a dataset of Twitter posts, performs necessary preprocessing (such as encoding sentiments), and then trains a sentiment classification model using BERT. The model is capable of classifying sentiments into three categories: Positive, Negative, and Neutral. The code also includes functions for installing dependencies, setting up environment variables, and saving the trained model.

## Step 1: Install Required Packages

- The function install_requirements installs the necessary packages from the requirements.txt file.

- It ensures packages are installed, retrying a set number of times if the installation fails.

In [1]:
import os

requirements_installed = False
max_retries = 3
retries = 0


def install_requirements():
    """Installs the requirements from requirements.txt file"""
    global requirements_installed
    if requirements_installed:
        print("Requirements already installed.")
        return

    print("Installing requirements...")
    install_status = os.system("pip install -r requirements.txt")
    if install_status == 0:
        print("Requirements installed successfully.")
        requirements_installed = True
    else:
        print("Failed to install requirements.")
        if retries < max_retries:
            print("Retrying...")
            retries += 1
            return install_requirements()
        exit(1)
    return

install_requirements()

## Step 2: Load Environment Variables

- Loads environment variables from a .env file using dotenv and checks if specific variables are set.

In [3]:
from dotenv import load_dotenv
import os


def setup_env():
    """Sets up the environment variables"""

    def check_env(env_var):
        value = os.getenv(env_var)
        if value is None:
            print(f"Please set the {env_var} environment variable.")
            exit(1)
        else:
            print(f"{env_var} is set.")

    load_dotenv()

    variables_to_check = []

    for var in variables_to_check:
        check_env(var)

setup_env()

## Step 3: Read Dataset

- Reads a CSV file containing Twitter sentiment analysis data into a Pandas DataFrame.

In [5]:
import pandas as pd


def read_dataset():
    """Reads the dataset"""
    dataset = pd.read_csv("data/twitter_sentiment_analysis/twitter.csv")
    return dataset
dataset = read_dataset()

dataset.head()

## Step 4: Prepare Training and Testing Data

This code snippet is responsible for preparing the dataset by processing it into the right format for model training. Here’s a breakdown of what each step does:

- Reading the Dataset:

    - First, it loads the dataset (a CSV file containing Twitter data) using the read_dataset() function.

- Data Splitting:

    - The code separates the features (X) from the target variable (y). In this case, the target variable is the sentiment of the tweet, which is labeled as Positive, Negative, or Neutral.
    - It then splits the dataset into training and testing sets, using train_test_split(). 80% of the data is used for training, and 20% is held out for testing.

- Label Encoding:

    - The target sentiment column (y) is transformed into numerical values using LabelEncoder. This is necessary for the model, as machine learning algorithms require numerical data.

    - The possible labels (Positive, Negative, and Neutral) are converted into integers (e.g., Positive = 1, Negative = 0, Neutral = 2).

- Preprocessing the Features:

    - The dataset contains both numeric and categorical features. The numeric columns are scaled using StandardScaler to ensure all values are on a similar scale, which helps the model learn more effectively.

    - The categorical columns are one-hot encoded using OneHotEncoder. This converts each unique category value into a new column with binary values.

    - The ColumnTransformer is used to apply these transformations to the respective columns in a single step.

- Converting to Tensors:

    - After preprocessing, the features (X_train and X_test) are converted into PyTorch tensors. This allows the data to be used in a PyTorch-based model for training and testing.
    - Similarly, the target variable (y_train and y_test) is also converted into tensors and reshaped.

- Return Processed Data:

    - Finally, the function returns the processed training and testing datasets: X_train, X_test, y_train, and y_test, which can now be used to train and evaluate the sentiment analysis model.

In [9]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer

def get_train_test_data():
    """Prepare the training and testing data."""
    dataset = read_dataset()
    target_column = "sentiment"
    X = dataset.drop(columns=[target_column])
    y = dataset[target_column]
    
    # Encode target labels into numerical values
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(y)
    
    # Separate columns into categorical and numeric
    categorical_cols = X.select_dtypes(include=["object"]).columns
    numeric_cols = X.select_dtypes(include=["number"]).columns
    
    # Preprocessing the data (Scaling numeric columns and One-Hot Encoding categorical columns)
    preprocessor = ColumnTransformer(
        transformers=[("num", StandardScaler(), numeric_cols),
                      ("cat", OneHotEncoder(sparse_output=False), categorical_cols)])
    
    # Split the dataset into training and testing data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Apply preprocessing and convert to PyTorch tensors
    X_train = torch.tensor(preprocessor.fit_transform(X_train), dtype=torch.float32)
    X_test = torch.tensor(preprocessor.transform(X_test), dtype=torch.float32)
    y_train = torch.tensor(y_train, dtype=torch.float32).reshape(-1, 1)
    y_test = torch.tensor(y_test, dtype=torch.float32).reshape(-1, 1)
    
    return X_train, X_test, y_train, y_test

dataset_path = "data/twitter_sentiment_analysis/twitter.csv"


## Step 5: Preprocessing and Dataset Class for Sentiment Analysis using BERT

This code focuses on preparing the dataset for a BERT-based sentiment analysis model. The main tasks include:

1. **Data Preprocessing**:

   - **Handling Missing Values**: The `content` column in the dataset (which contains the text of tweets) may have missing values. The missing values are filled with empty strings using the `fillna()` method, ensuring there are no NaN values in the text data.

   - **Converting Sentiment Labels to Numerical Values**: The sentiment column in the dataset (`Positive`, `Negative`, `Neutral`) is converted to numerical labels using the function `sentiment_to_label`. This is necessary because machine learning models require numerical labels, and each sentiment is mapped to an integer (e.g., `Positive` → 1, `Negative` → 0, `Neutral` → 2).

2. **Dataset Class**:
   - The `TwitterDataset` class inherits from PyTorch's `Dataset` class, which is essential for working with data in PyTorch.

   - **Constructor (`__init__`)**:

     - Takes in the data, the tokenizer, and the maximum length of the tokenized sequences.

     - The tokenizer (in this case, a BERT tokenizer) will convert tweets into tokenized sequences, which will be input into the BERT model.

   - **`__len__` Method**:
     - Returns the number of examples in the dataset. This method is required by PyTorch’s `DataLoader` to determine how many samples are in the dataset.

   - **`__getitem__` Method**:
     - For each data sample (in this case, a tweet), this method retrieves the tweet text and its associated sentiment label.

     - The `tweet` is tokenized using the BERT tokenizer. The tokenization process includes padding the sequence to the specified `max_length` and truncating longer sequences. The result is a dictionary containing the `input_ids`, `attention_mask`, and the sentiment label.

     - The `input_ids` represent the tokenized form of the tweet, while the `attention_mask` tells the model which tokens are actual words and which are padding.

     - The `label` is the sentiment label (numerical value) associated with the tweet.

In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from torch import nn
from transformers import BertTokenizer, BertModel

# Load your dataset
dataset = pd.read_csv(dataset_path)

# Ensure the 'content' column has only strings and handle missing values
dataset["content"] = dataset["content"].fillna("").astype(str)

# Preprocessing sentiment to numerical labels
def sentiment_to_label(sentiment):
    return {"Positive": 1, "Negative": 0, "Neutral": 2}.get(
        sentiment, 2
    )  # Default to Neutral if not found

dataset["sentiment_label"] = dataset["sentiment"].apply(sentiment_to_label)

# Dataset Class
class TwitterDataset(Dataset):
    def __init__(self, data, tokenizer, max_length):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        tweet = self.data.iloc[index]["content"]
        sentiment = self.data.iloc[index]["sentiment_label"]

        encoding = self.tokenizer(
            tweet,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "label": torch.tensor(sentiment, dtype=torch.long),
        }


## Step 6: Model and Training Setup

In this section, we define the hyperparameters, prepare the dataset using PyTorch's `DataLoader`, build the model, and set up the loss function and optimizer for training. This step is crucial to fine-tune a pre-trained BERT model for sentiment classification.

#### 1. **Hyperparameters**:

   - **BATCH_SIZE**: Defines the number of samples processed in one batch. Here, it is set to 16.

   - **MAX_LENGTH**: Specifies the maximum length of the input sequence. Any sequence longer than this is truncated, and shorter ones are padded. This is set to 128 tokens.

   - **EPOCHS**: Number of times the entire dataset is passed through the model. We are training for 3 epochs.

   - **LEARNING_RATE**: The learning rate for the optimizer. A smaller learning rate (2e-5) is chosen to fine-tune the pre-trained BERT model.

#### 2. **Tokenizer**:

   - We use the `BertTokenizer` from the Hugging Face `transformers` library, loading the pre-trained `bert-base-uncased` model to tokenize the text data into a format suitable for input to BERT.

#### 3. **Dataset and DataLoaders**:

   - The dataset is shuffled (`frac=1`) and split into training and validation sets. 
   - **Training Set**: 80% of the data is used for training.

   - **Validation Set**: The remaining 20% is used for validation.

   - Both the training and validation datasets are converted into `DataLoader` objects, which handle batching and shuffling for training.

#### 4. **Model Architecture**:

   - **SentimentClassifier**:
     - A custom class that extends `nn.Module` and builds the model architecture.
     - **BERT Layer**: The `BertModel` from Hugging Face is used for feature extraction. It outputs contextual embeddings for each token.

     - **Dropout Layer**: A dropout layer with a rate of 0.3 is applied for regularization to avoid overfitting.

     - **Output Layer**: A linear layer is used to produce predictions corresponding to the number of sentiment classes (3 classes: Positive, Negative, and Neutral).
   
   - The `forward` method takes `input_ids` and `attention_mask` as inputs and passes them through the BERT model to obtain the hidden state. The output is then passed through a dropout layer before being fed into the output layer for classification.

#### 5. **Loss Function and Optimizer**:

   - **Loss Function**: Cross-Entropy Loss is used for multi-class classification. This is appropriate since we have three sentiment classes.

   - **Optimizer**: AdamW (Adam with Weight Decay) is used as the optimizer, which is commonly used for training transformer models like BERT. The learning rate is set to 2e-5.

In [None]:
# Hyperparameters
BATCH_SIZE = 16
MAX_LENGTH = 128
EPOCHS = 3
LEARNING_RATE = 2e-5

# Tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Datasets and DataLoaders
dataset = dataset.sample(frac=1).reset_index(drop=True)  # Shuffle dataset
train_size = int(0.8 * len(dataset))
train_data = dataset[:train_size]
val_data = dataset[train_size:]

train_dataset = TwitterDataset(train_data, tokenizer, MAX_LENGTH)
val_dataset = TwitterDataset(val_data, tokenizer, MAX_LENGTH)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Model
class SentimentClassifier(nn.Module):
    def __init__(self, n_classes):
        super(SentimentClassifier, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs[1]
        output = self.drop(pooled_output)
        return self.out(output)

model = SentimentClassifier(n_classes=3)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)


## Step 7: Train and Evaluate the Model

This section contains functions for training and evaluating the sentiment analysis model. Additionally, it includes the training loop that iterates over multiple epochs, computes losses and accuracy for both the training and validation sets, and saves the trained model.

#### 1. **Training Function**:

   - **`train_epoch`**:

     - This function is responsible for training the model for one epoch.

     - **Parameters**:
     
       - `model`: The model to train.

       - `data_loader`: The DataLoader for the training dataset.

       - `criterion`: The loss function (Cross-Entropy Loss).

       - `optimizer`: The optimizer (AdamW).

       - `device`: The device (CUDA or CPU) where the model and data are located.

     - **Process**:

       - The model is set to training mode (`model.train()`).

       - For each batch:

         - Input data (`input_ids`, `attention_mask`, and `labels`) is moved to the device.

         - The model's predictions (`outputs`) are calculated.

         - The loss is computed using the criterion.

         - The gradients are zeroed, backpropagation is performed, and the optimizer step is applied.

       - The accuracy and loss are accumulated to return at the end of the epoch.

#### 2. **Evaluation Function**:

   - **`eval_model`**:

     - This function evaluates the model on the validation dataset.

     - **Parameters**:

       - `model`: The model to evaluate.

       - `data_loader`: The DataLoader for the validation dataset.

       - `criterion`: The loss function (Cross-Entropy Loss).

       - `device`: The device where the model and data are located.

     - **Process**:
       - The model is set to evaluation mode (`model.eval()`).

       - For each batch:

         - Input data is moved to the device.

         - The model's predictions (`outputs`) are calculated.

         - The loss is computed using the criterion.

       - The accuracy and loss are accumulated to return at the end of the evaluation.


#### 3. **Training Loop**:

   - The loop iterates over the number of epochs defined by `EPOCHS`.

   - For each epoch:

     - The model is trained using `train_epoch`.

     - The model is evaluated on the validation set using `eval_model`.

   - The training and validation loss and accuracy for each epoch are printed.

   - The model's state dictionary is saved after training is complete.



In [None]:
# Training Function
def train_epoch(model, data_loader, criterion, optimizer, device):
    print("Training...")
    model.train()
    total_loss = 0
    correct_predictions = 0

    i = 0
    for batch in data_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = criterion(outputs, labels)
        total_loss += loss.item()

        _, preds = torch.max(outputs, dim=1)
        correct_predictions += torch.sum(preds == labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        i += 1

    return correct_predictions.double() / len(data_loader.dataset), total_loss / len(data_loader)

# Evaluation Function
def eval_model(model, data_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct_predictions = 0

    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs, labels)
            total_loss += loss.item()

            _, preds = torch.max(outputs, dim=1)
            correct_predictions += torch.sum(preds == labels)

    return correct_predictions.double() / len(data_loader.dataset), total_loss / len(data_loader)

# Training Loop
for epoch in range(EPOCHS):
    train_acc, train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
    val_acc, val_loss = eval_model(model, val_loader, criterion, device)

    print(f"Epoch {epoch + 1}/{EPOCHS}")
    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")
    print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

# Save Model
torch.save(model.state_dict(), "sentiment_model.pth")

print("Model training complete!")
    

## Conclusion
This sentiment analysis application uses a deep learning model based on BERT (Bidirectional Encoder Representations from Transformers) to classify tweets into three sentiment categories: Positive, Negative, and Neutral. The app follows a systematic approach to preprocess the data, tokenize the text using BERT's tokenizer, and train a model to predict sentiment.

- Key highlights of the app include:

    - `Data Preprocessing` : It handles missing values and encodes sentiment labels into numerical format for training.

    - `Model Architecture`: The app uses a pre-trained BERT model, which has been fine-tuned to adapt to the sentiment classification task. A dropout layer is applied to prevent overfitting, and a fully connected layer is used for output classification.

    - `Training and Evaluation`: The app includes functions for training and evaluating the model across multiple epochs, monitoring the loss and accuracy on both training and validation datasets.

    - `Performance Monitoring`: The training loop prints the loss and accuracy after each epoch, allowing for easy tracking of model performance.

    - `Model Saving`: After training, the model is saved to a file, allowing for future predictions or model loading without retraining.

Overall, this application provides a robust framework for sentiment analysis of tweets, leveraging the power of BERT to achieve high accuracy and efficiency. It can be further extended to perform sentiment analysis on larger datasets or integrated into a web application for real-time tweet sentiment classification.

---

# Thank You for visiting The Hackers Playbook! 🌐

If you liked this research material;

- [Subscribe to our newsletter.](https://thehackersplaybook.substack.com)

- [Follow us on LinkedIn.](https://www.linkedin.com/company/the-hackers-playbook/)

- [Leave a star on our GitHub.](https://www.github.com/thehackersplaybook)

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>

