# Acme Corp: Transforming Customer Feedback into Strategic Insights
## Business Problem Scenario
In today's competitive e-commerce environment, understanding customer sentiment is more crucial than ever. With the rise of online reviews, social media feedback, and customer interactions, businesses are inundated with data that can significantly impact their strategies. However, manually sifting through this information can be overwhelming and inefficient.

To tackle this challenge, Acme Corp, a leader in the online retail sector, recognizes the need for an innovative solution to streamline sentiment analysis. By leveraging advanced artificial intelligence techniques like Natural Language processing (NLP), the company aims to automate the process of analyzing customer reviews and transforming raw feedback into valuable insight by leveraging the power of transformers.

This AI model will provide significant support in:
- Real-time Insights: Gain immediate understanding of customer sentiment, enabling proactive responses to feedback.
- Data-Driven Decisions: Inform product development and marketing strategies based on comprehensive analysis of customer opinions.
- Enhanced Customer Engagement: Identify and address negative sentiments quickly, improving overall customer satisfaction and loyalty.
- Operational Efficiency: Reduce the manual effort required for sentiment analysis, allowing team members to focus on higher-value tasks.

Through this project, Acme Corp will harness the power of AI to turn customer feedback into actionable insights, driving business growth and improving customer experiences.

## Import Libraries

In [1]:
import pandas as pd
from datasets import load_dataset , DatasetDict
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from transformers import  AdamW, get_linear_schedule_with_warmup , DistilBertTokenizer, DistilBertModel
from tqdm import tqdm
import numpy as np
import optuna
import time

## Load the dataset

we will proceed with these steps to load our data successfully  :
- load the Stanford SST2 dataset from Hugging face ,
- remove the unnecessary idx column
- check the dataset size to validate the idx column removal
- The data is already classified to training , validating and testing , we just need to assign each of them to a **pandas** data frame.

In [2]:
df = load_dataset("glue", "sst2")

# Define a function to remove the idx column from a dataset
def remove_column(dataset):
    return dataset.remove_columns('idx')

# Apply the function to each dataset in the DatasetDict
df = DatasetDict({
    split: remove_column(dataset)
    for split, dataset in df.items()
})

In [3]:
df.shape

{'train': (67349, 2), 'validation': (872, 2), 'test': (1821, 2)}

In [4]:
df_train = pd.DataFrame(df['train'])
df_valid = pd.DataFrame(df['validation'])
df_test = pd.DataFrame(df['test'])

## Dataset Class

We will define a custom dataset class to handle tokenization and prepare our input data for the model.

In [5]:
class SST2Dataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts.iloc[idx]
        label = self.labels.iloc[idx]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'sentence': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased' ,  clean_up_tokenization_spaces=True)
max_length = 128

train_dataset = SST2Dataset(df_train['sentence'], df_train['label'], tokenizer, max_length)
val_dataset = SST2Dataset(df_valid['sentence'], df_valid['label'], tokenizer, max_length)
test_dataset = SST2Dataset(df_test['sentence'], df_test['label'], tokenizer, max_length)


train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True )
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

## Model Architecture

We will create the sentiment classification model using  DistilBERT pretrained architecture , we will define it later on.

In [6]:
class SentimentClassifier(nn.Module):
    def __init__(self, model,  dropout_rate, n_classes=2,):
        super(SentimentClassifier, self).__init__()
        self.model = model
        self.drop = nn.Dropout(dropout_rate)
        self.out = nn.Linear(self.model.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask):
        # Get the last hidden state from DistilBERT
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = outputs.last_hidden_state  # [batch_size, sequence_length, hidden_size]
        
        # Use the hidden state of the first token ([CLS] token)
        cls_output = hidden_state[:, 0, :]  # [batch_size, hidden_size]
        output = self.drop(cls_output)
        return self.out(output)

        
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


## Loss Function

In [7]:
#Loss Function
criterion = torch.nn.CrossEntropyLoss().to(device)

## Model Training and Hyper parameter tuning functions

In [8]:
def hypertune(model, optimizer, train_loader, criterion, device , trial):
        # Training phase
        model.train()
        train_loss = 0
        train_correct = 0

        for batch_idx, batch in enumerate(tqdm(train_loader)):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            optimizer.zero_grad()
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()
            _, predicted = torch.max(outputs, dim=1)
            train_correct += torch.sum(predicted == labels).item()
            
            # Report intermediate results to Optuna
            intermediate_accuracy = train_correct / len(train_loader.dataset)

            if batch_idx % 20 == 0:
                trial.report(1 - intermediate_accuracy, batch_idx)
                if trial.should_prune():
                    raise optuna.exceptions.TrialPruned()

    

        train_loss /= len(train_loader)
        train_accuracy = train_correct / len(train_loader.dataset)

        return train_accuracy



def model_train(model, optimizer, train_loader, val_loader, criterion, device):
        # Training phase
        model.train()
        train_loss = 0
        train_correct = 0

        for batch in tqdm(train_loader):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            optimizer.zero_grad()
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            scheduler.step()  # Adjust the learning rate
            train_loss += loss.item()
            _, predicted = torch.max(outputs, dim=1)
            train_correct += torch.sum(predicted == labels).item()

        train_loss /= len(train_loader)
        train_accuracy = train_correct / len(train_loader.dataset)

        # Evaluation phase
        model.eval()
        val_loss = 0
        val_correct = 0

        with torch.no_grad():
            for batch in val_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device)

                outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask
                )
                loss = criterion(outputs, labels)

                val_loss += loss.item()
                _, predicted = torch.max(outputs, dim=1)
                val_correct += torch.sum(predicted == labels).item()

        val_loss /= len(val_loader)
        val_accuracy = val_correct / len(val_loader.dataset)

        print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Train Accuracy: {train_accuracy:.4f}, '
              f'Validation Loss: {val_loss:.4f}, Validation Accuracy: {val_accuracy:.4f}')

        return train_accuracy, val_accuracy


## Hyperparameter Tuning

We will use **optuna** framework to tune our hyperparameters which are  : 
- learning rate 
- dropout rete 
- and weight decay

we will define the **model** now and the **optimizer** to use them in the tuning process.

model --> as mentioned , we are using distilbert model , **DistilBERT** is a smaller, faster, and more efficient version of BERT (Bidirectional Encoder Representations from Transformers). It retains about 97% of BERT's language understanding capabilities while being 60% smaller and 2x faster. This makes it a great option for tasks like sentiment analysis where performance and speed are important.

optimizer--> we are using **AdamW** optimizer ,  The **AdamW** optimizer is an extension of the Adam optimizer. It decouples the weight decay (L2 regularization) from the gradient update to perform better with large models like BERT or DistilBERT. In the context of transformers, **AdamW** is commonly used because it improves the convergence of the model while preventing overfitting by adding weight decay.

In [10]:
def objective(trial):
    # Hyperparameters to tune
    learningrate = trial.suggest_float('learning_rate', 1e-5, 5e-3)
    weightdecay = trial.suggest_float('weight_decay', 1e-4, 1e-2)
    dropoutrate = trial.suggest_float('dropout_rate', 0.2, 0.7)

    bert_model_tuning = SentimentClassifier(model = DistilBertModel.from_pretrained('distilbert-base-uncased') , dropout_rate = dropoutrate )

    model_tuning = bert_model_tuning.to(device)

    optimizer_tuning = torch.optim.AdamW(model_tuning.parameters(), lr=learningrate, weight_decay=weightdecay)


    # Run multiple epochs and report after each epoch
    for epoch in range(3): 
        train_acc = hypertune(
            model_tuning,
            optimizer_tuning,
            train_loader,
            criterion,
            device,
            trial
        )

        train_loss = 1 - train_acc  # Assuming lower accuracy means higher loss
        
    return train_loss

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50)

best_trial = study.best_trial
learning_rate = best_trial.params['learning_rate']
weight_decay = best_trial.params['weight_decay']
dropout_rate = best_trial.params['dropout_rate']

[I 2024-09-28 10:23:02,808] A new study created in memory with name: no-name-f4e3ff5b-d7a4-4913-80af-fabca1bfdb6e
100%|██████████████████████████████████████████████████████████████████████████████| 2105/2105 [06:34<00:00,  5.33it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 2105/2105 [06:35<00:00,  5.33it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 2105/2105 [06:33<00:00,  5.35it/s]
[I 2024-09-28 10:42:47,172] Trial 0 finished with value: 0.05493771251243518 and parameters: {'learning_rate': 0.00017133144290496324, 'weight_decay': 0.004414139811726712, 'dropout_rate': 0.4806636668253681}. Best is trial 0 with value: 0.05493771251243518.
100%|██████████████████████████████████████████████████████████████████████████████| 2105/2105 [06:33<00:00,  5.35it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 2105/2105 [06:37<00:00,  5.30it/s]
100%|██████████████████████

## Model Training

now we will use the best hyperparameters with our model and optimizer to proceed with the model training.

we will implement **get_linear_schedule_with_warmup** scheduler , which is important for stabilizing training, especially when fine-tuning large pre-trained models like **DistilBERT**.

In [15]:
bert_model_tuned = SentimentClassifier(model = DistilBertModel.from_pretrained('distilbert-base-uncased') , dropout_rate = dropout_rate )

model_tuned = bert_model_tuned.to(device)

optimizer_tuned = torch.optim.AdamW(model_tuned.parameters(), lr=learning_rate, weight_decay=weight_decay)

# Learning Rate Scheduler
total_steps = len(train_loader) * 2
warmup_steps = int(0.1 * total_steps)  # Typically 10% of the total steps
scheduler = get_linear_schedule_with_warmup(
    optimizer_tuned,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

We will train with only 2 epochs , various experiments were conducted with more epochs but it ended up with the model to overfit , so it was observed that more than 2 epochs cause overfitting , this is because it's a pretrained model , 2-5 epochs will be enough , according to different use cases.

In [16]:
num_epochs = 2
training_start_time = time.time()

#This code is for implementing early stopping , it's useless here since we are working only with 3 epochs
# patience = 15 # this variable is to apply early stopping technique
# best_val_accuracy = 0.0
# epochs_without_improvement = 0

for epoch in range(num_epochs):
    print(f"------------------ Training Epoch {epoch+1} /{num_epochs} ------------------")
    train_accuracy, val_accuracy = model_train(model_tuned, optimizer_tuned, train_loader, val_loader, criterion, device)
    
    #If statement for early stopping mechanism
    # if val_accuracy > best_val_accuracy:
    #     best_val_accuracy = val_accuracy
    #     epochs_without_improvement = 0
    # else:
    #     epochs_without_improvement += 1

    # if epochs_without_improvement >= patience:
    #     print("Early stopping triggered")
    #     break
    
training_end_time = time.time()

# Calculate the total training time in hours
total_training_time = (training_end_time - training_start_time) /3600
print(f"Total training time: {total_training_time:.2f} hours")

------------------ Training Epoch 1 /2 ------------------


100%|██████████████████████████████████████████████████████████████████████████████| 2105/2105 [06:36<00:00,  5.31it/s]


Epoch 1/2, Train Loss: 0.2310, Train Accuracy: 0.9043, Validation Loss: 0.2822, Validation Accuracy: 0.8922
------------------ Training Epoch 2 /2 ------------------


100%|██████████████████████████████████████████████████████████████████████████████| 2105/2105 [06:33<00:00,  5.35it/s]


Epoch 2/2, Train Loss: 0.0941, Train Accuracy: 0.9674, Validation Loss: 0.2565, Validation Accuracy: 0.8991
Total training time: 0.22 hours


## Conclusion

In this notebook, we developed an advanced sentiment analysis model utilizing the DistilBERT architecture, fine-tuning it to accurately classify sentiments from textual data. We employed get_linear_schedule_with_warmup learning rate scheduler to optimize the training process. The model was rigorously evaluated on distinct training, validation, and test sets to ensure reliable accuracy and generalization.

Additionally, we prepared the model for deployment on Hugging Face (deployment is in a seperate notebook), allowing for easy testing of individual text inputs to predict sentiment classifications. 

This project serves as a comprehensive and practical tool for sentiment analysis, providing valuable insights for understanding public opinions and sentiments in various applications.
