<a href="https://github.com/Deffro/Data-Science-Portfolio/tree/master"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1xKct26SXCNQf0WffDeh5l3alMQiiftGp?usp=sharing)

In a previous tutorial called [Traditional vs. Generative AI for Sentiment Classification](https://github.com/Deffro/Data-Science-Portfolio/blob/master/Generative%20AI/Traditional%20vs.%20Generative%20AI%20for%20Sentiment%20Classification/Traditional_vs_Generative_AI_for_Sentiment_Classification.ipynb), we predicted the sentiment of product reviews from the [Flipkart Customer Review dataset](https://www.kaggle.com/datasets/kabirnagpal/flipkart-customer-review-and-rating?resource=download).

We compared several methods:

1. **Logistic Regression with TF-IDF**:
   - A simple yet effective baseline using term-frequency-based features for classification.

2. **Logistic Regression with Pretrained Embeddings**:
   - Utilize advanced embedding models like `all-MiniLM-L6-v2` to generate semantic representations for training a classifier.

3. **Zero-shot Classification**:
   - Perform classification without labeled data by leveraging cosine similarity between document and label embeddings.

4. **Generative Models**:
   - Explore generative language models like `Flan-T5`, which classify text by generating responses based on a prompt.

5. **Task-Specific Sentiment Models**:
   - Leverage fine-tuned sentiment models like `juliensimon/reviews-sentiment-analysis` for domain-specific performance.

Here are the results of that experiment:


| **Method**                                   | **Labeled Data** | **Accuracy** | **Time Taken**      |
|----------------------------------------------|------------------|--------------|---------------------|
| Logistic Regression using TF-IDF            | Yes              | 0.87         | ~1 second           |
| Logistic Regression with embeddings (all-MiniLM-L6-v2) | Yes              | 0.86         | ~1 minute           |
| Logistic Regression with embeddings (all-mpnet-base-v2) | Yes              | 0.86         | ~7 minutes          |
| Zero-shot classification (all-mpnet-base-v2) | No               | 0.78         | ~5 seconds          |
| Classification using Generative Models (Flan-T5)        | No               | 0.86         | ~30 seconds         |
| Task-Specific Sentiment Model (juliensimon/reviews-sentiment-analysis) | No               | 0.79         | ~4 seconds          |


In this tutorial we will fine-tune the Task-Specific Sentiment Model (`juliensimon/reviews-sentiment-analysis`) and see if its 0.79 accuracy will improve.

In [4]:
%%capture
!pip install datasets transformers sentence-transformers evaluate

We will use the same dataset and the same pre-processing.

# Load dataset

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

data = pd.read_csv('data.csv')
data = data[data["rating"]!=4]
data["label"] = data["rating"].apply(lambda x: 1 if x >= 4 else 0)

# Down-sample the positive class and combine with the negative class
data = pd.concat([
    data[data["label"] == 1].sample(n=len(data[data["label"] == 0]), random_state=1),
    data[data["label"] == 0]
])

# Shuffle the resulting dataset
data = data.sample(frac=1, random_state=1).reset_index(drop=True)

# data split
train = data[:int(0.8*len(data))]
test = data[int(0.8*len(data)):].reset_index(drop=True)

y_test = test['label']

# Task-Specific Model

We use the pre-trained task-specific sentiment model `juliensimon/reviews-sentiment-analysis` from Hugging Face's Model Hub.

- Load the model pipeline.
- Convert the test data into a compatible Hugging Face Dataset.
- Perform sentiment classification on the test set.

In [None]:
from transformers import pipeline

model_path = "juliensimon/reviews-sentiment-analysis"

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    device=0
)

In [5]:
from datasets import Dataset
from sklearn.metrics import classification_report, f1_score
from transformers.pipelines.pt_utils import KeyDataset

# Convert pandas DataFrame to datasets.Dataset for compatibility with Hugging Face
dataset = Dataset.from_pandas(test)

y_pred = []

# Run inference on the test set
for output in tqdm(pipe(KeyDataset(dataset, key="review"))):
    label = output["label"]
    # Map textual output to numerical labels
    y_pred.append(0 if "0" in label else 1)

print("Classification Report:")
print(classification_report(y_test, y_pred))

100%|██████████| 754/754 [00:10<00:00, 71.62it/s]

Classification Report:
              precision    recall  f1-score   support

           0       0.73      0.91      0.81       360
           1       0.89      0.69      0.78       394

    accuracy                           0.79       754
   macro avg       0.81      0.80      0.79       754
weighted avg       0.81      0.79      0.79       754






We got an accuracy of 79%.

# Fine-Tuning the model

- **Load Model and Tokenizer**: We use the same `juliensimon/reviews-sentiment-analysis` model.
- **Prepare the Data**: Tokenize the training and testing datasets, ensuring padding for variable-length inputs.
- **Define Metrics**: Use the F1-score as the primary evaluation metric.
- **Set Training Arguments**: Specify hyperparameters such as learning rate, batch size, and number of epochs.
- **Train the Model**: Use the Hugging Face Trainer API to fine-tune the model.
- **Evaluation**: Evaluate the model's performance on the test set.
- **Manual Inference**: Perform manual inference to predict sentiments for individual reviews.

## Load Model and Tokenizer

In [66]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load Model and Tokenizer
model_id = "juliensimon/reviews-sentiment-analysis"
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_id)

## Prepare the Data

In [67]:
from transformers import DataCollatorWithPadding

# Pad to the longest sequence in the batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Convert pandas DataFrames to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train)
test_dataset = Dataset.from_pandas(test)

def preprocess_function(examples):
    """Tokenize input data."""
    return tokenizer(examples["review"], truncation=True)


# Tokenize train/test data
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/3016 [00:00<?, ? examples/s]

Map:   0%|          | 0/754 [00:00<?, ? examples/s]

## Define Metrics

In [68]:
import numpy as np
import evaluate

def compute_metrics(eval_pred):
    """Calculate F1 score"""

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    f1 = f1_score(labels, predictions, average="binary")

    return {"f1": f1}

In [69]:
print(train_dataset[0])

{'review': 'Sometimes I can only hear the songs from one side onlyREAD MORE', 'rating': 5, 'label': 1}


## Train the model

In [70]:
from transformers import TrainingArguments, Trainer

# Training arguments for parameter tuning
training_args = TrainingArguments(
   "model",
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=1,
   weight_decay=0.01,
   save_strategy="epoch",
   report_to="none"
)

# Trainer which executes the training process
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)


In [71]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=189, training_loss=0.32724548016906413, metrics={'train_runtime': 69.2659, 'train_samples_per_second': 43.542, 'train_steps_per_second': 2.729, 'total_flos': 67537675283520.0, 'train_loss': 0.32724548016906413, 'epoch': 1.0})

## Evaluation

In [72]:
trainer.evaluate()

{'eval_loss': 0.29022637009620667,
 'eval_f1': 0.9001233045622689,
 'eval_runtime': 2.0912,
 'eval_samples_per_second': 360.55,
 'eval_steps_per_second': 22.953,
 'epoch': 1.0}

## Manual Inference

In [73]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Inference loop
y_pred = []

for output in tqdm(tokenized_test):
    # Extract input_ids and attention_mask
    input_ids = torch.tensor(output["input_ids"]).unsqueeze(0).to(device)
    attention_mask = torch.tensor(output["attention_mask"]).unsqueeze(0).to(device)

    # Run inference
    with torch.no_grad():  # Disable gradient computation for inference
        logits = model(input_ids=input_ids, attention_mask=attention_mask).logits

    # Get the predicted label
    predictions = np.argmax(logits.cpu().detach().numpy(), axis=1)
    y_pred.append(predictions[0])

100%|██████████| 754/754 [00:03<00:00, 226.85it/s]


In [74]:
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.86      0.88       360
           1       0.88      0.93      0.90       394

    accuracy                           0.89       754
   macro avg       0.89      0.89      0.89       754
weighted avg       0.89      0.89      0.89       754



The accuracy is now 89%, a 10% increase from the base model!