# <span style="color:blue">**Project Overview: DistriBERT for Sentiment Analysis**</span>

## <span style="color:green">**Introduction**</span>

This project focuses on implementing a sentiment analysis (SA) model using DistilBERT, a distilled version of BERT, for analyzing sentiment in textual data.

## <span style="color:green">**Objectives**</span>

### <span style="color:purple">**Main Objectives**</span>
- <span style="color:purple">**Implementation:**</span> Implement DistilBERT for sentiment analysis using the Hugging Face Transformers library.
- <span style="color:purple">**Evaluation:**</span> Evaluate the model's performance using metrics such as accuracy, precision, recall, and F1-score.
- <span style="color:purple">**Cross-Validation:**</span> Perform cross-validation to ensure the robustness and generalization of the model.

### <span style="color:purple">**Specific Goals**</span>
- <span style="color:purple">**Dataset Preparation:**</span> Load and preprocess the Amazon Reviews dataset for training and validation.
- <span style="color:purple">**Model Training:**</span> Train the DistilBERT model for sentiment classification.
- <span style="color:purple">**Evaluation Metrics:**</span> Calculate and report metrics to assess model performance.
- <span style="color:purple">**Cross-Validation:**</span> Conduct cross-validation to validate model performance across different folds.

## <span style="color:green">**Implementation Steps**</span>

### <span style="color:orange">**Install Necessary Libraries**</span>
- Install PyTorch, Transformers, Datasets, and other required libraries for model development.

### <span style="color:orange">**Import Statements**</span>
- Import essential libraries such as PyTorch, Hugging Face Transformers, and scikit-learn for dataset handling, model training, and evaluation.

### <span style="color:orange">**Dataset Loading and Preprocessing**</span>
- Load a subset of the Amazon Reviews dataset and preprocess it using DistilBERT's tokenizer.
- Split the dataset into training and validation sets.

### <span style="color:orange">**Model Initialization**</span>
- Initialize the DistilBERT model for sequence classification.
- Define training arguments such as batch size, number of epochs, and logging configurations.

### <span style="color:orange">**Model Training**</span>
- Train the DistilBERT model using the Trainer class from Transformers.
- Save the trained model and evaluation metrics to files for further analysis.

### <span style="color:orange">**Model Evaluation**</span>
- Evaluate the trained model on the validation dataset.
- Calculate accuracy, precision, recall, and F1-score to assess model performance.

### <span style="color:orange">**Cross-Validation**</span>
- Implement cross-validation to validate the model across multiple folds.
- Compute average metrics like accuracy, precision, recall, and F1-score across cross-validation runs.
- Save cross-validation results to a CSV file.

## <span style="color:green">**Conclusion**</span>

This project aims to demonstrate the effectiveness of DistilBERT for sentiment analysis tasks. By following these steps, we can build, train, evaluate, and validate a robust sentiment analysis model using state-of-the-art techniques.


<div style="background-color:#f0f8ff; padding:10px;">

# <span style="color:blue">Step 1: Install Necessary Libraries</span>

- **Purpose:**
  - The goal of this step is to install essential Python libraries required for the project. These libraries are foundational for tasks such as data preprocessing, model training, evaluation, and optimization.

- **Actions:**
  - **PyTorch Libraries:** Install `torch`, `torchvision`, and `torchaudio` libraries. These are core components for building and training neural networks using PyTorch framework. `torchvision` provides utilities for computer vision tasks, while `torchaudio` supports audio data processing.
  
  - **Transformers Library:** Install `transformers` from Hugging Face. This library is crucial for leveraging pre-trained transformer models like BERT, DistilBERT, and others. It simplifies model loading, fine-tuning, and inference.
  
  - **Datasets Library:** Install `datasets` library. It provides access to various datasets commonly used in machine learning research and applications. This facilitates seamless integration of datasets into your training and evaluation pipelines.
  
  - **Accelerate Library:** Install `accelerate`, ensuring it is version 0.21.0 or higher. This library optimizes PyTorch training, particularly beneficial for scaling training on multiple GPUs efficiently.

- **Code Example:**
  ```python
  # Install PyTorch and related libraries
  !pip install torch torchvision torchaudio
  
  # Install Hugging Face Transformers and Datasets
  !pip install transformers datasets
  
  # Install Accelerate library (version 0.21.0 or higher)
  !pip install accelerate>=0.21.0

In [4]:
# Install necessary libraries

In [5]:
!pip install torch torchvision torchaudio transformers datasets  # Install required libraries



In [6]:
!pip install accelerate>=0.21.0  # Install accelerate library with version >=0.21.0

### <span style="color:blue">Step 2: Import Statements</span>

#### <span style="color:green">Library Import</span>

- **PyTorch**: Import PyTorch (`import torch`) to leverage its capabilities for deep learning model training and operations.
- **NumPy**: Import NumPy (`import numpy as np`) for efficient numerical computations and array operations.
- **Pandas**: Import Pandas (`import pandas as pd`) for handling and manipulating data in tabular form.
- **Hugging Face Transformers**: Import necessary classes from Hugging Face Transformers (`from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments`) for using DistilBERT model and related components.
- **Hugging Face Datasets**: Import the function to load datasets from Hugging Face Datasets (`from datasets import load_dataset`) to conveniently access the Amazon Reviews dataset.
- **Scikit-learn Metrics**: Import evaluation metrics (`from sklearn.metrics import accuracy_score, precision_recall_fscore_support`) from scikit-learn to evaluate model performance.

#### <span style="color:green">Purpose</span>

- **PyTorch**: Utilize PyTorch for building and training neural network models, including DistilBERT.
- **NumPy**: Leverage NumPy for numerical computations needed during data preprocessing and evaluation.
- **Pandas**: Use Pandas for handling and manipulating structured data, including loading datasets and storing results.
- **Hugging Face Transformers**: Access the DistilBERT model and utilities for tokenization, training, and evaluation.
- **Hugging Face Datasets**: Load the Amazon Reviews dataset using a convenient interface provided by Hugging Face.
- **Scikit-learn Metrics**: Compute evaluation metrics such as accuracy, precision, recall, and F1-score to assess model performance.

#### <span style="color:green">Integration</span>

- **Integration Strategy**: Integrate these libraries and modules to streamline the development and evaluation of the sentiment analysis model using DistilBERT.
- **Compatibility**: Ensure compatibility and functionality across different components, facilitating efficient data handling, model training, and performance evaluation.

---

These import statements are crucial for setting up the environment and accessing necessary tools and utilities to proceed with the development of the DistriBERT model for sentiment analysis on the Amazon Reviews dataset.

In [7]:
# Import statements

In [8]:
import torch  # Import PyTorch library

In [60]:
import numpy as np  # Import NumPy library for numerical computations

In [45]:
import pandas as pd  # Import pandas for handling dataframes

In [9]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments  # Import classes from Hugging Face Transformers


In [10]:
from datasets import load_dataset  # Import function to load datasets from Hugging Face Datasets

In [11]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support  # Import evaluation metrics from scikit-learn

<div style="background-color:#f0f8ff; padding:10px;">

# <span style="color:orange">Step 3: Dataset Loading and Preprocessing</span>

- **<span style="color:purple">Purpose:</span>**
  - This step involves loading and preprocessing a subset of the Amazon Reviews dataset for training and validation purposes.

- **<span style="color:purple">Actions:</span>**
  - **<span style="color:blue">Dataset Loading:</span>**
    - Load a smaller subset of the Amazon Reviews dataset using the `load_dataset` function from the `datasets` module. Specify the dataset name (`"amazon_polarity"`) and the split (`"train[:100]"`) to load 100 samples for initial testing.

  - **<span style="color:blue">Data Preprocessing:</span>**
    - **<span style="color:green">Tokenization:</span>** Use the DistilBERT tokenizer (`DistilBertTokenizer`) to tokenize and encode the dataset. Set parameters such as `truncation=True` for handling long sequences and `padding='max_length'` to ensure uniform sequence length.

  - **<span style="color:blue">Dataset Format:</span>**
    - Convert tokenized datasets into a format compatible with PyTorch (`type='torch'`). Specify columns to include (`['input_ids', 'attention_mask', 'label']`) for model input (input IDs and attention masks) and labels (sentiment labels).

  - **<span style="color:blue">Splitting into Training and Validation Sets:</span>**
    - Determine the sizes for the training (`train_size`) and validation (`val_size`) datasets based on a predefined split ratio (e.g., 80% for training, 20% for validation).
    
    - Shuffle and select samples for the training dataset (`train_dataset`) to ensure randomization and avoid bias.
    
    - Select samples for the validation dataset (`val_dataset`) using the remaining samples after selecting the training set.

- **<span style="color:purple">Notes:</span>**
  - **<span style="color:green">Dataset Size Considerations:</span>** Adjust the subset size (`100` samples in this case) based on computational resources and initial testing requirements.
  
  - **<span style="color:green">Data Integrity:</span>** Ensure data integrity by verifying the successful loading and preprocessing of the dataset before proceeding to the next steps.
  
  - **<span style="color:green">Error Handling:</span>** Handle potential errors such as missing data or incompatible formats during dataset loading and preprocessing stages.
  
  - **<span style="color:green">Documentation:</span>** Refer to documentation and examples provided by Hugging Face and other relevant sources for detailed usage and parameter configurations of the `datasets` and `transformers` libraries.

</div>

In [12]:
# Load a smaller subset of the Amazon Reviews dataset for initial testing

In [13]:
dataset = load_dataset("amazon_polarity", split="train[:100]")  # Load a subset of Amazon Reviews dataset

In [14]:
# Tokenizer initialization

In [15]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')  # Initialize DistilBERT tokenizer

In [16]:
# Function for preprocessing data

In [17]:
def preprocess(example):
    return tokenizer(example['content'], truncation=True, padding='max_length')  # Preprocess dataset examples

In [18]:
# Tokenize datasets in batches

In [19]:
tokenized_datasets = dataset.map(preprocess, batched=True)  # Tokenize dataset in batches

In [20]:
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])  # Set dataset format for PyTorch


<div style="background-color:#f0f8ff; padding:10px;">

# <span style="color:orange">Step 4: Split Datasets into Training and Validation Sets</span>

- **<span style="color:purple">Purpose:</span>**
  - Split the preprocessed dataset into training and validation sets to facilitate model training and evaluation.

- **<span style="color:purple">Actions:</span>**
  - **<span style="color:blue">Dataset Splitting:</span>**
    - Calculate the sizes for training (`train_size`) and validation (`val_size`) datasets based on a specified ratio (e.g., 80% training, 20% validation).
    
  - **<span style="color:blue">Shuffling:</span>**
    - Use the `shuffle` method (`tokenized_datasets.shuffle`) with a seed for reproducibility to randomize the dataset before splitting.
    
  - **<span style="color:blue">Selection:</span>**
    - Select data ranges (`range`) for training and validation sets using the `select` method (`tokenized_datasets.select`).

- **<span style="color:purple">Notes:</span>**
  - **<span style="color:green">Data Distribution:</span>** Ensure a representative distribution of data across training and validation sets to maintain model performance and generalization.
  
  - **<span style="color:green">Shuffling:</span>** Shuffle the dataset to randomize the order of examples before splitting to prevent any inherent bias in the data sequence.
  
  - **<span style="color:green">Seed Selection:</span>** Use a consistent seed (e.g., `seed=42`) for reproducibility in dataset shuffling and selection across different runs or environments.
  
  - **<span style="color:green">Validation Set:</span>** Validate the selected validation set range (`train_size` to `train_size + val_size`) to ensure correct separation from the training data.

</div>

In [21]:
# Split datasets into training and validation sets

In [22]:
train_size = int(0.8 * len(tokenized_datasets))  # Calculate size of training dataset

In [23]:
val_size = len(tokenized_datasets) - train_size  # Calculate size of validation dataset

In [24]:
train_dataset = tokenized_datasets.shuffle(seed=42).select(range(train_size))  # Shuffle and select training dataset


In [25]:
val_dataset = tokenized_datasets.shuffle(seed=42).select(range(train_size, train_size + val_size))  # Select validation dataset


<div style="background-color:#f0f8ff; padding:10px;">

# <span style="color:orange">Step 5: Initialize and Train DistilBERT Model</span>

- **<span style="color:purple">Purpose:</span>**
  - Initialize the DistilBERT model for sequence classification and train it on the prepared training dataset.

- **<span style="color:purple">Actions:</span>**
  - **<span style="color:blue">Model Initialization:</span>**
    - Initialize the DistilBERT model for sequence classification using the `DistilBertForSequenceClassification.from_pretrained` method with appropriate parameters (e.g., `num_labels`).

  - **<span style="color:blue">Training Arguments:</span>**
    - Define training arguments such as `TrainingArguments` specifying parameters like `output_dir`, `num_train_epochs`, `per_device_train_batch_size`, `per_device_eval_batch_size`, `warmup_steps`, `weight_decay`, `logging_dir`, `logging_steps`, `evaluation_strategy`, and `save_strategy`.

  - **<span style="color:blue">Trainer Initialization:</span>**
    - Initialize the `Trainer` object with the defined DistilBERT model, training arguments, and the prepared training dataset (`train_dataset`) and validation dataset (`val_dataset`).

  - **<span style="color:blue">Model Training:</span>**
    - Use the `trainer.train()` method to commence training the initialized DistilBERT model on the training dataset.

- **<span style="color:purple">Notes:</span>**
  - **<span style="color:green">Model Selection:</span>** Choose an appropriate pre-trained DistilBERT model configuration (`distilbert-base-uncased`, etc.) for sequence classification based on the task requirements.
  
  - **<span style="color:green">Training Configuration:</span>** Configure training parameters (`num_train_epochs`, batch sizes, logging settings, etc.) to optimize model performance and monitor training progress.
  
  - **<span style="color:green">Evaluation Strategy:</span>** Specify the `evaluation_strategy` to evaluate the model at the end of each epoch for performance assessment on the validation dataset.
  
  - **<span style="color:green">Save Strategy:</span>** Define the `save_strategy` to save the model checkpoints at the end of each epoch for potential further evaluation or deployment.

</div>

In [26]:
# Load pre-trained DistilBERT model for sequence classification

In [27]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)  # Initialize DistilBERT for sequence classification


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
# Define training arguments, adjusting batch sizes and epochs for faster execution

In [29]:
training_args = TrainingArguments(
    output_dir='./results',  # Directory to save training results
    num_train_epochs=1,  # Number of training epochs
    per_device_train_batch_size=4,  # Batch size for training
    per_device_eval_batch_size=4,  # Batch size for evaluation
    warmup_steps=500,  # Number of warmup steps
    weight_decay=0.01,  # Weight decay coefficient
    logging_dir='./logs',  # Directory to save logs
    logging_steps=10,  # Log every 10 steps
    evaluation_strategy='epoch',  # Evaluate at the end of each epoch
    save_strategy='epoch',  # Save model at the end of each epoch
)



In [30]:
# Initialize Trainer with defined model, training arguments, and datasets

In [31]:
trainer = Trainer(
    model=model,  # Pass model to Trainer
    args=training_args,  # Pass training arguments to Trainer
    train_dataset=train_dataset,  # Pass training dataset to Trainer
    eval_dataset=val_dataset,  # Pass validation dataset to Trainer
)

In [32]:
# Train the model

In [33]:
trainer.train()  # Train the model

Epoch,Training Loss,Validation Loss
1,0.6942,0.694923


TrainOutput(global_step=20, training_loss=0.6982547283172608, metrics={'train_runtime': 436.1211, 'train_samples_per_second': 0.183, 'train_steps_per_second': 0.046, 'total_flos': 10597391892480.0, 'train_loss': 0.6982547283172608, 'epoch': 1.0})

<div style="background-color:#f0f8ff; padding:10px;">

# <span style="color:orange">Step 6: Evaluate the Trained Model</span>

- **<span style="color:purple">Purpose:</span>**
  - Evaluate the performance of the trained DistilBERT model on the validation dataset using evaluation metrics such as accuracy, precision, recall, and F1-score.

- **<span style="color:purple">Actions:</span>**
  - **<span style="color:blue">Model Loading:</span>**
    - Load the trained DistilBERT model using `DistilBertForSequenceClassification.from_pretrained` from the saved directory or checkpoint.
  
  - **<span style="color:blue">Prediction:</span>**
    - Use the loaded model to make predictions on the validation dataset (`val_dataset`) using `trainer.predict()` method.
  
  - **<span style="color:blue">Metrics Calculation:</span>**
    - Compute evaluation metrics such as accuracy, precision, recall, and F1-score using `accuracy_score`, `precision_recall_fscore_support`, or other appropriate functions from scikit-learn.
  
  - **<span style="color:blue">Results Display:</span>**
    - Print or display the calculated evaluation metrics to assess the model's performance on the validation dataset.

- **<span style="color:purple">Notes:</span>**
  - **<span style="color:green">Model Loading:</span>** Ensure the correct path or directory is provided to load the trained model from the saved checkpoint or directory.
  
  - **<span style="color:green">Prediction:</span>** Utilize the `trainer.predict()` method to obtain model predictions on the validation dataset efficiently.
  
  - **<span style="color:green">Metric Selection:</span>** Select appropriate evaluation metrics (`accuracy`, `precision`, `recall`, `F1-score`) based on the task requirements and dataset characteristics.
  
  - **<span style="color:green">Performance Assessment:</span>** Interpret and analyze the evaluation metrics to assess the model's performance and identify areas for potential improvement.

</div>

In [34]:
# Evaluate the trained model on validation dataset

In [35]:
predictions = trainer.predict(val_dataset)  # Make predictions on validation dataset

In [36]:
preds = predictions.predictions.argmax(-1)  # Get predicted labels

In [37]:
labels = predictions.label_ids  # Get true labels

In [38]:
# Calculate evaluation metrics (accuracy, precision, recall, F1-score)

In [39]:
accuracy = accuracy_score(labels, preds)  # Calculate accuracy

In [40]:
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')  # Calculate precision, recall, F1-score


In [41]:
# Print evaluation metrics

In [59]:
print(f"Accuracy: {accuracy:.4f}")  # Print accuracy

Accuracy: 0.6000


In [55]:
print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}")  # Print precision, recall, F1-score

Precision: 1.0000, Recall: 0.1111, F1: 0.2000


<div style="background-color:#f0f8ff; padding:10px;">

# <span style="color:orange">Step 7: Save Evaluation Metrics and Predictions</span>

- **<span style="color:purple">Purpose:</span>**
  - Save the evaluation metrics (accuracy, precision, recall, F1-score) and model predictions (true labels and predicted labels) to files for further analysis and reporting.

- **<span style="color:purple">Actions:</span>**
  - **<span style="color:blue">Metrics Saving:</span>**
    - Create a DataFrame or data structure to store evaluation metrics (accuracy, precision, recall, F1-score) calculated during model evaluation.
    - Save the evaluation metrics to a CSV file using `to_csv()` method of Pandas DataFrame.
  
  - **<span style="color:blue">Predictions Saving:</span>**
    - Create a DataFrame to store true labels and predicted labels obtained from model predictions.
    - Save the predictions DataFrame to a CSV file using `to_csv()` method of Pandas DataFrame.

- **<span style="color:purple">Notes:</span>**
  - **<span style="color:green">Metrics Storage:</span>** Ensure the CSV file path is correctly specified to save evaluation metrics for future reference or reporting.
  
  - **<span style="color:green">Predictions Storage:</span>** Save true labels and predicted labels in a structured format to facilitate comparison and analysis.
  
  - **<span style="color:green">File Naming:</span>** Choose meaningful names for CSV files (e.g., `evaluation_metrics.csv`, `predictions.csv`) to easily identify stored data.

</div>

In [None]:
# Save evaluation metrics to CSV file

In [56]:
eval_metrics_df = pd.DataFrame({  # Create dataframe for evaluation metrics
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-score'],  # Define metric names
    'Score': [accuracy, precision, recall, f1]  # Define corresponding scores
})

In [57]:
eval_metrics_df.to_csv('./evaluation_metrics.csv', index=False)  # Save evaluation metrics to CSV file

In [48]:
# Save predictions to CSV file

In [58]:
predictions_df = pd.DataFrame({  # Create dataframe for predictions
    'True Labels': labels,  # True labels column
    'Predicted Labels': preds  # Predicted labels column
})

In [50]:
predictions_df.to_csv('./predictions.csv', index=False)  # Save predictions to CSV file

<div style="background-color:#f0f8ff; padding:10px;">

# <span style="color:orange">Step 8: Save and Load the Trained Model</span>

- **<span style="color:purple">Purpose:</span>**
  - Save the trained DistilBERT model to disk for future use and load it back into memory for inference or further training.

- **<span style="color:purple">Actions:</span>**
  - **<span style="color:blue">Model Saving:</span>**
    - Use the `save_model()` method of the Trainer class to save the trained DistilBERT model to a specified directory.
  
  - **<span style="color:blue">Model Loading:</span>**
    - Utilize the `from_pretrained()` method of DistilBertForSequenceClassification to load the saved model from the directory path.

- **<span style="color:purple">Notes:</span>**
  - **<span style="color:green">Model Storage:</span>** Ensure the directory path provided during model saving is accessible and descriptive.
  
  - **<span style="color:green">Model Loading:</span>** Verify that the correct directory path is used to load the saved model back into memory.
  
  - **<span style="color:green">Reusability:</span>** Saved models can be reused for inference on new data or continued training without retraining from scratch.

</div>

In [51]:
# Save the trained model

In [52]:
trainer.save_model('./distilbert-amazon-reviews')  # Save the trained model

In [53]:
# Load the saved model

In [54]:
loaded_model = DistilBertForSequenceClassification.from_pretrained('./distilbert-amazon-reviews')  # Load the saved model


<div style="background-color:#f0f8ff; padding:10px;">

# <span style="color:orange">Step 9: Perform Cross-Validation and Calculate Average Metrics</span>

- **<span style="color:purple">Purpose:</span>**
  - Validate the performance of the DistilBERT model across multiple folds using cross-validation.
  - Compute average evaluation metrics to assess the model's robustness and generalization.

- **<span style="color:purple">Actions:</span>**
  - **<span style="color:blue">Cross-Validation Setup:</span>**
    - Define the number of cross-validation runs and initialize an empty list to store results.

  - **<span style="color:blue">Dataset Processing:</span>**
    - Tokenize and preprocess the dataset for each cross-validation run to ensure consistency and fairness.

  - **<span style="color:blue">Model Training and Evaluation:</span>**
    - Train the DistilBERT model and evaluate its performance on each fold of the cross-validation using predefined metrics.
  
  - **<span style="color:blue">Average Metrics Calculation:</span>**
    - Calculate average metrics such as accuracy, precision, recall, and F1-score across all cross-validation runs.
  
  - **<span style="color:blue">Results Saving:</span>**
    - Save the cross-validation results to a CSV file for further analysis and comparison.

- **<span style="color:purple">Notes:</span>**
  - **<span style="color:green">Data Consistency:</span>** Ensure dataset preprocessing and model training procedures are consistent across all cross-validation folds.
  
  - **<span style="color:green">Metric Interpretation:</span>** Interpret average metrics to gauge the model's performance across various data subsets.
  
  - **<span style="color:green">Result Analysis:</span>** Use saved results to compare different models or parameter settings and inform future model improvements.

</div>

In [65]:
# Function to train and evaluate the model

In [66]:
def train_and_evaluate(train_dataset, val_dataset):
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset
    )

    trainer.train()  # Train the model

    predictions = trainer.predict(val_dataset)  # Make predictions on validation dataset
    preds = predictions.predictions.argmax(-1)  # Get predicted labels
    labels = predictions.label_ids  # Get true labels

    accuracy = accuracy_score(labels, preds)  # Calculate accuracy
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')  # Calculate precision, recall, F1-score

    return accuracy, precision, recall, f1  # Return evaluation metrics

In [67]:
# Function to perform cross-validation

In [68]:
def cross_validate(dataset, num_runs=5):
    results = []  # Initialize an empty list to store results

    for i in range(num_runs):
        seed = 42 + i  # Define seed for reproducibility

        # Preprocess datasets for current seed
        tokenized_datasets = dataset.map(preprocess, batched=True)  # Tokenize and preprocess dataset

        # Split datasets into training and validation sets
        train_size = int(0.8 * len(tokenized_datasets))  # Calculate size of training dataset
        val_size = len(tokenized_datasets) - train_size  # Calculate size of validation dataset
        train_dataset = tokenized_datasets.shuffle(seed=seed).select(range(train_size))  # Shuffle and select training dataset
        val_dataset = tokenized_datasets.shuffle(seed=seed).select(range(train_size, train_size + val_size))  # Select validation dataset

        # Train and evaluate with current seed
        accuracy, precision, recall, f1 = train_and_evaluate(train_dataset, val_dataset)  # Train and evaluate the model

        # Collect results
        results.append({
            'Seed': seed,
            'Accuracy': accuracy,
            'Precision': precision,
            'Recall': recall,
            'F1-score': f1
        })  # Append results for current run to the list

    return results  # Return the collected results after all cross-validation runs

In [69]:
# Perform cross-validation

In [70]:
num_runs = 5  # Number of cross-validation runs

In [71]:
cross_validation_results = cross_validate(dataset, num_runs)  # Execute cross-validation

Epoch,Training Loss,Validation Loss
1,0.6864,0.69318


Epoch,Training Loss,Validation Loss
1,0.684,0.685068


Epoch,Training Loss,Validation Loss
1,0.6773,0.678845


Epoch,Training Loss,Validation Loss
1,0.6795,0.681347


Epoch,Training Loss,Validation Loss
1,0.6871,0.67751


In [72]:
# Calculate average metrics across runs

In [73]:
avg_accuracy = np.mean([result['Accuracy'] for result in cross_validation_results])  # Calculate average accuracy

In [74]:
avg_precision = np.mean([result['Precision'] for result in cross_validation_results])  # Calculate average precision


In [75]:
avg_recall = np.mean([result['Recall'] for result in cross_validation_results])  # Calculate average recall

In [76]:
avg_f1 = np.mean([result['F1-score'] for result in cross_validation_results])  # Calculate average F1-score

In [77]:
# Print average metrics

In [78]:
print(f"Average Accuracy: {avg_accuracy:.4f}")  # Print average accuracy
print(f"Average Precision: {avg_precision:.4f}")  # Print average precision
print(f"Average Recall: {avg_recall:.4f}")  # Print average recall
print(f"Average F1-score: {avg_f1:.4f}")  # Print average F1-score

Average Accuracy: 0.7000
Average Precision: 0.6754
Average Recall: 0.6624
Average F1-score: 0.6600


In [79]:
# Save average metrics to CSV file

In [80]:
avg_metrics_df = pd.DataFrame({  # Create dataframe for average metrics
    'Metric': ['Average Accuracy', 'Average Precision', 'Average Recall', 'Average F1-score'],  # Define metric names
    'Score': [avg_accuracy, avg_precision, avg_recall, avg_f1]  # Define corresponding scores
})

In [81]:
avg_metrics_df.to_csv('./average_metrics_cv_updated.csv', index=False)  # Save average metrics to CSV file

<div style="background-color:#f0f8ff; padding:10px;">

# <span style="color:blue">Project Conclusion and Additional Notes</span>

- **<span style="color:green">Project Overview:</span>**
  - This project focused on implementing a sentiment analysis model using DistilBERT, a lighter version of BERT, for analyzing sentiment in textual data from the Amazon Reviews dataset.

- **<span style="color:green">Key Objectives Achieved:</span>**
  - **Implementation:** The project successfully implemented DistilBERT for sentiment analysis using the Hugging Face Transformers library. This involved leveraging pre-trained language models and fine-tuning them for sentiment classification.
  
  - **Evaluation:** The model's performance was evaluated using standard metrics such as accuracy, precision, recall, and F1-score. This evaluation provided insights into how well the model classified sentiment in the validation dataset.
  
  - **Cross-Validation:** To ensure the model's robustness and generalization, cross-validation was performed across multiple folds of the dataset. This helped validate that the model's performance was consistent and not overfitted to a specific subset of data.

- **<span style="color:green">Implementation Steps Recap:</span>**
  - **Step 1: Install Necessary Libraries:** Essential libraries including PyTorch, Transformers, and Datasets were installed to facilitate model development and data handling.
  
  - **Step 2: Import Statements:** Necessary modules and libraries were imported, such as PyTorch for tensor computations, Hugging Face Transformers for leveraging pre-trained models, and scikit-learn for evaluation metrics.
  
  - **Step 3: Dataset Loading and Preprocessing:** A subset of the Amazon Reviews dataset was loaded and preprocessed using the DistilBERT tokenizer. This involved tokenizing the text data and preparing it for input into the model.
  
  - **Step 4: Split Datasets into Training and Validation Sets:** The dataset was split into training and validation sets to facilitate model training and evaluation. This step ensured that the model's performance could be assessed on unseen data.
  
  - **Step 5: Initialize and Train DistilBERT Model:** The DistilBERT model was initialized for sequence classification and trained using the Trainer class from Hugging Face Transformers. Training parameters such as batch size, number of epochs, and learning rate were configured to optimize model performance.
  
  - **Step 6: Evaluate the Trained Model:** The trained model was evaluated on the validation dataset using metrics like accuracy, precision, recall, and F1-score. These metrics provided quantitative measures of the model's performance in sentiment analysis tasks.
  
  - **Step 7: Save Evaluation Metrics and Predictions:** Evaluation metrics (e.g., accuracy, F1-score) and model predictions were saved to CSV files. This allowed for further analysis and comparison of different model configurations or datasets.
  
  - **Step 8: Save and Load the Trained Model:** After training, the DistilBERT model was saved to disk for future use or deployment. This step ensured that the trained model could be easily loaded for inference tasks without needing to retrain from scratch.
  
  - **Step 9: Perform Cross-Validation and Calculate Average Metrics:** Cross-validation was implemented to validate the model's performance across multiple folds of the dataset. Average metrics such as accuracy, precision, recall, and F1-score were computed to assess the model's consistency and generalization.

- **<span style="color:green">Conclusion:</span>**
  - This project demonstrated the effectiveness of DistilBERT, a distilled version of BERT, for sentiment analysis tasks. By following the outlined steps, a robust sentiment analysis model was developed, trained, evaluated, and validated.
  
  - The project highlighted the importance of leveraging pre-trained models and fine-tuning them for specific tasks like sentiment analysis. This approach not only saves computational resources but also benefits from the extensive knowledge embedded in large-scale language models.
  
  - The evaluation metrics provided insights into the model's strengths and areas for improvement, helping guide future iterations or refinements of the sentiment analysis pipeline.
  
- **<span style="color:green">Future Steps:</span>**
  - **Model Fine-Tuning:** Consider fine-tuning the DistilBERT model with additional labeled data or adjusting hyperparameters to further improve performance in specific domains or datasets.
  
  - **Deployment:** Explore deployment options to integrate the trained model into real-world applications for sentiment analysis. This may involve deploying the model as a web service, embedding it into existing workflows, or deploying it on edge devices.
  
  - **Further Research:** Continue exploring advancements in transformer-based models and their applications in natural language processing tasks. This includes investigating new architectures, exploring multilingual models, or integrating domain-specific knowledge into model training.

- **<span style="color:green">Notes:</span>**
  - **Dataset Considerations:** Ensure that data preprocessing steps are tailored to specific datasets and tasks to maintain model performance consistency across different domains or datasets.
  
  - **Model Evaluation:** Interpret evaluation metrics comprehensively to understand the strengths and limitations of the sentiment analysis model. Consider analyzing performance metrics in conjunction with qualitative assessments of model predictions.
  
  - **Documentation and Reproducibility:** Maintain detailed documentation of code, experimental setups, and results for reproducibility and future reference. This ensures that the project can be easily replicated or extended by other researchers or practitioners in the field.

</div>