# BERT Sentiment Analysis Model

This notebook implements a BERT-based sentiment analysis model for Twitter sentiment classification, using pre-trained BERT with fine-tuning. It is adapted to run in Google Colab, using preprocessed data from `data_preprocessing.ipynb` stored on Google Drive.

## Setup and Imports

Mount Google Drive, install dependencies, and import required libraries.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

# Install necessary libraries
!pip install torch transformers scikit-learn joblib tqdm matplotlib seaborn

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
    BertTokenizer, BertForSequenceClassification,
    AdamW, get_linear_schedule_with_warmup
)
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib
import logging
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Configuration
DATA_DIR = Path('/content/drive/MyDrive/Colab Notebooks/Data_RO')
MODELS_DIR = DATA_DIR / 'models'
RESULTS_DIR = DATA_DIR / 'results'
BERT_CONFIG = {
    'model_name': 'bert-base-uncased',
    'num_labels': 2,
    'max_length': 128,
    'batch_size': 16,
    'num_epochs': 3,
    'learning_rate': 2e-5,
    'weight_decay': 0.01,
    'warmup_steps': 0
}
RANDOM_STATE = 42

# Create directories if they don't exist
MODELS_DIR.mkdir(exist_ok=True)
RESULTS_DIR.mkdir(exist_ok=True)

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
logger.info(f"Using device: {device}")

Mounted at /content/drive
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading

## Twitter Sentiment Dataset

Define the custom Dataset class for Twitter sentiment data.

In [3]:
class TwitterSentimentDataset(Dataset):
    """
    Custom Dataset class for Twitter sentiment data
    """

    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]

        # Tokenize text
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        return {
            'text': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

## BERT Sentiment Model Class

Define the `BERTSentimentModel` class for fine-tuning and evaluating the BERT model.

In [4]:
class BERTSentimentModel:
    """
    BERT-based sentiment analysis model with fine-tuning capabilities
    """

    def __init__(self, config=BERT_CONFIG):
        """
        Initialize BERT model with configuration

        Args:
            config (dict): Model configuration parameters
        """
        self.config = config
        self.tokenizer = None
        self.model = None
        self.is_trained = False
        self.training_history = []

        # Initialize tokenizer and model
        self._initialize_model()

        logger.info("BERT model initialized")

    def _initialize_model(self):
        """Initialize BERT tokenizer and model"""
        logger.info(f"Loading BERT model: {self.config['model_name']}")

        # Load tokenizer
        self.tokenizer = BertTokenizer.from_pretrained(
            self.config['model_name']
        )

        # Load model
        self.model = BertForSequenceClassification.from_pretrained(
            self.config['model_name'],
            num_labels=self.config['num_labels'],
            output_attentions=False,
            output_hidden_states=False
        )

        # Move model to device
        self.model.to(device)

    def create_data_loader(self, texts, labels, batch_size, shuffle=True):
        """
        Create DataLoader for training/evaluation

        Args:
            texts (array-like): Input texts
            labels (array-like): Labels
            batch_size (int): Batch size
            shuffle (bool): Whether to shuffle data

        Returns:
            DataLoader: PyTorch DataLoader
        """
        dataset = TwitterSentimentDataset(
            texts=texts,
            labels=labels,
            tokenizer=self.tokenizer,
            max_length=self.config['max_length']
        )

        return DataLoader(
            dataset,
            batch_size=batch_size,
            shuffle=shuffle,
            num_workers=0  # Set to 0 for compatibility
        )

    def train(self, X_train, y_train, X_val=None, y_val=None):
        """
        Fine-tune BERT model on Twitter sentiment data

        Args:
            X_train (array-like): Training texts
            y_train (array-like): Training labels
            X_val (array-like): Validation texts (optional)
            y_val (array-like): Validation labels (optional)

        Returns:
            dict: Training history
        """
        logger.info("Starting BERT fine-tuning...")

        # Create data loaders
        train_loader = self.create_data_loader(
            X_train, y_train,
            self.config['batch_size'],
            shuffle=True
        )

        val_loader = None
        if X_val is not None and y_val is not None:
            val_loader = self.create_data_loader(
                X_val, y_val,
                self.config['batch_size'],
                shuffle=False
            )

        # Setup optimizer and scheduler
        optimizer = AdamW(
            self.model.parameters(),
            lr=self.config['learning_rate'],
            weight_decay=self.config['weight_decay']
        )

        total_steps = len(train_loader) * self.config['num_epochs']
        scheduler = get_linear_schedule_with_warmup(
            optimizer,
            num_warmup_steps=self.config['warmup_steps'],
            num_training_steps=total_steps
        )

        # Training loop
        self.model.train()
        self.training_history = []

        for epoch in range(self.config['num_epochs']):
            logger.info(f"Epoch {epoch + 1}/{self.config['num_epochs']}")

            total_loss = 0
            correct_predictions = 0
            total_predictions = 0

            # Training phase
            progress_bar = tqdm(train_loader, desc=f"Training Epoch {epoch + 1}")

            for batch in progress_bar:
                optimizer.zero_grad()

                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device)

                # Forward pass
                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )

                loss = outputs.loss
                logits = outputs.logits

                # Backward pass
                loss.backward()
                optimizer.step()
                scheduler.step()

                # Calculate accuracy
                predictions = torch.argmax(logits, dim=1)
                correct_predictions += torch.sum(predictions == labels).item()
                total_predictions += labels.size(0)
                total_loss += loss.item()

                # Update progress bar
                progress_bar.set_postfix({
                    'loss': f'{loss.item():.4f}',
                    'acc': f'{correct_predictions/total_predictions:.4f}'
                })

            # Calculate epoch metrics
            avg_train_loss = total_loss / len(train_loader)
            train_accuracy = correct_predictions / total_predictions

            epoch_results = {
                'epoch': epoch + 1,
                'train_loss': avg_train_loss,
                'train_accuracy': train_accuracy
            }

            # Validation phase
            if val_loader is not None:
                val_results = self._evaluate_on_loader(val_loader)
                epoch_results.update({
                    'val_loss': val_results['loss'],
                    'val_accuracy': val_results['accuracy']
                })

                logger.info(
                    f"Epoch {epoch + 1}: "
                    f"Train Loss: {avg_train_loss:.4f}, "
                    f"Train Acc: {train_accuracy:.4f}, "
                    f"Val Loss: {val_results['loss']:.4f}, "
                    f"Val Acc: {val_results['accuracy']:.4f}"
                )
            else:
                logger.info(
                    f"Epoch {epoch + 1}: "
                    f"Train Loss: {avg_train_loss:.4f}, "
                    f"Train Acc: {train_accuracy:.4f}"
                )

            self.training_history.append(epoch_results)

        self.is_trained = True
        logger.info("BERT fine-tuning completed!")

        return self.training_history

    def _evaluate_on_loader(self, data_loader):
        """
        Evaluate model on a data loader

        Args:
            data_loader (DataLoader): PyTorch DataLoader

        Returns:
            dict: Evaluation metrics
        """
        self.model.eval()

        total_loss = 0
        correct_predictions = 0
        total_predictions = 0

        with torch.no_grad():
            for batch in data_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device)

                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )

                loss = outputs.loss
                logits = outputs.logits

                predictions = torch.argmax(logits, dim=1)
                correct_predictions += torch.sum(predictions == labels).item()
                total_predictions += labels.size(0)
                total_loss += loss.item()

        avg_loss = total_loss / len(data_loader)
        accuracy = correct_predictions / total_predictions

        self.model.train()

        return {
            'loss': avg_loss,
            'accuracy': accuracy
        }

    def predict(self, texts):
        """
        Make predictions on new texts

        Args:
            texts (array-like): Input texts

        Returns:
            array: Predicted labels
        """
        if not self.is_trained:
            raise ValueError("Model must be trained before making predictions")

        # Create temporary labels (not used for prediction)
        temp_labels = [0] * len(texts)

        # Create data loader
        data_loader = self.create_data_loader(
            texts, temp_labels,
            batch_size=self.config['batch_size'],
            shuffle=False
        )

        self.model.eval()
        predictions = []

        with torch.no_grad():
            for batch in tqdm(data_loader, desc="Making predictions"):
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)

                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask
                )

                logits = outputs.logits
                batch_predictions = torch.argmax(logits, dim=1)
                predictions.extend(batch_predictions.cpu().numpy())

        return np.array(predictions)

    def predict_proba(self, texts):
        """
        Get prediction probabilities

        Args:
            texts (array-like): Input texts

        Returns:
            array: Prediction probabilities
        """
        if not self.is_trained:
            raise ValueError("Model must be trained before making predictions")

        # Create temporary labels
        temp_labels = [0] * len(texts)

        # Create data loader
        data_loader = self.create_data_loader(
            texts, temp_labels,
            batch_size=self.config['batch_size'],
            shuffle=False
        )

        self.model.eval()
        probabilities = []

        with torch.no_grad():
            for batch in tqdm(data_loader, desc="Getting probabilities"):
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)

                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask
                )

                logits = outputs.logits
                batch_probs = torch.softmax(logits, dim=1)
                probabilities.extend(batch_probs.cpu().numpy())

        return np.array(probabilities)

    def evaluate(self, X_test, y_test, save_results=True):
        """
        Comprehensive model evaluation

        Args:
            X_test (array-like): Test texts
            y_test (array-like): Test labels
            save_results (bool): Whether to save results

        Returns:
            dict: Evaluation metrics
        """
        logger.info("Evaluating BERT model...")

        # Make predictions
        y_pred = self.predict(X_test)
        y_pred_proba = self.predict_proba(X_test)

        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred, output_dict=True)
        cm = confusion_matrix(y_test, y_pred)

        results = {
            'accuracy': accuracy,
            'precision': report['weighted avg']['precision'],
            'recall': report['weighted avg']['recall'],
            'f1_score': report['weighted avg']['f1-score'],
            'confusion_matrix': cm.tolist(),
            'classification_report': report
        }

        logger.info(f"Test accuracy: {accuracy:.4f}")
        logger.info(f"Test F1-score: {results['f1_score']:.4f}")

        # Save training history and evaluation plots
        if save_results and self.training_history:
            self._save_training_plots(cm, y_test, y_pred_proba)

        return results

    def _save_training_plots(self, cm, y_test, y_pred_proba):
        """Save training history and evaluation plots"""
        plots_dir = RESULTS_DIR / 'bert_plots'
        plots_dir.mkdir(exist_ok=True)

        # Training history plot
        epochs = [h['epoch'] for h in self.training_history]
        train_losses = [h['train_loss'] for h in self.training_history]
        train_accs = [h['train_accuracy'] for h in self.training_history]

        # Check if validation metrics exist
        has_val = 'val_loss' in self.training_history[0]

        plt.figure(figsize=(15, 5))

        # Loss plot
        plt.subplot(1, 3, 1)
        plt.plot(epochs, train_losses, 'b-', label='Training Loss')
        if has_val:
            val_losses = [h['val_loss'] for h in self.training_history]
            plt.plot(epochs, val_losses, 'r-', label='Validation Loss')
        plt.title('Training and Validation Loss')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.legend()
        plt.grid(True, alpha=0.3)

        # Accuracy plot
        plt.subplot(1, 3, 2)
        plt.plot(epochs, train_accs, 'b-', label='Training Accuracy')
        if has_val:
            val_accs = [h['val_accuracy'] for h in self.training_history]
            plt.plot(epochs, val_accs, 'r-', label='Validation Accuracy')
        plt.title('Training and Validation Accuracy')
        plt.xlabel('Epoch')
        plt.ylabel('Accuracy')
        plt.legend()
        plt.grid(True, alpha=0.3)

        # Confusion matrix
        plt.subplot(1, 3, 3)
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                   xticklabels=['Negative', 'Positive'],
                   yticklabels=['Negative', 'Positive'])
        plt.title('BERT Model - Confusion Matrix')
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')

        plt.tight_layout()
        plt.savefig(plots_dir / 'training_history.png', dpi=300, bbox_inches='tight')
        plt.close()

        # Confidence distribution plot
        plt.figure(figsize=(12, 5))

        # Subplot 1: Confidence for negative predictions
        plt.subplot(1, 2, 1)
        neg_confidence = y_pred_proba[y_test == 0][:, 0]
        plt.hist(neg_confidence, bins=30, alpha=0.7, color='red', label='True Negative')
        pos_confidence_neg = y_pred_proba[y_test == 1][:, 0]
        plt.hist(pos_confidence_neg, bins=30, alpha=0.7, color='blue', label='True Positive')
        plt.xlabel('Confidence for Negative Class')
        plt.ylabel('Frequency')
        plt.title('Confidence Distribution - Negative Class')
        plt.legend()

        # Subplot 2: Confidence for positive predictions
        plt.subplot(1, 2, 2)
        neg_confidence_pos = y_pred_proba[y_test == 0][:, 1]
        plt.hist(neg_confidence_pos, bins=30, alpha=0.7, color='red', label='True Negative')
        pos_confidence = y_pred_proba[y_test == 1][:, 1]
        plt.hist(pos_confidence, bins=30, alpha=0.7, color='blue', label='True Positive')
        plt.xlabel('Confidence for Positive Class')
        plt.ylabel('Frequency')
        plt.title('Confidence Distribution - Positive Class')
        plt.legend()

        plt.tight_layout()
        plt.savefig(plots_dir / 'confidence_distribution.png', dpi=300, bbox_inches='tight')
        plt.close()

        logger.info(f"Training plots saved to {plots_dir}")

    def save_model(self, model_path=None):
        """
        Save the fine-tuned model

        Args:
            model_path (str): Path to save the model
        """
        if not self.is_trained:
            raise ValueError("Model must be trained before saving")

        if model_path is None:
            model_path = MODELS_DIR / 'bert_model'

        # Save model and tokenizer
        self.model.save_pretrained(model_path)
        self.tokenizer.save_pretrained(model_path)
        logger.info(f"BERT model saved to {model_path}")

    def load_model(self, model_path=None):
        """
        Load a previously fine-tuned model

        Args:
            model_path (str): Path to the saved model
        """
        if model_path is None:
            model_path = MODELS_DIR / 'bert_model'

        if not Path(model_path).exists():
            raise FileNotFoundError(f"Model directory not found: {model_path}")

        self.tokenizer = BertTokenizer.from_pretrained(model_path)
        self.model = BertForSequenceClassification.from_pretrained(model_path)
        self.model.to(device)
        self.is_trained = True
        logger.info(f"BERT model loaded from {model_path}")

## Main Execution

Train and evaluate the BERT model using preprocessed data.

In [5]:
# Load preprocessed data
data_path = DATA_DIR / 'processed_splits.pkl'
if not data_path.exists():
    logger.error(f"Processed data not found: {data_path}")
    logger.info("Please run data_preprocessing.ipynb first to generate the processed data")
else:
    # Load data splits
    data = joblib.load(data_path)
    X_train = data['X_train']
    X_val = data['X_val']
    X_test = data['X_test']
    y_train = data['y_train']
    y_val = data['y_val']
    y_test = data['y_test']

    logger.info("Loaded preprocessed data splits")

    # Initialize and train model
    bert_model = BERTSentimentModel()

    # Train the model
    training_history = bert_model.train(X_train, y_train, X_val, y_val)

    # Evaluate on test set
    test_results = bert_model.evaluate(X_test, y_test)

    # Save model
    bert_model.save_model()

    # Save results
    all_results = {
        'training_history': training_history,
        'test_results': test_results
    }

    joblib.dump(all_results, RESULTS_DIR / 'bert_results.pkl')

    logger.info("BERT model training and evaluation completed!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training Epoch 1: 100%|██████████| 2014/2014 [11:16<00:00,  2.98it/s, loss=0.5836, acc=0.7424]
Training Epoch 2: 100%|██████████| 2014/2014 [11:17<00:00,  2.97it/s, loss=0.0435, acc=0.8169]
Training Epoch 3: 100%|██████████| 2014/2014 [11:16<00:00,  2.98it/s, loss=0.0676, acc=0.8818]
Making predictions: 100%|██████████| 576/576 [01:03<00:00,  9.02it/s]
Getting probabilities: 100%|██████████| 576/576 [01:03<00:00,  9.02it/s]


## Usage Instructions

1. Ensure the preprocessed data file (`processed_splits.pkl`) is available in the `twitter_data` folder on your Google Drive, generated by running `data_preprocessing.ipynb`.
2. Run all cells in sequence. The first cell mounts Google Drive and installs dependencies.
3. The trained model will be saved as a directory named `bert_model` in the `twitter_data/models` folder.
4. Evaluation plots (training history and confidence distributions) will be saved in the `twitter_data/results/bert_plots` folder.
5. Results, including metrics, will be saved as `bert_results.pkl` in the `twitter_data/results` folder.
6. Note: Training BERT on a large dataset can be computationally intensive. Ensure you have a GPU runtime enabled in Colab for faster processing.