# Use Case Demonstration: Selecting and Preparing Data for Fine-tuning

This notebook demonstrates the process of selecting and preparing data for fine-tuning a machine learning model, specifically for emotion intensity classification in tweets.

## 1. Install Required Modules

First, let's install the necessary packages. The `!` prefix runs the command in the system shell rather than the Python interpreter.

In [None]:
# Install modules
# A '!' in a Jupyter Notebook runs the line in the system's shell, and not in the Python interpreter
!pip install nltk transformers torch scikit-learn pandas

## 2. Import Libraries

Import all the necessary libraries for data processing, model tokenization, and machine learning operations.

In [None]:
# Import necessary libraries
import pandas as pd
import nltk
import random
import torch
import re
from transformers import BertTokenizer
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset

## 3. Load and Preview Dataset

Load the tweet emotion intensity dataset. You can download this dataset from: https://huggingface.co/datasets/stepp1/tweet_emotion_intensity/tree/main

In [None]:
# Load dataset 
# you can download this dataset from https://huggingface.co/datasets/stepp1/tweet_emotion_intensity/tree/main
data = pd.read_csv('data/tweet_emotion_intensity.csv')

# Preview the data
print(data.head())
print(f"\nDataset shape: {data.shape}")
print(f"\nColumn names: {data.columns.tolist()}")

## 4. Text Preprocessing

Clean the text data by removing URLs, HTML tags, special characters, and converting to lowercase.

In [None]:
# Text preprocessing function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)  # Remove text in square brackets
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    text = re.sub(r'<.*?>+', '', text)  # Remove HTML tags
    text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabet characters
    return text

# Apply cleaning function
data['cleaned_text'] = data['tweet'].apply(clean_text)

# Show examples of original vs cleaned text
print("Original vs Cleaned Text Examples:")
for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"Original: {data['tweet'].iloc[i]}")
    print(f"Cleaned:  {data['cleaned_text'].iloc[i]}")

## 5. Tokenization with BERT

Use the BERT tokenizer to convert text into tokens that can be processed by the model.

In [None]:
# Tokenize using BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer(data['cleaned_text'].tolist(), max_length=128, padding=True, truncation=True, return_tensors="pt")

print(f"Tokenized input shape: {tokens['input_ids'].shape}")
print(f"Attention mask shape: {tokens['attention_mask'].shape}")

# Show an example of tokenization
example_idx = 0
print(f"\nExample tokenization:")
print(f"Original text: {data['cleaned_text'].iloc[example_idx]}")
print(f"Token IDs: {tokens['input_ids'][example_idx][:20]}...")  # Show first 20 tokens
print(f"Decoded tokens: {tokenizer.decode(tokens['input_ids'][example_idx][:20])}")

## 6. Data Splitting

Split the dataset into training (70%), validation (15%), and test (15%) sets using stratified sampling to maintain class distribution.

In [None]:
# Split the dataset into training (70%), validation (15%), and test (15%) sets
train_data, temp_data = train_test_split(data, test_size=0.3, stratify=data['sentiment_intensity'], random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, stratify=temp_data['sentiment_intensity'], random_state=42)

print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")
print(f"Test samples: {len(test_data)}")

# Check class distribution in each split
print("\nClass distribution:")
print("Training set:")
print(train_data['sentiment_intensity'].value_counts().sort_index())
print("\nValidation set:")
print(val_data['sentiment_intensity'].value_counts().sort_index())
print("\nTest set:")
print(test_data['sentiment_intensity'].value_counts().sort_index())

## 7. Data Augmentation with Synonym Replacement

Augment the training data by replacing some words with their synonyms to increase dataset diversity.

In [None]:
# Download wordnet if this is your first time using it
nltk.download('wordnet')
from nltk.corpus import wordnet

def synonym_replacement(word):
    """Replace a word with its first synonym if available"""
    synonyms = wordnet.synsets(word)
    if synonyms:
        return synonyms[0].lemmas()[0].name()
    return word

# Apply synonym replacement on random words in the dataset
def augment_text(text, replacement_prob=0.2):
    """Augment text by randomly replacing words with synonyms"""
    words = text.split()
    augmented_words = [synonym_replacement(word) if random.random() < replacement_prob else word for word in words]
    return ' '.join(augmented_words)

# Apply augmentation
random.seed(42)  # For reproducibility
data['augmented_text'] = data['cleaned_text'].apply(augment_text)

# Show examples of augmentation
print("Text Augmentation Examples:")
for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"Original: {data['cleaned_text'].iloc[i]}")
    print(f"Augmented: {data['augmented_text'].iloc[i]}")

## 8. Label Preparation and Dataset Creation

Convert sentiment intensity labels to numeric values and prepare the final dataset for model training.

In [None]:
# Convert inputs and labels into PyTorch tensors
input_ids = tokens['input_ids']
attention_masks = tokens['attention_mask']

# Define a mapping function for sentiment intensity
def map_sentiment(value):
    """Map sentiment intensity categories to numeric values"""
    if value == "high":
        return 1
    elif value == "medium":
        return 0.5
    elif value == "low":
        return 0
    else:
        return None  # Handle unexpected values, if any

# Apply the function to each item in 'sentiment_intensity'
data['sentiment_intensity_numeric'] = data['sentiment_intensity'].apply(map_sentiment)

# Check for any null values after mapping
print(f"Null values after mapping: {data['sentiment_intensity_numeric'].isnull().sum()}")
print(f"Value distribution after mapping:")
print(data['sentiment_intensity_numeric'].value_counts().sort_index())

# Drop any rows where 'sentiment_intensity_numeric' is None
data = data.dropna(subset=['sentiment_intensity_numeric']).reset_index(drop=True)

# Convert the 'sentiment_intensity_numeric' column to a tensor
labels = torch.tensor(data['sentiment_intensity_numeric'].tolist(), dtype=torch.float32)

print(f"\nFinal dataset shape: {data.shape}")
print(f"Labels tensor shape: {labels.shape}")
print(f"Labels tensor dtype: {labels.dtype}")

## 9. Create DataLoader for Training

Create a PyTorch DataLoader that will be used during model training.

In [None]:
# Create DataLoader for training
dataset = TensorDataset(input_ids, attention_masks, labels)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)

print(f"Dataset size: {len(dataset)}")
print(f"Number of batches: {len(dataloader)}")
print(f"Batch size: {dataloader.batch_size}")

# Show a sample batch
sample_batch = next(iter(dataloader))
print(f"\nSample batch shapes:")
print(f"Input IDs: {sample_batch[0].shape}")
print(f"Attention masks: {sample_batch[1].shape}")
print(f"Labels: {sample_batch[2].shape}")

# Show label distribution in sample batch
print(f"\nSample batch label distribution:")
unique_labels, counts = torch.unique(sample_batch[2], return_counts=True)
for label, count in zip(unique_labels, counts):
    print(f"Label {label.item()}: {count.item()} samples")

## 10. Summary and Next Steps

At this point, we have successfully:

1. ✅ Loaded and explored the tweet emotion intensity dataset
2. ✅ Cleaned and preprocessed the text data
3. ✅ Tokenized the text using BERT tokenizer
4. ✅ Split the data into train/validation/test sets
5. ✅ Applied data augmentation using synonym replacement
6. ✅ Converted labels to numeric format
7. ✅ Created PyTorch DataLoader for training

The data is now ready for fine-tuning a BERT model for emotion intensity classification!

In [None]:
# Final summary statistics
print("=== DATA PREPARATION SUMMARY ===")
print(f"Total samples: {len(data)}")
print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")
print(f"Test samples: {len(test_data)}")
print(f"\nTokenization parameters:")
print(f"- Max sequence length: 128")
print(f"- Tokenizer: bert-base-uncased")
print(f"- Padding: True")
print(f"- Truncation: True")
print(f"\nDataLoader parameters:")
print(f"- Batch size: 16")
print(f"- Shuffle: True")
print(f"- Number of batches: {len(dataloader)}")
print("\n🚀 Ready for model fine-tuning!")