<a href="https://colab.research.google.com/github/Malleshcr7/AI-ML-Projects/blob/main/DistilBERT_powered_Sentiment_Analysis_of_Airline_Tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple DistilBERT Sentiment Analysis

**Goal:** Build an easy-to-understand sentiment analysis model using only 5,000 tweets to keep training fast and simple.

**Dataset:** Twitter US Airline Sentiment (first 5,000 rows)

**Model:** DistilBERT (a smaller, faster version of BERT)

In [None]:
# Step 1: Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
import torch
import os

# Disable wandb logging to keep it simple
os.environ['WANDB_DISABLED'] = 'true'

print(" Libraries imported successfully!")

✓ Libraries imported successfully!


## Step 2: Load and Prepare Data

We'll load the Tweets.csv file and use only the first **5,000 rows** to keep things simple and fast.

In [None]:
# Step 2: Load first 5000 rows of data
df = pd.read_csv('Tweets.csv', nrows=5000)

# Display basic info
print(f"Total rows loaded: {len(df)}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nSentiment distribution:")
print(df['airline_sentiment'].value_counts())

# Preview first few tweets
print(f"\n Data loaded successfully!\n")
print("Sample tweets:")
print(df[['text', 'airline_sentiment']].head(3))

Total rows loaded: 4493

Columns: ['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence', 'negativereason', 'negativereason_confidence', 'airline', 'airline_sentiment_gold', 'name', 'negativereason_gold', 'retweet_count', 'text', 'tweet_coord', 'tweet_created', 'tweet_location', 'user_timezone']

Sentiment distribution:
airline_sentiment
negative    2906
neutral      914
positive     673
Name: count, dtype: int64

✓ Data loaded successfully!

Sample tweets:
                                                text airline_sentiment
0                @VirginAmerica What @dhepburn said.           neutral
1  @VirginAmerica plus you've added commercials t...          positive
2  @VirginAmerica I didn't today... Must mean I n...           neutral


## Step 3: Preprocess Data

Filter only positive and negative tweets (removing neutral), then convert labels to binary format (0 = negative, 1 = positive).

In [None]:
# Step 3: Preprocess - Filter only positive/negative and create binary labels
df = df[df['airline_sentiment'].isin(['positive', 'negative'])].copy()

# Convert to binary: negative=0, positive=1
df['label'] = df['airline_sentiment'].map({'negative': 0, 'positive': 1})

# Keep only text and label columns
df = df[['text', 'label']]

print(f"After filtering: {len(df)} tweets")
print(f"\nLabel distribution:")
print(df['label'].value_counts())
print(f"\n Data preprocessed successfully!")

After filtering: 3579 tweets

Label distribution:
label
0    2906
1     673
Name: count, dtype: int64

 Data preprocessed successfully!


## Step 4: Split Data and Load Model

Split data into training (80%) and testing (20%) sets, then load the DistilBERT tokenizer and model.

In [None]:
# Step 4: Split data into train/test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['text'].values,
    df['label'].values,
    test_size=0.2,
    random_state=42
)

print(f"Training samples: {len(train_texts)}")
print(f"Testing samples: {len(test_texts)}")

# Load DistilBERT tokenizer and model
print("\nLoading DistilBERT model...")
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2  # Binary classification: negative=0, positive=1
)

print("✓ Model and tokenizer loaded successfully!")

Training samples: 2863
Testing samples: 716

Loading DistilBERT model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✓ Model and tokenizer loaded successfully!


## Step 5: Tokenize Data and Create Dataset

Tokenize the text data and create PyTorch datasets for training.

In [None]:
# Step 5: Tokenize and create PyTorch Dataset
class TweetDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

# Tokenize texts (max length=128 for speed)
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True, max_length=128)

# Create datasets
train_dataset = TweetDataset(train_encodings, train_labels)
test_dataset = TweetDataset(test_encodings, test_labels)

print(f"✓ Created training dataset with {len(train_dataset)} samples")
print(f"✓ Created testing dataset with {len(test_dataset)} samples")

✓ Created training dataset with 2863 samples
✓ Created testing dataset with 716 samples


## Step 6: Train the Model

Set up training parameters and train the model for 2 epochs (keeping it fast for demonstration).

In [None]:
!pip install --upgrade transformers

# Step 6: Set up training and train the model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=50,
    load_best_model_at_end=True,
    eval_strategy='steps', # Set eval strategy to steps
    save_strategy='steps', # Set save strategy to steps
    eval_steps=50, # Define eval steps to match logging steps
)


# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Train!
print("\nStarting training...\n")
trainer.train()

# Evaluate
print("\n" + "="*50)
print("FINAL EVALUATION")
print("="*50)
eval_results = trainer.evaluate()
print(f"\nAccuracy: {eval_results['eval_loss']:.4f}")
print(f"Loss: {eval_results['eval_loss']:.4f}")

# Get predictions
predictions = trainer.predict(test_dataset)
pred_labels = np.argmax(predictions.predictions, axis=1)

print("\nClassification Report:")
print(classification_report(test_labels, pred_labels, target_names=['negative', 'positive']))

print("\n✓ Training complete!")



Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).



Starting training...





Step,Training Loss,Validation Loss
50,0.5365,0.395107
100,0.3017,0.336748
150,0.2618,0.194964
200,0.1757,0.194228
250,0.1049,0.304255
300,0.1425,0.21325
350,0.1065,0.217746



FINAL EVALUATION





Accuracy: 0.2179
Loss: 0.2179

Classification Report:
              precision    recall  f1-score   support

    negative       0.95      0.98      0.96       579
    positive       0.89      0.78      0.83       137

    accuracy                           0.94       716
   macro avg       0.92      0.88      0.90       716
weighted avg       0.94      0.94      0.94       716


✓ Training complete!
