<a href="https://colab.research.google.com/github/Lisarika-kanchumarthi/LLMs/blob/main/Disaster_Tweet_Classification_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Disaster Tweet Classification using BERT

This notebook demonstrates the training and fine-tuning of a BERT-based language model to classify tweets as either related to a disaster or not. The model uses the `bert-base-cased` variant and is trained on the Kaggle 'NLP with Disaster Tweets' dataset.

The following steps are covered in this notebook:
1. Mount Google Drive and Load Data
2. Data Preprocessing and Splitting
3. Tokenization using BERT Tokenizer
4. Dataset Wrapping and Model Setup
5. Training using Hugging Face Trainer
6. Evaluation Metrics (Accuracy, F1)
7. Inference with Sample Tweets


# Disaster Tweet Classification using BERT

This notebook fine-tunes a BERT model to classify whether a tweet is related to a disaster or not.

Install required libraries

In [None]:
!pip install -q --upgrade transformers datasets scikit-learn

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m93.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sklearn-compat 0.1.3 requires scikit-learn<1.7,>=1.2, but you have scikit-learn 1.7.0 which is incompatible.[0m[31m
[0m

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Step 2: Load the Dataset

In [None]:
import pandas as pd

# Loading the training data from Google Drive
file_path = '/content/drive/MyDrive/Large_Language_Models/train.csv'
df = pd.read_csv(file_path)

# Display the first few rows
df.head()


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


## Step 3: Preprocess the Data

In [None]:
# Checking for missing values
print(df.isnull().sum())

# To drop rows with missing text
df = df.dropna(subset=['text'])

# Adding a label column for classification
df['label'] = df['target']

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64


## Step 4: Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

# To split the dataset: 80% for training, 20% for validation
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['text'].tolist(),
    df['label'].tolist(),
    test_size=0.2,
    random_state=42
)

# To print the number of examples in each split
print(f"Training samples: {len(train_texts)}")
print(f"Validation samples: {len(val_texts)}")


Training samples: 6090
Validation samples: 1523


## Step 5: Tokenization with BERT

In [None]:
from transformers import BertTokenizer

# To load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# To tokenize the training and validation texts
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=128)

## Step 6: Convert Encodings to Dataset Format

In [None]:
import torch

# Creating a PyTorch dataset class
class TweetDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
# To create train and validation datasets
train_dataset = TweetDataset(train_encodings, train_labels)
val_dataset = TweetDataset(val_encodings, val_labels)


## Step 7: Fine-tune BERT with Trainer API

In [None]:
import os
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

# Disable Weights & Biases to avoid login issues
os.environ["WANDB_DISABLED"] = "true"

# Loading the pre-trained BERT model with a classification head
model = BertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)

# To define evaluation metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1
    }

# To define training arguments
training_args = TrainingArguments(
    output_dir='./results',                # output directory for model checkpoints
    num_train_epochs=3,                    # total number of training epochs
    per_device_train_batch_size=16,        # batch size per device during training
    per_device_eval_batch_size=16,         # batch size for evaluation
    warmup_steps=500,                      # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                     # strength of weight decay
    logging_dir='./logs',                  # directory for storing logs
    eval_strategy="epoch",           # evaluate at the end of each epoch
    save_strategy="epoch",                  # save model checkpoint at each epoch
    logging_strategy="epoch"
)


# To initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# Start training
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5377,0.420672,0.831911,0.784874
2,0.3795,0.476584,0.804334,0.778932
3,0.2298,0.551551,0.822062,0.790734


TrainOutput(global_step=1143, training_loss=0.3823268686796841, metrics={'train_runtime': 446.1411, 'train_samples_per_second': 40.951, 'train_steps_per_second': 2.562, 'total_flos': 995207289123600.0, 'train_loss': 0.3823268686796841, 'epoch': 3.0})

## Step 8: Evaluate the Model

In [None]:
# Evaluate on validation set
trainer.evaluate()

{'eval_loss': 0.5814521312713623,
 'eval_accuracy': 0.8194353250164149,
 'eval_f1': 0.7879722436391673,
 'eval_runtime': 8.0452,
 'eval_samples_per_second': 189.304,
 'eval_steps_per_second': 11.933,
 'epoch': 3.0}

## Step 9: Make Sample Predictions

In [None]:
# Sample disaster-related and non-disaster tweets
test_texts = [
    "Forest fire near La Ronge Sask. Canada",
    "What a beautiful day to relax in the sun!",
    "Explosion heard near downtown New York",
    "Enjoying a peaceful weekend with family",
    "Earthquake tremors felt across Los Angeles"
]

# Tokenize inputs
test_encodings = tokenizer(test_texts, truncation=True, padding=True, return_tensors="pt")

# To move input tensors to the same device as the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
test_encodings = {key: val.to(device) for key, val in test_encodings.items()}

# To make predictions
outputs = model(**test_encodings)
preds = torch.argmax(outputs.logits, dim=1)

# To Display results
for text, pred in zip(test_texts, preds):
    label = "DISASTER" if pred.item() == 1 else "NOT DISASTER"
    print(f"{text} → {label}")

Forest fire near La Ronge Sask. Canada → DISASTER
What a beautiful day to relax in the sun! → NOT DISASTER
Explosion heard near downtown New York → DISASTER
Enjoying a peaceful weekend with family → NOT DISASTER
Earthquake tremors felt across Los Angeles → DISASTER


## Conclusion

The BERT model achieved strong accuracy and F1 score on the validation set, demonstrating its effectiveness in classifying short texts such as tweets. The implementation followed standard fine-tuning procedures using Hugging Face's Transformers API and can be extended further using advanced methods like LoRA or prompt-tuning for efficiency or performance improvements.
