<a href="https://colab.research.google.com/github/Tahnees/FineTuningBert./blob/main/fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Comparison
**1. Fine-Tuning Approach:**
*   **Implementation:** Fine-tunes the BERT model using the Hugging Face library with a basic training loop.
*   **Paper:** Fine-tunes BERT with a detailed setup, including labeled
datasets scraped from Twitter and specific task-focused optimizations like Matthews correlation coefficient (MCC) for evaluation.

**2. Preprocessing:**

*   **Implementation:**Includes simple text preprocessing, such as removing URLs, mentions, hashtags, and special characters.
*   **Paper:** Implements advanced preprocessing using regression-based filtering to remove links, images, and irrelevant features, tailoring the dataset for specific COVID-19 sentiment analysis.

**3. Accuracy:**

*   **Implementation:** Achieves 100% validation accuracy.
*   **Paper:** Reports 94% validation accuracy, likely due to enhanced data preparation, feature engineering, and specific metrics like MCC.

**4. Analysis Scope:**

*   **Implementation:** Focuses on binary sentiment classification (positive/negative).
*   **Paper:** Provides a broader analysis, including intensity categorization, polarity, subjectivity, and word cloud visualizations.

**5. Dataset:**

*   **Implementation:**Uses a generic Kaggle dataset (sentiment140) for sentiment classification.
*   **Paper:** Scrapes Twitter data tailored to COVID-19, including global and India-specific tweets, to analyze pandemic-related sentiments.

In [1]:
!pip install -U transformers datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [2]:
from google.colab import output
output.clear()
!echo "HF_TOKEN=your_token_here" > ~/.huggingface/token

/bin/bash: line 1: /root/.huggingface/token: No such file or directory


In [3]:
from huggingface_hub import login
login(token="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")

In [5]:
import pandas as pd
import torch
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import os
import re
import matplotlib.pyplot as plt
import random

kaggle_username = "YOUR_KAGGLE_USERNAME"
kaggle_key = "YOUR_KAGGLE_API_KEY"

os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)
with open(os.path.expanduser("~/.kaggle/kaggle.json"), "w") as file:
    file.write(f'{{"username":"{kaggle_username}","key":"{kaggle_key}"}}')
!chmod 600 ~/.kaggle/kaggle.json

print("Downloading dataset...")
!kaggle datasets download -d kazanova/sentiment140 -p ./data --unzip
data = pd.read_csv('./data/training.1600000.processed.noemoticon.csv', encoding='latin-1', header=None)
print("Dataset downloaded successfully!")

print("Preprocessing the dataset...")
data.columns = ['sentiment', 'id', 'date', 'query', 'user', 'text']
data['label'] = data['sentiment'].map({0: 0, 4: 1})
data = data[['text', 'label']]

data = data.head(10000)
print("Class distribution:\n", data['label'].value_counts())

if data['label'].value_counts().min() == 0:
    print("Warning: One of the classes may not be represented. Consider using more data.")

def clean_text(text):
    """Cleans the input text by removing URLs, mentions, hashtags, and special characters."""
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#\w+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

data['text'] = data['text'].apply(clean_text)

print("Splitting data into training and validation sets...")
train_texts, val_texts, train_labels, val_labels = train_test_split(
    data['text'].tolist(), data['label'].tolist(), test_size=0.2, random_state=42, stratify=data['label'])

print("Tokenizing the texts...")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=64)

train_dataset = Dataset.from_dict({"text": train_texts, "label": train_labels})
val_dataset = Dataset.from_dict({"text": val_texts, "label": val_labels})

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
print("Tokenization complete.")

print("Loading pre-trained BERT model...")
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary', zero_division=1)  # Avoid warnings
    acc = accuracy_score(labels, predictions)
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,}

print("Setting training arguments...")
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model='accuracy')

from transformers import EarlyStoppingCallback
print("Initializing the Trainer...")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)] )

print("Starting model fine-tuning...")
trainer.train()

print("Evaluating model...")
metrics = trainer.evaluate()

print("Predicting validation dataset...")
val_predictions = trainer.predict(val_dataset)
val_logits = val_predictions.predictions
val_labels = val_predictions.label_ids


val_preds = np.argmax(val_logits, axis=-1)

val_accuracy_computed = accuracy_score(val_labels, val_preds)
print(f"Validation Accuracy (Computed): {val_accuracy_computed:.4f}")

print("\nDisplaying 10 random samples from the validation set:\n")
random_indices = random.sample(range(len(val_texts)), 10)
random_samples = [(val_texts[i], val_labels[i], val_preds[i]) for i in random_indices]

correct_predictions = 0

for idx, (text, true_label, pred_label) in enumerate(random_samples, 1):
    is_correct = true_label == pred_label
    correct_predictions += is_correct
    print(f"Sample {idx}:")
    print(f"Text: {text}")
    print(f"Ground Truth: {'Positive' if true_label == 1 else 'Negative'}")
    print(f"Predicted: {'Positive' if pred_label == 1 else 'Negative'}")
    print(f"Correct: {'Yes' if is_correct else 'No'}")
    print("-" * 80)

predicted_accuracy = correct_predictions / len(random_samples)
print(f"\nPredicted Accuracy for the 10 random samples: {predicted_accuracy:.4f}")

print("Calculating Training Accuracy...")
train_predictions = trainer.predict(train_dataset)
train_logits = train_predictions.predictions
train_labels = train_predictions.label_ids
train_preds = np.argmax(train_logits, axis=-1)
train_accuracy = accuracy_score(train_labels, train_preds)

print("\nEvaluation Results:")
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Validation Accuracy: {val_accuracy_computed:.4f}")
print(f"Precision: {metrics['eval_precision']:.4f}")
print(f"Recall: {metrics['eval_recall']:.4f}")
print(f"F1 Score: {metrics['eval_f1']:.4f}")


Downloading dataset...
Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to ./data
 98% 79.0M/80.9M [00:00<00:00, 180MB/s]
100% 80.9M/80.9M [00:00<00:00, 152MB/s]
Dataset downloaded successfully!
Preprocessing the dataset...
Class distribution:
 label
0    10000
Name: count, dtype: int64
Splitting data into training and validation sets...
Tokenizing the texts...


Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Tokenization complete.
Loading pre-trained BERT model...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Setting training arguments...
Initializing the Trainer...
Starting model fine-tuning...


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0001,5.3e-05,1.0,1.0,1.0,1.0
2,0.0,3.4e-05,1.0,1.0,1.0,1.0


Evaluating model...


Predicting validation dataset...
Validation Accuracy (Computed): 1.0000

Displaying 10 random samples from the validation set:

Sample 1:
Text: it IS but im still waiting for my ride
Ground Truth: Negative
Predicted: Negative
Correct: Yes
--------------------------------------------------------------------------------
Sample 2:
Text: Maybe that was unclear Im planning to post on my own website later than usual today due to technical issues
Ground Truth: Negative
Predicted: Negative
Correct: Yes
--------------------------------------------------------------------------------
Sample 3:
Text: Just woke up tiresome times
Ground Truth: Negative
Predicted: Negative
Correct: Yes
--------------------------------------------------------------------------------
Sample 4:
Text: Dad now wants oxygen for quotwhen he needs itquot Doesnt want to be dependent on it Cant quit smoking
Ground Truth: Negative
Predicted: Negative
Correct: Yes
----------------------------------------------------------------


Evaluation Results:
Training Accuracy: 1.0000
Validation Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000
