Transformer-BERT Take Home Assessment
24CS60R49
SHAILJA PATIL

Task-1: Prepare dataset

Task-2: Preprocessing Social Media Post.

In [None]:
# Install required libraries
!pip install emoji transformers scikit-learn

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1


In [None]:


import pandas as pd
import re
import emoji
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_csv("Dataset_MisinformationData.csv")

# Drop any missing values
df.dropna(subset=['tweet', 'label'], inplace=True)

# Label encoding (e.g., fake -> 0, real -> 1)
le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])

# Preprocessing function
def preprocess(text):
    text = text.lower()  # Lowercase
    text = emoji.demojize(text)  # Convert emojis to text
    text = re.sub(r"http\S+|www.\S+", "", text)  # Remove URLs
    text = re.sub(r"#\S+", "", text)  # Remove hashtags
    text = re.sub(r"\s+", " ", text).strip()  # Clean up whitespace
    return text

# Apply preprocessing
df['tweet'] = df['tweet'].apply(preprocess)

# Split the data
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['tweet'], df['label'], test_size=0.2, random_state=42, stratify=df['label'])

val_texts, test_texts, val_labels, test_labels = train_test_split(
    test_texts, test_labels, test_size=0.5, random_state=42, stratify=test_labels)

# Output summary
print(f"Train size: {len(train_texts)}, Validation size: {len(val_texts)}, Test size: {len(test_texts)}")


Train size: 7774, Validation size: 972, Test size: 972


TO PRINT SAMPLE DATA

In [None]:
# Display first 5 rows of the full preprocessed dataset
print("Sample of full dataset (preprocessed):")
print(df.head())

# Display samples from each split
print("\nSample training data:")
for i in range(3):
    print(f"{i+1}. Tweet: {train_texts.iloc[i]}\n   Label: {train_labels.iloc[i]}")

print("\nSample validation data:")
for i in range(3):
    print(f"{i+1}. Tweet: {val_texts.iloc[i]}\n   Label: {val_labels.iloc[i]}")

print("\nSample test data:")
for i in range(3):
    print(f"{i+1}. Tweet: {test_texts.iloc[i]}\n   Label: {test_labels.iloc[i]}")

Sample of full dataset (preprocessed):
                                               tweet  label
0  the cdc currently reports 99031 deaths. in gen...      1
1  states reported 1121 deaths a small rise from ...      1
2  politically correct woman (almost) uses pandem...      0
3  we have 1524 testing laboratories in india and...      1
4  populous states can generate large case counts...      1

Sample training data:
1. Tweet: professor chris whitty says the uk must take "very seriously" for the next six months and there's "no evidence" the virus is currently a milder form of the one in april. read more:
   Label: 1
2. Tweet: the identification on feb. 26 2020 of a patient with covid-19 &amp; no travel history indicated the likelihood of community spread. until late feb. incidence was too low to be detected by emergency department surveillance. more from @cdcmmwr:
   Label: 1
3. Tweet: a video of a woman reciting sanskrit verses has been viewed thousands of times on facebook and twitt

Task-3: Obtaining Representations using Bert-based Model

In [None]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

In [None]:
from transformers import AutoTokenizer
from datasets import Dataset
import torch

# Set the model names

model_names = {
    "bert": "bert-base-uncased",
    "covid": "digitalepidemiologylab/covid-twitter-bert",
    "twhin": "Twitter/twhin-bert-base",
    "socbert": "sarkerlab/SocBERT-base"
}


# Convert splits to Hugging Face datasets
train_dataset = Dataset.from_dict({"text": train_texts.tolist(), "label": train_labels.tolist()})
val_dataset = Dataset.from_dict({"text": val_texts.tolist(), "label": val_labels.tolist()})
test_dataset = Dataset.from_dict({"text": test_texts.tolist(), "label": test_labels.tolist()})

# Tokenization function
def tokenize_function(example, tokenizer):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)

# Dictionary to store tokenized data
tokenized_data = {}

# Tokenize using all models
for key, model_name in model_names.items():
    print(f"Tokenizing with {key} model...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    train_tok = train_dataset.map(lambda x: tokenize_function(x, tokenizer), batched=True)
    val_tok = val_dataset.map(lambda x: tokenize_function(x, tokenizer), batched=True)
    test_tok = test_dataset.map(lambda x: tokenize_function(x, tokenizer), batched=True)

    # Set format for PyTorch
    train_tok.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    val_tok.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    test_tok.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

    tokenized_data[key] = {
        "train": train_tok,
        "val": val_tok,
        "test": test_tok
    }


Tokenizing with bert model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/7774 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

Tokenizing with covid model...


config.json:   0%|          | 0.00/421 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Map:   0%|          | 0/7774 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

Tokenizing with twhin model...


tokenizer_config.json:   0%|          | 0.00/372 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Map:   0%|          | 0/7774 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

Tokenizing with socbert model...


tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.24M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/735k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Map:   0%|          | 0/7774 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

TO PRINT SAMPLE DATA

In [None]:
# Choose one model to preview (e.g., 'bert')
preview_key = 'bert'

# Show one sample from the tokenized training dataset
sample = tokenized_data[preview_key]['train'][0]

print("Sample tokenized input:")
print("Input IDs:", sample['input_ids'])
print("Attention Mask:", sample['attention_mask'])
print("Label:", sample['label'])

# Optional: decode the input IDs to see actual tokens
tokenizer = AutoTokenizer.from_pretrained(model_names[preview_key])
decoded_text = tokenizer.decode(sample['input_ids'], skip_special_tokens=True)
print("\nDecoded Text:", decoded_text)


Sample tokenized input:
Input IDs: tensor([  101,  2934,  3782,  1059, 16584,  3723,  2758,  1996,  2866,  2442,
         2202,  1000,  2200,  5667,  1000,  2005,  1996,  2279,  2416,  2706,
         1998,  2045,  1005,  1055,  1000,  2053,  3350,  1000,  1996,  7865,
         2003,  2747,  1037, 10256,  2121,  2433,  1997,  1996,  2028,  1999,
         2258,  1012,  3191,  2062,  1024,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0, 

Task-4: Training Classifiers with Hyperparameter Tuning

Task-5: Evaluating Models.

In [None]:
!pip install transformers datasets scikit-learn evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
import evaluate
import numpy as np
import torch
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    f1_score,
    classification_report,
    confusion_matrix
)

# Evaluation function
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)

    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    f1_micro = f1_score(labels, preds, average='micro')
    f1_macro = f1_score(labels, preds, average='macro')

    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'f1_micro': f1_micro,
        'f1_macro': f1_macro,
    }

# Hyperparameters
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=1.5,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

# Storage for final results
final_results = {}

# Loop through each model
for key, model_name in model_names.items():
    print(f"\n🔧 Training model: {key} ({model_name})")

    # Load model
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    # Get tokenized data
    data = tokenized_data[key]

    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=data["train"],
        eval_dataset=data["val"],
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
    )

    # Train
    trainer.train()

    # Evaluate
    predictions = trainer.predict(data["test"])
    preds = np.argmax(predictions.predictions, axis=1)
    labels = predictions.label_ids

    # Compute metrics manually for final evaluation
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    f1_micro = f1_score(labels, preds, average='micro')
    f1_macro = f1_score(labels, preds, average='macro')
    confusion = confusion_matrix(labels, preds)

    final_results[key] = {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "f1_micro": f1_micro,
        "f1_macro": f1_macro,
        "confusion_matrix": confusion,
        "best_hyperparameters": {
            "learning_rate": training_args.learning_rate,
            "batch_size": training_args.per_device_train_batch_size,
            "epochs": training_args.num_train_epochs
        }
    }

    # Print summary
    print(f"\n✅ Evaluation for {key}:")
    print("Accuracy:", acc)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1-score:", f1)
    print("F1-score (micro):", f1_micro)
    print("F1-score (macro):", f1_macro)
    print("Confusion Matrix:\n", confusion)





🔧 Training model: bert (bert-base-uncased)


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mshailja290802[0m ([33mshailja290802-iit-kharagpur[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,F1 Micro,F1 Macro
1,0.2114,0.180142,0.949588,0.933837,0.972441,0.952748,0.949588,0.949362



✅ Evaluation for bert:
Accuracy: 0.9475308641975309
Precision: 0.9253731343283582
Recall: 0.9783037475345168
F1-score: 0.9511025886864813
F1-score (micro): 0.9475308641975309
F1-score (macro): 0.9472494075507878
Confusion Matrix:
 [[425  40]
 [ 11 496]]

🔧 Training model: covid (digitalepidemiologylab/covid-twitter-bert)


pytorch_model.bin:   0%|          | 0.00/1.35G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at digitalepidemiologylab/covid-twitter-bert and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,F1 Micro,F1 Macro
1,0.1497,0.146527,0.967078,0.974104,0.962598,0.968317,0.967078,0.967028



✅ Evaluation for covid:
Accuracy: 0.9722222222222222
Precision: 0.9669260700389105
Recall: 0.980276134122288
F1-score: 0.9735553379040157
F1-score (micro): 0.9722222222222222
F1-score (macro): 0.9721514501004369
Confusion Matrix:
 [[448  17]
 [ 10 497]]

🔧 Training model: twhin (Twitter/twhin-bert-base)


config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Twitter/twhin-bert-base and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,F1 Micro,F1 Macro
1,0.2067,0.17202,0.959877,0.96994,0.952756,0.961271,0.959877,0.959824



✅ Evaluation for twhin:
Accuracy: 0.9742798353909465
Precision: 0.9725490196078431
Recall: 0.9783037475345168
F1-score: 0.9754178957718781
F1-score (micro): 0.9742798353909465
F1-score (macro): 0.9742245897413867
Confusion Matrix:
 [[451  14]
 [ 11 496]]

🔧 Training model: socbert (sarkerlab/SocBERT-base)


config.json:   0%|          | 0.00/766 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/572M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at sarkerlab/SocBERT-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/572M [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,F1 Micro,F1 Macro
1,0.2043,0.15622,0.953704,0.962076,0.948819,0.955401,0.953704,0.953637



✅ Evaluation for socbert:
Accuracy: 0.9567901234567902
Precision: 0.9497098646034816
Recall: 0.9684418145956607
F1-score: 0.958984375
F1-score (micro): 0.9567901234567902
F1-score (macro): 0.9566661005434782
Confusion Matrix:
 [[439  26]
 [ 16 491]]
