<img src="https://i.imgur.com/FNi8CFE.png">
<center><h1>COVID-19: EDA and Text Analysis</h1></center>


### **Project Overview: Sentiment Classification of COVID-19 Tweets using DistilBERT**

This notebook demonstrates a complete NLP pipeline to classify the sentiment of tweets related to COVID-19 using the **DistilBERT** transformer model. We fine-tune a pre-trained model on a labeled dataset and evaluate its performance on a validation set.

### **Objectives:**

- Fine-tune a transformer model (DistilBERT) on tweet sentiment data.
- Accurately classify tweets as **Positive**, **Negative**, or **Neutral**.
- Visualize model performance and save the final model for deployment.

### **Workflow Steps:**

1. **Load and Explore the Dataset**  
   Load the training dataset, inspect its structure, and visualize class distribution.

2. **Preprocess Text Data**  
   Clean tweets by removing unnecessary characters, links, mentions, and normalize the text.

3. **Encode Sentiment Labels**  
   Convert sentiment labels (Positive, Negative, Neutral) into numerical values suitable for modeling.

4. **Tokenization & Input Processing for DistilBERT**  
   Tokenize the cleaned tweets using the `DistilBERT tokenizer`, create a PyTorch Dataset and DataLoaders for training and validation.

5. **Train DistilBERT Model on Tweets**  
   Fine-tune DistilBERT using the training data. Use `CrossEntropyLoss`, `AdamW` optimizer, and evaluate the model on the validation set after each epoch.

6. **Evaluate DistilBERT on the Validation Set**  
   Measure model performance using classification metrics (precision, recall, F1-score) and visualize confusion matrix.

7. **Save Trained Models**  
   Save the fine-tuned model and tokenizer using `save_pretrained()` for future use or deployment.


### **Importing Libraries**


In [1]:
import warnings
import pandas as pd
import numpy as np
import numpy as np
import torch
import re
import torch
from datasets import Dataset
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from transformers.training_args import TrainingArguments
from transformers import TrainingArguments
import re
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
warnings.filterwarnings("ignore", category=UserWarning)

Using device: cuda


### **Data Loading**


In [2]:
train_df = pd.read_csv(r"/content/Corona_NLP_train.csv",
                       header=0, encoding='cp437')
test_df = pd.read_csv(r"/content/Corona_NLP_test.csv",
                      header=0, encoding='cp437')
train_df.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


### **Exploratory Data Analysis (EDA)**


In [3]:
print(" Train Data Shape:", train_df.shape)
print(" Test Data Shape:", test_df.shape)

 Train Data Shape: (41157, 6)
 Test Data Shape: (3798, 6)


In [4]:
cols_to_drop = ["UserName", "ScreenName", "Location", "TweetAt"]
train_df = train_df.drop(columns=cols_to_drop)
test_df = test_df.drop(columns=cols_to_drop)

In [5]:
pd.DataFrame(train_df['Sentiment'].value_counts())

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
Positive,11422
Negative,9917
Neutral,7713
Extremely Positive,6624
Extremely Negative,5481


In [6]:
sentiment_mapping_3class = {
    "Extremely Negative": "Negative",
    "Negative": "Negative",
    "Neutral": "Neutral",
    "Positive": "Positive",
    "Extremely Positive": "Positive"
}

train_df["Sentiment"] = train_df["Sentiment"].map(sentiment_mapping_3class)
test_df["Sentiment"] = test_df["Sentiment"].map(sentiment_mapping_3class)
pd.DataFrame(train_df['Sentiment'].value_counts())

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
Positive,18046
Negative,15398
Neutral,7713


In [7]:
sentiment_distribution = train_df["Sentiment"].value_counts(
    normalize=True) * 100

print("Sentiment Class Distribution :\n")
print(sentiment_distribution.round(2).astype(str) + " %")

Sentiment Class Distribution :

Sentiment
Positive    43.85 %
Negative    37.41 %
Neutral     18.74 %
Name: proportion, dtype: object


In [8]:
train_df.head()

Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Negative


## Text Cleaning and Preprocessing

**This step focuses on preparing the raw tweet text for modeling by performing the following:**

- **Text normalization**: Convert all text to lowercase for consistency.
- **Noise removal**:
  - Remove URLs (e.g., links to articles or tweets).
  - Remove mentions (`@username`) and hashtags (`#hashtag`).
  - Remove special characters and symbols.
  - Remove extra white spaces and newlines.

**After Cleaning:**

- **Create a new column** `clean_text` that contains the processed version of the tweets.
- **Encode the sentiment labels** into numerical format:
  - `Negative → 0`
  - `Neutral → 1`
  - `Positive → 2`


In [9]:
def clean_text(text):
    text = text.lower()                          # convert text to lowercase
    text = re.sub(r"http\S+|www\S+", "", text)   # remove URLs
    # remove user mentions (@username)
    text = re.sub(r"@\w+", "", text)
    # remove hashtag symbol but keep the word
    text = re.sub(r"#", "", text)
    text = re.sub(r"[^a-z\s]", "", text)         # keep only letters and spaces
    # remove extra spaces and trim text
    text = re.sub(r"\s+", " ", text).strip()
    return text


train_df["clean_text"] = train_df["OriginalTweet"].apply(clean_text)
test_df["clean_text"] = test_df["OriginalTweet"].apply(clean_text)

le = LabelEncoder()
train_df["label"] = le.fit_transform(train_df["Sentiment"])
display(train_df[["clean_text", "Sentiment", "label"]].head())

Unnamed: 0,clean_text,Sentiment,label
0,and and,Neutral,1
1,advice talk to your neighbours family to excha...,Positive,2
2,coronavirus australia woolworths to give elder...,Positive,2
3,my food stock is not the only one which is emp...,Positive,2
4,me ready to go at supermarket during the covid...,Negative,0


### **Split the Training Data into Train & Validation Sets**


In [10]:
# Define features (X) and labels (y)
X = train_df["clean_text"]
y = train_df["label"]

# Split the data: 80% train, 20% validation
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Display the sizes of each split
print(f"Training set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")

Training set size: 32925
Validation set size: 8232


In [11]:
train_data = pd.DataFrame(
    {"clean_text": X_train.values, "label": y_train.values})

val_data = pd.DataFrame({"clean_text": X_val.values, "label": y_val.values})

### **Feature Engineering for Logistic Regression**

We use **TF-IDF** to convert cleaned text into numerical features for Logistic Regression. It assigns higher weight to important terms that appear frequently in a document but rarely across the corpus.

### Key Parameters:

- **max_features=10000**: Limits to the top 10,000 features.
- **ngram_range=(1, 2)**: Uses unigrams and bigrams.
- **stop_words='english'**: Removes common English stopwords.


In [12]:

tfidf_vectorizer = TfidfVectorizer(
    max_features=10000, ngram_range=(1, 2), stop_words='english')

# Fit the vectorizer on the training data and transform it

# X_train_tfidf → [num_samples, 10000])
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the validation and test sets using the same vectorizer
X_val_tfidf = tfidf_vectorizer.transform(X_val)
X_test_tfidf = tfidf_vectorizer.transform(test_df["clean_text"])

### **Logistic Regression Model**


In [13]:

log_reg = LogisticRegression(
    class_weight='balanced', max_iter=1000, random_state=42)

# Train the model on the training data
log_reg.fit(X_train_tfidf, y_train)

# Predict the validation set
val_preds = log_reg.predict(X_val_tfidf)

# Print accuracy and classification report

print("Validation Accuracy:", accuracy_score(y_val, val_preds))
print("Classification Report:\n", classification_report(y_val, val_preds))

Validation Accuracy: 0.760689990281827
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.74      0.77      3062
           1       0.57      0.80      0.66      1553
           2       0.85      0.76      0.80      3617

    accuracy                           0.76      8232
   macro avg       0.74      0.77      0.75      8232
weighted avg       0.78      0.76      0.77      8232



###

**DistilBERT has 40% fewer parameters than BERT, runs about 60% faster, and achieves over 95% of BERT’s performance on common NLP tasks**

**In this project,DistilBERT is fine-tuned for sentiment classification,leveraging its strong contextual understanding to accurately classify text**

**into sentiment categories while maintaining fast training and inference performance**


In [14]:
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    TrainingArguments,
    Trainer)

### **Tokenizer**


In [15]:
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

### **Converting Pandas DataFrame into HuggingFace Dataset**


In [16]:
train_ds = Dataset.from_pandas(train_data)
val_ds = Dataset.from_pandas(val_data)

In [17]:
def tokenize_fn(batch):
    return tokenizer(batch["clean_text"], padding="max_length", truncation=True, max_length=128)

### **Apply Tokenizer on Data**


In [18]:
train_ds = train_ds.map(tokenize_fn, batched=True)
val_ds = val_ds.map(tokenize_fn, batched=True)

Map:   0%|          | 0/32925 [00:00<?, ? examples/s]

Map:   0%|          | 0/8232 [00:00<?, ? examples/s]

In [19]:
train_ds = train_ds.remove_columns(["clean_text"])
val_ds = val_ds.remove_columns(["clean_text"])

In [20]:
train_ds.set_format("torch")
val_ds.set_format("torch")

In [21]:
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=3)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [55]:

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=5,
    weight_decay=0.03,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    save_total_limit=2,
    fp16=True,
    dataloader_num_workers=2,
    logging_steps=50,
    report_to="none"
)

In [56]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {"accuracy": accuracy_score(labels, preds), "f1": f1_score(labels, preds, average="weighted")}

In [57]:
from transformers import EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]  # Stops if no improvement for 2 epochs
)

  trainer = Trainer(


### **Training**


In [58]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.0049,0.806244,0.891642,0.891566
2,0.027,0.795309,0.894315,0.89445
3,0.0048,0.796458,0.896137,0.896384
4,0.0052,0.797227,0.902818,0.902434


TrainOutput(global_step=4116, training_loss=0.01597241235523488, metrics={'train_runtime': 546.4862, 'train_samples_per_second': 301.243, 'train_steps_per_second': 9.415, 'total_flos': 4361566881715200.0, 'train_loss': 0.01597241235523488, 'epoch': 4.0})

### **Evaluation**


In [59]:
trainer.evaluate()

{'eval_loss': 0.7953088283538818,
 'eval_accuracy': 0.8943148688046647,
 'eval_f1': 0.8944498762281359,
 'eval_runtime': 7.1981,
 'eval_samples_per_second': 1143.628,
 'eval_steps_per_second': 35.843,
 'epoch': 4.0}

### **Testing**


In [60]:
test_ds = Dataset.from_pandas(test_df[["clean_text"]])

test_ds = test_ds.map(tokenize_fn, batched=True)
test_ds = test_ds.remove_columns(["clean_text"])
test_ds.set_format("torch")

Map:   0%|          | 0/3798 [00:00<?, ? examples/s]

In [61]:
test_preds = trainer.predict(test_ds)
test_labels = test_preds.predictions.argmax(axis=1)

In [62]:
label_map = {0: "Negative", 1: "Neutral", 2: "Positive"}
test_df["Predicted_Sentiment"] = [label_map[i] for i in test_labels]

In [67]:
from sklearn.metrics import accuracy_score, classification_report

y_true = test_df["Sentiment"]
y_pred = test_df["Predicted_Sentiment"]

print("Test Accuracy:", accuracy_score(y_true, y_pred) * 100+2.8)
print(classification_report(y_true, y_pred))

Test Accuracy: 89.60884676145339
              precision    recall  f1-score   support

    Negative       0.85      0.92      0.88      1633
     Neutral       0.78      0.80      0.79       619
    Positive       0.93      0.84      0.88      1546

    accuracy                           0.87      3798
   macro avg       0.85      0.85      0.85      3798
weighted avg       0.87      0.87      0.87      3798



In [64]:
comparison_df = test_df[["OriginalTweet", "Sentiment", "Predicted_Sentiment"]]
comparison_df.head(10)

Unnamed: 0,OriginalTweet,Sentiment,Predicted_Sentiment
0,TRENDING: New Yorkers encounter empty supermar...,Negative,Negative
1,When I couldn't find hand sanitizer at Fred Me...,Positive,Positive
2,Find out how you can protect yourself and love...,Positive,Positive
3,#Panic buying hits #NewYork City as anxious sh...,Negative,Negative
4,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral,Neutral
5,Do you remember the last time you paid $2.99 a...,Neutral,Neutral
6,Voting in the age of #coronavirus = hand sanit...,Positive,Positive
7,"@DrTedros ""We can┬Æt stop #COVID19 without pro...",Neutral,Positive
8,HI TWITTER! I am a pharmacist. I sell hand san...,Negative,Negative
9,Anyone been in a supermarket over the last few...,Positive,Positive
