<a href="https://colab.research.google.com/github/R-r632/Data-Analysis-Project/blob/main/Twitter_Sentiment_Analysis_BERT_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Installing Kaggle Library
!pip install kaggle



In [None]:
import torch

print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU found")


CUDA available: True
Device: Tesla T4


In [None]:
import os
import shutil

In [None]:
# Create the .kaggle directory if it doesn't exist
os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)

In [None]:
# Copy the kaggle.json file into the .kaggle directory
shutil.copy("kaggle.json", os.path.expanduser("~/.kaggle/kaggle.json"))

'/root/.kaggle/kaggle.json'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Set file permissions (skipped or adjusted for Windows)
kaggle_path = os.path.expanduser("~/.kaggle/kaggle.json")
if os.name != 'nt':  # If not Windows
    os.chmod(kaggle_path, 0o600)

### Importing Twitter Sentimental Analysis

In [None]:
# API to fetch the dataset from Kaggle
!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other


In [None]:
#extracting the zip archive
from zipfile import ZipFile
with ZipFile('sentiment140.zip', 'r') as zip_ref:
    zip_ref.extractall('sentiment140')
    print("Dataset extracted successfully!")

Dataset extracted successfully!


#### Importing The Dependencies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
#printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

##  Introduction to BERT and the problem at hand
Task : Binary-sentiment classification of tweets (original Kaggle Sentiment140 set).

Labels : 0 → negative, 1 → positive (the notebook first renames Kaggle’s 4 tag to 1).

Why BERT? Tweets are short, full of slang, emojis and creative spelling. A transformer pretrained on large-scale, open-domain English (e.g. bert-base-uncased, 109 M parameters) is a strong baseline because it already “knows” modern English syntax and vocabulary and only needs a thin classification head to specialize on the sentiment signal. (The notebook ultimately fine-tunes a lighter logistic-regression baseline first; BERT is introduced as the intended next step.)

### 🧹 Exploratory Data Analysis (EDA) Summary

- **Raw shape after unzip**:  
  `1,599,999 rows × 6 columns`

- **Class balance**:  
  - 800,000 negative  
  - 800,000 positive  
  - Perfect 50 / 50 balance

- **Typical tweet length**:  
  - Mean ≈ 71 characters  
  - ≈ 19 word-pieces after BERT tokenization

- **Key cleaning steps**:  
  - Lower-casing  
  - URL and user-mention removal  
  - HTML-entity fix-ups  
  - Stop-word and punctuation stripping  
  - Emojis retained

- **Post-cleaning features**:  
  - Only the text column is kept  
  - Sparse TF-IDF vocabulary includes 150k+ unigrams/bigrams

#### Loading tokenizer & encoding our data

In [None]:
!pip install transformers



In [None]:
!pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

#### BERT Sentiment Classification Notebook

**What you'll learn:**

- Preprocess and clean data for BERT Classification
- Load in pretrained BERT with custom output layer
- Train and evaluate finetuned BERT architecture on your own problem statement

##### 1. Introduction to BERT and the problem at hand

In this notebook, we will fine-tune `bert-base-uncased` on the Sentiment140 dataset to classify tweets as positive or negative.

##### 2. Exploratory Data Analysis and Preprocessing

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:

# Load the Sentiment140 dataset (ensure the CSV is in the working directory)
df = pd.read_csv('sentiment140/training.1600000.processed.noemoticon.csv',
                 encoding='latin-1',
                 names=['target','id','date','query','user','text'])

In [None]:

# Map targets: 0 -> negative, 4 -> positive
df['target'] = df['target'].map({0: 0, 4: 1})

In [None]:

# Basic cleaning function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\.\S+', '', text)  # URLs
    text = re.sub(r'@\w+', '', text)                # Mentions
    text = re.sub(r'[^a-z\s]', '', text)            # Non-alphabetic
    tokens = text.split()
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    return ' '.join(tokens)


In [None]:
# Apply cleaning
df['clean_text'] = df['text'].apply(clean_text)


In [None]:
# Quick EDA
print(df['target'].value_counts(normalize=True))
print(df['clean_text'].str.len().describe())

target
0    0.5
1    0.5
Name: proportion, dtype: float64
count    1.600000e+06
mean     4.293283e+01
std      2.424669e+01
min      0.000000e+00
25%      2.300000e+01
50%      3.900000e+01
75%      6.000000e+01
max      1.750000e+02
Name: clean_text, dtype: float64


In [None]:
# ↓ Inserted: Random sampling down to ~100k tweets ↓
df = df.sample(n=100000, random_state=42).reset_index(drop=True)
print("Reduced dataset shape:", df.shape)

Reduced dataset shape: (40000, 7)


##### 3. Training/Validation Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_df, test_df = train_test_split(
    df[['clean_text', 'target']],
    test_size=0.2,
    stratify=df['target'],
    random_state=42
)

print(f'Train size: {len(train_df)}, Test size: {len(test_df)}')

Train size: 32000, Test size: 8000


##### 4. Loading Tokenizer and Encoding our Data

In [None]:
from transformers import BertTokenizer

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize
train_encodings = tokenizer(
    train_df['clean_text'].tolist(),
    truncation=True,
    padding=True,
    max_length=128,
    return_tensors='pt'
)
test_encodings = tokenizer(
    test_df['clean_text'].tolist(),
    truncation=True,
    padding=True,
    max_length=128,
    return_tensors='pt'
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

##### 5. Setting up BERT Pretrained Model

In [None]:
from transformers import BertForSequenceClassification

In [None]:
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##### 6. Creating Data Loaders

In [None]:
from torch.utils.data import Dataset, DataLoader

In [None]:

class TweetDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: v[idx] for k, v in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [None]:

train_dataset = TweetDataset(train_encodings, train_df['target'].tolist())
test_dataset = TweetDataset(test_encodings, test_df['target'].tolist())

In [None]:

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

##### 7. Setting Up Optimizer and Scheduler

In [None]:
# 1) pull AdamW from torch.optim
from torch.optim import AdamW

In [None]:
# 2) get the LR scheduler (depending on your transformers version)
try:
    from transformers import get_linear_schedule_with_warmup
except ImportError:
    from transformers.optimization import get_linear_schedule_with_warmup

In [None]:
optimizer = AdamW(model.parameters(), lr=2e-5)

epochs = 3
total_steps = len(train_loader) * epochs

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

#### 8. Defining our Performance Metrics

In [None]:
from sklearn.metrics import accuracy_score, f1_score

In [None]:
def compute_metrics(preds, labels):
    preds = np.argmax(preds, axis=1)
    return {
        'accuracy': accuracy_score(labels, preds),
        'f1': f1_score(labels, preds)
    }

##### 9. Creating our Training Loop

In [None]:
import torch

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
print(torch.cuda.is_available())
print(device)

True
cuda


In [None]:
import torch

print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU found")


CUDA available: True
Device: Tesla T4


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(
            input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        scheduler.step()

    avg_loss = total_loss / len(train_loader)
    print(f'Epoch {epoch+1} | Training Loss: {avg_loss:.4f}')

Epoch 1 | Training Loss: 0.4957
Epoch 2 | Training Loss: 0.3738
Epoch 3 | Training Loss: 0.2531


##### 10. Loading and Evaluating our Model

In [None]:
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits.cpu().numpy()
        all_preds.extend(logits)
        all_labels.extend(labels.cpu().numpy())

metrics = compute_metrics(np.array(all_preds), np.array(all_labels))
print('Test Accuracy:', metrics['accuracy'])
print('Test F1 Score:', metrics['f1'])

Test Accuracy: 0.77975
Test F1 Score: 0.779087261785356


##### 11. Saving the Fine-Tuned BERT Model

In [None]:
model.save_pretrained('saved_model/bert_sentiment')
tokenizer.save_pretrained('saved_model/bert_sentiment')
print("✅ Model and tokenizer saved to 'saved_model/bert_sentiment/'")

✅ Model and tokenizer saved to 'saved_model/bert_sentiment/'


In [None]:
import shutil
shutil.make_archive('bert_sentiment_model', 'zip', 'saved_model/bert_sentiment')
print("Zipped model saved as 'bert_sentiment_model.zip'")

Zipped model saved as 'bert_sentiment_model.zip'



##### 12. Loading the Saved Model & Tokenizer for Inference

In [None]:
import re
import torch
import numpy as np
from nltk.corpus import stopwords
from transformers import BertForSequenceClassification, BertTokenizer

In [None]:
save_dir = 'saved_model/bert_sentiment'
infer_tokenizer = BertTokenizer.from_pretrained(save_dir)
infer_model     = BertForSequenceClassification.from_pretrained(save_dir)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
infer_model.to(device).eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
def clean_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r'http\S+|www\.\S+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = text.split()
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    return " ".join(tokens)

In [None]:
def predict_sentiment(text: str, max_length: int = 128) -> str:
    cleaned = clean_text(text)
    inputs  = infer_tokenizer(
        cleaned,
        return_tensors='pt',
        truncation=True,
        padding='max_length',
        max_length=max_length
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        logits = infer_model(**inputs).logits
    pred_idx = int(torch.argmax(logits, dim=1).item())
    return "Positive" if pred_idx == 1 else "Negative"

In [None]:
for sample in [
    "I absolutely loved the new album!",
    "This is hands down the worst experience ever."
]:
    print(f"{sample}  →  {predict_sentiment(sample)}")

I absolutely loved the new album!  →  Positive
This is hands down the worst experience ever.  →  Negative


In [None]:
!zip -r sample_data.zip /content/sample_data

  adding: content/sample_data/ (stored 0%)
  adding: content/sample_data/anscombe.json (deflated 83%)
  adding: content/sample_data/README.md (deflated 39%)
  adding: content/sample_data/mnist_test.csv (deflated 88%)
  adding: content/sample_data/california_housing_train.csv (deflated 79%)
  adding: content/sample_data/mnist_train_small.csv (deflated 88%)
  adding: content/sample_data/california_housing_test.csv (deflated 76%)


In [None]:
from google.colab import files
files.download('sample_data.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
!zip -r sample_data.zip /content/saved_model

updating: content/saved_model/ (stored 0%)
updating: content/saved_model/bert_sentiment/ (stored 0%)
updating: content/saved_model/bert_sentiment/config.json (deflated 49%)
updating: content/saved_model/bert_sentiment/tokenizer_config.json (deflated 75%)
updating: content/saved_model/bert_sentiment/model.safetensors (deflated 7%)
updating: content/saved_model/bert_sentiment/vocab.txt (deflated 53%)
updating: content/saved_model/bert_sentiment/special_tokens_map.json (deflated 42%)
