# Text Classification with a Fine-Tuned BERT Model

### Project Overview
This project showcases a modern approach to Natural Language Processing (NLP) by fine-tuning a pre-trained **BERT (Bidirectional Encoder Representations from Transformers)** model for a specific text classification task. Unlike traditional models that rely on hand-crafted features, BERT learns contextual relationships between words, leading to state-of-the-art performance.

### Dataset
The model is fine-tuned on the **IMDb movie review dataset**, which contains 50,000 movie reviews labeled as either positive or negative. This dataset is a standard benchmark for sentiment analysis, allowing for the demonstration of a robust classification model.

### Methodology
1.  **Tokenization and Encoding:** The raw text data is preprocessed using BERT's specialized tokenizer. It converts text into a numerical format that the model can understand, including `input_ids` and `attention_masks`.
2.  **Model Loading:** A pre-trained `BertForSequenceClassification` model is loaded from the Hugging Face Transformers library. This powerful base model already possesses a deep understanding of language.
3.  **Fine-tuning:** The pre-trained model is fine-tuned on the IMDb dataset. The model's final layers are updated to adapt to the specific sentiment classification task.
4.  **Training and Evaluation:** The model is trained for a few epochs with a specialized optimizer (`AdamW`) and a learning rate scheduler, which are best practices for fine-tuning Transformer models. Its performance is evaluated on a validation set.

### Concluded Results
The fine-tuned BERT model achieves a high classification accuracy (expected to be over 90%), significantly outperforming basic machine learning models. This project demonstrates proficiency in using large language models, a critical skill in modern NLP, and an understanding of advanced training techniques for pre-trained models.

### Technologies Used
- Python
- Hugging Face Transformers
- PyTorch
- Pandas
- Scikit-learn
- Jupyter Notebook

In [None]:
# Project 4: Advanced Text Classification with a Fine-Tuned BERT Model

# --- Section 1: Setup and Data Loading ---

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
import torch
from torch.utils.data import DataLoader, TensorDataset, random_split

# Load IMDb dataset from Hugging Face Datasets library
print("Loading IMDb dataset...")
from datasets import load_dataset
dataset = load_dataset('imdb')

df_train = dataset['train'].to_pandas()
df_test = dataset['test'].to_pandas()

# --- Section 2: Tokenization and Encoding ---

print("Tokenizing and encoding text data...")

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

def encode_text(texts, labels):
    input_ids = []
    attention_masks = []

    for text in texts:
        encoded_dict = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=128, 
            padding='max_length',  
            return_attention_mask=True,
            return_tensors='pt',
        )
        input_ids.append(encoded_dict['input_ids'])
        attention_masks.append(encoded_dict['attention_mask'])

    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels)

    return TensorDataset(input_ids, attention_masks, labels)

train_dataset = encode_text(df_train['text'].tolist(), df_train['label'].tolist())
test_dataset = encode_text(df_test['text'].tolist(), df_test['label'].tolist())

# Create DataLoader
batch_size = 32
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# --- Section 3: Model Fine-tuning ---

print("Fine-tuning BERT model...")

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2, 
    output_attentions=False,
    output_hidden_states=False
)

optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
epochs = 3
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(epochs):
    print(f'\n======== Epoch {epoch + 1} / {epochs} ========')
    model.train()
    total_loss = 0
    for batch in train_dataloader:
        b_input_ids, b_input_mask, b_labels = [b.to(device) for b in batch]
        model.zero_grad()
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
    avg_train_loss = total_loss / len(train_dataloader)
    print(f'  Average training loss: {avg_train_loss:.2f}')

# --- Section 4: Model Evaluation ---

print("\nEvaluating the model on the test dataset...")
model.eval()
predictions, true_labels = [], []
for batch in test_dataloader:
    b_input_ids, b_input_mask, b_labels = [b.to(device) for b in batch]
    with torch.no_grad():
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    logits = outputs.logits
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    predictions.append(logits)
    true_labels.append(label_ids)

flat_predictions = np.argmax(np.concatenate(predictions, axis=0), axis=1)
flat_true_labels = np.concatenate(true_labels, axis=0)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(flat_true_labels, flat_predictions)
print(f"\nTest Accuracy: {accuracy*100:.2f}%")