### **Load or Simulate Text Data**
For simplicity, we’ll simulate some labeled review data. You can replace this with any CSV.

✅ Simulate Sentiment-Labeled Text Data

In [63]:
import pandas as pd

data = {
    'text': [
        "I love this product!",
        "Worst purchase ever.",
        "Totally satisfied with the service.",
        "Horrible experience.",
        "Amazing quality, will buy again!",
        "Not worth the price."
    ],
    'label': [1, 0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)

✅ Tokenize Text with BERT

In [64]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
encodings = tokenizer(df['text'].tolist(), truncation=True, padding=True, return_tensors='pt')
labels = df['label'].values

✅ Create a PyTorch Dataset

In [65]:

import torch
from torch.utils.data import Dataset

class SentimentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item
    def __len__(self):
        return len(self.labels)

dataset = SentimentDataset(encodings, labels)

✅ Load BERT Model for Classification

In [66]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
import os
os.environ["WANDB_DISABLED"] = "true"

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅  Train the Model

In [67]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=10,
    per_device_train_batch_size=2,
    logging_dir='./logs',
    logging_steps=1,
    save_strategy='no',
    disable_tqdm=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss
1,0.76
2,0.9086
3,0.681
4,0.7546
5,0.6553
6,0.7681
7,0.7423
8,0.5307
9,0.5501
10,0.5805


TrainOutput(global_step=30, training_loss=0.4103278748691082, metrics={'train_runtime': 42.5772, 'train_samples_per_second': 1.409, 'train_steps_per_second': 0.705, 'total_flos': 277499941200.0, 'train_loss': 0.4103278748691082, 'epoch': 10.0})

✅ Predict Sentiment on New Text

In [69]:
test_text = ["This is excellent!", "Terrible service..."]

# Tokenize the test text
test_inputs = tokenizer(test_text, return_tensors="pt", padding=True, truncation=True)

# Get model predictions
with torch.no_grad():
    outputs = model(**test_inputs)
    predictions = torch.argmax(outputs.logits, dim=1)

# Print the sentiment results
for text, pred in zip(test_text, predictions):
    sentiment = "Positive" if pred.item() == 1 else "Negative"
    print(f"Text: {text} → Sentiment: {sentiment}")

Text: This is excellent! → Sentiment: Positive
Text: Terrible service... → Sentiment: Negative


### 📝 Project Summary:

This project uses a pre-trained BERT model (`bert-base-uncased`) for binary sentiment classification. The model is fine-tuned on a small custom dataset of labeled reviews (0 = Negative, 1 = Positive).

**Steps:**
1. Simulate a labeled text dataset using `pandas`.
2. Tokenize text using `BertTokenizer`.
3. Create a custom PyTorch `Dataset`.
4. Load `BertForSequenceClassification` with `num_labels=2`.
5. Train the model using Hugging Face's `Trainer` and `TrainingArguments`.
6. Predict sentiment on new examples using the fine-tuned model.

**Important Notes:**
- The warning about uninitialized weights is expected: the classifier head is randomly initialized and learned during training.
- To disable `wandb` logging prompts, use:  
  `os.environ["WANDB_DISABLED"] = "true"`