#  Hands-On Task

## Yelp Sentiment Analysis â€“ Baseline vs Transformer (Google Colab)

---

##  Task Goal

Build and compare **two sentiment classification models** on the Yelp Polarity dataset:

1. Traditional ML baseline (TF-IDF + Logistic Regression)
2. Transformer model (DistilBERT Fine-Tuning)

Then **compare results** and write a short technical recommendation.

---

##  Scenario

A food delivery company wants to analyze customer reviews automatically to understand customer satisfaction.
You are hired to test two modeling approaches and recommend the best solution.

---

##  Final Deliverables

Each student must submit:

* Google Colab Notebook (.ipynb)
* experiments_log.csv file
* Saved fine-tuned Transformer model folder
* Screenshot of inference predictions
* Mini technical report (5â€“6 lines)

---

##  Step 1 â€” Setup Google Colab

1. Open Google Colab
2. Runtime â†’ Change runtime type â†’ Select **GPU**
3. Verify that GPU is available

**Outcome:** Your notebook is ready for accelerated Transformer training.

---

##  Step 2 â€” Load Yelp Polarity Dataset

The Yelp Polarity dataset is available on **Hugging Face Datasets Hub**.

**What you need to do:**
```python
    dataset = load_dataset("yelp_polarity")
```
* Use the Hugging Face `datasets` library
* Load dataset by name: `yelp_polarity`

**What happens automatically:**

* Dataset downloads from Hugging Face servers
* Train and Test splits are created for you
* Each record contains:

  * `text` â†’ the customer review
  * `label` â†’ sentiment (0 = Negative, 1 = Positive)

**Outcome:** You now have ready-to-use train and test datasets of real customer reviews.

---

In [3]:
from datasets import load_dataset

ds = load_dataset("fancyzhx/yelp_polarity")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/256M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/560000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/38000 [00:00<?, ? examples/s]

## Step 3 â€” Create a Training Subset

The original Yelp dataset is very large. To make experiments faster:

**What you need to do:**

* Shuffle the dataset
* Select a smaller subset (example: 2000 training samples)
* Create a smaller test subset (example: 1000 samples)

**Why:**

* Faster training
* Controlled experiments
* Fair comparison between models

**Outcome:** Faster experiments with consistent data size.

---

In [4]:
ds['train'][0]

{'text': "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars.",
 'label': 0}

In [5]:
ds['train'].shape

(560000, 2)

In [6]:
ds['test'].shape

(38000, 2)

In [7]:
train_ds = ds['train'].shuffle(seed=42).select(range(8000))
test_ds = ds['test'].shuffle(seed=42).select(range(2000))
print(train_ds[0])
print("-----")
print(test_ds[0])

{'text': "Decent size, decent selection, decent staff.\\n\\nI guess that can wholly sum this place up, it's decent.  As with many other stores that are like this, the product rotates depending on what doesn't sale well at other stores.  Can always snag a deal here.  I was able to pick up a pretty sweet Puma jacket for $10, can't beat that, right?\\n\\nThat being said, there are those times that you may not find anything as well.  So really don't get your hopes up if you are looking for a specific item.", 'label': 1}
-----
{'text': "Nightclub rating only...\\n\\nWe got lucky because we happened to arrive during Kris Humphries' (new husband of Kim Kardashian) bachelor party.\\n\\nI also saw Jordan Farmar, Lamar Odom and Scott.\\n\\nPlace was packed on a Saturday night and we didn't want to wait in line so we did bottle service. Cost us $575 total for 5 guys and we got grey goose and our choose of drinks.\\n\\nYou don't get your own VIP booth - you actually get to sit on a long couch and 

##  Step 4 â€” Train Baseline Model

Build a traditional machine learning baseline.

**What you need to do:**

1. Convert text into numeric vectors using **TF-IDF**
2. Train a **Logistic Regression** classifier
3. Evaluate model on test set
4. Record **Accuracy** and **F1-score**

**Why this model:**

* Simple
* Fast
* Common NLP baseline

**Outcome:** You obtain baseline Accuracy and F1 scores.

---

In [8]:
# Step 4: Train Baseline Model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

# 1. Prepare data
X_train = train_ds['text']
y_train = train_ds['label']
X_test = test_ds['text']
y_test = test_ds['label']


In [19]:
import pandas as pd
pd.Series(y_train).value_counts()

Unnamed: 0,count
1,4018
0,3982


In [9]:
# 2. TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [10]:
# 3. Train Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_tfidf, y_train)

In [11]:
# 4. Evaluate
y_pred_baseline = lr_model.predict(X_test_tfidf)
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)
baseline_f1 = f1_score(y_test, y_pred_baseline)

print(f"Baseline Accuracy: {baseline_accuracy:.4f}")
print(f"Baseline F1 Score: {baseline_f1:.4f}")

Baseline Accuracy: 0.9100
Baseline F1 Score: 0.9087


In [12]:
# save lr_model
import pickle
with open('lr_model.pkl', 'wb') as f:
    pickle.dump(lr_model, f)
    print("Model saved to lr_model.pkl")

Model saved to lr_model.pkl


##  Step 5 â€” Train Transformer Model
Fine-tune **DistilBERT** for sentiment classification.
**Outcome:** You obtain Transformer Accuracy and F1 scores.


In [13]:
# Step 5: Train Transformer Model
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import numpy as np

# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [14]:
# Tokenize
tokenized_train = train_ds.map(preprocess_function, batched=True)
tokenized_test = test_ds.map(preprocess_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Model
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)
model.to(device)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [15]:
# Metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions)
    }

In [16]:
# Trainer
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.233,0.186141,0.941,0.93998
2,0.0925,0.209669,0.9395,0.938981


TrainOutput(global_step=1000, training_loss=0.1627424774169922, metrics={'train_runtime': 830.8427, 'train_samples_per_second': 19.258, 'train_steps_per_second': 1.204, 'total_flos': 2119478378496000.0, 'train_loss': 0.1627424774169922, 'epoch': 2.0})

In [22]:
from transformers import pipeline
# Try on a new text
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
text = "The food was amazing and fresh!"
prediction = classifier(text)
print(prediction)

Device set to use cuda:0


[{'label': 'LABEL_1', 'score': 0.9944788217544556}]


In [24]:
label_names = ["Negative", "Positive"]
# Extract the numeric part from the label string and convert to int
predicted_label_index = int(prediction[0]["label"].split('_')[1])
predicted_label = label_names[predicted_label_index]
print(f"Input: {text}\n  â†’ Prediction: {predicted_label}")

Input: The food was amazing and fresh!
  â†’ Prediction: Positive


In [28]:
# Evaluate
eval_results = trainer.evaluate()
transformer_accuracy = eval_results["eval_accuracy"]
transformer_f1 = eval_results["eval_f1"]

print(f"Transformer Accuracy: {transformer_accuracy:.4f}")
print(f"Transformer F1 Score: {transformer_f1:.4f}")

Transformer Accuracy: 0.9410
Transformer F1 Score: 0.9400


##  Step 6 â€” Compare Results
Compare both models: Baseline F1 vs Transformer F1.
**Outcome:** Identify which model performs better.


In [30]:
# Step 6: Compare Results
print(f"Model | Accuracy | F1 Score")
print(f"--- | --- | ---")
print(f"Baseline | {baseline_accuracy:.4f} | {baseline_f1:.4f}")
print(f"Transformer | {transformer_accuracy:.4f} | {transformer_f1:.4f}")

better_model = "Transformer" if transformer_f1 > baseline_f1 else "Baseline"
print(f"\nConclusion: The {better_model} model performed better.")

Model | Accuracy | F1 Score
--- | --- | ---
Baseline | 0.9100 | 0.9087
Transformer | 0.9410 | 0.9400

Conclusion: The Transformer model performed better.


##  Step 7 â€” Save Fine-Tuned Model
Save the trained Transformer model and tokenizer.
**Outcome:** A reusable trained model ready for inference.


In [17]:
# Step 7: Save Fine-Tuned Model
save_path = "./fine_tuned_distilbert_yelp"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model saved to {save_path}")

Model saved to ./fine_tuned_distilbert_yelp


##  Step 8 â€” Inference Demo
Test your fine-tuned Transformer on new customer reviews.
**Outcome:** Visual proof that your model works.


In [31]:
# Step 8: Inference Demo
inputs = ["The food was amazing and fresh!", "The service was horrible and slow."]
model.eval()
for text in inputs:
    inputs_tok = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        logits = model(**inputs_tok).logits
    predicted_class_id = logits.argmax().item()
    label_map = {0: "Negative ðŸ˜¡", 1: "Positive ðŸ˜€"}
    print(f"Input: {text}\n  â†’ Prediction: {label_map[predicted_class_id]}")

Input: The food was amazing and fresh!
  â†’ Prediction: Positive ðŸ˜€
Input: The service was horrible and slow.
  â†’ Prediction: Negative ðŸ˜¡


## Step 9 â€” Log Experiment Results
Create a CSV file named `experiments_log.csv`.
**Outcome:** Structured tracking of your experiments.


In [32]:
# Step 9: Log Experiment Results
import pandas as pd
import os
from datetime import datetime

log_file = "experiments_log.csv"
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
log_data = {
    "dataset": ["yelp_polarity"],
    "subset_size": [len(train_ds)],
    "baseline_accuracy": [baseline_accuracy],
    "baseline_f1": [baseline_f1],
    "transformer_accuracy": [transformer_accuracy],
    "transformer_f1": [transformer_f1],
    "timestamp": [timestamp]
}
df_log = pd.DataFrame(log_data)
if os.path.exists(log_file):
    df_log.to_csv(log_file, mode='a', header=False, index=False)
else:
    df_log.to_csv(log_file, index=False)
print(f"Results logged to {log_file}")

Results logged to experiments_log.csv


##  Step 10 â€” Mini Technical Report
Write a short technical recommendation.



In [33]:
# Step 10: Mini Technical Report
report = f"""
1. Dataset: Yelp Polarity
2. Samples: {len(train_ds)}
3. Baseline F1: {baseline_f1:.4f}
4. Transformer F1: {transformer_f1:.4f}
5. Better model: {better_model}
6. Recommendation: {better_model} is recommended due to higher performance."""
print(report)


1. Dataset: Yelp Polarity
2. Samples: 8000
3. Baseline F1: 0.9087
4. Transformer F1: 0.9400
5. Better model: Transformer
6. Recommendation: Transformer is recommended due to higher performance.


##  Task Completed
By finishing this task, you have built, compared, and deployed sentiment models.
