### Task 3 & 4 – Evaluation + Troubleshooting

In [None]:
## 1.Imports & Setup

from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
## 2.Loading dataset

dataset = load_dataset("tweet_eval", "sentiment")
test_data = dataset["test"].shuffle(seed=42).select(range(200))

label_names = dataset["train"].features["label"].names

In [None]:
## 3.Load Pretrained Model

classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

y_true = []
y_pred = []

for row in test_data:
    pred = classifier(row["text"])[0]["label"].upper()
    y_pred.append(pred)
    y_true.append(label_names[row["label"]].upper())

mapped_true = ["POSITIVE" if t == "POSITIVE" else "NEGATIVE" for t in y_true]
mapped_pred = ["POSITIVE" if "POS" in p else "NEGATIVE" for p in y_pred]

Device set to use cpu


In [None]:
## 4.Evaluate with Prompts

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

print("=== DistilBERT Evaluation (scikit-learn) ===")
print("Accuracy:", accuracy_score(mapped_true, mapped_pred))

precision, recall, f1, _ = precision_recall_fscore_support(mapped_true, mapped_pred, average="macro")
print("Precision:", precision)
print("Recall:", recall)
print("F1:", f1)

=== DistilBERT Evaluation (scikit-learn) ===
Accuracy: 0.75
Precision: 0.6879595588235294
Recall: 0.7125779625779626
F1: 0.6964545896066052


In [None]:
## 5.Save Results

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

t5_name = "google/flan-t5-small"
t5_tokenizer = AutoTokenizer.from_pretrained(t5_name)
t5_model = AutoModelForSeq2SeqLM.from_pretrained(t5_name)

def flan_generate(prompt, max_new_tokens=50):
    inputs = t5_tokenizer(prompt, return_tensors="pt")
    outputs = t5_model.generate(**inputs, max_new_tokens=max_new_tokens)
    return t5_tokenizer.decode(outputs[0], skip_special_tokens=True)


def prompt_direct(tweet):
    return f"Classify the sentiment of this tweet as Positive, Neutral, or Negative:\n\nTweet: {tweet}"

def prompt_fewshot(tweet):
    examples = """
Tweet: "I love flying with Delta, great service!" → Positive
Tweet: "The flight was delayed and staff were rude." → Negative
Tweet: "The plane was okay, nothing special." → Neutral
"""
    return f"{examples}\nClassify the following tweet:\nTweet: {tweet}"

def prompt_chain_of_thought(tweet):
    return f"Analyze the tweet step by step. First decide if the tone is favorable, unfavorable, or neutral. Then provide the final label.\n\nTweet: {tweet}"

## Testing prompts on a few tweets
sample_tweets = [
    "I loved the crew, they were very friendly!",
    "The flight was delayed by 3 hours.",
    "It was just okay, nothing special."
]

print("\n=== FLAN-T5 Prompt Interaction ===")
for tweet in sample_tweets:
    print("\nTWEET:", tweet)
    for style, func in [("Direct", prompt_direct), ("Few-Shot", prompt_fewshot), ("Chain-of-Thought", prompt_chain_of_thought)]:
        prompt = func(tweet)
        response = flan_generate(prompt)
        print(f"\n{style} Prompt:\n{prompt}")
        print("Model Response:", response)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`



=== FLAN-T5 Prompt Interaction ===

TWEET: I love the crew, they were very friendly!

Direct Prompt:
Classify the sentiment of this tweet as Positive, Neutral, or Negative:

Tweet: I love the crew, they were very friendly!
Model Response: Positive

Few-Shot Prompt:

Tweet: "I love flying with Delta, great service!" → Positive
Tweet: "The flight was delayed and staff were rude." → Negative
Tweet: "The plane was okay, nothing special." → Neutral

Classify the following tweet:
Tweet: I love the crew, they were very friendly!
Model Response: Positive

Chain-of-Thought Prompt:
Analyze the tweet step by step. First decide if the tone is favorable, unfavorable, or neutral. Then provide the final label.

Tweet: I love the crew, they were very friendly!
Model Response: Positive

TWEET: The flight was delayed by 3 hours.

Direct Prompt:
Classify the sentiment of this tweet as Positive, Neutral, or Negative:

Tweet: The flight was delayed by 3 hours.
Model Response: Negative

Few-Shot Prompt:

T

#### 6.Troubleshooting


Likely Issue:
- DistilBERT ignores custom prompts (since it's fine-tuned classifier). leads to poor performance on 'Neutral' tweets.

Proposed Solutions:
- Use generative LLMs (e.g., FLAN-T5, GPT) to respect prompt variations, or fine-tune DistilBERT on custom prompts if compute allows.
