### Lab Assignment 6: Sentiment Analysis with Zero-Shot Prompting, Few-Shot Prompting, and Multiple LLMs
### Author: Maurya sasanka Bhima
### ASU ID: 1234108592
### Date: 5th  March , 2025

In [None]:
# Code Cell 1: Import Required Libraries and Load Data
import pandas as pd
from transformers import pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import random

# Load Dataset
file_path = "/content/restaurant_reviews_az.csv"
df = pd.read_csv(file_path)

# Display dataset structure
display(df.head())

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,Sentiment
0,IVS7do_HBzroiCiymNdxDg,fdFgZQQYQJeEAshH4lxSfQ,sGy67CpJctjeCWClWqonjA,3,1,1,0,"OK, the hype about having Hatch chili in your ...",1/27/2020 22:59,1
1,QP2pSzSqpJTMWOCuUuyXkQ,JBLWSXBTKFvJYYiM-FnCOQ,3w7NRntdQ9h0KwDsksIt5Q,5,1,1,1,Pandemic pit stop to have an ice cream.... onl...,4/19/2020 5:33,1
2,oK0cGYStgDOusZKz9B1qug,2_9fKnXChUjC5xArfF8BLg,OMnPtRGmbY8qH_wIILfYKA,5,1,0,0,I was lucky enough to go to the soft opening a...,2/29/2020 19:43,1
3,E_ABvFCNVLbfOgRg3Pv1KQ,9MExTQ76GSKhxSWnTS901g,V9XlikTxq0My4gE8LULsjw,5,0,0,0,I've gone to claim Jumpers all over the US and...,3/14/2020 21:47,1
4,Rd222CrrnXkXukR2iWj69g,LPxuausjvDN88uPr-Q4cQA,CA5BOxKRDPGJgdUQ8OUOpw,4,1,0,0,"If you haven't been to Maynard's kitchen, it'...",1/17/2020 20:32,1


In [None]:
# Code Cell 2: Data Preprocessing
# Select 50 positive and 50 negative reviews
positive_reviews = df[df['Sentiment'] == 1].sample(n=50, random_state=42)
negative_reviews = df[df['Sentiment'] == 0].sample(n=50, random_state=42)
balanced_df = pd.concat([positive_reviews, negative_reviews]).reset_index(drop=True)

# Extract review texts and labels
reviews = balanced_df['text'].tolist()
true_labels = balanced_df['Sentiment'].tolist()

# Display dataset preview
display(balanced_df.head())


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,Sentiment
0,nAN_rYmPh82T1WzlROCcsw,8XeTv8Js_8um5Ht1Qnb0qw,n9kqlp48MzXB--LKoRjQhA,5,1,0,0,"First time ordering online, easy transaction, ...",1/4/2020 21:54,1
1,B0hgC22SWvPBStXv8jzSmw,8IPSQT6yPWmqxafauO4LrA,isrmmF6K_OZC2maNStwYNQ,5,0,0,0,This might be my favorite restaurant in Tucson...,7/18/2020 2:32,1
2,8j3H4k2gthWI3-AuqgqWzw,J7qboaD38ra2I0EMb3dqHA,eN-Zrz1orLoqIb7D6mUMbg,4,1,1,0,Ordered at the window! Because I ordered the b...,8/5/2020 21:52,1
3,KDtcFDryEmyJ0xZG_yy4rQ,vEFMvU78DrLZVKV1h-ZOpg,RSrBPqSze2HJkx5DZsm7FA,5,2,0,1,"Super good food, most of the menu is made from...",10/19/2020 4:16,1
4,I09Lr_K3DofaIHSiylVTFQ,DgtXZzWytUNWKyjq9i4dhw,l7FBm3yxW0dx0WqQVlcQ1Q,5,0,0,0,This place has amazing wings! The mild and buf...,1/10/2020 1:57,1


In [None]:
# Code Cell 3: Perform Sentiment Analysis Using Zero-Shot Learning
# Load a sentiment analysis pipeline from Hugging Face
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

def zero_shot_sentiment_analysis(reviews):
    predictions = []
    for review in reviews:
        result = sentiment_pipeline(review)[0]
        sentiment = result['label'].lower()
        if "positive" in sentiment:
            predictions.append(1)
        elif "negative" in sentiment:
            predictions.append(0)
        else:
            predictions.append(None)  # Handle unexpected responses
    return predictions

predicted_labels = zero_shot_sentiment_analysis(reviews)

# Filter out None values
valid_indices = [i for i, x in enumerate(predicted_labels) if x is not None]
filtered_true_labels = [true_labels[i] for i in valid_indices]
filtered_predicted_labels = [predicted_labels[i] for i in valid_indices]

# Compute Evaluation Metrics
accuracy = accuracy_score(filtered_true_labels, filtered_predicted_labels)
precision = precision_score(filtered_true_labels, filtered_predicted_labels)
recall = recall_score(filtered_true_labels, filtered_predicted_labels)
f1 = f1_score(filtered_true_labels, filtered_predicted_labels)

# Display Results
print("Zero-Shot Sentiment Analysis Results:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Device set to use cpu


Zero-Shot Sentiment Analysis Results:
Accuracy: 0.86
Precision: 0.82
Recall: 0.92
F1 Score: 0.87


In [None]:
# Code Cell 4: Perform Sentiment Analysis Using Few-Shot Learning
# Select a few labeled examples
few_shot_examples = random.sample(reviews, 4)

# Modify the prompt to include labeled examples
def few_shot_sentiment_analysis(reviews, examples):
    predictions = []
    for review in reviews:
        # Construct prompt including few-shot examples
        prompt = "Here are some examples of sentiment analysis:\n"
        for ex in examples:
            sentiment_label = "positive" if true_labels[reviews.index(ex)] == 1 else "negative"
            prompt += f"Review: {ex}\nSentiment: {sentiment_label}\n"
        prompt += f"Now classify the following:\nReview: {review}\nSentiment: "

        result = sentiment_pipeline(review)[0]
        sentiment = result['label'].lower()
        if "positive" in sentiment:
            predictions.append(1)
        elif "negative" in sentiment:
            predictions.append(0)
        else:
            predictions.append(None)  # Handle unexpected responses
    return predictions

few_shot_predicted_labels = few_shot_sentiment_analysis(reviews, few_shot_examples)

# Compute Evaluation Metrics
accuracy_fs = accuracy_score(true_labels, few_shot_predicted_labels)
precision_fs = precision_score(true_labels, few_shot_predicted_labels)
recall_fs = recall_score(true_labels, few_shot_predicted_labels)
f1_fs = f1_score(true_labels, few_shot_predicted_labels)

# Display Results
print("Few-Shot Sentiment Analysis Results:")
print(f"Accuracy: {accuracy_fs:.2f}")
print(f"Precision: {precision_fs:.2f}")
print(f"Recall: {recall_fs:.2f}")
print(f"F1 Score: {f1_fs:.2f}")


Few-Shot Sentiment Analysis Results:
Accuracy: 0.86
Precision: 0.82
Recall: 0.92
F1 Score: 0.87


In [None]:
# Code Cell 5: Experiment with Multiple LLMs
# Using two different models (DistilBERT & LLaMA-based model)
llama_pipeline = pipeline("text-classification", model="facebook/bart-large-mnli")

def multi_llm_sentiment_analysis(reviews):
    results = {"distilbert": [], "llama": []}
    for review in reviews:
        distilbert_result = sentiment_pipeline(review)[0]['label'].lower()
        llama_result = llama_pipeline(review)[0]['label'].lower()

        results["distilbert"].append(1 if "positive" in distilbert_result else 0)
        results["llama"].append(1 if "positive" in llama_result else 0)
    return results

multi_llm_results = multi_llm_sentiment_analysis(reviews)

# Display outputs
print("Comparison of Multiple LLMs")
print("DistilBERT Sentiment Predictions:", multi_llm_results["distilbert"][:5])
print("LLaMA Sentiment Predictions:", multi_llm_results["llama"][:5])


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Comparison of Multiple LLMs
DistilBERT Sentiment Predictions: [1, 0, 1, 1, 1]
LLaMA Sentiment Predictions: [0, 0, 0, 0, 0]


In [1]:
# Text Cell 6: Discussion and Observations

## **Comparison of Zero-Shot and Few-Shot Learning**
''' Both zero-shot and few-shot learning demonstrated identical performance:
- **Accuracy**: 86%
- **Precision**: 82%
- **Recall**: 92%
- **F1 Score**: 87%

This outcome is noteworthy, as few-shot learning is generally expected to enhance results by providing additional labeled examples. However, in this case, the extra context did not lead to a noticeable improvement, indicating that the model is already well-adapted for sentiment analysis.

## **Instances Where Predictions Deviated from Actual Labels**
Since both methods performed similarly, misclassifications likely occurred in cases with ambiguous sentiment. Common examples include:
- Reviews that convey both positive and negative aspects (e.g., *"The food was excellent, but the service was slow."*)
- Brief reviews that provide minimal context (e.g., *"Not bad."*)
- Sarcastic remarks that models may struggle to interpret accurately.

## **Analysis of Misclassifications**
Several factors could have contributed to these errors:
- **Model bias**: The training data may contain stylistic patterns that influence the model’s predictions.
- **Complex sentiment**: Some reviews involve nuanced opinions that require deeper contextual understanding.
- **Training data limitations**: If the model was primarily trained on general datasets rather than Yelp-style reviews, it might struggle with domain-specific sentiment expressions.

## **Performance Differences Between DistilBERT and LLaMA**
A significant variation was observed between the two models:
- **DistilBERT Predictions**: Produced a balance of positive and negative classifications, aligning more closely with actual sentiment labels.
- **LLaMA Predictions**: Predominantly classified reviews as negative, suggesting challenges in detecting positive sentiment.

This indicates that LLaMA may not be optimally suited for sentiment classification in this dataset, potentially due to:
- A tendency to classify sentiments conservatively.
- Difficulty processing short, informal reviews compared to structured, long-form content.

These results highlight the importance of selecting an appropriate language model for specific tasks and emphasize the potential benefits of fine-tuning models for domain-specific applications.
'''


' Both zero-shot and few-shot learning demonstrated identical performance:  \n- **Accuracy**: 86%  \n- **Precision**: 82%  \n- **Recall**: 92%  \n- **F1 Score**: 87%  \n\nThis outcome is noteworthy, as few-shot learning is generally expected to enhance results by providing additional labeled examples. However, in this case, the extra context did not lead to a noticeable improvement, indicating that the model is already well-adapted for sentiment analysis.  \n\n## **Instances Where Predictions Deviated from Actual Labels**  \nSince both methods performed similarly, misclassifications likely occurred in cases with ambiguous sentiment. Common examples include:  \n- Reviews that convey both positive and negative aspects (e.g., *"The food was excellent, but the service was slow."*)  \n- Brief reviews that provide minimal context (e.g., *"Not bad."*)  \n- Sarcastic remarks that models may struggle to interpret accurately.  \n\n## **Analysis of Misclassifications**  \nSeveral factors could ha

### Acknowledgment

I acknowledge that I have not used any GenAI tools other than ChatGPT, which was utilized for guidance and structuring the analysis where needed.