In [1]:
import pandas as pd
import matplotlib.pyplot as plt

from task1_scripts.prompts import (
    DIRECT_PROMPT_TEMPLATE,
    CHAIN_REASONING_PROMPT_TEMPLATE,
    FEWSHOT_PROMPT_PREFIX
)

from task1_scripts.llm_client import LLMClient
from task1_scripts.evaluate_prompts import PromptEvaluator
from task1_scripts.utils import load_sample

In [2]:
df = load_sample("data/sample_yelp_200.csv")
df.head()



Unnamed: 0,review,stars
95,Went here while on vacation in Phoneix based o...,5
15,So I was pretty excited about this burger join...,2
30,"My wife and I live around the corner, hadn't e...",1
158,I don't often review places on Yelp because of...,2
128,The Harkins Camelview 5 gives Arizonans the un...,5


In [3]:
llm = LLMClient()
evaluator = PromptEvaluator(llm)


In [4]:
print(llm.generate("Say hello"))





Hello there! üòä How‚Äôs your day going so far?


In [5]:
df_small = df.sample(10, random_state=42)
df_small.head()


Unnamed: 0,review,stars
194,Love this place! Wish we had one here in CA. G...,5
177,I was searching for a unique item to give as a...,1
9,"1 star for service, but the food is not ok :( ...",1
17,I have been trying to reduce my caloric intake...,4
81,I had a chicken panini and was more than enoug...,4


In this task, I evaluated three different prompting strategies for predicting Yelp star ratings using an LLM. Below I describe each prompt clearly and explain why modifications were made.

You are a review classifier. Given the Yelp review below, assign a star rating from 1 to 5.
Return only a JSON object with keys:
  "predicted_stars": (int),
  "explanation": (brief string)

Review:
""" {review} """

Why this prompt?

Establishes a baseline for performance.

Very simple and easy for the model to follow.

Forces a strict JSON output, making parsing reliable.

Helps measure how well the model performs without examples or reasoning.

In [6]:
df_small = df.sample(10, random_state=42)
res_direct = evaluator.run_on_df(df_small, "direct")


You are an assistant that reasons briefly about the sentiment and assigns a rating.

First, think step-by-step about the sentiment.
THEN output ONLY a JSON object with:
  "predicted_stars"
  "explanation"

IMPORTANT:
- Do NOT output anything except the JSON object.

Review:
""" {review} """

Why this prompt?

Added light reasoning (‚Äúthink step-by-step‚Äù) to see if accuracy improves.

Emotion and sentiment tasks sometimes benefit from CoT.

The original version returned extra text before/after JSON ‚Üí
therefore updated to enforce JSON-only output.

Ensures the model remains structured and machine-readable.

In [7]:
res_chain = evaluator.run_on_df(df_small, "chain")
res_chain.head()



Unnamed: 0,review,gold,raw_response,parsed_json,predicted
0,Love this place! Wish we had one here in CA. G...,5,The review expresses strong positive feelings ...,"{'predicted_stars': 5, 'explanation': 'The rev...",5.0
1,I was searching for a unique item to give as a...,1,The review expresses strong dissatisfaction wi...,"{'predicted_stars': 1, 'explanation': 'The rev...",1.0
2,"1 star for service, but the food is not ok :( ...",1,The customer expresses strong dissatisfaction ...,"{'predicted_stars': 1, 'explanation': 'The cus...",1.0
3,I have been trying to reduce my caloric intake...,4,The review expresses a helpful tip for those w...,"{'predicted_stars': 4, 'explanation': 'The rev...",4.0
4,I had a chicken panini and was more than enoug...,4,The review expresses positive feelings about t...,"{'predicted_stars': 4, 'explanation': 'The rev...",4.0


You are a helpful classifier. Examples:

Review: "Great food and quick service"
‚Üí {"predicted_stars": 5, "explanation": "Positive mention of food and service."}

Review: "Food was cold and the waiter was rude"
‚Üí {"predicted_stars": 2, "explanation": "Negative food quality and poor service."}

Review: "Okay place, not bad but overpriced"
‚Üí {"predicted_stars": 3, "explanation": "Mixed sentiment; price is an issue."}

Review: "Terrible, found hair in my soup"
‚Üí {"predicted_stars": 1, "explanation": "Severe negative hygiene issue."}

Review: "Decent food, friendly staff"
‚Üí {"predicted_stars": 4, "explanation": "Mostly positive overall."}

Now classify the following review.
Output ONLY a JSON object with predicted_stars and explanation.

Review:
""" {review} """

Why this prompt?

Provides examples of how star ratings connect to sentiment.

Helps the model generalize across review styles.

Includes a variety of tones: highly positive, mixed, negative, severe complaint.

Forces consistent formatting by showing correct JSON outputs.

In [None]:
res_fewshot = evaluator.run_on_df(df_small, "fewshot")
res_fewshot.head()



Unnamed: 0,review,gold,raw_response,parsed_json,predicted
0,Love this place! Wish we had one here in CA. G...,5,"```json\n{\n ""predicted_stars"": 5,\n ""explan...","{'predicted_stars': 5, 'explanation': 'Strong ...",5
1,I was searching for a unique item to give as a...,1,"```json\n{\n ""predicted_stars"": 2,\n ""explan...","{'predicted_stars': 2, 'explanation': 'Strongl...",2
2,"1 star for service, but the food is not ok :( ...",1,"```json\n{\n ""predicted_stars"": 2,\n ""explan...","{'predicted_stars': 2, 'explanation': 'Very ne...",2
3,I have been trying to reduce my caloric intake...,4,"```json\n{\n ""predicted_stars"": 4,\n ""explan...","{'predicted_stars': 4, 'explanation': 'Informa...",4
4,I had a chicken panini and was more than enoug...,4,"```json\n{\n ""predicted_stars"": 4,\n ""explan...","{'predicted_stars': 4, 'explanation': 'Positiv...",4


In [9]:
import numpy as np

def accuracy(df):
    return np.mean(df["predicted"] == df["gold"])

print("Direct Accuracy:", accuracy(res_direct))
print("Chain Accuracy:", accuracy(res_chain))
print("Few-shot Accuracy:", accuracy(res_fewshot))


Direct Accuracy: 0.7
Chain Accuracy: 0.7
Few-shot Accuracy: 0.6


| Prompt Type      | Accuracy |
| ---------------- | -------- |
| Direct           | *0.70* |
| Chain-of-Thought | *0.70* |
| Few-shot         | *0.60* |

In [10]:
import json

final_output = {
    "direct": res_direct.to_dict(orient="records"),
    "chain": res_chain.to_dict(orient="records"),
    "fewshot": res_fewshot.to_dict(orient="records")
}

with open("submission.json", "w") as f:
    json.dump(final_output, f, indent=2)

print("‚úî submission.json created successfully!")


‚úî submission.json created successfully!


*FINAL DISCUSSION*

The Direct and Chain-of-Thought prompts both achieved an accuracy of 0.70, while the Few-shot prompt achieved 0.60.

Key observations:

The Direct prompt performed strongly despite being the simplest.

The Chain-of-Thought prompt did not improve accuracy because star rating prediction is a sentiment task, not a reasoning task.

The Few-shot prompt underperformed slightly due to anchoring bias from the provided examples.

Conclusion:
The Direct Prompt is the best performing and most reliable method for this task.