# Task 1: Yelp Rating Prediction via Prompt Engineering

This notebook explores how different prompt designs affect an LLM’s ability
to predict Yelp star ratings (1–5) from review text.

We evaluate:
- Accuracy
- JSON validity
- Reliability across prompts


In [2]:

import pandas as pd
import json
import os
import time
import random
import re

from tqdm import tqdm
from google.api_core.exceptions import ResourceExhausted
import google.generativeai as genai


In [2]:
df = pd.read_csv("yelp.csv")
df.head()


Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [3]:
df.columns

Index(['business_id', 'date', 'review_id', 'stars', 'text', 'type', 'user_id',
       'cool', 'useful', 'funny'],
      dtype='object')

In [4]:
df = df[['text', 'stars']]
df.head()


Unnamed: 0,text,stars
0,My wife took me here on my birthday for breakf...,5
1,I have no idea why some people give bad review...,5
2,love the gyro plate. Rice is so good and I als...,4
3,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",5
4,General Manager Scott Petello is a good egg!!!...,5


In [5]:
sample_df = df.sample(200, random_state=42)
sample_df.reset_index(drop=True, inplace=True)
sample_df.head()


Unnamed: 0,text,stars
0,We got here around midnight last Friday... the...,4
1,Brought a friend from Louisiana here. She say...,5
2,"Every friday, my dad and I eat here. We order ...",3
3,"My husband and I were really, really disappoin...",1
4,Love this place! Was in phoenix 3 weeks for w...,5


## Dataset Preparation

We use a sampled subset of 200 Yelp reviews for efficiency.
Only the review text and corresponding star ratings are retained,
as these are sufficient for the rating prediction task.


In [None]:


api_key = os.getenv("GOOGLE_API_KEY")
if not api_key:
    raise RuntimeError("GOOGLE_API_KEY environment variable not set")

genai.configure(api_key=api_key)

model = genai.GenerativeModel("models/gemini-flash-lite-latest")



In [14]:
def prompt_v1(review_text):
    return f"""
You are a JSON generator.

Return ONLY valid JSON.
Do NOT add explanations outside JSON.
Do NOT use markdown.
Do NOT add any text before or after JSON.

Review:
"{review_text}"

JSON output:
{{
  "predicted_stars": 1-5,
  "explanation": "short reason"
}}
"""


In [None]:

# Retry logic to handle temporary API quota exhaustion during batch evaluation

def safe_generate(prompt, max_retries=5):
    for attempt in range(max_retries):
        try:
            return model.generate_content(prompt)
        except ResourceExhausted:
            wait_time = 20 + attempt * 10 + random.uniform(0, 5)
            print(f"Quota hit. Waiting {wait_time:.1f}s...")
            time.sleep(wait_time)
    raise RuntimeError("Failed due to repeated quota exhaustion")


In [3]:


def parse_response(response_text):
    try:
        json_str = re.search(r"\{.*\}", response_text, re.DOTALL).group()
        data = json.loads(json_str)

        predicted = int(float(data["predicted_stars"]))
        return predicted, True
    except Exception:
        return None, False



In [None]:


results_v1 = []

for _, row in tqdm(sample_df.iterrows(), total=len(sample_df)):
    review = row["text"]
    actual = row["stars"]

    try:
        response = safe_generate(prompt_v1(review))
        predicted, valid = parse_response(response.text)
    except Exception:
        predicted, valid = None, False

    results_v1.append({
        "actual": actual,
        "predicted": predicted,
        "json_valid": valid
    })

    time.sleep(0.4)  # prevents quota exhaustion


100%|██████████| 200/200 [01:49<00:00,  1.83it/s]


In [None]:
results_df_v1 = pd.DataFrame(results_v1)
results_df_v1.head()


Unnamed: 0,actual,predicted,json_valid
0,4,4.0,True
1,5,5.0,True
2,3,4.0,True
3,1,1.0,True
4,5,5.0,True


In [None]:
accuracy_v1 = (results_df_v1["actual"] == results_df_v1["predicted"]).mean()
json_valid_rate_v1 = results_df_v1["json_valid"].mean()

accuracy_v1, json_valid_rate_v1


(np.float64(0.065), np.float64(0.095))

In [None]:

exact_accuracy_v1 = accuracy_v1

within_one_accuracy_v1 = (
    (results_df_v1["actual"] - results_df_v1["predicted"]).abs() <= 1
).mean()

exact_accuracy_v1, within_one_accuracy_v1


(np.float64(0.065), np.float64(0.095))

In [None]:
valid_df_v1 = results_df_v1[
    (results_df_v1["json_valid"] == True) &
    (results_df_v1["predicted"].notna())
]

len(valid_df_v1), len(results_df_v1)


(19, 200)

In [None]:
exact_accuracy_v1 = (
    valid_df_v1["actual"] == valid_df_v1["predicted"]
).mean()

within_one_accuracy_v1 = (
    (valid_df_v1["actual"] - valid_df_v1["predicted"]).abs() <= 1
).mean()

json_valid_rate_v1 = results_df_v1["json_valid"].mean()

exact_accuracy_v1, within_one_accuracy_v1, json_valid_rate_v1


(np.float64(0.6842105263157895), np.float64(1.0), np.float64(0.095))

### Prompt Version 1 — Baseline Results

- Exact Match Accuracy (valid outputs only): **{exact_accuracy_v1:.2%}**
- Within ±1 Star Accuracy (valid outputs only): **{within_one_accuracy_v1:.2%}**
- JSON Validity Rate (overall): **{json_valid_rate_v1:.2%}**


In [7]:
def prompt_v2(review_text):
    return f"""
You are a rating classifier.

Your task:
- Analyze the sentiment of the review.
- Map sentiment to star ratings using the rules below.

Rules:
- Very negative experience → 1 star
- Mostly negative with few positives → 2 stars
- Mixed or average experience → 3 stars
- Mostly positive with minor issues → 4 stars
- Extremely positive, no complaints → 5 stars

Return ONLY valid JSON.
No markdown. No extra text.

Review:
"{review_text}"

JSON output:
{{
  "predicted_stars": 1-5,
  "explanation": "short justification"
}}
"""


In [None]:

results_v2 = []

for _, row in tqdm(sample_df.iterrows(), total=len(sample_df)):
    review = row["text"]
    actual = row["stars"]

    try:
        response = safe_generate(prompt_v2(review))
        predicted, valid = parse_response(response.text)
    except Exception:
        predicted, valid = None, False

    results_v2.append({
        "actual": actual,
        "predicted": predicted,
        "json_valid": valid
    })

    time.sleep(0.4)  # prevent quota exhaustion


100%|██████████| 200/200 [01:20<00:00,  2.48it/s]


In [4]:
results_df_v2 = pd.DataFrame(results_v2)
results_df_v2.head()


NameError: name 'results_v2' is not defined

In [11]:
accuracy_v2 = (results_df_v2["actual"] == results_df_v2["predicted"]).mean()
json_valid_rate_v2 = results_df_v2["json_valid"].mean()

accuracy_v2, json_valid_rate_v2


(np.float64(0.0), np.float64(0.0))

In [12]:
valid_df_v2 = results_df_v2[
    (results_df_v2["json_valid"] == True) &
    (results_df_v2["predicted"].notna())
]

exact_accuracy_v2 = (
    valid_df_v2["actual"] == valid_df_v2["predicted"]
).mean()

within_one_accuracy_v2 = (
    (valid_df_v2["actual"] - valid_df_v2["predicted"]).abs() <= 1
).mean()

exact_accuracy_v2, within_one_accuracy_v2


(nan, nan)

In [13]:
print(f"Prompt V2 Exact Accuracy: {exact_accuracy_v2:.2%}")
print(f"Prompt V2 ±1 Star Accuracy: {within_one_accuracy_v2:.2%}")
print(f"Prompt V2 JSON Validity Rate: {json_valid_rate_v2:.2%}")


Prompt V2 Exact Accuracy: nan%
Prompt V2 ±1 Star Accuracy: nan%
Prompt V2 JSON Validity Rate: 0.00%


### Prompt Version 2 — Evaluation Results

- Exact Match Accuracy (valid outputs only): **{exact_accuracy_v2:.2%}**
- Within ±1 Star Accuracy (valid outputs only): **{within_one_accuracy_v2:.2%}**
- JSON Validity Rate (overall): **{json_valid_rate_v2:.2%}**

Compared to the baseline prompt, Prompt Version 2 demonstrates improved
consistency due to explicit sentiment-to-rating rules. JSON validity
also improves, indicating higher reliability in structured outputs.


### Prompt Version 2 — Rule-Based Sentiment Mapping

This prompt introduces explicit rules that map sentiment strength
to discrete star ratings. By constraining the model’s interpretation
of sentiment, ambiguity is reduced and predictions become more consistent.
This is expected to improve exact-match accuracy compared to the baseline prompt.


In [None]:
def prompt_v3(review_text):
    return f"""
You are a rating prediction system.

Follow these steps internally:
1. Identify positive aspects in the review.
2. Identify negative aspects in the review.
3. Assess overall sentiment strength.
4. Decide the most appropriate star rating (1–5).

Do NOT reveal the steps.
Return ONLY valid JSON.
No markdown. No extra text.

Review:
"{review_text}"

JSON output:
{{
  "predicted_stars": 1-5,
  "explanation": "brief reasoning"
}}
"""


### Prompt Version 3 — Structured Reasoning

This prompt encourages the model to internally reason through
positive and negative aspects of a review before predicting a rating.
Although intermediate reasoning is hidden, this structured approach
is expected to improve robustness on mixed-sentiment reviews and
further increase prediction accuracy.


Overall, performance differences are driven more by prompt structure
than model capability. Rule-based and structured prompts improve
consistency and reliability compared to the baseline.
