# Task 1 – Yelp Rating Prediction via Prompt Engineering
This notebook evaluates multiple prompt designs for predicting Yelp star ratings using LLMs.


In [1]:
!pip install --upgrade pip




In [2]:
import sys
sys.executable


'C:\\Users\\Rhitik9579\\AppData\\Local\\Programs\\Python\\Python312\\python.exe'

In [10]:
import sys
!{sys.executable} -m pip install ollama





[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: C:\Users\Rhitik9579\AppData\Local\Programs\Python\Python312\python.exe -m pip install --upgrade pip





In [7]:
!pip install pandas tqdm  python-dotenv





In [8]:
import pandas as pd
import json
from tqdm import tqdm
import re


## 1. Dataset
We use a sampled subset (200 reviews) from the Yelp Reviews dataset to evaluate prompt performance efficiently.


In [9]:
df = pd.read_csv("yelp.csv")

df = df[['text', 'stars']]
df_sample = df.sample(200, random_state=42)

df_sample.head()


Unnamed: 0,text,stars
6252,We got here around midnight last Friday... the...,4
4684,Brought a friend from Louisiana here. She say...,5
1731,"Every friday, my dad and I eat here. We order ...",3
4742,"My husband and I were really, really disappoin...",1
4521,Love this place! Was in phoenix 3 weeks for w...,5


In [11]:
import ollama

prompt = """
Classify this Yelp review into a star rating (1–5).

Review:
"The food was good and the service was quick. I would come back again."

Return ONLY valid JSON:
{
  "predicted_stars": number,
  "explanation": "short reason"
}
"""

response = ollama.chat(
    model="mistral",
    messages=[{"role": "user", "content": prompt}]
)

print(response["message"]["content"])


 {
  "predicted_stars": 4,
  "explanation": "The reviewer mentions that the food was good and service was quick, indicating a positive experience. The phrase 'I would come back again' also suggests satisfaction with their visit."
}


In [12]:
import ollama
import json
import re

def call_llm(prompt: str):
    """
    Calls a local Ollama LLM and returns raw text output.
    """

    response = ollama.chat(
        model="mistral",   # or "mistral"
        messages=[
            {
                "role": "user",
                "content": prompt
            }
        ],
        options={
            "temperature": 0
        }
    )

    return response["message"]["content"]




In [13]:
def extract_json(text):
    try:
        match = re.search(r"\{.*\}", text, re.DOTALL)
        if not match:
            return None
        return json.loads(match.group())
    except Exception:
        return None


In [14]:
test = call_llm("""
Return ONLY valid JSON:
{
  "predicted_stars": 5,
  "explanation": "test"
}
""")

print(test)
print(extract_json(test))


 Here is the requested JSON:

```json
{
  "predicted_stars": 5,
  "explanation": "test"
}
```
{'predicted_stars': 5, 'explanation': 'test'}


## 2. Prompting Approaches

We evaluate four prompt designs:

- **V1 Basic** – Simple instruction-based prompt
- **V2 Role-based** – Adds analyst role and constraints
- **V3 Stepwise** – Encourages step-by-step reasoning
- **V4 Confidence-aware** – Adds uncertainty estimation


In [12]:

def prompt_v1(review):
    return f"""
Classify the Yelp review into a rating from 1 to 5.

Review:
{review}

Return JSON only:
{{
  "predicted_stars": number,
  "explanation": "short reason"
}}
"""


In [13]:
def prompt_v2(review):
    return f"""
You are a sentiment analysis expert.

Rules:
- Predict rating 1 to 5
- Strict JSON output
- No extra text

Review:
{review}

JSON:
{{
  "predicted_stars": 1-5,
  "explanation": "brief explanation"
}}
"""


In [14]:
def prompt_v3(review):
    return f"""
You are a Yelp review rating classifier.

Instructions:
- Yelp uses integer ratings from 1 to 5.
- Be decisive. Do NOT hedge.
- Follow this mapping strictly:

Very negative / angry → 1
Mostly negative → 2
Mixed or neutral → 3
Mostly positive → 4
Very positive / praise → 5

Review:
{review}

Return ONLY valid JSON:
{{
  "predicted_stars": 1 | 2 | 3 | 4 | 5,
  "explanation": "one sentence justification"
}}
"""


## 3. Evaluation Metrics

We evaluate each prompt on:
- Exact Accuracy
- Tolerant Accuracy (±1 star)
- JSON Validity Rate
- Confidence Calibration (V4 only)


In [15]:
def evaluate_prompt(prompt_fn):
    exact = 0
    tolerant = 0
    valid_json = 0
    total_valid = 0   # ✅ NEW

    for _, row in tqdm(df_sample.iterrows(), total=len(df_sample)):
        output = call_llm(prompt_fn(row['text']))
        parsed = extract_json(output)

        if not parsed:
            continue

        valid_json += 1

        pred = normalize_stars(parsed.get("predicted_stars"))
        if pred is None:
            continue

        true = int(row["stars"])
        total_valid += 1   # ✅ NEW

        if pred == true:
            exact += 1
        if abs(pred - true) <= 1:
            tolerant += 1

    return {
        "exact_accuracy": exact / total_valid if total_valid else 0,
        "tolerant_accuracy": tolerant / total_valid if total_valid else 0,
        "json_validity": valid_json / len(df_sample)
    }


In [None]:
def normalize_stars(value):
    try:
        val = float(value)
        val = round(val)
        return max(1, min(5, int(val)))
    except:
        return None
results["gap"] = results["tolerant_accuracy"] - results["exact_accuracy"]

## 4. Results


In [17]:
results = pd.DataFrame([
    {"Prompt": "V1 Basic", **evaluate_prompt(prompt_v1)},
    {"Prompt": "V2 Role-based", **evaluate_prompt(prompt_v2)},
    {"Prompt": "V3 Stepwise", **evaluate_prompt(prompt_v3)}
])

results

100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [38:35<00:00, 11.58s/it]
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [42:12<00:00, 12.66s/it]
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [37:49<00:00, 11.35s/it]


Unnamed: 0,Prompt,exact_accuracy,tolerant_accuracy,json_validity
0,V1 Basic,0.645,0.995,1.0
1,V2 Role-based,0.665,0.97,1.0
2,V3 Stepwise,0.471795,0.902564,0.975
