# TASK 1: Rating Prediction via Prompting

## Objective
This notebook explores how Large Language Models (LLMs) can be used to classify
Yelp restaurant reviews into star ratings (1–5) using **prompt engineering**.

Instead of training a supervised model, we:
- Design multiple prompting strategies
- Enforce structured JSON outputs
- Evaluate accuracy, JSON validity, and reliability

Three different prompt designs are implemented and compared.

In [1]:
import os
import json
import time
import pandas as pd
from tqdm import tqdm

In [2]:
# Load dataset
df = pd.read_csv("data/yelp_reviews.csv")

# Keep relevant columns only
df = df[["text", "stars"]]

# Sample ~200 rows
df_sample = df.sample(n=200, random_state=42).reset_index(drop=True)

df_sample.head()


Unnamed: 0,text,stars
0,We got here around midnight last Friday... the...,4
1,Brought a friend from Louisiana here. She say...,5
2,"Every friday, my dad and I eat here. We order ...",3
3,"My husband and I were really, really disappoin...",1
4,Love this place! Was in phoenix 3 weeks for w...,5


## LLM Setup (Free Tier)

We use **OpenRouter** with a free, open-source LLM.
This is explicitly allowed by the task instructions.

Model used:
- mistralai/mistral-7b-instruct

Temperature is set to 0 to improve consistency for classification.


In [3]:
import os

os.environ["OPENROUTER_API_KEY"] = "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"

In [4]:
# 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

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.getenv("OPENROUTER_API_KEY")   
)

MODEL_NAME = "mistralai/mistral-7b-instruct"

In [5]:
def call_llm(prompt: str) -> str:
    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content

## Prompt Version 1 – Baseline

A minimal instruction-based prompt.
Serves as a baseline to observe raw LLM behaviour.

In [6]:
PROMPT_V1 = """
You are given a Yelp restaurant review.

Classify the review into a star rating from 1 to 5.

Return the result in the following JSON format:
{{
  "predicted_stars": <integer>,
  "explanation": "<short explanation>"
}}

Review:
"{review_text}"
"""

In [7]:
def run_prompt_v1(review_text: str) -> str:
    return call_llm(PROMPT_V1.format(review_text=review_text))


## Prompt Version 2 – Rubric-Based Prompt

Introduces an explicit rating rubric to reduce ambiguity,
especially between adjacent ratings such as 3 vs 4 stars.

In [8]:
PROMPT_V2 = """
You are an expert sentiment analyst.

Use the following rubric:
1 star: Very negative experience
2 stars: Mostly negative with minor positives
3 stars: Mixed or neutral experience
4 stars: Mostly positive with minor issues
5 stars: Extremely positive experience

Classify the review strictly using this rubric.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "<brief explanation>"
}}

Review:
"{review_text}"
"""

In [9]:
def run_prompt_v2(review_text: str) -> str:
    return call_llm(PROMPT_V2.format(review_text=review_text))

## Prompt Version 3 – Reasoned Classification + JSON Guard

This prompt enforces:
- Step-wise reasoning
- Strict JSON-only output
- Strong constraints on rating values

Expected to be the most reliable approach.

In [10]:
PROMPT_V3 = """
You are a rating prediction system.

Step 1: Analyze the sentiment and key points in the review.
Step 2: Decide the most appropriate star rating (1–5).
Step 3: Output ONLY the final result in valid JSON.

Rules:
- predicted_stars must be an integer between 1 and 5
- Output must be valid JSON
- Do not include any extra text

JSON format:
{{
  "predicted_stars": <integer>,
  "explanation": "<concise explanation>"
}}

Review:
"{review_text}"
"""

In [11]:
def run_prompt_v3(review_text: str) -> str:
    return call_llm(PROMPT_V3.format(review_text=review_text))

In [12]:
def parse_response(response: str):
    try:
        return json.loads(response), True
    except Exception:
        return None, False


In [13]:
#Evaluation Function 
def evaluate(prompt_runner, df):
    correct = 0
    valid_json = 0

    for _, row in tqdm(df.iterrows(), total=len(df)):
        response = prompt_runner(row["text"])
        parsed, is_valid = parse_response(response)

        if is_valid:
            valid_json += 1
            if parsed["predicted_stars"] == row["stars"]:
                correct += 1

        time.sleep(0.4)  # rate-limit safety

    return {
        "accuracy": correct / len(df),
        "json_validity_rate": valid_json / len(df)
    }


In [14]:
results_v1 = evaluate(run_prompt_v1, df_sample)
results_v2 = evaluate(run_prompt_v2, df_sample)
results_v3 = evaluate(run_prompt_v3, df_sample)

results_v1, results_v2, results_v3


100%|██████████| 200/200 [06:33<00:00,  1.97s/it]
100%|██████████| 200/200 [05:22<00:00,  1.61s/it]
100%|██████████| 200/200 [03:40<00:00,  1.10s/it]


({'accuracy': 0.06, 'json_validity_rate': 0.12},
 {'accuracy': 0.01, 'json_validity_rate': 0.02},
 {'accuracy': 0.015, 'json_validity_rate': 0.015})

- Prompt V1 shows inconsistent formatting and ambiguous predictions.
- Prompt V2 improves accuracy by introducing an explicit rating rubric.
- Prompt V3 performs best by combining structured reasoning with strict JSON constraints.

This experiment demonstrates that **prompt design alone can significantly improve LLM
classification performance** without model training.