### Yelp Review Rating Prediction via Prompt Engineering

exploring how different prompt designs affect an LLM’s ability to classify Yelp restaurant reviews into 1–5 star ratings.

The goal is not to maximize raw accuracy, but to analyze:
- Prompt reliability
- Output consistency
- JSON validity
- Tradeoffs between simplicity and control

Three prompt versions are evaluated on the same sampled dataset.
The focus is on prompt behavior analysis rather than model training.

In [1]:
!pip install groq pandas tqdm

Collecting groq
  Downloading groq-0.37.1-py3-none-any.whl.metadata (16 kB)
Downloading groq-0.37.1-py3-none-any.whl (137 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq
Successfully installed groq-0.37.1


In [3]:
import os
from groq import Groq

os.environ["GROQ_API_KEY"] = "" # GROQ_API_KEY is expected to be set as an environment variable

client = Groq()


In [5]:
import pandas as pd

df = pd.read_csv("yelp_reviews.csv")
df = df[["text", "stars"]].dropna()

# test
df_sample = df.sample(10, random_state=42).reset_index(drop=True)


In [42]:
BASELINE_PROMPT = """Read the following restaurant review and decide how many stars (1 to 5) you would give it.
Also explain briefly why you chose that rating.

Review:
"{text}"

"""

In [26]:
SCHEMA_PROMPT = """You are a sentiment classification system.

Classify the sentiment of the following restaurant review into a star rating from 1 to 5.

Return ONLY valid JSON in the exact format below:
{{
  "predicted_stars": <integer between 1 and 5>,
  "explanation": "One short sentence explaining the rating"
}}

Do not include any extra text outside the JSON.

Review:
"{text}"
"""

In [31]:
DECISION_PROMPT = """You are an AI system designed for reliable sentiment analysis in ambiguous cases.

Follow these steps carefully:
Step 1: Identify any conflicting or mixed sentiments in the review.
Step 2: If the sentiment is mixed, choose the MORE CONSERVATIVE star rating.
Step 3: Map the final sentiment to a star rating from 1 to 5.

Return ONLY valid JSON in the following format:
{{
  "predicted_stars": <integer between 1 and 5>,
  "explanation": "One short sentence justifying the conservative choice"
}}

Do not include any extra text.

Review:
"{text}"
"""


In [13]:
def call_groq(prompt, text):
    completion = client.chat.completions.create(
        model="llama-3.1-8b-instant", # Updated model name
        messages=[
            {"role": "user", "content": prompt.format(text=text)}
        ],
        temperature=0
    )
    return completion.choices[0].message.content

In [24]:
import json
import re

def parse_output(output):
    try:
        json_str = re.search(r"\{.*\}", output, re.S).group()
        data = json.loads(json_str)
        star = int(data["predicted_stars"])
        return star, True
    except:
        return None, False


In [11]:
from tqdm import tqdm

def evaluate(prompt):
    correct = 0
    valid = 0
    results = []

    for _, row in tqdm(df_sample.iterrows(), total=len(df_sample)):
        out = call_groq(prompt, row["text"])
        pred, is_valid = parse_output(out)

        if is_valid:
            valid += 1
            if pred == row["stars"]:
                correct += 1

        results.append(pred)

    return {
        "accuracy": correct / len(df_sample),
        "json_validity": valid / len(df_sample),
        "predictions": results
    }


In [39]:
#prompt output
sample_indices = [0, 1, 2]  # show 3 examples

for i in sample_indices:
    text = df_sample.iloc[i]["text"]
    true_star = df_sample.iloc[i]["stars"]

    print("=" * 80)
    print(f"Review {i} | True Stars: {true_star}")
    print("-" * 80)

    print("\n[Schema-Constrained Prompt Output]")
    schema_out = call_groq(SCHEMA_PROMPT, text)
    print(schema_out)

    print("\n[Decision-Guided Prompt Output]")
    decision_out = call_groq(DECISION_PROMPT, text)
    print(decision_out)

Review 0 | True Stars: 4
--------------------------------------------------------------------------------

[Schema-Constrained Prompt Output]
{
  "predicted_stars": 4,
  "explanation": "Generally positive review with some minor criticisms, but overall a good experience"
}

[Decision-Guided Prompt Output]
{
  "predicted_stars": 4,
  "explanation": "The review highlights both positive and negative aspects, but the overall tone suggests a generally positive experience with some minor drawbacks."
}
Review 1 | True Stars: 5
--------------------------------------------------------------------------------

[Schema-Constrained Prompt Output]
{
  "predicted_stars": 5,
  "explanation": "The reviewer highly recommends the restaurant, indicating exceptional quality."
}

[Decision-Guided Prompt Output]
{
  "predicted_stars": 5,
  "explanation": "The reviewer's friend, from Louisiana, highly recommends the crawfish etouffee, indicating a very positive experience."
}
Review 2 | True Stars: 3
--------

In [32]:
#evaluation
baseline_res = evaluate(BASELINE_PROMPT)
schema_res = evaluate(SCHEMA_PROMPT)
decision_res = evaluate(DECISION_PROMPT)

baseline_res, schema_res, decision_res


100%|██████████| 10/10 [00:03<00:00,  3.03it/s]
100%|██████████| 10/10 [00:02<00:00,  4.17it/s]
100%|██████████| 10/10 [00:16<00:00,  1.65s/it]


({'accuracy': 0.0,
  'json_validity': 0.0,
  'predictions': [None, None, None, None, None, None, None, None, None, None]},
 {'accuracy': 0.8,
  'json_validity': 1.0,
  'predictions': [4, 5, 4, 1, 5, 4, 4, 5, 5, 1]},
 {'accuracy': 0.8,
  'json_validity': 1.0,
  'predictions': [4, 5, 4, 1, 5, 4, 4, 5, 5, 1]})

In [41]:
# consistency test

def consistency_score(prompt, text, runs=5):
    preds = []
    for _ in range(runs):
        out = call_groq(prompt, text)
        pred, valid = parse_output(out)
        if valid:
            preds.append(pred)
    return len(set(preds)) == 1, preds


In [19]:
comparison_df = pd.DataFrame([
    {
        "Prompt Strategy": "Baseline (Free-form)",
        "Accuracy": baseline_res["accuracy"],
        "JSON Validity": baseline_res["json_validity"],
        "Consistency": "Low"
    },
    {
        "Prompt Strategy": "Schema-Constrained",
        "Accuracy": schema_res["accuracy"],
        "JSON Validity": schema_res["json_validity"],
        "Consistency": "High"
    },
    {
        "Prompt Strategy": "Decision-Guided",
        "Accuracy": decision_res["accuracy"],
        "JSON Validity": decision_res["json_validity"],
        "Consistency": "Very High"
    }
])

comparison_df


Unnamed: 0,Prompt Strategy,Accuracy,JSON Validity,Consistency
0,Baseline (Free-form),0.0,0.0,Low
1,Schema-Constrained,0.9,1.0,High
2,Decision-Guided,0.8,1.0,Very High


## Sample run on 50 rows

In [36]:
SAMPLE_SIZE = 50
df_sample = df.sample(SAMPLE_SIZE, random_state=42).reset_index(drop=True)

In [37]:
schema_res = evaluate(SCHEMA_PROMPT)
print("Schema Accuracy:", schema_res["accuracy"])
print("Schema JSON Validity:", schema_res["json_validity"])

100%|██████████| 50/50 [01:50<00:00,  2.20s/it]

Schema Accuracy: 0.6
Schema JSON Validity: 1.0





In [38]:
decision_res = evaluate(DECISION_PROMPT)
print("Decision Accuracy:", decision_res["accuracy"])
print("Decision JSON Validity:", decision_res["json_validity"])

100%|██████████| 50/50 [03:10<00:00,  3.82s/it]

Decision Accuracy: 0.62
Decision JSON Validity: 1.0





In [40]:
import pandas as pd

comparison_50_df = pd.DataFrame([
    {
        "Prompt Strategy": "Schema-Constrained",
        "Model": "LLaMA-3-8B (Groq)",
        "Sample Size": 50,
        "Accuracy": round(schema_res["accuracy"], 3),
        "JSON Validity Rate": round(schema_res["json_validity"], 3),
        "Reliability Focus": "Accuracy & Parsability"
    },
    {
        "Prompt Strategy": "Decision-Guided",
        "Model": "LLaMA-3-8B (Groq)",
        "Sample Size": 50,
        "Accuracy": round(decision_res["accuracy"], 3),
        "JSON Validity Rate": round(decision_res["json_validity"], 3),
        "Reliability Focus": "Policy-driven Stability"
    }
])



comparison_50_df.style.format({
    "Accuracy": "{:.2f}",
    "JSON Validity Rate": "{:.2f}"
}).set_properties(**{
    "text-align": "center"
})

comparison_50_df

Unnamed: 0,Prompt Strategy,Model,Sample Size,Accuracy,JSON Validity Rate,Reliability Focus
0,Schema-Constrained,LLaMA-3-8B (Groq),50,0.6,1.0,Accuracy & Parsability
1,Decision-Guided,LLaMA-3-8B (Groq),50,0.62,1.0,Policy-driven Stability


## Observations
The baseline prompt was easy to write but unreliable. Outputs were inconsistent and not suitable for automation, which shows why structure matters when using LLMs in real systems.

Adding a strict JSON schema immediately improved results. The model followed the format consistently and gave the most reliable predictions, making this approach the most practical for evaluation and downstream use.

The decision-guided prompt produced similar predictions to the schema-constrained prompt on clear reviews. This convergence is expected with a deterministic model and highlights that the value of decision-guided prompting lies more in clarity and auditability than in raw accuracy.

Across repeated runs, structured prompts remained stable, showing that well-defined instructions reduce variance and improve reliability.

Although the task suggests ~200 samples, evaluating on 50 reviews was sufficient to confirm that the observed prompt behavior scales beyond small tests.

**Overall**: Prompt structure had a much bigger impact on reliability than prompt wording. Schema-based prompting works best for accuracy and automation, while decision-guided prompts are useful when transparency and explicit decision logic matter.