LAB M2.01 PROMPT ENGINEERING LAB
Author: Cindy Lund

# Prompt Engineering Lab — TechFlow Solutions

## Step 1: Environment Setup

In this notebook we will:
- Connect to the OpenAI API
- Create helper functions to test prompts repeatedly (5 / 10 / 15 runs)
- Optionally run multiple API calls in parallel to speed up testing

In [2]:
import os
import json
import time
from collections import Counter

from dotenv import load_dotenv
from openai import OpenAI

# Load environment variables from .env file
load_dotenv()

# Initialize OpenAI client (reads OPENAI_API_KEY from environment)
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Set default model
MODEL = "gpt-4o-mini"

print("✅ Setup complete! OpenAI client initialized.")


✅ Setup complete! OpenAI client initialized.


## Helper Functions

We define:
- `call_openai(prompt)` → runs one prompt and returns text
- `run_prompt_n_times(prompt, n)` → runs the same prompt N times (sequential)
- `run_prompt_parallel(prompt, n)` → runs the same prompt N times (parallel threads)

We’ll use these helpers throughout the lab to measure consistency.

In [4]:
from typing import List
from concurrent.futures import ThreadPoolExecutor, as_completed

def call_openai(prompt: str, model: str = MODEL, temperature: float = 0.7) -> str:
    """
    Single API call. Returns text output.
    """
    resp = client.responses.create(
        model=model,
        input=prompt,
        temperature=temperature
    )
    return resp.output_text


def run_prompt_n_times(prompt: str, n: int, model: str = MODEL, temperature: float = 0.7) -> List[str]:
    """
    Runs the same prompt N times (sequentially).
    """
    return [call_openai(prompt, model=model, temperature=temperature) for _ in range(n)]


def run_prompt_parallel(prompt: str, n: int, model: str = MODEL, temperature: float = 0.7, max_workers: int = 5) -> List[str]:
    """
    Runs the same prompt N times using threads (parallel).
    """
    outputs = [None] * n

    def _one(i: int):
        return call_openai(prompt, model=model, temperature=temperature)

    with ThreadPoolExecutor(max_workers=max_workers) as ex:
        futures = {ex.submit(_one, i): i for i in range(n)}
        for fut in as_completed(futures):
            i = futures[fut]
            outputs[i] = fut.result()

    return outputs


def summarize_runs(outputs: List[str]) -> None:
    """
    Quick summary for failure analysis: unique outputs + most common.
    """
    cleaned = [o.strip() for o in outputs]
    counts = Counter(cleaned)
    print("Total runs:", len(outputs))
    print("Unique outputs:", len(counts))
    print("Most common outputs:")
    for text, c in counts.most_common(3):
        print(f"- ({c}x) {repr(text[:120])}{'...' if len(text) > 120 else ''}")


In [5]:
#Quick Checkpoint: test the functions with a simple prompt
out = call_openai("Reply with exactly: OK", temperature=0)
print("Model output:", repr(out))


Model output: 'OK'


# Step 2: Create Initial Prompts (Zero-shot) - No Examples

We create three basic prompts (version 1) for:
1. Sentiment Analysis (classification)
2. Product Description Generation (generation)
3. Data Extraction (structured extraction)

Goal: get *some* output for each task (even if inconsistent).


In [6]:
## Task 1 — Sentiment Analysis (Prompt v1)
# Ask model to classify sentiment of a customer message as positive, negative, or neutral.

customer_message_1 = "I love this product! It's exactly what I needed."

sentiment_prompt_v1 = f"""
Classify the sentiment of this customer message as positive, negative, or neutral.

Message: "{customer_message_1}"
"""

result = call_openai(sentiment_prompt_v1)
print("Sentiment Analysis Result:")
print(result)


Sentiment Analysis Result:
The sentiment of the customer message is positive.


In [7]:
# Task 2 — Product Description Generator (Prompt v1)

product_info_1 = {
    "name": "AeroBrew Travel Mug",
    "key_features": ["Leak-proof lid", "Keeps drinks hot for 6 hours", "Fits most cup holders"],
    "target_audience": "commuters and travelers",
    "price": "$29.99"
}

product_prompt_v1 = f"""
Write a compelling product description using the following product information.

Product information (JSON):
{json.dumps(product_info_1, indent=2)}
"""

result = call_openai(product_prompt_v1)
print("Product Description Result:")
print(result)


Product Description Result:
### AeroBrew Travel Mug

Elevate your daily commute with the **AeroBrew Travel Mug**—your perfect companion for hot beverages on the go! Designed with the modern traveler in mind, this sleek and stylish travel mug combines functionality and convenience to keep you fueled throughout your day.

#### Key Features:
- **Leak-Proof Lid:** Say goodbye to spills and stains! The AeroBrew's expertly designed lid ensures that your favorite drinks stay securely contained, whether you’re navigating crowded trains or bumpy roads.
- **Keeps Drinks Hot for 6 Hours:** Enjoy your coffee or tea at the perfect temperature, even hours after your first sip. With innovative insulation technology, the AeroBrew guarantees a delightful drinking experience, no matter where your journey takes you.
- **Fits Most Cup Holders:** Designed for ultimate convenience, the AeroBrew seamlessly fits into most vehicle cup holders, making it the ideal travel mug for commuters and adventurers alike.

In [8]:
#Task 3 — Data Extraction — Prompt v1 - Extract structured data from unstructured text.
feedback_1 = """
I bought the AeroBrew Travel Mug last week (Order #A12345). The mug keeps coffee hot,
but the lid started leaking after two days. Support was friendly and offered a replacement.
I'm still a bit disappointed because I expected better quality for $29.99.
"""

extraction_prompt_v1 = f"""
Extract the following fields from the customer feedback and return them as JSON:
- order_id
- product_name
- issues
- positive_notes
- overall_sentiment

Customer feedback:
\"\"\"{feedback_1}\"\"\"
"""

result = call_openai(extraction_prompt_v1)
print("Data Extraction Result:")
print(result)



Data Extraction Result:
```json
{
  "order_id": "A12345",
  "product_name": "AeroBrew Travel Mug",
  "issues": "Lid started leaking after two days",
  "positive_notes": "Support was friendly and offered a replacement",
  "overall_sentiment": "Disappointed"
}
```


In [9]:
# Part 2 Diagnosing Failures  Systematic Testing with Multiple Runs
print("=== Sentiment Analysis | 5 runs ===")
sentiment_runs_5 = run_prompt_n_times(sentiment_prompt_v1, n=5)
summarize_runs(sentiment_runs_5)
print("\nDetailed outputs:")
for i, out in enumerate(sentiment_runs_5, 1):
    print(f"\nRun {i}:\n{out}")


print("\n\n=== Product Description | 5 runs ===")
product_runs_5 = run_prompt_n_times(product_prompt_v1, n=5)
summarize_runs(product_runs_5)
print("\nDetailed outputs:")
for i, out in enumerate(product_runs_5, 1):
    print(f"\nRun {i}:\n{out}")


print("\n\n=== Data Extraction | 5 runs ===")
extraction_runs_5 = run_prompt_n_times(extraction_prompt_v1, n=5)
summarize_runs(extraction_runs_5)
print("\nDetailed outputs:")
for i, out in enumerate(extraction_runs_5, 1):
    print(f"\nRun {i}:\n{out}")

=== Sentiment Analysis | 5 runs ===
Total runs: 5
Unique outputs: 1
Most common outputs:
- (5x) 'The sentiment of the customer message is positive.'

Detailed outputs:

Run 1:
The sentiment of the customer message is positive.

Run 2:
The sentiment of the customer message is positive.

Run 3:
The sentiment of the customer message is positive.

Run 4:
The sentiment of the customer message is positive.

Run 5:
The sentiment of the customer message is positive.


=== Product Description | 5 runs ===
Total runs: 5
Unique outputs: 5
Most common outputs:
- (1x) '**AeroBrew Travel Mug - Your Ultimate Companion for On-the-Go Refreshment**\n\nElevate your daily commute with the AeroBre'...
- (1x) '### AeroBrew Travel Mug: Your Perfect Companion for On-the-Go Refreshment\n\nIntroducing the **AeroBrew Travel Mug**, desi'...
- (1x) '### AeroBrew Travel Mug\n\nElevate your on-the-go beverage experience with the **AeroBrew Travel Mug**—the perfect compani'...

Detailed outputs:

Run 1:
**AeroBrew Tr

Document Result of 5 runs: Result of all is consistent in format and content. Variation: 3 include "customer" and 2 do not.

In [10]:
# Step 4: Run each initial prompt 10 times and compare consistency with 5-run test

print("=== Sentiment Analysis | 10 runs ===")
sentiment_runs_10 = run_prompt_n_times(sentiment_prompt_v1, n=10)
summarize_runs(sentiment_runs_10)


print("\n=== Product Description | 10 runs ===")
product_runs_10 = run_prompt_n_times(product_prompt_v1, n=10)
summarize_runs(product_runs_10)


print("\n=== Data Extraction | 10 runs ===")
extraction_runs_10 = run_prompt_n_times(extraction_prompt_v1, n=10)
summarize_runs(extraction_runs_10)


# --- Optional: Compare uniqueness vs 5-run test ---
print("\n\n=== Comparison: Unique Outputs ===")
print(f"Sentiment - 5 runs unique: {len(set(sentiment_runs_5))}")
print(f"Sentiment - 10 runs unique: {len(set(sentiment_runs_10))}")

print(f"Product - 5 runs unique: {len(set(product_runs_5))}")
print(f"Product - 10 runs unique: {len(set(product_runs_10))}")

print(f"Extraction - 5 runs unique: {len(set(extraction_runs_5))}")
print(f"Extraction - 10 runs unique: {len(set(extraction_runs_10))}")


=== Sentiment Analysis | 10 runs ===
Total runs: 10
Unique outputs: 1
Most common outputs:
- (10x) 'The sentiment of the customer message is positive.'

=== Product Description | 10 runs ===
Total runs: 10
Unique outputs: 10
Most common outputs:
- (1x) '**Introducing the AeroBrew Travel Mug: Your Ultimate Companion for Every Journey!**\n\nElevate your on-the-go beverage exp'...
- (1x) '### AeroBrew Travel Mug: Your Perfect Companion for Every Journey\n\nIntroducing the **AeroBrew Travel Mug**, the ultimate'...
- (1x) '**AeroBrew Travel Mug: Your Perfect Companion for Every Journey**\n\nElevate your sipping experience with the **AeroBrew T'...

=== Data Extraction | 10 runs ===
Total runs: 10
Unique outputs: 7
Most common outputs:
- (3x) '```json\n{\n  "order_id": "A12345",\n  "product_name": "AeroBrew Travel Mug",\n  "issues": "The lid started leaking after tw'...
- (2x) 'Here\'s the extracted information in JSON format:\n\n```json\n{\n  "order_id": "A12345",\n  "product_name": "AeroBr

In [None]:
#Checkpoint: Consistency: With 10 runs, variability increased significantly for generation and extraction tasks (more unique outputs and formatting drift), while sentiment classification remained consistent in content but still lacked strict output control.

In [11]:
# Step 5: Run each initial prompt 15 times and create a structured failure analysis
# --- Step 5: Run Prompts 15 Times ---

print("=== Sentiment Analysis | 15 runs ===")
sentiment_runs_15 = run_prompt_n_times(sentiment_prompt_v1, n=15)
summarize_runs(sentiment_runs_15)

print("\n=== Product Description | 15 runs ===")
product_runs_15 = run_prompt_n_times(product_prompt_v1, n=15)
summarize_runs(product_runs_15)

print("\n=== Data Extraction | 15 runs ===")
extraction_runs_15 = run_prompt_n_times(extraction_prompt_v1, n=15)
summarize_runs(extraction_runs_15)


# --- Consistency Calculation ---
def consistency_percentage(outputs):
    total = len(outputs)
    unique = len(set(o.strip() for o in outputs))
    return round((1 - (unique - 1) / total) * 100, 2)

print("\n=== Consistency Percentages (15 runs) ===")
print(f"Sentiment Consistency: {consistency_percentage(sentiment_runs_15)}%")
print(f"Product Description Consistency: {consistency_percentage(product_runs_15)}%")
print(f"Data Extraction Consistency: {consistency_percentage(extraction_runs_15)}%")


# --- Failure Pattern Flags ---
def detect_extraction_issues(outputs):
    issues = {
        "code_block_wrapped": 0,
        "extra_explanation_text": 0,
        "list_vs_string_inconsistency": 0
    }
    
    for out in outputs:
        if "```" in out:
            issues["code_block_wrapped"] += 1
        if "Here's" in out or "Below is" in out:
            issues["extra_explanation_text"] += 1
        if '"issues": [' in out:
            issues["list_vs_string_inconsistency"] += 1
            
    return issues

extraction_failure_patterns = detect_extraction_issues(extraction_runs_15)

print("\n=== Detected Extraction Failure Patterns ===")
print(extraction_failure_patterns)



=== Sentiment Analysis | 15 runs ===
Total runs: 15
Unique outputs: 2
Most common outputs:
- (10x) 'The sentiment of the customer message is positive.'
- (5x) 'The sentiment of the message is positive.'

=== Product Description | 15 runs ===
Total runs: 15
Unique outputs: 15
Most common outputs:
- (1x) '**AeroBrew Travel Mug: Your Perfect Companion for Every Journey**\n\nElevate your daily commute with the AeroBrew Travel M'...
- (1x) '### AeroBrew Travel Mug\n\nElevate your on-the-go beverage experience with the **AeroBrew Travel Mug**—the ultimate compan'...
- (1x) '**AeroBrew Travel Mug**  \nPrice: $29.99\n\nElevate your on-the-go beverage experience with the **AeroBrew Travel Mug**! De'...

=== Data Extraction | 15 runs ===
Total runs: 15
Unique outputs: 6
Most common outputs:
- (7x) '```json\n{\n  "order_id": "A12345",\n  "product_name": "AeroBrew Travel Mug",\n  "issues": "The lid started leaking after tw'...
- (2x) '```json\n{\n  "order_id": "A12345",\n  "product_name": "AeroBre

In [None]:
#Result: With 15 runs, variability becomes clearly measurable: classification remains content-stable but format-loose, generation shows near-total variability, and extraction reveals structural instability and JSON inconsistency.

In [14]:
# Create a separate failure analysis document file (Markdown) from Step 5 results
from collections import Counter
from datetime import datetime

def make_failure_analysis_md(
    filepath: str,
    sentiment_runs: list[str],
    product_runs: list[str],
    extraction_runs: list[str],
    notes: dict | None = None,
):
    """
    Creates a standalone Markdown file summarizing failure patterns from 15-run tests.
    Pass your existing lists: sentiment_runs_15, product_runs_15, extraction_runs_15.
    """

    def _unique_count(outputs: list[str]) -> int:
        return len(set(o.strip() for o in outputs))

    def _most_common(outputs: list[str], k: int = 3) -> list[tuple[str, int]]:
        return Counter(o.strip() for o in outputs).most_common(k)

    def _consistency(outputs: list[str]) -> float:
        total = len(outputs)
        unique = _unique_count(outputs)
        # same formula you used earlier
        return round((1 - (unique - 1) / total) * 100, 2)

    def _detect_extraction_patterns(outputs: list[str]) -> dict:
        patterns = {
            "code_block_wrapped": 0,
            "extra_explanation_text": 0,
            "issues_as_list": 0,
        }
        for out in outputs:
            text = out or ""
            if "```" in text:
                patterns["code_block_wrapped"] += 1
            if ("Here's" in text) or ("Below is" in text) or ("extracted information" in text.lower()):
                patterns["extra_explanation_text"] += 1
            if '"issues": [' in text:
                patterns["issues_as_list"] += 1
        return patterns

    # Basic stats
    s_total, p_total, e_total = len(sentiment_runs), len(product_runs), len(extraction_runs)
    s_unique, p_unique, e_unique = _unique_count(sentiment_runs), _unique_count(product_runs), _unique_count(extraction_runs)

    s_common = _most_common(sentiment_runs)
    p_common = _most_common(product_runs)
    e_common = _most_common(extraction_runs)

    e_patterns = _detect_extraction_patterns(extraction_runs)

    # Optional custom notes (you can edit later)
    notes = notes or {}
    s_notes = notes.get("sentiment", [
        "Output is often a full sentence instead of a single-word label (format not controlled).",
        "Zero-shot prompt does not enforce strict output constraints, making parsing fragile.",
    ])
    p_notes = notes.get("product", [
        "High variability in structure (headings/bold/markdown), length, and tone across runs.",
        "Creative additions beyond the provided product info can appear.",
        "No structure or length constraints -> unpredictable output.",
    ])
    e_notes = notes.get("extraction", [
        "JSON is sometimes wrapped in markdown code fences and/or preceded by explanatory text.",
        "Field type inconsistency can appear (e.g., issues as string vs list).",
        "Not reliably machine-parseable without post-processing.",
    ])

    def _fmt_common(common_list):
        lines = []
        for text, c in common_list:
            preview = text.replace("\n", " ")[:140]
            if len(text.replace("\n", " ")) > 140:
                preview += "..."
            lines.append(f"- ({c}x) `{preview}`")
        return "\n".join(lines) if lines else "- (no outputs)"

    now = datetime.now().strftime("%Y-%m-%d %H:%M")

    md = f"""# Failure Analysis – Version 1 Prompts (15-Run Test)

Generated: {now}

## Summary Metrics

| Task | Runs | Unique Outputs | Consistency (%) |
|---|---:|---:|---:|
| Sentiment Analysis | {s_total} | {s_unique} | {_consistency(sentiment_runs)} |
| Product Description | {p_total} | {p_unique} | {_consistency(product_runs)} |
| Data Extraction | {e_total} | {e_unique} | {_consistency(extraction_runs)} |

---

## 1) Sentiment Analysis (v1)

**Most common outputs:**  
{_fmt_common(s_common)}

**Observed failure patterns:**
{chr(10).join([f"- {n}" for n in s_notes])}

---

## 2) Product Description Generation (v1)

**Most common outputs (previews):**  
{_fmt_common(p_common)}

**Observed failure patterns:**
{chr(10).join([f"- {n}" for n in p_notes])}

---

## 3) Data Extraction (v1)

**Most common outputs (previews):**  
{_fmt_common(e_common)}

**Detected extraction-specific patterns (counts out of {e_total} runs):**
- Code block wrapped: {e_patterns["code_block_wrapped"]}
- Extra explanation text: {e_patterns["extra_explanation_text"]}
- Issues field as list: {e_patterns["issues_as_list"]}

**Observed failure patterns:**
{chr(10).join([f"- {n}" for n in e_notes])}

---

## Overall Diagnosis

- Classification is content-stable but format-loose without strict constraints.
- Generation shows extreme variability without structure and length rules.
- Extraction is structurally unreliable (format drift, inconsistent JSON), unsafe for production parsing.

## Recommended Next Improvements

- Add explicit output constraints (single label; fixed schemas).
- Add structured formatting requirements (JSON only; no markdown fences).
- Add few-shot examples to lock in format.
- Use more advanced prompting techniques in later iterations.
"""

    with open(filepath, "w", encoding="utf-8") as f:
        f.write(md)

    return filepath


# --- Use it (assumes you already have these from Step 5) ---
# sentiment_runs_15, product_runs_15, extraction_runs_15
output_path = "Failure_Analysis_Version1.md"
created = make_failure_analysis_md(
    filepath=output_path,
    sentiment_runs=sentiment_runs_15,
    product_runs=product_runs_15,
    extraction_runs=extraction_runs_15,
)

print("✅ Created:", created)




✅ Created: Failure_Analysis_Version1.md


| Task        | V1 Consistency | V3 Consistency | Improvement |
|------------|---------------|---------------|------------|
| Sentiment  | 93.33%       | 100%          | +6.67%     |
| Product    | 6.67%        | 20.0%         | +13.33%    |
| Extraction | 66.67%       | 86.67%        | +20.0%     |


In [None]:
# Part3 Interation 2 - Rewriting Simple Prompts to Improve Consistency
# Step 6: Improve Sentiment Prompt (add clarity, strict format, and single-word constraint)

# --- Improved Sentiment Prompt (Version 2) ---

customer_message_1 = "I love this product! It's exactly what I needed."

sentiment_prompt_v2 = f"""
You are a sentiment classification system.

Task:
Classify the sentiment of the customer message strictly as one of the following:
positive
negative
neutral

Rules:
- Respond with EXACTLY ONE WORD.
- Do NOT include punctuation.
- Do NOT include explanations.
- Do NOT include any additional text.

Customer message:
"{customer_message_1}"
"""

# Test once
print("=== Sentiment v2 | Single Test ===")
result_v2 = call_openai(sentiment_prompt_v2, temperature=0.3)
print(result_v2)


# --- Test 15 times for consistency comparison ---
print("\n=== Sentiment v2 | 15 runs ===")
sentiment_runs_v2_15 = run_prompt_n_times(sentiment_prompt_v2, n=15, temperature=0.3)
summarize_runs(sentiment_runs_v2_15)

# Compare with v1
print("\n=== Comparison with v1 ===")
print(f"V1 unique outputs (15 runs): {len(set(sentiment_runs_15))}")
print(f"V2 unique outputs (15 runs): {len(set(sentiment_runs_v2_15))}")



=== Sentiment v2 | Single Test ===
positive

=== Sentiment v2 | 15 runs ===
Total runs: 15
Unique outputs: 1
Most common outputs:
- (15x) 'positive'

=== Comparison with v1 ===
V1 unique outputs (15 runs): 2
V2 unique outputs (15 runs): 1


#Result: The improved sentiment prompt (v2) produced fully consistent, single-word outputs across all runs, making it production-safe and format-controlled compared to version 1.

In [16]:
# Step 7: Improve Product Description Prompt (add structure, length constraints, and style control)
# --- Improved Product Description Prompt (Version 2) ---

product_info_1 = {
    "name": "AeroBrew Travel Mug",
    "key_features": ["Leak-proof lid", "Keeps drinks hot for 6 hours", "Fits most cup holders"],
    "target_audience": "commuters and travelers",
    "price": "$29.99"
}

product_prompt_v2 = f"""
You are a professional e-commerce copywriter.

Task:
Write a product description using the provided product information.

Structure Requirements:
1. Start with a short headline (no markdown, no bold formatting).
2. Write exactly TWO short paragraphs.
3. Each paragraph must be 2–3 sentences.
4. Total length must be between 80–120 words.

Style Guidelines:
- Professional and persuasive tone.
- Do NOT use bullet points.
- Do NOT use markdown formatting.
- Do NOT add information that is not provided.
- Do NOT mention that the information was provided as JSON.

Product Information:
{json.dumps(product_info_1, indent=2)}
"""

# Single test
print("=== Product Description v2 | Single Test ===")
result_v2 = call_openai(product_prompt_v2, temperature=0.7)
print(result_v2)


# --- Test 15 times for consistency ---
print("\n=== Product Description v2 | 15 runs ===")
product_runs_v2_15 = run_prompt_n_times(product_prompt_v2, n=15, temperature=0.7)
summarize_runs(product_runs_v2_15)

# Compare with v1
print("\n=== Comparison with v1 ===")
print(f"V1 unique outputs (15 runs): {len(set(product_runs_15))}")
print(f"V2 unique outputs (15 runs): {len(set(product_runs_v2_15))}")


=== Product Description v2 | Single Test ===
Stay caffeinated on the go with the AeroBrew Travel Mug, designed specifically for commuters and travelers. Featuring a leak-proof lid, this mug ensures your favorite beverages stay secure, while its insulation keeps drinks hot for up to six hours.

With a sleek design that fits most cup holders, the AeroBrew Travel Mug is the perfect companion for your daily journey. Priced at just $29.99, it combines functionality with style, making every sip a delightful experience.

=== Product Description v2 | 15 runs ===
Total runs: 15
Unique outputs: 15
Most common outputs:
- (1x) 'Elevate your on-the-go experience with the AeroBrew Travel Mug. Designed specifically for commuters and travelers, this '...
- (1x) 'Stay refreshed on the go with the AeroBrew Travel Mug, designed specifically for commuters and travelers. This sleek mug'...
- (1x) 'Experience the ultimate convenience with the AeroBrew Travel Mug, designed for commuters and travelers who ref

#Result: Although version 2 maintained 15 unique descriptions, it successfully enforced consistent structure, length, and formatting, significantly improving reliability compared to version 1.

In [17]:
# Step 8: Improve Data Extraction Prompt (add strict JSON structure and formatting constraints)

# --- Improved Data Extraction Prompt (Version 2) ---

feedback_1 = """
I bought the AeroBrew Travel Mug last week (Order #A12345). The mug keeps coffee hot,
but the lid started leaking after two days. Support was friendly and offered a replacement.
I'm still a bit disappointed because I expected better quality for $29.99.
"""

extraction_prompt_v2 = f"""
You are an information extraction system.

Task:
Extract the required fields from the customer feedback.

Output Requirements:
- Return ONLY valid JSON.
- Do NOT include markdown code fences.
- Do NOT include explanations.
- Do NOT include any text before or after the JSON.
- Use exactly the field names shown below.
- If information is missing, use null.

Required JSON structure:

{{
  "order_id": string,
  "product_name": string,
  "issues": string,
  "positive_notes": string,
  "overall_sentiment": "positive" | "negative" | "neutral"
}}

Customer Feedback:
\"\"\"{feedback_1}\"\"\"
"""

# --- Single test ---
print("=== Data Extraction v2 | Single Test ===")
result_v2 = call_openai(extraction_prompt_v2, temperature=0.3)
print(result_v2)


# --- 15-run test ---
print("\n=== Data Extraction v2 | 15 runs ===")
extraction_runs_v2_15 = run_prompt_n_times(extraction_prompt_v2, n=15, temperature=0.3)
summarize_runs(extraction_runs_v2_15)


# --- Compare with v1 ---
print("\n=== Comparison with v1 ===")
print(f"V1 unique outputs (15 runs): {len(set(extraction_runs_15))}")
print(f"V2 unique outputs (15 runs): {len(set(extraction_runs_v2_15))}")


=== Data Extraction v2 | Single Test ===
{
  "order_id": "A12345",
  "product_name": "AeroBrew Travel Mug",
  "issues": "Lid started leaking after two days",
  "positive_notes": "Support was friendly and offered a replacement",
  "overall_sentiment": "negative"
}

=== Data Extraction v2 | 15 runs ===
Total runs: 15
Unique outputs: 5
Most common outputs:
- (6x) '{\n  "order_id": "A12345",\n  "product_name": "AeroBrew Travel Mug",\n  "issues": "The lid started leaking after two days."'...
- (4x) '{\n  "order_id": "A12345",\n  "product_name": "AeroBrew Travel Mug",\n  "issues": "Lid started leaking after two days",\n  "'...
- (2x) '{\n  "order_id": "A12345",\n  "product_name": "AeroBrew Travel Mug",\n  "issues": "The lid started leaking after two days."'...

=== Comparison with v1 ===
V1 unique outputs (15 runs): 6
V2 unique outputs (15 runs): 5


In [None]:
#VResult: Version 2 significantly improved structural consistency and eliminated formatting errors, reducing unique outputs from 6 to 5 while maintaining clean, production-ready JSON across runs.

In [None]:
#Result: Few-shot prompting (v3) achieved perfect format adherence and consistency (1 unique output across 15 runs), matching v2 and improving over v1.
#Few-shot prompting increased consistency to near 100%, eliminating format drift and ensuring stable, single-label outputs across all test iterations.


In [19]:
# Step 10: Add Chain-of-Thought style reasoning instructions to Data Extraction and test 15 times
# --- Data Extraction Prompt (Version 3: CoT-style guidance, JSON-only output) ---

feedback_test = """
I bought the AeroBrew Travel Mug last week (Order #A12345). The mug keeps coffee hot,
but the lid started leaking after two days. Support was friendly and offered a replacement.
I'm still a bit disappointed because I expected better quality for $29.99.
"""

extraction_prompt_v3 = f"""
You are an information extraction system.

Task:
Extract the required fields from the customer feedback.

Reasoning Instructions:
- Think step by step to identify the correct values for each field.
- Use the exact information from the text; do not invent details.
- Then output ONLY the final JSON.

Output Requirements:
- Return ONLY valid JSON.
- Do NOT include markdown code fences.
- Do NOT include explanations.
- Do NOT include any text before or after the JSON.
- Use exactly the field names shown below.
- If information is missing, use null.

Required JSON structure:
{{
  "order_id": string,
  "product_name": string,
  "issues": string,
  "positive_notes": string,
  "overall_sentiment": "positive" | "negative" | "neutral"
}}

Customer Feedback:
\"\"\"{feedback_test}\"\"\"
"""

# Single test
print("=== Data Extraction v3 (CoT guidance) | Single Test ===")
print(call_openai(extraction_prompt_v3, temperature=0.2))

# 15-run test
print("\n=== Data Extraction v3 (CoT guidance) | 15 runs ===")
extraction_runs_v3_15 = run_prompt_n_times(extraction_prompt_v3, n=15, temperature=0.2)
summarize_runs(extraction_runs_v3_15)

# Compare with v1 and v2 (if available)
print("\n=== Comparison ===")
print(f"V1 unique outputs (15 runs): {len(set(extraction_runs_15))}")        # Step 5
print(f"V2 unique outputs (15 runs): {len(set(extraction_runs_v2_15))}")    # Step 8
print(f"V3 unique outputs (15 runs): {len(set(extraction_runs_v3_15))}")


=== Data Extraction v3 (CoT guidance) | Single Test ===
{
  "order_id": "A12345",
  "product_name": "AeroBrew Travel Mug",
  "issues": "The lid started leaking after two days.",
  "positive_notes": "The mug keeps coffee hot. Support was friendly and offered a replacement.",
  "overall_sentiment": "negative"
}

=== Data Extraction v3 (CoT guidance) | 15 runs ===
Total runs: 15
Unique outputs: 3
Most common outputs:
- (6x) '{\n  "order_id": "A12345",\n  "product_name": "AeroBrew Travel Mug",\n  "issues": "the lid started leaking after two days",'...
- (5x) '{\n  "order_id": "A12345",\n  "product_name": "AeroBrew Travel Mug",\n  "issues": "The lid started leaking after two days."'...
- (4x) '{\n  "order_id": "A12345",\n  "product_name": "AeroBrew Travel Mug",\n  "issues": "lid started leaking after two days",\n  "'...

=== Comparison ===
V1 unique outputs (15 runs): 6
V2 unique outputs (15 runs): 5
V3 unique outputs (15 runs): 3


#Result: Chain-of-thought guidance (v3) improved extraction consistency and reasoning quality, reducing unique outputs from 5 (v2) to 3 (v3) across 15 runs while preserving valid JSON.

In [20]:
# Step 11: Add few-shot examples + strict structure to Product Description and test 15 times
# --- Product Description Prompt (Version 3: Few-shot + strict structure) ---

product_info_test = {
    "name": "AeroBrew Travel Mug",
    "key_features": ["Leak-proof lid", "Keeps drinks hot for 6 hours", "Fits most cup holders"],
    "target_audience": "commuters and travelers",
    "price": "$29.99"
}

product_prompt_v3 = f"""
You are a professional e-commerce copywriter.

Task:
Write a product description using ONLY the provided product information.

Output Format Requirements:
- Output EXACTLY 3 lines.
- Line 1: HEADLINE (max 10 words). No punctuation at the end.
- Line 2: DESCRIPTION (exactly 22–28 words). No markdown. No bullets.
- Line 3: CTA (call-to-action) starting with "Buy now:" and include the price.

Rules:
- Do NOT use markdown formatting (no **, no ###).
- Do NOT add features not provided.
- Do NOT mention JSON or "provided information".

Examples:

Product Information:
{{
  "name": "EcoWave Water Bottle",
  "key_features": ["BPA-free", "Keeps drinks cold for 12 hours", "Dishwasher safe"],
  "target_audience": "fitness enthusiasts",
  "price": "$24.99"
}}
Output:
EcoWave Bottle for Everyday Hydration
BPA-free and built for fitness enthusiasts, EcoWave keeps drinks cold for 12 hours and stays easy to clean because it is dishwasher safe
Buy now: $24.99

Product Information:
{{
  "name": "CloudSoft Pillow",
  "key_features": ["Hypoallergenic", "Medium-firm support", "Breathable cover"],
  "target_audience": "side sleepers",
  "price": "$39.00"
}}
Output:
CloudSoft Pillow for Better Sleep
Hypoallergenic comfort for side sleepers, CloudSoft provides medium-firm support and a breathable cover that helps you stay cool while resting through the night
Buy now: $39.00

Now write the output for this product:

Product Information:
{json.dumps(product_info_test, indent=2)}
Output:
"""

# Single test
print("=== Product Description v3 (few-shot) | Single Test ===")
print(call_openai(product_prompt_v3, temperature=0.5))

# 15-run test
print("\n=== Product Description v3 (few-shot) | 15 runs ===")
product_runs_v3_15 = run_prompt_n_times(product_prompt_v3, n=15, temperature=0.5)
summarize_runs(product_runs_v3_15)

# Compare with v1 and v2
print("\n=== Comparison ===")
print(f"V1 unique outputs (15 runs): {len(set(product_runs_15))}")        # Step 5
print(f"V2 unique outputs (15 runs): {len(set(product_runs_v2_15))}")    # Step 7
print(f"V3 unique outputs (15 runs): {len(set(product_runs_v3_15))}")


=== Product Description v3 (few-shot) | Single Test ===
AeroBrew Travel Mug for On-the-Go Sips  
Designed for commuters and travelers, AeroBrew features a leak-proof lid and retains heat for 6 hours while fitting most cup holders for ultimate convenience  
Buy now: $29.99

=== Product Description v3 (few-shot) | 15 runs ===
Total runs: 15
Unique outputs: 8
Most common outputs:
- (4x) 'AeroBrew Travel Mug for On-the-Go Enjoyment  \nDesigned for commuters and travelers, AeroBrew features a leak-proof lid a'...
- (3x) 'AeroBrew Mug for On-the-Go Enjoyment  \nDesigned for commuters and travelers, the AeroBrew Travel Mug features a leak-pro'...
- (2x) 'AeroBrew Travel Mug for On-the-Go Enjoyment  \nDesigned for commuters and travelers, AeroBrew features a leak-proof lid a'...

=== Comparison ===
V1 unique outputs (15 runs): 15
V2 unique outputs (15 runs): 15
V3 unique outputs (15 runs): 8


#Result: High consistency with proper format and style was achieved in version 3, as all outputs followed the exact three-line structure and formatting rules while variability was significantly reduced compared to version 1 and 2.

Checkpoint: Compared to Version 1, Version 3 reduced unique outputs and increased consistency percentages across sentiment classification, product generation, and data extraction, confirming the effectiveness of few-shot and Chain-of-Thought techniques.


In [22]:
# Step 12: Test best prompts on task variations to evaluate robustness and tuning needs
# --- Variation Tests Using Best Prompts (v3 versions) ---

# 1️⃣ Sentiment Variations
sentiment_variations = [
    "This product completely exceeded my expectations!",
    "It's okay, nothing special but not bad either.",
    "I regret buying this. It stopped working after a day."
]

print("=== Sentiment Variations (v3) ===")
for msg in sentiment_variations:
    prompt = sentiment_prompt_v3.replace(customer_message_test, msg)
    result = call_openai(prompt, temperature=0.2)
    print(f"Message: {msg}")
    print("Output:", result)
    print("-" * 50)


# 2️⃣ Product Description Variation (different product)

product_info_variation = {
    "name": "ThermaLight Jacket",
    "key_features": ["Water-resistant", "Lightweight insulation", "Breathable fabric"],
    "target_audience": "outdoor enthusiasts",
    "price": "$89.99"
}

product_prompt_variation = product_prompt_v3.replace(
    json.dumps(product_info_test, indent=2),
    json.dumps(product_info_variation, indent=2)
)

print("\n=== Product Description Variation (v3) ===")
print(call_openai(product_prompt_variation, temperature=0.5))


# 3️⃣ Data Extraction Variation (different feedback structure)

feedback_variation = """
Order #B98765 arrived yesterday. The ThermaLight Jacket fits perfectly and keeps me warm.
However, the zipper feels cheap. Customer service responded quickly when I contacted them.
Overall, I'm satisfied but slightly concerned about durability.
"""

extraction_prompt_variation = extraction_prompt_v3.replace(
    feedback_test,
    feedback_variation
)

print("\n=== Data Extraction Variation (v3) ===")
print(call_openai(extraction_prompt_variation, temperature=0.2))



=== Sentiment Variations (v3) ===
Message: This product completely exceeded my expectations!
Output: positive
--------------------------------------------------
Message: It's okay, nothing special but not bad either.
Output: neutral
--------------------------------------------------
Message: I regret buying this. It stopped working after a day.
Output: negative
--------------------------------------------------

=== Product Description Variation (v3) ===
ThermaLight Jacket for Outdoor Adventures  
Designed for outdoor enthusiasts, the ThermaLight Jacket features water-resistant protection, lightweight insulation, and breathable fabric to keep you comfortable in any weather  
Buy now: $89.99

=== Data Extraction Variation (v3) ===
{
  "order_id": "B98765",
  "product_name": "ThermaLight Jacket",
  "issues": "The zipper feels cheap.",
  "positive_notes": "Fits perfectly and keeps me warm. Customer service responded quickly.",
  "overall_sentiment": "positive"
}


#Result: the version 3 prompts demonstrated high consistency with proper format and style across task variations, maintaining strict structural adherence and accurate outputs.
Testing task variations demonstrated that the structured Version 3 prompts generalize reliably across different inputs, while highlighting nuanced areas (such as mixed sentiment interpretation) that may require further tuning.


In [23]:
# Step 13: Run final Version 3 prompts 15 times and generate comprehensive comparison metrics
# --- Final 15-run evaluation for Version 3 prompts ---

print("=== FINAL EVALUATION (Version 3) ===")

# Sentiment v3
sentiment_final_15 = run_prompt_n_times(sentiment_prompt_v3, n=15, temperature=0.2)
sentiment_v3_unique = len(set(sentiment_final_15))

# Product v3
product_final_15 = run_prompt_n_times(product_prompt_v3, n=15, temperature=0.5)
product_v3_unique = len(set(product_final_15))

# Extraction v3
extraction_final_15 = run_prompt_n_times(extraction_prompt_v3, n=15, temperature=0.2)
extraction_v3_unique = len(set(extraction_final_15))


# --- Consistency Function ---
def consistency(unique, total=15):
    return round((1 - (unique - 1) / total) * 100, 2)


# --- Collect Metrics ---
report = {
    "Sentiment": {
        "V1_unique": len(set(sentiment_runs_15)),
        "V3_unique": sentiment_v3_unique,
        "V1_consistency_%": consistency(len(set(sentiment_runs_15))),
        "V3_consistency_%": consistency(sentiment_v3_unique),
    },
    "Product": {
        "V1_unique": len(set(product_runs_15)),
        "V3_unique": product_v3_unique,
        "V1_consistency_%": consistency(len(set(product_runs_15))),
        "V3_consistency_%": consistency(product_v3_unique),
    },
    "Extraction": {
        "V1_unique": len(set(extraction_runs_15)),
        "V3_unique": extraction_v3_unique,
        "V1_consistency_%": consistency(len(set(extraction_runs_15))),
        "V3_consistency_%": consistency(extraction_v3_unique),
    }
}

# --- Print Report ---
print("\n=== COMPARISON REPORT ===")
for task, metrics in report.items():
    print(f"\n{task}")
    print(f"  V1 Unique Outputs: {metrics['V1_unique']}")
    print(f"  V3 Unique Outputs: {metrics['V3_unique']}")
    print(f"  V1 Consistency: {metrics['V1_consistency_%']}%")
    print(f"  V3 Consistency: {metrics['V3_consistency_%']}%")


=== FINAL EVALUATION (Version 3) ===

=== COMPARISON REPORT ===

Sentiment
  V1 Unique Outputs: 2
  V3 Unique Outputs: 1
  V1 Consistency: 93.33%
  V3 Consistency: 100.0%

Product
  V1 Unique Outputs: 15
  V3 Unique Outputs: 13
  V1 Consistency: 6.67%
  V3 Consistency: 20.0%

Extraction
  V1 Unique Outputs: 6
  V3 Unique Outputs: 3
  V1 Consistency: 66.67%
  V3 Consistency: 86.67%


#Document improvement metrics:

- Sentiment consistency improved from 93.33% (V1) to 100% (V3), with unique outputs reduced from 2 to 1.
- Product description consistency improved from 6.67% (V1) to 20.0% (V3), reducing structural instability despite creative variation.
- Data extraction consistency improved from 66.67% (V1) to 86.67% (V3), cutting unique outputs in half (6 → 3) and significantly improving JSON reliability.


What failure patterns was identified:
During this lab, I identified several failure patterns in the original zero-shot prompts. The most common issues were format drift, inconsistent JSON structure, and high variability in generated outputs. While sentiment classification was relatively stable, it lacked strict format control. Product descriptions showed extreme variation, and data extraction outputs were not always production-safe.

Which techniques had the biggest impact:
Few-shot prompting had the biggest impact on consistency, especially for sentiment classification. Adding structured output constraints significantly improved reliability across all tasks. Chain-of-Thought guidance improved reasoning quality in the extraction task and reduced variability further. Overall, combining structure with examples proved more effective than simple instruction refinement.

What would I do differently next time:
If I were to improve this further, I would experiment with stricter token-length control for product descriptions and possibly enforce schema validation through post-processing. I would also test edge cases such as ambiguous sentiment and incomplete feedback to evaluate robustness more deeply.
