# Day 5 Mini-Capstone Lab  
## End-to-End Evaluation & Tracking for a HomePro LLM App (TruLens + MLflow)

**Estimated duration:** ~240 minutes (4 hours)  
**Theme:** Design and implement an *end-to-end* evaluation strategy for a small LLM-backed application in a **retail home improvement** domain using:

- Traditional metrics: **accuracy, F1, BLEU, ROUGE**
- An **evaluation harness** that runs your model over a labeled test set
- **MLflow** for experiment tracking (metrics, params, artifacts)
- **TruLens** for structured LLM evaluation (groundedness, answer relevance, context relevance)
- A local LLM running in **LM Studio** (OpenAI-compatible API)

By the end of this lab you should be able to:

1. Build a synthetic **HomePro**-style corpus (~500 docs) with realistic variation.  
2. Define an **evaluation dataset** for at least one of:
   - FAQ answerer (RAG-style Q&A)  
   - Summarizer (summarize product/policy docs)  
   - Sentiment classifier (label customer reviews)  
3. Implement a simple **RAG-style LLM app** over that corpus using LM Studio.  
4. Build a **metric harness** to compute accuracy/F1 (for classification), BLEU, and ROUGE.  
5. Track **two variants** of your app with MLflow and compare them.  
6. Instrument your app with **TruLens** and compute structured LLM metrics:
   - Groundedness (faithfulness)
   - Answer relevance
   - Context relevance  
7. Combine **MLflow** + **TruLens** outputs to argue which variant is “better” and why.


## 0. Pre-requisites & Environment Setup

### 0.1. LM Studio (local LLM)

You will use **LM Studio** as an on-box LLM, exposed via an **OpenAI-compatible HTTP API**.

1. Open LM Studio and download a reasonably capable chat model (e.g., a 7B–8B instruction-tuned model).  
2. Start the **local server** in LM Studio (UI: *"Local Server"* → *"Start Server"*).  
   - Default base URL (check LM Studio docs / status panel):
     - Typically something like: `http://localhost:1234/v1`
3. Note the **model ID** you want to use (e.g., `gpt-4o-mini`, `openai/gpt-4o-mini`, or the specific LM Studio model name).

We will configure the Python `openai` client (used by TruLens) to talk to this local server via environment variables.

---

### 0.2. MLflow via Docker (local tracking server)

We’ll run an **MLflow tracking server** locally in Docker and log experiments to it.

**From a terminal in your lab project directory:**

```bash
# If not already existing
mkdir -p $PWD/mlflow_data/mlruns
mkdir -p $PWD/mlflow_data/mlartifacts

docker run --rm -it -d -p 5000:5000 -v "$(pwd)/mlflow_data/mlruns:/mlflow/mlruns" -v "$(pwd)/mlflow_data/mlartifacts:/mlflow/artifacts" ghcr.io/mlflow/mlflow:latest mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri "sqlite:///mlflow.db" --default-artifact-root "/mlflow/artifacts"
```

- In the browser, open: <http://localhost:5000> to see the MLflow UI.
- You will log runs to this server from the notebook using `MLFLOW_TRACKING_URI=http://localhost:5000`.

---

### 0.3. Python environment & dependencies

This notebook assumes a Python 3.13+ virtual environment with `pip` available.

In the kernel being used for the notebook, install the core dependencies (if not already installed):

```bash
pip install mlflow scikit-learn sacrebleu rouge-score pandas numpy matplotlib openai trulens trulens-providers-openai
```

> **Note on CUDA vs Apple Silicon (MPS)**  
> - LM Studio handles GPU selection for you.  
> - On **NVIDIA/CUDA** machines, choose a CUDA-capable model in LM Studio.  
> - On **Apple Silicon**, choose an MPS-capable model in LM Studio.  
> - This notebook itself does not require GPU-specific code; any heavy lifting is done by LM Studio.


In [None]:
%pip install mlflow scikit-learn sacrebleu rouge-score pandas numpy matplotlib openai trulens trulens-providers-openai


In [None]:
import os

# === LM Studio / OpenAI-compatible configuration ===
#
# Adjust these if your LM Studio server uses a different base URL or port.
# Common default from LM Studio docs is http://localhost:1234/v1

os.environ.setdefault("OPENAI_API_KEY", "lm-studio")  # Dummy key; LM Studio ignores it.
os.environ.setdefault("OPENAI_BASE_URL", "http://localhost:1234/v1")
# Some libraries still look for OPENAI_API_BASE; mirror the value for robustness.
os.environ.setdefault("OPENAI_API_BASE", os.environ["OPENAI_BASE_URL"])

# Choose a default model ID; you can override this later in the lab.
DEFAULT_LM_STUDIO_MODEL = os.getenv("LM_STUDIO_MODEL", "openai/gpt-oss-20b")

print("Configured LM Studio-compatible OpenAI client:")
print("  OPENAI_BASE_URL =", os.environ["OPENAI_BASE_URL"])
print("  OPENAI_API_KEY  =", os.environ["OPENAI_API_KEY"])
print("  Default model    =", DEFAULT_LM_STUDIO_MODEL)

# === MLflow tracking server configuration ===
#
# Make sure the Docker-based MLflow server is running on localhost:5000 before continuing.

import mlflow
from pathlib import Path

MLFLOW_TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000")
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

print("Configured MLflow tracking URI:", mlflow.get_tracking_uri())

# Set up local artifact storage path
REPO_ROOT = Path.cwd()
LOCAL_ARTIFACT_ROOT = REPO_ROOT / "mlflow_data" / "mlartifacts"

## 1. Build a Synthetic HomePro Corpus (~500 docs)

In this section you will:

- Generate a synthetic **HomePro**-style corpus of ~500 short documents.  
- Cover multiple categories (flooring, paint, lighting, cabinets, appliances, garden, policies, services).  
- Include **variation** so evaluation metrics won’t be “too perfect”: overlapping info, messy reviews, promotions, policies, etc.

We’ll represent the corpus as a `pandas.DataFrame` with columns like:

- `doc_id`: integer ID  
- `category`: product or topic category  
- `doc_type`: type of content (product_guide, policy, installation, troubleshooting, promotion, review)  
- `text`: the actual text chunk used for retrieval / summarization


In [None]:
import random
import pandas as pd

random.seed(42)

categories = [
    "laminate flooring",
    "vinyl plank flooring",
    "ceramic tile",
    "interior paint",
    "exterior paint",
    "LED lighting",
    "kitchen cabinets",
    "bathroom vanities",
    "power tools",
    "garden & outdoor",
]

doc_types = [
    "product_guide",
    "policy_faq",
    "installation_guide",
    "troubleshooting",
    "promotion",
    "customer_review",
]

warranty_terms = ["1-year", "2-year", "3-year", "5-year", "10-year"]
return_windows = ["30 days", "60 days", "90 days"]
finish_levels = ["matte", "eggshell", "satin", "semi-gloss"]
traffic_levels = ["light", "medium", "heavy"]

review_sentiments = ["positive", "negative", "neutral"]
review_phrases_positive = [
    "loved how easy the install was",
    "customer service was fantastic",
    "quality felt better than expected",
    "installer arrived on time and cleaned up thoroughly",
    "the color matched the sample perfectly",
]
review_phrases_negative = [
    "installation was delayed and communication was poor",
    "finish scratched more easily than expected",
    "return process felt confusing and slow",
    "instructions were unclear and missing steps",
    "delivery arrived late and boxes were damaged",
]
review_phrases_neutral = [
    "product quality was fine, nothing special",
    "store was crowded but staff eventually helped",
    "color was close enough to what we expected",
    "installer did the job, but scheduling took a few calls",
    "overall experience was acceptable but not memorable",
]

docs = []
doc_id = 0

# Target ~500 docs: 10 categories * ~50 docs each = 500
docs_per_category = 50

for category in categories:
    for i in range(docs_per_category):
        d_type = random.choice(doc_types)

        # Random knobs to create variation
        warranty = random.choice(warranty_terms)
        return_window = random.choice(return_windows)
        finish = random.choice(finish_levels)
        traffic = random.choice(traffic_levels)

        if d_type == "product_guide":
            text = (
                f"HomePro offers several {category} options designed for {traffic} traffic areas. "
                f"Most {category} products come with a {warranty} limited residential warranty when "
                f"installed according to the manufacturer guidelines. Customers should check the "
                f"subfloor, moisture levels, and acclimation requirements before starting. For finishes, "
                f"many customers prefer a {finish} finish to balance durability and appearance. "
                f"HomePro stores typically keep popular colors and textures in stock but special orders "
                f"may take 7–14 days depending on the vendor."
            )
        elif d_type == "policy_faq":
            text = (
                f"HomePro's return policy for {category} is designed to be flexible while protecting product quality. "
                f"Most unopened {category} can be returned within {return_window} with proof of purchase. "
                f"Cut-to-length or custom-tinted products may not be returnable. For installed {category}, "
                f"HomePro recommends contacting the installation support team before removal to avoid damage and safety issues. "
                f"Refunds are generally issued to the original payment method; store credit is available for certain promotions."
            )
        elif d_type == "installation_guide":
            text = (
                f"When installing {category}, HomePro recommends dry-fitting a small area before committing to full adhesive coverage. "
                f"For high-{traffic} areas, customers should use premium underlayment and follow the expansion-gap guidelines "
                f"listed on the packaging. Surfaces must be clean, flat, and structurally sound. "
                f"For {finish} finishes, customers should avoid harsh cleaners for the first 7 days after installation. "
                f"HomePro provides printed installation guides in-store and video tutorials in the online learning center."
            )
        elif d_type == "troubleshooting":
            text = (
                f"Common issues with {category} purchased at HomePro include minor color variation between batches and "
                f"surface noise in high-{traffic} hallways. Color variation can often be minimized by mixing planks or tiles "
                f"from multiple boxes during installation. For squeaks or hollow sounds, customers should verify subfloor prep "
                f"and ensure that recommended underlayment was used. If problems persist, HomePro's installation support line "
                f"can review photos and suggest remedies or warranty options."
            )
        elif d_type == "promotion":
            text = (
                f"During the seasonal HomePro Savings Event, select {category} products may qualify for bundled discounts. "
                f"Typical offers include percentage discounts when purchasing a minimum square footage, or free delivery "
                f"on qualifying orders. Promotions on {category} sometimes combine with installation offers, such as "
                f"discounted labor on weekday installs. Customers should review fine print, as clearance items and certain "
                f"premium collections might be excluded."
            )
        else:  # customer_review
            sentiment = random.choice(review_sentiments)
            if sentiment == "positive":
                phrase = random.choice(review_phrases_positive)
            elif sentiment == "negative":
                phrase = random.choice(review_phrases_negative)
            else:
                phrase = random.choice(review_phrases_neutral)

            text = (
                f"I recently purchased {category} from HomePro. Overall, I would describe my experience as {sentiment}. "
                f"I {phrase}. The pricing felt {'fair' if sentiment != 'negative' else 'higher than I expected'}, "
                f"and I {'would' if sentiment == 'positive' else 'might'} shop here again for future projects. "
                f"The store team {'answered most of my questions' if sentiment != 'negative' else 'seemed rushed and hard to flag down'}. "
                f"Installation scheduling was {'smooth' if sentiment == 'positive' else 'a little bumpy but acceptable'}."
            )

        docs.append(
            {
                "doc_id": doc_id,
                "category": category,
                "doc_type": d_type,
                "text": text,
            }
        )
        doc_id += 1

corpus_df = pd.DataFrame(docs)
len_corpus = len(corpus_df)
print(f"Corpus size: {len_corpus} documents")
display(corpus_df.head())
display(corpus_df["doc_type"].value_counts())

## 2. Build an Evaluation Dataset

We’ll construct a labeled evaluation set:

1. **FAQ Answerer (RAG)**  
   - Input: customer question (e.g., *“What is the return policy for laminate flooring?”*).  
   - Reference: short “gold” answer derived from the corpus.  
   - Metrics: BLEU, ROUGE, token-level F1, plus TruLens groundedness/relevance.

To keep the lab focused on evaluation (not data engineering), we’ll generate the evaluation rows **programmatically** from the corpus templates. You can extend or edit this to better match your goals.


In [None]:
import numpy as np

eval_rows = []

# --- FAQ-style Q&A examples ---
faq_count = 80
faq_candidates = corpus_df[corpus_df["doc_type"].isin(["policy_faq", "product_guide"])].sample(
    n=min(faq_count, len(corpus_df)), random_state=123
)

for idx, row in faq_candidates.iterrows():
    category = row["category"]
    base_text = row["text"]

    # Simple templated questions derived from category
    question_templates = [
        f"What is the return policy for {category} at HomePro?",
        f"How long is the typical warranty on {category} from HomePro?",
        f"What should I know before installing {category}?",
    ]
    question = random.choice(question_templates)

    # Reference answers: short, focused, and derived from the text template semantics
    ref_answer = (
        f"For {category}, HomePro usually offers a limited warranty and a time-bound return window. "
        f"Unopened items can often be returned within a specific number of days with a receipt, while "
        f"custom or cut-to-length products may not be returnable. Installation should follow the "
        f"product guidelines around subfloor prep, moisture checks, and expansion gaps."
    )

    eval_rows.append(
        {
            "task_type": "faq",
            "category": category,
            "source_doc_id": row["doc_id"],
            "input_text": question,
            "reference_answer": ref_answer,
            "sentiment_label": None,
        }
    )

eval_df = pd.DataFrame(eval_rows)
print(f"Evaluation set size: {len(eval_df)} examples")
display(eval_df.head())
display(eval_df["task_type"].value_counts())
