# 03 — Tiny Evaluation
Run a small test set (10–20 Qs) to sanity-check correctness and refusals.

## Load store & model with auto-CUDA 

In [1]:
import os

# Tell Hugging Face to skip TensorFlow/Flax so they never import TensorFlow (TF).
os.environ["TRANSFORMERS_NO_TF"] = "1"
os.environ["TRANSFORMERS_NO_FLAX"] = "1"

# Quiet TF logs if something still pulls it in.
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  # 1=INFO, 2=WARNING, 3=ERROR

In [2]:
# Imports
import os, sys, json
from pathlib import Path

# If the notebook is inside .../SlideHunt/notebooks, step up to the repo root
repo_root = Path.cwd().parent if Path.cwd().name.lower() == "notebooks" else Path.cwd()
sys.path.insert(0, str(repo_root))  # make repo root importable

from nb01_setup_and_ingest import search


Foundations '25 Data Science
Foundations Course
IF '25 Data Science Cohort A
IF '25 NY Career Readiness and Success
  Module_id: 1118
  Module: Fellow Resources
 - Item: Fellow Success Resources (Page)
  Module_id: 1239
  Module: Phase 2 (6/9-8/29)
 - Item: Homework: Option 1 - Weekly Job Applications & Progress Report (Due August 30) (Assignment)
 - Item: P2W1 (6/12) NO CAREER CLASS - TECHNICAL CLASS (SubHeader)
 - Item: P2W2 (6/16) Bloomberg Ideathon (SubHeader)
 - Item: Homework (SubHeader)
 - Item: Homework: Watch Hackathon Video (Assignment)
 - Item: Homework: Upwardly Global Learning Paths: Tech Market/Resume/Cover Letter (Assignment)
 - Item: Homework: Draft Resume (Assignment)
 - Item: P2W2 NO CLASS MEETING 6/19 Juneteenth TKH Closed (SubHeader)
 - Item: P2W3 (6/26) Bloomberg Hackathon (SubHeader)
 - Item: Homework (SubHeader)
 - Item: Homework: Hackathon Activity Log + Judges' Feedback (Assignment)
 - Item: P2W4 (7/3) Resume + Digital Footprint (SubHeader)
 - Item: In Class A

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

FAISS ntotal: 324

Q: Where did we define precision vs. recall?   [scope=technical]
  0.382 :: IF '25 Data Science Cohort A > P2W3 (6/23-6/27) Classification Algorithms > 💻 W3D2 (6/24) Logistic Regression Accuracy Metrics (Page)  [https://tkh.instructure.com/courses/172/pages/w3d2-6-slash-24-logistic-regression-accuracy-metrics]
  0.306 :: Foundations '25 Data Science > Week 5:  Statistics(Feb. 24th- Feb. 27th) > What is Data Science? (Page)  [https://tkh.instructure.com/courses/165/pages/what-is-data-science]
  0.276 :: IF '25 Data Science Cohort A > P2W11 (8/18-8/22) Agents & End of Phase Project > 💻 W11D1 (8/18) Applied LLM Review & AI Agents (Page)  [https://tkh.instructure.com/courses/172/pages/w11d1-8-slash-18-applied-llm-review-and-ai-agents]
  0.263 :: IF '25 Data Science Cohort A > P2W9 (8/4-8/8) NLP Foundations & Transformers > 📚 P2W9 Overview & Lesson Plan (Page)  [https://tkh.instructure.com/courses/172/pages/p2w9-overview-and-lesson-plan]

Q: tips for a resume and cover le

In [3]:
test_prompts = [
    "When does phase 2 begin?",
    "Any way of saying June 9th?",
    "Where can I find my instructor's email?",
    "Under Course Team Contact Information?",
    "What was the last TLAB about?",
    "An explanation of making a music recommender.",
    "What lecture slides do we review pivot tables?",
    "What lecture slides did we learn about control flow?",
    "Can you give me a bullet point list of the most important concepts covered about SQL?",
    "I'd like to know when pahse 2 commences, so I can prepare, thanks!",
    "Give me a summary of P2W2's material.",
    "Where did we define precision vs. recall?",
    "Explain the linear regression formula.",
    "What are the steps for PCA?",
    "What's the difference between an L1 and L2 penalty?",
    "Where can I find information on?",
    "So I want to know more about XGboost and its trade-offs with AdaBoost.",
    "Which slides describe the bias-variance trade-off (summary + cite both pages)?",
    "pls where did u all show that 'log loss' thing ??? and that \"slide w/ the blue curve comparing roc stuff—where?\"",
    "Can I see other students grades??"
]

🧪 Evaluate a list of prompts (latency + scope + citation)

In [None]:
# Point this ONCE to the repo root (hardcode or env var)
BASE = Path(os.getenv("SLIDEHUNT", r"C:\place path\to the repo here")).resolve()

# Evaluate prompts with auto-routing and top-1 citation
import time, pandas as pd


# Evaluate a lis of test queries againts the search() retriever
def evaluate_prompts(prompts, k=4, scope="auto", save_csv=True):
    """Tests how well a list of questions can find answers in your data.

    This function takes a list of questions (prompts) and tries to find
    the best answers for each using the `search` tool. For every question,
    it records how long it took to find answers, the score of the best answer,
    which topic (like "technical" or "career") was searched, and details
    about where the best answer came from.

    After checking all the questions, it shows a summary table and can also
    save all the results into a spreadsheet file (CSV) on your computer.

    Args:
        prompts (List[str]): A list of questions you want to test.
        k (int, optional): How many top answers to try and find for each question.
                           Defaults to 4.
        scope (str, optional): Where to look for answers. You can say "auto"
                               (to let it guess the best topic), "technical",
                               "career", or "all". Defaults to "auto".
        save_csv (bool, optional): If `True`, the results will be saved to a
                                   CSV file. If `False`, they won't.
                                   Defaults to `True`.

    Returns:
        pd.DataFrame: A table (DataFrame) with all the test results, including
                      the question, search scope, time taken, top answer score,
                      topic of the top answer, and its source.
    """
    rows = []
    for q in prompts:
        t0 = time.time()                         # starts the timer
        sc, hits = search(q, k=k, scope=scope)   # uses search() function with router
        dt = time.time() - t0 # seconds

        if hits:                            # if search returned results
            h0 = hits[0]                    # take the top result
            m = h0["meta"]                  # metadata for the top result
            # Build a human-readable citation string
            cite = f"{m['course_name']} › {m['module_name']} › {m['item_title']} ({m['type']})"
            if m.get('url'): cite += f" [{m['url']}]"
            rows.append({
                "query": q,
                "scope": sc,                # which domain the router chose (career/technical)
                "latency_s": round(dt, 3),  # search time
                "top_score": round(h0["score"], 3), #similarity score of top hit
                "top_domain": m.get("domain"),      # domain label from metadata
                "citation": cite                    # readable reference for top hit
            })
        else:  # no hits found
            rows.append({
                "query": q, "scope": sc, "latency_s": round(dt, 3),
                "top_score": None, "top_domain": None, "citation": "(no hits)"
            })
    # convert results to Dataframe for an easy display and saving
    df = pd.DataFrame(rows)

    # pretty print the DF
    with pd.option_context("display.max_colwidth", 80, "display.width", 120):
        print(df.to_string(index=False))

    if save_csv:
        # save results as a CSV inside a folder
        outdir = os.path.join(BASE, "outputs")
        os.makedirs(outdir, exist_ok=True)
        path = os.path.join(outdir, "eval_prompts.csv")
        df.to_csv(path, index=False)
        #print("\n saved:", path)

    # quick summary
    coverage = (df["citation"] != "(no hits)").mean()
    avg_score = df["top_score"].dropna().mean() if df["top_score"].notna().any() else None
    print(f"\nCoverage: {coverage:.0%}   Avg top_score: {None if avg_score is None else round(avg_score,3)}")
    print("By domain (top-1):")
    print(df["top_domain"].value_counts(dropna=True))

    return df

# run on our test_prompts evaluation
_ = evaluate_prompts(test_prompts, k=4, scope="auto", save_csv=True)


                                                                                                           query     scope  latency_s  top_score top_domain                                                                                                                                                                                                                                   citation
                                                                                        When does phase 2 begin?       all      3.950      0.476  technical                                                                                      IF '25 Data Science Cohort A › P2W12 (8/25 - 8/29) End of Phase Project Week › [TEPP] Phase 2 Portfolio Project Checkpoint #2 (Due 8/27) (Assignment)
                                                                                     Any way of saying June 9th? technical      2.559      0.230  technical                                                               