# Creating Practice Exercises

The goal of this notebook is to extend the 7 day personalized plan by generating practice tasks (questions, coding exercises, and quizzes) that align with the student’s focus areas.

**Inputs**:
- A generated 7 day plan (from Notebook 03)
- Contextual chunks (slides + labs) for reference

**What this notebook does**:
1. Reads a study plan created from the student's feedback.
2. For each daily task, calls the LLM to generate:
   - Quiz questions (multiple choice or short answer)
   - Coding exercises (if applicable)
   - Reflection prompts
3. Stores practice exercises in structured format (CSV/Parquet).
4. Prepares exercises to be integrated with progress tracking (Notebook 04).

**Outputs**:
- `practice_exercises.csv` in `data/processed/`
- Each row will contain:
  - `day` —> Day 1 through Day 7
  - `task_description` —> from the plan
  - `practice_type` —> quiz, coding, reflection
  - `practice_content` —> generated questions or instructions


In [66]:
# Libraries

from dotenv import load_dotenv
from pathlib import Path
import sys
from openai import OpenAI
import os
import pandas as pd 
import json
import re


In [67]:
# Paths
PROJECT_ROOT = Path("..").resolve()
sys.path.append(str(PROJECT_ROOT))

PLAN_JSON = Path("../data/processed/example_plan.json")   # from Notebook 3
OUT_STRUCT  = Path("../data/processed/plan_structured.json")
OUT_CSV = Path("../data/processed/practice_exercises.csv")

# Check for paths 
print("PLAN_JSON path:", PLAN_JSON, "| Exists?", PLAN_JSON.exists())
print("OUT_CSV path: ", OUT_CSV)

# Make sure the output folder exists (create if missing)
OUT_CSV.parent.mkdir(parents=True, exist_ok=True)
print("Plan exists?", PLAN_JSON.exists())
print("Output folder ready:", OUT_CSV.parent.exists())


PLAN_JSON path: ..\data\processed\example_plan.json | Exists? True
OUT_CSV path:  ..\data\processed\practice_exercises.csv
Plan exists? True
Output folder ready: True


In [68]:
plan_path = Path("../data/processed/example_plan.json")
raw = json.loads(plan_path.read_text(encoding="utf-8"))

def looks_structured(obj):
    """Return True if obj has a top-level days list with task dicts."""
    if not isinstance(obj, dict): 
        return False
    days = obj.get("days")
    if not isinstance(days, list) or not days:
        return False
    first = days[0]
    return isinstance(first, dict) and "day" in first and isinstance(first.get("tasks"), list)

print("Top-level keys:", list(raw.keys()))
print("Structured?", looks_structured(raw))
if looks_structured(raw):
    # peek
    print("Day 1 sample:", raw["days"][0])


Top-level keys: ['plan_text']
Structured? False


In [69]:
def parse_markdown_plan(plan_text: str):
    """
    Heuristic parser for 'plan_text' shaped like:
    #### Day 1 — ...
    - **Task 1 (est. 15–25 min)** — description [CITATION: 1]
    Returns: {"days":[{"day":1,"tasks":[{"title":...,"est_mins":..., "citation_url":...}, ...]}, ...]}
    """
    days = []
    current_day = None

    # Split by lines
    for line in plan_text.splitlines():
        line = line.strip()
        if not line:
            continue

        # Detect day header lines like "#### Day 3 — Review"
        m_day = re.match(r"^(?:#+\s*)?Day\s*(\d+)", line, flags=re.IGNORECASE)
        if m_day:
            # Start new day bucket
            day_num = int(m.group(1))
            current_day = {"day": day_num, "tasks": []}
            days.append(current_day)
            continue

        # Detect bullet task lines that have an estimate and optional citation
        # Examples:
        # - **Task 1 (est. 15–25 min)** — Review SQL joins. [CITATION: 3]
        # - Task: Practice inner vs left (est. 20 min) [CITATION: 1]
        if line.startswith(("-","*",".")): #Check if line begins with 
            # grab citation like [CITATION: 3] or [3]
            cit = ""
            m_cit = re.search(r"\[(?:CITATION:\s*)?(\d+)\]", line, flags=re.IGNORECASE)
            if m_cit:
                cit = f"[{m_cit.group(1)}]"

            # grab est minutes (pick the first integer we find)
            m_mins = re.search(r"\b(\d+)\s*min", line, flags=re.IGNORECASE)
            est_mins = int(m_mins.group(1)) if m_mins else 20  # default 20 if missing

            # strip leading bullet and bold markers
            text = re.sub(r"^-+\s*", "", line) 
            text = re.sub(r"\*\*|\—|—", " ", text)  # remove bold markers/em-dash
            text = re.sub(r"\[(?:CITATION:\s*)?\d+\]", "", text).strip()  # remove citation tag

            title = text
            # optional: shorten very long titles
            if len(title) > 180:
                title = title[:177] + "..."

            if current_day is not None:
                current_day["tasks"].append({
                    "title": title,
                    "est_mins": est_mins,
                    "citation_url": cit
                })

    # Sort by day number just in case
    days = sorted(days, key=lambda d: d["day"])
    return {"days": days}

# If your JSON is not structured, try converting it:
if not looks_structured(raw):
    if "plan_text" in raw:
        plan_struct = parse_markdown_plan(raw["plan_text"])
        print("Converted to structured format with", len(plan_struct["days"]), "days")
    else:
        raise ValueError("JSON has neither 'days' nor 'plan_text'.")
else:
    plan_struct = raw

# Save a clean structured copy (optional but recommended)
clean_path = Path("../data/processed/plan_structured.json")
clean_path.write_text(json.dumps(plan_struct, indent=2), encoding="utf-8")
print("Wrote", clean_path)


NameError: name 'm' is not defined

## Normalize the Plan to accept JSON `({"plan":{"days":[...]}})` and raw text `({"plan_text":"..."})`

In [41]:
# Load file saved in Notebook 3
with open(PLAN_PATH, "r", encoding="utf-8") as f:
    raw_plan = json.load(f)

# print("Loaded keys:", raw_plan.keys())
def normalize_plan(obj):
    """
    Turn whatever we loaded from JSON into a consistent shape:

    Returns:
      {
        "plan": {
          "days": [
            {"day": int, "tasks": [
                {"title": str, "est_mins": int, "citation_url": str},
                ...
            ]},
            ...
          ]
        }
      }

    Works for:
    - Structured JSON: {"plan": {"days": [...]}}
    - Raw text: {"plan_text": "..."}  (parse Day 1/Day 2 sections naively)
    """

    # CASE 1: Try to use structured JSON directly

    try:
        # Try to access obj["plan"]["days"]
        days_value = obj["plan"]["days"]

        # 
        # If it's already a list, assume it's fine and return as is
        if type(days_value) is list:
            return obj  # nothing else to do, keep it as is 
    except Exception:
        # If obj has no "plan" key, or no "days", or days isn't a list go to Case 2
        pass

    # CASE 2: Fallback to raw text
  
    # Get the raw text string (or empty string if missing)
    text = str(obj.get("plan_text", "")).strip()

    # If there's no text, return a tiny default plan so the rest of the code won’t break
    if not text:
        return {
            "plan": {
                "days": [
                    {"day": 1, "tasks": [
                        {"title": "Fallback: review your notes", "est_mins": 20, "citation_url": ""}
                    ]}
                ]
            }
        }

    # Split text based on "Day {number}" to find day sections
    # It can be parse as:
    # "Day 1 - Understand\n- Read slide 10...\nDay 2 - Apply\n- Try lab task..."
    parts = re.split(r"\bDay\s+(\d+)\b", text)
    # example of parts : ["prefix", "1", "text for day 1", "2", "text for day 2", ...]

    days_list = []

    # Loop through pairs: (day number, day text)
    for i in range(1, len(parts), 2): 
        day_num_str = parts[i]
        day_text    = parts[i + 1] if i + 1 < len(parts) else ""

        # Convert day number to int safely
        try:
            day_num = int(day_num_str)
        except Exception:
            # If it’s not a valid integer, skip this chunk
            continue

        # Split day_text into lines and keep the non-empty ones
        raw_lines = [ln.strip() for ln in day_text.splitlines() if ln.strip()]

        # Turn lines into up to 4 simple tasks
        tasks = []
        for ln in raw_lines:
            if len(tasks) >= 4: # limit to 4 tasks per day for simplicity
                break

            # Skip obvious headers in the plan text
            lower_ln = ln.lower()
            if lower_ln.startswith(("day", "understand", "apply", "checkpoint")):
                continue

            # Make a simple task object
            tasks.append({
                "title": ln[:140],        # keep task title short
                "est_mins": 20,           # default time estimate
                "citation_url": ""        # unknown text
            })

        # Only add the day if we found at least one task line
        if tasks:
            days_list.append({"day": day_num, "tasks": tasks})

    # If parsing failed to produce any day, return a minimal default
    if not days_list:
        days_list = [
            {"day": 1, "tasks": [
                {"title": "Fallback: review your notes", "est_mins": 20, "citation_url": ""}
            ]}
        ]

    # Final format
    return {"plan": {"days": days_list}}

In [31]:
# Teststing the function normalize_plan 
# Example 1: Structured input
structured_example = {
    "plan": {
        "days": [
            {"day": 1, "tasks": [{"title":"Read slides", "est_mins":15, "citation_url": ""}]}
        ]
    }
}
print(normalize_plan(structured_example))

# Example 2: Raw text
raw_text_example = {"plan_text": "Day 1 - Understand\n- Read slide 10 about Joins\nDay 2 - Apply\n- Do Lab SQL-1"}
print(normalize_plan(raw_text_example))


{'plan': {'days': [{'day': 1, 'tasks': [{'title': 'Read slides', 'est_mins': 15, 'citation_url': ''}]}]}}
{'plan': {'days': [{'day': 1, 'tasks': [{'title': '- Understand', 'est_mins': 20, 'citation_url': ''}, {'title': '- Read slide 10 about Joins', 'est_mins': 20, 'citation_url': ''}]}, {'day': 2, 'tasks': [{'title': '- Apply', 'est_mins': 20, 'citation_url': ''}, {'title': '- Do Lab SQL-1', 'est_mins': 20, 'citation_url': ''}]}]}}


## Load search utilities & indexes for tiny RAG context

In [42]:
from src.utils import (
    load_chunks, load_index,
    search_slides, search_labs,
    FAISS_SLIDES_PATH, FAISS_LABS_PATH
)

# Load chunk DataFrames & FAISS indexes
slides_df, labs_df = load_chunks()
slides_index = load_index(FAISS_SLIDES_PATH)
labs_index   = load_index(FAISS_LABS_PATH)

print("Slides rows:", len(slides_df), "| index size:", slides_index.ntotal)
print("Labs rows:  ", len(labs_df),   "| index size:", labs_index.ntotal)

Slides rows: 40 | index size: 40
Labs rows:   1410 | index size: 1410


## Context per task using RAG search

In [49]:
plan_dict = json.loads(Path("../data/processed/plan_structured.json").read_text(encoding="utf-8"))

# Helper to build a small context for each task using vector search 
def short_context_for_task(task_text, k_slides=2, k_labs=2):
    """
    For a given task description ('Review SQL joins'),
    run semantic search over slides and labs, grab the top matches,
    and build a small context string with simple citations.
    """
    # Search both sources
    lab_hits   = search_labs(labs_index, labs_df, task_text, top_k=k_labs)
    slide_hits = search_slides(slides_index, slides_df, task_text, top_k=k_slides)

    lines = []
    ref = 1

    # Add lab matches first (practical/code)
    for _, r in lab_hits.iterrows():
        lines.append(f"[{ref}] (Lab: {r['file']}) {r['text']}")
        ref += 1

    # Add slide matches next
    for _, r in slide_hits.iterrows():
        page = r.get("page", "")
        lines.append(f"[{ref}] (Slide: {r['file']} | Page {page}) {r['text']}")
        ref += 1

    # If we found nothing, at least return a hint
    if not lines:
        return "No matching course context was found for this task."

    return "\n\n".join(lines)


## LLM: generate practice for a single task (quiz + coding + reflection)

In [44]:
load_dotenv(override=True)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

print("Client ready?", bool(os.getenv("OPENAI_API_KEY")))

def generate_practice_for_task(task_text, context_text, model="gpt-4o-mini"):
    """
    Ask the model to create practice aligned to a specific task.
    We request a strict JSON to make saving/parsing easy.
    
    Returns a Python dict with keys: quiz (list), coding (str), reflection (str).
    If parsing fails, returns {"raw": "..."} with the raw text.
    """
    system_msg = (
        "You are a helpful teaching assistant. Generate practice aligned to the given task. "
        "Use the provided course context only. Be concise and accurate."
    )

    user_msg = f"""
Task: {task_text}

Course context (use for grounding and citations):
{context_text}

Return a JSON object with EXACTLY these keys:
- "quiz": an array of 2 items; each item must have:
    - "question": string
    - "choices": array of 3-5 short strings (if not multiple-choice, put an empty array [])
    - "answer": the correct answer as a string (for MCQ, copy the correct choice text)
    - "citation": copy one [#] citation number from the context that supports the answer (if possible)
- "coding": a short coding exercise prompt (string) aligned to the task; add one citation in brackets if possible.
- "reflection": a brief reflection prompt (string)

Make sure the response is valid JSON. Do not include markdown code fences.
"""

    resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user",   "content": user_msg},
        ],
        temperature=0.3,
        max_tokens=700
    )
    text = resp.choices[0].message.content

    # Try to parse JSON safely
    try:
        data = json.loads(text)
        # Minimal shape check
        if not isinstance(data, dict):
            return {"raw": text}
        return data
    except Exception:
        # If the model returned plain text, keep it so you don't lose work
        return {"raw": text}


Client ready? True


## Generate practice for the whole plan (loop days -> tasks)

In [65]:

# Iterate over the plan and build a table of practice exercises 
rows = []

for day_obj in plan_dict.get("days", []):
    day_num = int(day_obj.get("day", 0))
    tasks = day_obj.get("tasks", [])
    for t_idx, t in enumerate(tasks, start=1):
        task_title = t.get("title", "").strip()
        if not task_title:
            continue  # skip empty tasks

        # Build RAG context for this task
        context = short_context_for_task(task_title, k_slides=2, k_labs=2)

        # Ask the model to generate practice for this task
        practice = generate_practice_for_task(task_title, context)

        # If we got structured JSON
        if isinstance(practice, dict) and "quiz" in practice and "coding" in practice and "reflection" in practice:
            # Add quiz items (0, 1)
            for q in practice.get("quiz", []):
                rows.append({
                    "day": day_num,
                    "task_order": t_idx,
                    "task_title": task_title,
                    "practice_type": "quiz",
                    "question": q.get("question", ""),
                    "choices": " | ".join(q.get("choices", [])),
                    "answer": q.get("answer", ""),
                    "citation": q.get("citation", ""),
                    "content": ""  # empty for quiz
                })
            # Add coding
            rows.append({
                "day": day_num,
                "task_order": t_idx,
                "task_title": task_title,
                "practice_type": "coding",
                "question": "",
                "choices": "",
                "answer": "",
                "citation": "",
                "content": practice.get("coding", "")
            })
            # Add reflection
            rows.append({
                "day": day_num,
                "task_order": t_idx,
                "task_title": task_title,
                "practice_type": "reflection",
                "question": "",
                "choices": "",
                "answer": "",
                "citation": "",
                "content": practice.get("reflection", "")
            })
        else:
            # Fallback: store raw text if JSON parsing failed
            rows.append({
                "day": day_num,
                "task_order": t_idx,
                "task_title": task_title,
                "practice_type": "raw",
                "question": "",
                "choices": "",
                "answer": "",
                "citation": "",
                "content": practice.get("raw", str(practice))
            })

# Create a DataFrame of all generated practice content
practice_df = pd.DataFrame(rows)
print("Generated rows:", len(practice_df))
practice_df.head(10)


Generated rows: 56


Unnamed: 0,day,task_order,task_title,practice_type,question,choices,answer,citation,content
0,1,1,Task 1 (est. 15–25 min) Review the definiti...,quiz,What does an INNER JOIN return?,All rows from both tables | Only matching rows...,Only matching rows from both tables,[1],
1,1,1,Task 1 (est. 15–25 min) Review the definiti...,quiz,When would you use a LEFT JOIN?,When you want all rows from the right table | ...,"When you want all rows from the left table, re...",[1],
2,1,1,Task 1 (est. 15–25 min) Review the definiti...,coding,,,,,Write a SQL query using INNER JOIN to combine ...
3,1,1,Task 1 (est. 15–25 min) Review the definiti...,reflection,,,,,Reflect on a scenario where using a RIGHT JOIN...
4,1,2,Task 2 (est. 10–20 min) Read through the SQ...,quiz,What is the SQL syntax for an INNER JOIN?,SELECT * FROM table1 INNER JOIN table2 ON tabl...,SELECT * FROM table1 INNER JOIN table2 ON tabl...,[1],
5,1,2,Task 2 (est. 10–20 min) Read through the SQ...,quiz,What is the SQL syntax for a LEFT JOIN?,SELECT * FROM table1 LEFT JOIN table2 ON table...,SELECT * FROM table1 LEFT JOIN table2 ON table...,[1],
6,1,2,Task 2 (est. 10–20 min) Read through the SQ...,coding,,,,,Write a SQL query that performs a RIGHT JOIN b...
7,1,2,Task 2 (est. 10–20 min) Read through the SQ...,reflection,,,,,Reflect on how different types of joins can af...
8,2,1,Task 1 (est. 20–30 min) Use the interactive...,quiz,What does an INNER JOIN do in SQL?,Combines rows from two or more tables based on...,Combines rows from two or more tables based on...,[2],
9,2,1,Task 1 (est. 20–30 min) Use the interactive...,quiz,Which SQL clause is used to specify the condit...,WHERE | ON | JOIN,ON,[2],


Generated rows: 56


Unnamed: 0,day,task_order,task_title,practice_type,question,choices,answer,citation,content
0,1,1,Task 1 (est. 15–25 min) Review the definiti...,quiz,What does an INNER JOIN return?,All rows from the left table | Only matching r...,Only matching rows from both tables,[1],
1,1,1,Task 1 (est. 15–25 min) Review the definiti...,quiz,When would you use a LEFT JOIN?,When you want to include all rows from the rig...,When you want to include all rows from the lef...,[1],
2,1,1,Task 1 (est. 15–25 min) Review the definiti...,coding,,,,,Write a SQL query using INNER JOIN to combine ...
3,1,1,Task 1 (est. 15–25 min) Review the definiti...,reflection,,,,,Reflect on a scenario where using a RIGHT JOIN...
4,1,2,Task 2 (est. 10–20 min) Read through the SQ...,quiz,What is the SQL syntax for an INNER JOIN?,SELECT * FROM table1 INNER JOIN table2 ON tabl...,SELECT * FROM table1 INNER JOIN table2 ON tabl...,[1],
5,1,2,Task 2 (est. 10–20 min) Read through the SQ...,quiz,What is the SQL syntax for a LEFT JOIN?,SELECT * FROM table1 LEFT JOIN table2 ON table...,SELECT * FROM table1 LEFT JOIN table2 ON table...,[1],
6,1,2,Task 2 (est. 10–20 min) Read through the SQ...,coding,,,,,Write a SQL query that performs a RIGHT JOIN b...
7,1,2,Task 2 (est. 10–20 min) Read through the SQ...,reflection,,,,,Reflect on how different types of JOINs can af...
8,2,1,Task 1 (est. 20–30 min) Use the interactive...,quiz,What does an INNER JOIN do in SQL?,Combines rows from two or more tables based on...,Combines rows from two or more tables based on...,[2],
9,2,1,Task 1 (est. 20–30 min) Use the interactive...,quiz,Which SQL clause is used to specify the condit...,WHERE | ON | JOIN,ON,[2],
