# Search & Study Plan Generation

The **goal of this notebook** is to connect our semantic search (using FAISS) with a language model (LLM) to generate a personalized **7 day study plan** from student feedback.  

### What this notebook does
1. Loads pre-processed slide and lab chunks (from `data/processed/`).  
2. Loads FAISS indexes so we can quickly search the most relevant content.  
3. Searches the slides and labs for matches to a student’s feedback.  
4. Builds a formatted context block that attaches simple **[CITATION] markers** to each chunk.  
5. Sends the feedback + context to an LLM to compose a **7 day plan** with review, practice, and reflection tasks.  
6. Prints the plan in the notebook and saves it for later use.  

### Inputs
- `slides_chunks.parquet` — processed lecture slides  
- `labs_chunks.parquet` — processed Jupyter notebooks  
- `faiss_slides.index` — FAISS index for slides  
- `faiss_labs.index` — FAISS index for labs  

### Outputs
- Printed 7 day study plan in the notebook  
- Option to export/save as JSON or CSV for later analysis  


In [77]:
from dotenv import load_dotenv
from pathlib import Path
import sys
from openai import OpenAI
import os
import pandas as pd 


# Add project root to Python path
PROJECT_ROOT = Path("..").resolve()
sys.path.append(str(PROJECT_ROOT))

# From utils.py pull helper functions 
from src.utils import( 
    load_chunks,
    load_index,
    search_labs,
    search_slides,
    FAISS_SLIDES_PATH,
    FAISS_LABS_PATH
)

load_dotenv(override=True)  # take environment variables from .env file


True

In [78]:
# Load Data and FIASS
slides_df, labs_df = load_chunks()

slides_index = load_index(FAISS_SLIDES_PATH)
labs_index = load_index(FAISS_LABS_PATH)

print("Slides rows:", len(slides_df), "|FAISS size:", slides_index.ntotal)
print("Labs rows:  ", len(labs_df), "|FAISS size:", labs_index.ntotal)


Slides rows: 40 |FAISS size: 40
Labs rows:   1410 |FAISS size: 1410


In [79]:
def make_context(lab_matches, slide_matches):
    """
    Combine lab and slide search results into a single text block 
    that we can pass to the language model.
    
    Args:
        lab_matches (pd.DataFrame): results from search_labs()
        slide_matches (pd.DataFrame): results from search_slides()
    
    Returns:
        str: formatted context with numbered [CITATION] markers
    """

    # List to hold formatted text chunks
    context_texts = []

    # Start citation numbers at 1
    ref_number = 1       

    # --- Add lab matches first ---
    for _, row in lab_matches.iterrows():
        # Build one formatted line for this lab result
        line = f"[{ref_number}] (Lab file: {row['file']}) {row['text']}"
        context_texts.append(line)  # store the line
        ref_number += 1             # move to the next citation number

    # --- Add slide matches next ---
    for _, row in slide_matches.iterrows():
        # Build one formatted line for this slide result
        line = f"[{ref_number}] (Slide file: {row['file']} | Page {row['page']}) {row['text']}"
        context_texts.append(line)  # store the line
        ref_number += 1             # move to the next citation number

    # Join everything into one string with blank lines between
    return "\n\n".join(context_texts)


In [80]:
# Example student feedback ( swap this for any test case)
feedback_text = "I lost points on SQL joins and I keep mixing up inner vs left vs right joins."
lab_matches = search_labs(labs_index, labs_df, feedback_text, top_k=5)
slide_matches = search_slides(slides_index, slides_df, feedback_text, top_k=4)

context = make_context(lab_matches, slide_matches)
print(context[:800])  # show a preview of the context


[1] (Lab file: sql_refresher.ipynb) ### SQL JOINs (Review)
JOINs are used to combine data from two or more tables based on a related column.
- `INNER JOIN`: returns only matching rows
- `LEFT JOIN`: returns all rows from the left table, even if there are no matches in the right table
Try both join types below!

[2] (Lab file: sql_refresher.ipynb) #### SQL JOINs
JOINs combine rows from two or more tables based on a related column. Try changing the join type or columns.

[3] (Lab file: w9-class2.ipynb) ### Joins

inner joins:
```sql
SELECT *
FROM orders
INNER JOIN users ON orders.user_id = users.id;
```

left joins:
```sql
SELECT *
FROM users
LEFT JOIN orders ON users.id = orders.user_id;
```
right joins:
```sql
SELECT *
FROM users
RIGHT JOIN orders ON users.id = orders.user_id;
```
full out


In [84]:
load_dotenv(override=True)  # take environment variables from .env file

# Create a single OpenAI client (reads key from environment)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def generate_7_day_plan(feedback_text, slides_index, slides_df, labs_index, labs_df, top_k_slides=5, top_k_labs=5, model_name="gpt-4o-mini"): 
    """
    How it works:
    1) Search slides and labs separately.
    2) Concatenate the results (labs first by default since they are practical).
    3) Build a single context block with simple citations.
    4) Call OpenAI to write a 7-day plan with spaced review.
    5) Return the generated text.

    Args:
        feedback_text (str): Student's feedback, "I lost points on SQL joins and confusion matrix."
        top_k_slides (int): Number of slide chunks to include.
        top_k_labs   (int): Number of lab chunks to include.
        model_name   (str): OpenAI chat model.

    Returns:
        str: The study plan text generated by the LLM.
    """
    # Search slides and labs
    slides_results = search_slides(slides_index, slides_df, feedback_text, top_k=top_k_slides)
    labs_results   = search_labs(labs_index, labs_df, feedback_text, top_k=top_k_labs)

    # Make context
    context = make_context(labs_results, slides_results)

    system_prompt = (
        "You are an academic coach for a data science course. "
        "You must create a concrete 7-day micro-task plan using ONLY the provided context. "
        "Ensure tasks alternate between review (reading/notes), application (coding exercises), and reflection."
        "Each day should include 2–4 actionable tasks, with estimated time, and a citation line that points back to the source. "
        "Use spaced review on Day 1, Day 3, and Day 6. "
        "If context is insufficient for any part, state that clearly."
    )
    
    user_prompt = (
        f"Student feedback: {feedback_text}\n\n"
        f"Context from course materials (slides and labs):\n"
        f"{context}\n"
        "Now write the 7-day plan in this structure:\n"
        "Day 1 — Understand\n"
        "- Task 1 (est. 15–25 min) — description [CITATION]\n"
        "- Task 2 (est. 10–20 min) — description [CITATION]\n"
        "Day 2 — Apply\n"
        "- Task 1 ...\n"
        "...\n"
        "Day 7 — Checkpoint\n"
        "- Mini-quiz or small coding task ...\n"
        "\n"
        "Rules:\n"
        "- Use only facts available in the context above.\n"
        "- Each task line should end with a [CITATION] using the [SOURCE: ...] entry from context.\n"
        "- If something is unclear or missing, say so.\n"
    )

    # Call OpenAI chat completion
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2,  # low temperature for focused output
        max_tokens=1000   # adjust as needed
    )

    return response.choices[0].message.content

In [89]:
feedback_text = "I lost points on SQL joins and I keep mixing up inner vs left vs right joins."
plan_text = generate_7_day_plan(feedback_text, slides_index, slides_df, labs_index, labs_df, top_k_slides=4, top_k_labs=6, model_name="gpt-4o-mini")
print(plan_text)

out_path = Path("../data/processed/example_plan.json")
out_path.parent.mkdir(parents=True, exist_ok=True)

# Save as a JSON object with a single string field
with open(out_path, "w", encoding="utf-8") as f:
    json.dump({"plan_text": plan_text}, f, indent=2, ensure_ascii=False)

print("Saved (raw text) plan to:", out_path.as_posix())

### 7-Day Micro-Task Plan for SQL Joins

#### Day 1 — Understand
- **Task 1 (est. 15–25 min)** — Review the definitions and differences between `INNER JOIN`, `LEFT JOIN`, and `RIGHT JOIN`. Take notes on when to use each type. [CITATION: 1]
- **Task 2 (est. 10–20 min)** — Read through the SQL JOIN examples provided in the course materials and write down the SQL syntax for each type of join. [CITATION: 3]

#### Day 2 — Apply
- **Task 1 (est. 20–30 min)** — Use the interactive SQL playground to practice writing `INNER JOIN` queries. Experiment with different tables and columns. [CITATION: 6]
- **Task 2 (est. 15–25 min)** — Complete a coding exercise that requires you to write `LEFT JOIN` queries based on provided datasets. [CITATION: 1]

#### Day 3 — Review
- **Task 1 (est. 15–25 min)** — Revisit your notes from Day 1 and summarize the key points about SQL JOINs. Focus on the differences between join types. [CITATION: 1]
- **Task 2 (est. 10–20 min)** — Go through the SQL JOIN examples aga

In [56]:
# Detect Topic 
#Topic_keywords = {
#    "sql": ["join", "left join ", "right join", "inner join", "outer join", "select", "from", "where", "group by", "order by", "having", "union", "intersect", "except", "subquery", "cte", "window function"],
#    "loops": ["for loop", "while loop", "do while loop", "nested loop", "break", "continue", "infinite loop", "loop control"]

#    def detect_topic(query):
#        result = []
#         for topic, keywords in Topic_keywords.items():
#            for keyword in keywords:
#                if keyword in query.lower():
#                    result.append(topic)
#        return result       