the goal of thois notebook is to 
pull from faiss and connect to llm to output 7 day plan 

Implement semantic search (topk) and a simple 7-day plan composer that attaches citations from chunks.

LLM planner

use openai that can
1. take a query (student's input)
2. search indicies using (search_slides and search_labs)
3. format results with citations 
4. call the chat model using api key
5. gen 7day  plan 
6. save the plan to JSON/CSV 

**Inputs**
- `../data/processed/slides_chunks.parquet`
- `../data/processed/labs_chunks.parquet`
- `../data/processed/faiss_slides.index`
- `../data/processed/faiss_labs.index`


- `../data/processed/faiss_labs.index`

**Outputs**
- Printed study plan in the notebook
- JSON file of the plan for saving 

In [45]:
from openai import OpenAI
from pathlib import Path
import pandas as pd 
import numpy as np
import os
import faiss
import json
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer


In [46]:
# Load DAta
# Folders
PROJECT_ROOT = Path("..").resolve()
PROCESSED_FOLDER = PROJECT_ROOT / "data" / "processed"

SLIDES_CHUNKS_PATH = PROCESSED_FOLDER / "slides_chunks.parquet"
LABS_CHUNKS_PATH   = PROCESSED_FOLDER / "labs_chunks.parquet"
FAISS_SLIDES_PATH  = PROCESSED_FOLDER / "faiss_slides.index"
FAISS_LABS_PATH    = PROCESSED_FOLDER / "faiss_labs.index"


print("Slides chunks:", SLIDES_CHUNKS_PATH.exists(), SLIDES_CHUNKS_PATH)
print("Labs chunks:  ", LABS_CHUNKS_PATH.exists(),   LABS_CHUNKS_PATH)
print("Slides index: ", FAISS_SLIDES_PATH.exists(),  FAISS_SLIDES_PATH)
print("Labs index:   ", FAISS_LABS_PATH.exists(),    FAISS_LABS_PATH)



Slides chunks: True C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\processed\slides_chunks.parquet
Labs chunks:   True C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\processed\labs_chunks.parquet
Slides index:  True C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\processed\faiss_slides.index
Labs index:    True C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\processed\faiss_labs.index


In [47]:
# Load chunks 
slides_df = pd.read_parquet("../data/processed/slides_chunks.parquet")
labs_df   = pd.read_parquet("../data/processed/labs_chunks.parquet")

# Drop empty rows
slides_df = slides_df.dropna(subset=['text']).reset_index(drop=True)
labs_df   = labs_df.dropna(subset=['text']).reset_index(drop=True)

print("Slides rows:", len(slides_df))
print("Labs rows:  ", len(labs_df))

Slides rows: 40
Labs rows:   1410


In [48]:
# Load FAISS indexes from disk
slides_index = faiss.read_index(FAISS_SLIDES_PATH.as_posix())
labs_index   = faiss.read_index(FAISS_LABS_PATH.as_posix())# LLM call (OpenAI) to build a 7-day plan



In [49]:
# Embedding model
embedder = SentenceTransformer("bert-base-nli-mean-tokens")


def encode_normalized(texts):
    """Convert a single query string (what the student types) into a normalized float32 vector.
    - normalize_embeddings=True ensures vectors have length 1, so FAISS inner product ≈ cosine similarity.
    - We return a NumPy array with dtype float32 because FAISS expects float32 vectors."""
    embeddings = embedder.encode(texts, normalize_embeddings=True, show_progress_bar=True)
    return np.asarray(embeddings, dtype="float32")

print("Embedding model loaded:", embedder)

Embedding model loaded: SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)


In [50]:
# Create 2 FAISS indices, one for slides and one for labs

# Slides index
slides_embeddings = encode_normalized(slides_df['text'].tolist())
slides_index = faiss.IndexFlatIP(slides_embeddings.shape[1])
slides_index.add(slides_embeddings)
faiss.write_index(slides_index, FAISS_SLIDES_PATH.as_posix())

# Labs index
labs_embeddings = encode_normalized(labs_df['text'].tolist())
labs_index = faiss.IndexFlatIP(labs_embeddings.shape[1])
labs_index.add(labs_embeddings)
faiss.write_index(labs_index, FAISS_LABS_PATH.as_posix())

print("Slides index size:", slides_index.ntotal)
print("Labs index size:", labs_index.ntotal)

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Batches:   0%|          | 0/45 [00:00<?, ?it/s]

Slides index size: 40
Labs index size: 1410


In [51]:
def search_slides(query_text,top_k =5):
    """
    Search ONLY the slides index and return a small, readable DataFrame of results.

    Steps:
    1) Encode the query using the same embedding model and normalization.
    2) Ask FAISS for the top_k most similar vectors from the slides index.
    3) For each match, look up the original row in slides_df to get metadata (file, page, text).
    4) Build a small results table with source_type, file, page, text, and the similarity score.
    5) Sort by score descending (higher ≈ more similar).
    """
    # Embed query 
    query_vector = encode_normalized([query_text])

    # FAISS search
    distances, indices = slides_index.search(query_vector, top_k)

    # Turn indicies into a list of rows from slides_df
    rows = []
    for score, idx in zip(distances[0], indices[0]):
        if idx < 0:  # FAISS returns -1 for empty results
            continue
        row = slides_df.iloc[idx] # get the mathching slide chunks
        # result structure
        rows.append({
            "source_type": "slide",
            "file": row['file'], # slide filename
            "page": row['page'], # slide page number
            "text": row['text'], # slide text chunk that matched
            "score": score # similarity score  (higher the better)
    
        })

    # Create a DataFrame and sort by score descending
    results_df = pd.DataFrame(rows).sort_values(by="score", ascending=False).reset_index(drop=True)
    return results_df




In [52]:
def search_labs(query_text,top_k =5):
    """
    Search ONLY the labs index and return a small, readable DataFrame of results.

    Steps:
    1) Encode the query using the same embedding model and normalization.
    2) Ask FAISS for the top_k most similar vectors from the labs index.
    3) For each match, look up the original row in labs_df to get metadata (file, page, text).
    4) Build a small results table with source_type, file, page, text, and the similarity score.
    5) Sort by score descending (higher ≈ more similar).
    """
    # Embed query 
    query_vector = encode_normalized([query_text])

    # FAISS search
    distances, indices = labs_index.search(query_vector, top_k)

    # Turn indicies into a list of rows from labs_df
    rows = []
    for score, idx in zip(distances[0], indices[0]):
        if idx < 0:  # FAISS returns -1 for empty results
            continue
        row = labs_df.iloc[idx] # get the mathching slide chunks
        # result structure
        rows.append({
            "source_type": "lab",
            "file": row['file'], # lab filename
            "text": row['text'], # lab text chunk that matched
            "score": score # similarity score  (higher the better)
    
        })

    # Create a DataFrame and sort by score descending
    results_df = pd.DataFrame(rows).sort_values(by="score", ascending=False).reset_index(drop=True)
    return results_df


In [None]:
load_dotenv(override=True)  # take environment variables from .env file

# Create a single OpenAI client (reads key from environment)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def generate_7_day_plan(feedback_text, top_k_slides: 5, top_k_labs: 5, model_name: "gpt-4o-mini"):
    """
    End-to-end:
    1) Search slides and labs separately.
    2) Concatenate the results (labs first by default since they are practical).
    3) Build a single context block with simple citations.
    4) Call OpenAI to write a 7-day plan with spaced review.
    5) Return the generated text.

    Args:
        feedback_text (str): Student's feedback, "I lost points on SQL joins and confusion matrix."
        top_k_slides (int): Number of slide chunks to include.
        top_k_labs   (int): Number of lab chunks to include.
        model_name   (str): OpenAI chat model.

    Returns:
        str: The study plan text generated by the LLM.
    """
    # Search slides and labs
    slides_results = search_slides(feedback_text, top_k=top_k_slides)
    labs_results   = search_labs(feedback_text, top_k=top_k_labs)

    # Combine results 
    combined_results = pd.concat([labs_results, slides_results], ignore_index=True)

    # Build context block with citations
    context_blocks = []
    for i, row in combined_results.iterrows():
        citation = f"[{i+1}]"
        if row['source_type'] == 'slide':
            context_blocks.append(f"{citation} (Slide: {row['file']} Page: {row['page']}) {row['text']}")
        else:
            context_blocks.append(f"{citation} (Lab: {row['file']}) {row['text']}")

    context = "\n\n".join(context_blocks) # double newline for readability
    
    system_prompt = (
        "You are an academic coach for a data science course. "
        "You must create a concrete 7-day micro-task plan using ONLY the provided context. "
        "Ensure tasks alternate between review (reading/notes), application (coding exercises), and reflection."
        "Each day should include 2–4 actionable tasks, with estimated time, and a citation line that points back to the source. "
        "Use spaced review on Day 1, Day 3, and Day 6. "
        "If context is insufficient for any part, state that clearly."
    )
    
    user_prompt = (
        f"Student feedback: {feedback_text}\n\n"
        f"Context from course materials (slides and labs):\n"
        f"{context}\n"
        "Now write the 7-day plan in this structure:\n"
        "Day 1 — Understand\n"
        "- Task 1 (est. 15–25 min) — description [CITATION]\n"
        "- Task 2 (est. 10–20 min) — description [CITATION]\n"
        "Day 2 — Apply\n"
        "- Task 1 ...\n"
        "...\n"
        "Day 7 — Checkpoint\n"
        "- Mini-quiz or small coding task ...\n"
        "\n"
        "Rules:\n"
        "- Use only facts available in the context above.\n"
        "- Each task line should end with a [CITATION] using the [SOURCE: ...] entry from context.\n"
        "- If something is unclear or missing, say so.\n"
    )

    # Call OpenAI chat completion
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2,  # low temperature for focused output
        max_tokens=1000   # adjust as needed
    )

    return response.choices[0].message.content

In [54]:
example_feedback = "I lost points on SQL joins and I keep mixing up inner vs left vs right joins."
plan_text = generate_7_day_plan(example_feedback, top_k_slides=4, top_k_labs=6, model_name="gpt-4o-mini")
print(plan_text)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### Day 1 — Understand
- Task 1 (est. 20 min) — Review the SQL keywords related to Data Definition Language (DDL) and Data Manipulation Language (DML). Focus on understanding the purpose of each keyword, especially `JOIN`. Take notes on how `INNER JOIN`, `LEFT JOIN`, and `RIGHT JOIN` differ. [CITATION: 5]
- Task 2 (est. 15 min) — Read through the SQL commands for selecting, inserting, updating, and deleting data. Pay special attention to the `SELECT` statement and how it can be used with `JOIN`. [CITATION: 1]

### Day 2 — Apply
- Task 1 (est. 30 min) — Complete a coding exercise that involves creating a sample database with at least two tables. Use `INNER JOIN`, `LEFT JOIN`, and `RIGHT JOIN` to query data from these tables. [CITATION: 1]
- Task 2 (est. 15 min) — Write SQL queries to insert, update, and delete data in your sample database. Ensure you practice using the `WHERE` clause effectively. [CITATION: 1]

### Day 3 — Review
- Task 1 (est. 20 min) — Revisit your notes on SQL joins.

# Prototype Eval (test cases)

Evaluating the prototype on at least 10 examples to check:
- Topic Correctness
- Citation Correctness
- Format Correctness



In [55]:
# Define feedback sample (easy, medium, hard and noisy)
# AI was able to assist us with generating 10 diverse feedback examples quickly.
# To ensure that the examples fit the categories of easy, medium, hard and noisy we will have to create new ones and/ edit the ones provided for us.
# We will use these for now just to test the functionality of the code. 

Test_feedbacks = [
    "I lost points on SQL joins and I keep mixing up inner vs left vs right joins.",
    "I struggled with the concepts of overfitting and regularization in our machine learning assignments.",
    "The sections on neural networks and backpropagation were really hard for me to grasp.",
    "I found the data visualization part confusing, especially when to use different types of charts.",
    "I had trouble understanding the differences between supervised and unsupervised learning.",
    "The statistical concepts like p-values and confidence intervals were challenging to apply in practice.",
    "I got lost in the details of Python programming, particularly with functions and loops.",
    "The project on natural language processing was overwhelming, especially tokenization and stemming.",
    "I found it difficult to follow the steps in data cleaning and preprocessing for our datasets.",
    "The explanations about clustering algorithms like K-means and hierarchical clustering were unclear."
]

# Loop through the test feedbacks and generate plans
for feedback in enumerate(Test_feedbacks):
    plan = generate_7_day_plan(
        feedback,
        top_k_slides=5, 
        top_k_labs=5, 
        model_name="gpt-4o-mini")
    
    print(plan)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### Day 1 — Understand
- **Task 1 (est. 15–25 min)** — Review the concept of SQL joins, focusing on inner, left, and right joins. Take notes on their differences and use cases. [CITATION: Slide: Introduction to Structured Databases I (1).pdf Page: 62.0]
- **Task 2 (est. 10–20 min)** — Read through the SQL examples provided in the course materials, specifically focusing on the `DELETE` and `DROP TABLE` commands to understand their implications. [CITATION: Lab: w10-class1.ipynb]

### Day 2 — Apply
- **Task 1 (est. 20–30 min)** — Complete a coding exercise that requires writing SQL queries using inner, left, and right joins on a sample dataset. [CITATION: Lab: w9-class2.ipynb]
- **Task 2 (est. 15–25 min)** — Use Python's `pd.read_sql_query` to execute a query that retrieves users with missing age values. [CITATION: Lab: w9-class2.ipynb]

### Day 3 — Review
- **Task 1 (est. 15–25 min)** — Revisit your notes on SQL joins and summarize the key differences between inner, left, and right joins

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### 7-Day Micro-Task Plan

**Day 1 — Understand**
- Task 1 (est. 15–25 min) — Review the concepts of overfitting and regularization in machine learning. Take notes on how these concepts affect model performance and generalization. [CITATION: Not provided]
- Task 2 (est. 10–20 min) — Read through the slides on classification algorithms and note the differences between them, including the benefits of ensemble methods. [CITATION: Slide: Introduction to Decision Trees.pdf Page: 62.0]

**Day 2 — Apply**
- Task 1 (est. 30 min) — Implement the `k_means` function from Lab [5]. Experiment with different values of k (e.g., k=3) and observe how the final centroids and assignments differ. [CITATION: Lab: lab.ipynb]
- Task 2 (est. 20 min) — Complete the TODO in Lab [4] by implementing a simple machine learning model. Test its performance on a sample dataset. [CITATION: Lab: classification_challenge.ipynb]

**Day 3 — Review**
- Task 1 (est. 15–25 min) — Revisit your notes on overfitting and regulari

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### 7-Day Micro-Task Plan

#### Day 1 — Understand
- **Task 1 (est. 15–25 min)** — Review the section titled "How does backpropagation work?" from the provided IBM link to grasp the foundational concepts of backpropagation in neural networks. [CITATION: 4]
- **Task 2 (est. 10–20 min)** — Read through the "Introduction to Neural Networks" slide to familiarize yourself with the basic concepts of neural networks, including recurrent neural networks and their applications. [CITATION: 9]

#### Day 2 — Apply
- **Task 1 (est. 30–45 min)** — Complete the coding exercises in the `neural_networks.ipynb` lab, focusing on implementing a simple neural network using the Keras package. [CITATION: 4]
- **Task 2 (est. 20–30 min)** — Experiment with modifying the parameters of your neural network in the lab to observe how changes affect the model's performance. [CITATION: 4]

#### Day 3 — Review
- **Task 1 (est. 15–25 min)** — Revisit the "Types of Visualizations Review" slide to reinforce your understa

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### Day 1 — Understand
- Task 1 (est. 15–25 min) — Review the "Types of Visualizations Review.pdf" to understand different types of charts and when to use them. Pay special attention to examples that clarify the use of each type. [CITATION: Slide: Types of Visualizations Review.pdf Page: 64.0]
- Task 2 (est. 10–20 min) — Read through the "neural_networks.ipynb" lab, focusing on the section about plotting errors of mini-batch training. Take notes on how visualization is used in this context. [CITATION: Lab: neural_networks.ipynb]

### Day 2 — Apply
- Task 1 (est. 30 min) — Complete a coding exercise that involves creating different types of visualizations using a sample dataset. Try to implement at least three different chart types. [CITATION: Context insufficient for specific coding exercise]
- Task 2 (est. 20 min) — Use the "svm.ipynb" lab to explore non-linear decision boundaries by modifying the code to visualize the results. [CITATION: Lab: svm.ipynb]

### Day 3 — Understand
- Task

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### 7-Day Micro-Task Plan

#### Day 1 — Understand
- **Task 1 (est. 15–25 min)** — Review the differences between supervised and unsupervised learning. Focus on definitions and examples of each type. [CITATION: Not provided]
- **Task 2 (est. 10–20 min)** — Read through the slides on Measures of Dispersion & Central Limit Theorem to reinforce your understanding of key statistical concepts that underpin supervised learning. [CITATION: 9]

#### Day 2 — Apply
- **Task 1 (est. 30 min)** — Complete coding exercises in the `svm.ipynb` lab, focusing on implementing a Support Vector Machine (SVM) model. Experiment with both linear and non-linear decision boundaries. [CITATION: 2]
- **Task 2 (est. 20 min)** — Work on a coding exercise that involves implementing Laplace Smoothing in the `naive_bayes.ipynb` lab. [CITATION: 3]

#### Day 3 — Review
- **Task 1 (est. 15–25 min)** — Revisit the definitions of ordinal, nominal, and binary data types from the feature engineering lab. Take notes on how th

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### 7-Day Micro-Task Plan

#### Day 1 — Understand
- **Task 1 (est. 15–25 min)** — Review the concept of p-values and confidence intervals. Take notes on their definitions and applications in statistical analysis. [CITATION: Not provided]
- **Task 2 (est. 10–20 min)** — Read through the slide on latent variables and summarize how they can be inferred from observable variables. [CITATION: Slide: Types of Visualizations Review.pdf Page: 64.0]

#### Day 2 — Apply
- **Task 1 (est. 30–40 min)** — Complete the coding exercises in the `svm.ipynb` lab, focusing on exploring non-linear decision boundaries. [CITATION: Lab: svm.ipynb]
- **Task 2 (est. 20–30 min)** — Implement a simple example of Laplace Smoothing in the `naive_bayes.ipynb` lab to handle zero counts in word frequencies. [CITATION: Lab: naive_bayes.ipynb]

#### Day 3 — Understand
- **Task 1 (est. 15–25 min)** — Review the concept of gradient descent and its role in updating weights in neural networks. Take notes on how the `gradien

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### Day 1 — Understand
- **Task 1 (est. 15–25 min)** — Review the basics of Python functions and loops. Focus on the control flow section from the Python refresher lab. Take notes on how loops and functions are structured and used. [CITATION: Lab: python_refresher.ipynb]
- **Task 2 (est. 10–20 min)** — Read through the SQL + Python slide to understand how Python interacts with SQL databases, particularly focusing on the use of SQLite. Take notes on key points. [CITATION: Slide: Advanced SQL I.pdf Page: 43.0]

### Day 2 — Apply
- **Task 1 (est. 20–30 min)** — Complete coding exercises related to functions and loops from the Python refresher lab. Implement at least two different types of loops (for and while) in your code. [CITATION: Lab: python_refresher.ipynb]
- **Task 2 (est. 15–25 min)** — Write a simple SQL query using Python to select users where age is NULL, similar to the example provided in the lab. Test it in your SQLite environment. [CITATION: Lab: w9-class2.ipynb]

### Day 3 

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### 7-Day Micro-Task Plan

#### Day 1 — Understand
- **Task 1 (est. 15–25 min)** — Review the concept of vector databases and RAGs, focusing on their importance in providing contextual information to models. Take notes on key points. [CITATION: 1]
- **Task 2 (est. 10–20 min)** — Read about the transformer architecture and its applications beyond text, including image and audio data. Summarize the main ideas. [CITATION: 7]

#### Day 2 — Apply
- **Task 1 (est. 30–45 min)** — Complete the coding exercise in `vectors_rags.ipynb` to implement the SentenceTransformer and generate embeddings for a set of facts. [CITATION: 3]
- **Task 2 (est. 20–30 min)** — Install the necessary libraries (`gensim` and `nltk`) as shown in `nlp_vectors.ipynb`, and explore basic functionalities of these libraries. [CITATION: 4]

#### Day 3 — Review
- **Task 1 (est. 15–25 min)** — Revisit your notes on vector databases and RAGs. Reflect on how they enhance the performance of LLMs. [CITATION: 1]
- **Task 2 (est. 1

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### 7-Day Micro-Task Plan

**Day 1 — Understand**
- Task 1 (est. 15–25 min) — Review the concepts of data cleaning and preprocessing, focusing on the importance of consistency in datasets and the differences between data cleaning and data validation. Take notes on key points. [CITATION: 4]
- Task 2 (est. 10–20 min) — Read through the slides on Types of Visualizations to understand latent variables and their significance in data analysis. Note how they can be inferred from measurable variables. [CITATION: 6]

**Day 2 — Apply**
- Task 1 (est. 30–45 min) — Complete coding exercises related to data cleaning using the vectors_rags.ipynb lab. Focus on identifying and correcting inconsistencies in a sample dataset. [CITATION: 1]
- Task 2 (est. 20–30 min) — Work on a coding task that involves implementing data validation techniques in a provided dataset. [CITATION: 4]

**Day 3 — Review**
- Task 1 (est. 15–25 min) — Revisit the notes taken on data cleaning and preprocessing. Summarize the key t

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### 7-Day Micro-Task Plan

**Day 1 — Understand**
- Task 1 (est. 15–25 min) — Review the slides on K-Nearest Neighbors, focusing on the concepts of "Who is my neighbor?" and "What is a Manhattan distance?" [CITATION: Slide: Introduction to K-Nearest-Neighbors.pdf Page: 90.0]
- Task 2 (est. 10–20 min) — Read about clustering algorithms, specifically K-means and hierarchical clustering, and take notes on their differences and applications. [CITATION: Context from course materials]

**Day 2 — Apply**
- Task 1 (est. 30 min) — Complete coding exercises related to K-means clustering from the lab materials, focusing on implementing the algorithm and visualizing the clusters. [CITATION: Lab: vectors_rags.ipynb]
- Task 2 (est. 20 min) — Modify the number of clusters (k) in your K-means implementation and observe the changes in clustering outcomes. Document your observations. [CITATION: Lab: lab.ipynb]

**Day 3 — Review**
- Task 1 (est. 15–25 min) — Revisit the notes taken on K-means and hierarc

In [56]:
# Detect Topic 
#Topic_keywords = {
#    "sql": ["join", "left join ", "right join", "inner join", "outer join", "select", "from", "where", "group by", "order by", "having", "union", "intersect", "except", "subquery", "cte", "window function"],
#    "loops": ["for loop", "while loop", "do while loop", "nested loop", "break", "continue", "infinite loop", "loop control"]

#    def detect_topic(query):
#        result = []
#         for topic, keywords in Topic_keywords.items():
#            for keyword in keywords:
#                if keyword in query.lower():
#                    result.append(topic)
#        return result       