<div style="border: 2px solid #ccc; border-radius: 12px; padding: 20px; max-width: 950px; margin: auto; background-color: #1e1e1e; color: #f0f0f0; font-family: Arial, sans-serif; line-height: 1.6;">

  <div style="text-align: center; margin-bottom: 20px;">
    <img src="..\images\SlideHunter_Logo.png" 
         alt="Coffee Production Boxplot by Subdivision"
         style="width: 80%; max-width: 80%; height: auto; border-radius: 8px; box-shadow: 0 0 10px rgba(0,0,0,0.4);">
  </div>

  <blockquote style="margin: 0; padding: 10px 20px; border-left: 4px solid #4faaff;">
    <p><strong>
      SlideHunter App
    </strong></p>
    <p>
     User Interface (UI) : 
      <a href="..\images\SlideHunter_Logo.png" target="_blank" style="color: #4faaff;">
        Find exactly where a topic was covered in course materials. Fast answers with precise slide/page citations.
      </a>
    </p>
  </blockquote>

</div>

# 01 — Setup & Ingest
This notebook parses PDFs (page-level), embeds chunks, and builds a FAISS Store

## Install (optional)

In [56]:
# Warning surpresser
import os

# Tell Hugging Face to skip TensorFlow/Flax so they never import TensorFlow (TF).
os.environ["TRANSFORMERS_NO_TF"] = "1"
os.environ["TRANSFORMERS_NO_FLAX"] = "1"

# Quiet TF logs if something still pulls it in.
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  # 1=INFO, 2=WARNING, 3=ERROR


In [57]:


# OPTIONAL If env is missing packages
%pip install -q sentence-transformers faiss-cpu beautifulsoup4 canvasapi


Note: you may need to restart the kernel to use updated packages.


## Imports and Crendentials

In [58]:
from canvasapi import Canvas
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer
import numpy as np, faiss, re, json, os


In [63]:
#CANVAS_BASE_UR = "add here" 
#CANVAS_TOKEN = "add here"

CANVAS_BASE_URL="https://tkh.instructure.com" 
CANVAS_TOKEN="23885~ZaCTh63JmTHamHWMAxBtWUQQxUuRWGm4kNfMHJKhBcvnU44mw9vND4eaXEPLGDvk"

canvas = Canvas(CANVAS_BASE_URL, CANVAS_TOKEN)

In [64]:

my_courses = canvas.get_courses()


In [65]:
my_courses = canvas.get_courses()
course_list = []

for course in my_courses:
    print(course.name)
    course_list.append(course)


Foundations '25 Data Science
Foundations Course
IF '25 Data Science Cohort A
IF '25 NY Career Readiness and Success


In [66]:
modules = course.get_modules()

for module in modules:
    print(f"  Module_id: {module.id}")
    print(f"  Module: {module.name}")
    module_items = module.get_module_items()
    for item in module_items:
        print(f" - Item: {item.title} ({item.type})")

  Module_id: 1118
  Module: Fellow Resources
 - Item: Fellow Success Resources (Page)
  Module_id: 1239
  Module: Phase 2 (6/9-8/29)
 - Item: Homework: Option 1 - Weekly Job Applications & Progress Report (Due August 30) (Assignment)
 - Item: P2W1 (6/12) NO CAREER CLASS - TECHNICAL CLASS (SubHeader)
 - Item: P2W2 (6/16) Bloomberg Ideathon (SubHeader)
 - Item: Homework (SubHeader)
 - Item: Homework: Watch Hackathon Video (Assignment)
 - Item: Homework: Upwardly Global Learning Paths: Tech Market/Resume/Cover Letter (Assignment)
 - Item: Homework: Draft Resume (Assignment)
 - Item: P2W2 NO CLASS MEETING 6/19 Juneteenth TKH Closed (SubHeader)
 - Item: P2W3 (6/26) Bloomberg Hackathon (SubHeader)
 - Item: Homework (SubHeader)
 - Item: Homework: Hackathon Activity Log + Judges' Feedback (Assignment)
 - Item: P2W4 (7/3) Resume + Digital Footprint (SubHeader)
 - Item: In Class Activity (SubHeader)
 - Item: In-Class Activity: Updated Resume (Assignment)
 - Item: Homework (SubHeader)
 - Item: Ho

## Embedding Tokenized Canvas modules (Texts/items). Then Turning Those Embddings into a facts list + FAISS index that we can query.

1. Build facts (+ metadata) from Canvas
  - This pulls Pages' text (HTML → plain text), and adds light facts for  
    External URLs / Files / SubHeaders → may have to extend later.

## single FAISS store:
- Simple/Demo MVP, which tags every fact with a domain and use a tiny auto-router
  - Two-way short route descriptions (technical.index and career.index)
    - Pulls multiple Canvas courses
    - Builds one facts/metas list with domain in metadata
    - And creates one FAISS index
- This method routes queries to technical / career / all--automatically and filters hits accordingly.



In [67]:
# Multi-course to ONE FAISS store + simple router using career and technical courses

# 0) CONFIG: map course names to domain buckets
DOMAINS = {
    "technical": [
        "Foundations '25 Data Science",
        "Foundations Course",
        "IF '25 Data Science Cohort A",
    ],
    "career": [
        "IF '25 NY Career Readiness and Success",
    ],
}
# Short route descriptions--We can add more if needed (used for auto routing purpos)
ROUTE_DESC = {
    "technical": "Technical class content: Python, SQL, statistics, machine learning, slides, labs, code, algorithms, data science, lecture notes.",
    "career":    "Career readiness content: resumes, cover letters, job search, interviews, career prep, LinkedIn, networking, internship resources.",
}

In [68]:
# 1) Utility: HTML → text, light chunking
def strip_html(html: str) -> str:
    if not html: return ""
    txt = " ".join(BeautifulSoup(html, "html.parser").stripped_strings)
    return re.sub(r"\s+", " ", txt).strip()

def chunk_text(text, max_chars=600):
    if not text: return []
    parts = re.split(r"(\n|\.\s+)", text)
    buf, chunks = "", []
    for p in parts:
        buf += p
        if len(buf) >= max_chars:
            chunks.append(buf.strip()); buf = ""
    if buf.strip(): chunks.append(buf.strip())
    return [c for c in chunks if c]

In [69]:

# 2) Select courses by name (use your Canvas client `canvas`)
def course_domain(course_name: str):
    for dom, names in DOMAINS.items():
        if any(course_name.startswith(n) for n in names):
            return dom
    return "other"

wanted_prefixes = sum(DOMAINS.values(), [])
all_courses = [c for c in canvas.get_courses(enrollment_state="active")
               if any(c.name.startswith(p) for p in wanted_prefixes)]

print("Selected courses:", [c.name for c in all_courses])


Selected courses: ["Foundations '25 Data Science", 'Foundations Course', "IF '25 Data Science Cohort A", "IF '25 NY Career Readiness and Success"]


In [70]:
# 3) Build facts + metas from ALL selected courses
facts, metas = [], []
for crs in all_courses:
    dom = course_domain(crs.name)
    for module in crs.get_modules():
        for item in module.get_module_items():
            t = (item.type or "").strip()
            if t == "Page":
                page = crs.get_page(item.page_url)
                text = strip_html(getattr(page, "body", ""))
                for chunk in chunk_text(text, max_chars=600):
                    facts.append(f"[{dom}] {crs.name} › {module.name} › {item.title}: {chunk}")
                    metas.append({
                        "domain": dom,
                        "course_id": crs.id, "course_name": crs.name,
                        "module_id": module.id, "module_name": module.name,
                        "item_title": item.title, "type": "Page",
                        "url": getattr(page, "html_url", None)
                    })
            elif t in ("ExternalUrl", "ExternalTool"):
                facts.append(f"[{dom}] {crs.name} › {module.name} › {item.title}: external link {getattr(item, 'external_url', '')}")
                metas.append({
                    "domain": dom, "course_id": crs.id, "course_name": crs.name,
                    "module_id": module.id, "module_name": module.name,
                    "item_title": item.title, "type": t,
                    "url": getattr(item, "external_url", None)
                })
            elif t == "File":
                facts.append(f"[{dom}] {crs.name} › {module.name} › {item.title} (file)")
                metas.append({
                    "domain": dom, "course_id": crs.id, "course_name": crs.name,
                    "module_id": module.id, "module_name": module.name,
                    "item_title": item.title, "type": "File", "file_id": item.content_id
                })
            elif t == "SubHeader":
                continue
            else:
                facts.append(f"[{dom}] {crs.name} › {module.name} › {item.title} ({t})")
                metas.append({
                    "domain": dom, "course_id": crs.id, "course_name": crs.name,
                    "module_id": module.id, "module_name": module.name,
                    "item_title": item.title, "type": t
                })

print(f"Built {len(facts)} facts")

Built 323 facts


In [71]:

# 4) Embed — use GPU if available, else CPU
import torch
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=DEVICE)
print("model device:", model.device)

# (optional) quick warm-up on GPU
if DEVICE == "cuda":
    _ = model.encode(["warm up"], show_progress_bar=False)

# pick a sensible batch size per device
BATCH = 192 if DEVICE == "cuda" else 64

emb = model.encode(
    facts,
    batch_size=BATCH,
    normalize_embeddings=True,   # cosine-ready
    convert_to_numpy=True,       # returns NumPy on CPU for FAISS
    show_progress_bar=True
).astype("float32")

d = emb.shape[1]
index = faiss.IndexFlatIP(d)               # cosine (vectors normalized)
index.add(emb)
print("FAISS ntotal:", index.ntotal)


model device: cpu


Batches:   0%|          | 0/6 [00:00<?, ?it/s]

FAISS ntotal: 323


In [72]:
# 5) Router: choose technical / career / all based on similarity to route descriptions
route_emb = {k: model.encode([v], normalize_embeddings=True).astype("float32") for k,v in ROUTE_DESC.items()}

def choose_scope(query, margin=0.05):
    q = model.encode([query], normalize_embeddings=True).astype("float32")
    sims = {k: float((q @ route_emb[k].T)[0,0]) for k in ROUTE_DESC}
    best = max(sims, key=sims.get)
    # if not clearly better, use 'all'
    ordered = sorted(sims.items(), key=lambda x: x[1], reverse=True)
    if ordered[0][1] - ordered[1][1] < margin:
        return "all", sims
    return best, sims

In [73]:
# 6) Search with optional scope filter (auto by default)
def search(query, k=5, scope="auto"):
    if scope == "auto":
        scope, sims = choose_scope(query)
    q = model.encode([query], normalize_embeddings=True).astype("float32")
    # pull more then filter by domain
    D, I = index.search(q, k*8)
    hits = []
    for score, idx in zip(D[0], I[0]):
        if idx == -1: continue
        m = metas[idx]
        if scope != "all" and m["domain"] != scope:
            continue
        hits.append({"score": float(score), "fact": facts[idx], "meta": m})
        if len(hits) >= k: break
    # if not enough in-scope, backfill with any
    if len(hits) < k:
        for score, idx in zip(D[0], I[0]):
            if idx == -1: continue
            if any(h["meta"] is metas[idx] for h in hits): continue
            hits.append({"score": float(score), "fact": facts[idx], "meta": metas[idx]})
            if len(hits) >= k: break
    return scope, hits

In [74]:
# 7) Try it out with some pre-test test-prompts
tests = [
    "Where did we define precision vs. recall?",
    "tips for a resume and cover letter?",
    "What lecture slides did we learn about control flow?",
  ]
for q in tests:
    scope, hits = search(q, k=4, scope="auto")
    print(f"\nQ: {q}   [scope={scope}]")
    if not hits: print("  (no hits)"); continue
    for h in hits:
        m = h["meta"]
        cite = f"{m['course_name']} › {m['module_name']} › {m['item_title']} ({m['type']})"
        if m.get("url"): cite += f"  [{m['url']}]"
        print(f"  {h['score']:.3f} :: {cite}")


Q: Where did we define precision vs. recall?   [scope=technical]
  0.393 :: IF '25 Data Science Cohort A › P2W3 (6/23-6/27) Classification Algorithms › 💻 W3D2 (6/24) Logistic Regression Accuracy Metrics (Page)  [https://tkh.instructure.com/courses/172/pages/w3d2-6-slash-24-logistic-regression-accuracy-metrics]
  0.320 :: Foundations '25 Data Science › Week 5:  Statistics(Feb. 24th- Feb. 27th) › What is Data Science? (Page)  [https://tkh.instructure.com/courses/165/pages/what-is-data-science]
  0.299 :: IF '25 Data Science Cohort A › P2W11 (8/18-8/22) Agents & End of Phase Project › 💻 W11D1 (8/18) Applied LLM Review & AI Agents (Page)  [https://tkh.instructure.com/courses/172/pages/w11d1-8-slash-18-applied-llm-review-and-ai-agents]
  0.279 :: Foundations '25 Data Science › Week 5:  Statistics(Feb. 24th- Feb. 27th) › What is Data Science? (Page)  [https://tkh.instructure.com/courses/165/pages/what-is-data-science]

Q: tips for a resume and cover letter?   [scope=career]
  0.402 :: IF '2

In [76]:
import json, os

os.makedirs("data/faiss", exist_ok=True)
faiss.write_index(index, "data/faiss/canvas.index")
with open("data/faiss/facts.json", "w", encoding="utf-8") as f:
    json.dump({"facts": facts, "metas": metas}, f, ensure_ascii=False) # Persistence: save / load FAISS store + metadata
import os, json, faiss
from pathlib import Path

STORE_DIR  = f"{BASE}/data/faiss"
INDEX_PATH = f"{STORE_DIR}/canvas.index"
FACTS_PATH = f"{STORE_DIR}/facts.json"

def save_store(index, facts, metas, store_dir=STORE_DIR):
    Path(store_dir).mkdir(parents=True, exist_ok=True)
    faiss.write_index(index, os.path.join(store_dir, "canvas.index"))
    with open(os.path.join(store_dir, "facts.json"), "w", encoding="utf-8") as f:
        json.dump({"facts": facts, "metas": metas}, f, ensure_ascii=False)
    print(" saved:", INDEX_PATH, "and", FACTS_PATH)

def load_store(store_dir=STORE_DIR):
    idx = faiss.read_index(os.path.join(store_dir, "canvas.index"))
    with open(os.path.join(store_dir, "facts.json"), "r", encoding="utf-8") as f:
        data = json.load(f)
    print(" loaded:", os.path.join(store_dir, "canvas.index"), "and facts.json")
    return idx, data["facts"], data["metas"]

# This saves preceding index, facts, and metadata right after building
save_store(index, facts, metas)

 saved: C:\Users\oneps\Documents\Research_Dev_Documents\DataEden_Github\TEPP-2-SlideHunt-Repo\SlideHunt/data/faiss/canvas.index and C:\Users\oneps\Documents\Research_Dev_Documents\DataEden_Github\TEPP-2-SlideHunt-Repo\SlideHunt/data/faiss/facts.json
