# Technical Course Recommender

## Core Steps:
1. Load major requirements and extract technical courses/electives
2. Filter courses based on prerequisites (courses user can take)
3. Filter courses based on postrequisites (courses that lead somewhere)
4. Build text corpus from course titles and descriptions
5. Apply TF-IDF vectorization (1-2 grams) + cosine similarity to user interests
6. Diversify with MMR
7. Return top-N technical course recommendations

In [19]:
pip install fuzzywuzzy

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [86]:
from __future__ import annotations

import pandas as pd
import numpy as np
import json
import sys
from pathlib import Path
from typing import Dict, List, Set, Optional

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from fuzzywuzzy import process

# Add backend to path for importing services
sys.path.append(str(Path('../backend').resolve()))

# Import backend services for robust prerequisite checking and data loading
from app.services.prereq_checker import (
    normalize_course_code,
    can_take_course,
    check_prerequisites_met
)
from app.services.year_detector import detect_student_year

# Add pdf_to_dars to path
sys.path.append(str(Path('../pdf_to_dars').resolve()))

# Import DARS parser (optional - only if available)


## 1. Data Loading Functions

In [87]:
# File paths
MAJORS_FILE = "../data_scraping/output/processed/majors_structured.json"
PREREQ_FILE = "../data_scraping/output/processed/prerequisite_graph.json"
POSTREQ_FILE = "../data_scraping/output/processed/postrequisite_graph.json"

def load_json(filepath: str) -> dict:
    """Load JSON file."""
    with open(filepath, 'r') as f:
        return json.load(f)

def load_major_requirements(filepath: str, major_name: str) -> Dict:
    """Load requirements for a specific major."""
    data = load_json(filepath)
    if major_name not in data:
        raise ValueError(f"Major '{major_name}' not found in data")
    return data[major_name]

# Load prerequisite and postrequisite graphs
prereq_graph = load_json(PREREQ_FILE)
postreq_graph = load_json(POSTREQ_FILE)

print(f"Loaded prerequisite graph with {len(prereq_graph)} courses")
print(f"Loaded postrequisite graph with {len(postreq_graph)} courses")

Loaded prerequisite graph with 7968 courses
Loaded postrequisite graph with 1376 courses


## 2. Extract Technical Courses from Major

In [95]:
def extract_technical_courses(major_data: Dict) -> Dict[str, List[str]]:
    """
    Extract technical courses from major requirements.


    """
    required_courses = []
    elective_courses = []
    area_courses = []

    for group in major_data.get('requirement_groups', []):
        group_name = group.get('group_name', '').lower()

        # Check if this is a technical group
        is_technical = any(keyword in group_name for keyword in [
            'technical', 'elective', 'cs ', 'computer science',
            'programming', 'advanced', 'areas', 'specialization',
            'core', 'major', 'concentration'
        ])

        is_elective = any(keyword in group_name for keyword in [
            'elective', 'choose', 'select', 'option'
        ])

        is_area = 'area' in group_name or 'specialization' in group_name

        # Extract courses from this group
        for course in group.get('courses', []):
            if isinstance(course, dict) and 'code' in course:
                course_code = course['code']

                if is_area:
                    area_courses.append(course_code)
                elif is_elective or not course.get('required', True):
                    elective_courses.append(course_code)
                elif is_technical or course.get('required', False):
                    required_courses.append(course_code)

        # Also check course_codes list
        for course_code in group.get('course_codes', []):
            if is_area:
                area_courses.append(course_code)
            elif is_elective:
                elective_courses.append(course_code)
            elif is_technical:
                required_courses.append(course_code)

    return {
        'required': list(set(required_courses)),
        'electives': list(set(elective_courses)),
        'areas': list(set(area_courses))
    }

# This will be set when you run the profile cell below


<cell_type>markdown</cell_type>## 3. Prerequisite Filtering with Backend Integration

Now using robust prerequisite checking from backend services.

In [96]:
def get_eligible_courses_improved(
    candidate_courses: List[str],
    courses_completed: List[str],
    courses_in_progress: List[str],
    prereq_graph: Dict[str, List[str]]
) -> List[str]:
    """
    Filter courses based on prerequisites using backend prereq_checker.
    """
    eligible = []
    completed_or_in_progress = set(courses_completed + courses_in_progress)

    for course in candidate_courses:
        # Build course data dict with prerequisites
        course_data = {
            'prerequisites': prereq_graph.get(course, [])
        }

        # Use backend's robust prerequisite checking
        can_take, missing = can_take_course(
            course,
            completed_or_in_progress,
            course_data
        )

        if can_take:
            eligible.append(course)

    return eligible

def filter_by_postrequisites(
    courses: List[str],
    postreq_graph: Dict[str, List[str]],
    min_postreqs: int = 0
) -> List[str]:
    """
    Filter courses that have postrequisites (lead to other courses).
    Useful for finding courses that open up more options.
    """
    filtered = []
    for course in courses:
        postreqs = postreq_graph.get(course, [])
        if len(postreqs) >= min_postreqs:
            filtered.append(course)
    return filtered

print("ready")

ready


## 4. Load Real Course Data from CSV

Instead of creating synthetic features, we'll use actual course names and descriptions from the catalog.

In [97]:
# Load course data from CSV (same as gened_recommender)
COURSES_FILE = "../data_scraping/raw_data/all_courses.csv"

def load_course_data(csv_path: str) -> pd.DataFrame:
    """Load course data from CSV file."""
    df = pd.read_csv(csv_path)

    # Normalize course_id: turn non-breaking space into a normal space
    df['course_code'] = (
        df['course_id']
        .astype(str)
        .str.replace('\xa0', ' ', regex=False)   # important!
        .str.strip()
    )

    # Combine name and description for richer text features
    df['text'] = df['name'].fillna('') + ' ' + df['description'].fillna('')

    # Extract subject and number from the CLEANED code, not course_id
    df[['subject', 'number']] = df['course_code'].str.extract(r'^([A-Z]+)\s+(\d+)$')

    # Infer level from the numeric part (or you could just map df['course_level'])
    def get_level(num_str):
        try:
            num = int(num_str)
            if num >= 400:
                return 'advanced'
            elif num >= 200:
                return 'intermediate'
            else:
                return 'introductory'
        except Exception:
            return 'unknown'

    df['level'] = df['number'].apply(get_level)

    return df




# Load all courses
all_courses_df = load_course_data(COURSES_FILE)

print(f"Loaded {len(all_courses_df)} courses from catalog")


Loaded 14382 courses from catalog


## 5. TF-IDF Vectorization

In [98]:
def build_text_corpus(df: pd.DataFrame) -> pd.Series:
    """
    Build text corpus from course data.
    """
    return (
        df["name"].fillna("").astype(str) + " " +
        df["description"].fillna("").astype(str) + " " +
        df["subject"].fillna("").astype(str) + " " +
        df["level"].fillna("").astype(str)
    )

def fit_vectorizer(corpus: pd.Series) -> TfidfVectorizer:
    """Fit TF-IDF vectorizer on course corpus."""
    vec = TfidfVectorizer(
        max_df=0.7,           # Ignore terms in >70% of documents
        min_df=2,             # Include terms in at least 2 documents (filters noise)
        ngram_range=(1, 2),   # Use unigrams and bigrams
        stop_words='english'  # Remove English stop words
    )
    vec.fit(corpus)
    return vec

# Build corpus and fit vectorizer
corpus = build_text_corpus(courses_df)
vectorizer = fit_vectorizer(corpus)
X_tfidf = vectorizer.transform(corpus)

print(f"\nTF-IDF matrix shape: {X_tfidf.shape}")
print(f"Number of courses: {X_tfidf.shape[0]}")
print(f"Number of features (unique terms): {X_tfidf.shape[1]}")


TF-IDF matrix shape: (43, 952)
Number of courses: 43
Number of features (unique terms): 952


<cell_type>markdown</cell_type>## 6. Hybrid Recommender: TF-IDF + Rule-Based Boosting

Combines interest-based matching (TF-IDF) with curriculum rules (boosting).

In [None]:
def get_course_level(course_code: str) -> int:
    """
    Extract course level from code 
    """
    if not course_code or not isinstance(course_code, str):
        return 5  # Default to high if can't parse

    import re
    match = re.match(r'[A-Z]{2,4}\s*(\d)(\d{2})', course_code.upper())
    if match:
        return int(match.group(1))
    return 5  # Default to high if can't parse


def calculate_rule_based_boost(
    course_row: pd.Series,
    student_year: str,
    prefer_foundational: bool = False,
    prefer_advanced: bool = False
) -> float:
    """
    Calculate rule-based boost multiplier for a course.

    Returns a multiplier (e.g., 1.0 = no change, 1.3 = +30%, 0.8 = -20%)

    Boosts applied:
    - Foundational courses (high postreq count): +20%
    - Sequence-aligned courses: +15%
    - Level-appropriate courses: +10%

    Penalties applied:
    - Too advanced for student year: -30%
    - Very low level for advanced students: -20%
    """
    boost = 1.0

    # 1. Foundational boost (courses that unlock many others)
    if prefer_foundational and course_row.get('postreq_count', 0) > 5:
        boost *= 1.2  # +20%

    # 2. Advanced course preference
    if prefer_advanced and course_row.get('level') == 'advanced':
        boost *= 1.3  # +30%

    # 3. Level appropriateness based on student year
    course_level = get_course_level(course_row['code'])
    year_to_level = {
        'freshman': 1,
        'sophomore': 2,
        'junior': 3,
        'senior': 4
    }
    expected_level = year_to_level.get(student_year, 2)

    # Boost courses at or slightly above student level
    if course_level == expected_level or course_level == expected_level + 1:
        boost *= 1.1  # +10%

    # Penalize courses too advanced
    elif course_level > expected_level + 1:
        boost *= 0.7  # -30%

    # Penalize very basic courses for advanced students
    elif student_year in ['junior', 'senior'] and course_level <= 1:
        boost *= 0.8  # -20%

    return boost


def fix_typo(text: str, valid_terms: Set[str]) -> str:
    """Fix typos in user input using fuzzy matching."""
    if not text:
        return ""

    fixed = []
    for word in text.lower().split():
        word = word.strip()
        if word in valid_terms:
            fixed.append(word)
        elif valid_terms:  # Only do fuzzy match if we have valid terms
            match, score = process.extractOne(word, list(valid_terms))
            fixed.append(match if score > 70 else word)
        else:
            fixed.append(word)
    return ' '.join(fixed)


def mmr_diversify(
    scores: np.ndarray,
    X: np.ndarray,
    topk: int,
    lambda_param: float = 0.7
) -> List[int]:
    """Maximal Marginal Relevance for diversity."""
    selected = []
    candidates = np.arange(len(scores))

    # Select highest scoring course first
    first = np.argmax(scores)
    selected.append(first)
    candidates = np.delete(candidates, np.where(candidates == first))

    while len(selected) < topk and len(candidates) > 0:
        mmr_scores = []
        for c in candidates:
            relevance = scores[c]

            # Calculate similarity to already selected courses
            if len(selected) > 0:
                sims_to_selected = cosine_similarity(
                    X[c:c+1], X[selected]
                ).flatten()
                max_sim = np.max(sims_to_selected)
            else:
                max_sim = 0

            # MMR score: balance relevance and diversity
            mmr = lambda_param * relevance - (1 - lambda_param) * max_sim
            mmr_scores.append(mmr)

        # Select best MMR score
        best_idx = np.argmax(mmr_scores)
        best_candidate = candidates[best_idx]
        selected.append(best_candidate)
        candidates = np.delete(candidates, best_idx)

    return selected


def recommend_technical_courses_hybrid(
    profile: Dict,
    major_name: str,
    df: pd.DataFrame,
    X_tfidf: np.ndarray,
    vectorizer: TfidfVectorizer,
    prereq_graph: Dict[str, List[str]],
    postreq_graph: Dict[str, List[str]],
    topk: int = 20,
    mmr_lambda: float = 0.7,
    tfidf_weight: float = 0.6,
    rule_weight: float = 0.4
) -> pd.DataFrame:

    # Extract profile parameters
    interests = profile.get('interests', '')
    completed = profile.get('courses_completed', [])
    in_progress = profile.get('courses_in_progress', [])
    prefer_foundational = profile.get('prefer_foundational', False)
    prefer_advanced = profile.get('prefer_advanced', False)

    # STEP 1: Rule-Based Filtering (prerequisites)
    # =============================================
    all_courses = df['code'].tolist()

    # FIX 1: Exclude courses already completed or in-progress
    already_taken = set(completed + in_progress)
    candidate_courses = [c for c in all_courses if c not in already_taken]

    print(f"Total courses: {len(all_courses)}")
    print(f"Already completed or in-progress: {len(already_taken)}")
    print(f"Candidates after excluding taken courses: {len(candidate_courses)}")

    eligible_courses = get_eligible_courses_improved(
        candidate_courses,
        completed,
        in_progress,
        prereq_graph
    )

    # Detect student year for level-based boosting
    student_year = detect_student_year(completed)
    print(f"Detected student year: {student_year}")

    # FIX 2: Hard filter for course levels based on student year
    # Freshmen/Sophomores: No 400+ level courses
    # Juniors: OK with 400-level, no 500-level
    # Seniors: All levels OK
    level_filtered = []
    year_to_level = {
        'freshman': 1,
        'sophomore': 2,
        'junior': 3,
        'senior': 4
    }
    max_level_allowed = year_to_level.get(student_year, 2) + 1  # Can take 1 level above

    for course in eligible_courses:
        course_level = get_course_level(course)
        if course_level <= max_level_allowed:
            level_filtered.append(course)

    print(f"After level filtering (max level {max_level_allowed}): {len(level_filtered)} courses")

    # Filter dataframe to only eligible courses
    df_eligible = df[df['code'].isin(level_filtered)].copy()
    eligible_indices = df_eligible.index.tolist()

    if len(df_eligible) == 0:
        print("\n⚠️ No eligible courses found!")
        print("  This might be because:")
        print("  - All appropriate-level courses require prerequisites you haven't met")
        print("  - You've already completed/are taking most available courses")
        return pd.DataFrame()

    print(f"Final eligible courses: {len(df_eligible)}")

    # Get TF-IDF vectors for eligible courses
    X_eligible = X_tfidf[eligible_indices]

    # STEP 2: TF-IDF Interest Scoring
    # =================================
    # Use interests as-is (no typo fixing needed with real course descriptions)
    print(f"\nUser interests: '{interests}'")

    # Vectorize user interests
    query_vec = vectorizer.transform([interests])

    # Calculate cosine similarity (0-1 range)
    interest_scores = cosine_similarity(query_vec, X_eligible).flatten()

    # STEP 3: Rule-Based Boosting
    # ============================
    rule_boosts = np.array([
        calculate_rule_based_boost(
            row,
            student_year,
            prefer_foundational,
            prefer_advanced
        )
        for _, row in df_eligible.iterrows()
    ])

    # STEP 4: Combine Scores (Weighted Average + Boosting)
    # =====================================================
    # Method: Interest score (weighted) * rule boost
    boosted_scores = interest_scores * rule_boosts

    non_zero_mask = interest_scores > 0
    if non_zero_mask.sum() > 0:
        print(f"  Courses with interest match: {non_zero_mask.sum()}")

        # Keep only non-zero interest courses
        df_eligible = df_eligible[non_zero_mask].copy()
        eligible_indices = [idx for idx, mask in zip(eligible_indices, non_zero_mask) if mask]
        X_eligible = X_eligible[non_zero_mask]
        interest_scores = interest_scores[non_zero_mask]
        rule_boosts = rule_boosts[non_zero_mask]
        boosted_scores = boosted_scores[non_zero_mask]

    print(f"\nScore statistics:")
    print(f"  Interest scores: mean={interest_scores.mean():.3f}, max={interest_scores.max():.3f}")
    print(f"  Rule boosts: mean={rule_boosts.mean():.3f}, max={rule_boosts.max():.3f}")
    print(f"  Final scores: mean={boosted_scores.mean():.3f}, max={boosted_scores.max():.3f}")

    # STEP 5: Apply MMR for Diversity
    # ================================
    top_indices = mmr_diversify(boosted_scores, X_eligible, min(topk, len(df_eligible)), mmr_lambda)

    # Get results
    result_df_indices = [eligible_indices[i] for i in top_indices]
    result = df.loc[result_df_indices].copy()
    result['interest_score'] = interest_scores[top_indices]
    result['rule_boost'] = rule_boosts[top_indices]
    result['final_score'] = boosted_scores[top_indices]
    result = result.sort_values('final_score', ascending=False)

    return result[[
        'code', 'name', 'subject', 'level', 'prereq_count', 'postreq_count',
        'interest_score', 'rule_boost', 'final_score'
    ]]

In [103]:


YOUR_MAJOR = "Biochemistry, BS"  

YOUR_COMPLETED = ['CHEM 102', 'BIOL 110']  

YOUR_IN_PROGRESS = [  
    'CHEM 104', 'BIOL 120', 'MATH 220'
]


YOUR_INTERESTS = 'genetics molecular biology cell signaling'  


PREFER_ADVANCED = False      # Set True if you want 400-level courses
PREFER_FOUNDATIONAL = True   # Set True to prioritize courses that unlock more courses

print(f"Completed: {len(YOUR_COMPLETED)} course(s)")
print(f"In Progress: {len(YOUR_IN_PROGRESS)} courses")
print(f"Interests: {YOUR_INTERESTS}")
print(f"Preferences: advanced={PREFER_ADVANCED}, foundational={PREFER_FOUNDATIONAL}")




try:
    major_data = load_major_requirements(MAJORS_FILE, YOUR_MAJOR)
    tech_courses = extract_technical_courses(major_data)

    print(f"\n✓ Loaded {YOUR_MAJOR}")
    print(f"  Required courses: {len(tech_courses['required'])}")
    print(f"  Elective courses: {len(tech_courses['electives'])}")
    print(f"  Area courses: {len(tech_courses['areas'])}")

    # Create course dataframe
    all_technical = tech_courses['required'] + tech_courses['electives'] + tech_courses['areas']
    courses_df = create_technical_course_dataframe(all_technical, all_courses_df, prereq_graph, postreq_graph)

    print(f"\n✓ Loaded {len(courses_df)} courses for your major")

    # Build TF-IDF for this major's courses
    corpus = build_text_corpus(courses_df)
    vectorizer = fit_vectorizer(corpus)
    X_tfidf = vectorizer.transform(corpus)

    print(f"\n✓ Built recommendation model ({X_tfidf.shape[1]} features)")

except Exception as e:
    print(f"\n❌ Error loading major: {e}")
    print("\nAvailable majors (first 30):")
    majors = load_json(MAJORS_FILE)
    for major_name in sorted(majors.keys())[:30]:
        print(f"  - {major_name}")
    raise

# Create profile
my_profile = {
    'interests': YOUR_INTERESTS,
    'courses_completed': YOUR_COMPLETED,
    'courses_in_progress': YOUR_IN_PROGRESS,
    'prefer_advanced': PREFER_ADVANCED,
    'prefer_foundational': PREFER_FOUNDATIONAL
}


my_recommendations = recommend_technical_courses_hybrid(
    my_profile,
    YOUR_MAJOR,
    courses_df,
    X_tfidf,
    vectorizer,
    prereq_graph,
    postreq_graph,
    topk=20
)

# Display results
print("\n" + "="*80)
print(f"TOP {len(my_recommendations)} RECOMMENDED COURSES")
print("="*80)

if not my_recommendations.empty:
    for idx, row in my_recommendations.iterrows():
        print(f"\n{row['code']:15} | Level: {row['level']:15} | Score: {row['final_score']:.3f}")
        print(f"  {row['name']}")
        print(f"  Interest Match: {row['interest_score']:.3f} | Curriculum Boost: {row['rule_boost']:.2f}x")
        print(f"  Prerequisites: {row['prereq_count']} | Unlocks: {row['postreq_count']} courses")

    print("\n" + "="*80)
    print("SUMMARY TABLE")
    print("="*80)
    display(my_recommendations)
else:
    print("\n⚠️ No recommendations found. This might be because:")




Completed: 2 course(s)
In Progress: 3 courses
Interests: genetics molecular biology cell signaling
Preferences: advanced=False, foundational=True

✓ Loaded Biochemistry, BS
  Required courses: 27
  Elective courses: 0
  Area courses: 0

✓ Loaded 43 courses for your major

✓ Built recommendation model (952 features)
Total courses: 43
Already completed or in-progress: 5
Candidates after excluding taken courses: 39
Detected student year: first_year
After level filtering (max level 3): 16 courses
Final eligible courses: 16

User interests: 'genetics molecular biology cell signaling'

Filtering to courses with non-zero interest scores...
  Courses with interest match: 6

Score statistics:
  Interest scores: mean=0.095, max=0.206
  Rule boosts: mean=1.207, max=1.320
  Final scores: mean=0.121, max=0.272

TOP 6 RECOMMENDED COURSES

MCB 250         | Level: intermediate    | Score: 0.272
  Molecular Genetics
  Interest Match: 0.206 | Curriculum Boost: 1.32x
  Prerequisites: 3 | Unlocks: 16 cou

Unnamed: 0,code,name,subject,level,prereq_count,postreq_count,interest_score,rule_boost,final_score
26,MCB 250,Molecular Genetics,MCB,intermediate,3,16,0.205722,1.32,0.271553
32,MCB 250,Molecular Genetics,MCB,intermediate,3,16,0.205722,1.32,0.271553
25,MCB 150,Molecular & Cellular Basis of Life,MCB,introductory,0,14,0.053384,1.2,0.06406
31,MCB 150,Molecular & Cellular Basis of Life,MCB,introductory,0,14,0.053384,1.2,0.06406
41,STAT 212,Biostatistics,STAT,intermediate,0,2,0.025414,1.1,0.027956
42,STAT 212,Biostatistics,STAT,intermediate,0,2,0.025414,1.1,0.027956
