# Content-Based Recommendation System

This notebook implements a content-based filtering approach using **TF-IDF (Term Frequency-Inverse Document Frequency)** to match user interests with academic programs.

## Why Content-Based Filtering?
- Works well with cold-start problems (new users/programs)
- Provides transparent, explainable recommendations
- Doesn't require historical interaction data
- Ideal for academic program matching based on interests

## Import Libraries
We use scikit-learn for text vectorization and similarity computation.

In [3]:
print("hi")

hi


In [4]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import joblib


In [5]:
users = pd.read_csv("../data/raw/users.csv")
programs = pd.read_csv("../data/raw/programs.csv")
interactions = pd.read_csv("../data/raw/interactions.csv")


print(f"Total data: {len(users)} users, {len(programs)} programs, {len(interactions)} interactions")

Total data: 100 users, 15 programs, 440 interactions


## Train/Test Split (80/20)

Split interactions into training (80%) and testing (20%) sets for proper model evaluation.

**Note:** Content-based filtering doesn't use interaction history for training (only program features), but we create this split now so all notebooks use the same train/test data.

In [6]:
# Split interactions 80/20 for train and test
train_interactions, test_interactions = train_test_split(
    interactions,
    test_size=0.2,
    random_state=42
)

print(f"\nTrain/Test Split:")
print(f"  Training: {len(train_interactions)} interactions ({len(train_interactions)/len(interactions)*100:.1f}%)")
print(f"  Testing:  {len(test_interactions)} interactions ({len(test_interactions)/len(interactions)*100:.1f}%)")

# Save splits for use in other notebooks
train_interactions.to_csv("../data/processed/train_interactions.csv", index=False)
test_interactions.to_csv("../data/processed/test_interactions.csv", index=False)

print("\n✓ Train/test splits saved to ../data/processed/")



Train/Test Split:
  Training: 352 interactions (80.0%)
  Testing:  88 interactions (20.0%)

✓ Train/test splits saved to ../data/processed/


## Load Generated Data
Load the user and program datasets created in the data generation notebook.

In [7]:
programs["text"] = programs["description"] + " " + programs["tags_text"]
programs["text"] = programs["text"].str.lower()


## Prepare Program Text Features
Combine program descriptions and tags into a single text field for analysis. We lowercase everything to ensure consistent matching.

**Why combine description + tags?**
- Descriptions provide context and detail
- Tags capture key concepts succinctly
- Both contribute to better matching accuracy

In [8]:
tfidf = TfidfVectorizer(
    stop_words="english",
    max_features=500
)

program_tfidf = tfidf.fit_transform(programs["text"])


## Build TF-IDF Vectorizer
Transform program text into numerical features using TF-IDF.

**Note:** Content-based filtering uses program features only (not interaction history), so the model is trained on all programs regardless of the train/test split.

**Why TF-IDF?**
- Captures term importance (TF) while reducing weight of common words (IDF)
- Better than simple word counting
- Standard approach for text-based recommendations

**Hyperparameters:**

- `stop_words="english"`: Removes common words like "the", "and"- `max_features=500`: Limits vocabulary size for efficiency

In [9]:
user_tfidf = tfidf.transform(users["interests_text"])
similarity = cosine_similarity(user_tfidf, program_tfidf)


## Calculate User-Program Similarity
Transform user interests using the same TF-IDF model, then compute cosine similarity between all users and programs.

**Why Cosine Similarity?**
- Measures angle between vectors (0-1 range)
- Ignores magnitude, focuses on direction
- Standard metric for text similarity
- Efficient for large-scale comparisons

In [10]:
def recommend_programs(user_idx, k=3):
    scores = similarity[user_idx]
    top_idx = np.argsort(scores)[::-1][:k]
    return programs.iloc[top_idx][["program_id", "name"]]
recommend_programs(0)

Unnamed: 0,program_id,name
7,p_7,Graphic Design
0,p_0,Computer Science
9,p_9,Architecture


## Recommendation Function
Create a function to recommend top-k programs for any user.

**Parameters:**
- `user_idx`: Index of the user
- `k`: Number of recommendations (default=3)

The function sorts similarity scores in descending order and returns the top matches.

In [11]:
def explain(user_idx, program_idx):
    user_terms = set(users.iloc[user_idx]["interests_text"].split())
    program_terms = set(programs.iloc[program_idx]["tags_text"].split())
    overlap = user_terms.intersection(program_terms)
    return f"Shared interests: {', '.join(overlap)}"
explain(0, 2)


'Shared interests: drawing'

## Explanation Function
Provide transparency by showing which interests overlap between user and program.

**Why Explainability Matters:**
- Builds user trust in recommendations
- Helps users understand why programs were suggested
- Critical for academic/career decisions
- Meets fairness and transparency requirements

In [12]:
joblib.dump(tfidf, "../models/tfidf.pkl")
joblib.dump(program_tfidf, "../models/program_tfidf.pkl")


['../models/program_tfidf.pkl']

## Save Models
Persist the trained TF-IDF vectorizer and program embeddings for deployment.

These models will be loaded by the API to generate real-time recommendations without retraining.