### Content-based filtering

Content-based filtering [1] is a recommendation approach that suggests items to users based on the similarity between the features of items and the user's profile. In educational contexts, it means recommending learning materials (e.g., videos, lessons, quizzes) by analyzing their content—such as subject, difficulty, format, or tags—and matching them with what the user prefers or has engaged with previously.



#### Step 1: Import libraries
We use:
- `pandas` to load and manipulate data
- `TfidfVectorizer` to turn text into numerical vectors
- `cosine_similarity` to measure similarity between vectors

In [12]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

#### Step 2: Load data
The `content_data.csv` contains columns like:
- Subject
- Topic Tags
- Format (Video, Text, etc.)
- Difficulty

The `user_profiles.csv` contains:
- Preferred Subjects
- Preferred Format
- Skills Mastered

We will use these fields to match users with the most suitable content.


In [14]:
content_df = pd.read_csv("content_data.csv")
content_df.head()

Unnamed: 0,Content ID,Title,Subject,Grade Level,Topic Tags,Difficulty,Format,Duration (min)
0,C001,Physics Lesson 1,Physics,6,"Thermodynamics, Optics",Easy,Text,9
1,C002,Physics Lesson 2,Physics,11,"Thermodynamics, Mechanics",Hard,Quiz,6
2,C003,Math Lesson 3,Math,6,"Algebra, Trigonometry",Hard,Video,22
3,C004,History Lesson 4,History,11,"Modern, Medieval",Easy,Quiz,23
4,C005,Art Lesson 5,Art,12,"Painting, Art History",Hard,Quiz,15


In [15]:
user_df = pd.read_csv("user_profiles.csv")
display(user_df.head())

Unnamed: 0,User ID,Preferred Subjects,Preferred Format,Grade Level,Skills Mastered
0,Student1,"Chemistry, Economics",Interactive,10,"Problem Solving, Creativity"
1,Student2,"History, ComputerScience",Interactive,10,"Collaboration, Creativity"
2,Student3,"Math, Economics",Quiz,12,"Creativity, Critical Thinking"
3,Student4,"Math, Chemistry",Text,11,"Creativity, Problem Solving"
4,Student5,"Economics, Art",Quiz,10,"Critical Thinking, Scientific Method"


#### Step 3: Features vectorization 
To compare items using machine learning, we need to convert them into numerical form.
We first combine several descriptive fields of each content item into one string. This `combined_features` column will be the basis for our recommendation engine.<br>

**TF-IDF (Term Frequency - Inverse Document Frequency)** is a technique to convert text into numbers.

It gives higher importance to words that are unique to a document (in our case, a content item) and less importance to common words.
Now, each content item is represented as a vector of numbers that capture its unique features.

In [16]:
content_df['combined_features'] = (
    content_df['Subject'] + ' ' +
    content_df['Topic Tags'] + ' ' +
    content_df['Format'] + ' ' +
    content_df['Difficulty']
)

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(content_df['combined_features'])

To compare users with content, we need to vectorize their preferences too.

We build a text profile of each user by combining:
- Preferred subjects
- Preferred content format
- Skills they've mastered

In [17]:
user_df['user_profile_text'] = (
    user_df['Preferred Subjects'] + ' ' +
    user_df['Preferred Format'] + ' ' +
    user_df['Skills Mastered']
)

user_tfidf = vectorizer.transform(user_df['user_profile_text'])

#### Step 4: Recommendation for learning resources
Now we match a user to content using **cosine similarity**, which measures how close two vectors are.

For our chosen user (Student10), we:
1. Compare their profile with every content item's vector
2. Sort the content by similarity
3. Show the top 5 recommendations

These are the learning resources most aligned with the user's interests

In [18]:
user_id = "Student10"
user_index = user_df[user_df['User ID'] == user_id].index[0]
user_vector = user_tfidf[user_index]

similarities = cosine_similarity(user_vector, tfidf_matrix)

top_indices = similarities[0].argsort()[-5:][::-1]
recommended = content_df.iloc[top_indices]

recommended[['Content ID', 'Title', 'Subject', 'Topic Tags', 'Format']]

Unnamed: 0,Content ID,Title,Subject,Topic Tags,Format
34,C035,Biology Lesson 35,Biology,"Evolution, Genetics",Text
16,C017,Biology Lesson 17,Biology,"Evolution, Ecology",Text
8,C009,History Lesson 9,History,"Ancient, Modern",Text
37,C038,History Lesson 38,History,"Medieval, Modern",Text
33,C034,Biology Lesson 34,Biology,"Evolution, Genetics",Video


#### Reference
[1] Pazzani, Michael J., and Daniel Billsus. "Content-based recommendation systems." The adaptive web: methods and strategies of web personalization. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007. 325-341.