# Content-Based Filtering for Recommendation Systems

## Recap: Recommendation Systems

A **recommendation system** is an algorithm designed to provide users with personalized suggestions, recommendations, or predictions.

The primary goal is to assist users in discovering items (products, services, content) they are likely to be interested in, based on their preferences, behaviors, or historical data.

Examples: Netflix, Amazon, Zillow, Airbnb.

## Content-Based Filtering

**Content-based filtering** is a type of recommendation system that focuses on the **characteristics (content)** of the items and a **profile of the user's preferences**.

**How it works:**
1.  **Item Profile Creation:** Each item is described by a set of features or attributes (e.g., for movies: keywords, genres, actors, director; for articles: topics, keywords).
2.  **User Profile Creation:** A profile is built for each user, representing their preferences. This profile is often based on the characteristics of items the user has previously interacted with, liked, or rated positively.
3.  **Recommendation:** Recommendations are made by comparing the user's profile with the profiles of unrated/unseen items. Items that are "similar" in content to what the user has liked in the past are recommended.

### Step 1: Item-Profile Curation (Example: Movie Genres)

We start with item data, for example, movies and their genres. This data needs to be transformed into a structured format where each item (movie) is represented by a vector of its features (genres).

**Initial Movie Data (e.g., from a DataFrame):**

| movieId | year | genres             |
|---------|------|--------------------|
| A       | 2023 | Adventure and Comedy   |
| B       | 2013 | Romance and Sci-Fi     |
| C       | 2023 | Adventure          |
| D       | 2010 | Adventure and Sci-Fi   |

**Transformed Item-Profile (Genre Matrix - One-Hot Encoded):**

| movieId | Adventure | Comedy | Romance | Sci-Fi |
|---------|-----------|--------|---------|--------|
| A       | 1         | 1      | 0       | 0      |
| B       | 0         | 0      | 1       | 1      |
| C       | 1         | 0      | 0       | 0      |
| D       | 1         | 0      | 0       | 1      |

(Here, `1` indicates the movie has the genre, `0` indicates it does not.)

In [None]:
import pandas as pd

# Initial Movie Data
movies_data = {
    'movieId': ['A', 'B', 'C', 'D'],
    'year': [2023, 2013, 2023, 2010],
    'genres_str': ['Adventure|Comedy', 'Romance|Sci-Fi', 'Adventure', 'Adventure|Sci-Fi']
}
movies_df = pd.DataFrame(movies_data)
print("Initial Movies DataFrame:")
print(movies_df)

# Item-Profile Curation: One-hot encode genres
# The str.get_dummies() method is excellent for this when genres are pipe-separated
genre_dummies = movies_df['genres_str'].str.get_dummies(sep='|')
item_profile_df = pd.concat([movies_df['movieId'], genre_dummies], axis=1).set_index('movieId')

print("Item-Profile DataFrame (Genres One-Hot Encoded):")
print(item_profile_df)

### Step 2: User-Profile Curation

A user's profile is created based on their past interactions. For content-based filtering, this means aggregating the features of items the user has liked.

**Example: User 1 likes a movie that is 'Adventure' and 'Sci-Fi'.**

**User's Liked Item Data:**

| UserId | year | genres             |
|--------|------|--------------------|
| 1      | 2014 | Adventure|Sci-Fi   |

**User-Profile (Genre Vector):**

| UserId | Adventure | Comedy | Romance | Sci-Fi |
|--------|-----------|--------|---------|--------|
| 1      | 1         | 0      | 0       | 1      |

*If a user liked multiple movies, their profile could be an average or sum of the feature vectors of those movies.*

In [None]:
# Example User Profile (could be derived from user's liked movies)
user_profile_data = {
    'Adventure': [1],
    'Comedy': [0],
    'Romance': [0],
    'Sci-Fi': [1]
}
user_profile_df = pd.DataFrame(user_profile_data, index=['User1'])
print("User-Profile DataFrame:")
print(user_profile_df)

### Step 3: Recommendation using Similarity

Recommendations are made by finding items that are "similar" to the user's profile. A common way to measure this similarity between feature vectors is **Cosine Similarity**.

**Cosine Similarity:**
*   Measures the cosine of the angle between two non-zero vectors in a multi-dimensional space.
*   Value ranges from -1 to 1:
    *   `1`: Vectors point in exactly the same direction (max similarity).
    *   `0`: Vectors are orthogonal (no similarity).
    *   `-1`: Vectors point in completely opposite directions (max dissimilarity).
*   Formula:  `cos(θ) = (A ⋅ B) / (||A|| ⋅ ||B||)`
    Where `A ⋅ B` is the dot product of vectors A and B, and `||A||`, `||B||` are their Euclidean norms (magnitudes).

#### Calculating Cosine Similarity Example

**User Profile (User1):** `[Adventure:1, Comedy:0, Romance:0, Sci-Fi:1]`
**Item Profile (Movie A):** `[Adventure:1, Comedy:1, Romance:0, Sci-Fi:0]`

Let `U = [1, 0, 0, 1]` and `M_A = [1, 1, 0, 0]`

1.  **Dot Product (U ⋅ M_A):**
    `(1*1) + (0*1) + (0*0) + (1*0) = 1 + 0 + 0 + 0 = 1`

2.  **Magnitude of U (||U||):**
    `sqrt(1² + 0² + 0² + 1²) = sqrt(1 + 0 + 0 + 1) = sqrt(2)`

3.  **Magnitude of M_A (||M_A||):**
    `sqrt(1² + 1² + 0² + 0²) = sqrt(1 + 1 + 0 + 0) = sqrt(2)`

4.  **Cosine Similarity:**
    `cos(θ) = 1 / (sqrt(2) * sqrt(2)) = 1 / 2 = 0.5`

#### Using `sklearn.metrics.pairwise.cosine_similarity`

The Scikit-learn library provides a convenient function for this.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np # For array manipulation if needed

# User Profile vector (ensure it's in the same order of genres as item_profile_df)
user1_vector = user_profile_df.iloc[0].values.reshape(1, -1) 
# .values converts the Series to a NumPy array
# .reshape(1, -1) makes it a 2D array with 1 row, as expected by cosine_similarity

print("User1 Vector:", user1_vector)
print("Item Profile DataFrame (for reference):")
print(item_profile_df)

# Calculate similarity between user1_vector and ALL item profiles in item_profile_df
similarities = cosine_similarity(user1_vector, item_profile_df.values) # .values gets NumPy array from DF

# similarities will be a 2D array, e.g., [[sim_A, sim_B, sim_C, sim_D]]
print("Cosine Similarities (User1 vs. All Movies):")
print(similarities)

# Create a DataFrame for easier viewing of results
similarity_scores = pd.DataFrame(
    data=similarities[0], # Take the first (and only) row of similarities
    index=item_profile_df.index, # Use movieIds as index
    columns=['CS_Genre']
)
similarity_scores['User_movieId'] = '1_' + similarity_scores.index # For display matching slide

print("cs_genre_df (Similarity Scores with User1):")
print(similarity_scores[['User_movieId', 'CS_Genre']].sort_values(by='CS_Genre', ascending=False))

The output should show:
*   Similarity(User1, MovieA) = 0.5
*   Similarity(User1, MovieB) = 0.5
*   Similarity(User1, MovieC) = 0.7071 (approx, because MovieC is only Adventure, a partial match)
*   Similarity(User1, MovieD) = 1.0 (perfect match on Adventure & Sci-Fi)

## More Data? Enhancing Content-Based Filtering

To create a more sophisticated content-based recommendation system, what additional factors (features) beyond just genre could be considered?

Think about data available from sources like IMDB:
*   **Actors, Directors, Writers:** Users who like movies by certain people might like other movies they are involved in.
*   **Keywords/Plot Summaries:** Text analysis (NLP) can extract key themes and plot elements.
*   **Year of Release:** Some users prefer newer or older movies.
*   **Country of Origin, Language.**
*   **Average Rating, Number of Votes:** While this leans towards collaborative filtering, it can also be a content feature indicating quality/popularity.
*   **Runtime, MPAA Rating.**

The more relevant features you can incorporate into your item profiles, the potentially more nuanced and accurate your content-based recommendations can become. However, this also increases complexity in feature engineering and similarity calculation.