# Recommender Systems Tutorial
# This notebook covers basic to advanced recommender systems, including SVD, Collaborative Filtering, Content-Based Filtering, and more.

# CHAPTER 1. Dataset creation
# Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from scipy.sparse.linalg import svds
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_squared_error

## Explanation

In this chapter, we create a synthetic dataset to simulate user-item interactions. The dataset consists of:

- 50 users and 20 items.  
- Each user has rated some items on a scale of 1 to 5.  
- A test dataset with 20 users is also created for evaluation purposes.  

## Mathematical Formulation

Let:  

- **U** be the set of users, where $ |U| = 50 $.  
- **I** be the set of items, where $ |I| = 20 $.  
- **R** be the user-item interaction matrix, where $ R_{u,i} $ represents the rating given by user $ u $ to item $ i $.  

The matrix **R** is of size $ |U| \times |I| $, and each entry $ R_{u,i} $ is a value between 1 and 5.  

In [2]:
# Create synthetic dataset
np.random.seed(42)
num_users = 50
num_items = 20
user_ids = np.arange(1, num_users + 1)
item_ids = np.arange(1, num_items + 1)

In [3]:
# Generate user-item interactions (ratings between 1 and 5)
ratings = np.random.randint(1, 6, size=(num_users, num_items))

# Create a DataFrame for the dataset
ratings_df = pd.DataFrame(ratings, columns=item_ids, index=user_ids)
ratings_df.index.name = 'UserID'
ratings_df.columns.name = 'ItemID'

print("Synthetic User-Item Ratings Dataset:")
print(ratings_df.head())

Synthetic User-Item Ratings Dataset:
ItemID  1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  \
UserID                                                                       
1        4   5   3   5   5   2   3   3   3   5   4   3   5   2   4   2   4   
2        2   5   4   1   1   3   3   2   4   4   3   4   4   1   3   5   3   
3        4   1   4   2   2   1   2   5   2   4   4   4   4   5   3   1   4   
4        2   4   5   2   2   4   2   2   4   4   1   5   5   2   5   2   1   
5        5   1   5   5   1   1   1   1   4   3   3   1   3   3   1   3   5   

ItemID  18  19  20  
UserID              
1        5   1   4  
2        5   1   2  
3        2   4   2  
4        4   4   4  
5        2   2   1  


In [5]:
# Create a test dataset with 20 users
test_users = np.random.choice(user_ids, size=20, replace=False)
test_ratings_df = ratings_df.loc[test_users]

print("\nTest Dataset (20 Users):")
print(test_ratings_df.head())


Test Dataset (20 Users):
ItemID  1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  \
UserID                                                                       
13       3   1   1   3   3   3   4   1   4   3   1   4   4   3   1   3   1   
32       5   2   4   4   2   2   4   2   4   4   5   1   4   3   1   1   1   
41       3   3   4   5   2   2   3   3   1   5   4   2   1   1   2   4   1   
7        2   4   5   3   1   4   5   4   5   5   3   5   4   5   3   3   4   
27       4   1   4   3   5   4   5   1   5   5   2   2   2   5   3   5   3   

ItemID  18  19  20  
UserID              
13       5   2   2  
32       5   4   5  
41       1   5   4  
7        2   2   5  
27       3   2   4  


# Chapter 2: Base Recommender System

## Explanation

We start with a popularity-based recommender system, which recommends the most popular items based on average ratings. This is a simple baseline model.

## Mathematical Formulation

The popularity of an item $ i $ is calculated as:

$$
\text{Popularity}(i) = \frac{1}{|U|} \sum_{u \in U} R_{u,i}
$$

The top $ N $ items with the highest popularity scores are recommended.

In [8]:
# Popularity-Based Recommender
def popularity_based_recommender(ratings_df, top_n=10):
    item_popularity = ratings_df.mean(axis=0).sort_values(ascending=False)
    return item_popularity.head(top_n)

print("\nTop 5 Popular Items:")
popularity_based_recommender(ratings_df)


Top 5 Popular Items:


ItemID
2     3.24
4     3.22
7     3.20
13    3.16
16    3.12
10    3.10
20    3.06
3     3.06
14    3.04
19    3.02
dtype: float64

# Chapter 3: Advanced Recommender Systems

## 3.1 Pure SVD (Singular Value Decomposition)

### Explanation

SVD is a matrix factorization technique that decomposes the user-item interaction matrix $R$ into three matrices:

$$
R \approx U \cdot \Sigma \cdot V^T
$$

where:

- $U$ is the user-feature matrix.  
- $\Sigma$ is the diagonal matrix of singular values.  
- $V^T$ is the item-feature matrix.  

The predicted ratings are obtained by reconstructing the matrix:

$$
\hat{R} = U \cdot \Sigma \cdot V^T
$$


In [10]:
# Normalize the ratings matrix
mean_user_rating = ratings_df.mean(axis=1)
ratings_normalized = ratings_df.sub(mean_user_rating, axis=0)
ratings_normalized_array = ratings_normalized.fillna(0).values

# Perform SVD
U, sigma, Vt = svds(ratings_normalized_array, k=5)
sigma = np.diag(sigma)

# Reconstruct the predicted ratings matrix
predicted_ratings = np.dot(np.dot(U, sigma), Vt) + mean_user_rating.values.reshape(-1, 1)
predicted_ratings_df = pd.DataFrame(predicted_ratings, columns=item_ids, index=user_ids)

print("\nPredicted Ratings using Pure SVD:")
predicted_ratings_df.head()


Predicted Ratings using Pure SVD:


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
1,4.329476,5.070966,3.522534,4.089787,5.006367,3.045555,2.895946,2.903321,4.129436,3.582906,3.553028,3.278289,4.560916,2.764263,3.710295,1.943241,4.9563,4.352671,1.805617,2.499085
2,2.599279,4.294169,1.775465,2.945804,2.824183,4.603698,2.962221,2.467252,3.827262,3.200505,2.171488,2.919653,3.314646,2.223176,2.744578,3.608871,3.075225,3.936971,1.649683,2.855872
3,3.931546,1.775628,4.404455,3.655858,3.076656,0.757935,2.159078,4.364129,2.399096,2.817185,4.401907,3.53839,3.012854,3.67328,3.109144,1.612546,2.715727,1.622212,4.392102,2.580272
4,2.750254,3.721415,2.943639,3.278692,2.867132,4.084801,2.991481,3.657217,3.402968,3.532144,2.812794,3.313849,2.920908,3.033761,3.436959,3.545547,2.807501,3.1645,2.722724,3.011714
5,4.181443,1.619621,3.907487,3.329708,2.793707,0.831197,1.737544,2.33855,4.161896,2.784486,3.333062,1.171065,3.125921,3.020331,0.999292,0.803955,2.629107,3.408533,2.495592,2.327502


## 3.2 Hybrid SVD

### Explanation

Hybrid SVD combines SVD with additional features (e.g., user demographics or item metadata). The predicted ratings are a weighted combination of SVD predictions and other features:

$$
\hat{R}_{\text{hybrid}} = \alpha \cdot \hat{R}_{\text{SVD}} + (1 - \alpha) \cdot \hat{R}_{\text{features}}
$$


In [14]:
# Example: Add user features (age, gender) and item features (category)
user_features = pd.DataFrame({
    'UserID': user_ids,
    'Age': np.random.randint(18, 65, size=num_users),
    'Gender': np.random.choice(['M', 'F'], size=num_users)
}).set_index('UserID')

item_features = pd.DataFrame({
    'ItemID': item_ids,
    'Category': np.random.choice(['A', 'B', 'C'], size=num_items)
}).set_index('ItemID')

# Combine features with SVD predictions (example: weighted average)
hybrid_predicted_ratings = 0.7 * predicted_ratings_df + 0.3 * ratings_df.mean().values

print("\nHybrid SVD Predicted Ratings:")
hybrid_predicted_ratings.head()


Hybrid SVD Predicted Ratings:


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
1,3.864633,4.521676,3.383774,3.828851,4.326457,3.001889,2.987163,2.938325,3.730605,3.438034,3.29712,3.200802,4.140641,2.846984,3.491207,2.296269,4.36941,3.92287,2.169932,2.66736
2,2.653495,3.977918,2.160825,3.028063,2.798928,4.092588,3.033554,2.633076,3.519084,3.170354,2.330042,2.949757,3.268252,2.468223,2.815205,3.462209,3.052657,3.63188,2.060778,2.917111
3,3.586082,2.21494,4.001118,3.5251,2.975659,1.400554,2.471355,3.96089,2.519367,2.90203,3.891335,3.382873,3.056998,3.483296,3.070401,2.064782,2.801009,2.011548,3.980472,2.72419
4,2.759178,3.576991,2.978547,3.261085,2.828992,3.729361,3.054036,3.466052,3.222078,3.402501,2.778956,3.225694,2.992636,3.035632,3.299871,3.417883,2.86525,3.09115,2.811907,3.0262
5,3.76101,2.105735,3.653241,3.296796,2.777595,1.451838,2.176281,2.542985,3.753327,2.87914,3.143143,1.725746,3.136145,3.026232,1.593505,1.498769,2.740375,3.261973,2.652914,2.547251


## 3.3 Collaborative Filtering (User-User and Item-Item)

### Explanation

Collaborative filtering recommends items based on user-user or item-item similarity.

#### User-User Collaborative Filtering

The similarity between users $u$ and $v$ is computed using cosine similarity:

$$
\text{Similarity}(u,v) = \cos(\theta) = \frac{R_u \cdot R_v}{\|R_u\| \cdot \|R_v\|}
$$

Recommendations are based on ratings from similar users.

#### Item-Item Collaborative Filtering

The similarity between items $i$ and $j$ is computed as:

$$
\text{Similarity}(i,j) = \cos(\theta) = \frac{R_i \cdot R_j}{\|R_i\| \cdot \|R_j\|}
$$

Recommendations are based on similar items.

In [15]:
# User-User Collaborative Filtering
def user_user_collaborative_filtering(ratings_df, user_id, top_n=5):
    user_similarity = cosine_similarity(ratings_df.fillna(0))
    user_similarity_df = pd.DataFrame(user_similarity, index=ratings_df.index, columns=ratings_df.index)
    similar_users = user_similarity_df[user_id].sort_values(ascending=False).index[1:]
    recommendations = ratings_df.loc[similar_users].mean().sort_values(ascending=False)
    return recommendations.head(top_n)

print("\nUser-User Collaborative Filtering Recommendations for User 1:")
print(user_user_collaborative_filtering(ratings_df, user_id=1))

# Item-Item Collaborative Filtering
def item_item_collaborative_filtering(ratings_df, item_id, top_n=5):
    item_similarity = cosine_similarity(ratings_df.fillna(0).T)
    item_similarity_df = pd.DataFrame(item_similarity, index=ratings_df.columns, columns=ratings_df.columns)
    similar_items = item_similarity_df[item_id].sort_values(ascending=False).index[1:]
    return similar_items[:top_n]

print("\nItem-Item Collaborative Filtering Recommendations for Item 1:")
print(item_item_collaborative_filtering(ratings_df, item_id=1))


User-User Collaborative Filtering Recommendations for User 1:
ItemID
7     3.204082
2     3.204082
4     3.183673
16    3.142857
13    3.122449
dtype: float64

Item-Item Collaborative Filtering Recommendations for Item 1:
Index([13, 11, 4, 7, 5], dtype='int64', name='ItemID')


## 3.4 Content-Based Filtering

### Explanation

Content-based filtering recommends items based on their features. For example, if items are songs, features could include genre, tempo, or artist.

The similarity between items $i$ and $j$ is computed using cosine similarity:

$$
\text{Similarity}(i,j) = \cos(\theta) = \frac{F_i \cdot F_j}{\|F_i\| \cdot \|F_j\|}
$$

where $F_i$ and $F_j$ are feature vectors for items $i$ and $j$.


In [17]:
# Map single-character categories to meaningful text
item_features['Category'] = item_features['Category'].map({
    'A': 'CategoryA',
    'B': 'CategoryB',
    'C': 'CategoryC'
})

# Use item features (e.g., category) to recommend similar items
tfidf = TfidfVectorizer()
item_feature_matrix = tfidf.fit_transform(item_features['Category'])
item_similarity = cosine_similarity(item_feature_matrix)

def content_based_recommender(item_id, top_n=5):
    similar_items = item_similarity[item_id - 1].argsort()[::-1][1:top_n + 1]
    return item_ids[similar_items]

print("\nContent-Based Recommendations for Item 1:")
print(content_based_recommender(item_id=1))


Content-Based Recommendations for Item 1:
[18  3  6 12 11]


# Chapter 4: Evaluation Metrics

## Explanation

We evaluate recommender systems using:

- **Hit Rate (HR)**: The fraction of recommended items that are relevant.  
- **Mean Reciprocal Rank (MRR)**: The average reciprocal rank of the first relevant item in the recommendation list.  

The **Hit Rate (HR)** is calculated as:

$$
HR = \frac{\text{Number of relevant recommendations}}{\text{Total number of recommendations}}
$$

The **Mean Reciprocal Rank (MRR)** is given by:

$$
MRR = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \frac{1}{\text{rank}_q}
$$


In [18]:
def hit_rate(recommended_items, actual_items):
    return len(set(recommended_items) & set(actual_items)) / len(actual_items)

def mean_reciprocal_rank(recommended_items, actual_items):
    for i, item in enumerate(recommended_items):
        if item in actual_items:
            return 1.0 / (i + 1)
    return 0.0

# Example evaluation
actual_items = [1, 2, 3]
recommended_items = [3, 4, 5]
print("\nHit Rate:", hit_rate(recommended_items, actual_items))
print("Mean Reciprocal Rank:", mean_reciprocal_rank(recommended_items, actual_items))


Hit Rate: 0.3333333333333333
Mean Reciprocal Rank: 1.0


# Chapter 5: Cold Start Problem
### Explanation
The cold start problem occurs when there is insufficient data for new users or items. Solutions include:

Using hybrid models.

Leveraging content-based filtering.

Incorporating demographic data.

In [19]:
# Example: New user with no ratings
new_user_ratings = pd.Series([np.nan] * num_items, index=item_ids)
print("\nCold Start Problem for New User:")
print(new_user_ratings)


Cold Start Problem for New User:
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN
17   NaN
18   NaN
19   NaN
20   NaN
dtype: float64


# Chapter 6: Next Steps for Musical Recommender Systems
### Explanation

To advance research in music recommender systems:

- Explore datasets like the **Million Song Dataset** or **Last.fm**.

Incorporate audio features (e.g., MFCC, tempo) into content-based models.

Experiment with deep learning models like **Neural Collaborative Filtering**.

Address scalability and real-time recommendations.