# C5: RECOMMENDATION SYSTEM

## What is a Recommendation System?

A recommendation system suggests items to users based on their behavior, preferences, or similarity to others.

### Types of Recommendation Systems

- Popularity-based  
- Content-based filtering  
- Collaborative filtering  
- Matrix factorization  
- Hybrid systems  
- Deep learning–based systems  
- Knowledge-based systems  
- Context-aware systems  

## Popularity-Based Recommendation Systems

- A simple and widely used type  
- Recommends items that are popular among all users, regardless of individual preferences  
- Uses metrics like number of views, clicks, purchases, or ratings  

### Advantages

- Easy to implement  
- Works well when there is no user data  

### Disadvantages

- No personalization — everyone sees the same recommendations  
- Can create a "rich get richer" effect  


In [1]:
import pandas as pd

# Sample dataset: user_id, movie, rating
data = {
    'user_id': [1, 2, 3, 4, 5, 1, 2, 3],
    'movie': ['A', 'A', 'B', 'B', 'C', 'C', 'C', 'D'],
    'rating': [5, 4, 5, 4, 3, 4, 5, 2]
}
df = pd.DataFrame(data)

# Popularity-based: average rating per movie
popularity = df.groupby('movie')['rating'].mean().sort_values(ascending=False)

print("Top Recommendations (Popularity-Based):")
print(popularity)


Top Recommendations (Popularity-Based):
movie
A    4.5
B    4.5
C    4.0
D    2.0
Name: rating, dtype: float64


## Content-Based Recommendation Systems

**Definition:** Recommends items similar to what the user has liked in the past, using item attributes.  

### How It Works

- Analyze item features  
- Create a user profile based on items the user has interacted with  
- Match new items against the profile to recommend similar ones  

### Advantages

- Personalized to each user  
- Does not require data from other users  

### Disadvantages

- Limited to suggesting items similar to what the user already liked, leading to less diversity  
- Requires detailed item metadata  


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Sample movie dataset
movies = {
    'title': [
        'The Matrix',
        'John Wick',
        'Avengers',
        'Inception',
        'The Lion King',
        'Bambi'
    ],
    'description': [
        'A computer hacker learns about the true nature of reality and his role in the war against its controllers.',
        'A retired hitman seeks vengeance for the killing of his dog.',
        'Earth’s mightiest heroes must come together to fight a powerful villain.',
        'A thief who enters dreams to steal secrets faces his toughest mission.',
        'A lion cub crown prince is tricked by a treacherous uncle into thinking he caused his father’s death.',
        'A deer cub growing to overcome difficulties'
    ]
}

# Create DataFrame
df = pd.DataFrame(movies)

# Convert descriptions into TF-IDF vectors
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df['description'])

# Compute similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Function to get recommendations
def recommend(title, cosine_sim=cosine_sim):
    # Get index of the movie that matches the title
    idx = df.index[df['title'] == title][0]
    
    # Get similarity scores for this movie with all others
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort movies based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Skip the first (itself), return top 3 similar
    sim_scores = sim_scores[1:4]
    movie_indices = [i[0] for i in sim_scores]
    
    return df['title'].iloc[movie_indices]

# Example: Recommend movies similar to "The Matrix"
print(recommend("The Matrix"))


1    John Wick
2     Avengers
3    Inception
Name: title, dtype: object


## Collaborative Filtering

- **Definition:** Predicts user recommendations based on the preferences of many users.  
- Matrix factorization is often used to handle sparse user-item matrices efficiently.  

### Steps

1. Collect data  
2. Build a user-item matrix  
3. Choose approach: user-based, item-based, or model-based  
4. Measure similarity: cosine similarity, Pearson correlation, Jaccard index  
5. Find neighbors  
6. Generate predictions  
7. Recommend items  

### Types

- **User-based:** Finds users similar to the target user and recommends items they liked  
- **Item-based:** Finds items similar to those the user already liked and recommends them  

### Advantages

- Domain-independent  
- Captures hidden patterns  
- Personalized recommendations  
- Improves as more data is collected  

### Disadvantages

- Cold start problem  
- Data sparsity  
- Scalability issues  
- Popularity bias  
- Grey sheep problem: Users with unique tastes get poor recommendations  


In [3]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Sample user-movie rating data
data = {
    'User': ['A', 'A', 'B', 'B', 'C', 'C', 'D'],
    'Movie': ['Inception', 'Titanic', 'Inception', 'Avatar', 'Titanic', 'Avatar', 'Inception'],
    'Rating': [5, 4, 4, 5, 2, 4, 5]
}

df = pd.DataFrame(data)

# Create user-item rating matrix
rating_matrix = df.pivot_table(index='User', columns='Movie', values='Rating').fillna(0)

# Compute user similarity (cosine similarity)
user_similarity = cosine_similarity(rating_matrix)
user_similarity_df = pd.DataFrame(user_similarity, index=rating_matrix.index, columns=rating_matrix.index)

print("User Similarity Matrix:")
print(user_similarity_df)

# Recommend for User C (based on most similar user)
target_user = 'C'
similar_users = user_similarity_df[target_user].sort_values(ascending=False)
most_similar_user = similar_users.index[1]  # skip self (C)

print(f"\nMost similar user to {target_user}: {most_similar_user}")

# Find movies that the similar user rated highly but target user hasn't seen
target_seen = set(df[df['User'] == target_user]['Movie'])
recommendations = df[(df['User'] == most_similar_user) & (~df['Movie'].isin(target_seen))]

print(f"\nRecommendations for {target_user}:")
print(recommendations[['Movie', 'Rating']])


User Similarity Matrix:
User         A         B         C         D
User                                        
A     1.000000  0.487805  0.279372  0.780869
B     0.487805  1.000000  0.698430  0.624695
C     0.279372  0.698430  1.000000  0.000000
D     0.780869  0.624695  0.000000  1.000000

Most similar user to C: B

Recommendations for C:
       Movie  Rating
2  Inception       4


## Matrix Factorization

Matrix factorization is a method to break down a large user-item interaction matrix into smaller latent factor matrices.  

### How It Works

1. Start with a sparse rating matrix  
2. Factorize it into two smaller dense matrices  
3. Predicted rating for user i and item j with $U_i$ user vector and $V_j$ item vector => $\hat{R}_{ij} = U_i \cdot V_j^T$

### Common Algorithms

1. Singular Value Decomposition (SVD)  
2. Funk SVD  
3. Alternating Least Squares (ALS)  

### Advantages

- Handles sparsity better than memory-based collaborative filtering  
- Captures hidden dimensions of user preferences  
- Scales well to millions of users and items  

### Limitations

- Still suffers from cold start problem  
- Requires careful tuning of the number of latent factors  
- Purely data-driven, so it cannot explain why an item is recommended  

## Singular Value Decomposition (SVD)

SVD finds latent structures. Instead of directly comparing users and items in the sparse rating matrix, it projects them into a lower-dimensional space where similarities are easier to compute.  

### Use Cases

- Captures hidden patterns  
- Used for dimensionality reduction  
- Helps in recommendation systems  

### Applications of SVD

- Recommender systems  
- Dimensionality reduction  
- Image compression  
- Natural language processing  


In [4]:
import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD

# Step 1: Create a sample user-movie rating matrix
ratings = pd.DataFrame([
    [5, 4, 0, 1],
    [4, 0, 0, 1],
    [1, 1, 0, 5],
    [0, 0, 5, 4],
    [0, 1, 5, 4]
], columns=["Inception", "Titanic", "Avatar", "Interstellar"],
   index=["User1", "User2", "User3", "User4", "User5"])

print("Original User-Item Matrix:")
print(ratings)

# Step 2: Apply Matrix Factorization (SVD)
svd = TruncatedSVD(n_components=2)  # reduce to 2 latent factors
latent_matrix = svd.fit_transform(ratings)       # U matrix (users × latent factors)
latent_matrix_items = svd.components_            # V^T matrix (latent factors × items)

# Step 3: Reconstruct the predicted rating matrix
reconstructed_ratings = np.dot(latent_matrix, latent_matrix_items)

predicted_ratings = pd.DataFrame(reconstructed_ratings,
                                 columns=ratings.columns,
                                 index=ratings.index)

print("\nPredicted Ratings (after Matrix Factorization):")
print(predicted_ratings.round(2))

# Step 4: Recommend movies for a user (e.g., User2)
user = "User2"
user_predictions = predicted_ratings.loc[user].sort_values(ascending=False)

print(f"\nTop recommendations for {user}:")
print(user_predictions)


Original User-Item Matrix:
       Inception  Titanic  Avatar  Interstellar
User1          5        4       0             1
User2          4        0       0             1
User3          1        1       0             5
User4          0        0       5             4
User5          0        1       5             4

Predicted Ratings (after Matrix Factorization):
       Inception  Titanic  Avatar  Interstellar
User1       5.29     2.98   -0.59          1.69
User2       3.00     1.70   -0.25          1.05
User3       1.75     1.25    2.10          2.94
User4      -0.38     0.31    4.48          4.47
User5       0.01     0.53    4.44          4.60

Top recommendations for User2:
Inception       3.002019
Titanic         1.698432
Interstellar    1.047272
Avatar         -0.248552
Name: User2, dtype: float64


In [2]:
from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate

# Load built-in MovieLens dataset
data = Dataset.load_builtin('ml-100k')

# Use SVD (matrix factorization)
algo = SVD()

# Run cross-validation
results = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /home/mht/.surprise_data/ml-100k
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9405  0.9309  0.9436  0.9402  0.9230  0.9356  0.0076  
MAE (testset)     0.7392  0.7349  0.7440  0.7394  0.7292  0.7373  0.0050  
Fit time          0.99    1.03    1.03    0.93    0.92    0.98    0.04    
Test time         0.08    0.07    0.13    0.13    0.07    0.10    0.02    
