# Introduction To Recommendation Engines
Recommendation engines (or systems) are algorithms that suggest relevant items to users based on their preferences, behavior, or patterns. They are widely used in e-commerce, media streaming, and social media to enhance user experience and drive engagement.

#### **Real-world Examples**:
- **Amazon**: "Customers who bought this also bought..."
- **Netflix**: "Because you watched..."
- **Spotify**: "Recommended songs for you"

### **Types of Recommendation Systems**:
Recommendation engines can be classified into several types:

#### **A. Collaborative Filtering**:
Collaborative filtering is based on user behavior, i.e., it recommends items that similar users liked.
- **User-based Collaborative Filtering**:     This method recommends items to a user based on what similar users have liked. It involves:
    1. Finding similar users (using similarity measures like cosine similarity).
    2. Recommending items that similar users liked but the current user hasn't interacted with yet.

- **Item-based Collaborative Filtering**: Instead of finding similar users, item-based collaborative filtering recommends items based on the similarity between items.
    1. Calculate similarity between items.
    2. Recommend items similar to those the user has interacted with.

#### **B. Content-based Filtering**:
This approach suggests items based on the features of the items that the user has shown interest in. For example, if a user likes action movies, the system will recommend other action movies based on their descriptions.

#### **C. Hybrid Methods**:
A hybrid recommendation system combines the strengths of collaborative and content-based filtering to improve accuracy. This could be done by averaging the predictions of both models or by using collaborative filtering when sufficient user data is available, and content-based when the data is sparse.

**Example**: 
- Netflix uses a hybrid approach, combining collaborative filtering for personalized recommendations and content-based filtering to recommend movies with similar themes or genres.

### **Metrics for Recommendation Systems**:
- **Accuracy**: How well the recommendations match the user's preferences.
- **Precision**: The fraction of relevant recommendations out of all recommendations.
- **Recall**: The fraction of relevant items that were recommended.

## Collaborative Filtering Recommendation System For Movies
- In this section, we will implement **Collaborative Filtering** using Python with the **Surprise** library.
- We’ll use the MovieLens dataset, which contains user ratings for movies.

### **Step 1: Install and Import Required Libraries**

In [1]:
# Install the necessary library for scikit-surprise (uncomment if you haven't installed it yet)
!pip install scikit-surprise

# Importing necessary modules from the Surprise library
# Dataset: Used for loading and manipulating datasets
# Reader: Helps define the format of the data to be used in the Dataset
from surprise import Dataset, Reader

# Import train_test_split from the model_selection module
# This will be used to split the dataset into training and testing sets
from surprise.model_selection import train_test_split

# Import KNNBasic, which is a basic k-nearest neighbors algorithm implementation
from surprise import KNNBasic

# Import accuracy to calculate prediction accuracy on the test set
from surprise import accuracy

# Import KFold for cross-validation to split the dataset into k folds
from surprise.model_selection import KFold

# Import Pandas for data manipulation
import pandas as pd

Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4.tar.gz (154 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25ldone
[?25h  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp312-cp312-macosx_10_15_x86_64.whl size=513452 sha256=aff183b412766157ca4496948f06874e65b293cc92d21500a07004dd4ac8cb65
  Stored in directory: /Users/bambammain/Library/Caches/pip/wheels/75/fa/bc/739bc2cb1fbaab6061854e6cfbb81a0ae52c92a502a7fa454b
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.4


#### **Step 2: Load the Data**

In [3]:
# Load the MovieLens dataset (small version, containing 100,000 ratings)
# The 'ml-100k' dataset is a built-in dataset provided by the surprise library
data = Dataset.load_builtin('ml-100k')

# Get the raw ratings data
raw_ratings = data.raw_ratings
# raw_ratings is a list that contains tuples of (user_id, item_id, rating, timestamp).

# We convert this list into a DataFrame with columns named user_id, item_id, rating, and timestamp, to view it.
df = pd.DataFrame(raw_ratings, columns=["user_id", "item_id", "rating", "timestamp"])

# Display the first few rows of the DataFrame
df

Dataset ml-100k could not be found. Do you want to download it? [Y/n] 

 Y


Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /Users/bambammain/.surprise_data/ml-100k


Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596
...,...,...,...,...
99995,880,476,3.0,880175444
99996,716,204,5.0,879795543
99997,276,1090,1.0,874795795
99998,13,225,2.0,882399156


- The MovieLens dataset consists of user ratings for movies.
- We load the built-in version which contains 100,000 ratings.

#### **Step 3: Train-Test Split**

In [5]:
# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2)


# We split the dataset so that 80% of the data is used for training and 20% for testing.

#### **Step 4: Build a User-Based Collaborative Filtering Model**

In [20]:
# Use KNN (K-Nearest Neighbors) for Collaborative Filtering
sim_options = {'name': 'cosine', 'user_based': True}  # Use cosine similarity and user-based approach
algo = KNNBasic(sim_options=sim_options)# Use KNN (K-Nearest Neighbors) for Collaborative Filtering

# Note: There are other algorithms from the Surprise library besides KNN, such as:
# - SVD (Singular Value Decomposition) for matrix factorization
# - BaselineOnly for basic collaborative filtering with baseline estimates
# - CoClustering for a co-clustering-based approach
# - NormalPredictor for random predictions

In [21]:
# Train the model on the training data
# Fit the KNN model on the training set to learn user similarities based on their ratings
algo.fit(trainset)

# We use KNN with cosine similarity to compute similarity between users and recommend items.
# 'user_based=True' means we are using user-based collaborative filtering, 
# which focuses on finding similar users to suggest items they've rated highly to other users.

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x11b3addb2e0>

### what it does in the backend
`KNNBasic` from the Surprise library implements the K-Nearest Neighbors (KNN) algorithm specifically for collaborative filtering.
It essentially creates a model that relies on the idea that similar users (or items) will have similar ratings. By leveraging the ratings of similar users, it makes predictions for users who have not interacted with certain items, aiming to provide personalized recommendations.

1. **Similarity Calculation**:
   - **Computes Similarity**: It calculates the similarity between users (or items, depending on the `user_based` parameter). For user-based collaborative filtering, it measures how alike users are based on their ratings using a specified similarity metric (like cosine similarity, Pearson correlation, etc.).
   - **Distance Metrics**: If `user_based=True`, it computes user-user similarities; if `user_based=False`, it computes item-item similarities.

2. **Finding Neighbors**:
   - **K Selection**: After calculating the similarities, `KNNBasic` identifies the `K` nearest neighbors (most similar users or items) for each user or item based on the computed similarities. The value of `K` is a parameter you can set (often defaulted to 40 in Surprise).
   - **Sorting**: It sorts the users or items based on their similarity scores to find the top `K`.

3. **Prediction**:
   - **Weighted Average**: When making predictions for a user-item pair, `KNNBasic` computes a weighted average of the ratings from the `K` nearest neighbors. The weights are determined by the similarity scores, meaning closer neighbors have a larger influence on the prediction.
   - **Handling Unrated Items**: If a user has not rated an item, the algorithm uses the ratings from similar users to estimate a score for that item.

4. **Efficiency**:
   - **Sparse Data Handling**: It efficiently handles sparse datasets, which is common in collaborative filtering scenarios, as not all users will rate all items.
   - **Optimization**: The algorithm is optimized to quickly find the nearest neighbors and compute predictions, though performance can still vary based on the size of the dataset and the value of `K`.

#### **Step 5: Evaluate the Model**

In [22]:
# Test the algorithm on the test set
predictions = algo.test(testset)

# Calculate the accuracy of predictions
accuracy.rmse(predictions)  # Root Mean Squared Error
accuracy.mae(predictions)   # Mean Absolute Error

# We test the model using the test dataset and calculate the RMSE (Root Mean Squared Error) to evaluate how well the recommendations match the actual user ratings.

RMSE: 1.0208
MAE:  0.8083


0.8083226976603757

## Simple content-based recommendation system using a dataset of movie descriptions

#### **Step 1: Load Movie Data**

In [23]:
# Sample dataset (Movies and their genres)
import pandas as pd

data = {'Movie_ID': [1, 2, 3, 4],
        'Title': ['The Matrix', 'Inception', 'Toy Story', 'Interstellar'],
        'Genre': ['Action, Sci-Fi', 'Action, Sci-Fi', 'Animation, Family', 'Drama, Sci-Fi']}
df = pd.DataFrame(data)
df
# Here we have a small dataset where each movie is associated with its genre.

Unnamed: 0,Movie_ID,Title,Genre
0,1,The Matrix,"Action, Sci-Fi"
1,2,Inception,"Action, Sci-Fi"
2,3,Toy Story,"Animation, Family"
3,4,Interstellar,"Drama, Sci-Fi"


#### **Step 2: Preprocess the Data**

We need to convert the genres into a format that allows us to calculate similarities between movies.

In [24]:
# Split the 'Genre' column into a list of genres for each song
df['Genre'] = df['Genre'].apply(lambda x: x.split(', '))

# Create a one-hot encoded DataFrame where each genre becomes a separate column
# The genres are joined with '|' as a separator, and then get_dummies creates binary columns
onehot_df = df['Genre'].str.join('|').str.get_dummies()

# Display the one-hot encoded DataFrame
onehot_df

Unnamed: 0,Action,Animation,Drama,Family,Sci-Fi
0,1,0,0,0,1
1,1,0,0,0,1
2,0,1,0,1,0
3,0,0,1,0,1


#### **Step 3: Calculate Similarity Between Movies**

In [25]:
# Calculate cosine similarity between movies based on their genres
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(onehot_df)

# We use cosine similarity to calculate the similarity between the movies based on their genres.

In [26]:
# This displays a matrix comparing the 4 movies with each other
cosine_sim

array([[1. , 1. , 0. , 0.5],
       [1. , 1. , 0. , 0.5],
       [0. , 0. , 1. , 0. ],
       [0.5, 0.5, 0. , 1. ]])

### Cosine Similarity

#### Cosine Similarity Calculation:

1. **Cosine Similarity Overview**: 
   - Cosine similarity compares the **angle** between two vectors. A value of `1` means they are perfectly aligned (same direction), and a value of `0` means they are completely orthogonal (no similarity).
   - For binary data like this, cosine similarity can be thought of as comparing the **overlap** in the presence of specific genres between two movies.

2. **Interpretation of `cosine_sim`**:
   The `cosine_similarity(onehot_df)` computes the similarity between each row (i.e., each movie) based on the one-hot encoded genre features. 


#### `cosine_sim` Output:
Summary of `cosine_sim` array:
- **1.0**: Perfect similarity (when the movies have the same genres).
- **0.5**: Partial similarity (when they share one genre).
- **0.0**: No similarity (no shared genres).
    
Let's break down the values from `cosine_sim`:

- **All diagonal values are 1**, meaning each movie is perfectly similar to itself (`cosine_sim[0, 0] = 1`).

- **Movies 0 and 1** (rows 0 and 1 in `one_hot_df`):
  - They have the same genres: Action and Sci-Fi (`[1, 0, 0, 0, 1]`).
  - Cosine similarity between them is **1**, meaning they are perfectly similar (`cosine_sim[0, 1] = 1`).

- **Movies 0 and 2** (rows 0 and 2 in `one_hot_df`):
  - Movie 0 has Action and Sci-Fi, while Movie 2 has Animation and Family (`[0, 1, 0, 1, 0]`).
  - There is **no overlap** between the genres of these two movies, so the cosine similarity is **0** (`cosine_sim[0, 2] = 0`).

- **Movies 0 and 3** (rows 0 and 3 in `one_hot_df`):
  - Movie 3 has Drama and Sci-Fi (`[0, 0, 1, 0, 1]`), sharing only the Sci-Fi genre with Movie 0.
  - They have a cosine similarity of **0.5** due to this partial overlap (`cosine_sim[0, 3] = 0.5`).

- **Movies 2 and 3** (rows 2 and 3 in `one_hot_df`):
  - Movie 2 has Animation and Family, while Movie 3 has Drama and Sci-Fi, so there is **no overlap**, leading to a similarity of **0** (`cosine_sim[2, 3] = 0`).



This matrix reflects the pairwise similarity between movies based on their genres, and you can use it to make genre-based recommendations by finding movies with high similarity to a given movie.

- **Cosine Similarity** measures the similarity between two vectors based on the angle between them.
- In your case, it calculates how similar two movies are based on their **genre vectors**.
- It’s a common metric for similarity in recommendation systems, particularly when features are represented as **binary or numeric vectors**.

### How Cosine Similarity Works:
1. **Cosine of the Angle**: 
   - The cosine similarity measures the cosine of the angle between two vectors in an **n-dimensional space**.
   - If two vectors point in the same direction, the cosine similarity is **1** (maximum similarity).
   - If two vectors are at **90 degrees** to each other, the cosine similarity is **0** (no similarity).
   - If two vectors point in completely opposite directions, the cosine similarity is **-1** (maximum dissimilarity).

   The formula for cosine similarity between two vectors \( A \) and \( B \) is:
   
   $
   \text{cosine_similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}
   $
   
   Where:
   - $ A \cdot B $ is the **dot product** of the two vectors.
   - $ \|A\|  and  \|B\| $ are the magnitudes (lengths) of the vectors.

2. **Cosine Similarity Between Movies:**
   In your case, the vectors represent movies, and each dimension corresponds to a genre (since you're using a one-hot encoded dataframe `onehot_df` where the genres are represented as binary values).
   
   - Movies with **similar genre patterns** (e.g., both movies are action and adventure) will have a cosine similarity close to **1**.
   - Movies with **very different genres** (e.g., one is comedy and the other is horror) will have a cosine similarity closer to **0**.

### Application to Movies:
- **Genres as Features**: If you represent each movie by a vector of its genres (like in your one-hot encoded `onehot_df`), then `cosine_similarity` calculates how similar two movies are based on their genre composition.
   - For example, if Movie A is an **Action, Adventure** movie and Movie B is an **Action, Sci-Fi** movie, they will have a high cosine similarity because they share the **Action** genre.
   - If Movie C is a **Romantic Comedy** and Movie D is a **Horror** movie, their cosine similarity will be low or zero, as they don't share any genres.

#### **Step 4: Make Recommendations**

In [27]:
# Function to recommend movies based on similarity
def recommend(movie_title, df, cosine_sim):
    # Get the index of the movie in the dataframe
    movie_idx = df[df['Title'] == movie_title].index[0]
    
    # Create a list of tuples (index, similarity score) for the movie at movie_idx
    similar_movies = list(enumerate(cosine_sim[movie_idx]))
    
    # Sort the movies based on similarity scores (second item) in descending order
    similar_movies = sorted(similar_movies, key=lambda x: x[1], reverse=True)
    
    # Get the top 3 recommended movies (excluding the first one as it will be the same movie)
    recommended_movies = [df['Title'][i[0]] for i in similar_movies[1:4]]
    
    return recommended_movies  # Return the list of recommended movie titles

# Example recommendation
print(recommend('The Matrix', df, cosine_sim))  # Print the recommended movies for 'The Matrix'

# The function finds movies similar to the given movie (in this case 'The Matrix') based on cosine similarity 
# and returns the top 3 recommendations.

['Inception', 'Interstellar', 'Toy Story']
