# Recommendation Engines

The goal of building recommendation engines includes:

* **Similar item recommendations**: surfacing similar items to users. This approach generates recommendations for items that are similar to an item you specify.
* **Personalized rankings**: a list of recommended items that are re-ranked for a specific user.
* **New item recommendations**: Offering the right recommendations when new items are added to your catalog. This is one of the most challenging problems in building relevant recommendations.

## How does a recommendation engine work?

Here is the high-level idea:

* recommend items to a user which are most popular among all the users
* divide the users into multiple segments based on their preferences (user features) and recommend items to them based on the segment they belong to

### Content based filtering

This algorithm recommends products which are similar to the ones that a user has liked in the past.

#### But what does **"similar"** mean in case of movies, musics, books, etc?

First we need to  save all the information related to each user in a vector form (**profile vector**). This vector contains the past behavior of the user, for example the movies liked/disliked by the user and the ratings given by them.

All the information related to items is stored in another vector called the **item vector**. For example, item vector contains the details of each movie, like genre, cast, director, etc.

Once we collect the data abour users and items in vectors, we can do vector operations including calculating their distance.

One common approach to measure similarity between vectors is **cosine similarity**. Cosine Similarity measures the cosine of the angle between two **non-zero** vectors of an inner product space. This similarity measurement is particularly concerned with orientation, rather than magnitude. 

![image](./img/cosine-similarity-1007790.jpeg) 


Based on the cosine value, which ranges between -1 to 1, the items are then arranged in descending order and you can use the result to recommend top-n items.

##### Advantages and Disadvantages:

Advantages:
* No need for data on other users when applying to similar users.
* Able to recommend to users with unique tastes.
* Able to recommend new & popular items
* Explanations for recommended items.

Disadvantages:
* Finding the appropriate feature is hard.
* Doesn’t recommend items outside the user profile.
    * this is due to the "**non-zero** vectors" condition. In other words, this alrogrithm is limited to recommending items that are of the same type. It will never recommend products which the user has not bought or liked in the past. So if a user has watched or liked only action movies in the past, the system will recommend only action movies. It’s a very narrow way of building an engine.

------

### Collaborative filtering

The collaborative filtering algorithm uses “User Behavior” for recommending items. This is one of the most commonly used algorithms in the industry as it is not dependent on any additional information. Collaborative filtering is based on the idea that similar people (based on the data) generally tend to like similar things. 


* User-User collaborative filtering: This algorithm first finds the similarity score between users. Based on this similarity score, it then picks out the most similar users and recommends products which these similar users have liked or bought previously.

![image](https://miro.medium.com/max/720/0*o0zVW2O6Rv-LI5Mu.png) 
source: https://miro.medium.com/max/720/0*o0zVW2O6Rv-LI5Mu.png


* Item-Item collaborative filtering: In this algorithm, we compute the similarity between each pair of items. Based on that, we will recommend similar movies which are liked by the users in the past.

* How do you determine which users or items are similar to one another?
* Given that you know which users are similar, how do you determine the rating that a user would give to an item based on the ratings of similar users?
* How do you measure the accuracy of the ratings you calculate?


no single answer!Collaborative filtering is a family of algorithms where there are multiple ways to find similar users or items and multiple ways to calculate rating based on ratings of similar users. Depending on the choices you make, you end up with a type of collaborative filtering approach. 


##### Advantages and Disadvantages:

Advantages:
* No need for the domain knowledge because embedding are learned automatically.

Disadvantages:
* Hard to add any new features that may improve quality of model
* Cannot handle new items/users. It is called a **Cold Start**.
    * One possible solution could be to recommend the best selling products, i.e. the products which are high in demand. Another possible solution could be to recommend the products which would bring the maximum profit to the business.
* This algorithm is quite time consuming as it involves calculating the similarity for each user/items and then calculating prediction for each similarity score. 
   * One way of handling this problem is to select only a few users/items instead of all to make predictions


# Movie recommendation using MovieLens

https://grouplens.org/datasets/movielens/

Small: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
# Loading the data
df_ratings = pd.read_csv('./data/ml-latest-small/ratings.csv')
df_ratings.head()

In [None]:
sns.histplot(data=df_ratings, x="rating")


In [None]:
df_movies = pd.read_csv("./data/ml-latest-small/movies.csv")
df_movies.head()

In [None]:
df = pd.merge(df_ratings, df_movies, on="movieId", how="left")
df.head()

The dataset is a collection of ratings by a number of users for different movies. Let’s find out the average rating for each and every movie in the dataset.



In [None]:
df_avg_ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
df_avg_ratings.head()

since the rating of a movie is proportional to the total number of ratings it has. Therefore, we will also consider the total ratings cast for each movie.

In [None]:
df_avg_ratings['total_ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
df_avg_ratings.head()

Let's pivot the tale to get user-movie matrix

In [None]:
user_movie_mat = df.pivot_table(index='userId',columns='title',values='rating')
user_movie_mat.head()

In [None]:
user_movie_mat[user_movie_mat.notnull()]

#### Note that alot of them are NaN. usually with recommendation systems, we are dealing with highly sparse data since not every user has seen and rated all the movies.
with larger datasets, you mighy run into overflow and wasted memory. one solution is to work with scipy.sparse_matrix.

Now let's try a few approaches to build CF recommendation systems

In [None]:
# pick a movie before 2019
target_movie = "toy story"

In [None]:
df_movies[df_movies["title"].str.lower().str.contains(target_movie)]

In [None]:
target_movie = "Toy Story (1995)"

## 1. Correlation

In [None]:
target_corr = user_movie_mat.corrwith(user_movie_mat[target_movie])

In [None]:
target_corr.head()

In [None]:
target_corr = target_corr.dropna()

In [None]:
target_corr = target_corr.reset_index()
target_corr.columns = ["title", "corr"]

In [None]:
target_corr = pd.merge(target_corr, df_avg_ratings, on="title")
target_corr = target_corr.sort_values(by='corr', ascending=False)
target_corr.head()

In [None]:
target_corr[target_corr['total_ratings']>100].head()

## 2. KNN

In [None]:
from operator import itemgetter 
from sklearn.preprocessing import Normalizer
from sklearn.neighbors import NearestNeighbors


remember here I want to find similar movies...so have to transpose user_movie_mat to make it easier to work with

**KNN can be very sensitive to the scale of data as it relies on computing the distances.**

In [None]:
knn_mat = user_movie_mat.T
knn_mat = knn_mat.fillna(0)

In [None]:
knn_mat.iloc[:,:] = Normalizer().fit_transform(knn_mat)

In [None]:
knn = NearestNeighbors(metric='cosine', algorithm='auto')

In [None]:
knn.fit(knn_mat)

distances, indices = knn.kneighbors(knn_mat, n_neighbors=6)

In [None]:
# get index for the target movie
index_for_movie = knn_mat.index.tolist().index(target_movie)

# find the indices for the similar movies
sim_movies = indices[index_for_movie].tolist()
sim_movies.pop(0)


In [None]:
print(itemgetter(*sim_movies)(knn_mat.index.tolist()))

## 3. Cosine similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity


In [None]:
dist_matrix = cosine_similarity(knn_mat, knn_mat)

In [None]:
dist_matrix.shape

In [None]:
df_cos = pd.DataFrame(dist_matrix)
df_cos.columns = user_movie_mat.columns
df_cos.index = user_movie_mat.columns

In [None]:
target_cos = pd.merge(df_cos[target_movie], df_avg_ratings, on="title")

target_cos.columns = ["similarity", "rating", "total_ratings"]
target_cos = target_cos.sort_values(by='similarity', ascending=False)

In [None]:
target_cos[target_cos['total_ratings']>100].head()