# In-Depth Machine Learning: Recommendation Systems

---

# Introduction

---

### What is a Recommendation System? 
A recommendation systems works to filter data using different algorithms to recommend the most relevant items to users. Often, it first captures the past behavior of a customer and then recommends things based on that past behavior. 

### Example Flow of a Recommendation System:
1. Collect the data from the customer in the form of ratings.
2. Comment and store data in a standard database.
3. Filter the data to extract the relevant meaningful information required.
4. Predict the final recommendations.

### Data and Data Collection
The data used in a reccommendation system can take many forms. Possible data could relate to customer behavior, interests, browsing history, and even their similarity with another likely customer.

Data can be collected explicitly and/or implicitly. **Explicit data** is information that is provided intentionally. This includes any form of input from the users, such as the contribution of movie ratings. **Implicit data** is information that is not provided intentionally but gathered from available data streams. This includes things like search history, clicks, order history, etc.


# Main Types of Recommendation Systems

---

## Content-Based Filtering/Recommenders

---

Content-based filtering is often based on the description of the product or the keywords used to specify more about the product. This filtering technique studies a user’s preferred choices and then delivers the most relevant recommendations. Essentially, based on what we like, the algorithm will simply reccomend similar items.

A major drawback of this algorithm is that it is limited to recommending items that are of the same type. It will never recommend products which the user has not bought or liked in the past. So if a user has watched or liked only action movies in the past, the system will recommend only action movies. In this respect, this method is a very narrow way of building an engine.

There are a number of ways to measure similarity, and this is a major decision point in this approach. Common methods include Euclidean Distance, Cosine Similarity, and Pearson's Correlation, with Cosine Similarity probably being the most popular. 

### Euclidean Distance
This is the distance "as the crow flies" or the straight line between two points. 

![EuclideanDisPic.png](assets/EuclideanDisPic.png) 

### Cosine Similarity
Cosine similarity works to quantify the relationship between two vectors. 

Suppose A and B are 2 movie vectors, A = profile vector and B = item vector, then the similarity between them can be calculated as:

![CosSimPic.png](assets/CosSimPic.png)

An advanced version of the cosine similarity function can be seen here: 

![CosSimEq.png](assets/CosSimEq.png)

Based on the cosine similarity values, which range between -1 to 1, the movies would be arranged in descending order and deliver the top-most recommendations to the user. If the value is near 1 it would have a strong positive relationship, if the value is near -1 then it would have a strong negative relationship, and if the value is close to 0, there would be no relationship.

![CosSimPic2.png](assets/CosSimPic2.png)

The main drawback of this technique is that it recommends movies in the same genre only. If we want recommendations from another movie genre then it might not perform well.

### Pearson's Correlation
Pearson’s Correlation can tell us the extent to which two items are correlated. A higher correlation would indicate greater similarity. 

Pearson’s correlation can be calculated for two users, u and v, using the following formula:

![PearsonCorPic.png](assets/PearsonCorPic.png) 

With:
 - rᵤᵢ = rating given by user (u) to item (i)
 - rᵥᵢ = rating given by user (v) to item (i)
 - rᵥ (mean) = mean of a rating given by user (v)

## Collaborative-Based Filtering/Recommenders
---

Collaborative filtering techniques usually work with user’s preferences, activities, and behavior. They give recommendations to the user based on similarity with the other likely users. For the collaborative filtering techniques we don’t need any additional information just need to collect and analyze user’s behavior. Thus, this is one of the most commonly used algorithms in the industry as it is not dependent on any additional information. There are different types of collaborating filtering techniques and we shall look at them in detail below.

## User - User Collaborative Filtering
---

This algorithm first determines the similarity between the users. Based on the similarity score, it identifies the most similar users and recommends products which these similar users have liked or bought previously. Users having a higher similarity score will tend to be recommended similar things. 

As an example, this algorithm could work to find the similarity between each user based on the ratings they have previously given to different movies. The prediction of an item for a user u is calculated by computing the weighted sum of the user ratings given by other users to an item i.

The prediction Pu,i is given by:

![UUCollabPic.png](assets/UUCollabPic.png) 

With this, 
 - Pu,i is the prediction of an item
 - Rv,i is the rating given by a user v to a movie i
 - Su,v is the similarity between users


With the ratings for users in a profile vector, we cna then use that to predict the ratings for other users. 

The followings steps are taken to then get that final recommendation:
1. First, to calculate the similarity between the user u and v, we can make use of Pearson correlations.
2. We can then find the items rated by both the users and, using those ratings, we can calculate the correlation between the users.
3. The predictions can then be calculated using these similarity values. The algorithm calculates the similarity between each user and then based on each similarity calculates the predictions. With this, the assumption is that users who have a high correlation will tend to be similar.
4. Based on the prediction values, recommendations can made.

### An Example

![UUMatrix.png](assets/UUMatrix.png)

Here we have a user movie rating matrix, with each user represented with each row, and different movies represented in each column. 

To understand this, let’s find the similarity between users (A, C) and (B, C) in the above table using Pearson’s correlation. Common movies rated by A and C are movies x2 and x4 and by B and C are movies x2, x4 and x5.

![UUCorr.png](assets/UUCorr.png)

The correlation between user A and C is more than the correlation between B and C. Therefore, users A and C have more similarity and the movies liked by user A will be recommended to user C and vice versa.

As a warning, this algorithm can be computationally intensive as it involves calculating the similarity for each user and then calculating prediction for each similarity score. This technique is only really usable when there is a low or moderate data size. 

One way of handling this problem is to select only a few users (neighbors) instead of all to make predictions, i.e. instead of making predictions for all similarity values, we choose only few similarity values. 

There are various ways to select neighbors:
 - Select a threshold similarity and choose all the users above that value
 - Randomly select the users
 - Arrange the neighbors in descending order of their similarity value and choose top-N users
 - Use clustering approaches to choose neighbors

## Item-Item Collaborative Filtering
---

This technique is very similar to the user-user collaborative technique, but instead of finding similarities between users it focuses on the similarity score between items. This approach is effective when the number of users is more than the items being recommended.

![IICollabPic.png](assets/IICollabPic.png) 

With the above example in mind, this approach would find the similarity between each movie pair and based on that, it will recommend similar movies which are liked by the users in the past. This algorithm works similar to user-user collaborative filtering with just a slight modification — instead of utilizing the weighted sum of ratings of “user-neighbors”, it utilizes the weighted sum of ratings of “item-neighbors”. 

The prediction is given by:
![IIPredict.png](assets/IIPredict.png) 

To get the similarity of items we would use this equation:
![IISimil.png](assets/IISimil.png) 

Now, with the similarity between each movie and their ratings, predictions can made and based on those metrics, and similar movies can be recommended. 

### An Example
![IIMatrix.png](assets/IIMatrix.png) 

Here the mean item rating is the average of all the ratings given to a particular item. Instead of finding the user-user similarity, we will work to find the item-item similarity.

To do this, we would first find users who have rated our items of interest, and based on those ratings, the similarity between the items would be calculated. To find the similarity between movies (x1, x4) and (x1, x5). we would look at the common users who have rated movies x1 and x4, as well as the common users who have rated movies x1 and x5. These are users A and B.

![IIMath.png](assets/IIMath.png) 

The similarity between movie x1 and x4 is higher than the similarity between movie x1 and x5. Based on these similarity values, if a user searches for movie x1, they will be recommended movie x4 and vice versa. 

### Question: What happens with new items or new users? 

Before going further and implementing these concepts, there is a question which we must know the answer to — what will happen if a new user or a new item is added in the dataset? This is called a **Cold Start**. 

There are two main types of cold start:

#### Visitor Cold Start
Visitor Cold Start refers to the situation when a new user is introduced in the dataset. Since there is no history of that user, the system does not know the preferences of that user, and it becomes harder to recommend products to that user. 

One basic approach to resolve this challenge is to apply a popularity based strategy, where the most popular products would be recommended. These can be determined by what has been popular recently, overall, or even regionally. Once we know the preferences of that user, recommending products will become more feasible.


#### Product Cold Start
Product Cold Start refers to the situation when a new product is launched on the market or added to the system. 

User action is most important to determine the value of any product. The more interaction a product receives, the easier it is for the model to recommend that product to the right user. 

We can make use of Content-based filtering to resolve this problem. The system would first use the content of the new product for recommendations and then eventually the user actions on that product. But, to be clear, there would be some wait in order to get the intial user interaction data. 

## Hybrid Recommendation Systems
---

A hybrid recommendation system is a combination of collaborative and content-based recommendations. This system can be implemented by making content-based and collaborative-based predictions separately and then combining them and vice-versa.

![HybridPic.png](assets/HybridPic.png) 


## Creating A Recommendation System
---

The data for this applied example comes from the MovieLens dataset and we will work to build a model that recommends movies to the end users. This model to recommend movies will be based on user-user similarity and item-item similarity.

The data has been collected by the **GroupLens Research Project at the University of Minnesota**. The dataset can be downloaded from https://grouplens.org/datasets/movielens/100k/. 

This dataset consists of:
 - 100,000 ratings (1–5) from 943 users on 1682 movies
 - Demographic information of the users (age, gender, occupation, etc.)

#### Importing Libraries

In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

#### Loading in Data

In [2]:
# Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols, encoding='latin-1')

# Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, encoding='latin-1')

# Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

items = pd.read_csv('ml-100k/u.item', sep='|', names=i_cols, encoding='latin-1')

#### Exploring Each Dataset

In [3]:
users.head(3)

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067


In [4]:
ratings.head(3)

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116


In [5]:
items.head(3)

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


#### Getting Train and Test Datasets

The dataset has already been divided into train and test sets by GroupLens. The test data has 10 ratings for each user with 9,430 rows in total. We can just import these files into our Python environment, but if we didnt there are a number of ways to split data into train and test sets in python. 

In [6]:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']

ratings_train = pd.read_csv('ml-100k/ua.base', sep='\t', names=r_cols, encoding='latin-1')

ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', names=r_cols, encoding='latin-1')

ratings_train.shape, ratings_test.shape

((90570, 4), (9430, 4))

#### Calculating Unique Users and Items

In [7]:
n_users = ratings.user_id.unique().shape[0]
n_items = ratings.movie_id.unique().shape[0]

#### Creating User-Item Matrix
Now, that we have the number of unique items, the next step will be to create a user-item matrix. This will be used to calculate the similarity between users and items. To do this, we will first initialize a matrix with a zeros array of shape 943 x 1643 having 943 users and 1643 movies. Then we will iteratively load the data into this matrix. 

Here,
- line[1] is the userId 
    - we are subtracting 1 from it since array indexing starts from 0 = row
- line[2]-1 is the movie id = column
- now at that specifec row and column we will add line[3] which is the movie rating

In [8]:
data_matrix = np.zeros((n_users, n_items))
for line in ratings.itertuples():
    data_matrix[line[1]-1, line[2]-1] = line[3]
    
data_matrix

array([[5., 3., 4., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [5., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 5., 0., ..., 0., 0., 0.]])

#### Calculating Similarity

Now, when we have rating of all the movies given by each user in the matrix, we can now calculate the similarity. We will use the pairwise_distance function from sklearn to calculate the cosine similarity. This gives us the item-item and user-user similarity in an array form.

In [9]:
from sklearn.metrics.pairwise import pairwise_distances 
user_similarity = pairwise_distances(data_matrix, metric='cosine')
item_similarity = pairwise_distances(data_matrix.T, metric='cosine')

#### Making Predictions
The next step is to make predictions based on these similarities. Let’s define a function to do just that.

In [10]:
def predict(ratings, similarity, type='user'):
    
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1).reshape(-1,1)
        #We use np.newaxis so that mean_user_rating has same format as ratings
        
        ratings_diff = (ratings - mean_user_rating)
        pred = mean_user_rating + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    
    return pred

#### Applying Predicition Function

In [11]:
user_prediction = predict(data_matrix, user_similarity, type='user')
item_prediction = predict(data_matrix, item_similarity, type='item')

In [12]:
item_prediction

array([[0.44627765, 0.475473  , 0.50593755, ..., 0.58815455, 0.5731069 ,
        0.56669645],
       [0.10854432, 0.13295661, 0.12558851, ..., 0.13445801, 0.13657587,
        0.13711081],
       [0.08568497, 0.09169006, 0.08764343, ..., 0.08465892, 0.08976784,
        0.09084451],
       ...,
       [0.03230047, 0.0450241 , 0.04292449, ..., 0.05302764, 0.0519099 ,
        0.05228033],
       [0.15777917, 0.17409459, 0.18900003, ..., 0.19979296, 0.19739388,
        0.20003117],
       [0.24767207, 0.24489212, 0.28263031, ..., 0.34410424, 0.33051406,
        0.33102478]])

These predictions can then be used within some sort of application to give recommendations. 

## Second Example Using Text Similarity
---

#### Importing Libraries

In [13]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import ast 
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel,cosine_similarity

#### Importing Data

The data for this second example also comes from the MovieLens dataset, but is different than the one used above. This example follows the tutorial here (https://medium.com/swlh/beginners-guide-to-build-recommendation-system-2bd4a96aa3e), where there is a link to the data. 

In [14]:
movies_df = pd.read_csv('movie_dataset/movies_metadata.csv')
ratings = pd.read_csv('movie_dataset/ratings_small.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


#### Dropping Unnecessary Columns

In [15]:
movies_df = movies_df.drop(['belongs_to_collection', 'budget', 'homepage', 
                            'original_language', 'release_date', 'revenue', 
                            'runtime', 'spoken_languages', 'status', 'video', 
                            'poster_path', 'production_companies', 'production_countries'], axis = 1)

movies_df.head()

Unnamed: 0,adult,genres,id,imdb_id,original_title,overview,popularity,tagline,title,vote_average,vote_count
0,False,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.9469,,Toy Story,7.7,5415.0
1,False,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,tt0113497,Jumanji,When siblings Judy and Peter discover an encha...,17.0155,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,False,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,tt0113228,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,False,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,tt0114885,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.85949,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,False,"[{'id': 35, 'name': 'Comedy'}]",11862,tt0113041,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.38752,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0


#### Some Data Wrangling

The ‘genres’ column in the data frame contains a dictionary with the keys 'id' and 'name'. We need to extract the 'name', as it is the movie genre, and separate the column with the 'name' values only.

In [16]:
movies_df['genres'] = (movies_df['genres']
                       .fillna('[]')
                       .apply(literal_eval)
                       .apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else []))

movies_df['genres'].head()

0     [Animation, Comedy, Family]
1    [Adventure, Fantasy, Family]
2               [Romance, Comedy]
3        [Comedy, Drama, Romance]
4                        [Comedy]
Name: genres, dtype: object

#### Creating TF-IDF Vector

In order to create the recommendation engine, we have to create a vector for each and every movie within the matrix. We create this vector because this recommendation engine depends on pairwise similarity, which assesses the similarity of vectors. So, in order to identify the similarities, we need to created the vectors for each movie.

Since the overview column in the dataset is string data, we need to use a TF-IDF vectorizer to create a document matrix from these sentences.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

movies_df['overview'] = movies_df['overview'].fillna('')

tfv_matrix = tfv.fit_transform(movies_df['overview'])

#### Finding Cosine Similarity

This may take a moment to run... 

In [18]:
cos_sim = linear_kernel(tfv_matrix, tfv_matrix)

movies_df = movies_df.reset_index()

indices = pd.Series(movies_df.index, index=movies_df['title'])

indices.head(20)


title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
Heat                               5
Sabrina                            6
Tom and Huck                       7
Sudden Death                       8
GoldenEye                          9
The American President            10
Dracula: Dead and Loving It       11
Balto                             12
Nixon                             13
Cutthroat Island                  14
Casino                            15
Sense and Sensibility             16
Four Rooms                        17
Ace Ventura: When Nature Calls    18
Money Train                       19
dtype: int64

#### Writing Finction to Recommend Movies

In [19]:
def sugg_recm(title):
    
    # Getting the index corresponding to the original_title
    idx = indices[title]    
    
    # Get the pairwsie similarity scores 
    sim_scores = list(enumerate(cos_sim[idx]))
    
    # Sorting the movies 
    sim_scores = sorted(sim_scores, key=lambda x: x[1],reverse=True)
    
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    
    return indices.iloc[movie_indices]

#### Applying the Function

In [20]:
sugg_recm('Star Wars').head(10)

title
The Empire Strikes Back                        1154
Star Wars: The Force Awakens                  26555
The Star Wars Holiday Special                 30434
Return of the Jedi                             1167
Samson and the Seven Miracles of the World    34153
The Interrogation                             16264
The Thief of Bagdad                            7149
Threads of Destiny                            22939
Where Eagles Dare                              6659
West Of Shanghai                              34855
dtype: int64

In [21]:
sugg_recm('Indiana Jones and the Temple of Doom').head(10)

title
Secret of the Incas                          40234
Raiders of the Lost Ark                       1156
The Saint                                    45145
Allan Quatermain and the Temple of Skulls    12647
Treasure of the Four Crowns                  25052
Beginning of the End                          6164
The Rocketeer                                 1985
Queen of the Amazons                         18469
Armour of God                                 2764
The Condemned of Altona                      35498
dtype: int64

## Conclusion

Overall, this workshop has introduced you to some common approaches to bulding a recommendation system in Python. There are many other ways to build a recommendation system including K-Nearest Neieghbors, Clusteirng Algorithms, and even Classification approaches. There is so much more to learn here, but the hope is that this case will introduce you to some major concepts, and give you some code to start your journey. 

## References

- Recommendation Systems from Scratch in Python - https://medium.com/@lope.ai/recommendation-systems-from-scratch-in-python-pytholabs-6946491e76c2
- Beginner’s guide to build Recommendation Engine in Python - https://medium.com/swlh/beginners-guide-to-build-recommendation-system-2bd4a96aa3e
- Machine Learning for Building Recommender System in Python - https://towardsdatascience.com/machine-learning-for-building-recommender-system-in-python-9e4922dd7e97
- Creating a Movie Recommendation System using Python - https://blog.jovian.ai/creating-a-movie-recommendation-system-using-python-5ba88a7eb6df