## What Is The Recommender System
 Recommendation System is a filtration program whose prime goal is to predict the “rating” or “preference” of a user towards a domain-specific item or item. In our case, this domain-specific item is a movie, therefore the main focus of our recommendation system is to filter and predict only those movies which a user would prefer given some data about the user him or herself.
___
## Types of the Systems
Broadly, recommender systems can be classified into 3 types:

* **Simple recommenders:** offer generalized recommendations to every user, based on movie popularity and/or genre. The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience. An example could be IMDB Top 250.
___
* **Content-based recommenders:** suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata. A good example could be YouTube, where based on your history, it suggests you new videos that you could potentially watch.
  * **Disadvantages:**
  1) Different products do not get much exposure to the user.
  2) Businesses cannot be expanded as the user does not try different types of products.
___
* **Association & Market Based Model:** The system makes recommendations based on the items in the consumer's basket. For instance, if the system detected that the buyer is purchasing ground coffee it would also suggest her to buy filters as well (observed association coffee - filters).
___
* **Collaborative Filtering (CF):** It is an algorithmic architecture that recommends consumers items based on their observed behavior. There are two types of Collaborative Filtering frameworks: **Model-Based Approach** and **Memory-Based Approach**:
 
 * **User-based (UBCF):** The basic idea here is to find users that have similar past preference patterns as the user ‘A’ has had and then recommending him or her items liked by those similar users which ‘A’ has not encountered yet.
    * **Disadvantages:**
    1) People are fickle-minded i.e their taste change from time to time.
    2) There are many more users than items therefore it becomes very difficult to maintain such large matrices and therefore needs to be recomputed very regularly.
    3) This algorithm is very susceptible to shilling attacks where fake users profiles consisting of biased preference patterns are used to manipulate key decisions.
 
 * **Item-based (IBCF):** IBCF was originally developed by Amazon and is currently adopted by most online corporations (e.g. Netflix, YouTube, etc.).
    * **Advantages over User-based Collaborative Filtering:**
    1) Unlike people’s taste, movies don’t change.
    2) There are usually a lot fewer items than people, therefore easier to maintain and compute the matrices.
    3) Shilling attacks are much harder because items cannot be faked.


## First: Content Based Recommender:
We will construct a recommender that relates movies based on their description and then using Cosine Similarities we can similar movies

In [1]:
# Import Pandas
import pandas as pd

# Load Movies Metadata
metadata = pd.read_csv('movies_imdb.csv', low_memory=False)

# Print the first three rows
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [2]:
#Print plot overviews of the first 5 movies.
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [3]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(10000, 32350)

In [4]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names()[5000:5010]

['cellisten',
 'cellmate',
 'cellmates',
 'cello',
 'cellular',
 'celtics',
 'cement',
 'cemetary',
 'cemetery',
 'cenobite']

In [5]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [6]:
cosine_sim.shape

(10000, 10000)

In [7]:
cosine_sim[1]

array([0.01682915, 1.        , 0.04871976, ..., 0.        , 0.01200997,
       0.        ])

In [8]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

In [9]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:6]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

In [10]:
get_recommendations('Toy Story')

2997              Toy Story 2
8327                The Champ
1071    Rebel Without a Cause
3057          Man on the Moon
1932                Condorman
Name: title, dtype: object

In [11]:
get_recommendations('The Godfather')

1178     The Godfather: Part II
1914    The Godfather: Part III
8653               Violent City
6711                   Mobsters
6977            Queen of Hearts
Name: title, dtype: object

___
## Second: Collaborative Item_Based Recommender:
Here we can use Surprise package to get movielens dataset and get the predicted ratings for a user who hasn't seen these movies before and then getting the top 10 recommended movies.

In [13]:
import pandas as pd
import numpy as np


from collections import defaultdict #data colector

#Surprise: https://surprise.readthedocs.io/en/stable/
import surprise

from surprise.reader import Reader
from surprise import Dataset


##Matrix Factorization Algorithms
from surprise import NMF

np.random.seed(42) # replicating results

## Importing Online Data
[MovieLens](https://grouplens.org/datasets/movielens/) provides available rating datasets from the [MovieLens](http://movielens.org) web site (F. M. Harper and J. A. Konstan, 2015). Any machine learning practitioner may use several different rating files with a number of rated movies and the time of release. For demonstrative purposes and limited computation power, the author worked with 100,836 ratings and 3,683 tag applications across 9,742 movies. The full description of the particular dataset can be found [here](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html). According to the documentation, **the selected users in data rated at least 20 movies on the scale from 0.5 to 5**. The dataset was last updated on 9/2018
 
The work considers only tidy data in `ratings.csv` and `movies.csv`.
Specifically, `ratings_df` records `userId`, `movieId`, and `rating` consecutively.
On the other hand, `movies_df` stores values in `movieId` and `genres`. `movieId` is, therefore, the mutual variable. 
 
Note that `Surprise` enables one to upload data, e.g. csv files, for predictions through its own methods. On the other hand, as it is discussed below, `Surprise` also allows the user to use pandas' DataFrames. The author works with `pd.DataFrame` objects for convenience.

In [15]:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

r = urlopen("http://files.grouplens.org/datasets/movielens/ml-latest-small.zip")
zipfile = ZipFile(BytesIO(r.read()))

#print the content of zipfile
zipfile.namelist()

['ml-latest-small/',
 'ml-latest-small/links.csv',
 'ml-latest-small/tags.csv',
 'ml-latest-small/ratings.csv',
 'ml-latest-small/README.txt',
 'ml-latest-small/movies.csv']

In [16]:
# tidy df ratings (movieId,)
ratings_df = pd.read_csv(zipfile.open('ml-latest-small/ratings.csv'))
print('Columns of ratings_df: {0}'.format(ratings_df.columns))

#movies df (tidy data)
movies_df = pd.read_csv(zipfile.open('ml-latest-small/movies.csv'))
print('Columns of movies_df: {0}'.format(movies_df.columns))

Columns of ratings_df: Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')
Columns of movies_df: Index(['movieId', 'title', 'genres'], dtype='object')


## Inspecting the Data
One of the advantages of training on the selected dataset is its purity. Unlike in the real world, one does not need to spend extra time on data cleansing. The following chunk's output demonstrates how the data is stored. 
The results are in line with the disclosed data description.

In [17]:
#ratings
print(ratings_df.head())

print(ratings_df.info())

print(ratings_df.describe())

   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
None
              userId        movieId         rating     timestamp
count  100836.000000  100836.000000  100836.000000  1.008360e+05
mean      326.127564   19435.295718       3.501557  1.205946e+09
std       182.618491   35530.987199       1.042529  2.162610e+08
min         1.000000       1.000000       0.500000  8.281246e+08
25%       177.000000    1199.000000 

In [18]:
#movies
print(movies_df.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


## Data Pre-Processing




#### Filtering Data Set
Firstly, it is essential to filter out movies and users with low exposure to remove some of the noise from outliers. According to the official MovieLens documentation, all selected users have rated at least 20 movies in the data set. However, the following code filters out the movies and users based on an arbitrary threshold and creates a new data frame `ratings_flrd_df`. Moreover, the chunk also prints the value of deleted movies with new and old dimensions.

In [19]:
min_movie_ratings = 2 #a movie has was rated at least 
min_user_ratings =  5 #a user rated movies at least


ratings_flrd_df = ratings_df.groupby("movieId").filter(lambda x: x['movieId'].count() >= min_movie_ratings)
ratings_flrd_df = ratings_flrd_df.groupby("userId").filter(lambda x: x['userId'].count() >= min_user_ratings)



"{0} movies deleted; all movies are now rated at least: {1} times. Old dimensions: {2}; New dimensions: {3}"\
.format(len(ratings_df.movieId.value_counts()) - len(ratings_flrd_df.movieId.value_counts())\
        ,min_movie_ratings,ratings_df.shape, ratings_flrd_df.shape )

'3446 movies deleted; all movies are now rated at least: 2 times. Old dimensions: (100836, 4); New dimensions: (97390, 4)'

## Data Loading
While using `Surprise`, one can use a bunch of built-in datasets (e.g. Jeseter or even the movielens) parsed by `Dataset` module. However, it is usually required to build a customized recommender system. In a case as such, it is necessary to upload your own rating dataset either from a file (e.g. csv) or from a pandas' dataframe. In both cases, you need to define a `Reader` object to parse the file or the dataframe by `Surprise`. See the reference [here](https://surprise.readthedocs.io/en/stable/getting_started.html#use-a-custom-dataset).
 
In the next step, one must load the data set through the call of a particular method of `surprise.Dataset`. Specifically, `load_from_file()` loads a csv file. Surprise also allows to upload pandas' `DataFrame`. This time, it is required to upload the data frame with ratings by user per movie (i.e. in the tidy format) with `Dataset.load_from_df` and specify `reader` as the argument.
 
Lastly, `build_full_trainset()` method builds the training set from the entire data set. As demonstrated later, training on the whole data while using the best hyper tuning parameters is useful for the prediction of top arbitrary number of movies for each `userId`.

In [20]:
reader = Reader(rating_scale=(0.5, 5)) #line_format by default order of the fields
data = Dataset.load_from_df(ratings_flrd_df[["userId", "movieId", "rating"]], reader=reader)

trainset = data.build_full_trainset()

testset = trainset.build_anti_testset()


## Matrix Factorization
Matrix factorization is an effective CF technique because it benefits from the properties of linear algebra. Specifically, consider matrix $R$ as a record of various elements. As it is possible to decompose any integer into the product of its prime factor, matrix factorization also enables humans to explore information about matrices and their functional properties an array of elements.

In [21]:
def get_top_n(predictions, userId, movies_df, ratings_df, n = 10):
    '''Return the top N (default) movieId for a user,.i.e. userID and history for comparisom
    Args:
    Returns: 
  
    '''
    #Peart I.: Surprise docomuntation
    
    #1. First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    #2. Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key = lambda x: x[1], reverse = True)
        top_n[uid] = user_ratings[: n ]
    
    #Part II.: inspired by: https://beckernick.github.io/matrix-factorization-recommender/
    
    #3. Tells how many movies the user has already rated
    user_data = ratings_df[ratings_df.userId == (userId)]
    print('User {0} has already rated {1} movies.'.format(userId, user_data.shape[0]))

    
    #4. Data Frame with predictions. 
    preds_df = pd.DataFrame([(id, pair[0],pair[1]) for id, row in top_n.items() for pair in row],
                        columns=["userId" ,"movieId","rat_pred"])
    
    
    #5. Return pred_usr, i.e. top N recommended movies with (merged) titles and genres. 
    pred_usr = preds_df[preds_df["userId"] == (userId)].merge(movies_df, how = 'left', left_on = 'movieId', right_on = 'movieId')
            
    #6. Return hist_usr, i.e. top N historically rated movies with (merged) titles and genres for holistic evaluation
    hist_usr = ratings_df[ratings_df.userId == (userId) ].sort_values("rating", ascending = False).merge\
    (movies_df, how = 'left', left_on = 'movieId', right_on = 'movieId')
    
    
    return hist_usr, pred_usr


## Non-Negative Matrix Factorization (NMF)

Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements
\
$R_{n*d} = W_{n*r} H_{r*d}$

In [22]:
algo_NMF = NMF(n_factors = 16)
algo_NMF.fit(trainset)


# Predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo_NMF.test(testset)

In [27]:
hist_NMF_124, pred_NMF_124 = get_top_n(predictions, movies_df = movies_df, userId = 124, ratings_df = ratings_df)

User 124 has already rated 50 movies.


In [28]:
pred_NMF_124

Unnamed: 0,userId,movieId,rat_pred,title,genres
0,124,2583,5.0,Cookie's Fortune (1999),Comedy|Drama
1,124,6300,5.0,Flickering Lights (Blinkende lygter) (2000),Action|Comedy|Crime
2,124,177593,5.0,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama
3,124,3451,5.0,Guess Who's Coming to Dinner (1967),Drama
4,124,7842,5.0,Dune (2000),Drama|Fantasy|Sci-Fi
5,124,1945,5.0,On the Waterfront (1954),Crime|Drama
6,124,86781,5.0,Incendies (2010),Drama|Mystery|War
7,124,89759,5.0,"Separation, A (Jodaeiye Nader az Simin) (2011)",Drama
8,124,51931,5.0,Reign Over Me (2007),Drama
9,124,2732,5.0,Jules and Jim (Jules et Jim) (1961),Drama|Romance


In [25]:
hist_NMF_100, pred_NMF_100 = get_top_n(predictions, movies_df = movies_df, userId = 100, ratings_df = ratings_df)

User 100 has already rated 148 movies.


In [26]:
pred_NMF_100

Unnamed: 0,userId,movieId,rat_pred,title,genres
0,100,1046,5.0,Beautiful Thing (1996),Drama|Romance
1,100,3429,5.0,Creature Comforts (1989),Animation|Comedy
2,100,55721,5.0,Elite Squad (Tropa de Elite) (2007),Action|Crime|Drama|Thriller
3,100,1204,5.0,Lawrence of Arabia (1962),Adventure|Drama|War
4,100,82,5.0,Antonia's Line (Antonia) (1995),Comedy|Drama
5,100,306,5.0,Three Colors: Red (Trois couleurs: Rouge) (1994),Drama
6,100,89759,5.0,"Separation, A (Jodaeiye Nader az Simin) (2011)",Drama
7,100,51931,5.0,Reign Over Me (2007),Drama
8,100,106100,5.0,Dallas Buyers Club (2013),Drama
9,100,6666,5.0,"Discreet Charm of the Bourgeoisie, The (Charme...",Comedy|Drama|Fantasy
