### Recommender system 

Build prototype of recommender system by using different methods:

- Content-Based Filtering: make recommendation based on the info of the items previously rated by the user (items that are similar to those that a user liked in the past).

- Collaborative Filtering: make recommendation based on the info of a user by collecting preferences or taste information from many users (collaborating).

- Hybrid methods: combining collaborative filtering and content-based filtering, overcome some of the common problems in recommender systems such as cold start and the sparsity problem.


MovieLens 100K Dataset - https://grouplens.org/datasets/movielens/100k/

In [33]:
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel

In [15]:
#Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,
 encoding='latin-1')

#Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('ml-100k/u.item', sep='|', names=i_cols,
 encoding='latin-1')

#Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols,
 encoding='latin-1')

In [21]:
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [19]:
items.head()

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [23]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


### 1. Popularity model

In [27]:
## recommend the most popular (highest rating score) to all users, 
## here just use the average rating, to be more precise should use weighted rating instead

ratings.groupby(by='movie_id')['rating'].mean().sort_values(ascending=False).head(10)

movie_id
1293    5.0
1467    5.0
1653    5.0
814     5.0
1122    5.0
1599    5.0
1201    5.0
1189    5.0
1500    5.0
1536    5.0
Name: rating, dtype: float64

### 2. Content-based model

In [40]:
## recommend items only related to the user's previous choices

items_vectors = items.iloc[:, 5:]
print(items_vectors.shape)
items_vectors
#cosine_sim = linear_kernel(items_vectors, items_vectors)
#cosine_sim

(1682, 19)


Unnamed: 0,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
5,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
7,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0


In [56]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = items.index[items["movie title"]==title].tolist()[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]

    return items['movie title'].iloc[movie_indices]

get_recommendations("Toy Story (1995)")

94                              Aladdin (1992)
421     Aladdin and the King of Thieves (1996)
819                           Space Jam (1996)
992                            Hercules (1997)
1218                     Goofy Movie, A (1995)
7                                  Babe (1995)
62                    Santa Clause, The (1994)
70                       Lion King, The (1994)
90      Nightmare Before Christmas, The (1993)
93                           Home Alone (1990)
Name: movie title, dtype: object