# Collaborative Filtering Algorithms
* Idea: If a person A likes item 1, 2, 3 and B like 2,3,4 then they have similar interests and A should like item 4 and B should like item 1.

* This algorithm is entirely based on the past behavior and not on the context.
    This makes it one of the most commonly used algorithm as it is not dependent on any additional information.
* For instance: product recommendations by e-commerce player like Amazon and merchant recommendations by banks like American Express.

# types of collaborative filtering algorithms :
* User-User Collaborative filtering: 
    Here we find look alike customers (based on similarity) and offer products which first customer’s look alike has chosen in past. 
    This algorithm is very effective but takes a lot of time and resources. 
    It requires to compute every customer pair information which takes time. 
    Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.
* Item-Item Collaborative filtering: 
    It is quite similar to previous algorithm, but instead of finding customer look alike, we try finding item look alike. 
    Once we have item look alike matrix, we can easily recommend alike items to customer who have purchased any item from the store. 
    This algorithm is far less resource consuming than user-user collaborative filtering. 
    Hence, for a new customer the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between customers. 
    And with fixed number of products, product-product look alike matrix is fixed over time.
* Other simpler algorithms: There are other approaches like market basket analysis, which generally do not have high predictive power than the algorithms described above.

In [1]:
import pandas as pd
# pass in column names for each CSV and read them using pandas. 
# Column names available in the readme file
#Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,
 encoding='latin-1')

#Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols,
 encoding='latin-1')

#Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('ml-100k/u.item', sep='|', names=i_cols,
 encoding='latin-1')

In [2]:
users.head(5)

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [24]:
users.drop(['zip_code','occupation','sex'],axis=1, inplace=True)

In [3]:
ratings.head(5)

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [4]:
items.head(5)

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [25]:
items.drop(['movie title','release date','video release date','IMDb URL'],axis=1, inplace=True)

In [27]:
items = items.rename(columns={'movie id': 'movie_id'})

# SPlit Sets

In [5]:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings_base = pd.read_csv('ml-100k/ua.base', sep='\t', names=r_cols, encoding='latin-1')
# ratings_base_20m = pd.read_csv('ml-20m/ratings.csv', sep=',', names=r_cols, encoding='latin-1',skiprows=1)
ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', names=r_cols, encoding='latin-1')
ratings_base.shape, ratings_test.shape

((90570, 4), (9430, 4))

In [6]:
ratings_base.head(5)

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,1,1,5,874965758
1,1,2,3,876893171
2,1,3,4,878542960
3,1,4,3,876893119
4,1,5,3,889751712


In [7]:
# ratings_base_20m.head(4)

# Graph lab

In [8]:
import graphlab
train_data = graphlab.SFrame(ratings_base)
test_data = graphlab.SFrame(ratings_test)



This non-commercial license of GraphLab Create for academic use is assigned to AMRGIL001@myuct.ac.za and will expire on April 04, 2018.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1491300827.log


In [9]:
# train_20M_data = graphlab.SFrame(ratings_base_20m)

## Popularity Model

In [28]:
popularity_model = graphlab.popularity_recommender.create(train_data, 
                                                          user_id='user_id', 
                                                          item_id='movie_id', 
                                                          target='rating',
                                                          user_data=graphlab.SFrame(users),
                                                          item_data=graphlab.SFrame(items))

In [11]:
popularity_model = graphlab.popularity_recommender.create(train_data, 
                                                          user_id='user_id', 
                                                          item_id='movie_id', 
                                                          target='rating')

In [12]:
#Get recommendations for first 5 users and print them
#users = range(1,6) specifies user ID of first 5 users
#k=5 specifies top 5 recommendations to be given
popularity_recomm = popularity_model.recommend(users=range(1, 6), k=5)
popularity_recomm.print_rows(num_rows=25)

+---------+----------+-------+------+
| user_id | movie_id | score | rank |
+---------+----------+-------+------+
|    1    |   1467   |  5.0  |  1   |
|    1    |   1201   |  5.0  |  2   |
|    1    |   1189   |  5.0  |  3   |
|    1    |   1122   |  5.0  |  4   |
|    1    |   814    |  5.0  |  5   |
|    2    |   1467   |  5.0  |  1   |
|    2    |   1201   |  5.0  |  2   |
|    2    |   1189   |  5.0  |  3   |
|    2    |   1122   |  5.0  |  4   |
|    2    |   814    |  5.0  |  5   |
|    3    |   1467   |  5.0  |  1   |
|    3    |   1201   |  5.0  |  2   |
|    3    |   1189   |  5.0  |  3   |
|    3    |   1122   |  5.0  |  4   |
|    3    |   814    |  5.0  |  5   |
|    4    |   1467   |  5.0  |  1   |
|    4    |   1201   |  5.0  |  2   |
|    4    |   1189   |  5.0  |  3   |
|    4    |   1122   |  5.0  |  4   |
|    4    |   814    |  5.0  |  5   |
|    5    |   1467   |  5.0  |  1   |
|    5    |   1201   |  5.0  |  2   |
|    5    |   1189   |  5.0  |  3   |
|    5    | 

In [29]:
item_sim_model_cosine = graphlab.item_similarity_recommender.create(train_data, 
                                                             user_id='user_id', 
                                                             item_id='movie_id', 
                                                             target='rating', 
                                                             similarity_type='cosine',
                                                          user_data=graphlab.SFrame(users),
                                                          item_data=graphlab.SFrame(items))


In [30]:
item_sim_model_jaccard = graphlab.item_similarity_recommender.create(train_data, 
                                                             user_id='user_id', 
                                                             item_id='movie_id', 
                                                             target='rating', 
                                                             similarity_type='cosine',
                                                          user_data=graphlab.SFrame(users),
                                                          item_data=graphlab.SFrame(items))


In [31]:
item_sim_model_pearson = graphlab.item_similarity_recommender.create(train_data, 
                                                             user_id='user_id', 
                                                             item_id='movie_id', 
                                                             target='rating', 
                                                             similarity_type='pearson',
                                                          user_data=graphlab.SFrame(users),
                                                          item_data=graphlab.SFrame(items))

In [32]:
item_ranking_factorization_model = graphlab.ranking_factorization_recommender.create(train_data,
                                                                                     user_id='user_id',
                                                                                     item_id='movie_id',
                                                                                     target='rating',
                                                                                     max_iterations =300,
                                                          user_data=graphlab.SFrame(users),
                                                          item_data=graphlab.SFrame(items))

[ERROR] graphlab.toolkits._main: Toolkit error: Cancelled by user.


ToolkitError: Cancelled by user.

In [33]:
item_factorization_model = graphlab.factorization_recommender.create(train_data, 
                                                                     user_id='user_id', 
                                                                     item_id='movie_id', 
                                                                     target='rating',
                                                                     max_iterations =300,
                                                          user_data=graphlab.SFrame(users),
                                                          item_data=graphlab.SFrame(items))

In [34]:
#Make Recommendations:
item_sim_recomm = item_sim_model_cosine.recommend(users=range(1,6),k=64)
item_sim_recomm.print_rows(num_rows=25)

+---------+----------+----------------+------+
| user_id | movie_id |     score      | rank |
+---------+----------+----------------+------+
|    1    |   423    | 0.988330925694 |  1   |
|    1    |   202    | 0.936275708311 |  2   |
|    1    |   655    | 0.803317898785 |  3   |
|    1    |   403    | 0.778580487912 |  4   |
|    1    |   568    | 0.754668656651 |  5   |
|    1    |   385    | 0.753191386698 |  6   |
|    1    |   393    | 0.638743438566 |  7   |
|    1    |   265    | 0.635910023032 |  8   |
|    1    |   357    | 0.573736994094 |  9   |
|    1    |   483    | 0.490505273106 |  10  |
|    1    |   496    | 0.46411596068  |  11  |
|    1    |   318    | 0.460766265183 |  12  |
|    1    |   550    | 0.437688769502 |  13  |
|    1    |   566    | 0.434414722992 |  14  |
|    1    |   732    | 0.428943741868 |  15  |
|    1    |   474    | 0.407339939634 |  16  |
|    1    |   367    | 0.332838937299 |  17  |
|    1    |   405    | 0.331113176264 |  18  |
|    1    |  

In [35]:
model_list = [popularity_model, item_sim_model_cosine,item_sim_model_pearson,item_sim_model_jaccard,item_ranking_factorization_model,item_factorization_model]


model_performance = graphlab.compare(test_data, model_list )
graphlab.show_comparison(model_performance,[popularity_model, item_sim_model_cosine,item_sim_model_pearson,item_sim_model_jaccard,item_ranking_factorization_model,item_factorization_model])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |        0.0        |        0.0        |
|   2    |        0.0        |        0.0        |
|   3    | 0.000353481795688 | 0.000106044538706 |
|   4    | 0.000265111346766 | 0.000106044538706 |
|   5    | 0.000424178154825 | 0.000212089077413 |
|   6    | 0.000353481795688 | 0.000212089077413 |
|   7    | 0.000302984396304 | 0.000212089077413 |
|   8    | 0.000265111346766 | 0.000212089077413 |
|   9    | 0.000235654530458 | 0.000212089077413 |
|   10   | 0.000212089077413 | 0.000212089077413 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+---