## Movie Recommender

<li> Using the MovieLens Public dataset
    <li> 100,000 ratings and , 3,600 tag applications applied to 9,000 movies by 600 users

### Importing the Data and Printing

In [2]:
import turicreate as tc
movies = tc.SFrame.read_csv("movies.csv", header=True,delimiter=',')
movies

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Child ren|Comedy|Fantasy ...
2,Jumanji (1995),Adventure|Children|Fantas y ...
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995) ...,Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller


### Data Visualization with a Single Command

In [16]:
movies.show()


### Doing the Same for "Users Rating"

In [4]:
ratings = tc.SFrame.read_csv("ratings.csv", header=True,delimiter=',')
ratings

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,float,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
1,70,3.0,964982400
1,101,5.0,964980868
1,110,4.0,964982176
1,151,5.0,964984041
1,157,5.0,964984100


In [17]:
ratings['rating'].show()
#shows the rating distribution

### Beginning with The Recommender

#### Popularity Recommender

In [8]:
model = tc.recommender.popularity_recommender.create(ratings,user_id='userId',item_id='movieId',target='rating')
most_popular = model.recommend(users=[1,2,3,4,5],k=3)
most_popular = most_popular.join(right=movies,on={'movieId':'movieId'},how='inner').sort(['userId','rank'], ascending=True)
most_popular.print_rows(num_rows=15)



+--------+---------+-------+------+--------------------------------+
| userId | movieId | score | rank |             title              |
+--------+---------+-------+------+--------------------------------+
|   1    |   6835  |  5.0  |  1   |   Alien Contamination (1980)   |
|   1    |   5746  |  5.0  |  2   | Galaxy of Terror (Quest) (...  |
|   1    |  131724 |  5.0  |  3   | The Jinx: The Life and Dea...  |
|   2    |   3851  |  5.0  |  1   | I'm the One That I Want (2000) |
|   2    |   6835  |  5.0  |  2   |   Alien Contamination (1980)   |
|   2    |   5746  |  5.0  |  3   | Galaxy of Terror (Quest) (...  |
|   3    |   1151  |  5.0  |  1   |      Lesson Faust (1994)       |
|   3    |   3851  |  5.0  |  2   | I'm the One That I Want (2000) |
|   3    |  131724 |  5.0  |  3   | The Jinx: The Life and Dea...  |
|   4    |   6835  |  5.0  |  1   |   Alien Contamination (1980)   |
|   4    |   5746  |  5.0  |  2   | Galaxy of Terror (Quest) (...  |
|   4    |  131724 |  5.0  |  3   

<li>The results are slightly different for some users because, if someone already rated that movie, it’s not proposed again. Smart!

### Split
Let’s now try item-item similarity. Now we split between training and validation data, so we’ll have the possibility to evaluate model performance

In [9]:
training_data, validation_data = tc.recommender.util.random_split_by_user(ratings, 'userId', 'movieId',item_test_proportion=0.2)
model = tc.recommender.item_similarity_recommender.create(training_data,
                                          user_id='userId',
                                    item_id='movieId',
                                    target='rating')
items_similarity = model.get_similar_items()

<li> Empirically test with a movie, “Alien” (movieId 1214)

In [10]:
(items_similarity[(items_similarity['movieId'] == 1214)]).join(right=movies,on={'similar':'movieId'},how='inner').sort('rank', ascending=True).print_rows()

+---------+---------+---------------------+------+
| movieId | similar |        score        | rank |
+---------+---------+---------------------+------+
|   1214  |   1200  |  0.5241379141807556 |  1   |
|   1214  |   1097  | 0.34355831146240234 |  2   |
|   1214  |   1089  | 0.34117645025253296 |  3   |
|   1214  |   1210  | 0.32535886764526367 |  4   |
|   1214  |   1198  | 0.30232560634613037 |  5   |
|   1214  |   1136  | 0.29545456171035767 |  6   |
|   1214  |   1387  |  0.2857142686843872 |  7   |
|   1214  |   1653  |  0.2770270109176636 |  8   |
|   1214  |   1213  |  0.2721893787384033 |  9   |
|   1214  |   480   | 0.27196651697158813 |  10  |
+---------+---------+---------------------+------+
+-------------------------------+--------------------------------+
|             title             |             genres             |
+-------------------------------+--------------------------------+
|         Aliens (1986)         | Action|Adventure|Horror|Sci-Fi |
| E.T. the Extra-T

In [11]:
model.evaluate(validation_data)


Precision and recall summary statistics by cutoff
+--------+---------------------+----------------------+
| cutoff |    mean_precision   |     mean_recall      |
+--------+---------------------+----------------------+
|   1    |  0.3103448275862067 | 0.018634324532989934 |
|   2    |  0.2947454844006568 | 0.038303208108109804 |
|   3    |  0.2698412698412698 | 0.05155647467531792  |
|   4    |  0.261904761904762  | 0.06541700519494215  |
|   5    | 0.24729064039408877 | 0.07531744392400544  |
|   6    |  0.2402846195949644 |  0.0857107909849983  |
|   7    | 0.23269997654234098 |  0.0943738849336917  |
|   8    | 0.22844827586206898 | 0.10712201535797734  |
|   9    | 0.22276956759715383 | 0.11569717101200327  |
|   10   | 0.21527093596059108 | 0.12216046262488454  |
+--------+---------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 3.634390588560626

Per User RMSE (best)
+--------+--------------------+-------+
| userId |        rmse        | count |
+------

{'precision_recall_by_user': Columns:
 	userId	int
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 10962
 
 Data:
 +--------+--------+---------------------+----------------------+-------+
 | userId | cutoff |      precision      |        recall        | count |
 +--------+--------+---------------------+----------------------+-------+
 |   1    |   1    |         0.0         |         0.0          |   44  |
 |   1    |   2    |         0.0         |         0.0          |   44  |
 |   1    |   3    |         0.0         |         0.0          |   44  |
 |   1    |   4    |         0.0         |         0.0          |   44  |
 |   1    |   5    |         0.0         |         0.0          |   44  |
 |   1    |   6    |         0.0         |         0.0          |   44  |
 |   1    |   7    | 0.14285714285714285 | 0.022727272727272728 |   44  |
 |   1    |   8    |        0.125        | 0.022727272727272728 |   44  |
 |   1    |   9    |  0.2222222222222222 | 0.04545454

### Factorization approach

In [13]:
model = tc.recommender.ranking_factorization_recommender.create(training_data,
                                          user_id='userId',
                                    item_id='movieId',
                                    target='rating')
results = model.recommend(k=3)


In [14]:
def join_titles(sframe,on):
    return sframe.join(right=movies, on=on, how='inner')
results = join_titles(results,'movieId')
results.sort(['userId','rank'], ascending=True).print_rows(20)

+--------+---------+--------------------+------+-------------------------------+
| userId | movieId |       score        | rank |             title             |
+--------+---------+--------------------+------+-------------------------------+
|   1    |   296   | 5.722471357287951  |  1   |      Pulp Fiction (1994)      |
|   1    |    47   | 5.6180686959644035 |  2   |  Seven (a.k.a. Se7en) (1995)  |
|   1    |   858   |  5.61585241645772  |  3   |     Godfather, The (1972)     |
|   2    |   1198  | 4.804590404929705  |  1   | Raiders of the Lost Ark (I... |
|   2    |   260   | 4.766318351926394  |  2   | Star Wars: Episode IV - A ... |
|   2    |   541   | 4.723595620097704  |  3   |      Blade Runner (1982)      |
|   3    |   1240  | 4.6727822580953315 |  1   |     Terminator, The (1984)    |
|   3    |   2916  |  4.37605489880044  |  2   |      Total Recall (1990)      |
|   3    |   150   | 4.3733899751325325 |  3   |        Apollo 13 (1995)       |
|   4    |   924   | 5.29685

The algorithm is trying to learn from latent features and minimize the error (RMSE), using stochastic gradient descent while optimizing the learning rate.

In [15]:
model.evaluate(validation_data)


Precision and recall summary statistics by cutoff
+--------+---------------------+----------------------+
| cutoff |    mean_precision   |     mean_recall      |
+--------+---------------------+----------------------+
|   1    | 0.20032840722495898 | 0.00950652619099485  |
|   2    |  0.1847290640394088 | 0.015179279562914426 |
|   3    | 0.17569786535303775 |  0.0216443387613403  |
|   4    | 0.16625615763546797 | 0.028139439698030518 |
|   5    | 0.15894909688013137 | 0.03394687132637824  |
|   6    | 0.15188834154351397 | 0.03959102223001426  |
|   7    | 0.14801782782078352 | 0.045359559742451254 |
|   8    |  0.1434729064039409 | 0.04975328047615025  |
|   9    |  0.140485312899106  | 0.05501675118647552  |
|   10   |  0.1372742200328407 | 0.05929834908026429  |
+--------+---------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.0516069324149855

Per User RMSE (best)
+--------+--------------------+-------+
| userId |        rmse        | count |
+-----

{'precision_recall_by_user': Columns:
 	userId	int
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 10962
 
 Data:
 +--------+--------+--------------------+----------------------+-------+
 | userId | cutoff |     precision      |        recall        | count |
 +--------+--------+--------------------+----------------------+-------+
 |   1    |   1    |        1.0         | 0.022727272727272728 |   44  |
 |   1    |   2    |        1.0         | 0.045454545454545456 |   44  |
 |   1    |   3    | 0.6666666666666666 | 0.045454545454545456 |   44  |
 |   1    |   4    |        0.5         | 0.045454545454545456 |   44  |
 |   1    |   5    |        0.4         | 0.045454545454545456 |   44  |
 |   1    |   6    | 0.3333333333333333 | 0.045454545454545456 |   44  |
 |   1    |   7    | 0.2857142857142857 | 0.045454545454545456 |   44  |
 |   1    |   8    |        0.25        | 0.045454545454545456 |   44  |
 |   1    |   9    | 0.2222222222222222 | 0.045454545454545456 |