<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Recommender Systems

_Authors: Riley Dallas (AUS)_

---

In [1]:
import pandas as pd
import numpy as np
from scipy import sparse

from sklearn.metrics.pairwise import pairwise_distances, cosine_distances, cosine_similarity

## Load `movies.csv` and `ratings.csv`
---

We'll be using the [MovieLens](https://grouplens.org/datasets/movielens/) dataset for building our recommendation engine. There are two CSVs (`movies.csv` and `ratings.csv`) that we'll eventually inner join. 

In [2]:
movies = pd.read_csv('datasets/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratings =pd.read_csv('datasets/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## Drop unnecessary columns
---

We won't need the `timestamp` column from `ratings`, nor will we need the `genres` column from `movies`. Drop both columns in the cells below.

In [4]:
movies.drop(columns='genres', inplace=True)
ratings.drop(columns='timestamp', inplace=True)

## Merge `movies` and `ratings`
---

Use `pd.merge` to **inner join** `movies` with `ratings` on the `movieId` column.

In [5]:
ratings_with_titles = pd.merge(ratings,movies, on='movieId')
ratings_with_titles.head()

Unnamed: 0,userId,movieId,rating,title
0,1,1,4.0,Toy Story (1995)
1,5,1,4.0,Toy Story (1995)
2,7,1,4.5,Toy Story (1995)
3,15,1,2.5,Toy Story (1995)
4,17,1,4.5,Toy Story (1995)


## Create pivot table
---

Because we're creating an item-based collaborative recommender (where item in this case is our movies), we'll set up our pivot table as follows:
1. The `title` will be the index
2. The `userId` will be the column
3. The `rating` will be the value

**If we were building a user-based collaborative recommender, what would change about this pivot table?**

In [6]:
pivot = pd.pivot_table(ratings_with_titles, index ='title', columns='userId', values='rating')#index = row,columns=columns
pivot.tail()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
eXistenZ (1999),,,,,,,,,,,...,,,5.0,,,,,4.5,,
xXx (2002),,,,,,,,,1.0,,...,,,,,,,,3.5,,2.0
xXx: State of the Union (2005),,,,,,,,,,,...,,,,,,,,,,1.5
¡Three Amigos! (1986),4.0,,,,,,,,,,...,,,,,,,,,,
À nous la liberté (Freedom for Us) (1931),,,,,,,,,,,...,,,,,,,,,,


## Create sparse matrix
---

In a minute, we'll calculate the cosine similarity for each movie using the `pairwise_distances` function. Before that, we need to create a sparse matrix (datatype) using `scipy`'s `sparse` module like so:
```python
sparse.csr_matrix(pivot.fillna(0))
```

In [7]:
sparse_pivot = sparse.csr_matrix(pivot.fillna(0))

In [8]:
print(sparse_pivot)

  (0, 609)	4.0
  (1, 331)	4.0
  (2, 331)	3.5
  (2, 376)	3.5
  (3, 344)	5.0
  (4, 112)	3.0
  (4, 344)	5.0
  (5, 20)	1.5
  (6, 11)	5.0
  (6, 18)	2.0
  (6, 90)	2.0
  (6, 94)	3.0
  (6, 171)	4.0
  (6, 216)	4.0
  (6, 287)	3.0
  (6, 293)	1.0
  (6, 306)	3.5
  (6, 376)	3.5
  (6, 413)	3.0
  (6, 473)	1.0
  (6, 476)	3.5
  (6, 519)	4.0
  (6, 554)	5.0
  (6, 560)	4.5
  (6, 598)	2.0
  :	:
  (9717, 26)	5.0
  (9717, 41)	5.0
  (9717, 56)	2.0
  (9717, 67)	4.0
  (9717, 87)	3.5
  (9717, 140)	3.5
  (9717, 197)	2.0
  (9717, 214)	2.5
  (9717, 216)	2.0
  (9717, 220)	3.5
  (9717, 238)	3.0
  (9717, 281)	4.0
  (9717, 293)	4.0
  (9717, 306)	2.5
  (9717, 312)	1.0
  (9717, 413)	3.0
  (9717, 420)	3.0
  (9717, 447)	3.0
  (9717, 473)	3.0
  (9717, 476)	3.5
  (9717, 554)	3.0
  (9717, 560)	4.0
  (9717, 596)	3.0
  (9717, 598)	2.5
  (9718, 526)	1.0


## Calculate cosine similarity
---

`sklearn` has a built-in `pairwise_distances` function that we can use for our recommender. It will return a square matrix, comparing every movie with every other movie in the dataset.

```python
pairwise_distances(sparse_pivot, metric='cosine')
cosine_distances(sparse_pivot)                     # Identical but more concise
```

In [9]:
dists = cosine_distances(sparse_pivot)
dists

array([[0.        , 1.        , 1.        , ..., 0.67267316, 1.        ,
        1.        ],
       [1.        , 0.        , 0.29289322, ..., 1.        , 1.        ,
        1.        ],
       [1.        , 0.29289322, 0.        , ..., 1.        , 1.        ,
        1.        ],
       ...,
       [0.67267316, 1.        , 1.        , ..., 0.        , 1.        ,
        1.        ],
       [1.        , 1.        , 1.        , ..., 1.        , 0.        ,
        1.        ],
       [1.        , 1.        , 1.        , ..., 1.        , 1.        ,
        0.        ]])

However, note that distance is not the same as similarity. For example, a similarity of 1 is a distance of 0! 

Because of this, the similarity is defined as: `cosine_similarity = 1.0 - cosine_distance`. To compute this, we can use the `cosine_similarity` instead.

In [10]:
similarities = cosine_similarity(sparse_pivot)
similarities

array([[1.        , 0.        , 0.        , ..., 0.32732684, 0.        ,
        0.        ],
       [0.        , 1.        , 0.70710678, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.70710678, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.32732684, 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

## Create distances DataFrame
---

At this point, we essentially have a recommender. We'll load it into a `pandas` DataFrame for readability. 

You'll notice that each movie has a "distance" of 0 with itself (along the diagonal).

In [11]:
recommender_df = pd.DataFrame(dists, columns=pivot.index, index=pivot.index)
recommender_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.858347,1.0,...,1.0,0.657945,0.456695,0.292893,1.0,1.0,0.860569,0.672673,1.0,1.0
'Hellboy': The Seeds of Creation (2004),1.0,0.0,0.292893,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
'Round Midnight (1986),1.0,0.292893,0.0,1.0,1.0,1.0,0.823223,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
'Salem's Lot (2004),1.0,1.0,1.0,0.0,0.142507,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
'Til There Was You (1997),1.0,1.0,1.0,0.142507,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Evaluate recommender performance
---

Now comes the fun part! Let's check out a few movies to see if the recommender aligns with our intuition. In the cell below we'll do the following:
1. Create a search term
2. Use that to find all titles matching the search query
3. For each title, we'll list off the following:
  1. The average rating
  2. The number of ratings
  3. The ten most similar movies

In [12]:
recommender_df['Godfather, The (1972)'] #distances between Godfather and our movies

title
'71 (2014)                                   0.917670
'Hellboy': The Seeds of Creation (2004)      1.000000
'Round Midnight (1986)                       1.000000
'Salem's Lot (2004)                          1.000000
'Til There Was You (1997)                    0.966113
                                               ...   
eXistenZ (1999)                              0.798106
xXx (2002)                                   0.758071
xXx: State of the Union (2005)               0.845493
¡Three Amigos! (1986)                        0.781896
À nous la liberté (Freedom for Us) (1931)    1.000000
Name: Godfather, The (1972), Length: 9719, dtype: float64

In [13]:
recommender_df['Godfather, The (1972)'].sort_values()[1:11] #10 most similar movies based off of user ratings

title
Godfather: Part II, The (1974)                           0.178227
Goodfellas (1990)                                        0.335159
One Flew Over the Cuckoo's Nest (1975)                   0.379464
Star Wars: Episode IV - A New Hope (1977)                0.404683
Fargo (1996)                                             0.411386
Star Wars: Episode V - The Empire Strikes Back (1980)    0.413970
Fight Club (1999)                                        0.418721
Reservoir Dogs (1992)                                    0.420941
Pulp Fiction (1994)                                      0.424730
American Beauty (1999)                                   0.424988
Name: Godfather, The (1972), dtype: float64

In [14]:
recommender_df.filter(like='Matrix', axis='index')

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Matrix Reloaded, The (2003)",0.883333,0.9125,0.938128,1.0,1.0,0.897916,0.840624,0.9125,0.705184,0.911087,...,0.95625,0.812314,0.800787,0.835008,1.0,0.786295,0.658392,0.758141,0.830678,1.0
"Matrix Revolutions, The (2003)",0.882482,1.0,1.0,1.0,1.0,0.865693,0.876486,0.916058,0.714626,0.858422,...,1.0,0.829159,0.889243,0.845675,1.0,0.75943,0.608498,0.776525,0.825728,1.0
"Matrix, The (1999)",0.930325,0.930325,0.950732,1.0,1.0,0.94426,0.828052,0.937293,0.796685,0.909482,...,0.97213,0.842406,0.894007,0.906391,1.0,0.727038,0.735027,0.864681,0.785112,1.0


In [15]:
# def recommend_movie(search_term):
#     titles = recommender_df.filter(like=search_term, axis='index')
    
#     for title in titles:
#         print(title)
#         print('Average rating:', pivot.loc[title].mean())
#         print('Number of ratings:', pivot.loc[title].count())
#         print('\n10 most similar movies:')
#         print(recommender_df[title].sort_values()[1:11])
#         print('*' *50)
#         print()
        #*** Warning broke program with wrong data***

In [16]:
# recommend_movie('Matrix')