## 1st Class: Recommendation Heuristics

> Date: June 23, 2020

- Here we are going to make a movies classifier, recommending movies to the user based on what he/she watched.
- In this context, we will use a lot the main libraries into the machine learning with Python: Pandas, Numpy, SKLearn, Matplotlib and others. 

In [1]:
import pandas as pd

movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")
pd.read_csv("ml-latest-small/")

### Translation to portuguese

If you want, run the cell below to rename the columns to be in portuguese/brazilian/pt-BR. In this case, you'll need to adapt the code to this language.

#### Attention
Don't run the next cell if you want continue using the columns in english.

In [None]:
movies.columns  = ["filmeId", "titulo", "genero"]
ratings.columns = ["usuarioId", "filmeId", "genero", "tempo"]

In [2]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
movies = movies.set_index("movieId")
movies.head()

In [6]:
ratings.head()

Unnamed: 0_level_0,userId,rating,timestamp
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,4.0,964982703
3,1,4.0,964981247
6,1,4.0,964982224
47,1,5.0,964983815
50,1,5.0,964982931


#### First Attempt
- Considering that we don't know what kind of movie the user likes, we need to define the firsts films to suggest. 
- As we can see, we have same informations that say to us the films with more number of ratings. We can use this to make a sugestion of the most seen films into our catalog.

In [26]:
qnt_votes = ratings.index.value_counts()
qnt_votes.head()

356     329
318     317
296     307
593     279
2571    278
Name: movieId, dtype: int64

In [16]:
print(movies.loc[356], "\n")
print(movies.loc[318])

title          Forrest Gump (1994)
genres    Comedy|Drama|Romance|War
Name: 356, dtype: object 

title     Shawshank Redemption, The (1994)
genres                         Crime|Drama
Name: 318, dtype: object


In [19]:
movies["qnt_votes"] = qnt_votes
movies.head()

Unnamed: 0_level_0,title,genres,qnt_votes
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215.0
2,Jumanji (1995),Adventure|Children|Fantasy,110.0
3,Grumpier Old Men (1995),Comedy|Romance,52.0
4,Waiting to Exhale (1995),Comedy|Drama|Romance,7.0
5,Father of the Bride Part II (1995),Comedy,49.0


In [24]:
movies.sort_values('qnt_votes', ascending = False).head()

Unnamed: 0_level_0,title,genres,qnt_votes
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
356,Forrest Gump (1994),Comedy|Drama|Romance|War,329.0
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0
296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,307.0
593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,279.0
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0


### Second Attempt

#### Sort by rate

- Trying to find a better way to suggest some movies to someone, we can make our second attempt: sort by rate.
- In this case, we will need to pay attention in 2 things: the rate and the number of votes. if the rate is 5.0, but the quantity of votes is one, we have to consider that probably it isn't a movie that many people would like to see.

In [33]:
rates = ratings.groupby("movieId").mean()["rating"]
rates.head()

movieId
1    3.920930
2    3.431818
3    3.259615
4    2.357143
5    3.071429
Name: rating, dtype: float64

In [36]:
movies["rating"] = rates
movies.sort_values('rating', ascending = False).head(10)

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88448,Paper Birds (Pájaros de papel) (2010),Comedy|Drama,1.0,5.0
100556,"Act of Killing, The (2012)",Documentary,1.0,5.0
143031,Jump In! (2007),Comedy|Drama|Romance,1.0,5.0
143511,Human (2015),Documentary,1.0,5.0
143559,L.A. Slasher (2015),Comedy|Crime|Fantasy,1.0,5.0
6201,Lady Jane (1986),Drama|Romance,1.0,5.0
102217,Bill Hicks: Revelations (1993),Comedy,1.0,5.0
102084,Justice League: Doom (2012),Action|Animation|Fantasy,1.0,5.0
6192,Open Hearts (Elsker dig for evigt) (2002),Romance,1.0,5.0
145994,Formula of Love (1984),Comedy,1.0,5.0


#### Filtering the data

In [47]:
movies.query('qnt_votes > 0.15 * qnt_votes.max()').sort_values('rating', ascending = False).head(10)

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
858,"Godfather, The (1972)",Crime|Drama,192.0,4.289062
2959,Fight Club (1999),Action|Crime|Drama|Thriller,218.0,4.272936
1276,Cool Hand Luke (1967),Drama,57.0,4.27193
750,Dr. Strangelove or: How I Learned to Stop Worr...,Comedy|War,97.0,4.268041
904,Rear Window (1954),Mystery|Thriller,84.0,4.261905
1221,"Godfather: Part II, The (1974)",Crime|Drama,129.0,4.25969
48516,"Departed, The (2006)",Crime|Drama|Thriller,107.0,4.252336
1213,Goodfellas (1990),Crime|Drama,126.0,4.25
912,Casablanca (1942),Drama|Romance,100.0,4.24


In [48]:
print("Minimum quantity of rates: ", 0.15 * movies.qnt_votes.max())

Minimum quantity of rates:  49.35
