## 1st Class: Recommendation Heuristics

> Date: June 23, 2020

- Here we are going to make a movies classifier, recommending movies to the user based on what he/she watched.
- In this context, we will use a lot the main libraries into the machine learning with Python: Pandas, Numpy, SKLearn, Matplotlib and others. 

In [22]:
import pandas as pd

movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

### Translation to portuguese

If you want, run the cell below to rename the columns to be in portuguese/brazilian/pt-BR. In this case, you'll need to adapt the code to this language and uncomment the lines of this notebook.

#### Attention
Don't run the next cell if you want continue using the columns in english. In this case, let the lines commented, as they are.

In [23]:
# movies.columns  = ["filmeId", "titulo", "genero"]
# ratings.columns = ["usuarioId", "filmeId", "genero", "tempo"]

In [24]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [25]:
movies = movies.set_index("movieId")
movies.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [29]:
ratings = ratings.set_index("movieId")
ratings.head()

Unnamed: 0_level_0,userId,rating,timestamp
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,4.0,964982703
3,1,4.0,964981247
6,1,4.0,964982224
47,1,5.0,964983815
50,1,5.0,964982931


#### First Attempt
- Considering that we don't know what kind of movie the user likes, we need to define the firsts films to suggest. 
- As we can see, we have same informations that say to us the films with more number of ratings. We can use this to make a sugestion of the most seen films into our catalog.

In [30]:
qnt_votes = ratings.index.value_counts()
qnt_votes.head()

356     329
318     317
296     307
593     279
2571    278
Name: movieId, dtype: int64

In [31]:
print(movies.loc[356], "\n")
print(movies.loc[318])

title          Forrest Gump (1994)
genres    Comedy|Drama|Romance|War
Name: 356, dtype: object 

title     Shawshank Redemption, The (1994)
genres                         Crime|Drama
Name: 318, dtype: object


In [32]:
movies["qnt_votes"] = qnt_votes
movies.head()

Unnamed: 0_level_0,title,genres,qnt_votes
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215.0
2,Jumanji (1995),Adventure|Children|Fantasy,110.0
3,Grumpier Old Men (1995),Comedy|Romance,52.0
4,Waiting to Exhale (1995),Comedy|Drama|Romance,7.0
5,Father of the Bride Part II (1995),Comedy,49.0


In [39]:
movies.sort_values('qnt_votes', ascending = False).head(20)

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
356,Forrest Gump (1994),Comedy|Drama|Romance|War,329.0,4.164134
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,307.0,4.197068
593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,279.0,4.16129
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0,4.192446
260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,251.0,4.231076
480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,238.0,3.75
110,Braveheart (1995),Action|Drama|War,237.0,4.031646
589,Terminator 2: Judgment Day (1991),Action|Sci-Fi,224.0,3.970982
527,Schindler's List (1993),Drama|War,220.0,4.225


### Second Attempt

#### Sort by rate

- Trying to find a better way to suggest some movies to someone, we can make our second attempt: sort by rate.
- In this case, we will need to pay attention in 2 things: the rate and the number of votes. if the rate is 5.0, but the quantity of votes is one, we have to consider that probably it isn't a movie that many people would like to see.

In [34]:
rates = ratings["rating"].groupby("movieId").mean()
rates.head()

movieId
1    3.920930
2    3.431818
3    3.259615
4    2.357143
5    3.071429
Name: rating, dtype: float64

In [35]:
movies["rating"] = rates
movies.sort_values('rating', ascending = False).head(10)

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88448,Paper Birds (Pájaros de papel) (2010),Comedy|Drama,1.0,5.0
100556,"Act of Killing, The (2012)",Documentary,1.0,5.0
143031,Jump In! (2007),Comedy|Drama|Romance,1.0,5.0
143511,Human (2015),Documentary,1.0,5.0
143559,L.A. Slasher (2015),Comedy|Crime|Fantasy,1.0,5.0
6201,Lady Jane (1986),Drama|Romance,1.0,5.0
102217,Bill Hicks: Revelations (1993),Comedy,1.0,5.0
102084,Justice League: Doom (2012),Action|Animation|Fantasy,1.0,5.0
6192,Open Hearts (Elsker dig for evigt) (2002),Romance,1.0,5.0
145994,Formula of Love (1984),Comedy,1.0,5.0


#### Filtering the data

In [45]:
movies015p = movies.query('qnt_votes > 0.15 * qnt_votes.max()').sort_values('rating', ascending = False)
movies015p.head(10)

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
858,"Godfather, The (1972)",Crime|Drama,192.0,4.289062
2959,Fight Club (1999),Action|Crime|Drama|Thriller,218.0,4.272936
1276,Cool Hand Luke (1967),Drama,57.0,4.27193
750,Dr. Strangelove or: How I Learned to Stop Worr...,Comedy|War,97.0,4.268041
904,Rear Window (1954),Mystery|Thriller,84.0,4.261905
1221,"Godfather: Part II, The (1974)",Crime|Drama,129.0,4.25969
48516,"Departed, The (2006)",Crime|Drama|Thriller,107.0,4.252336
1213,Goodfellas (1990),Crime|Drama,126.0,4.25
912,Casablanca (1942),Drama|Romance,100.0,4.24


In [37]:
print("Minimum number of rates: ", (0.15 * movies.qnt_votes.max()))

Minimum number of rates:  49.35


## 2nd Class: Recommendation System

> Date: June 24, 2020

**Colaborative Filtering:** Filter using users ratings and comments

**Content Filtering:** Filter based on the content of the movies (genres, author, movie's casting etc.)

Now, we have users that watched some movie of the list. What can we do with this information?

In [41]:
watched_movies = [2571, 260, 480,  527, 1,  1196, 4993] # These are the movies that I watched haha ;D

movies.loc[watched_movies]

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0,4.192446
260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,251.0,4.231076
480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,238.0,3.75
527,Schindler's List (1993),Drama|War,220.0,4.225
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215.0,3.92093
1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,211.0,4.21564
4993,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy,198.0,4.106061


### 1st Attempt: Suggest movies of the same genre

If we take one of this movies and see its genres, we can use that to know similar movies that can be interesting to the people, no?

Obs.: The argument `errors = 'ignore'` is used because some movies into the list of watched movies are not into the list of `genres == "Action|Adventure|Sci-Fi"`. Therefore, 

In [54]:
act_adv_scifi_movies = movies015p.query('genres == "Action|Adventure|Sci-Fi"').drop(watched_movies, errors = 'ignore').sort_values('rating', ascending = False)

## 3rd Class: Distance between users

> Date: June 24, 2020

Using the user ratings, we can find the relationship between them and consider that if both like a specific movie, we can recommend films that the other likes.

**Example:** If I liked either X-Men and Avengers and you liked X-Men too, we can recommend Avengers to you.

-------------

**Norm:** Calculate the distance between two points 

In [59]:
# Take ratings of one user with movieId and ratings
# Define movieId as the index
# def function that do all this things
# join data of 2 users in one dataframed
# Remove lines with NaN
# Distance between 2 users

In [161]:
ratings.head()

Unnamed: 0_level_0,userId,rating,timestamp
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,4.0,964982703
3,1,4.0,964981247
6,1,4.0,964982224
47,1,5.0,964983815
50,1,5.0,964982931


In [162]:
user = 'userId'
number1 = 1
number2 = 3
user1_test = ratings.query('{0} == {1}'.format(user, number1))[['rating']]
user4_test = ratings.query('{0} == {1}'.format(user, number2))[['rating']]
result_df = user1_test.join(user4_test, lsuffix = 4, rsuffix = 2).dropna()
print(result_df)
norm(result_df)

         rating4  rating2
movieId                  
527          5.0      0.5
1275         5.0      3.5
1587         5.0      4.5
2018         5.0      0.5
2090         5.0      0.5
2105         4.0      2.0
3703         5.0      5.0


15.107944929738128

In [160]:
from numpy.linalg import norm

def dist_between_users(data, column_users, column_ratings, user1, user2):
    data_user1 = data.query('{0} == {1}'.format(column_users, user1))[[column_ratings]]
    data_user2 = data.query('{0} == {1}'.format(column_users, user2))[[column_ratings]]
    data_users = data_user1.join(data_user2, lsuffix = user1, rsuffix = user2).dropna()
    return norm(data_users) 

In [167]:
norm_1_4 = dist_between_users(ratings, 'userId', 'rating', 1, 4)
norm_1_3 = dist_between_users(ratings, 'userId', 'rating', 1, 3)
print(norm_1_4)
print(norm_1_3)

39.84971769034255
15.107944929738128
