## 1st Class: Recommendation Heuristics

> Date: June 23, 2020

- Here we are going to make a movies classifier, recommending movies to the user based on what he/she watched.
- In this context, we will use a lot the main libraries into the machine learning with Python: Pandas, Numpy, SKLearn, Matplotlib and others. 

In [1]:
import pandas as pd

movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

### Translation to portuguese

If you want, run the cell below to rename the columns to be in portuguese/brazilian/pt-BR. In this case, you'll need to adapt the code to this language and uncomment the lines of this notebook.

#### Attention
Don't run the next cell if you want continue using the columns in english. In this case, let the lines commented, as they are.

In [2]:
# movies.columns  = ["filmeId", "titulo", "genero"]
# ratings.columns = ["usuarioId", "filmeId", "genero", "tempo"]

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
movies = movies.set_index("movieId")
print(len(movies))
movies.head()

9742


Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [117]:
ratings = ratings.drop(columns = 'timestamp')
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


#### First Attempt
- Considering that we don't know what kind of movie the user likes, we need to define the firsts films to suggest. 
- As we can see, we have same informations that say to us the films with more number of ratings. We can use this to make a sugestion of the most seen films into our catalog.

In [118]:
qnt_votes = ratings['movieId'].value_counts()
qnt_votes.head()

356     329
318     317
296     307
593     279
2571    278
Name: movieId, dtype: int64

In [119]:
print(movies.loc[356], "\n")
print(movies.loc[318])

title             Forrest Gump (1994)
genres       Comedy|Drama|Romance|War
qnt_votes                         329
rating                        4.16413
Name: 356, dtype: object 

title        Shawshank Redemption, The (1994)
genres                            Crime|Drama
qnt_votes                                 317
rating                                4.42902
Name: 318, dtype: object


In [120]:
movies["qnt_votes"] = qnt_votes
movies.head()

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215.0,3.92093
2,Jumanji (1995),Adventure|Children|Fantasy,110.0,3.431818
3,Grumpier Old Men (1995),Comedy|Romance,52.0,3.259615
4,Waiting to Exhale (1995),Comedy|Drama|Romance,7.0,2.357143
5,Father of the Bride Part II (1995),Comedy,49.0,3.071429


In [121]:
movies.sort_values('qnt_votes', ascending = False).head(20)

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
356,Forrest Gump (1994),Comedy|Drama|Romance|War,329.0,4.164134
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,307.0,4.197068
593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,279.0,4.16129
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0,4.192446
260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,251.0,4.231076
480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,238.0,3.75
110,Braveheart (1995),Action|Drama|War,237.0,4.031646
589,Terminator 2: Judgment Day (1991),Action|Sci-Fi,224.0,3.970982
527,Schindler's List (1993),Drama|War,220.0,4.225


### Second Attempt

#### Sort by rate

- Trying to find a better way to suggest some movies to someone, we can make our second attempt: sort by rate.
- In this case, we will need to pay attention in 2 things: the rate and the number of votes. if the rate is 5.0, but the quantity of votes is one, we have to consider that probably it isn't a movie that many people would like to see.

In [122]:
rates = ratings.groupby('movieId').mean()['rating']
rates.head()

movieId
1    3.920930
2    3.431818
3    3.259615
4    2.357143
5    3.071429
Name: rating, dtype: float64

In [123]:
movies["rating"] = rates
movies.sort_values('rating', ascending = False).head(10)

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88448,Paper Birds (Pájaros de papel) (2010),Comedy|Drama,1.0,5.0
100556,"Act of Killing, The (2012)",Documentary,1.0,5.0
143031,Jump In! (2007),Comedy|Drama|Romance,1.0,5.0
143511,Human (2015),Documentary,1.0,5.0
143559,L.A. Slasher (2015),Comedy|Crime|Fantasy,1.0,5.0
6201,Lady Jane (1986),Drama|Romance,1.0,5.0
102217,Bill Hicks: Revelations (1993),Comedy,1.0,5.0
102084,Justice League: Doom (2012),Action|Animation|Fantasy,1.0,5.0
6192,Open Hearts (Elsker dig for evigt) (2002),Romance,1.0,5.0
145994,Formula of Love (1984),Comedy,1.0,5.0


#### Filtering the data

In [124]:
movies015p = movies.query('qnt_votes > 0.15 * qnt_votes.max()').sort_values('rating', ascending = False)
movies015p.head(10)

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
858,"Godfather, The (1972)",Crime|Drama,192.0,4.289062
2959,Fight Club (1999),Action|Crime|Drama|Thriller,218.0,4.272936
1276,Cool Hand Luke (1967),Drama,57.0,4.27193
750,Dr. Strangelove or: How I Learned to Stop Worr...,Comedy|War,97.0,4.268041
904,Rear Window (1954),Mystery|Thriller,84.0,4.261905
1221,"Godfather: Part II, The (1974)",Crime|Drama,129.0,4.25969
48516,"Departed, The (2006)",Crime|Drama|Thriller,107.0,4.252336
1213,Goodfellas (1990),Crime|Drama,126.0,4.25
912,Casablanca (1942),Drama|Romance,100.0,4.24


In [125]:
print("Minimum number of rates: ", (0.15 * movies.qnt_votes.max()))

Minimum number of rates:  49.35


## 2nd Class: Recommendation System

> Date: June 24, 2020

**Colaborative Filtering:** Filter using users ratings and comments

**Content Filtering:** Filter based on the content of the movies (genres, author, movie's casting etc.)

Now, we have users that watched some movie of the list. What can we do with this information?

In [126]:
watched_movies = [2571, 260, 480,  527, 1,  1196, 4993] # These are the movies that I watched haha ;D

movies.loc[watched_movies]

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0,4.192446
260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,251.0,4.231076
480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,238.0,3.75
527,Schindler's List (1993),Drama|War,220.0,4.225
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215.0,3.92093
1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,211.0,4.21564
4993,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy,198.0,4.106061


### 1st Attempt: Suggest movies of the same genre

If we take one of this movies and see its genres, we can use that to know similar movies that can be interesting to the people, no?

Obs.: The argument `errors = 'ignore'` is used because some movies into the list of watched movies are not into the list of `genres == "Action|Adventure|Sci-Fi"`. Therefore, 

In [127]:
act_adv_scifi_movies = movies015p.query('genres == "Action|Adventure|Sci-Fi"').drop(watched_movies, errors = 'ignore').sort_values('rating', ascending = False)

## 3rd Class: Distance between users

> Date: June 24, 2020

Using the user ratings, we can find the relationship between them and consider that if both like a specific movie, we can recommend films that the other likes.

**Example:** If I liked either X-Men and Avengers and you liked X-Men too, we can recommend Avengers to you.

-------------

**Norm:** Calculate the distance between two points 

In [128]:
# Take ratings of one user with movieId and ratings
# Define movieId as the index
# def function that do all this things
# join data of 2 users in one dataframed
# Remove lines with NaN
# Distance between 2 users

In [129]:
ratings[300:400]

Unnamed: 0,userId,movieId,rating
300,4,21,3.0
301,4,32,2.0
302,4,45,3.0
303,4,47,2.0
304,4,52,3.0
...,...,...,...
395,4,1391,1.0
396,4,1449,5.0
397,4,1466,4.0
398,4,1500,4.0


In [423]:
import numpy as np
from numpy.linalg import norm

user = 'userId'
number1 = 77
number2 = 511

user1_test = ratings.query('{0} == {1}'.format(user, number1))[['rating', 'movieId']].set_index('movieId')
user2_test = ratings.query('{0} == {1}'.format(user, number2))[['rating', 'movieId']].set_index('movieId')

user1_test.join(user2_test, lsuffix = number1, rsuffix = number2).dropna().index

Int64Index([1198, 2571, 4973, 4993, 5349, 7153, 79132, 109487], dtype='int64', name='movieId')

In [131]:
from numpy.linalg import norm

def dist_between_users(data, column_users, column_ratings, column_id, user1, user2, min_data = 5):
    data_user1 = data.query('{0} == {1}'.format(column_users, user1))[[column_ratings, column_id]].set_index([column_id])
    data_user2 = data.query('{0} == {1}'.format(column_users, user2))[[column_ratings, column_id]].set_index([column_id])
    data_users = data_user1.join(data_user2, lsuffix = user1, rsuffix = user2).dropna()
    if(len(data_users) < min_data):
        return None
    return norm(data_users[column_ratings + '%d' % user1] - data_users[column_ratings + '%d' % user2])
    

In [132]:
dist_between_users(ratings, 'userId', 'rating', 'movieId', 1, 77)

0.0

In [133]:
norm_1_4 = dist_between_users(ratings, 'userId', 'rating', 'movieId', 1, 4)
norm_1_3 = dist_between_users(ratings, 'userId', 'rating', 'movieId', 1, 3)
print(norm_1_4)
print(norm_1_3)
norm_1_1 = dist_between_users(ratings, 'userId', 'rating','movieId', 1, 1)
print(norm_1_1)

11.135528725660043
8.200609733428363
0.0


In [134]:
from math import sqrt

def pitagoras(num1, num2):
    return (sqrt((pow(num1, 2) + pow(num2, 2))))


print(pitagoras(4, 5))
print(norm([4, 5]))

6.4031242374328485
6.4031242374328485


In [135]:
# Show number of user
# Show array of user (show user only one time)
# For loop to pass through each user
# Execute function to calculate distance between one user and all the others
# Make a dataFrame with 3 columns: Users, other_users, distance

In [136]:
ratings.userId.unique()[1:20]

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20])

In [137]:
qnt_users = len(ratings.userId.unique())
print(qnt_users)

610


## 4th Class: Nearest Users

> Date: June 25, 2020

Seeing that we have a function that returns the difference between users, now we can make a new
 function, based on the previous one, to give us a distance between 1 defined user and all the others.
 
 ----------------
 
- `loc` considers only the index column
- `iloc` takes a row of the dataFrame
- `dropna` remove values NaN (not a number) from the dataFrame
- `drop` remove columns (using parameter `columns = <column>`) or lines (by index)

In [483]:
def distance_of_others(data, column_users, column_ratings, columnId, main_user, k_nearest_users = 1, k_users = None):   
    if(k_users):
        dist_user = [[main_user, (user + 1), dist_between_users(ratings, column_users, column_ratings, columnId, main_user, (user + 1))] for user in range(k_users)]
    else:
        qnt_users = len(data[column_users].unique())
        dist_user = [[main_user, (user + 1), dist_between_users(ratings, column_users, column_ratings, columnId, main_user, (user + 1))] for user in range(qnt_users)]
    
    dist_user_df = pd.DataFrame(data = dist_user, columns = ['main_user', 'other_user', 'distance'])
    dist_user_df = dist_user_df.dropna()
    dist_user_df = dist_user_df.sort_values('distance').set_index('other_user').drop(main_user)
    return dist_user_df.iloc[:k_nearest_users]

In [484]:
dist_user_1 = distance_of_others(ratings, "userId", "rating", 'movieId', 1, k_nearest_users = 10)

In [485]:
dist_user_1

Unnamed: 0_level_0,main_user,distance
other_user,Unnamed: 1_level_1,Unnamed: 2_level_1
77,1,0.0
511,1,0.5
366,1,0.707107
258,1,1.0
9,1,1.0
49,1,1.0
523,1,1.0
319,1,1.118034
398,1,1.224745
65,1,1.322876


In [486]:
# Limit number of elements into distance_of_others
# Remove users that there are none in common with the main user
# Use filter to delete that user
# Use iloc() to refer to a line of the dataFrame
# Find films that the most similar user saw
# Remove films that the main user just saw too.
# Show the other films (will be the next recomendations) in a df with join(movies), that will include the informations about the selected films 

In [487]:
def user_ratings(user, data, users_column, movies_column): 
    return data.query('{0} == {1}'.format(users_column, user)).set_index(movies_column)

In [488]:
user_ratings(77, ratings, 'userId', 'movieId')

Unnamed: 0_level_0,userId,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
260,77,5.0
1196,77,5.0
1198,77,5.0
1210,77,5.0
2571,77,5.0
3578,77,5.0
3948,77,3.0
3996,77,5.0
4226,77,2.5
4878,77,1.0


In [489]:
user_ratings(77, ratings, 'userId', 'movieId').join(user_ratings(1, ratings, 'userId', 'movieId'), lsuffix = 77, rsuffix = 1).dropna()

Unnamed: 0_level_0,userId77,rating77,userId1,rating1
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
260,77,5.0,1.0,5.0
1196,77,5.0,1.0,5.0
1198,77,5.0,1.0,5.0
1210,77,5.0,1.0,5.0
2571,77,5.0,1.0,5.0
3578,77,5.0,1.0,5.0


## 5th Class: KNN - K Nearest Neighbors

> Date: June 26, 2020

Now, to finish the process of use distances, ratings and number of votes to filter and refine our dataset and define a better way to suggest movies to an user, we will use the KNN algorithm. This method uses the distance between users and use the "neighbors" to decide what someone would like to watch, in our case. Basically, it try to define a profile based on other users that made something equal or similar to what a specific user made too.
 
 ----------------
 
- `sort_values` can  sort by various columns/parameters

In [490]:
def knn(user, data, users_column, ratings_column, movies_column, k_nearest_users = 1, k_users = None):
    nearest_users_dist = distance_of_others(data, users_column, ratings_column, movies_column, user, k_nearest_users = k_nearest_users, k_users = k_users)
    user_movies = user_ratings(user, data, users_column, movies_column)
    nearest_users_movies = data.set_index(users_column).loc[nearest_users_dist.index]

    suggested_movies_drop = nearest_users_movies.drop(user_movies.index, errors = 'ignore')
    suggested_movies_idx = suggested_movies_drop.set_index(movies_column)
    
    suggested_movies_mean = suggested_movies_idx.groupby(movies_column).mean()[[ratings_column]]
    suggested_movies_join = suggested_movies_mean.join(movies, lsuffix = '_user', rsuffix = '_mean')
    suggested_movies = suggested_movies_join.sort_values(['%s_user' % ratings_column, '%s_mean' % ratings_column], ascending = False)
    filtered_suggested_movies = suggested_movies.query('qnt_votes > (qnt_votes.max()*0.15)')
    return filtered_suggested_movies


In [491]:
# Take de 10 nearest users
# Print the array with their ratings
# Do the mean between this ratings (group by movies)
# Order by these means
# join with movies

In [496]:
suggestions_user1_2 = knn(1, ratings, 'userId', 'rating', 'movieId', k_nearest_users = 2) # 1198, 2571, 4973, 4993, 5349, 7153, 79132, 109487
suggestions_user1_2.head(10)

Unnamed: 0_level_0,rating_user,title,genres,qnt_votes,rating_mean
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
318,5.0,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
2959,5.0,Fight Club (1999),Action|Crime|Drama|Thriller,218.0,4.272936
58559,5.0,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX,149.0,4.238255
260,5.0,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,251.0,4.231076
527,5.0,Schindler's List (1993),Drama|War,220.0,4.225
1196,5.0,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,211.0,4.21564
1198,5.0,Raiders of the Lost Ark (Indiana Jones and the...,Action|Adventure,200.0,4.2075
2571,5.0,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0,4.192446
1210,5.0,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi,196.0,4.137755
7153,5.0,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy,185.0,4.118919


In [495]:
suggestions_user1_15 = knn(1, ratings, 'userId', 'rating', 'movieId', k_nearest_users = 15)
suggestions_user1_15.head(10)

Unnamed: 0_level_0,rating_user,title,genres,qnt_votes,rating_mean
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
858,5.0,"Godfather, The (1972)",Crime|Drama,192.0,4.289062
1221,5.0,"Godfather: Part II, The (1974)",Crime|Drama,129.0,4.25969
1197,5.0,"Princess Bride, The (1987)",Action|Adventure|Comedy|Fantasy|Romance,142.0,4.232394
260,5.0,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,251.0,4.231076
1196,5.0,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,211.0,4.21564
1089,5.0,Reservoir Dogs (1992),Crime|Mystery|Thriller,131.0,4.20229
1136,5.0,Monty Python and the Holy Grail (1975),Adventure|Comedy|Fantasy,136.0,4.161765
4011,5.0,Snatch (2000),Comedy|Crime|Thriller,93.0,4.155914
5618,5.0,Spirited Away (Sen to Chihiro no kamikakushi) ...,Adventure|Animation|Fantasy,87.0,4.155172
1210,5.0,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi,196.0,4.137755
