## 1st Class: Recommendation Heuristics

> Date: June 23, 2020

- Here we are going to make a movies classifier, recommending movies to the user based on what he/she watched.
- In this context, we will use a lot the main libraries into the machine learning with Python: Pandas, Numpy, SKLearn, Matplotlib and others. 

In [1]:
import pandas as pd

movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

### Translation to portuguese

If you want, run the cell below to rename the columns to be in portuguese/brazilian/pt-BR. In this case, you'll need to adapt the code to this language and uncomment the lines of this notebook.

#### Attention
Don't run the next cell if you want continue using the columns in english. In this case, let the lines commented, as they are.

In [2]:
# movies.columns  = ["filmeId", "titulo", "genero"]
# ratings.columns = ["usuarioId", "filmeId", "genero", "tempo"]

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
movies = movies.set_index("movieId")
print(len(movies))
movies.head()

9742


Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [5]:
ratings = ratings.drop(columns = 'timestamp')
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


#### First Attempt
- Considering that we don't know what kind of movie the user likes, we need to define the firsts films to suggest. 
- As we can see, we have same informations that say to us the films with more number of ratings. We can use this to make a sugestion of the most seen films into our catalog.

In [6]:
qnt_votes = ratings['movieId'].value_counts()
qnt_votes.head()

356     329
318     317
296     307
593     279
2571    278
Name: movieId, dtype: int64

In [7]:
print(movies.loc[356], "\n")
print(movies.loc[318])

title          Forrest Gump (1994)
genres    Comedy|Drama|Romance|War
Name: 356, dtype: object 

title     Shawshank Redemption, The (1994)
genres                         Crime|Drama
Name: 318, dtype: object


In [8]:
movies["qnt_votes"] = qnt_votes
movies.head()

Unnamed: 0_level_0,title,genres,qnt_votes
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215.0
2,Jumanji (1995),Adventure|Children|Fantasy,110.0
3,Grumpier Old Men (1995),Comedy|Romance,52.0
4,Waiting to Exhale (1995),Comedy|Drama|Romance,7.0
5,Father of the Bride Part II (1995),Comedy,49.0


In [9]:
movies.sort_values('qnt_votes', ascending = False).head(20)

Unnamed: 0_level_0,title,genres,qnt_votes
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
356,Forrest Gump (1994),Comedy|Drama|Romance|War,329.0
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0
296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,307.0
593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,279.0
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0
260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,251.0
480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,238.0
110,Braveheart (1995),Action|Drama|War,237.0
589,Terminator 2: Judgment Day (1991),Action|Sci-Fi,224.0
527,Schindler's List (1993),Drama|War,220.0


### Second Attempt

#### Sort by rate

- Trying to find a better way to suggest some movies to someone, we can make our second attempt: sort by rate.
- In this case, we will need to pay attention in 2 things: the rate and the number of votes. if the rate is 5.0, but the quantity of votes is one, we have to consider that probably it isn't a movie that many people would like to see.

In [10]:
rates = ratings.groupby('movieId').mean()['rating']
rates.head()

movieId
1    3.920930
2    3.431818
3    3.259615
4    2.357143
5    3.071429
Name: rating, dtype: float64

In [11]:
movies["rating"] = rates
movies.sort_values('rating', ascending = False).head(10)

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88448,Paper Birds (Pájaros de papel) (2010),Comedy|Drama,1.0,5.0
100556,"Act of Killing, The (2012)",Documentary,1.0,5.0
143031,Jump In! (2007),Comedy|Drama|Romance,1.0,5.0
143511,Human (2015),Documentary,1.0,5.0
143559,L.A. Slasher (2015),Comedy|Crime|Fantasy,1.0,5.0
6201,Lady Jane (1986),Drama|Romance,1.0,5.0
102217,Bill Hicks: Revelations (1993),Comedy,1.0,5.0
102084,Justice League: Doom (2012),Action|Animation|Fantasy,1.0,5.0
6192,Open Hearts (Elsker dig for evigt) (2002),Romance,1.0,5.0
145994,Formula of Love (1984),Comedy,1.0,5.0


#### Filtering the data

In [12]:
movies015p = movies.query('qnt_votes > 0.15 * qnt_votes.max()').sort_values('rating', ascending = False)
movies015p.head(10)

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
858,"Godfather, The (1972)",Crime|Drama,192.0,4.289062
2959,Fight Club (1999),Action|Crime|Drama|Thriller,218.0,4.272936
1276,Cool Hand Luke (1967),Drama,57.0,4.27193
750,Dr. Strangelove or: How I Learned to Stop Worr...,Comedy|War,97.0,4.268041
904,Rear Window (1954),Mystery|Thriller,84.0,4.261905
1221,"Godfather: Part II, The (1974)",Crime|Drama,129.0,4.25969
48516,"Departed, The (2006)",Crime|Drama|Thriller,107.0,4.252336
1213,Goodfellas (1990),Crime|Drama,126.0,4.25
912,Casablanca (1942),Drama|Romance,100.0,4.24


In [13]:
print("Minimum number of rates: ", (0.15 * movies.qnt_votes.max()))

Minimum number of rates:  49.35


## 2nd Class: Recommendation System

> Date: June 24, 2020

**Colaborative Filtering:** Filter using users ratings and comments

**Content Filtering:** Filter based on the content of the movies (genres, author, movie's casting etc.)

Now, we have users that watched some movie of the list. What can we do with this information?

In [14]:
watched_movies = [2571, 260, 480,  527, 1,  1196, 4993] # These are the movies that I watched haha ;D

movies.loc[watched_movies]

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0,4.192446
260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,251.0,4.231076
480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,238.0,3.75
527,Schindler's List (1993),Drama|War,220.0,4.225
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215.0,3.92093
1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,211.0,4.21564
4993,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy,198.0,4.106061


### 1st Attempt: Suggest movies of the same genre

If we take one of this movies and see its genres, we can use that to know similar movies that can be interesting to the people, no?

Obs.: The argument `errors = 'ignore'` is used because some movies into the list of watched movies are not into the list of `genres == "Action|Adventure|Sci-Fi"`. Therefore, 

In [15]:
act_adv_scifi_movies = movies015p.query('genres == "Action|Adventure|Sci-Fi"').drop(watched_movies, errors = 'ignore').sort_values('rating', ascending = False)

## 3rd Class: Distance between users

> Date: June 24, 2020

Using the user ratings, we can find the relationship between them and consider that if both like a specific movie, we can recommend films that the other likes.

**Example:** If I liked either X-Men and Avengers and you liked X-Men too, we can recommend Avengers to you.

-------------

**Norm:** Calculate the distance between two points 

In [16]:
# Take ratings of one user with movieId and ratings
# Define movieId as the index
# def function that do all this things
# join data of 2 users in one dataframed
# Remove lines with NaN
# Distance between 2 users

In [17]:
ratings[300:400]

Unnamed: 0,userId,movieId,rating
300,4,21,3.0
301,4,32,2.0
302,4,45,3.0
303,4,47,2.0
304,4,52,3.0
...,...,...,...
395,4,1391,1.0
396,4,1449,5.0
397,4,1466,4.0
398,4,1500,4.0


In [18]:
import numpy as np
from numpy.linalg import norm

user = 'userId'
number1 = 77
number2 = 511

user1_test = ratings.query('{0} == {1}'.format(user, number1))[['rating', 'movieId']].set_index('movieId')
user2_test = ratings.query('{0} == {1}'.format(user, number2))[['rating', 'movieId']].set_index('movieId')

user1_test.join(user2_test, lsuffix = number1, rsuffix = number2).dropna().index

Int64Index([1198, 2571, 4973, 4993, 5349, 7153, 79132, 109487], dtype='int64', name='movieId')

In [19]:
from numpy.linalg import norm

def dist_between_users(data, column_users, column_ratings, column_id, user1, user2, min_data = 5):
    data_user1 = data.query('{0} == {1}'.format(column_users, user1))[[column_ratings, column_id]].set_index([column_id])
    data_user2 = data.query('{0} == {1}'.format(column_users, user2))[[column_ratings, column_id]].set_index([column_id])
    data_users = data_user1.join(data_user2, lsuffix = user1, rsuffix = user2).dropna()
    if(len(data_users) < min_data):
        return None
    return norm(data_users[column_ratings + '%d' % user1] - data_users[column_ratings + '%d' % user2])
    

In [20]:
dist_between_users(ratings, 'userId', 'rating', 'movieId', 1, 77)

0.0

In [21]:
norm_1_4 = dist_between_users(ratings, 'userId', 'rating', 'movieId', 1, 4)
norm_1_3 = dist_between_users(ratings, 'userId', 'rating', 'movieId', 1, 3)
print(norm_1_4)
print(norm_1_3)
norm_1_1 = dist_between_users(ratings, 'userId', 'rating','movieId', 1, 1)
print(norm_1_1)

11.135528725660043
8.200609733428363
0.0


In [22]:
from math import sqrt

def pitagoras(num1, num2):
    return (sqrt((pow(num1, 2) + pow(num2, 2))))


print(pitagoras(4, 5))
print(norm([4, 5]))

6.4031242374328485
6.4031242374328485


In [23]:
# Show number of user
# Show array of user (show user only one time)
# For loop to pass through each user
# Execute function to calculate distance between one user and all the others
# Make a dataFrame with 3 columns: Users, other_users, distance

In [24]:
ratings.userId.unique()[1:20]

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20])

In [25]:
qnt_users = len(ratings.userId.unique())
print(qnt_users)

610


## 4th Class: Nearest Users

> Date: June 25, 2020

Seeing that we have a function that returns the difference between users, now we can make a new
 function, based on the previous one, to give us a distance between 1 defined user and all the others.
 
 ----------------
 
- `loc` considers only the index column
- `iloc` takes a row of the dataFrame
- `dropna` remove values NaN (not a number) from the dataFrame
- `drop` remove columns (using parameter `columns = <column>`) or lines (by index)

In [26]:
def distance_of_others(data, column_users, column_ratings, columnId, main_user, k_nearest_users = 1, k_users = None):   
    if(k_users):
        dist_user = [[main_user, (user + 1), dist_between_users(data, column_users, column_ratings, columnId, main_user, (user + 1))] for user in range(k_users)]
    else:
        qnt_users = len(data[column_users].unique())
        dist_user = [[main_user, (user + 1), dist_between_users(data, column_users, column_ratings, columnId, main_user, (user + 1))] for user in range(qnt_users)]
    
    dist_user_df = pd.DataFrame(data = dist_user, columns = ['main_user', 'other_user', 'distance'])
    dist_user_df = dist_user_df.dropna()
    dist_user_df = dist_user_df.sort_values('distance').set_index('other_user').drop(main_user)
    return dist_user_df.iloc[:k_nearest_users]

In [27]:
dist_user_1 = distance_of_others(ratings, "userId", "rating", 'movieId', 1, k_nearest_users = 10)

In [28]:
dist_user_1

Unnamed: 0_level_0,main_user,distance
other_user,Unnamed: 1_level_1,Unnamed: 2_level_1
77,1,0.0
511,1,0.5
366,1,0.707107
258,1,1.0
9,1,1.0
49,1,1.0
523,1,1.0
319,1,1.118034
398,1,1.224745
65,1,1.322876


In [29]:
# Limit number of elements into distance_of_others
# Remove users that there are none in common with the main user
# Use filter to delete that user
# Use iloc() to refer to a line of the dataFrame
# Find films that the most similar user saw
# Remove films that the main user just saw too.
# Show the other films (will be the next recomendations) in a df with join(movies), that will include the informations about the selected films 

In [30]:
def user_ratings(user, data, users_column, movies_column): 
    return data.query('{0} == {1}'.format(users_column, user)).set_index(movies_column)

In [31]:
user_ratings(77, ratings, 'userId', 'movieId')

Unnamed: 0_level_0,userId,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
260,77,5.0
1196,77,5.0
1198,77,5.0
1210,77,5.0
2571,77,5.0
3578,77,5.0
3948,77,3.0
3996,77,5.0
4226,77,2.5
4878,77,1.0


In [32]:
user_ratings(77, ratings, 'userId', 'movieId').join(user_ratings(1, ratings, 'userId', 'movieId'), lsuffix = 77, rsuffix = 1).dropna()

Unnamed: 0_level_0,userId77,rating77,userId1,rating1
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
260,77,5.0,1.0,5.0
1196,77,5.0,1.0,5.0
1198,77,5.0,1.0,5.0
1210,77,5.0,1.0,5.0
2571,77,5.0,1.0,5.0
3578,77,5.0,1.0,5.0


## 5th Class: KNN - K Nearest Neighbors

> Date: June 26, 2020

Now, to finish the process of use distances, ratings and number of votes to filter and refine our dataset and define a better way to suggest movies to an user, we will use the KNN algorithm. This method uses the distance between users and use the "neighbors" to decide what someone would like to watch, in our case. Basically, it try to define a profile based on other users that made something equal or similar to what a specific user made too.
 
 ----------------
 
- `sort_values` can  sort by various columns/parameters

In [33]:
def knn(user, data, users_column, ratings_column, movies_column, k_nearest_users = 1, k_users = None):
    nearest_users_dist = distance_of_others(data, users_column, ratings_column, movies_column, user, k_nearest_users = k_nearest_users, k_users = k_users)
    user_movies = user_ratings(user, data, users_column, movies_column)
    nearest_users_movies = data.set_index(users_column).loc[nearest_users_dist.index]

    suggested_movies_mean = nearest_users_movies.groupby(movies_column).mean()[[ratings_column]]
    suggested_movies_count = nearest_users_movies.groupby(movies_column).count()[[ratings_column]]
    
    min_ratings = k_nearest_users * 0.25
    
    suggested_movies_join = suggested_movies_mean.join(suggested_movies_count, lsuffix = '_mean', rsuffix = '_count')
    suggested_movies = suggested_movies_join.join(movies)

    filtered_suggested_movies = suggested_movies.query('qnt_votes > (qnt_votes.max()*0.15)')
    filtered_counted_suggested_movies = filtered_suggested_movies.query('{0} >= {1}'.format((ratings_column + '_count') , min_ratings))
    filtered_counted_sorted_suggested_movies = filtered_counted_suggested_movies.sort_values(['%s_count' % ratings_column, '%s_mean' % ratings_column], ascending = False)
    
    most_suggested_movies = filtered_counted_sorted_suggested_movies.drop(user_movies.index, errors = 'ignore')
    return most_suggested_movies


In [34]:
knn(1, ratings, 'userId', 'rating', 'movieId', k_nearest_users = 20)

Unnamed: 0_level_0,rating_mean,rating_count,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
318,4.708333,12,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
79132,4.272727,11,Inception (2010),Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX,143.0,4.066434
4993,4.6875,8,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy,198.0,4.106061
109487,4.625,8,Interstellar (2014),Sci-Fi|IMAX,73.0,3.993151
7153,4.4375,8,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy,185.0,4.118919
58559,4.785714,7,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX,149.0,4.238255
60069,4.833333,6,WALL·E (2008),Adventure|Animation|Children|Romance|Sci-Fi,104.0,4.057692
116797,4.666667,6,The Imitation Game (2014),Drama|Thriller|War,50.0,4.02
5952,4.583333,6,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy,188.0,4.021277
4973,4.0,6,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Comedy|Romance,120.0,4.183333


In [35]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [36]:
# Take de 10 nearest users
# Print the array with their ratings
# Do the mean between this ratings (group by movies)
# Order by these means
# join with movies

In [38]:
suggestions_user11_2 = knn(11, ratings, 'userId', 'rating', 'movieId', k_nearest_users = 5) # 1198, 2571, 4973, 4993, 5349, 7153, 79132, 109487
suggestions_user11_2.head(10)

Unnamed: 0_level_0,rating_mean,rating_count,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
25,5.0,3,Leaving Las Vegas (1995),Drama|Romance,76.0,3.625
2571,4.5,3,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0,4.192446
1073,4.0,3,Willy Wonka & the Chocolate Factory (1971),Children|Comedy|Fantasy|Musical,119.0,3.87395
5445,4.0,3,Minority Report (2002),Action|Crime|Mystery|Sci-Fi|Thriller,120.0,3.6375
32,3.666667,3,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,177.0,3.983051
260,5.0,2,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,251.0,4.231076
1196,5.0,2,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,211.0,4.21564
5418,5.0,2,"Bourne Identity, The (2002)",Action|Mystery|Thriller,112.0,3.816964
1,4.75,2,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215.0,3.92093
608,4.75,2,Fargo (1996),Comedy|Crime|Drama|Thriller,181.0,4.116022


In [39]:
suggestions_user1_15 = knn(1, ratings, 'userId', 'rating', 'movieId', k_nearest_users = 15)
suggestions_user1_15.head(10)

Unnamed: 0_level_0,rating_mean,rating_count,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
79132,4.1875,8,Inception (2010),Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX,143.0,4.066434
4993,4.714286,7,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy,198.0,4.106061
318,4.642857,7,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
7153,4.75,6,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy,185.0,4.118919
58559,4.75,6,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX,149.0,4.238255
109487,4.583333,6,Interstellar (2014),Sci-Fi|IMAX,73.0,3.993151
5952,5.0,4,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy,188.0,4.021277
60069,5.0,4,WALL·E (2008),Adventure|Animation|Children|Romance|Sci-Fi,104.0,4.057692
68954,4.875,4,Up (2009),Adventure|Animation|Children|Drama,105.0,4.004762
91529,4.75,4,"Dark Knight Rises, The (2012)",Action|Adventure|Crime|IMAX,76.0,3.993421


## 6th Class: KNN - K Nearest Neighbors

> Date: June 26, 2020

KNN - Algorithm that classify a user based on the nearest neighbors informations

Now, we will create and add a new user to test and verify how the algorithm is working with a real data.

In [40]:
# Create list with number of filmes and ratings
# Include in the dataframe
# Find suggestions
# Reset index of the dataset

In [41]:
movies.sort_values('qnt_votes', ascending = False).iloc[30:45]

Unnamed: 0_level_0,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
380,True Lies (1994),Action|Adventure|Comedy|Romance|Thriller,178.0,3.497191
32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,177.0,3.983051
364,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX,172.0,3.94186
1270,Back to the Future (1985),Adventure|Comedy|Sci-Fi,171.0,4.038012
377,Speed (1994),Action|Romance|Thriller,171.0,3.52924
3578,Gladiator (2000),Action|Adventure|Drama,170.0,3.938235
4306,Shrek (2001),Adventure|Animation|Children|Comedy|Fantasy|Ro...,170.0,3.867647
1580,Men in Black (a.k.a. MIB) (1997),Action|Comedy|Sci-Fi,165.0,3.487879
590,Dances with Wolves (1990),Adventure|Drama|Western,164.0,3.835366
648,Mission: Impossible (1996),Action|Adventure|Mystery|Thriller,162.0,3.537037


In [42]:
# Add a column with the number of times that the movie appear on the nearest users ratings
# Ignore movies that less than a minimum quantity of the nearest people didn't watch

In [43]:
new_user_ratings = [[2571, 4], [260, 4], [480, 3], [1, 5], [527, 5], 
            [1196, 3], [4993, 5], [588, 4], [1210, 3], [364, 5], 
            [4306, 5], [1580, 4], [367, 4], [6539, 5]]

In [44]:
def new_user(data):
    new_user_id = ratings.userId.max() + 1
    new_user_ratings = pd.DataFrame(data = data, columns = ['movieId', 'rating'])
    new_user_ratings['userId'] = new_user_id
    new_ratings = pd.concat([ratings, new_user_ratings], sort = True).reset_index()
    return new_ratings.drop(columns = 'index')

In [45]:
ratings.tail(20)

Unnamed: 0,userId,movieId,rating
100816,610,158872,3.5
100817,610,158956,3.0
100818,610,159093,3.0
100819,610,160080,3.0
100820,610,160341,2.5
100821,610,160527,4.5
100822,610,160571,3.0
100823,610,160836,3.0
100824,610,161582,4.0
100825,610,161634,4.0


In [46]:
ratings_with_new_user = new_user(new_user_ratings)
ratings = ratings_with_new_user
ratings.tail(20)

Unnamed: 0,movieId,rating,userId
100830,166528,4.0,610
100831,166534,4.0,610
100832,168248,5.0,610
100833,168250,5.0,610
100834,168252,5.0,610
100835,170875,3.0,610
100836,2571,4.0,611
100837,260,4.0,611
100838,480,3.0,611
100839,1,5.0,611


In [47]:
suggestions_new_user = knn(611, ratings, 'userId', 'rating', 'movieId', k_nearest_users = 10)

In [48]:
suggestions_new_user.head(15)

Unnamed: 0_level_0,rating_mean,rating_count,title,genres,qnt_votes,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
593,4.6875,8,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,279.0,4.16129
318,4.428571,7,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022
296,4.071429,7,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,307.0,4.197068
356,4.25,6,Forrest Gump (1994),Comedy|Drama|Romance|War,329.0,4.164134
150,3.916667,6,Apollo 13 (1995),Adventure|Drama|IMAX,201.0,3.845771
457,3.75,6,"Fugitive, The (1993)",Thriller,190.0,3.992105
34,4.2,5,Babe (1995),Children|Drama,128.0,3.652344
344,4.0,5,Ace Ventura: Pet Detective (1994),Comedy,161.0,3.040373
377,3.9,5,Speed (1994),Action|Romance|Thriller,171.0,3.52924
589,3.9,5,Terminator 2: Judgment Day (1991),Action|Sci-Fi,224.0,3.970982
