# Movie Recommendation Systems

Content:

1. Common Movie Pairs
2. Content Based Recommender System: based on genre and textual description
3. Collaborative Filtering: item-item & user-user

In [288]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import regex

In [289]:
movies = pd.read_csv('https://raw.githubusercontent.com/AkhilRD/Recommender-Systems/main/movies.csv',low_memory=False)
users = pd.read_csv('https://raw.githubusercontent.com/AkhilRD/Recommender-Systems/main/user_ratings.csv',low_memory=False)

In [290]:
#Merging the datasets

df = movies.merge(users, on='movieId')
df.head()

Unnamed: 0,movieId,title_x,genres_x,userId,rating,timestamp,title_y,genres_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [291]:
#dropping columns deemed unnecessary

df.drop(['title_y','genres_y'],axis = 1,inplace = True)

In [292]:
df.rename(columns = {'title_x':'title','genres_x':'genres'},inplace = True)
df.genres = df.genres.str.split('|')

In [293]:
#removing year from title

df.title = df['title'].str.replace(r'\s\(\d+\)',"",regex = True) 


In [294]:
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1,4.0,964982703
1,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",5,4.0,847434962
2,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",7,4.5,1106635946
3,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",15,2.5,1510577970
4,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",17,4.5,1305696483


### Some EDA to remove unpopular movies and subset the most popular

In [295]:
average_rating = df[['title','rating']].groupby('title').mean()
sorted_average_ratings = average_rating.sort_values('rating',ascending = False)
sorted_average_ratings

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
Hollywood Chainsaw Hookers,5.0
"Calcium Kid, The",5.0
Chinese Puzzle (Casse-tête chinois),5.0
Raise Your Voice,5.0
Rain,5.0
...,...
Anaconda: The Offspring,0.5
Superfast!,0.5
Don't Look Now,0.5
Yongary: Monster from the Deep,0.5


- The above movies have a perfect 5/5 score probably because of the low volumn of users who rated it. We'll have to remove these movies to find the most popular and highly rated movies in a dataset.

In [296]:
#creating a set for popular movies by making sure more than 50 users have rated it

movie_popularity = df['title'].value_counts()
popular_movies = movie_popularity[movie_popularity > 50].index

In [297]:
#subsetting it from the old dataframe

movies_popular = df[df['title'].isin(popular_movies)]

In [298]:
# Finding the average rating given to these frequently watched movies

popular_movie_rankings = movies_popular[['title', 'rating']].groupby('title').mean()
popular_movie_rankings.sort_values("rating", ascending=False).head(15)

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
"Shawshank Redemption, The",4.429022
"Godfather, The",4.289062
Fight Club,4.272936
Cool Hand Luke,4.27193
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb,4.268041
Rear Window,4.261905
"Godfather: Part II, The",4.25969
"Departed, The",4.252336
Goodfellas,4.25
Casablanca,4.24


## 1. Permutations of Movies - Basic Movie Pair Recommendation

- Here we're just finding a combination of movies that were most watched together. 
- A basic recommendation system considering the user has watched one movie and hasn't watched another movie that was widely watched alongside by many other users.

In [299]:
from itertools import permutations

In [300]:
def movie_pairs(x):
    pairs = pd.DataFrame(list(permutations(x.values,2)),columns = ['movie1','movie2'])
    return pairs

In [301]:
#grouping by user id and title and applying permutation function

movie_combinations = movies_popular.groupby('userId')['title'].apply(movie_pairs).reset_index(drop =True)

In [302]:
#grouping by books again to find the most popular combinations

combination_counts = movie_combinations.groupby(['movie1','movie2']).size()
combination_counts.head()

movie1                      movie2                 
10 Things I Hate About You  12 Angry Men                7
                            2001: A Space Odyssey      19
                            28 Days Later              11
                            300                        25
                            40-Year-Old Virgin, The    25
dtype: int64

In [303]:
combination_counts_df = combination_counts.to_frame(name='size').reset_index()
combination_counts_df.sort_values('size',ascending = False,inplace = True)

In [304]:
# Find the movies most frequently watched by people who watched Space Odessey


def movie_combinations(y):
    combi = movies_find = combination_counts_df[combination_counts_df['movie1'] == y]
    combi = movies_find[['movie2','size']].sort_values('size',ascending = False).head(15)
    return combi

movie_combinations('2001: A Space Odyssey')

Unnamed: 0,movie2,size
1046,Forrest Gump,86
1268,Star Wars: Episode IV - A New Hope,85
1151,"Matrix, The",82
1269,Star Wars: Episode V - The Empire Strikes Back,82
1242,"Silence of the Lambs, The",81
1206,Pulp Fiction,80
945,Blade Runner,78
1116,Jurassic Park,76
1270,Star Wars: Episode VI - Return of the Jedi,76
901,Alien,76


- Based on simple combinations of movies, we can infer people who've watched 2001: A Space Odyssey, a classic sci-fi have also watched a lot of Star Wars movies along with a few other sci-fi flicks. 

# 2. Content Based Filtering 

- This filtering method uses item features to recommend other items similar to what the user likes and also based on their previous actions or explicit feedback. The main idea of content-based methods is to try to build a model, based on the available “features”, that explain the observed user-item interactions.

List of actions:

1. Unstacking the list of genres for every movie
2. Creating a pandas crosstab for all movie genres: determines if a movie falls under a genre or not (0 or 1)
3. Calculating the jaccard distance between movies
4. Recommendation function

In [305]:
genre = df.iloc[:,1:3] #extracting the required columns
genre.head()

Unnamed: 0,title,genres
0,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]"
2,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]"
3,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]"
4,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]"


In [306]:
# The functions here extract each genre stored in the list and applies it to the movie

# expanding genres feature
movie_genre = genre.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)

#Naming the new feature as 'genre'
movie_genre.name = 'genre'

#Create a new dataframe gen_df which by dropping the old 'genres' feature and adding the new 'genre'.

genre_list = genre.drop('genres', axis=1).join(movie_genre)

In [307]:
#dropping duplicates

genre_list.drop_duplicates(inplace = True)
genre_list

Unnamed: 0,title,genre
0,Toy Story,Adventure
0,Toy Story,Animation
0,Toy Story,Children
0,Toy Story,Comedy
0,Toy Story,Fantasy
...,...,...
100832,No Game No Life: Zero,Fantasy
100833,Flint,Drama
100834,Bungo Stray Dogs: Dead Apple,Action
100834,Bungo Stray Dogs: Dead Apple,Animation


In [308]:
#creating movie name and genre crosstabs as 1 or 0

movies_crosstab = pd.crosstab(genre_list['title'],genre_list['genre'])
movies_crosstab.head()

genre,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
'71,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0
'Hellboy': The Seeds of Creation,0,1,1,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0
'Round Midnight,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0
'Salem's Lot,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0
'Til There Was You,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0


- Vectorizing the data-format helps calculate distance and similarity is a no hassle way.

In [309]:
from scipy.spatial.distance import pdist, squareform

In [310]:
jaccard_distances = pdist(movies_crosstab.values,metric = 'jaccard')

- pdist calculates Jaccard distance which is a measure of how different rows are from each other. As we want the compliment of this we subtract the values from 1. 

In [311]:
jaccard_similarity = 1 - squareform(jaccard_distances)
print(jaccard_similarity)

[[1.         0.125      0.2        ... 0.4        0.         0.        ]
 [0.125      1.         0.         ... 0.14285714 0.16666667 0.16666667]
 [0.2        0.         1.         ... 0.         0.         0.33333333]
 ...
 [0.4        0.14285714 0.         ... 1.         0.         0.        ]
 [0.         0.16666667 0.         ... 0.         1.         0.33333333]
 [0.         0.16666667 0.33333333 ... 0.         0.33333333 1.        ]]


In [312]:
distance = pd.DataFrame(jaccard_similarity,index = movies_crosstab.index,
                       columns = movies_crosstab.index)

#### Finding similar movies 

In [313]:
def content_genre(movie):
    series = distance.loc[movie]
    series = series.sort_values(ascending = False)
    return series.head(15)
    
content_genre('Pulp Fiction')

title
Freeway                                           1.0
Informant!, The                                   1.0
Party Monster                                     1.0
Pulp Fiction                                      1.0
In Bruges                                         1.0
Leaves of Grass                                   1.0
Fargo                                             1.0
Man Bites Dog (C'est arrivé près de chez vous)    1.0
Out of Sight                                      0.8
11:14                                             0.8
Running Scared                                    0.8
Metro                                             0.8
Confessions of a Dangerous Mind                   0.8
Last Boy Scout, The                               0.8
Nurse Betty                                       0.8
Name: Pulp Fiction, dtype: float64

- This approach is clearly simpler to execute but it depends purely on a single attribute which is genre which is why we see perfect scores of 1

###  Content Based with Description 

- In this method, we'd use the movie descriptions to find the most similar movies to recommend a user.

List of actions:

1. Importing text pre-processing libraries
2. Lowercasing,removing stop words, removing numbers and lemmatizing the text column
3. Building a Tf-idf vectorizer to convert text into a numeric vector format
4. Using cosine similarity to find the most common movies 
5. Recommendation function

In [314]:
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import unicodedata
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer

In [315]:
movie = pd.read_csv('https://raw.githubusercontent.com/AkhilRD/Recommender-Systems/main/imdb5000.csv')
movie = movie[['original_title','overview']]
movie.head()

Unnamed: 0,original_title,overview
0,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,Spectre,A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,Following the death of District Attorney Harve...
4,John Carter,"John Carter is a war-weary, former military ca..."


#### Cleaning the text column  

In [316]:
def preprocessing(text):
    text = str(text)
    text = text.lower()
    text = re.sub('[0-9]+', '', text)
    stop = nltk.corpus.stopwords.words('english')                                  
    lem = WordNetLemmatizer()                                                                                                                              
    words = re.sub(r'[^\w\s]', '', text).split()
    return [lem.lemmatize(w) for w in words if w not in stop]  

In [317]:
movie['overview']= movie.apply(lambda x: preprocessing(x['overview']), axis=1) 
def final(lem_col):
    return (" ".join(lem_col))                                                       #applying the function to a text column                                  

movie['overview'] = movie.apply(lambda x: final(x['overview']),axis=1)

In [318]:
#After pre-processing we have a clean text column which can be fit to a tf-idf vectorizer

movie.head()

Unnamed: 0,original_title,overview
0,Avatar,nd century paraplegic marine dispatched moon p...
1,Pirates of the Caribbean: At World's End,captain barbossa long believed dead come back ...
2,Spectre,cryptic message bond past sends trail uncover ...
3,The Dark Knight Rises,following death district attorney harvey dent ...
4,John Carter,john carter warweary former military captain w...


In [319]:
vectorizer = TfidfVectorizer(min_df = 2, max_df=0.8) 

In [338]:
vectorized_data = vectorizer.fit_transform(movie['overview'])
# print(vectorizer.get_feature_names())

In [321]:
frame = pd.DataFrame(vectorized_data.toarray(),
                    columns = vectorizer.get_feature_names())

In [322]:
frame.index = movie['original_title']
print(frame)

                                          aaron  abandon  abandoned  \
original_title                                                        
Avatar                                      0.0      0.0        0.0   
Pirates of the Caribbean: At World's End    0.0      0.0        0.0   
Spectre                                     0.0      0.0        0.0   
The Dark Knight Rises                       0.0      0.0        0.0   
John Carter                                 0.0      0.0        0.0   
...                                         ...      ...        ...   
El Mariachi                                 0.0      0.0        0.0   
Newlyweds                                   0.0      0.0        0.0   
Signed, Sealed, Delivered                   0.0      0.0        0.0   
Shanghai Calling                            0.0      0.0        0.0   
My Date with Drew                           0.0      0.0        0.0   

                                          abandoning  abandonment  abbie  \


### Measuring similarity with cosine  similarity

In [323]:
from sklearn.metrics.pairwise import cosine_similarity

In [324]:
frame.index = movie['original_title']
cosine_array = cosine_similarity(frame)
cosine_df = pd.DataFrame(cosine_array,index =frame.index,columns = frame.index)

### Function 

In [325]:
def recommend(x):
    movie = cosine_df.loc[x].sort_values(ascending = False)[1:]
    return movie.head(10)

In [326]:
recommend('Spectre')

original_title
Never Say Never Again          0.336332
From Russia with Love          0.246971
Thunderball                    0.213093
Safe Haven                     0.185243
Quantum of Solace              0.176800
Jason Bourne                   0.149170
Skyfall                        0.143149
Octopussy                      0.132402
In the Valley of Elah          0.130559
The Man with the Golden Gun    0.128952
Name: Spectre, dtype: float64

- As expected, we get recommended a list of spy movies especially other James Bond movies.

# 3. Collaborative Filtering 

- Collaborative Filtering tends to find what similar users like. It classifies the users into clusters of similar types and recommend each user according to the preference of the overall cluster.

List of actions:

1. Creating a new dataframe that consists of userId, rating and title with the rest excluded: Ideally we can use other additional demographic features like age, geography etc to filter movies
2. Creating a pivot table with userId as index and movies as columns with rating being the values. Creates a sparse matrix
3. Centered the data and filled the NaN's with 0

In [327]:
ratings_table = df.loc[:,['userId','rating','title']] #creating a new ratings dataframe

In [328]:
user_table = ratings_table.pivot_table(index = 'userId',columns = 'title',values = 'rating')
user_table

title,'71,'Hellboy': The Seeds of Creation,'Round Midnight,'Salem's Lot,'Til There Was You,'Tis the Season for Love,"'burbs, The",'night Mother,(500) Days of Summer,*batteries not included,...,Zulu,[REC],[REC]²,[REC]³ 3 Génesis,anohana: The Flower We Saw That Day - The Movie,eXistenZ,xXx,xXx: State of the Union,¡Three Amigos!,À nous la liberté (Freedom for Us)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,,,,,,,,,,,...,,,,,,,,,,
607,,,,,,,,,,,...,,,,,,,,,,
608,,,,,,,,,,,...,,,,,,4.5,3.5,,,
609,,,,,,,,,,,...,,,,,,,,,,


- It's a sparse matrix/dataframe which is expected as people do not watch/rate every movie in the database.

In [329]:
#calculating average rating of each user

avg_ratings = user_table.mean(axis = 1)

#center each user's rating around 0 

user_table_centered = user_table.sub(avg_ratings,axis = 0)

#fill the missing data with 0 

user_table_final = user_table_centered.fillna(0)

- We center the data around 0 because directly converting NaN's to 0 would mean the user has rated the movie 0/5.
- By centering the data we'd be able to infer the matrix in the right format.

#### Item based filtering based on user reviews 

In [330]:
#Transpose to make userid into columns

user_movie_pivot = user_table_final.T

In [331]:
user_movie_pivot 

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.312789
'Hellboy': The Seeds of Creation,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
'Round Midnight,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
'Salem's Lot,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
'Til There Was You,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eXistenZ,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,1.490415,0.0,0.0,0.0,0.0,1.370606,0.0,0.000000
xXx,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.26087,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.370606,0.0,-1.687211
xXx: State of the Union,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,-2.187211
¡Three Amigos!,-0.373362,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000


In [332]:
#similarity is between -1 and +1 after we centered the data,

cosine_similarity(user_movie_pivot.loc['Departed, The',:].values.reshape(1,-1),
                 user_movie_pivot.loc['Twilight',:].values.reshape(1,-1))

## As expected the above movies are very different from each other

array([[-0.17201319]])

In [333]:
#finding item based similarity

item_based = cosine_similarity(user_movie_pivot)
item_based_df = pd.DataFrame(item_based,index = user_movie_pivot.index,
                             columns = user_movie_pivot.index)

In [334]:
cosine_item_recommendation = item_based_df.loc["Schindler's List"].sort_values(ascending = False)[1:]
cosine_item_recommendation.head(10)

title
Shawshank Redemption, The                     0.394199
Usual Suspects, The                           0.328439
Godfather, The                                0.299780
Forrest Gump                                  0.288181
Silence of the Lambs, The                     0.285953
Saving Private Ryan                           0.275372
Godfather: Part II, The                       0.268039
Pulp Fiction                                  0.258395
12 Angry Men                                  0.252515
Star Wars: Episode VI - Return of the Jedi    0.243258
Name: Schindler's List, dtype: float64

- That's a good list of recommendations.

#### User-User Filtering: Using KNN 

- Uses average of ratings of k most similar users gave a movie to suggest what rating a target user would give it.

In [335]:
#finding item based similarity

user_knn = cosine_similarity(user_table_final)
user_knn_df = pd.DataFrame(user_knn,index = user_table_final.index,
                             columns = user_table_final.index)

user_knn_df

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.001279,0.000590,0.049433,0.022008,-0.047110,-0.013769,0.047964,0.019495,-0.008565,...,0.018306,-0.018415,-0.017309,-0.038083,-0.030076,0.012316,0.056576,0.077001,-0.026641,0.004367
2,0.001279,1.000000,0.000000,-0.017164,0.021796,-0.021051,-0.011277,-0.048085,0.000000,0.003012,...,-0.049020,-0.031581,-0.001703,0.000000,0.000000,0.006254,-0.020504,-0.005949,-0.060091,0.025043
3,0.000590,0.000000,1.000000,-0.011260,-0.031539,0.004800,0.000000,-0.032471,0.000000,0.000000,...,-0.004904,-0.016117,0.017863,0.000000,-0.001437,-0.037490,-0.007789,-0.013147,0.000000,0.019609
4,0.049433,-0.017164,-0.011260,1.000000,-0.029620,0.011498,0.058999,0.002065,-0.005874,0.051590,...,-0.037687,0.060523,0.029643,-0.013782,0.040044,0.017081,0.014628,-0.037884,-0.017884,-0.000992
5,0.022008,0.021796,-0.031539,-0.029620,1.000000,0.009111,0.010269,-0.012284,0.000000,-0.033165,...,0.015964,0.012427,0.027204,0.012461,-0.036334,0.029234,0.031896,-0.001783,0.093829,-0.000285
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.012316,0.006254,-0.037490,0.017081,0.029234,-0.008436,0.028953,0.022399,0.031822,-0.040162,...,0.053980,0.016488,0.096804,0.062097,0.017334,1.000000,0.018027,0.054311,0.038606,0.074998
607,0.056576,-0.020504,-0.007789,0.014628,0.031896,0.054727,0.020016,0.048822,-0.012161,-0.017656,...,0.049059,0.038197,0.045640,0.008820,-0.029403,0.018027,1.000000,0.042051,0.019049,0.016952
608,0.077001,-0.005949,-0.013147,-0.037884,-0.001783,0.021907,0.026243,0.072031,0.032992,-0.051838,...,0.070107,0.051622,0.011140,0.006394,-0.007895,0.054311,0.042051,1.000000,0.050935,0.059114
609,-0.026641,-0.060091,0.000000,-0.017884,0.093829,0.053017,0.008911,0.077180,0.000000,-0.040090,...,0.043465,0.062400,0.015364,0.094038,-0.054767,0.038606,0.019049,0.050935,1.000000,-0.012480


### Using KNN  

In [336]:
def recommend_knn(user,movie):
    user_delta = user_knn_df.loc[user].sort_values(ascending = False) #finding similar users
    neighbors = user_delta[1:11].index                                #Selecting the top 10
    neighbors_ratings = user_table.reindex(neighbors)              #reindexing to retrieve only neighbor eatings
    neighbors_ratings[movie].mean()                                #finding the mean of the movie rating by the user's neighbors
    
    # Drop the column you are trying to predict
    drop = user_table_final.drop(movie, axis=1)

    # Get the data for the user you are predicting for
    target_user_x = drop.loc[[user]]

    # Get the target data from user_ratings_table
    other_users_y = user_table[movie]

    # Get the data for only those that have seen the movie
    other_users_x = drop[other_users_y.notnull()]

    # Remove those that have not seen the movie from the target
    other_users_y.dropna(inplace=True)
    
    # Instantiate the user KNN model
    user_knn = KNeighborsRegressor(metric='cosine', n_neighbors=10)

    # Fit the model and predict the target user
    user_knn.fit(other_users_x, other_users_y)
    user_user_pred = user_knn.predict(target_user_x)

    return user_user_pred

In [337]:
recommend_knn(48,"Pirates of the Caribbean: At World's End")

array([3.6])

- According to the KNN model the user 48 would give Pirates of the Caribbean: At World's End a rating of 3.6 based on the user's nearest neighbors.