# Netflik Film Recommender System

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

Questions to consider:

- Who are your stakeholders?
- What are your stakeholders' pain points related to this project?
- Why are your predictions important from a business perspective?
- What exactly is your deliverable: your analysis, or the model itself?
- Does your business understanding/stakeholder require a specific type of model?
    - For example: a highly regulated industry would require a very transparent/simple/interpretable model, whereas a situation where the model itself is your deliverable would likely benefit from a more complex and thus stronger model
   

Additional questions to consider for classification:

- What does a false positive look like in this context?
- What does a false negative look like in this context?
- Which is worse for your stakeholder?
- What metric are you focusing on optimizing, given the answers to the above questions?

## Data Understanding

The data for this project comes from the [MovieLens](https://grouplens.org/datasets/movielens/latest/) dataset provided by the grouplens website. Specifically, we are working with the smaller dataset available on the website, which contains 4 csv files detailing user ratings for a variety of movies. We worked with two of the csv files for our project: the 'ratings.csv' file and the 'movies.csv' file. 

The 'ratings.csv' file contains just over 100,000 user rating records from 610 unique users.  

The 'movies.csv' file contains records of 9,742 films, their title and year produced, and any genre categories that they fall into.

## Modeling

Describe and justify the process for analyzing or modeling the data.

Questions to consider:

- How will you analyze the data to arrive at an initial approach?
- How will you iterate on your initial approach to make it better?
- What model type is most appropriate, given the data and the business problem?

In [1]:
import pandas as pd
import numpy as np
from surprise import SVD, KNNBasic, NMF, SVDpp
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate, GridSearchCV

from surprise.model_selection import train_test_split
from surprise import accuracy

from surprise.prediction_algorithms import SVD, NormalPredictor, BaselineOnly

### Reading data into python and exploring data info

In [2]:
ratings_df = pd.read_csv('../data/ratings.csv')
ratings_df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [3]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [4]:
movies_df = pd.read_csv('../data/movies.csv')
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [5]:
tags_df = pd.read_csv('../data/tags.csv')
tags_df

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


In [6]:
links_df = pd.read_csv('../data/links.csv')
links_df

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
9737,193581,5476944,432131.0
9738,193583,5914996,445030.0
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0


In [7]:
#Instantiate algorithm from Surprise
algo = SVD()

In [8]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [9]:
len(movies_df)

9742

In [10]:
len(ratings_df)

100836

In [11]:
movies_df.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

In [12]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [13]:
ratings_df.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

### Merging dataframes

#### We need to merge the ratings and movies dataframes so we can have the combined data to use for the recommendation system. We then created a .csv version of the dataframe so we can reload it in the correct format for Surprise.

In [14]:
## merge ratings and movies df's
merged_df = pd.merge(ratings_df, movies_df, on='movieId', how='right')

In [15]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100854 entries, 0 to 100853
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  float64
 1   movieId    100854 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  float64
 4   title      100854 non-null  object 
 5   genres     100854 non-null  object 
dtypes: float64(3), int64(1), object(2)
memory usage: 5.4+ MB


In [16]:
merged_df.isna().sum()

userId       18
movieId       0
rating       18
timestamp    18
title         0
genres        0
dtype: int64

In [17]:
merged_df = merged_df.dropna()

In [18]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 100853
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  float64
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  float64
 4   title      100836 non-null  object 
 5   genres     100836 non-null  object 
dtypes: float64(3), int64(1), object(2)
memory usage: 5.4+ MB


In [19]:
merged_df['rating'].value_counts()

4.0    26818
3.0    20047
5.0    13211
3.5    13136
4.5     8551
2.0     7551
2.5     5550
1.0     2811
1.5     1791
0.5     1370
Name: rating, dtype: int64

In [20]:
merged_df.to_csv('../data/ratings_and_movies.csv', index=False)

In [21]:
user_item_rating = merged_df[['userId', 'movieId', 'rating']]

In [22]:
user_item_rating.to_csv('../data/user_item_rating.csv', index=False)

In [23]:
merged_df

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1.0,1,4.0,9.649827e+08,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5.0,1,4.0,8.474350e+08,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7.0,1,4.5,1.106636e+09,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15.0,1,2.5,1.510578e+09,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17.0,1,4.5,1.305696e+09,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...,...
100849,184.0,193581,4.0,1.537109e+09,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
100850,184.0,193583,3.5,1.537110e+09,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
100851,184.0,193585,3.5,1.537110e+09,Flint (2017),Drama
100852,184.0,193587,3.5,1.537110e+09,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


### Baseline Model Placeholder
NormalPredictor - may give the biggest "improvement"

### First Model Type: SVD

~Comments about why we started here, what this model can do/acheive.

In [240]:
reader = Reader(line_format='user item rating', sep=',', skip_lines=1, rating_scale=(1, 5))

ratings_surprise = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

In [25]:
trainset, testset = train_test_split(ratings_surprise, test_size=0.2, random_state=42)

In [26]:
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items, '\n')

Number of users:  610 

Number of items:  8928 



#### Implementing the basic SVD algorithm

In [27]:
svd = SVD()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7ff50a0d1d00>

In [28]:
preds = svd.test(testset)

In [29]:
rmse = accuracy.rmse(preds)
mae = accuracy.mae(preds)

RMSE: 0.8812
MAE:  0.6777


~Model Evalutaion/Comments

#### First Grid Search with SVD model

~Comment on why we tuned the parameters this way for the GS, what we are hoping to achieve w this GS

In [30]:
params = {'n_factors': [20, 25, 50, 100],
         'reg_all': [0.02, 0.03, 0.04, 0.05, 0.06, 0.07], 'n_epochs': [20, 25, 30, 35]}
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1, cv=3)
g_s_svd.fit(ratings_surprise)

KeyboardInterrupt: 

In [None]:
g_s_svd.best_params['rmse']

In [None]:
# Cross validate
results = cross_validate(svd, ratings_surprise, measures=['RMSE'], cv=3, n_jobs = -1, verbose=True)

In [None]:
svd.fit(trainset)
predictions = svd.test(testset)
svd_baseline = accuracy.rmse(predictions)

In [31]:
# Re-doing the SVD model with the best params from the GridSearchCV
svd_bestparams = SVD(n_factors=100, n_epochs=30, biased=True, reg_all=0.07, random_state=42)

svd_bestparams.fit(trainset)
predictions = svd_bestparams.test(testset)
svd_gs1 = accuracy.rmse(predictions)

RMSE: 0.8693


~Model Evaluation/Comments

- The RMSE score is very slightly better than the score of the baseline model with nothing tuned.

Hyperparameter Analysis:
- n_factors: 100 was the highest n_factor value we allowed the grid search to implement, because this came out as the optimal value, we will want to see if increasing this number could further optimize in later iterations
- n_epochs: 30 appears to be the best value here, and this number fell within the middle of the hyperparameter grid options, so we will keel this value 
- biased:
- reg_all: .07 is at the highest end of the hyperparameter options implemented, so we will test additional, higher values in the nxt GS.

#### Second Grid Search with SVD model

In [None]:
params = {'n_factors': [50, 100, 125, 150],
         'reg_all': [0.06, 0.07, 0.08], 'n_epochs': [30], 'lr_all': [0.02, 0.05]}
g_s_svd2 = GridSearchCV(SVD,param_grid=params,n_jobs=-1, cv=3)
g_s_svd2.fit(ratings_surprise)

In [None]:
g_s_svd2.best_params['rmse']

In [None]:
params = {'n_factors': [150, 175, 200],
         'reg_all': [0.07, 0.08, 0.09, 0.1], 'n_epochs': [30], 'lr_all': [0.01, 0.02]}
g_s_svd3 = GridSearchCV(SVD,param_grid=params,n_jobs=-1, cv=3)
g_s_svd3.fit(ratings_surprise)

In [None]:
g_s_svd3.best_params['rmse']

In [None]:
params = {'n_factors': [200, 300, 500],
         'reg_all': [0.1, 0.3, 0.5], 'n_epochs': [30], 'lr_all': [0.02]}
g_s_svd4 = GridSearchCV(SVD,param_grid=params,n_jobs=-1, cv=3)
g_s_svd4.fit(ratings_surprise)

In [None]:
g_s_svd4.best_params['rmse']

In [None]:
params = {'n_factors': [1, 500, 1000, 1500],
         'reg_all': [0.1], 'n_epochs': [30], 'lr_all': [0.02]}
g_s_svd5 = GridSearchCV(SVD,param_grid=params,n_jobs=-1, cv=3)
g_s_svd5.fit(ratings_surprise)

In [None]:
g_s_svd5.best_params['rmse']

In [32]:
svd_bestparams2 = SVD(n_factors=500, n_epochs=30, biased=True, reg_all=0.1, lr_all=0.02, random_state=42)

svd_bestparams2.fit(trainset)
predictions = svd_bestparams2.test(testset)
svd_gs2 = accuracy.rmse(predictions)

RMSE: 0.8568


~Model Evaluation/Comments

- This is the lowest RMSE score we acheived using the SVD model.

### Second Model Type: KNN

In [33]:
#KNN
sim = {'user_based': True, 'name': 'pearson'}
KNN = KNNBasic(sim_options=sim)

In [34]:
trainset, testset = train_test_split(ratings_surprise, test_size=0.2, random_state=42)

In [35]:
KNN.fit(trainset)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7ff50a0d1c40>

In [36]:
nmf = NMF()

In [37]:
nmf.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.NMF at 0x7ff50b73d3d0>

In [38]:
nmf_preds = nmf.test(testset)

In [39]:
rmse_nmf = accuracy.rmse(nmf_preds)
print("RMSE (NMF):", rmse_nmf)

RMSE: 0.9265
RMSE (NMF): 0.9265056869664153


In [40]:
NMF_bestparamsSVD = NMF(n_factors=500, n_epochs=30, biased=True, random_state=42)

In [41]:
NMF_bestparamsSVD.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.NMF at 0x7ff518d12eb0>

In [42]:
nmf_preds2 = NMF_bestparamsSVD.test(testset)

In [43]:
rmse_nmf2 = accuracy.rmse(nmf_preds2)
print("RMSE (NMF):", rmse_nmf2)

RMSE: 1.8088
RMSE (NMF): 1.8087746792544073


In [None]:
params = {'n_factors': [10, 100, 500],
         'n_epochs': [75, 100, 135]}
g_s_nmf1 = GridSearchCV(NMF,param_grid=params,n_jobs=-1, cv=3)
g_s_nmf1.fit(ratings_surprise)

In [None]:
g_s_nmf1.best_params['rmse']

In [44]:
NMF2 = NMF(n_factors=500, n_epochs=100, biased=True, random_state=42)

In [45]:
NMF2.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.NMF at 0x7ff50a0d1c10>

In [46]:
nmf_preds3 = NMF2.test(testset)

In [47]:
rmse_nmf3 = accuracy.rmse(nmf_preds3)
print("RMSE (NMF):", rmse_nmf3)

RMSE: 1.9353
RMSE (NMF): 1.9353167735262842


### SVD++

In [48]:
svdpp = SVDpp()

In [49]:
svdpp.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x7ff50b7242b0>

In [50]:
preds_svdpp = svdpp.test(testset)

In [51]:
rmse_svdpp = accuracy.rmse(preds_svdpp)
print("RMSE (SVD++):", rmse_svdpp)

RMSE: 0.8689
RMSE (SVD++): 0.8688701924247805


In [52]:
svdpp_bestparams = SVDpp(n_factors=25, n_epochs=30, reg_all=0.1, lr_all=0.02, random_state=42)

svdpp_bestparams.fit(trainset)
predictions = svdpp_bestparams.test(testset)
svdpp_gs = accuracy.rmse(predictions)

RMSE: 0.8610


### Building a Function to Generate Film Recommendations

In [54]:
best_model = svd_bestparams2

In [56]:
best_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7ff50b733d00>

In [57]:
watched_df = ratings_df.set_index('userId')
watched_df.drop(columns=['rating', 'timestamp'], inplace=True)
watched_df

Unnamed: 0_level_0,movieId
userId,Unnamed: 1_level_1
1,1
1,3
1,6
1,47
1,50
...,...
610,166534
610,168248
610,168250
610,168252


In [None]:
watched_df.to_csv('../data/watched_df.csv')

In [266]:
user = int(input('UserId: '))

UserId:  474


In [267]:
type(user)

int

In [250]:
watched_list = list(watched_df.loc[user, 'movieId'])
len(watched_list)

2108

In [252]:
unwatched_list = movies_df.copy()
unwatched_list = movies_df.set_index('movieId')
unwatched_list

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
...,...,...
193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
193585,Flint (2017),Drama
193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [259]:
unwatched_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7634 entries, 0 to 7633
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  7634 non-null   int64 
 1   title    7634 non-null   object
 2   genres   7634 non-null   object
dtypes: int64(1), object(2)
memory usage: 179.0+ KB


In [254]:
unwatched_list.drop(watched_list, inplace=True)
unwatched_list.reset_index(inplace=True)
unwatched_list.head()

Unnamed: 0,movieId,title,genres
0,3,Grumpier Old Men (1995),Comedy|Romance
1,4,Waiting to Exhale (1995),Comedy|Drama|Romance
2,8,Tom and Huck (1995),Adventure|Children
3,9,Sudden Death (1995),Action
4,10,GoldenEye (1995),Action|Adventure|Thriller


In [255]:
len(movies_df) - len(unwatched_list)

2108

In [268]:
def movie_recommender():
    
    user = int(input('userId:'))

    watched_list = list(watched_df.loc[user, 'movieId'])
    unwatched_list = movies_df.copy()
    unwatched_list = movies_df.set_index('movieId')
    unwatched_list.drop(watched_list, inplace=True)
    unwatched_list.reset_index(inplace=True)
    unwatched_list['pred_rating'] = unwatched_list['movieId'].apply(lambda i: best_model.predict(user, i).est)
    unwatched_list.sort_values(by='pred_rating', ascending=False, inplace=True)
    
    return unwatched_list.head()

In [79]:
movie_recommender()

userId: 474


Unnamed: 0,movieId,title,genres,pred_rating
7510,177593,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama,4.314493
421,741,Ghost in the Shell (Kôkaku kidôtai) (1995),Animation|Sci-Fi,4.26225
1518,3275,"Boondock Saints, The (2000)",Action|Crime|Drama|Thriller,4.243491
4604,58559,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX,4.190657
6358,112552,Whiplash (2014),Drama,4.181801


In [65]:
movie_recommender()

userId:  610


Unnamed: 0,movieId,title,genres,pred_rating
827,1204,Lawrence of Arabia (1962),Adventure|Drama|War,4.382414
810,1178,Paths of Glory (1957),Drama|War,4.364607
772,1104,"Streetcar Named Desire, A (1951)",Drama,4.328158
805,1172,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama,4.327683
8316,177593,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama,4.314493


In [269]:
movie_recommender()

userId: 2


Unnamed: 0,movieId,title,genres,pred_rating
2579,3451,Guess Who's Coming to Dinner (1967),Drama,4.520722
935,1237,"Seventh Seal, The (Sjunde inseglet, Det) (1957)",Drama,4.424485
9589,177593,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama,4.396874
678,898,"Philadelphia Story, The (1940)",Comedy|Drama|Romance,4.387048
1419,1945,On the Waterfront (1954),Crime|Drama,4.384074


In [67]:
movie_recommender()

userId:  2


KeyError: '2'

In [83]:
user=2
watched_list2 = list(watched_df.loc[user, 'movieId'])
watched_list2

[318,
 333,
 1704,
 3578,
 6874,
 8798,
 46970,
 48516,
 58559,
 60756,
 68157,
 71535,
 74458,
 77455,
 79132,
 80489,
 80906,
 86345,
 89774,
 91529,
 91658,
 99114,
 106782,
 109487,
 112552,
 114060,
 115713,
 122882,
 131724]

In [242]:
def movie_recommender2(user):
    watched_list = list(watched_df.loc[user, 'movieId'])
    unwatched_list = movies_df.copy()
    unwatched_list = movies_df.set_index('movieId')
    unwatched_list.drop(watched_list, inplace=True)
    unwatched_list.reset_index(inplace=True)
    unwatched_list['pred_rating'] = unwatched_list['movieId'].apply(lambda i: best_model.predict(user, i).est)
    unwatched_list.sort_values(by='pred_rating', ascending=False, inplace=True)
    
    return unwatched_list.head()

In [243]:
movie_recommender2(2)

Unnamed: 0,movieId,title,genres,pred_rating
2579,3451,Guess Who's Coming to Dinner (1967),Drama,4.520722
935,1237,"Seventh Seal, The (Sjunde inseglet, Det) (1957)",Drama,4.424485
9589,177593,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama,4.396874
678,898,"Philadelphia Story, The (1940)",Comedy|Drama|Romance,4.387048
1419,1945,On the Waterfront (1954),Crime|Drama,4.384074


In [None]:
x

### Building a Function to Generate User Profile Information:

In [205]:
profile = merged_df.copy()
profile

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1.0,1,4.0,9.649827e+08,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5.0,1,4.0,8.474350e+08,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7.0,1,4.5,1.106636e+09,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15.0,1,2.5,1.510578e+09,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17.0,1,4.5,1.305696e+09,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...,...
100849,184.0,193581,4.0,1.537109e+09,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
100850,184.0,193583,3.5,1.537110e+09,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
100851,184.0,193585,3.5,1.537110e+09,Flint (2017),Drama
100852,184.0,193587,3.5,1.537110e+09,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [218]:
user=610
profile = merged_df.loc[merged_df['userId'] == user, ['userId', 'title', 'rating']]
profile

Unnamed: 0,userId,title,rating
214,610.0,Toy Story (1995),5.0
534,610.0,Heat (1995),5.0
954,610.0,Casino (1995),4.5
1678,610.0,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),4.5
2309,610.0,Seven (a.k.a. Se7en) (1995),5.0
...,...,...,...
100250,610.0,Split (2017),4.0
100312,610.0,John Wick: Chapter Two (2017),5.0
100327,610.0,Get Out (2017),5.0
100352,610.0,Logan (2017),5.0


In [207]:
profile.set_index('userId', inplace=True)
profile

Unnamed: 0_level_0,title,rating
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
610.0,Toy Story (1995),5.0
610.0,Heat (1995),5.0
610.0,Casino (1995),4.5
610.0,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),4.5
610.0,Seven (a.k.a. Se7en) (1995),5.0
...,...,...
610.0,Split (2017),4.0
610.0,John Wick: Chapter Two (2017),5.0
610.0,Get Out (2017),5.0
610.0,Logan (2017),5.0


In [196]:
profile.sort_values(by='rating', ascending=False, inplace=True)
profile

Unnamed: 0_level_0,title,rating
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
610.0,Toy Story (1995),5.0
610.0,In Bruges (2008),5.0
610.0,Blue Velvet (1986),5.0
610.0,"Bourne Ultimatum, The (2007)",5.0
610.0,John Wick (2014),5.0
...,...,...
610.0,Stan Helsing (2009),1.0
610.0,Taken 3 (2015),1.0
610.0,Derailed (2002),0.5
610.0,"Crow, The: Wicked Prayer (2005)",0.5


In [220]:
user = 610
profile_ = merged_df.copy()
profile_ = merged_df.loc[merged_df['userId'] == user, ['userId', 'title', 'rating']]
profile_.set_index('userId', inplace=True)
profile_.sort_values(by='rating', ascending=False, inplace=True)
profile_.reset_index(inplace=True)
profile_

Unnamed: 0,userId,title,rating
0,610.0,Toy Story (1995),5.0
1,610.0,In Bruges (2008),5.0
2,610.0,Blue Velvet (1986),5.0
3,610.0,"Bourne Ultimatum, The (2007)",5.0
4,610.0,John Wick (2014),5.0
...,...,...,...
1297,610.0,Stan Helsing (2009),1.0
1298,610.0,Taken 3 (2015),1.0
1299,610.0,Derailed (2002),0.5
1300,610.0,"Crow, The: Wicked Prayer (2005)",0.5


In [208]:
profile.reset_index(inplace=True)
profile

Unnamed: 0,userId,title,rating
0,610.0,Toy Story (1995),5.0
1,610.0,Heat (1995),5.0
2,610.0,Casino (1995),4.5
3,610.0,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),4.5
4,610.0,Seven (a.k.a. Se7en) (1995),5.0
...,...,...,...
1297,610.0,Split (2017),4.0
1298,610.0,John Wick: Chapter Two (2017),5.0
1299,610.0,Get Out (2017),5.0
1300,610.0,Logan (2017),5.0


In [198]:
profile['rating'].value_counts()

3.5    315
4.0    286
3.0    230
5.0    180
4.5    148
2.5     74
2.0     42
1.0     13
1.5     11
0.5      3
Name: rating, dtype: int64

In [184]:
profile.head(20)

Unnamed: 0,userId,title,rating
214,610.0,Toy Story (1995),5.0
534,610.0,Heat (1995),5.0
954,610.0,Casino (1995),4.5
1678,610.0,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),4.5
2309,610.0,Seven (a.k.a. Se7en) (1995),5.0
2582,610.0,"Usual Suspects, The (1995)",4.0
2913,610.0,From Dusk Till Dawn (1996),4.0
3178,610.0,Broken Arrow (1996),3.5
3623,610.0,Braveheart (1995),4.5
3727,610.0,Taxi Driver (1976),5.0


In [200]:
print(f'Profile for User: {user}')
print('______________________________')
print('Highest rated films by User (Rated 5):')
print(profile['title'][0])
print(profile['title'][1])
print(profile['title'][2])
print(profile['title'][3])
print(profile['title'][4])

Profile for User: 610
______________________________
Highest rated films by User (Rated 5):
Toy Story (1995)
In Bruges (2008)
Blue Velvet (1986)
Bourne Ultimatum, The (2007)
John Wick (2014)


In [283]:
def profile_builder():

    user = int(input('Profile for User: '))
    
    profile = merged_df.copy()
    profile = merged_df.loc[merged_df['userId'] == user, ['userId', 'title', 'rating']]
    profile.set_index('userId', inplace=True)
    profile.sort_values(by='rating', ascending=False, inplace=True)
    profile.reset_index(inplace=True)

    print('====================================')
    print('Highest rated films by User (Rated 5):')
    print('====================================')
    print(profile['title'][0])
    print(profile['title'][1])
    print(profile['title'][2])
    print(profile['title'][3])
    print(profile['title'][4])

In [286]:
profile_builder()

Profile for User:  474


Highest rated films by User (Rated 5):
Safety Last! (1923)
Strictly Ballroom (1992)
Moonstruck (1987)
Enchanted April (1992)
Harry Potter and the Goblet of Fire (2005)


Thinking of options to include in the profile:
- Should we make a detail about how many films the user rated 5, 4, ect?
- Should we list some of the top rated films? If so, how many?

In [285]:
profile['rating'].value_counts()

3.5    315
4.0    286
3.0    230
5.0    180
4.5    148
2.5     74
2.0     42
1.0     13
1.5     11
0.5      3
Name: rating, dtype: int64

Code to add to function if we go this route:

In [230]:
profile5 = profile.loc[profile['rating'] == 5.0]
len(profile5)

180

In [232]:
profile4 = profile.loc[profile['rating'] == 4.0]
len(profile4)

286

## Evaluation

The evaluation of each model should accompany the creation of each model, and you should be sure to evaluate your models consistently.

Evaluate how well your work solves the stated business problem. 

Questions to consider:

- How do you interpret the results?
- How well does your model fit your data? How much better is this than your baseline model? Is it over or under fit?
- How well does your model/data fit any relevant modeling assumptions?

For the final model, you might also consider:

- How confident are you that your results would generalize beyond the data you have?
- How confident are you that this model would benefit the business if put into use?
- What does this final model tell you about the relationship between your inputs and outputs?

## Conclusions

Provide your conclusions about the work you've done, including any limitations or next steps.

Questions to consider:

- What would you recommend the business do as a result of this work?
- How could the stakeholder use your model effectively?
- What are some reasons why your analysis might not fully solve the business problem?
- What else could you do in the future to improve this project (future work)?