# Recommender Systems Walk Through

### Intro

2 Types of RS:

- Content Based Filtering

The aim of content-based recommendation is to create a ‘profile’ for each user and each item. Then recommend an item that is similar to a previose item used by the User.


- Collaborative Filtering

The aim of CF is to find similar users and recommend products based on a similar user.


![alt text](1_yrkvweErbifbPFkBUyZlOw.png)

### Data Prep

For simplicity we are using a small subset of the data available

In [30]:
import pandas as pd

import numpy as np 

In [31]:
credits = pd.read_csv('credits.csv')
credits.head(3)

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602


In [32]:
keywords = pd.read_csv('keywords.csv')
keywords.head(3)

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."


In [33]:
links = pd.read_csv('links.csv')
links.head(3)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0


In [34]:
meta = pd.read_csv('movies_metadata.csv')
meta.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [35]:
rating = pd.read_csv('ratings.csv')
rating.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523


### Cleaning meta data

In [39]:
# Dropping many due to simplicity (Notebook is not trying to get the best model, just trying things out)
meta=meta[['id','imdb_id','title','overview','genres','vote_average','budget','runtime','adult']]
meta.adult.replace({'False': 0, 'True': 1}, inplace=True)

In [40]:
meta.shape

(45466, 9)

In [41]:
meta.head(5)

Unnamed: 0,id,imdb_id,title,overview,genres,vote_average,budget,runtime,adult
0,862,tt0114709,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",7.7,30000000,81.0,0
1,8844,tt0113497,Jumanji,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",6.9,65000000,104.0,0
2,15602,tt0113228,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",6.5,0,101.0,0
3,31357,tt0114885,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",6.1,16000000,127.0,0
4,11862,tt0113041,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'id': 35, 'name': 'Comedy'}]",5.7,0,106.0,0


In [42]:
meta.genres = meta.genres.str.extract('(\d+)') # again wrong as many genres but keeping it simple

In [43]:
meta.genres = pd.to_numeric(meta.genres, errors='coerce')

In [44]:
meta.isnull().sum() / meta.shape[0] * 100.00

id              0.000000
imdb_id         0.037391
title           0.013197
overview        2.098271
genres          5.371046
vote_average    0.013197
budget          0.000000
runtime         0.578454
adult           0.000000
dtype: float64

In [45]:
meta = meta.drop([19730, 29503, 35587]) # Incorrect IDs

meta.dropna(inplace = True)

In [46]:
meta['id'] = meta['id'].astype('int')
meta.genres = meta.genres.astype('int')

#### links data

In [47]:
links.head(3)
links.dropna(inplace = True)

In [48]:
links['tmdbId'] = links['tmdbId'].astype('int')
links['imdbId'] = links['imdbId'].astype('int')

In [49]:
links.isnull().sum() / links.shape[0] * 100.00

movieId    0.0
imdbId     0.0
tmdbId     0.0
dtype: float64

In [50]:
links.dropna(inplace = True)

In [51]:
df = pd.merge(meta, links, left_on=['id'], right_on = ['tmdbId'], how='inner')
df.drop(['imdb_id','id'],axis = 1,inplace = True)

In [52]:
df

Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,movieId,imdbId,tmdbId
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",16,7.7,30000000,81.0,0,1,114709,862
1,Jumanji,When siblings Judy and Peter discover an encha...,12,6.9,65000000,104.0,0,2,113497,8844
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,10749,6.5,0,101.0,0,3,113228,15602
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",35,6.1,16000000,127.0,0,4,114885,31357
4,Father of the Bride Part II,Just when George Banks has recovered from his ...,35,5.7,0,106.0,0,5,113041,11862
...,...,...,...,...,...,...,...,...,...,...
42368,Caged Heat 3000,It's the year 3000 AD. The world's most danger...,878,3.5,0,85.0,0,176263,112613,222848
42369,Robin Hood,"Yet another version of the classic epic, with ...",18,5.7,0,104.0,0,176267,102797,30840
42370,Subdue,Rising and falling between a man and woman.,18,4.0,0,90.0,0,176269,6209470,439050
42371,Century of Birthing,An artist struggles to finish his work while a...,18,9.0,0,360.0,0,176271,2028550,111109


### Cleaning credits data

In [53]:
credits

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862
...,...,...,...
45471,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...",439050
45472,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109
45473,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758
45474,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506


In [54]:
df = pd.merge(df, credits, left_on=['tmdbId'], right_on = ['id'], how='inner')
df.drop(['id'],axis = 1,inplace = True)
df.head(3)

Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,movieId,imdbId,tmdbId,cast,crew
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",16,7.7,30000000,81.0,0,1,114709,862,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,Jumanji,When siblings Judy and Peter discover an encha...,12,6.9,65000000,104.0,0,2,113497,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,10749,6.5,0,101.0,0,3,113228,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de..."


In [55]:
df = pd.merge(df, keywords, left_on=['tmdbId'], right_on = ['id'], how='inner')
df.drop(['id'],axis = 1,inplace = True)
df.head(3)

Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,movieId,imdbId,tmdbId,cast,crew,keywords
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",16,7.7,30000000,81.0,0,1,114709,862,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,When siblings Judy and Peter discover an encha...,12,6.9,65000000,104.0,0,2,113497,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,10749,6.5,0,101.0,0,3,113228,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."


In [56]:
from ast import literal_eval
df['cast'] = df['cast'].apply(literal_eval)
df['crew'] = df['crew'].apply(literal_eval)
df['keywords'] =  df['keywords'].apply(literal_eval)
df['cast_size'] = df['cast'].apply(lambda x: len(x))
df['crew_size'] = df['crew'].apply(lambda x: len(x))

In [57]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

df['director'] = df['crew'].apply(get_director)

df['cast'] = df['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df['cast'] = df['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

df['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df.drop(['crew'],axis = 1,inplace = True)

In [58]:
df.head(3)

Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,movieId,imdbId,tmdbId,cast,keywords,cast_size,crew_size,director
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",16,7.7,30000000,81.0,0,1,114709,862,"[Tom Hanks, Tim Allen, Don Rickles]","[jealousy, toy, boy, friendship, friends, riva...",13,106,John Lasseter
1,Jumanji,When siblings Judy and Peter discover an encha...,12,6.9,65000000,104.0,0,2,113497,8844,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]","[board game, disappearance, based on children'...",26,16,Joe Johnston
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,10749,6.5,0,101.0,0,3,113228,15602,"[Walter Matthau, Jack Lemmon, Ann-Margret]","[fishing, best friend, duringcreditsstinger, o...",7,4,Howard Deutch


## Content Based Filtering 

Goal: be able to group similar movies together and have a ranking system

Many different approaches:

- Recommend movies with similar descriptions, crew, cast I.E NLP
- Tabular data i.e ratings, cost ect

I want to try a combination


I will:

- Cluster descriptions, crew and cast seperately . Make features out of these.
- then cluster the dataframe

## Collaborative Filtering

![alt text](1_qFweWAKML-SdpGndGMvLDw.png)