# Recommender Systems Walk Through

### Intro

2 distinct Types of RS:

- Content Based Filtering

The aim of content-based recommendation is to create a ‘profile’ for each user and each item. Then recommend an item that is similar to a previose item used by the User.


- Collaborative Filtering

The aim of CF is to find similar users and recommend products based on a similar user.

Finally I will implement a simple hybrid model

![alt text](1_yrkvweErbifbPFkBUyZlOw.png)

### Data Prep

For simplicity we are using a small subset of the data available

In [74]:
import pandas as pd
from nltk.stem.snowball import SnowballStemmer
from surprise import Reader, Dataset
import numpy as np 

In [75]:
credits = pd.read_csv('credits.csv')
credits.head(3)

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602


In [76]:
keywords = pd.read_csv('keywords.csv')
keywords.head(3)

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."


In [77]:
links = pd.read_csv('links.csv')
links.head(3)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0


In [78]:
meta = pd.read_csv('movies_metadata.csv')
meta.head(3)

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [None]:
rating = pd.read_csv('ratings.csv')
rating.head(3)

### Cleaning meta data

In [None]:
# Dropping many due to simplicity (Notebook is not trying to get the best model, just trying things out)
meta=meta[['id','imdb_id','title','overview','genres','vote_average','budget','runtime','adult']]
meta.adult.replace({'False': 0, 'True': 1}, inplace=True)

In [None]:
meta.shape

In [None]:
meta.head(5)

In [None]:
meta.genres = meta.genres.str.extract('(\d+)') # again wrong as many genres but keeping it simple

In [None]:
meta.genres = pd.to_numeric(meta.genres, errors='coerce')

In [None]:
meta.isnull().sum() / meta.shape[0] * 100.00

In [None]:
meta = meta.drop([19730, 29503, 35587]) # Incorrect IDs

meta.dropna(inplace = True)

In [None]:
meta['id'] = meta['id'].astype('int')
meta.genres = meta.genres.astype('int')

#### links data

In [None]:
links.head(3)
links.dropna(inplace = True)

In [None]:
links['tmdbId'] = links['tmdbId'].astype('int')
links['imdbId'] = links['imdbId'].astype('int')

In [None]:
links.isnull().sum() / links.shape[0] * 100.00

In [None]:
links.dropna(inplace = True)

In [None]:
df = pd.merge(meta, links, left_on=['id'], right_on = ['tmdbId'], how='inner')
df.drop(['imdb_id','id'],axis = 1,inplace = True)

In [None]:
df

### Cleaning credits data

In [None]:
credits

In [None]:
df = pd.merge(df, credits, left_on=['tmdbId'], right_on = ['id'], how='inner')
df.drop(['id'],axis = 1,inplace = True)
df.head(3)

In [None]:
df = pd.merge(df, keywords, left_on=['tmdbId'], right_on = ['id'], how='inner')
df.drop(['id'],axis = 1,inplace = True)
df.head(3)

In [None]:
from ast import literal_eval
df['cast'] = df['cast'].apply(literal_eval)
df['crew'] = df['crew'].apply(literal_eval)
df['keywords'] =  df['keywords'].apply(literal_eval)
df['cast_size'] = df['cast'].apply(lambda x: len(x))
df['crew_size'] = df['crew'].apply(lambda x: len(x))

In [None]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

df['director'] = df['crew'].apply(get_director)

df['cast'] = df['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df['cast'] = df['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

df['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df.drop(['crew'],axis = 1,inplace = True)

In [None]:
df.head(3)

# Simple recommender

Simply suggesting the most 'popular' movies

In [None]:
# Very naive approach (also to do this properly I need to take into account of number of votes not just avg vote.)

df.sort_values('vote_average', ascending=False).head(5)

## Content Based Filtering 

Goal: be able to group similar movies together and have a ranking system

Many different approaches:

- Recommend movies with similar descriptions, crew, cast I.E NLP
- Tabular data i.e ratings, cost ect

I want to try a combination


I will:

- Cluster descriptions, crew and cast seperately . Make features out of these.
- then cluster the dataframe

In [None]:
dff = df.sample(10000)

In [None]:
dff.dropna(inplace = True)

In [None]:
dff.head(5)

In [None]:
dff.keywords

In [None]:
s = dff.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'




In [None]:
s = s.value_counts()
s[:10]

In [None]:
s = s[s > 1]
s

In [None]:

stemmer = SnowballStemmer('english')

In [None]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [None]:
dff['keywords'] = dff['keywords'].apply(filter_keywords)
dff['keywords'] = dff['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
dff['keywords'] = dff['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

### Below is a bit of a trick

A better way would be to do similarity rating on keywords, cast and director seperatly and thehn combine all of this to find similar movies

Instead (for speed) I just combine all there strings seperatly. 

In [None]:
dff.director

In [None]:
def Convert(string):
    
    x = [string]
 
    return x

dff['director'] = dff['director'].apply(Convert)


In [None]:
dff['soup'] = dff['keywords'] + dff['cast'] + dff['director']
dff['soup'] = dff['soup'].apply(lambda x: ' '.join(x))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(dff['overview'])

In [None]:
count_matrix

In [None]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
dff = dff.reset_index()
titles = dff['title']
indices = pd.Series(dff.index, index=dff['title'])

In [None]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [None]:
dff.title.sample(10)

In [None]:
get_recommendations('Toy Story').head(15)

In [None]:
dff[dff.title == 'Toy Story']

In [None]:
dff[dff.title == "You're Only Young Once"]

In [None]:
# could improve above by ensuring the recommended movie is still somwhat popular and well voted

## Collaborative Filtering

![alt text](1_qFweWAKML-SdpGndGMvLDw.png)

In [None]:
rating

In [None]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'],cv=5)