# Recommender Systems Walk Through

### Intro

Recommender Systems:

- Content Based Filtering

    Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.
    
    Our case: Use NLP and cosine similarity on Movie Synopsis, Casts & Directors to find similar movies.
    

- Collaborative Filtering

The aim of CF is to find similar users and recommend products based on a similar user.

Finally I will implement a simple hybrid model



### Loading data in from cleaning Notebook

In [1]:
import pandas as pd
from nltk.stem.snowball import SnowballStemmer
from surprise import Reader, Dataset
import numpy as np 

In [2]:
df = pd.read_csv('Clean_Item_Data')
df.head(3)

Unnamed: 0.1,Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,movieId,imdbId,tmdbId,cast,keywords,cast_size,crew_size,director
0,0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",16,7.7,30000000,81.0,0,1,114709,862,"['Tom Hanks', 'Tim Allen', 'Don Rickles']","['jealousy', 'toy', 'boy', 'friendship', 'frie...",13,106,John Lasseter
1,1,Jumanji,When siblings Judy and Peter discover an encha...,12,6.9,65000000,104.0,0,2,113497,8844,"['Robin Williams', 'Jonathan Hyde', 'Kirsten D...","['board game', 'disappearance', ""based on chil...",26,16,Joe Johnston
2,2,Grumpier Old Men,A family wedding reignites the ancient feud be...,10749,6.5,0,101.0,0,3,113228,15602,"['Walter Matthau', 'Jack Lemmon', 'Ann-Margret']","['fishing', 'best friend', 'duringcreditssting...",7,4,Howard Deutch


In [4]:
rating = pd.read_csv('ratings.csv')
rating.sample(3)

Unnamed: 0,userId,movieId,rating,timestamp
1019067,10363,2141,3.5,1223305921
21638236,224751,91542,5.0,1445360333
16251544,168860,14,3.0,1007838247


# Simple recommender

Simply suggesting the most 'popular' movies

In [5]:
# Very naive approach (also to do this properly I need to take into account of number of votes not just avg vote.)

df.sort_values('vote_average', ascending=False).head(5)

Unnamed: 0.1,Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,movieId,imdbId,tmdbId,cast,keywords,cast_size,crew_size,director
30521,30779,Canal Zone,CANAL ZONE is about the people who live and wo...,99,10.0,0,174.0,0,139237,75803,114423,[],[],0,1,Frederick Wiseman
28299,28513,One Night Only,A group of female friends get together with so...,35,10.0,0,87.0,0,132000,91682,72178,"['Lenore Zann', 'Geoffrey MacKay', 'Helene Udy']",[],9,1,Timothy Bond
38161,38584,Sunnyside Up,"Molly and Bee, sweet young 'working girls,' li...",35,10.0,0,121.0,0,161876,20466,86360,"['Janet Gaynor', 'Charles Farrell', 'Marjorie ...",[],12,4,David Butler
40269,40736,Patient Zero,After an unprecedented global pandemic has tur...,28,10.0,0,0.0,0,168274,3458254,295011,"['Natalie Dormer', 'Stanley Tucci', 'Matt Smith']","['survivor', 'language', 'end of the world', '...",16,6,Stefan Ruzowitzky
181,181,Reckless,"On Christmas eve, a relentlessly cheerful woma...",14,10.0,0,91.0,0,189,114241,58372,"['Mia Farrow', 'Tony Goldwyn', 'Scott Glenn']","['trauma', 'game show', 'female protagonist', ...",4,17,Norman René


## Content Based Filtering 

Goal: be able to group similar movies together and have a ranking system

Many different approaches:

- Recommend movies with similar descriptions, crew, cast I.E NLP
- Tabular data i.e ratings, cost ect



In [None]:
dff = df.sample(10000)

In [None]:
dff.dropna(inplace = True)

In [None]:
dff.head(5)

In [None]:
dff.keywords

In [None]:
s = dff.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'




In [None]:
s = s.value_counts()
s[:10]

In [None]:
s = s[s > 1]
s

In [None]:

stemmer = SnowballStemmer('english')

In [None]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [None]:
dff['keywords'] = dff['keywords'].apply(filter_keywords)
dff['keywords'] = dff['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
dff['keywords'] = dff['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

### Below is a bit of a trick

A better way would be to do similarity rating on keywords, cast and director seperatly and thehn combine all of this to find similar movies

Instead (for speed) I just combine all there strings seperatly. 

In [None]:
dff.director

In [None]:
def Convert(string):
    
    x = [string]
 
    return x

dff['director'] = dff['director'].apply(Convert)


In [None]:
dff['soup'] = dff['keywords'] + dff['cast'] + dff['director']
dff['soup'] = dff['soup'].apply(lambda x: ' '.join(x))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(dff['overview'])

In [None]:
count_matrix

In [None]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
dff = dff.reset_index()
titles = dff['title']
indices = pd.Series(dff.index, index=dff['title'])

In [None]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [None]:
dff.title.sample(10)

In [None]:
get_recommendations('Toy Story').head(15)

In [None]:
dff[dff.title == 'Toy Story']

In [None]:
dff[dff.title == "You're Only Young Once"]

In [None]:
# could improve above by ensuring the recommended movie is still somwhat popular and well voted

## Collaborative Filtering

![alt text](1_qFweWAKML-SdpGndGMvLDw.png)

In [None]:
rating

In [None]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'],cv=5)