# Recommender Systems Walk Through

### Intro

Recommender Systems:

- Content Based Filtering

    Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.
    
    Our case: Use NLP and cosine similarity on Movie Synopsis, Casts & Directors to find similar movies.
    

- Collaborative Filtering

The aim of CF is to find similar users and recommend products based on a similar user.

Finally I will implement a simple hybrid model



### Loading data in from cleaning Notebook

In [1]:
import pandas as pd
from nltk.stem.snowball import SnowballStemmer
#from surprise import Reader, Dataset
import numpy as np 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import string
import keras
from keras.preprocessing.text import one_hot,Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense , Flatten ,Embedding,Input
from keras.models import Model
import nltk

Using TensorFlow backend.


In [2]:
df = pd.read_csv('Clean_Item_Data')
df.drop('Unnamed: 0',inplace = True,axis = 1)
df.drop_duplicates(subset = 'title', inplace=True)
df.head(3)

Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,movieId,imdbId,tmdbId,cast,keywords,cast_size,crew_size,director
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",16,7.7,30000000,81.0,0,1,114709,862,"['Tom Hanks', 'Tim Allen', 'Don Rickles']","['jealousy', 'toy', 'boy', 'friendship', 'frie...",13,106,John Lasseter
1,Jumanji,When siblings Judy and Peter discover an encha...,12,6.9,65000000,104.0,0,2,113497,8844,"['Robin Williams', 'Jonathan Hyde', 'Kirsten D...","['board game', 'disappearance', ""based on chil...",26,16,Joe Johnston
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,10749,6.5,0,101.0,0,3,113228,15602,"['Walter Matthau', 'Jack Lemmon', 'Ann-Margret']","['fishing', 'best friend', 'duringcreditssting...",7,4,Howard Deutch


In [3]:
df.shape

(38873, 15)

In [4]:
rating = pd.read_csv('ratings.csv')
rating.sample(3)

Unnamed: 0,userId,movieId,rating,timestamp
14928935,155220,1625,4.0,938910155
12021746,124604,745,3.0,848513996
8352947,86128,3733,3.5,1108585812


In [5]:
df = df[df.movieId.isin(rating.movieId.unique())]

In [6]:
rating = rating[rating.movieId.isin(df.movieId.unique())]

In [7]:
df = df.reset_index(drop=True)
rating = rating.reset_index(drop=True)

# Simple recommender

Simply suggesting the most 'popular' movies

In [8]:
# Very naive approach (also to do this properly I need to take into account of number of votes not just avg vote.)

df.sort_values('vote_average', ascending=False).head(5)

Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,movieId,imdbId,tmdbId,cast,keywords,cast_size,crew_size,director
24484,Claymation Comedy of Horrors,Wilshire Pig and Sheldon Snail discover a map ...,16,10.0,0,30.0,0,126094,252993,53543,[],[],0,2,Barry Bruce
25891,Crooks and Coronets,Two crooks are hired to rob an eccentric old l...,35,10.0,0,106.0,0,132052,64192,322460,"['Telly Savalas', 'Edith Evans', 'Warren Oates']",['heist'],11,2,Jim O'Connolly
21398,Tall Story,Love puts a college basketball star into a tai...,35,10.0,0,91.0,0,112301,54367,86297,"['Jane Fonda', 'Anthony Perkins', 'Ray Walston']","['college', 'bribe']",8,13,Joshua Logan
713,Carmen Miranda: Bananas Is My Business,A biography of the Portuguese-Brazilian singer...,99,10.0,0,91.0,0,756,109381,255546,"['Carmen Miranda', 'Aurora Miranda', 'Cesar Ro...","['latin', 'profile', 'woman director']",10,3,Helena Solberg
18579,Road to Redemption,A couple come into contact with stolen mob mon...,28,10.0,0,89.0,0,99735,256341,46016,"['Pat Hingle', 'Julie Condra', 'Leo Rossi']",['independent film'],4,1,Robert Vernon


## Content Based Filtering 

Goal: be able to group similar movies together and have a ranking system

Many different approaches:

- Recommend movies with similar descriptions, crew, cast I.E NLP
- Tabular data i.e ratings, cost ect



In [75]:


# Importing necessary libraries
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
import re
import string
import random
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser
from matplotlib import pyplot
from gensim.models import KeyedVectors


def _removeNonAscii(s):
    return "".join(i for i in s if  ord(i)<128)

def make_lower_case(text):
    return text.lower()

def remove_stop_words(text):
    text = text.split()
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops]
    text = " ".join(text)
    return text

def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

def remove_punctuation(text):
    tokenizer = RegexpTokenizer(r'\w+')
    text = tokenizer.tokenize(text)
    text = " ".join(text)
    return text

df['cleaned'] = df['overview'].apply(_removeNonAscii)

df['cleaned'] = df.cleaned.apply(func = make_lower_case)
df['cleaned'] = df.cleaned.apply(func = remove_stop_words)
df['cleaned'] = df.cleaned.apply(func=remove_punctuation)
df['cleaned'] = df.cleaned.apply(func=remove_html)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [76]:
df['cleaned']

0        led woody andy s toys live happily room andy s...
1        siblings judy peter discover enchanted board g...
2        family wedding reignites ancient feud net door...
3        cheated on mistreated stepped on women holding...
4        george banks recovered daughter s wedding rece...
                               ...                        
38277    true crime documentary delve murder spree insp...
38278    film archivist revisits story rustin parr herm...
38279    year 3000 ad world s dangerous women banished ...
38280                             rising falling man woman
38281    artist struggles finish work storyline cult pl...
Name: cleaned, Length: 38282, dtype: object

In [77]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df['cleaned'])

In [78]:
tfidf_matrix.shape

(38282, 915918)

In [80]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [81]:
titles = df['title']

In [83]:
indices = pd.Series(df.index, index=df['title'])

In [84]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [85]:
get_recommendations('The Godfather').head(10)

1124      The Godfather: Part II
17001            The Outside Man
28011           Honor Thy Father
21011                 Blood Ties
29                Shanghai Triad
15286      New York Confidential
7994                        Fury
35242              Live by Night
12474               I Am the Law
1834     The Godfather: Part III
Name: title, dtype: object

In [86]:
get_recommendations('The Dark Knight').head(10)

16939                                The Dark Knight Rises
145                                         Batman Forever
1270                                        Batman Returns
574                                                 Batman
14557                           Batman: Under the Red Hood
19401    Batman Unmasked: The Psychology of the Dark Kn...
18600              Batman: The Dark Knight Returns, Part 2
16755                                     Batman: Year One
34828    LEGO DC Comics Super Heroes: Batman: Be-Leaguered
35623    Batman Beyond Darwyn Cooke's Batman 75th Anniv...
Name: title, dtype: object

In [137]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [138]:
data = df['cleaned']

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

In [None]:
max_epochs = 100
vec_size = 15
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)

for epoch in range(max_epochs):

    print(epoch)
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.epochs)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


In [129]:
model= Doc2Vec.load("d2v.model")

In [130]:
["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "I love building chatbots!"]

['I love machine learning. Its awesome.',
 'I love coding in python',
 'I love building chatbots',
 'I love building chatbots!']

In [131]:
model.dv.key_to_index 

{'0': 0, '1': 1, '2': 2, '3': 3}

In [132]:
from gensim.models.doc2vec import Doc2Vec

model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love chatbots".lower())


# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('3')
print(similar_doc)


# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])


[('2', 0.734465479850769), ('1', 0.7178300023078918), ('0', 0.4611991345882416)]
[-0.4464615   0.05181082 -0.46369314  0.0706271   0.6003256  -0.6622851
 -0.6252563  -0.6537112   0.07582504 -0.7323634   0.2951613   0.5026427
 -0.7469875  -0.7266919  -0.28869343  0.07489646  0.00168609 -0.35149062
 -0.6598793   0.21526803]


  if __name__ == '__main__':
  


### Below is a bit of a trick

A better way would be to do similarity rating on keywords, cast and director seperatly and thehn combine all of this to find similar movies

Instead (for speed) I just combine all there strings seperatly. 

In [None]:
dff.director

In [None]:
def Convert(string):
    
    x = [string]
 
    return x

dff['director'] = dff['director'].apply(Convert)
dff['soup'] = dff['keywords'] + dff['cast'] + dff['director']
dff['soup'] = dff['soup'].apply(lambda x: ' '.join(x))

In [None]:
dff['soup'] = dff['keywords'] + dff['cast'] + dff['director']
dff['soup'] = dff['soup'].apply(lambda x: ' '.join(x))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(dff['overview'])

In [None]:
count_matrix

In [None]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
dff = dff.reset_index()
titles = dff['title']
indices = pd.Series(dff.index, index=dff['title'])

In [None]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [None]:
dff.title.sample(10)

In [None]:
get_recommendations('Toy Story').head(15)

In [None]:
dff[dff.title == 'Toy Story']

In [None]:
dff[dff.title == "You're Only Young Once"]

In [None]:
# could improve above by ensuring the recommended movie is still somwhat popular and well voted

## Collaborative Filtering

![alt text](1_qFweWAKML-SdpGndGMvLDw.png)

In [None]:
rating

In [None]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'],cv=5)