# Recommender Systems Walk Through

Assumptions:

- I assume we are not recommending any brand new shows (I.e shows with no ratings)
    If this were the case we would just have to utilise CB models to deal with that.
    i.e recommend shows based on meta data of movies
    
- No brand new users. If this were the case again we would have to use meta data of movies for recommendations. 

- We use th entire customers history of movie ratings for Collab filtering. As per say the past months/week. (Simply due to not having loads of data)

Notes:

- Very difficult to evaluate these models. The best method is A/B testing but not possible for this exercise, instead we use a naive hit rate metric. 

- Eng goal is to understand Hybrid models, and to see if they perform the best.

- Each model must recommend given a users ID.

To do:

- Model based CF

- Use Directors and cost andkey words too.

In [1]:
import pandas as pd
import numpy as np 
import Utils as ut
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [2]:
df = pd.read_csv('Clean_Item_Data')
df.drop('Unnamed: 0',inplace = True,axis = 1)

In [3]:
rating = pd.read_csv('ratings.csv')

In [4]:
rating = rating[rating.userId.isin(ut.random_sample(rating.userId.unique(),9000))]

In [5]:
rating = rating[rating.userId.map(rating.userId.value_counts())>4]

In [6]:
rating, rating_test = train_test_split(rating,
                                   stratify=rating['userId'], 
                                   test_size=0.20,
                                   random_state=432)

In [7]:
movieId_set = set.intersection(set(rating_test.movieId), 
                               set(rating.movieId), 
                               set(df.movieId))

rating_test = rating_test[rating_test.movieId.isin(movieId_set)]
rating = rating[rating.movieId.isin(movieId_set)]
df = df[df.movieId.isin(movieId_set)]

In [8]:
rating_test.shape

(128019, 4)

In [9]:
rating.shape

(507711, 4)

In [10]:
df.shape

(5158, 17)

In [11]:
df = df.reset_index(drop=True)
rating = rating.reset_index(drop=True)

In [12]:
# Very naive approach (also to do this properly I need to take into account of number of votes not just avg vote.)
user = rating.userId.sample(1).iloc[0]
ut.simple(user,df,rating,5)

Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,vote_count,movieId,imdbId,tmdbId,cast,keywords,cast_size,crew_size,director,score
124,Pulp Fiction,"A burger-loving hit man, his philosophical par...",53,8.3,8000000,154.0,0,8670.0,296,110912,680,"['John Travolta', 'Samuel L. Jackson', 'Uma Th...","['transport', 'brothel', 'drug deal', 'boxer',...",54,87,Quentin Tarantino,8.187793
215,Schindler's List,The true story of how businessman Oskar Schind...,18,8.3,22000000,195.0,0,4436.0,527,108052,424,"['Liam Neeson', 'Ben Kingsley', 'Ralph Fiennes']","['factori', 'concentration camp', 'hero', 'hol...",30,36,Steven Spielberg,8.091368
502,The Godfather: Part II,In the continuing saga of the Corleone crime f...,18,8.3,13000000,200.0,0,3418.0,1221,71562,240,"['Al Pacino', 'Robert Duvall', 'Diane Keaton']","['italo-american', 'cuba', 'vorort', 'melancho...",84,53,Francis Ford Coppola,8.037035
4403,The Intouchables,A true story of two men who should never have ...,18,8.2,13000000,112.0,0,5410.0,92259,1675434,77338,"['François Cluzet', 'Omar Sy', 'Audrey Fleurot']","['male friendship', 'masseus', 'friendship', '...",32,34,Eric Toledano,8.034125
500,Psycho,When larcenous real estate clerk Marion Crane ...,18,8.3,806948,109.0,0,2405.0,1219,54215,539,"['Anthony Perkins', 'Vera Miles', 'John Gavin']","['hotel', 'clerk', 'arizona', 'shower', 'rain'...",30,22,Alfred Hitchcock,7.945052


In [13]:
ut.evaluate(rating_test, 5, 'simple', df, sample_size = 200,  rating = rating)

0.31

## Content Based Filtering 

In [14]:
df['overview_toke'] = df['overview'].apply(ut.clean_text, toke = True)

In [15]:
df['overview_clean'] = df['overview'].apply(ut.clean_text)

In [16]:
df.head(3)

Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,vote_count,movieId,imdbId,tmdbId,cast,keywords,cast_size,crew_size,director,score,overview_toke,overview_clean
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",16,7.7,30000000,81.0,0,5415.0,1,114709,862,"['Tom Hanks', 'Tim Allen', 'Don Rickles']","['jealousi', 'toy', 'boy', 'friendship', 'frie...",13,106,John Lasseter,7.575833,"[led, woodi, andi, toy, live, happili, hi, roo...",led woodi andi toy live happili hi room andi b...
1,Jumanji,When siblings Judy and Peter discover an encha...,12,6.9,65000000,104.0,0,2413.0,2,113497,8844,"['Robin Williams', 'Jonathan Hyde', 'Kirsten D...","['board gam', 'disappear', ""based on children'...",26,16,Joe Johnston,6.782738,"[sibl, judi, peter, discov, enchant, board, ga...",sibl judi peter discov enchant board game open...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,10749,6.5,0,101.0,0,92.0,3,113228,15602,"['Walter Matthau', 'Jack Lemmon', 'Ann-Margret']","['fish', 'best friend', 'duringcreditssting', ...",7,4,Howard Deutch,6.252781,"[famili, wed, reignit, ancient, feud, next-doo...",famili wed reignit ancient feud next-door neig...


## TF-IDF

In [17]:
tfidf_cosine_sim = ut.TF_IDF(df['overview_clean'], ngram = 5)

In [18]:
ut.top_rec("The Dark Knight",
           tfidf_cosine_sim, 
           pd.Series(df.index, index=df['title']), 
           df, 5)

4388         The Dark Knight Rises
242                         Batman
611                 Batman Returns
4186    Batman: Under the Red Hood
Name: title, dtype: object

In [19]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = tfidf_cosine_sim)

0.015

## Word2Vec

In [20]:
word2vec_cosine_sim = ut.Word2Vec_Hybrid(
                                      df['overview_toke'], 
                                      vector_size = 300,
                                      window = 7, 
                                      epochs = 100)

100%|██████████| 18444/18444 [00:11<00:00, 1615.10it/s] 


In [21]:
ut.top_rec("The Dark Knight",
           word2vec_cosine_sim, 
           pd.Series(df.index, index=df['title']), 
           df,5)

4388         The Dark Knight Rises
242                         Batman
611                 Batman Returns
4186    Batman: Under the Red Hood
Name: title, dtype: object

In [22]:
word2vec_cosine_sim.shape

(5158, 5158)

In [23]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = word2vec_cosine_sim)

0.01

### Word embedding

In [24]:
doc2vec_cosine_sim = ut.Doc2Word_embed(df['overview_clean'], 
                                    df['overview_toke'], 
                                    vector_size = 300,
                                    window = 15,
                                    epochs = 100)

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


In [25]:
ut.top_rec("The Dark Knight",
           doc2vec_cosine_sim, 
           pd.Series(df.index, index=df['title']), 
           df,5)

4388                     The Dark Knight Rises
242                                     Batman
4369                          Batman: Year One
2406    Remo Williams: The Adventure Begins...
Name: title, dtype: object

In [26]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = doc2vec_cosine_sim)

0.03

## Hybrid Content based model

We want to combine the NLP models (i.e TF_IDF, Word2Vec_Hybrid and Doc2Word_embed)


In [27]:
# assuming the NLP model cosine similarities are comparable we will avg

avg_nlp_sim = (word2vec_cosine_sim+tfidf_cosine_sim)/2


ut.top_rec("The Dark Knight",
           avg_nlp_sim, 
           pd.Series(df.index, index=df['title']), 
           df,5)

4388         The Dark Knight Rises
242                         Batman
611                 Batman Returns
4186    Batman: Under the Red Hood
Name: title, dtype: object

In [28]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = avg_nlp_sim)

0.05

In [29]:
matrix = avg_nlp_sim
for i in range(matrix.shape[0]):
    for j in range(matrix.shape[1]):
        if matrix[i,j] <0:
            print("Negative")
            break

## Collaborative Filtering

### Item based

This is  collaborative filtering although we dont actually map users here. We just find the cosine similarity between movie rating vectors.

i.e If a lot of people who are highly rating to MasterChef are also highly rating Bake Off, these 2 shows will have a high similarity score.

In [30]:
df['index1'] = df.index
new = pd.merge(rating,
               df[["title", 'index1', "movieId"]], 
               how='inner',
               left_on="movieId",
               right_on="movieId")

new = new[["userId","index1","rating"]]
x = pd.pivot_table(new, values='rating', index=['index1'], columns=['userId'], aggfunc=np.max, fill_value=0) 
new.head(2)

Unnamed: 0,userId,index1,rating
0,60395,816,2.5
1,214377,816,2.0


In [31]:
item_cosine_sim = ut.Rating2Vec(x)

In [32]:
ut.top_rec("Terminator 2: Judgment Day",
           item_cosine_sim, 
           pd.Series(df.index, index=df['title']), 
           df,5)

195     Jurassic Park
519    The Terminator
165             Speed
187      The Fugitive
Name: title, dtype: object

In [33]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = item_cosine_sim)

0.26

### User based

In [34]:
user_cosine_sim = ut.Rating2Vec(x.T)

In [35]:
indexes = pd.DataFrame(x.T.index)

In [36]:
ut.top_rec_user(rating.userId[0], user_cosine_sim, 10,5,df, x.T,indexes).title

134                              The Shawshank Redemption
1056                                           The Matrix
1499                                            Gladiator
1979    The Lord of the Rings: The Fellowship of the Ring
1757                                                Shrek
Name: title, dtype: object

In [37]:
ut.evaluate(rating_test, 5, 'UB', df, sample_size = 200, rating = rating, sim = user_cosine_sim, user_matrix = x.T, indexes = indexes, k = 10)

0.105

## Hybrid Collab & Conent model

In [38]:
combined_sim = np.multiply(item_cosine_sim, avg_nlp_sim)

In [39]:
title = df.title.sample(1).iloc[0]

print("Tile wtached: ", title)

print("Movie recommendations")
ut.top_rec(title,
           combined_sim, 
           pd.Series(df.index, index=df['title']), 
           df,5)

Tile wtached:  Much Ado About Nothing
Movie recommendations


156    Four Weddings and a Funeral
737             The Wedding Singer
111       Like Water for Chocolate
36                     The Postman
Name: title, dtype: object

In [40]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = combined_sim)

0.195

In [41]:
from sklearn.preprocessing import normalize
combined_sim2 = np.divide(np.add(item_cosine_sim*2, avg_nlp_sim),2)
combined_sim2 = normalize(combined_sim2, axis=1, norm='l1')

In [42]:
ut.top_rec("Terminator 2: Judgment Day",
           combined_sim2, 
           pd.Series(df.index, index=df['title']), 
           df,5)

519    The Terminator
195     Jurassic Park
166         True Lies
165             Speed
Name: title, dtype: object

In [43]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = combined_sim2)

0.16

In [44]:
scores = np.matmul(x.T, combined_sim2)

In [45]:
ut.CB_CF_MEMORY_HYBRID_REC(scores , rating.userId[0], df, 5)

Unnamed: 0,title,movieId
549,Back to the Future,1270
544,Groundhog Day,1265
434,E.T. the Extra-Terrestrial,1097
712,The Truman Show,1682
155,Forrest Gump,356


In [46]:
ut.evaluate(rating_test, 5, 'Hybrid', df, sample_size = 200, user_movie = scores)

0.065

### Model based

## Evaluation

Very difficult to evaluate, here we use a hit rate. However there are issues doing this and it doesnt exactly represent the power/useful ness of each model.

In [None]:
UB = ut.evaluate(rating_test, 5, 'UB', df, sample_size = 200, rating = rating, sim = user_cosine_sim, user_matrix = x.T, indexes = indexes, k = 10)
IB = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = item_cosine_sim)
Hybrid_cb = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = avg_nlp_sim)
d2v = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = doc2vec_cosine_sim)
w2v = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = word2vec_cosine_sim)
tf = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = tfidf_cosine_sim)
simp = ut.evaluate(rating_test, 5, 'simple', df,  rating = rating)
h1 = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = combined_sim)
h2 = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = combined_sim2)
h3 = ut.evaluate(rating_test, 5, 'Hybrid', df, sample_size = 200, user_movie = scores)

to do:
- Evaluation method
- Finish simple memory based collab
- do a model based one

look into how a neural network could combine evrything.

Hybrid

Factorisation machines?
