# Recommender Systems Walk Through

Assumptions:

- I assume we are not recommending any brand new shows (I.e shows with no ratings)
    If this were the case we would just have to utilise CB models to deal with that.
    i.e recommend shows based on meta data of movies
    
- No brand new users. If this were the case again we would have to use meta data of movies for recommendations. 

- We use th entire customers history of movie ratings for Collab filtering. As per say the past months/week. (Simply due to not having loads of data)

Notes:

- Very difficult to evaluate these models. The best method is A/B testing but not possible for this exercise, instead we use a naive hit rate metric. 

- Eng goal is to understand Hybrid models, and to see if they perform the best.

- Each model must recommend given a users ID.

To do:

- Model based CF

- Use Directors and cost andkey words too.

In [1]:
import pandas as pd
import numpy as np 
import Utils as ut
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [2]:
df = pd.read_csv('Clean_Item_Data')
df.drop('Unnamed: 0',inplace = True,axis = 1)

In [3]:
rating = pd.read_csv('ratings.csv')

In [4]:
rating = rating[rating.userId.isin(ut.random_sample(rating.userId.unique(),9000))]

In [5]:
rating = rating[rating.userId.map(rating.userId.value_counts())>4]

In [6]:
rating, rating_test = train_test_split(rating,
                                   stratify=rating['userId'], 
                                   test_size=0.20,
                                   random_state=432)

In [7]:
movieId_set = set.intersection(set(rating_test.movieId), 
                               set(rating.movieId), 
                               set(df.movieId))

rating_test = rating_test[rating_test.movieId.isin(movieId_set)]
rating = rating[rating.movieId.isin(movieId_set)]
df = df[df.movieId.isin(movieId_set)]

In [8]:
rating_test.shape

(132916, 4)

In [9]:
rating.shape

(528253, 4)

In [10]:
df.shape

(5469, 17)

In [11]:
df = df.reset_index(drop=True)
rating = rating.reset_index(drop=True)

In [12]:
# Very naive approach (also to do this properly I need to take into account of number of votes not just avg vote.)
user = rating.userId.sample(1).iloc[0]
ut.simple(user,df,rating,5)

Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,vote_count,movieId,imdbId,tmdbId,cast,keywords,cast_size,crew_size,director,score
135,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,18,8.5,25000000,142.0,0,8358.0,318,111161,278,"['Tim Robbins', 'Morgan Freeman', 'Bob Gunton']","['prison', 'corrupt', 'police brut', 'prison c...",42,90,Frank Darabont,8.372739
304,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",18,8.5,6000000,175.0,0,6024.0,858,68646,238,"['Marlon Brando', 'Al Pacino', 'James Caan']","['itali', 'love at first sight', 'loss of fath...",58,42,Francis Ford Coppola,8.327148
216,Schindler's List,The true story of how businessman Oskar Schind...,18,8.3,22000000,195.0,0,4436.0,527,108052,424,"['Liam Neeson', 'Ben Kingsley', 'Ralph Fiennes']","['factori', 'concentration camp', 'hero', 'hol...",30,36,Steven Spielberg,8.091368
2219,Spirited Away,A ten year old girl who wanders away from her ...,14,8.3,15000000,125.0,0,3968.0,5618,245429,129,"['Rumi Hiiragi', 'Miyu Irino', 'Mari Natsuki']","['witch', 'parent child relationship', 'magic'...",15,25,Hayao Miyazaki,8.069471
979,Life Is Beautiful,A touching story of an Italian book seller of ...,35,8.3,20000000,116.0,0,3643.0,2324,118799,637,"['Nicoletta Braschi', 'Roberto Benigni', 'Gior...","['itali', 'riddl', 'bookshop', 'self sacrific'...",23,25,Roberto Benigni,8.051348


In [13]:
ut.evaluate(rating_test, 5, 'simple', df, sample_size = 200,  rating = rating)

0.31

## Content Based Filtering 

In [14]:
df['overview_toke'] = df['overview'].apply(ut.clean_text, toke = True)

In [15]:
df['overview_clean'] = df['overview'].apply(ut.clean_text)

In [16]:
df.head(3)

Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,vote_count,movieId,imdbId,tmdbId,cast,keywords,cast_size,crew_size,director,score,overview_toke,overview_clean
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",16,7.7,30000000,81.0,0,5415.0,1,114709,862,"['Tom Hanks', 'Tim Allen', 'Don Rickles']","['jealousi', 'toy', 'boy', 'friendship', 'frie...",13,106,John Lasseter,7.575833,"[led, woodi, andi, toy, live, happili, hi, roo...",led woodi andi toy live happili hi room andi b...
1,Jumanji,When siblings Judy and Peter discover an encha...,12,6.9,65000000,104.0,0,2413.0,2,113497,8844,"['Robin Williams', 'Jonathan Hyde', 'Kirsten D...","['board gam', 'disappear', ""based on children'...",26,16,Joe Johnston,6.782738,"[sibl, judi, peter, discov, enchant, board, ga...",sibl judi peter discov enchant board game open...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,10749,6.5,0,101.0,0,92.0,3,113228,15602,"['Walter Matthau', 'Jack Lemmon', 'Ann-Margret']","['fish', 'best friend', 'duringcreditssting', ...",7,4,Howard Deutch,6.252781,"[famili, wed, reignit, ancient, feud, next-doo...",famili wed reignit ancient feud next-door neig...


## TF-IDF

In [72]:
tfidf_cosine_sim = ut.TF_IDF(df['overview_clean'], ngram = 5)

In [85]:
df[df.title.str.contains("spider", case = False)]

Unnamed: 0,title,overview,genres,vote_average,budget,runtime,adult,vote_count,movieId,imdbId,tmdbId,cast,keywords,cast_size,crew_size,director,score,overview_toke,overview_clean,index1
2141,Spider-Man,After being bitten by a genetically altered sp...,14,6.8,139000000,121.0,0,5398.0,5349,145487,557,"['Tobey Maguire', 'Willem Dafoe', 'Kirsten Dun...","['loss of lov', 'spider', 'thanksgiv', 'bad bo...",71,126,Sam Raimi,6.750513,"[bitten, genet, alter, spider, nerdi, high, sc...",bitten genet alter spider nerdi high school st...,2141
2370,Spider,A mentally disturbed man takes residence in a ...,18,6.4,0,98.0,0,176.0,6197,278731,9613,"['Ralph Fiennes', 'Miranda Richardson', 'Gabri...","['secret', 'asylum', 'bed and breakfast plac',...",19,41,David Cronenberg,6.257529,"[mental, disturb, man, take, resid, halfway, h...",mental disturb man take resid halfway hous hi ...,2370
2443,Ziggy Stardust and the Spiders from Mars,Documentary from David Bowie's concert at the ...,99,7.6,0,90.0,0,15.0,6430,86643,34759,"['David Bowie', 'Mick Ronson', 'Trevor Bolder']",[],5,1,D.A. Pennebaker,6.24777,"[documentari, david, bowi, concert, hammersmit...",documentari david bowi concert hammersmith ode...,2443
2550,Kiss of the Spider Woman,Luis Molina and Valentin Arregui are cell mate...,18,6.8,0,120.0,0,53.0,6786,89424,11703,"['William Hurt', 'Raúl Juliá', 'Sônia Braga']","['gay', 'male nud', 'prison', 'based on novel'...",5,9,Hector Babenco,6.264286,"[lui, molina, valentin, arregui, cell, mate, s...",lui molina valentin arregui cell mate south am...,2550
3020,Spider-Man 2,Peter Parker is going through a major identity...,28,6.7,200000000,127.0,0,4432.0,8636,316654,558,"['Tobey Maguire', 'Kirsten Dunst', 'James Fran...","['dual ident', ""love of one's lif"", 'pizza boy...",76,128,Sam Raimi,6.650776,"[peter, parker, go, major, ident, crisi, burn,...",peter parker go major ident crisi burn spider-...,3020
4001,The Spiderwick Chronicles,Upon moving into the run-down Spiderwick Estat...,12,6.3,90000000,95.0,0,593.0,58105,416236,8204,"['Freddie Highmore', 'Mary-Louise Parker', 'Ni...","['brother sister relationship', 'family relati...",9,99,Mark Waters,6.257631,"[upon, move, run-down, spiderwick, estat, moth...",upon move run-down spiderwick estat mother twi...,4001
4717,The Amazing Spider-Man,Peter Parker is an outcast high schooler aband...,28,6.5,215000000,136.0,0,6734.0,95510,948470,1930,"['Andrew Garfield', 'Emma Stone', 'Rhys Ifans']","['loss of fath', 'vigilant', 'serum', 'marvel ...",57,106,Marc Webb,6.480051,"[peter, parker, outcast, high, schooler, aband...",peter parker outcast high schooler abandon hi ...,4717
5000,The Amazing Spider-Man 2,"For Peter Parker, life is busy. Between taking...",28,6.5,200000000,142.0,0,4274.0,110553,1872181,102382,"['Andrew Garfield', 'Emma Stone', 'Jamie Foxx']","['obsess', 'marvel com', 'sequel', 'based on c...",65,102,Marc Webb,6.469753,"[peter, parker, life, busi, take, bad, guy, sp...",peter parker life busi take bad guy spider-man...,5000


In [86]:
ut.top_rec("Spider-Man",
           tfidf_cosine_sim, 
           pd.Series(df.index, index=df['title']), 
           df, 5)

4667              21 Jump Street
817           The Breakfast Club
5000    The Amazing Spider-Man 2
2837       Bang Bang You're Dead
Name: title, dtype: object

In [18]:
ut.top_rec("The Dark Knight",
           tfidf_cosine_sim, 
           pd.Series(df.index, index=df['title']), 
           df, 5)

4388         The Dark Knight Rises
242                         Batman
611                 Batman Returns
4186    Batman: Under the Red Hood
Name: title, dtype: object

In [19]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = tfidf_cosine_sim)

0.015

## Word2Vec

In [20]:
word2vec_cosine_sim = ut.Word2Vec_Hybrid(
                                      df['overview_toke'], 
                                      vector_size = 300,
                                      window = 7, 
                                      epochs = 100)

100%|██████████| 18444/18444 [00:11<00:00, 1615.10it/s] 


In [21]:
ut.top_rec("The Dark Knight",
           word2vec_cosine_sim, 
           pd.Series(df.index, index=df['title']), 
           df,5)

4388         The Dark Knight Rises
242                         Batman
611                 Batman Returns
4186    Batman: Under the Red Hood
Name: title, dtype: object

In [22]:
word2vec_cosine_sim.shape

(5158, 5158)

In [23]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = word2vec_cosine_sim)

0.01

### Word embedding

In [24]:
doc2vec_cosine_sim = ut.Doc2Word_embed(df['overview_clean'], 
                                    df['overview_toke'], 
                                    vector_size = 300,
                                    window = 15,
                                    epochs = 100)

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


In [25]:
ut.top_rec("The Dark Knight",
           doc2vec_cosine_sim, 
           pd.Series(df.index, index=df['title']), 
           df,5)

4388                     The Dark Knight Rises
242                                     Batman
4369                          Batman: Year One
2406    Remo Williams: The Adventure Begins...
Name: title, dtype: object

In [26]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = doc2vec_cosine_sim)

0.03

## Hybrid Content based model

We want to combine the NLP models (i.e TF_IDF, Word2Vec_Hybrid and Doc2Word_embed)


In [27]:
# assuming the NLP model cosine similarities are comparable we will avg

avg_nlp_sim = (word2vec_cosine_sim+tfidf_cosine_sim)/2


ut.top_rec("The Dark Knight",
           avg_nlp_sim, 
           pd.Series(df.index, index=df['title']), 
           df,5)

4388         The Dark Knight Rises
242                         Batman
611                 Batman Returns
4186    Batman: Under the Red Hood
Name: title, dtype: object

In [28]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = avg_nlp_sim)

0.05

In [29]:
matrix = avg_nlp_sim
for i in range(matrix.shape[0]):
    for j in range(matrix.shape[1]):
        if matrix[i,j] <0:
            print("Negative")
            break

## Collaborative Filtering

### Item based

This is  collaborative filtering although we dont actually map users here. We just find the cosine similarity between movie rating vectors.

i.e If a lot of people who are highly rating to MasterChef are also highly rating Bake Off, these 2 shows will have a high similarity score.

In [17]:
df['index1'] = df.index
new = pd.merge(rating,
               df[["title", 'index1', "movieId"]], 
               how='inner',
               left_on="movieId",
               right_on="movieId")

new = new[["userId","index1","rating"]]
x = pd.pivot_table(new, values='rating', index=['index1'], columns=['userId'], aggfunc=np.max, fill_value=0) 
new.head(2)

Unnamed: 0,userId,index1,rating
0,137529,4764,4.5
1,184749,4764,4.0


In [52]:
item_cosine_sim = ut.Rating2Vec(x)

In [32]:
ut.top_rec("Terminator 2: Judgment Day",
           item_cosine_sim, 
           pd.Series(df.index, index=df['title']), 
           df,5)

195     Jurassic Park
519    The Terminator
165             Speed
187      The Fugitive
Name: title, dtype: object

In [33]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = item_cosine_sim)

0.26

### User based

In [59]:
user_cosine_sim = ut.Rating2Vec(x.T)

In [60]:
indexes = pd.DataFrame(x.T.index)

In [61]:
ut.top_rec_user(rating.userId[0], user_cosine_sim, 10,5,df, x.T,indexes).title

2960    Harry Potter and the Prisoner of Azkaban
4033                                    Iron Man
1790                                       Shrek
496                           Return of the Jedi
5160                Star Wars: The Force Awakens
Name: title, dtype: object

In [62]:
ut.evaluate(rating_test, 5, 'UB', df, sample_size = 200, rating = rating, sim = user_cosine_sim, user_matrix = x.T, indexes = indexes, k = 10)

0.145

## Hybrid Collab & Conent model

In [38]:
combined_sim = np.multiply(item_cosine_sim, avg_nlp_sim)

In [39]:
title = df.title.sample(1).iloc[0]

print("Tile wtached: ", title)

print("Movie recommendations")
ut.top_rec(title,
           combined_sim, 
           pd.Series(df.index, index=df['title']), 
           df,5)

Tile wtached:  Much Ado About Nothing
Movie recommendations


156    Four Weddings and a Funeral
737             The Wedding Singer
111       Like Water for Chocolate
36                     The Postman
Name: title, dtype: object

In [40]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = combined_sim)

0.195

In [41]:
from sklearn.preprocessing import normalize
combined_sim2 = np.divide(np.add(item_cosine_sim*2, avg_nlp_sim),2)
combined_sim2 = normalize(combined_sim2, axis=1, norm='l1')

In [42]:
ut.top_rec("Terminator 2: Judgment Day",
           combined_sim2, 
           pd.Series(df.index, index=df['title']), 
           df,5)

519    The Terminator
195     Jurassic Park
166         True Lies
165             Speed
Name: title, dtype: object

In [43]:
ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = combined_sim2)

0.16

In [44]:
scores = np.matmul(x.T, combined_sim2)

In [45]:
ut.CB_CF_MEMORY_HYBRID_REC(scores , rating.userId[0], df, 5)

Unnamed: 0,title,movieId
549,Back to the Future,1270
544,Groundhog Day,1265
434,E.T. the Extra-Terrestrial,1097
712,The Truman Show,1682
155,Forrest Gump,356


In [46]:
ut.evaluate(rating_test, 5, 'Hybrid', df, sample_size = 200, user_movie = scores)

0.065

### Model based

#### SVD

In [20]:
from surprise import Reader, Dataset, SVD

In [32]:
x.head(2)

userId,47,80,89,90,102,107,127,131,150,201,...,270641,270674,270685,270688,270718,270719,270729,270736,270848,270894
index1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0,0,0.0,0.0,0,0,0,4,4.5,...,0.0,0,0,0.0,0,0,0,0.0,0.0,0.0
1,0.0,0,0,0.0,0.0,0,0,0,0,0.0,...,0.0,0,0,0.0,0,0,0,0.0,0.0,0.0


In [64]:
from scipy.sparse.linalg import svds
M = x.T
U, sigma, Vt = svds(M, k =150)

In [65]:
sigma_diag_matrix = np.diag(sigma)

In [66]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma_diag_matrix), Vt)

preds_df = pd.DataFrame(all_user_predicted_ratings, columns = M.columns, index=M.index)

In [67]:
preds_df

index1,0,1,2,3,4,5,6,7,8,9,...,5459,5460,5461,5462,5463,5464,5465,5466,5467,5468
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
47,0.206890,-0.084963,0.226051,0.275886,-0.065360,0.344545,-0.443070,0.426263,0.352360,6.041230,...,-0.031172,-0.025064,0.028588,0.004298,0.009962,0.063335,0.002214,0.098241,-0.026274,0.003710
80,0.090107,0.188953,-0.219534,0.416049,4.384805,0.736280,-0.033869,-0.037168,0.291702,0.085258,...,0.011065,0.007162,0.003098,0.001978,-0.010329,0.040640,0.001263,0.002258,-0.006294,0.000246
89,0.033644,-0.367754,-0.227612,-0.044297,-0.220981,-0.590141,0.067733,0.047287,0.576854,0.014218,...,-0.027074,-0.018429,-0.045219,-0.000608,-0.031837,0.113323,-0.148709,0.004298,-0.050884,0.000297
90,-0.091386,0.063128,0.177861,0.031419,-0.017055,-0.096234,-0.026647,-0.037875,0.209395,-0.010165,...,0.007329,-0.010749,-0.009444,-0.001855,-0.016785,-0.004429,-0.005432,0.002929,-0.001332,0.000460
102,-0.017500,-0.050120,-0.059742,0.165551,0.093606,-0.157325,-0.020486,0.013692,-0.036205,0.067421,...,-0.012737,0.000189,0.034035,0.008510,0.012160,-0.015185,-0.016565,-0.006447,-0.006688,-0.000257
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270719,0.071621,0.194067,0.104578,0.008986,0.315960,-0.000330,0.006396,-0.019401,-0.143088,0.029776,...,0.005657,0.004282,-0.017193,-0.001938,-0.008488,-0.006809,-0.002854,-0.008224,0.000058,0.000003
270729,-0.127308,0.007736,0.323470,0.334567,0.764410,0.056092,-0.027744,0.334242,0.798346,-0.103513,...,0.015841,0.015080,-0.006487,-0.001503,-0.021646,-0.015089,0.018839,-0.000969,0.004803,0.000778
270736,-0.071442,0.194331,-0.181691,-0.084906,0.190892,-0.177154,0.084736,-0.032094,-0.329955,-0.035208,...,-0.032269,0.006759,-0.031083,-0.002315,-0.016456,0.008624,0.003902,-0.014177,0.018448,0.000243
270848,-0.095066,0.026433,-0.020042,-0.004706,0.089384,-0.086531,-0.038021,-0.007356,0.138613,-0.072419,...,-0.000826,0.002997,0.007170,0.000687,0.006720,0.009558,0.003103,0.005537,0.003057,-0.000279


In [68]:
user_cosine_sim = ut.Rating2Vec(preds_df)

In [69]:
indexes = pd.DataFrame(preds_df.index)

In [70]:
ut.top_rec_user(rating.userId[0], user_cosine_sim, 10,5,df, x.T,indexes).title

2960    Harry Potter and the Prisoner of Azkaban
4217                                          Up
2425                                Finding Nemo
1790                                       Shrek
726                            Good Will Hunting
Name: title, dtype: object

In [71]:
ut.evaluate(rating_test, 5, 'UB', df, sample_size = 200, rating = rating, sim = user_cosine_sim, user_matrix = preds_df, indexes = indexes, k = 10)

0.17

## Evaluation

Very difficult to evaluate, here we use a hit rate. However there are issues doing this and it doesnt exactly represent the power/useful ness of each model.

In [None]:
UB = ut.evaluate(rating_test, 5, 'UB', df, sample_size = 200, rating = rating, sim = user_cosine_sim, user_matrix = x.T, indexes = indexes, k = 10)
IB = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = item_cosine_sim)
Hybrid_cb = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = avg_nlp_sim)
d2v = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = doc2vec_cosine_sim)
w2v = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = word2vec_cosine_sim)
tf = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = tfidf_cosine_sim)
simp = ut.evaluate(rating_test, 5, 'simple', df,  rating = rating)
h1 = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = combined_sim)
h2 = ut.evaluate(rating_test, 5, 'CB', df, sample_size = 200, sim = combined_sim2)
h3 = ut.evaluate(rating_test, 5, 'Hybrid', df, sample_size = 200, user_movie = scores)

to do:
- Evaluation method
- Finish simple memory based collab
- do a model based one

look into how a neural network could combine evrything.

Hybrid

Factorisation machines?
