# Recommender Systems

Demonstration for MH6221 Analytics Workshop 1

---

### Agenda

In this demonstration, we cover:

- Simple content-based recommendations using "bag of words" title and genre similarity
- Collaborative filtering with user ratings
- Cross-validation using Single Value Decomposition and K-Nearest Neighbours

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, SVD, KNNBasic
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
import gradio as gr

# Content Filtering Recommender

First, we read our movies dataset.

Columns and formats:
- Movie ID (int)
- Title (string)
- Genres (string-serialised list)

In [2]:
movies_df = pd.read_csv('movies.csv')
movies_df['title'] = movies_df['title'].map(lambda x: str(x)[:-7])
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji,Adventure|Children|Fantasy
2,3,Grumpier Old Men,Comedy|Romance
3,4,Waiting to Exhale,Comedy|Drama|Romance
4,5,Father of the Bride Part II,Comedy


## Ingestion and checks

Run a few simple summary statistics to make sure all is in order.

In [None]:
movies_df.shape

(9125, 3)

In [None]:
movies_df.describe(include=['object'])

Unnamed: 0,title,genres
count,9125,9125
unique,8893,902
top,Hamlet,Drama
freq,6,1170


In [None]:
movies_df.describe()

Unnamed: 0,movieId
count,9125.0
mean,31123.291836
std,40782.633604
min,1.0
25%,2850.0
50%,6290.0
75%,56274.0
max,164979.0


## Data preprocessing

Sanitise and reformat strings into a suitable "bag of words" format.

- Split by separator
- For titles, split to individual words
- Strip spaces, make words case-insensitive

In [3]:
movies_df['genre'] = movies_df['genres'].map(lambda x: x.split('|'))
movies_df['title_name'] = movies_df['title'].map(lambda x: x.split(' '))

for index, row in movies_df.iterrows():
    
    row['genre'] = [x.lower().replace(' ','') for x in row['genre']]
    row['title_name'] = [x.lower().replace(' ','') for x in row['title_name']]

new_movies_df = movies_df[['movieId', 'title','genre','title_name']]

In [4]:
new_movies_df['Bag of words'] = new_movies_df['genre'] + new_movies_df['title_name']
new_movies_df['bag_of_words'] = [' '.join(map(str, l)) for l in new_movies_df['Bag of words']]
final_movies_df = new_movies_df.drop(columns=['genre','title_name', 'Bag of words'])
final_movies_df.head()

Unnamed: 0,movieId,title,bag_of_words
0,1,Toy Story,Adventure Animation Children Comedy Fantasy To...
1,2,Jumanji,Adventure Children Fantasy Jumanji
2,3,Grumpier Old Men,Comedy Romance Grumpier Old Men
3,4,Waiting to Exhale,Comedy Drama Romance Waiting to Exhale
4,5,Father of the Bride Part II,Comedy Father of the Bride Part II


## Similarity

Vectorise word appearances and calculate cosine similarity. Cosine similarity matrix between each movie is shown below.

**NOTE:** We only count appearances, not occurences. (multiple instances in one row will be counted as one appearance)

In [5]:
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(final_movies_df['bag_of_words'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)
print(cosine_sim)

[[1.         0.56694671 0.16903085 ... 0.         0.14285714 0.13363062]
 [0.56694671 1.         0.         ... 0.         0.         0.        ]
 [0.16903085 0.         1.         ... 0.         0.         0.15811388]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.14285714 0.         0.         ... 0.         1.         0.        ]
 [0.13363062 0.         0.15811388 ... 0.         0.         1.        ]]


## Recommendation modelling

Based on an input movie already in the database, this function recommends 10 similar movies also in the database.

As the similarity matrix is precomputed, recommendations are generated very quickly.

In [6]:
def recommend(title, movie_df, cosine_sim):
    
    recommended_movies = []
    if title not in list(movie_df.title):
        print('Error: please ensure that movies dataframe contains your query title.')
        return

    title_idx = movie_df[movie_df['title'] == title].index[0]
    sorted_similarities = pd.Series(cosine_sim[title_idx]).sort_values(ascending = False)
    top_10_indices = sorted_similarities.index[1:21] # highest similarity idx is movie itself (= 1)
    
    movie_titles = movie_df['title'].tolist()
    for i in top_10_indices:
        recommended_movies.append(movie_titles[i])
        
    return recommended_movies

In [7]:
recommend('Jumanji', final_movies_df, cosine_sim)

['Pan',
 "Pete's Dragon",
 'MirrorMask',
 'Halloweentown',
 'G-Force',
 "Gulliver's Travels",
 'Seventh Son',
 'Zathura',
 'Moana',
 'Bridge to Terabithia',
 'Antz',
 'NeverEnding Story, The',
 'Golden Compass, The',
 'Tall Tale',
 'Return to Oz',
 'Peter Pan',
 'Turbo',
 'Alice in Wonderland',
 'Borrowers, The',
 'Halloweentown High']

# Collaboration Filtering Recommender

This recommender uses user movie ratings to recommend movies that users similar to the target user rated highly.

**Desired outcome:** Target user highly rates the movie which recommended to them.

For this, we using a user-ratings table, tied to the previous movies dataset.

Columns and formats:
- User ID (int)
- Movie ID (int, corresponds to movie ID in movies dataset)
- Rating (int, 1 to 5)
- Timestamp (serialised)

In [8]:
user_df = pd.read_csv('ratings.csv')
user_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## Ingestion and checks

Run a few simple summary statistics to check the ratings dataset as well.

In [None]:
user_df.shape

(100000, 4)

In [None]:
user_df.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100000.0,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986,883528900.0
std,266.61442,330.798356,1.125674,5343856.0
min,1.0,1.0,1.0,874724700.0
25%,254.0,175.0,3.0,879448700.0
50%,447.0,322.0,4.0,882826900.0
75%,682.0,631.0,4.0,888260000.0
max,943.0,1682.0,5.0,893286600.0


## Build data set

Build a full dataset from our user-ratings data.

**NOTE:** We ignore timestamp for simplicity of demonstrating the concept, but a time decay factor could yield additional prediction power (recent ratings are more likely to be representative than older ones).

In [9]:
data = Dataset.load_from_df(user_df[['userId', 'movieId', 'rating']], Reader())
data_train = data.build_full_trainset()

## Finding the ideal recommender

2 types are tested in this demonstration: SVD and KNN.

---

### Singular Value Decomposition (SVD)

In [None]:
svd_recomender_model = SVD()
cross_validate(svd_recomender_model, data, measures=['rmse', 'mae'], cv=5)

{'test_rmse': array([0.94040942, 0.94035754, 0.92704873, 0.9402736 , 0.93715993]),
 'test_mae': array([0.74046544, 0.74201133, 0.73170769, 0.74008957, 0.73830493]),
 'fit_time': (7.356231927871704,
  7.312633991241455,
  7.318242073059082,
  7.332475185394287,
  7.346334934234619),
 'test_time': (0.48653578758239746,
  0.4798588752746582,
  0.47815394401550293,
  0.4732837677001953,
  0.4705379009246826)}

In [None]:
svd_recomender_model.fit(data_train)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x12facfc10>

In [None]:
svd_recomender_model.predict(1, 1)

Prediction(uid=1, iid=1, r_ui=None, est=3.880759152958209, details={'was_impossible': False})

### K-nearest neighbours (KNN)

In [None]:
knn = KNNBasic()
cross_validate(knn, data, measures=['rmse', 'mae'], cv=5)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([0.97824658, 0.9821069 , 0.97479225, 0.98342658, 0.97154262]),
 'test_mae': array([0.7751123 , 0.77557099, 0.76779762, 0.77626524, 0.76775025]),
 'fit_time': (0.42267918586730957,
  0.46935606002807617,
  0.45532894134521484,
  0.4592478275299072,
  0.5928571224212646),
 'test_time': (9.25972604751587,
  9.007874011993408,
  8.888593912124634,
  9.36017918586731,
  9.170369148254395)}

In [None]:
knn.fit(data_train)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x12fb04290>

In [None]:
knn.predict(1, 1)

Prediction(uid=1, iid=1, r_ui=None, est=4.1299713089494405, details={'actual_k': 40, 'was_impossible': False})

In [None]:
svd_mean_rmse = sum([0.94040942, 0.94035754, 0.92704873, 0.9402736 , 0.93715993])/5
knn_mean_rmse = sum([0.97824658, 0.9821069 , 0.97479225, 0.98342658, 0.97154262])/5

print('RMSE scores')
print(f'SVD: {svd_mean_rmse}')
print(f'KNN: {knn_mean_rmse}')

RMSE scores
SVD: 0.937049844
KNN: 0.978022986


### Choosing our recommender

With SVD having a better RMSE score, we proceeded to continue with it.

SVD|KNN
---|---
0.937|0.978

## Finding the best parameters

`GridSearchCV` performs an exhaustive search over various combinations of parameters, computing accuracy metrics on our chosen SVD algorithm in order to find the best parameters.

##### It may take upwards to 30min to run.

In [10]:
param_grid = {"n_epochs": [40,50,60], "lr_all": [0.01, 0.015], "reg_all": [0.4, 0.3], 'random_state':[88]}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=7)

gs.fit(data)


# best RMSE score
print(gs.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print(gs.best_params["rmse"])

0.8724982089680967
{'n_epochs': 40, 'lr_all': 0.01, 'reg_all': 0.3, 'random_state': 88}


Now we can use the estimator that yields the best RMSE.

In [11]:
svd_algo = gs.best_estimator["rmse"]
svd_algo.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x20be0cfedc0>

In [12]:
# Find the list of unique IDs for users and movies
idnos = list(user_df['userId'].unique())
movienos = list(user_df['movieId'].unique())

## Predicted ratings

For each movie that a user hasn't watched, predict the user's score if they watch it.

Again, this is precomputed so recommendations emerge quickly when requested.

In [13]:
ratings = [] 
user_id = []
movie_id = []
final_ratings = pd.DataFrame()

for x in idnos:
    for i in movienos:
        user_id.append(x)
        y = svd_algo.predict(x, i)
        ratings.append(y[3])
        movie_id.append(y[1])
        
final_ratings['userId'] = user_id
final_ratings['movieId'] = movie_id
final_ratings['predicted rating'] = ratings

In [59]:
final_ratings

Unnamed: 0,userId,movieId,predicted rating
0,1,1,4.469234
1,1,3,3.921582
2,1,6,4.512606
3,1,47,4.552512
4,1,50,4.748446
...,...,...,...
5931635,610,160341,3.301183
5931636,610,160527,3.948788
5931637,610,160836,3.470038
5931638,610,163937,3.631114


In [14]:
#To remove movies which the users have watched
predicted_ratings = final_ratings.merge(user_df.drop_duplicates(), on=['userId', 'movieId'], how='left',
                                       indicator = True)

final_df = predicted_ratings[predicted_ratings['_merge'] == 'left_only']
final_df = final_df.drop(columns=['rating','_merge','timestamp'])
final_df.head()

Unnamed: 0,userId,movieId,predicted rating
232,1,318,4.876347
233,1,1704,4.628691
234,1,6874,4.606376
235,1,8798,4.374886
236,1,46970,4.077419


In [15]:
#Get the movies names by merging
movies = pd.read_csv('movies.csv')
final_user_df = final_df.merge(movies, on=['movieId'], how='inner', indicator = True)
final_user_df = final_user_df.drop(columns=['genres','_merge'])
final_user_df.head()

Unnamed: 0,userId,movieId,predicted rating,title
0,1,318,4.876347,"Shawshank Redemption, The (1994)"
1,3,318,3.282163,"Shawshank Redemption, The (1994)"
2,4,318,4.020638,"Shawshank Redemption, The (1994)"
3,7,318,3.85077,"Shawshank Redemption, The (1994)"
4,9,318,4.036433,"Shawshank Redemption, The (1994)"


## Recommending movies to a user

Given a user who has already rated some movies, what other movies are they likely to rate highly?

AKA: what can we show on a user's home feed?

In [16]:
def user_recommender(df, user):
    user_df = df[df['userId']==user]
    user_df.sort_values(by = 'predicted rating', inplace=True, ascending=False)
    movies = user_df.iloc[0:9,3:4]
    
    return movies

In [17]:
user_recommender(final_user_df, 2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_df.sort_values(by = 'predicted rating', inplace=True, ascending=False)


Unnamed: 0,title
5163026,"Jetée, La (1962)"
1664218,"Trial, The (Procès, Le) (1962)"
3769069,Come and See (Idi i smotri) (1985)
668163,"Three Billboards Outside Ebbing, Missouri (2017)"
4290258,Bad Boy Bubby (1993)
1501058,Neon Genesis Evangelion: The End of Evangelion...
1119824,Guess Who's Coming to Dinner (1967)
1695311,Captain Fantastic (2016)
2880331,Woman in the Dunes (Suna no onna) (1964)


## Hybrid Recommender

Given a user and a movie, what similar movies is the user likely to rate highly?

Get similar movies first, then find the ones that the user is likely to like.

In [57]:
def hybrid_recommend(movie_df, cosine_sim, rating_df):
    title = (input("Enter movie title: ")).lower()
    user = int(input("Enter user ID: "))
    movie_df['title'] = movie_df['title'].str.lower()

    recommended_movies = []
    if title not in list(movie_df.title):
        print('Error: please ensure that movies dataframe contains your query title.')
        return

    title_idx = movie_df[movie_df['title'] == title].index[0]
    sorted_similarities = pd.Series(cosine_sim[title_idx]).sort_values(ascending = False)
    top_10_indices = sorted_similarities.index[1:20] # highest similarity idx is movie itself (= 1)
    
    movie_titles = movie_df['movieId'].tolist()
    for i in top_10_indices:
        recommended_movies.append(movie_titles[i])
        
    similar_df = pd.DataFrame(recommended_movies)
    similar_df = similar_df.rename(columns={0: "movieId"})
    
    user_df = rating_df[rating_df['userId']==user]
    
    recommendation = user_df.merge(similar_df, on=['movieId'], how='inner', indicator = True)
    recommendation.sort_values(by = 'predicted rating', inplace=True, ascending=False)
    ans = recommendation.iloc[0:10,3:4]   
    
    return ans

In [61]:
hybrid_recommend(final_movies_df, cosine_sim, final_user_df)

Enter movie title: toy story
Enter user ID: 76


Unnamed: 0,title
13,Paddington 2 (2017)
3,Toy Story 3 (2010)
11,Presto (2008)
1,Shrek (2001)
0,Toy Story 2 (1999)
2,"Monsters, Inc. (2001)"
7,Enchanted (2007)
12,Moana (2016)
9,Halloweentown (1998)
16,Turbo (2013)


In [62]:
hybrid_recommend(final_movies_df, cosine_sim, final_user_df)

Enter movie title: toy story
Enter user ID: 288


Unnamed: 0,title
9,Paddington 2 (2017)
0,Toy Story 3 (2010)
7,Presto (2008)
3,Enchanted (2007)
8,Moana (2016)
5,Halloweentown (1998)
12,Turbo (2013)
1,Madagascar (2005)
10,Gnomeo & Juliet (2011)
2,Minions (2015)


In [None]:
hybrid_recommend(final_movies_df, cosine_sim, final_user_df)

## A simple Recommendations GUI

Input and preview the recommended movies in a simple GUI!

Implementation done using `gradio`.

In [51]:
def hybrid_ui(user, title):
    title = title.lower()
    user = int(user)
    final_movies_df['title'] = final_movies_df['title'].str.lower()

    recommended_movies = []
    if title not in list(final_movies_df.title):
        ans = list(['Error: please ensure that movies dataframe contains your query title.'])
        ans_1 = pd.DataFrame(ans)
        return ans_1

    title_idx = final_movies_df[final_movies_df['title'] == title].index[0]
    sorted_similarities = pd.Series(cosine_sim[title_idx]).sort_values(ascending = False)
    top_10_indices = sorted_similarities.index[1:20] # highest similarity idx is movie itself (= 1)
    
    movie_titles = final_movies_df['movieId'].tolist()
    for i in top_10_indices:
        recommended_movies.append(movie_titles[i])
        
    similar_df = pd.DataFrame(recommended_movies)
    similar_df = similar_df.rename(columns={0: "movieId"})
    
    user_df = final_user_df[final_user_df['userId']==user]
    
    recommendation = user_df.merge(similar_df, on=['movieId'], how='inner', indicator = True)
    recommendation.sort_values(by = 'predicted rating', inplace=True, ascending=False)
    ans = recommendation.iloc[0:10,3:4]   
    
    return ans

In [48]:
hybrid_ui(2, 'toy story')

Unnamed: 0,title
13,Paddington 2 (2017)
3,Toy Story 3 (2010)
11,Presto (2008)
1,Shrek (2001)
0,Toy Story 2 (1999)
2,"Monsters, Inc. (2001)"
7,Enchanted (2007)
12,Moana (2016)
16,Turbo (2013)
9,Halloweentown (1998)


In [49]:
user_input = gr.Number(label = "Enter user ID")
title_input = gr.Textbox(label = "Enter the movie that user is currently watching")
output = gr.Dataframe(label = "Recommended Title")

In [55]:
app = gr.Interface(fn = hybrid_ui, inputs=[user_input, title_input], outputs=output)
app.launch()

Running on local URL:  http://127.0.0.1:7865

To create a public link, set `share=True` in `launch()`.




In [37]:
ans = 'Error: please ensure that movies dataframe contains your query title.'
ans_1 = pd.DataFrame()
ans_1["ans"] = "y"

In [45]:
ans_1.iloc[0:, 0:] = "dsad"

In [46]:
ans_1

Unnamed: 0,ans
