## Recommendation Engine Quickstart, the Notebook

An overview of recommender principles and techniques, mainly for my own enrichment and practice.  See links to main sections below:

- <a href="#Principles">High Level Principles</a><br>
- <a href="#Terms">Common Terms</a>
- <a href="#Examples">Models &amp; Snippets / Examples:</a>
    - <a href="#Content">Content-Based Recommender</a>
         - <a href="#Enrichment">Data Enrichment</a><br><br>
    - <a href="#Collaborate">Collaborative Filtering Item-Item Recommender</a><br><br>
    - <a href="#CollaborateU">Collaborative Filtering User-User Recommender</a>
<br><br>
- <a href="#Evaluation">Evaluating Recommender Quality</a>
- <a href="#Additional">Additional Thoughts</a>

### High Level Principles <a id="Principles"></a>

For a succinct definition of recommendation engines, I will reference the following:

<blockquote>A recommendation engine, also known as a recommender system, is software that analyzes available data to make suggestions for something that a website user might be interested in, such as a book, a video or a job, among other possibilities.

(https://whatis.techtarget.com/definition/recommendation-engine)
</blockquote> 

In researching various methods of building recommendation engines, I found a common pattern that seems to apply to <i>most</i> machine learning methods.  First, clean the data and get every unique entity at the row level, such as a unique dataframe/table of customers.  Next, encode features for each entity and compute the similarity of each entity--typically you create a similarity matrix in this step.  Finally, query the matrix or compute similarity based on a single user's or item's attributes and output the most-similar items.

For example, when building a simple user-based product recommendation engine, a quick recipe might be:
1. Gather user data into a single dataframe of unique users, including behavioral variables and contextual
2. Enrich user data, if possible, e.g., add social profile variables or user segmentation or recent purchases
3. Encode/engineer user features
4. Compute similarity of all unique users some in recent history and rank them
5. Take a user, query similar users (highly-ranked users) and see which items they bought that user has not purchased

Of course, more sophistication can be added to the above, including combining recommendation results and deploying in a fashion similar to below.

An actual deployment plan might look something like the following:
1. Build a data pipeline - extract, transform, and load data at regular intervals, perhaps daily
    - transform data into a format that can be easily fed into a machine learning algorithm / wrapper
2. Create a regularly scheduled task that ingests fresh user data then builds and deploys a new recommendation model
    - automated script takes model wrapper parameters and generates a new similarity matrix
    - script also evaluates and records model quality
3. Tag certain web pages with JavaScript that encodes and compares user with the current recommendation build
4. Once task 3 is complete, output inventory recommendations in HTML based on similarity computations.
5. Gather further feedback from actual users and QA

### Common Terms <a id="Terms"></a>

Esoteric recommender jargon as well as common terms I have run into during my research.  See terms and definitions below:

- <b>Collaborative Filtering</b>: aka 'social filtering', recommending actions/items based on similar past actions of a user
- <b>Content-based Filtering</b>: recommending actions/items that are similar to other actions/items based on static attributes
- <b>Similarity Matrix</b>: (typically) a correlation table that shows the similarity measures between all known entities, e.g., a user-user similarity matrix would show how similar every user is with every other user.
- <b>The "Cold-start" problem</b>: when there is not enough data to draw inferences about a user or entity <b>yet</b>
- <b>Pearson Correlation</b>: measure of similarity between two non-zero vectors, ie., comparing numerically encoded attributes
- <b>Cosine Similarity</b>: another measure of similarity between two non-zero vectors 
- <b>Vector</b>: an array of attributes, independent variables, typically associated to a user, class, or item, encoded numerically as features (this is a common term in all ML).

### Models & Snippets / Examples <a id="Examples"></a>

See quickly-coded examples of different movie recommendation engines below--all recommenders below utilized movie rating data to recommend new movies to watch.  I tried to review and recreate the most-popular methods I could find in Python.<br>

#### Data Source
<p>Data was provided by MovieLens at https://grouplens.org/datasets/movielens/.</p>
All of my examples come from the 100K data set of ratings known as <a href="http://files.grouplens.org/datasets/movielens/ml-latest.zip">ml-latest-small</a>.</p>

### Content-based Recommender <a id="Content"></a>

In cases where there is little data (Cold Start Problem) or great item features exist already, a content-based recommender can do an adequate job.  
A content-based recommender will recommend similar items based on static attributes or qualities of the item, not based on individual user ratings or user behavior.  The similarity matrix we will need will be similar to below:

<br>
<img src="data/similarity_matrix.png">
<br>

In the example below, I engineered movie features from the ratings dataset and also from the IMDB database, then I computed movie (item) similarity based on those features.  

In [1]:
#import some libraries and load data
import numpy as np
import pandas as pd

#EXTRACT STEP
#load the movie data
movies = pd.read_csv('data/ml-latest-small/movies.csv')

#merge the links data so that I can enrich the dataset from IMDB
links = pd.read_csv('data/ml-latest-small/links.csv')
movies = pd.merge(movies, links, on='movieId')  

In [2]:
#a quick look at our dataset
movies.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0


Above, see the first 5 rows of our movie data.  I have pulled a list of distinct movies, you can see their title, the year they came out, the genres that apply, and a couple of ids that will allow us to enrich the movie data further.

In [3]:
#TRANSFORM STEP aka "feature engineering" followed by "data enrichment"

#feature engineering

#split out the genres string into features
genres_df = movies['genres'].str.split('|', expand=True)
genres_df = genres_df.fillna('(no genres listed)')

#get all unique genres
cols = np.unique(genres_df[genres_df.columns].values)

#create columns for each genre
for col in cols:
    
    movies[col] = col

    #input the values to each dummy
    def bool_dums(x):
        genre = x['genres']
        col_name = x[col]

        if col_name in genre:
            return 1
        else:
            return 0
    
    movies[col] = movies.apply(bool_dums, axis=1)

In [4]:
movies.iloc[:,5:].head()

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


After breaking out the genres string into multiple columns, I encoded each variable as 0 or 1 (one hot encoding), see new movie genre variables above.  
These variables, while rather coarse, enable me to build a recommender already, you will see that as long as we have labels we can compute similarity and begin to recommend items.

In [5]:
#compute similarity and create a similarity matrix just based on the genre features above
movies_sim_matrix = movies.drop(['genres', 'movieId','imdbId','tmdbId'], axis=1) #drop extra columns
movies_sim_matrix = movies_sim_matrix.set_index('title') #set the index for the correlation calc
movies_sim_matrix = movies_sim_matrix.T.corr(method='pearson',min_periods=20) #get correlations by index / row instead of columns
movies_sim_matrix = movies_sim_matrix.replace(1,0)#replace all 1(s) with zeroes, eliminate movie correlation with itself
print("Similarity Matrix")
movies_sim_matrix.head().iloc[:, : 10]

Similarity Matrix


title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Toy Story (1995),0.0,0.727607,0.19245,0.080845,0.39736,-0.242536,0.19245,0.57735,-0.132453,0.080845
Jumanji (1995),0.727607,0.0,-0.140028,-0.176471,-0.096374,-0.176471,-0.140028,0.793492,-0.096374,0.215686
Grumpier Old Men (1995),0.19245,-0.140028,0.0,0.793492,0.688247,-0.140028,0.0,-0.111111,-0.076472,-0.140028
Waiting to Exhale (1995),0.080845,-0.176471,0.793492,0.0,0.546119,-0.176471,0.793492,-0.140028,-0.096374,-0.176471
Father of the Bride Part II (1995),0.39736,-0.096374,0.688247,0.546119,0.0,-0.096374,0.688247,-0.076472,-0.052632,-0.096374


Above, see the first 5 rows of the computed similarity between each movie (stopping at GoldenEye here).  I used Pearson correlation here, which calculates correlation in the classic -1 to 1 way, -1 is perfectly negatively correlated and 1 is perfectly positively correlated.  You can already see that this matrix is starting to pass the sniff test--Toy Story is more-correlated with Jumanji than Grumpier Old Men.  Maybe this is enough data to do an "OK" job...

In [6]:
#create a function to print out the most-similar recommendations
from IPython.display import display, HTML

def content_recommendations_n(movies_matrix, title,n):
    
    #get series with similarity scores for this title
    sim_series = movies_matrix[title].sort_values(ascending = False) #sort the values w/ highest corr
    sim_series = sim_series[:n] #take top 10 values 
    #sim_series = sim_series.sort_index() #sort alphabetically
    data = {'Title':[],'Similarity Score':[]}
    
    for recs in range(len(sim_series)):
        data['Title'].append(str(sim_series.index[recs]).split('(')[0])
        data['Similarity Score'].append(str(round(sim_series[sim_series.index[recs]],5)))

    df = pd.DataFrame(data)
    display(df)

Next, I created a function (above) to print out and recommend the most-similar movies, given a movie title and the number of nearest neighbors I specify.   
Let's go ahead and see what my engine recommends for Toy Story!

In [7]:
#get the top 10 similar movies to Toy Story, just based on genre
print("\nSee top 10 movie recommendations for Toy Story based on genre ONLY")
content_recommendations_n(movies_sim_matrix,'Toy Story (1995)',10)


See top 10 movie recommendations for Toy Story based on genre ONLY


Unnamed: 0,Title,Similarity Score
0,Shrek Forever After,0.88192
1,Gnomeo & Juliet,0.88192
2,Puss in Boots,0.88192
3,Space Jam,0.88192
4,The Lego Movie,0.88192
5,TMNT,0.88192
6,"Twelve Tasks of Asterix, The",0.88192
7,Valiant,0.88192
8,Toy Story 3,0.88192
9,Shrek,0.88192


We have a coarse model here that seems much better than nothing: genre in this dataset IS a <i>somewhat</i> good indicator of similarity, by itself--recommended movies tend to be animated and for kids.  Some of these movies would probably please a movie watcher who likes Toy Story.  In other ways our recommender is not very intelligent: we see that the top 10 movies here all have the same similarity score, if we had more features this would be very unlikely, our engine believes these movies are all equally similar, which is debatable, esp. since one is actually another Toy Story movie and is probably intuitively more-similar than Space Jam.

In order to improve results, I enriched the dataset and created more features, adding complexity to the model in order to increase similarity accuracy.

### Data Enrichment <a id="Enrichment"></a>

In [8]:
#utilize IMDB API

#get data from imdb API http://www.omdbapi.com/
#100K requests for $1/mo

#libraries for using API
import requests
import json

#example API query
PARAMS = {'t':'Toy Story','apikey':'a194be20'}
r = requests.get(url = "http://www.omdbapi.com/",params=PARAMS) 

json_ = json.loads(r.text)

print(json.dumps(json_, indent=4, sort_keys=True))

{
    "Actors": "Tom Hanks, Tim Allen, Don Rickles, Jim Varney",
    "Awards": "Nominated for 3 Oscars. Another 23 wins & 17 nominations.",
    "BoxOffice": "N/A",
    "Country": "USA",
    "DVD": "20 Mar 2001",
    "Director": "John Lasseter",
    "Genre": "Animation, Adventure, Comedy, Family, Fantasy",
    "Language": "English",
    "Metascore": "95",
    "Plot": "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.",
    "Poster": "https://m.media-amazon.com/images/M/MV5BMDU2ZWJlMjktMTRhMy00ZTA5LWEzNDgtYmNmZTEwZTViZWJkXkEyXkFqcGdeQXVyNDQ2OTk4MzI@._V1_SX300.jpg",
    "Production": "Buena Vista",
    "Rated": "G",
    "Ratings": [
        {
            "Source": "Internet Movie Database",
            "Value": "8.3/10"
        },
        {
            "Source": "Rotten Tomatoes",
            "Value": "100%"
        },
        {
            "Source": "Metacritic",
            "Value": "95/100"
        }
    ],
    "Rele

Above, see all the data I can pull from the IMDB database for Toy Story, I will fold some of this data into my dataframe.  For my example purposes, I will include 'Runtime', the Rotten Tomatoes score, and the number of imdbVotes.  Obviously, I could increase complexity much further, including actors names and description keywords via a bag of words models, but I just want a simple boost in accuracy and precision without spending hours on new features.

Below, see my data-enrichment script, which I ran once, it is commented out.

In [9]:
#create a new movie title column
# movies['title2'] = movies['title'].astype(str).str[:-7]

# #create new columns for all the new features
# movies['runtime'] = None
# movies['rt_score'] = None
# movies['imdb_votes'] = None

# #enrich the dataset now, based on the newly-formatted title
# def new_imdb_features(row):

#     row_name = row.name
    
#     #api query for each row
#     PARAMS = {'t':row['title2'],'apikey':'a194be20'}
#     r = requests.get(url = "http://www.omdbapi.com/",params=PARAMS) 
#     json_ = json.loads(r.text)
    
#     print(json.dumps(json_, indent=4, sort_keys=True))
    
#     #extract data from json response
#     if 'Runtime' in json_ and json_['Runtime'] != 'N/A':
#         runtime = int(json_['Runtime'].split()[0])
#     if 'imdbVotes' in json_ and json_['imdbVotes'] != 'N/A':
#         imdb_votes = int(json_['imdbVotes'].replace(',',''))
#     if 'Ratings' in json_:
#         if len(json_['Ratings']) > 1 and json_['Ratings'][1]['Source'] == 'Rotten Tomatoes':
#             rt_score = int(json_['Ratings'][1]['Value'].replace('%',''))
    
#     #print(runtime, rt_score,imdb_votes, row_name)
#     locals_ = locals()
    
#     #input data into appropriate fields
#     if 'runtime' in locals_:
#         movies.at[row_name, 'runtime'] = runtime
#     else:
#         movies.at[row_name, 'runtime'] = None
        
#     if 'rt_score' in locals_:
#         movies.at[row_name, 'rt_score'] = rt_score
#     else:
#         movies.at[row_name, 'rt_score'] = None
    
#     if 'imdb_votes' in locals_:
#         movies.at[row_name, 'imdb_votes'] = imdb_votes
#     else:
#         movies.at[row_name, 'imdb_votes'] = None
        
# movies.apply(new_imdb_features, axis=1)

In [10]:
#save new movies data to file so we can skip enrichment next time
#movies.to_csv('data/movies_enriched.csv', index=False)

In [11]:
movies = pd.read_csv('data/movies_enriched.csv')

Next I scaled all features between 1 and 0, so that the new features would not vastly outweigh the genre variables.

In [12]:
#scale the new variables
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
movies['runtime'] = min_max_scaler.fit_transform(movies['runtime'].fillna(0))
movies['rt_score'] = min_max_scaler.fit_transform(movies['rt_score'].fillna(0))
movies['imdb_votes'] = min_max_scaler.fit_transform(movies['imdb_votes'].fillna(0))



Below, see my new dataframe with scaled variables of runtime, rt_score, and imdb_votes

In [13]:
movies.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,(no genres listed),Action,Adventure,Animation,Children,...,Mystery,Romance,Sci-Fi,Thriller,War,Western,title2,runtime,rt_score,imdb_votes
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,0,0,1,1,1,...,0,0,0,0,0,0,Toy Story,0.136364,1.0,0.428633
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0,0,0,1,0,1,...,0,0,0,0,0,0,Jumanji,0.175084,0.54,0.145633
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0,0,0,0,0,0,...,0,1,0,0,0,0,Grumpier Old Men,0.170034,0.17,0.012196
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0,0,0,0,0,0,...,0,1,0,0,0,0,Waiting to Exhale,0.208754,0.56,0.004696
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0,0,0,0,0,0,...,0,0,0,0,0,0,Father of the Bride Part II,0.178451,0.48,0.016978


Next, see my new similarity matrix, based on genre and new attributes from IMDB.

In [14]:
#recompute similarity
#compute similarity and create a similarity matrix just based on the genre features above
movies_sim_matrix = movies.drop(['title','genres', 'movieId','imdbId','tmdbId'], axis=1) #drop extra columns
movies_sim_matrix = movies_sim_matrix.set_index('title2') #set the index for the correlation calc
movies_sim_matrix = movies_sim_matrix.T.corr(method='pearson',min_periods=20) #get correlations by index / row instead of columns
movies_sim_matrix = movies_sim_matrix.replace(1,0)#replace all 1(s) with zeroes, eliminate movie correlation with itself
print("New Similarity Matrix")
movies_sim_matrix.head()

New Similarity Matrix


title2,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II,Heat,Sabrina,Tom and Huck,Sudden Death,GoldenEye,...,Gintama: The Movie,anohana: The Flower We Saw That Day - The Movie,Silver Spoon,Love Live! The School Idol Movie,Jon Stewart Has Left the Building,Black Butler: Book of the Atlantic,No Game No Life: Zero,Flint,Bungo Stray Dogs: Dead Apple,Andrew Dice Clay: Dice Rules
title2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,0.0,0.739035,0.187899,0.149131,0.465106,-0.067107,0.325886,0.562099,0.023781,0.192479,...,0.220736,0.143938,0.147629,0.336867,-0.147604,0.487002,0.637948,0.071676,0.152638,0.351179
Jumanji,0.739035,0.0,-0.122665,-0.108135,0.006664,-0.065501,0.002872,0.790663,0.012922,0.279402,...,-0.225628,-0.15089,-0.151324,-0.103144,-0.104032,0.111038,0.187211,0.043639,-0.15132,-0.104539
Grumpier Old Men,0.187899,-0.122665,0.0,0.782176,0.653474,-0.108224,0.884384,-0.096917,-0.038805,-0.108319,...,0.247424,-0.106189,0.441336,-0.068585,-0.07234,0.247008,0.318357,-0.030328,-0.112278,0.680884
Waiting to Exhale,0.149131,-0.108135,0.782176,0.0,0.583573,-0.063734,0.786839,-0.099319,0.024399,-0.068074,...,0.117561,0.303962,0.751291,-0.09619,-0.098538,0.117352,0.190005,0.581172,-0.146742,0.516535
Father of the Bride Part II,0.465106,0.006664,0.653474,0.583573,0.0,0.078286,0.745462,-0.018615,0.133129,0.063087,...,0.376903,-0.088847,0.600657,-0.052573,-0.059501,0.376071,0.453057,0.174978,-0.100954,0.887687


Finally, see my new 10 recommendations, you will note that the newly-recommended films are still animated, but are more-popular and similar to Toy Story--<b>we improved the engine!!</b>  This is subjectively obvious when you note that now there are 2 Toy Story sequels listed in the results.

In [15]:
#get the top 10 similar movies to Toy Story
print("\nSee top 10 movie recommendations for Toy Story based on genre, popularity, and runtime")
content_recommendations_n(movies_sim_matrix,'Toy Story',10)


See top 10 movie recommendations for Toy Story based on genre, popularity, and runtime


Unnamed: 0,Title,Similarity Score
0,"Monsters, Inc.",0.99973
1,Toy Story 2,0.99657
2,Moana,0.98901
3,Antz,0.98583
4,The Good Dinosaur,0.97833
5,Turbo,0.97205
6,Shrek the Third,0.95277
7,Toy Story 3,0.89335
8,Inside Out,0.8909
9,Shrek,0.88719


### Collaborative Filtering Recommenders <a id="Collaborate"></a>

Generally considered more-sophisticated and accurate than content-based filters, collaborative filtering methods lead to recommendation engines that calculate the similarity of items or users based on the past actions of a user.  I built two recommenders below based on two popular methods of collaborative filtering: item-item filtering and user-user filtering.

#### Item-Item Filtering Method

First, I built the item-item recommender.  This method looks at items rated/purchased by users and compares the items based on their ratings.  In this case, I will look at a dataframe of user movie ratings in order to construct my similarity matrix of movies.  My dataset will be the same as the previous dataset.

In [16]:
#EXTRACT
#load ratings data and movie names
ratings_data = pd.read_csv("data/ml-latest-small/ratings.csv")  
movie_titles = pd.read_csv('data/ml-latest-small/movies.csv')
movie_data = pd.merge(ratings_data, movie_titles, on='movieId')

movie_data.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


As you can see in the first 5 rows of our dataframe above, we have every user rating for every movie in our dataset.

In [17]:
#TRANSFORM
#create our user-level dataframe
userRatings = movie_data.pivot_table(index='userId',columns=['title'],values=['rating'])
#create our item-item similarity matrix
#compute correlation for every column pair in the matrix
corrMatrix = userRatings.corr(method='pearson')#, min_periods=20)
#at least 20 ratings per movie
corrMatrix.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
Unnamed: 0_level_1,title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
Unnamed: 0_level_2,title,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
rating,'71 (2014),,,,,,,,,,,...,,,,,,,,,,
rating,'Hellboy': The Seeds of Creation (2004),,,,,,,,,,,...,,,,,,,,,,
rating,'Round Midnight (1986),,,,,,,,,,,...,,,,,,,,,,
rating,'Salem's Lot (2004),,,,,,,,,,,...,,,,,,,,,,
rating,'Til There Was You (1997),,,,,1.0,,,,,,...,,,,,,,,,,


Above, see the similarity matrix of every movie to every movie, based on user ratings. Note that MOST values are Null, it's very sparse.

Next we picked a single user to make recommendations for and looked at all the movies that user rated.  I randomly chose user number 19 in our dataframe.  During an actual real-time recommendation we would add a new user and compute similarity or query similarity again.

In [18]:
#get user by loc = userId
#drop all the movies NOT rated with dropna()
user19 = userRatings.loc[19].dropna()
print("See some of user 19's ratings profile below, our user has rated 703 movies.\n")
print(user19[:20])

See some of user 19's ratings profile below, our user has rated 703 movies.

        title                                                             
rating  'burbs, The (1989)                                                    2.0
        10 Things I Hate About You (1999)                                     3.0
        101 Dalmatians (1996)                                                 1.0
        2001: A Space Odyssey (1968)                                          3.0
        28 Days (2000)                                                        2.0
        39 Steps, The (1935)                                                  2.0
        Absent-Minded Professor, The (1961)                                   3.0
        Abyss, The (1989)                                                     3.0
        Ace Ventura: Pet Detective (1994)                                     2.0
        Ace Ventura: When Nature Calls (1995)                                 2.0
        Addams Family Values

In [19]:
print("Note our user's favorite movies:")
user19.where(user19 > 4).dropna()

Note our user's favorite movies:


        title                                                                         
rating  Adventures of Buckaroo Banzai Across the 8th Dimension, The (1984)                5.0
        Batman (1989)                                                                     5.0
        Crow, The (1994)                                                                  5.0
        Defending Your Life (1991)                                                        5.0
        E.T. the Extra-Terrestrial (1982)                                                 5.0
        Empire Records (1995)                                                             5.0
        Ferris Bueller's Day Off (1986)                                                   5.0
        Fifth Element, The (1997)                                                         5.0
        Fight Club (1999)                                                                 5.0
        Heathers (1989)                                            

Below, we find all the movies that are similar to the movies the user rated, then print out the top 10 movies as a recommendation.

In [20]:
#pick a user create a recommendation for them
#might need a holdout set
similarCandidates = pd.Series()


#loop through each rating for that user and look for similar movies
for x in range(len(user19.index)):
    
        #get movies similar to the ones I rated
        #slice the corrMatrix by my index
        similars = corrMatrix[user19.index[x]].dropna()

        #scales similarity by my user's ratings
        similars = similars.map(lambda z: z*user19[x])

        #add the candidates to my series
        similarCandidates = similarCandidates.append(similars)
        
#dedupe and add scores together
similarCandidates = similarCandidates.groupby(similarCandidates.index).sum()
similarCandidates.sort_values(inplace=True,ascending = False)

In [21]:
display(pd.DataFrame({'Title':np.array(similarCandidates[:10].index),'Similarity Score':np.array(similarCandidates[:10])}))

Unnamed: 0,Title,Similarity Score
0,"(rating, Captain America: Civil War (2016))",622.412268
1,"(rating, Quest, The (1996))",621.725653
2,"(rating, Batman: Year One (2011))",603.031266
3,"(rating, Dead Pool, The (1988))",590.345159
4,"(rating, Untitled Spider-Man Reboot (2017))",589.22006
5,"(rating, Nine Months (1995))",589.072744
6,"(rating, Black Snake Moan (2006))",582.120218
7,"(rating, Breach (2007))",562.349355
8,"(rating, Confidence (2003))",561.606443
9,"(rating, Beverly Hills Ninja (1997))",555.153451


Our script above outputs a new score for similarity that takes into account the number of times a movie is rated and scales ratings according to typical user 19 ratings.  Apparently, according to item-item similarity, our recommender indicates our user would love to watch Captain America: Civil War.

Additionally, you can imagine a hybrid recommender that would take these item-item similarity scores and also weight similarity scores by results from our previous content-based recommender.  Perhaps adding genre weighting would rearrange the list above.

#### User-User Filtering Method <a id="CollaborateU"></a>

A popular method, user-user filtering looks at items rated/purchased or activites by users and compares the <b>users</b> based on their ratings or behaviors. Again, I will look at a dataframe of user movie ratings in order to construct my similarity matrix, except I will compare users to users based on their ratings. My dataset will be the same as the previous dataset.

In [22]:
#create new ratings df and corr matrix

#EXTRACT & TRANSFORM
#get movie_data averaged at the user level
#this creates rating_y as the average rating at the user level
movie_data2 = movie_data.groupby(by="userId",as_index=False)['rating'].mean()

#bring user averages info into the normal ratings
movie_data2 = pd.merge(movie_data,movie_data2,on='userId')

#use their method to normalize ratings
#adjust ratings by the user average
movie_data2['adj_rating'] = movie_data2['rating_x']#movie_data2['rating_x']-movie_data2['rating_y']

#create dataframe of ratings by movies
user_rating_matrix = pd.pivot_table(movie_data2,values='adj_rating',index='userId',columns='movieId')
user_rating_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


Above, see the dataframe of users and their average ratings for every movie.  We want this dataframe at the unique user level so we can compare them.

In [23]:
# Replacing NaN by User Rating Average for scaling purposes
# user_rating_matrix2 = user_rating_matrix.fillna(user_rating_matrix.mean(axis=0)) #using avg user rating
#user_rating_matrix2 = user_rating_matrix.T.fillna(user_rating_matrix.mean(axis=1)).T
user_rating_matrix2 = user_rating_matrix
user_rating_matrix2.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [38]:
#compute user similarity
userCorrMatrix = user_rating_matrix2.T.corr(method='pearson', min_periods = 15) #transpose to compute correlation by rows
#get rid of correlation with itself
userCorrMatrix = userCorrMatrix.replace(1,0)
userCorrMatrix.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,,,0.207983,,-0.291636,-0.118773,0.469668,,,...,,0.0,-0.061503,,-0.164871,0.066378,0.174557,0.26807,,-0.032086
2,,0.0,,,,,,,,,...,,,,,,,,,,0.623288
3,,,0.0,,,,,,,,...,,,,,,,,,,
4,0.207983,,,0.0,,0.148498,0.542861,,,,...,-0.222113,0.396641,0.09009,,0.400124,0.144603,0.116518,-0.170501,,-0.043786
5,,,,,0.0,0.043166,,0.028347,,,...,,0.153303,0.234743,0.067791,,0.244321,0.23108,-0.020546,,


In [25]:
# method used in this tutorial: 
# https://medium.com/sfu-big-data/recommendation-systems-user-based-collaborative-filtering-using-n-nearest-neighbors-bf7361dc24e0
# (my way is more-elegant)
# user similarity on replacing NAN by item(movie) avg
# from sklearn.metrics.pairwise import cosine_similarity
# cosine = cosine_similarity(user_rating_matrix)
# np.fill_diagonal(cosine, 0 )
# similarity_with_movie =pd.DataFrame(cosine,index=user_rating_matrix.index)
# similarity_with_movie.head()

2 windows prior, you will see the user similarity matrix.  I removed perfect correlation and tried adjust different methods for filling null values.  Ultimately, I decided to keep all nulls and only compute correlations with users that had at least 20 ratings--there is certainly room for more-extensive tuning here.

In [26]:
userId = 19

#get a series of the most-similar users for the user specified
similarUsers = userCorrMatrix.iloc[userId-1]
similarUsers.sort_values(inplace=True,ascending = False)
similarUsers[:15]

#print out most-similar user_ids for the specified user
print("Top 10 Most-Similar Users, Based on Users with at Least 20 Movie Ratings")
display(pd.DataFrame({'User Similarity Score (Pearson Correlation)':similarUsers})[:15])

Top 10 Most-Similar Users, Based on Users with at Least 20 Movie Ratings


Unnamed: 0_level_0,User Similarity Score (Pearson Correlation)
userId,Unnamed: 1_level_1
335,0.701781
310,0.674296
165,0.673441
396,0.669934
211,0.6672
382,0.644205
208,0.642675
445,0.617486
422,0.614042
450,0.599536


Above you will see the most-similar users to user number 19, the same user we built an item-item recommendation list for earlier.  We have his "nearest neighbors", those users most-similar to him, now we need to compute their highest rated movies that user 19 has not seen...

In [37]:
#get top 5 similar users for one user
def find_n_neighbors(userCorrDf, user_matrix_id, n):

    #return list of 10 nearest users
    similarUsers = userCorrDf.iloc[user_matrix_id-1]
    similarUsers.sort_values(inplace=True,ascending = False)
    return similarUsers[:n].index

#calculate scores for user items that our user has not seen
def recommend_movies(userCorrDf, userId, movieNum):
    
    #find 10 nearest neighbors
    nn = find_n_neighbors(userCorrDf,userId,20).tolist()
    
    #get all movies rated by userId
    rated_list = user_rating_matrix.loc[userId].dropna().index.tolist()

    #get all highly-rated movies for nn
    movie_list = []
    movie_rating = []
    
    for n in nn:
        
        #get top 20 movies by rating
        neighbor = user_rating_matrix.loc[n].dropna()
        neighbor.sort_values(inplace=True,ascending = False)
        m_list = neighbor[:movieNum].index.tolist()

        #use list comprehension to remove items user may have already rated
        m_list = [x for x in m_list if x not in rated_list]
        neighbor = neighbor.filter(items=m_list)        
        
        #collate the data of movies that are unrated by the user but rated by nn
        movie_list.extend(neighbor[:movieNum].index.tolist())
        movie_rating.extend(neighbor[:movieNum].tolist())
        
    #compute the weight average score for these unrated movies
    df = pd.DataFrame({'movieId':movie_list,'movie_rating':movie_rating}) #get all results in a dataframe
    gb = pd.DataFrame(df.groupby('movieId').agg({'movie_rating':['sum','count']}))
    gb.columns = gb.columns.get_level_values(1)
    gb = gb[gb['count'] > 3]
    gb['w_score'] = gb['sum']/gb['count']
    gb = gb.sort_values(['w_score','count'],ascending=False)
    
    titles = movie_titles.copy()
    titles['title'] = titles['title'].astype(str).str[:-7]
    gb = pd.merge(gb,titles[['movieId','title']],on='movieId')
    
    print('See most-recommended movies, based on weighted user scores from nearest neighbors')
    display(gb)        
    
recommend_movies(userCorrMatrix,19,30)

See most-recommended movies, based on weighted user scores from nearest neighbors


Unnamed: 0,movieId,sum,count,w_score,title
0,2019,19.0,4,4.75,Seven Samurai (Shichinin no samurai)
1,750,23.5,5,4.7,Dr. Strangelove or: How I Learned to Stop Worr...
2,50,37.0,8,4.625,"Usual Suspects, The"
3,858,18.5,4,4.625,"Godfather, The"
4,58559,18.5,4,4.625,"Dark Knight, The"
5,4226,23.0,5,4.6,Memento
6,79132,23.0,5,4.6,Inception
7,318,55.0,12,4.583333,"Shawshank Redemption, The"
8,2028,22.5,5,4.5,Saving Private Ryan
9,110,18.0,4,4.5,Braveheart


Voila!  A list of recommendations based on movie scores from similar users.  Note some intuitive fit here--user 19 liked Batman, some more-classic action movies, and science-fiction, we see some pretty likely choices above.

While there are plenty of tweaks I can make to the recommender above, including tuning the number of movies and adjusting the heuristics for final scoring, above is a great start at outputing movies the user might not have seen before that they may enjoy.  This relies wholly on our assumptions that the user is truly similar to his nearest neighbors and we have encoded features well, so we need to vigilantly evaluate our similarity calculations and then use various quality metrics to evaluate our results.

### Evaluating Recommender Quality <a id="Evaluation"></a>

Some metrics I found for recommender evaluation:
- <b>Mean Average Precision</b>: average of recommendations that are relevant divided by the number of recommendations
- <b>Mean Average Recall</b>: average of recommendations that are relevant divided by all possible relevant items
- <b>Coverage</b>: the percent of items in the training data the model is able to recommend on a test set
- <b>Personalization (as a metric)</b>: the measure of disimilarity between different users' recommendations
- <b>Intra-list Similarity</b>: the average cosine similarity of all items in a list of recommendations

Given more time, I would evaluate all recommenders above on holdout sets of data, checking predicted ratings on actual ratings, utilizing the metrics above, esp. the first two.

### Additional Thoughts <a id="Additional"></a>

In actual production-level code, I would not recreate the exact code snippets I published here--my code would be much more object-oriented and packaged as a library, calling most functions as methods of my own recommender class.  Additionally, further research is needed into deep learning methods, this will happen in later versions of this notebook, but you will note the pragmatic power of the simpler, similarity methods displayed above.  Finally, I have researched some method of using various recommenders together, hybrids or ensemble methods exist, these I will also research further, one can imagine that in lieu of data that a content-based recommender or popularity recommender would operate better and obviate the cold start problem, but that ultimately a collaborative filter would operate the best in a setting with more data.