## Recommendation Engine Quickstart, the Notebook

#### An overview of recommender principles and techniques, mainly for my own enrichment and practice.

- <a href="#Principles">High Level Principles</a><br>
- <a href="#Terms">Common Terms</a>
- <a href="">Models &amp; Snippets / Examples:</a>
    - <a href="">Content-Based Recommender</a>
         - <a href="">Data Enrichment</a>
    - <a href="">Collaborative Filtering Item-Item Recommender</a>
    - <a href="">Collaborative Filtering User-User Recommender</a>
    - <a href="">Hybrid Recommender</a>
<br><br>
- <a href="">Evaluating Recommender Quality</a>
- <a href="">Additional Thoughts</a>

### High Level Principles <a id="Principles"></a>

For a succinct definition of recommendation engines, I will reference the following:

<blockquote>A recommendation engine, also known as a recommender system, is software that analyzes available data to make suggestions for something that a website user might be interested in, such as a book, a video or a job, among other possibilities.

(https://whatis.techtarget.com/definition/recommendation-engine)
</blockquote> 

In researching various methods of building recommendation engines, I found a common pattern that seems to apply to <i>most</i> machine learning methods.  First, clean the data and get every unique entity at the row level, such as a unique dataframe/table of customers.  Next, encode features for each entity and compute the similarity of each entity--typically you create a similarity matrix in this step.  Finally, query the matrix based on a single user's or item's attributes and output the most-similar items.

For example, when building a simple user-based product recommendation engine, a quick recipe might be:
1. Gather user data into a single dataframe of unique users, including behavioral variables and contextual
2. Enrich user data, if possible, e.g., add social profile variables or user segmentation or recent purchases
3. Encode/engineer user features
4. Compute similarity of all unique users some in recent history
5. Take a user, query similar users and see which items they bought that user has not purchased

Of course, more sophistication can be added to the above, including combining recommendation results and deploying in a fashion similar to below.

An actual deployment plan might look something like the following:
1. Build a data pipeline - extract, transform, and load data at regular intervals, perhaps daily
    - transform data into a format that can be easily fed into a machine learning algorithm / wrapper
2. Create a regularly scheduled task that ingests fresh user data then builds and deploys a new recommendation model
    - automated script takes model wrapper parameters and generates a new similarity matrix
    - script also evaluates and records model quality
3. Tag certain web pages with JavaScript that encodes and compares user with the current recommendation build
4. Once task 3 is complete, output inventory recommendations in HTML based on similarity computations.
5. Gather further feedback from actual users and QA

### Common Terms <a id="Terms"></a>

Esoteric recommender jargon as well as common terms I have run into during my research.  See terms and definitions below:

- <b>Collaborative Filtering</b>: aka 'social filtering', recommending actions/items based on similar past actions of a user
- <b>Content-based Filtering</b>: recommending actions/items that are similar to other actions/items based on static attributes
- <b>Similarity Matrix</b>: (typically) a correlation table that shows the similarity measures between all known entities, e.g., a user-user similarity matrix would show how similar every user is with every other user.
- <b>The "Cold-start" problem</b>: when there is not enough data to draw inferences about a user or entity <b>yet</b>
- <b>Pearson Correlation</b>: measure of similarity between two non-zero vectors, ie., comparing numerically encoded attributes
- <b>Cosine Similarity</b>: another measure of similarity between two non-zero vectors 
- <b>Vector</b>: an array of attributes, independent variables, typically associated to a user, class, or item, encoded numerically as features (this is a common term in all ML).

### Models & Snippets / Examples

See quickly-coded examples of different movie recommendation engines below--all recommenders below utilized movie rating data to recommend new movies to watch.  I tried to review and recreate the most-popular methods I could find in Python.<br>

#### Data Source
<p>Data was provided by MovieLens at https://grouplens.org/datasets/movielens/.</p>
All of my examples come from the 100K data set of ratings known as <a href="http://files.grouplens.org/datasets/movielens/ml-latest.zip">ml-latest-small</a>.</p>

### Content-based Recommender

In cases where there is little data or great item features exist already, a content-based recommender can do an adequate job.  
A content-based recommender will recommend similar items based on static attributes or qualities of the item, not based on individual user ratings or user behavior.  The similarity matrix we will need will be similar to below:

<br>
<img src="data/similarity_matrix.png">
<br>

In the example below, I engineered movie features from the ratings dataset and also from the IMDB database, then I computed movie (item) similarity based on those features.  

In [3]:
#import some libraries and load data
import numpy as np
import pandas as pd

#EXTRACT STEP
#load the movie data
movies = pd.read_csv('data/ml-latest-small/movies.csv')

#merge the links data so that I can enrich the dataset from IMDB
links = pd.read_csv('data/ml-latest-small/links.csv')
movies = pd.merge(movies, links, on='movieId')  

In [4]:
#a quick look at our dataset
movies.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0


Above, see the first 5 rows of our movie data.  I have pulled a list of distinct movies, you can see their title, the year they came out, the genres that apply, and a couple of ids that will allow us to enrich the movie data further.

In [5]:
#TRANSFORM STEP aka "feature engineering" followed by "data enrichment"

#feature engineering

#split out the genres string into features
genres_df = movies['genres'].str.split('|', expand=True)
genres_df = genres_df.fillna('(no genres listed)')

#get all unique genres
cols = np.unique(genres_df[genres_df.columns].values)

#create columns for each genre
for col in cols:
    
    movies[col] = col

    #input the values to each dummy
    def bool_dums(x):
        genre = x['genres']
        col_name = x[col]

        if col_name in genre:
            return 1
        else:
            return 0
    
    movies[col] = movies.apply(bool_dums, axis=1)

In [16]:
movies.iloc[:,5:].head()

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


After breaking out the genres string into multiple columns, I encoded each variable as 0 or 1 (one hot encoding), see new movie genre variables above.  
These variables, while rather coarse, enable me to build a recommender already, you will see that as long as we have labels we can compute similarity and begin to recommend items.

In [33]:
#compute similarity and create a similarity matrix just based on the genre features above
movies_sim_matrix = movies.drop(['genres', 'movieId','imdbId','tmdbId'], axis=1) #drop extra columns
movies_sim_matrix = movies_sim_matrix.set_index('title') #set the index for the correlation calc
movies_sim_matrix = movies_sim_matrix.T.corr(method='pearson',min_periods=20) #get correlations by index / row instead of columns
movies_sim_matrix = movies_sim_matrix.replace(1,0)#replace all 1(s) with zeroes, eliminate movie correlation with itself
print("Similarity Matrix")
movies_sim_matrix.head()

Similarity Matrix


title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Gintama: The Movie (2010),anohana: The Flower We Saw That Day - The Movie (2013),Silver Spoon (2014),Love Live! The School Idol Movie (2015),Jon Stewart Has Left the Building (2015),Black Butler: Book of the Atlantic (2017),No Game No Life: Zero (2017),Flint (2017),Bungo Stray Dogs: Dead Apple (2018),Andrew Dice Clay: Dice Rules (1991)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),0.0,0.727607,0.19245,0.080845,0.39736,-0.242536,0.19245,0.57735,-0.132453,0.080845,...,0.288675,0.19245,0.19245,0.39736,-0.132453,0.57735,0.727607,-0.132453,0.19245,0.39736
Jumanji (1995),0.727607,0.0,-0.140028,-0.176471,-0.096374,-0.176471,-0.140028,0.793492,-0.096374,0.215686,...,-0.210042,-0.140028,-0.140028,-0.096374,-0.096374,0.140028,0.215686,-0.096374,-0.140028,-0.096374
Grumpier Old Men (1995),0.19245,-0.140028,0.0,0.793492,0.688247,-0.140028,0.0,-0.111111,-0.076472,-0.140028,...,0.25,-0.111111,0.444444,-0.076472,-0.076472,0.25,0.326732,-0.076472,-0.111111,0.688247
Waiting to Exhale (1995),0.080845,-0.176471,0.793492,0.0,0.546119,-0.176471,0.793492,-0.140028,-0.096374,-0.176471,...,0.140028,0.326732,0.793492,-0.096374,-0.096374,0.140028,0.215686,0.546119,-0.140028,0.546119
Father of the Bride Part II (1995),0.39736,-0.096374,0.688247,0.546119,0.0,-0.096374,0.688247,-0.076472,-0.052632,-0.096374,...,0.458831,-0.076472,0.688247,-0.052632,-0.052632,0.458831,0.546119,-0.052632,-0.076472,0.0


Above, see the first 5 rows of the computed similarity between each movie.  I used Pearson correlation here, which calculates correlation in the classic -1 to 1 way, -1 is perfectly negatively correlated and 1 is perfectly positively correlated.  You can already see that this matrix is starting to pass the sniff test--Toy Story is more-correlated with Jumanji than Grumpier Old Men.  Maybe this is enough data to do an "OK" job...

In [41]:
#create a function to print out the most-similar recommendations
def content_recommendations_n(movies_matrix, title,n):
    
    #get series with similarity scores for this title
    sim_series = movies_matrix[title].sort_values(ascending = False) #sort the values w/ highest corr
    sim_series = sim_series[:n] #take top 10 values 
    sim_series = sim_series.sort_index() #sort alphabetically
    
    for recs in range(len(sim_series)):
        print(sim_series.index[recs],sim_series[sim_series.index[recs]])

Next, I created a function (above) to print out the most-similar movies, given a movie title I specify and the number of nearest neighbors I specify.   
Let's go ahead and see what my engine recommends for Toy Story!

In [44]:
#get the top 10 similar movies to Toy Story, just based on genre
print("\nSee top 10 movie recommendations for Toy Story based on genre ONLY\n")
print("Title\t\tSimilarity Score")
content_recommendations_n(movies_sim_matrix,'Toy Story (1995)',10)


See top 10 movie recommendations for Toy Story based on genre ONLY

Title		Similarity Score
Gnomeo & Juliet (2011) 0.8819171036881975
Puss in Boots (Nagagutsu o haita neko) (1969) 0.8819171036881975
Shrek (2001) 0.8819171036881975
Shrek Forever After (a.k.a. Shrek: The Final Chapter) (2010) 0.8819171036881975
Space Jam (1996) 0.8819171036881975
TMNT (Teenage Mutant Ninja Turtles) (2007) 0.8819171036881975
The Lego Movie (2014) 0.8819171036881975
Toy Story 3 (2010) 0.8819171036881975
Twelve Tasks of Asterix, The (Les douze travaux d'Astérix) (1976) 0.8819171036881975
Valiant (2005) 0.8819171036881975


We have a coarse model here that seems much better than nothing.  This result set is pretty great in some ways: we see that genre in this dataset IS a somewhat good indicator of similarity, by itself.  This content is well-labeled in some ways, some of these movies would probably please a movie watcher who likes Toy Story.  In other ways, not so much: we see that the top 10 movies here all have the same similarity score, if we had more features this would be very unlikely, our engine believes these movies are all equally similar, which is debatable, esp. since one is actually another Toy Story movie and is probably intuitively more-similar than Space Jam, for example.

In order to improve results, I enriched the dataset and created more features, adding complexity to the model to increase similarity accuracy.

In [46]:
#utilize IMDB API

#get data from imdb API http://www.omdbapi.com/
#100K requests for $1/mo

#libraries for using API
import requests
import json

#example API query
PARAMS = {'t':'Toy Story','apikey':'{apikey}'}
r = requests.get(url = "http://www.omdbapi.com/",params=PARAMS) 
print(r.url)
print(r.content)

json_ = json.loads(r.text)
print(json_)

http://www.omdbapi.com/?t=Toy+Story&apikey=a194be20
b'{"Title":"Toy Story","Year":"1995","Rated":"G","Released":"22 Nov 1995","Runtime":"81 min","Genre":"Animation, Adventure, Comedy, Family, Fantasy","Director":"John Lasseter","Writer":"John Lasseter (original story by), Pete Docter (original story by), Andrew Stanton (original story by), Joe Ranft (original story by), Joss Whedon (screenplay by), Andrew Stanton (screenplay by), Joel Cohen (screenplay by), Alec Sokolow (screenplay by)","Actors":"Tom Hanks, Tim Allen, Don Rickles, Jim Varney","Plot":"A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy\'s room.","Language":"English","Country":"USA","Awards":"Nominated for 3 Oscars. Another 23 wins & 17 nominations.","Poster":"https://m.media-amazon.com/images/M/MV5BMDU2ZWJlMjktMTRhMy00ZTA5LWEzNDgtYmNmZTEwZTViZWJkXkEyXkFqcGdeQXVyNDQ2OTk4MzI@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"8.3/10"},{"Source

Above, see all the data I can pull from the IMDB database for Toy Story, I will fold some of this data into my dataframe.  For my example purposes, I will include 'Runtime', the Rotten Tomatoes score, and the number of imdbVotes.  Obviously, I could increase complexity much further, including actors names and description keywords via a bag of words models, but I just want a simple boost in accuracy and precision without spending hours on new features.

In [None]:
#create a new movie title column

#create new columns for all the new features

#scale the features between 0 and 1

#recompute similarity

#re-query the results

In [26]:
movies.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,(no genres listed),Action,Adventure,Animation,Children,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
