## Recommender Systems: How do they work?

Recommender systems are sociotechnical artefacts which suggests information to a person based on some internal logic. Most systems use (a combination of) two types of logic:
- **Collaborative filtering**: recommends based on what people who liked what you liked also liked 
- **Content filtering**: recommends based on characteristics of the thing you liked that other items also have

One way to visualise these two systems is as follows:

<img src="https://miro.medium.com/max/1064/1*mz9tzP1LjPBhmiWXeHyQkQ.png" alt="Recommender systems" width="600"/>

In this workshop we'll build a basic version of both of these recommender systems. For this, we will use a dataset that is often used in recommender system tutorials called *MovieLens*: https://www.wikiwand.com/en/MovieLens.
The dataset contains a list of movies, rated on a scale of 0-5 by different people, and their genres.

We are using this dataset not because it is a particularly interesting subject, but because it gives a window into the culture of data science. In many situations there is a standard dataset, approach, or algorithm that is repeated again and again in online tutorials, demos, workshops, and demos. In many situations, those datasets later turned out to be problematic for various reasons (see e.g., [1], [2]). Knowing what those standard datasets are is helpful when critically studying processes of datafication.


[1] Koch et al. (2021) Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research. https://openreview.net/pdf?id=zNQBIBKJRkd  
[2] Crawford and Paglen. (n.d.) Excavating AI: The Politics of Images in Machine Learning Training Sets https://excavating.ai/

### Loading & exploring the data

In [2]:
import pandas as pd #import a popular data processing library

The first dataset we will look at is the `ratings.csv` file, which contains all the ratings given by users to movies.

In [3]:
ratings = pd.read_csv('ratings.csv') #read in the file
ratings.head() #show the first 5 items of the file

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [155]:
print("Number of ratings:", len(ratings))
print("Number of unique movieId's:", ratings['movieId'].nunique())
print("Number of unique users:", ratings['userId'].nunique())

Number of ratings: 100836
Number of unique movieId's: 9724
Number of unique users: 610


In [6]:
ratings.describe() #print out some basic statistics of the dataset

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


The second file we are interested in is `movies.csv`, which contains movie titles and the genres that someone has decided they belong to.

In [23]:
movies = pd.read_csv('movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Because both files have a similar column `movieId`, we can easily merge them and get one dataset that has ratings by users, movie titles, and movie genres.

In [10]:
merged_data = pd.merge(ratings, movies, on='movieId') #merge `ratings` and `movies` on the values in column `movieId`
merged_data.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [10]:
merged_data.sort_values('rating', ascending=False) #sort the data from highest rating to lowest

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
77065,42,4102,5.0,996258715,Eddie Murphy Raw (1987),Comedy|Documentary
31492,58,377,5.0,847718657,Speed (1994),Action|Romance|Thriller
31485,31,377,5.0,850467368,Speed (1994),Action|Romance|Thriller
14890,380,3033,5.0,1494803646,Spaceballs (1987),Comedy|Sci-Fi
74460,456,1393,5.0,856883540,Jerry Maguire (1996),Drama|Romance
...,...,...,...,...,...,...
29267,365,180,0.5,1491088177,Mallrats (1995),Comedy|Romance
42146,365,56949,0.5,1488594780,27 Dresses (2008),Comedy|Romance
59582,287,1485,0.5,1110228283,Liar Liar (1997),Comedy
64957,160,3986,0.5,1065992767,"6th Day, The (2000)",Action|Sci-Fi|Thriller


Now that we have our data imported and we have a basic sense of what is in there, we can start working with it to create our two recommendation algorithms.

### Collaborative Filtering

Collaborative filtering means that you get recommended something based on what other people liked, but only from people who also liked something you've liked in the past.


To do this, we need to transform our data structure into something called a user-by-item matrix, which is a fancy way of saying a table with item ratings, where the rows are different users and the columns are the different items. For us, this means the columns are the movies, the rows are the people, and in each cell is the rating between 0-5 that a person gavev to that particular movie.




In [22]:
#Create a new table with the rows being userIds, the columns being the movie title, and 
#the values in the cells the rating
movie_matrix = merged_data.pivot_table(index='userId', columns='title', values='rating')

#Not every user has rated every movie, but we can't just leave that empty.
#So we fill in all the na ('no answer') with 0s, so that everything in our table is numbers, and we can do math with it.
movie_matrix = movie_matrix.fillna(0) 
movie_matrix

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.5,3.5,0.0,0.0,0.0
609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that we have our data in a structure that we like, we can start recommending things! The way we do this is by trying to find a movies whose ratings are very similar, i.e. strongly correlated. For this we will calculate Pearson's correlation coefficient* between the different columns. This will give us a score between -1 (not at all similar) to 1 (very similar)
<br><br>
  


\* technically Spearman's Rank Order Correlation is better for ordinal data, but it is a less intuitive score and given the size of our dataset it does not have that big of an impact

In [33]:
#The movie we liked and based on which we want to find recommendations
movie = 'Lion King, The (1994)'

#Calculate the correlation
collaborative_recommendation = movie_matrix.corrwith(movie_matrix[movie])

#Sort the results from highest to lowest correlation score
collaborative_recommendation = collaborative_recommendation.sort_values(ascending=False)

#Make the table look pretty and print the top 10
pd.DataFrame(collaborative_recommendation, columns=['corr']).head(10)

Unnamed: 0_level_0,corr
title,Unnamed: 1_level_1
"Lion King, The (1994)",1.0
Beauty and the Beast (1991),0.613107
Aladdin (1992),0.609844
Mrs. Doubtfire (1993),0.53837
"Mask, The (1994)",0.518191
Jumanji (1995),0.481138
Snow White and the Seven Dwarfs (1937),0.466429
Babe (1995),0.452781
Home Alone (1990),0.441517
Jurassic Park (1993),0.440386


This seems to work pretty well!
However, collaborative filtering approaches have one big weakness: it can't recommend items that don't have any scores and it can't recommend anything to someone who hasn't scored anything! 
This is also referred to as the **cold start problem**: no inferences are possible for something we don't have any information about.  

One way to fix the cold start problem is to use something called content filtering.

### Content Filtering

Content filtering means that you get recommended something that is similar to what you liked before, and that similarity is often based on some internal characteristics that someone else decided items have. In our carse, the genres that movies fall into.

Let's look at our data again.

In [24]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


As you can see, each movie has a list of genres, separated by `|`. For us to work with this data, we have to transform it again into a different structure. First, we have to split the genres and put them into a list (a formal Python data structure).

In [25]:
movies['genres'] = movies['genres'].str.split('|') #split the string data in the column `genres`

In [26]:
movies.head() #show the first five items

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),[Comedy]


Now we want to, again, create a type of matrix, with the movies as columns, the genres as rows, and in each cell either a 0 and a 1 to indicate whether this movie belongs in this genre (1) or not (0).

In [27]:
#First, let's set the row names to be the same as the movie titles, 
#so it will be easier to figure out which score belongs to which movie later
movies = movies.set_index('title')

In [32]:
#Now `explode` the list of genres so that each genre has their own column.
#Next, sum all rows of the same movie, such that there is only one row per movie with the binary genre score
#then transpose (i.e. flip) the table to get the movies at the top
movie_features = pd.get_dummies(movies['genres'].explode()).sum(level=0).transpose()
movie_features

title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Gintama: The Movie (2010),anohana: The Flower We Saw That Day - The Movie (2013),Silver Spoon (2014),Love Live! The School Idol Movie (2015),Jon Stewart Has Left the Building (2015),Black Butler: Book of the Atlantic (2017),No Game No Life: Zero (2017),Flint (2017),Bungo Stray Dogs: Dead Apple (2018),Andrew Dice Clay: Dice Rules (1991)
(no genres listed),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Action,0,0,0,0,0,1,0,0,1,1,...,1,0,0,0,0,1,0,0,1,0
Adventure,1,1,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
Animation,1,0,0,0,0,0,0,0,0,0,...,1,1,0,1,0,1,1,0,1,0
Children,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Comedy,1,0,1,1,1,0,1,0,0,0,...,1,0,1,0,0,1,1,0,0,1
Crime,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Documentary,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
Drama,0,0,0,1,0,0,0,0,0,0,...,0,1,1,0,0,0,0,1,0,0
Fantasy,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0


In [42]:
#The movie we liked and based on which we want to find recommendations
movie = 'Lion King, The (1994)'

#Calculate the correlation
content_recommendation = movie_features.corrwith(movie_features[movie])

#Sort the results from highest to lowest correlation score
content_recommendation = content_recommendation.sort_values(ascending=False)

#Make the table look pretty and print the top 10
pd.DataFrame(content_recommendation, columns=['corr']).head(10)

Unnamed: 0_level_0,corr
title,Unnamed: 1_level_1
"Lion King, The (1994)",1.0
Anastasia (1997),0.881917
Rock-A-Doodle (1991),0.763763
Song of the South (1946),0.763763
Land Before Time III: The Time of the Great Giving (1995),0.763763
Pete's Dragon (1977),0.763763
Dumbo (1941),0.763763
"Secret of NIMH, The (1982)",0.763763
Up (2009),0.763763
Fantasia 2000 (1999),0.763763


As you can see, these two logics do not give the same recommendations. While content filtering overcomes the cold start problem of a collaborative approach, it has its own limitations. First, you need labels for each item to describe what it is, and if those labels are not granular enough (e.g., only 20 genres), differences that matter to people might not be captured. Second, by only recommending things with the same characteristics, people might get trapped into a 'rabbit hole' or 'echo chamber', and are no longer exposed to diverse information.

One thing that is not covered in this tutorial is: where are all these genres and ratings coming from? Recommender systems are more than just the algorithm to calculate similarity between items, but are often more complicated pipelines were features of items are automatically extracted or manually added, and users explicitly or implicitly rate items through their interactions. These aspects of a recommender system are generally referred to as **user profiling** and **content classification**.