## Recommender Systems: How do they work?

Recommender systems are sociotechnical artefacts which suggests information to a person based on some internal logic. Most systems use (a combination of) two types of logic:
- **Collaborative filtering**: recommends based on what people who liked what you liked also liked 
- **Content filtering**: recommends based on characteristics of the thing you liked that other items also have

One way to visualise these two systems is as follows:

<img src="https://miro.medium.com/max/1064/1*mz9tzP1LjPBhmiWXeHyQkQ.png" alt="Recommender systems" width="600"/>

In this workshop we'll build a basic version of both of these recommender systems. For this, we will use a dataset that is often used in recommender system tutorials called *MovieLens*: https://www.wikiwand.com/en/MovieLens.
The dataset contains a list of movies, rated on a scale of 0-5 by different people, and their genres.

We are using this dataset not because it is a particularly interesting subject, but because it gives a window into the culture of data science. In many situations there is a standard dataset, approach, or algorithm that is repeated again and again in online tutorials, demos, workshops, and demos. In many situations, those datasets later turned out to be problematic for various reasons (see e.g., [1], [2]). Knowing what those standard datasets are is helpful when critically studying processes of datafication.


[1] Koch et al. (2021) Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research. https://openreview.net/pdf?id=zNQBIBKJRkd  
[2] Crawford and Paglen. (n.d.) Excavating AI: The Politics of Images in Machine Learning Training Sets https://excavating.ai/

### Loading & exploring the data

In [None]:
import pandas as pd #import a popular data processing library

The first dataset we will look at is the `ratings.csv` file, which contains all the ratings given by users to movies.

In [None]:
ratings = pd.read_csv('ratings.csv') #read in the file
ratings.head() #show the first 5 items of the file

In [None]:
print("Number of ratings:", len(ratings))
print("Number of unique movieId's:", ratings['movieId'].nunique())
print("Number of unique users:", ratings['userId'].nunique())

In [None]:
ratings.describe() #print out some basic statistics of the dataset

The second file we are interested in is `movies.csv`, which contains movie titles and the genres that someone has decided they belong to.

In [None]:
movies = pd.read_csv('movies.csv')
movies.head()

Because both files have a similar column `movieId`, we can easily merge them and get one dataset that has ratings by users, movie titles, and movie genres.

In [None]:
merged_data = pd.merge(ratings, movies, on='movieId') #merge `ratings` and `movies` on the values in column `movieId`
merged_data.head()

In [None]:
merged_data.sort_values('rating', ascending=False) #sort the data from highest rating to lowest

Now that we have our data imported and we have a basic sense of what is in there, we can start working with it to create our two recommendation algorithms.

### Collaborative Filtering

Collaborative filtering means that you get recommended something based on what other people liked, but only from people who also liked something you've liked in the past.


To do this, we need to transform our data structure into something called a user-by-item matrix, which is a fancy way of saying a table with item ratings, where the rows are different users and the columns are the different items. For us, this means the columns are the movies, the rows are the people, and in each cell is the rating between 0-5 that a person gave to that particular movie.


In [None]:
#Create a new table with the rows being userIds, the columns being the movie title, and 
#the values in the cells the rating
movie_matrix = merged_data.pivot_table(index='userId', columns='title', values='rating')

#Not every user has rated every movie, but we can't just leave that empty.
#So we fill in all the na ('no answer') with 0s, so that everything in our table is numbers, and we can do math with it.
movie_matrix = movie_matrix.fillna(0) 
movie_matrix

Now that we have our data in a structure that we like, we can start recommending things! The way we do this is by trying to find a movies whose ratings are very similar, i.e. strongly correlated. For this we will calculate Pearson's correlation coefficient* between the different columns. This will give us a score between -1 (not at all similar) to 1 (very similar)
<br><br>
  


\* technically Spearman's Rank Order Correlation is better for ordinal data, but it is a less intuitive score and given the size of our dataset it does not have that big of an impact

In [None]:
#The movie we liked and based on which we want to find recommendations
movie = 'Lion King, The (1994)'

#Calculate the correlation
collaborative_recommendation = movie_matrix.corrwith(movie_matrix[movie])

#Sort the results from highest to lowest correlation score
collaborative_recommendation = collaborative_recommendation.sort_values(ascending=False)

#Make the table look pretty and print the top 10
pd.DataFrame(collaborative_recommendation, columns=['corr']).head(10)

This seems to work pretty well!
However, collaborative filtering approaches have one big weakness: it can't recommend items that don't have any scores and it can't recommend anything to someone who hasn't scored anything! 
This is also referred to as the **cold start problem**: no inferences are possible for something we don't have any information about.  

One way to fix the cold start problem is to use something called content filtering.

### Content Filtering

Content filtering means that you get recommended something that is similar to what you liked before, and that similarity is often based on some internal characteristics that someone else decided items have. In our carse, the genres that movies fall into.

Let's look at our data again.

In [None]:
movies.head()

As you can see, each movie has a list of genres, separated by `|`. For us to work with this data, we have to transform it again into a different structure. First, we have to split the genres and put them into a list (a formal Python data structure).

In [None]:
movies['genres'] = movies['genres'].str.split('|') #split the string data in the column `genres`

In [None]:
movies.head() #show the first five items

Now we want to, again, create a type of matrix, with the movies as columns, the genres as rows, and in each cell either a 0 and a 1 to indicate whether this movie belongs in this genre (1) or not (0).

In [None]:
#First, let's set the row names to be the same as the movie titles, 
#so it will be easier to figure out which score belongs to which movie later
movies = movies.set_index('title')

In [None]:
#Now `explode` the list of genres so that each genre has their own column.
#Next, sum all rows of the same movie, such that there is only one row per movie with the binary genre score
#then transpose (i.e. flip) the table to get the movies at the top
movie_features = pd.get_dummies(movies['genres'].explode()).sum(level=0).transpose()
movie_features

In [None]:
#The movie we liked and based on which we want to find recommendations
movie = 'Lion King, The (1994)'

#Calculate the correlation
content_recommendation = movie_features.corrwith(movie_features[movie])

#Sort the results from highest to lowest correlation score
content_recommendation = content_recommendation.sort_values(ascending=False)

#Make the table look pretty and print the top 10
pd.DataFrame(content_recommendation, columns=['corr']).head(10)

As you can see, these two logics do not give the same recommendations. While content filtering overcomes the cold start problem of a collaborative approach, it has its own limitations. First, you need labels for each item to describe what it is, and if those labels are not granular enough (e.g., only 20 genres), differences that matter to people might not be captured. Second, by only recommending things with the same characteristics, people might get trapped into a 'rabbit hole' or 'echo chamber', and are no longer exposed to diverse information.