## Recommender Systems: How do they work?

Recommender systems are sociotechnical artefacts which suggests information to a person based on some internal logic. Most systems use (a combination of) two types of logic:
- **Collaborative filtering**: recommends based on what people who liked what you liked also liked 
- **Content filtering**: recommends based on characteristics of the thing you liked that other items also have

One way to visualise these two systems is as follows:

<img src="https://miro.medium.com/max/1064/1*mz9tzP1LjPBhmiWXeHyQkQ.png" alt="Recommender systems" width="600"/>

In this workshop we'll build a basic version of both of these recommender systems. For this, we will use a dataset that is often used in recommender system tutorials called *MovieLens*: https://www.wikiwand.com/en/MovieLens.
The dataset contains a list of movies, rated on a scale of 0-5 by different people, and their genres.

We are using this dataset not because it is a particularly interesting subject, but because it gives a window into the culture of data science. In many situations there is a standard dataset, approach, or algorithm that is repeated again and again in online tutorials, demos, workshops, and demos. In many situations, those datasets later turned out to be problematic for various reasons (see e.g., [1], [2]). Knowing what those standard datasets are is helpful when critically studying processes of datafication.


[1] Koch et al. (2021) Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research. https://openreview.net/pdf?id=zNQBIBKJRkd  
[2] Crawford and Paglen. (n.d.) Excavating AI: The Politics of Images in Machine Learning Training Sets https://excavating.ai/

### Loading & exploring the data

In [None]:
import pandas as pd #import a popular data processing library

The first dataset we will look at is the `ratings.csv` file, which contains all the ratings given by users to movies.

In [None]:
ratings = pd.read_csv('ratings.csv') #read in the file
ratings.head() #show the first 5 items of the file

In [None]:
print("Number of ratings:", len(ratings))
print("Number of unique movieId's:", ratings['movieId'].nunique())
print("Number of unique users:", ratings['userId'].nunique())

In [None]:
ratings.describe() #print out some basic statistics of the dataset

The second file we are interested in is `movies.csv`, which contains movie titles and the genres that someone has decided they belong to.

In [None]:
movies = pd.read_csv('movies.csv')
movies.head()

Because both files have a similar column `movieId`, we can easily merge them and get one dataset that has ratings by users, movie titles, and movie genres.

In [None]:
merged_data = pd.merge(ratings, movies, on='movieId') #merge `ratings` and `movies` on the values in column `movieId`
merged_data.head()

In [None]:
merged_data.sort_values('rating', ascending=False) #sort the data from highest rating to lowest

Now that we have our data imported and we have a basic sense of what is in there, we can start working with it to create our two recommendation algorithms.