# Unit 1: Exploratory Data Analysis on the MovieLens 100k Dataset

In these lessons, we will be working a lot with [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) and numpy - so please take the time to at least get yourself familiar with it, e.g. with [10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html).

The [MovieLens](https://grouplens.org/datasets/movielens/) datasets are for recommender systems practitioners and researchers what MNIST is for computer vision people. Of course, the MovieLens datasets are not the only public datasets used in the RecSys community, but one of the most widely used. There are also the
* [Million Song Dataset](http://millionsongdataset.com/)
* [Amazon product review dataset](https://nijianmo.github.io/amazon/index.html)
* [Criteo datasets](https://labs.criteo.com/category/dataset/)
* [Twitter RecSys Challenge 2020](https://recsys-twitter.com/previous_challenge)
* [Spotify Million Playlist Dataset](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge)
* [YooChoose RecSys Challenge 2015](https://www.kaggle.com/chadgostopp/recsys-challenge-2015)
* [BookCrossings](http://www2.informatik.uni-freiburg.de/~cziegler/BX/) and many more

On _kdnuggets_ you can find a [simple overview](https://www.kdnuggets.com/2016/02/nine-datasets-investigating-recommender-systems.html) of some of them.

MovieLens comes in different sizes regarding the number of movie ratings, user, items. Take a look at the GroupLens website and explore them youself.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
from recsys_training.data import genres

In [3]:
ml100k_ratings_filepath = '../data/raw/ml-100k/u.data'
ml100k_item_filepath = '../data/raw/ml-100k/u.item'
ml100k_user_filepath = '../data/raw/ml-100k/u.user'

## Load Data

In [4]:
ratings = pd.read_csv(ml100k_ratings_filepath,
                      sep='\t',
                      header=None,
                      names=['user', 'item', 'rating', 'timestamp'],
                      engine='python')

In [5]:
items = pd.read_csv(ml100k_item_filepath, sep='|', header=None,
                    names=['item', 'title', 'release', 'video_release', 'imdb_url']+genres,
                    engine='python')

In [6]:
users = pd.read_csv(ml100k_user_filepath, sep='|', header=None,
                    names=['user', 'age', 'gender', 'occupation', 'zip'])

## Data Exploration

In this unit, we like to get a better picture of the data we use for making recommendations in the upcoming units. Therefore, let's have a look to some statistics to get confident with the data and algorithms.

![](parrot.png)

**TODO:**
Let's find out the following:

* number of users
* number of items
* rating distribution
* user / item mean ratings
* popularity skewness
    * user rating count distribution
    * item rating count distribution
* time
* sparsity
* user / item features