## A Simple Recommender System using Pandas 
In this part of the exercise you are going to work with a real data set, i.e. the **Movie Lens Data** from https://grouplens.org/datasets/movielens/. This is a **collection of movie ratings** which are widely used in research projects about recommender systems. 

As in any Data Science project, **we start with the data exploration task** to get familiar with the data and look for hidden patterns. Afterwards, we **build a simple movie recommender system** without using any fancy machine learning algorithms. Keep in mind that this will not be a robust recommender sytem, but it is a nice first start to get used to pandas.

**Remark**: Some of the expected results are already shown in the output. Therefore, add a new cell above the desired output. Otherwise you would overwrite some of the expected results!

So, let's start by loading the needed packages:

In [1]:
# import packages
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

Next, please load the two csv files **'data/ml-latest-small/ratings.csv'** and **'data/ml-latest-small/movies.csv'** as DataFrames named **ratings** and **movies**.

**Hint**: Use pd.read_csv().

In [None]:
# import the data
ratings = <FILL-IN>
<FILL-IN> = <FILL-IN>('../data/ml-latest-small/movies.csv')

Investigate the DataFrames and answer the following questions:
- how much memory is used for the DataFrames?
- what kind of columns do we have (names and datatype)?
- how many records hold the DataFrames?

**Merge** the two DataFrames **ratings** and **movies on the key movieId** and select the **columns rating, userId and title**. Give the resulting DataFrame the name df. Afterwards, print the first 5 rows of the DataFrame.

In [None]:
# merge
df = <FILL-IN>
<FILL-IN>.head()

Unnamed: 0,userId,title,rating
0,1,Dangerous Minds (1995),2.5
1,7,Dangerous Minds (1995),3.0
2,31,Dangerous Minds (1995),4.0
3,32,Dangerous Minds (1995),4.0
4,36,Dangerous Minds (1995),3.0


**Compute** the mean, median and count of the ratings per movie.

**Hint**: Extract the two needed columns *title* and *rating* and use the groupby('column') method. Try to produce the same DataFrame as below. For the following tasks it is important that you get the same result as below, i.e. same structure and column names.

In [None]:
# mean, median and count
df_agg = <FILL-IN>.agg(['mean', <FILL-IN>, 'count'])
df_agg.head()

Unnamed: 0_level_0,rating,rating,rating
Unnamed: 0_level_1,mean,median,count
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
"""Great Performances"" Cats (1998)",1.75,1.75,2
$9.99 (2008),3.833333,4.5,3
'Hellboy': The Seeds of Creation (2004),2.0,2.0,1
'Neath the Arizona Skies (1934),0.5,0.5,1
'Round Midnight (1986),2.25,2.25,2


Extract only movies of the dataframe df_agg which have been rated **more than 100 times** and sort the resulting dataset in **descending order with respect to the mean rating**. Name the resulting dataframe df_agg_100 and show the first 20 and the last 20 rows.

**Hint**: Use conditional indexing and the sort_values() method. Furthermore, you can use head and tail.

In [None]:
<FILL-IN> = df_agg[df_agg[('rating', <FILL-IN>)] > <FILL-IN>].<FILL-IN>([('rating', 'mean')], ascending=<FILL-IN>)
<FILL-IN>

Unnamed: 0_level_0,rating,rating,rating
Unnamed: 0_level_1,mean,median,count
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
"Godfather, The (1972)",4.4875,5.0,200
"Shawshank Redemption, The (1994)",4.487138,5.0,311
"Godfather: Part II, The (1974)",4.385185,5.0,135
"Usual Suspects, The (1995)",4.370647,4.5,201
Schindler's List (1993),4.303279,4.5,244
One Flew Over the Cuckoo's Nest (1975),4.256944,4.5,144
Fargo (1996),4.256696,4.5,224
Pulp Fiction (1994),4.256173,4.5,324
American Beauty (1999),4.236364,4.25,220
"Dark Knight, The (2008)",4.235537,4.5,121


Unnamed: 0_level_0,rating,rating,rating
Unnamed: 0_level_1,mean,median,count
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Stargate (1994),3.368966,3.0,145
Pretty Woman (1990),3.360544,3.0,147
Star Trek: Generations (1994),3.350877,3.0,114
Natural Born Killers (1994),3.336449,3.0,107
Titanic (1997),3.332317,3.5,164
Ghost (1990),3.325397,3.0,126
Outbreak (1995),3.309091,3.0,110
American Pie (1999),3.297297,3.5,111
Austin Powers: The Spy Who Shagged Me (1999),3.272321,3.5,112
Twister (1996),3.25,3.0,150


 We have not talked much about plotting yet. Making plots is one of the key ingredients to explore data sets. There will be another exercise which will cover this topic. However, we want to visualize the mean rating vs. the number of votings. Therefore, **please execute the commands below** and set the arguments **kind to 'scatter'** and **alpha to 0.3** to produce a scatter plot with a certain transparency level.

In [None]:
# create figure and axes object
fig, ax = plt.subplots(figsize=(12,8))
# use Pandas built-in Plotting Function
df_agg['rating'].plot(kind=<FILL-IN>, x='mean', y='count', alpha=<FILL-IN>, ax=ax)

Can you interpret the plot? Does it make sense? Do you see any kind of **correlation or pattern**?

Next, please answer the following question: Which movie got the **worst mean rating** out of the df_agg_100 DataFrame?

**Hint**: Use idxmin().

In [None]:
# worst movie
df_agg_100.<FILL-IN>

rating  mean        Waterworld (1995)
        median       Firm, The (1993)
        count     Shining, The (1980)
dtype: object

Next, we want to create a so called **feature matix**, where the **columns** are the **movie titles**, the **row indices** are the **userIds** and the **values** are the corresponding **ratings**. Therefore, you can use the **pivot_table** method on the dataframe df. Try to figure it out by your own. Name the resulting DataFrame **feature_mat**.

**Hint**: Use the Shortcut *Shift+Tab* to get the docstring of the function. Afterwards, show the first 10 entries of the matrix.

In [None]:
<FILL-IN> = df.<FILL-IN>
<FILL-IN>

Wow, there are so many NaNs. Could this be correct? Can you explain why we get so many NaNs?

Such a matrix is called a **sparse matrix**. Actually, there are better formats to store elements of such a matrix more efficiently. Those who are interested may have a look at the **Compressed Sparse Column (CSC) format**.

Now, please compute the **percentage of null values** of the feature matrix.

**Hint**: Use the methods .isnull() and .sum() and the class attribute size which gives you the number of elements of the matrix.

In [None]:
# percentage of null values
nulls = feature_mat.<FILL-IN>
nulls / <FILL-IN>.<FILL-IN> * 100

98.355739546434492

## Recommender System
There are many different ways to build a recommender system. One popular approach is to use **collaborative filtering**, which is just a fancy name for recommending stuff based on the combination what you did (e.g. bought) and what everybody else did (e.g. bought).

We want to discuss two approaches:

1. Find similar users based on correlations between their ratings. If e.g. user A and B are highly correlated, recommend a highly rated movie by user A to user B (if he has not seen that movie yet). This approach is called **user based collaborative filtering**.

2. Find similar items based on correlations between their rating patterns. Afterwards, recommend similar movies which the user has not seen yet.

Both approaches are very similar, but we choose  a slightly modified version of method 2. What are the **drawbacks of method 1**?

Compute the **correlation matrix** of the dataframe **feature_mat** with the **method corr()** and name the resulting DataFrame **movie_corr**.
This may take a while. You can **measure the time** with the magic function **%%time.**

In [None]:
%%<FILL-IN>
<FILL-IN> = <FILL-IN>.<FILL-IN>

Use this correlation matrix to **find which movies are highly correlated with the movie Toy Story**.

**Hint**: Look at the DataFrame with the .head() method. After you have discovered the structure of the dataframe try to **extract the column 'Toy Story (1995)'**. The result will be a **Pandas Series** containing the correlation coefficients of Toy Story with all the other movies. Sort the series in descending order and print the first 10 records.

In [None]:
movie_corr[<FILL-IN>].sort_values(<FILL-IN>).head(<FILL-IN>)

Do you know any of these movies? What happened? Did we do **something wrong**? Can you **explain the result**?

**Compute the correlation matrix again**, but this time for movies wich lead to **at least 100 *observation pairs***, i.e. the movie Toy Story and another movie has been rated at least by the same 100 users. In the following table you can see an example for two *observation pairs* of Toy Story and MovieA, and one *observation pair* for movie Toy Story and MovieB.

| userId|ToyStory|MovieA|MovieB
| ------|--------|------|------|
| 1      | 3.0 | 4.0 | 4.0
| 2      | NaN    |   5.0 | 4.0
| 3 | 4.0      |    NaN | NaN
| 4 | 5.0      |    4.0 | NaN


Afterwards, check again which movies are highly correlated wth Toy Story.

**Hint**: Use the argument **min_periods** of the corr() method to achieve the desired result.

In [None]:
movie_corr_100 = <FILL-IN>
movie_corr_100<FILL-IN>.head(10)

Next, we create a dictionary which contains **your own movie preferences**. Please add your own ratings (integers 1 to 5).

In [124]:
# please add your ratings as values
myRatings = {'Dark Knight, The (2008)': <FILL-IN>, 'Mask, The (1994)': <FILL-IN>,
             'Titanic (1997)': <FILL-IN>, 'Star Wars: Episode IV - A New Hope (1977)': <FILL-IN>,
            'Star Wars: Episode I - The Phantom Menace (1999)': <FILL-IN>, 'Pulp Fiction (1994)': <FILL-IN>}

The function below **weights the similiraty score** by your ratings and **returns a sorted pandas Series** containing your recommendations in descending order. Please execute the function definition.

In [20]:
def recommend(myRatings, movies_corr_100):
    '''Find the movies which are highly correlated with your rated movies and weight them with your rating.'''

    recommendations = pd.Series()
    
    for key in myRatings:
        candidates = movie_corr_100[key].dropna()
        candidates = candidates.apply(lambda x: x * myRatings[key])
        recommendations = recommendations.append(candidates)

    recommendations.drop(list(myRatings.keys()), inplace=True)
    
    # aggregation, because different movies (e.g. Toy Story 1 and 2) can lead to the same candidate (e.g. Shrek) 
    return recommendations.groupby(by=recommendations.index).agg('mean').sort_values(ascending=False)

Use the function **recommend** on your own dictionary **myRatings** and name the result myRecommendations.

In [None]:
myRecommendations = <FILL-IN>

Finally, we **join the series** with the **df_agg_100 dataframe** to add some summary statistics. Since we can only join two dataframes, we have to **convert the series to a DataFrame** by using the method to_frame(name='column_name') on our pandas series myRecommendations.

In [None]:
<FILL-IN>.to_frame(name='score').join(df_agg_100['rating']).dropna().head(10)

### End of the Exercise
Great job! I hope you enjoyed the exercise.

## Bonus:
Start again with the movies and ratings DataFrames. Try to solve the following problems:

1. How many different movie genres holds the DataFrame movies?
2. Which movie genre has the highest average rating considering only movies with more than 100 ratings?

**Hint**: These two tasks can be a bit tricky. Here are some useful steps that I have used for my solution:

1. Transform the 'genres' column into a list using the apply and split method.
2. Explode (denormalize) the list, i.e. each element of the list yields a new row in the dataframe. There are several ways to do that. One way is to use a nested for loop and iterate over a Pandas Series which contains the genres list. You could append each element of the genres_list to an empty list. Afterwards use the set(list) function to get only unique elements. As an alternative you could also use the method stack() on a Pandas Series. Unfortunately, there is no built-in explode function, e.g. like in Hive.

**Update**: Since Pandas 0.25 an explode method exists.

For question 2 you also need to do sth. like:

3. Perform sth. similar like in step 2 (see above), but keep track of the index to *rejoin* your results with the original DataFrame.
4. Join the result with the ratings df.
5. Compute the average rating and the count grouped by the exploded column genres.

Unnamed: 0,movieId,title,genres,genres_list
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),Comedy|Romance,"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),Comedy,[Comedy]


{'Film-Noir', 'Children', 'Drama', 'Comedy', 'Mystery', '(no genres listed)', 'Animation', 'Musical', 'Crime', 'Adventure', 'Fantasy', 'Romance', 'Horror', 'Documentary', 'Thriller', 'War', 'Western', 'IMAX', 'Sci-Fi', 'Action'}


20

   old_index     genres
0          0  Adventure
1          0  Animation
2          0   Children
3          0     Comedy
4          0    Fantasy
CPU times: user 12.4 ms, sys: 7 µs, total: 12.5 ms
Wall time: 11.9 ms


Unnamed: 0,old_index,genres
0,0,Adventure
1,0,Animation
2,0,Children
3,0,Comedy
4,0,Fantasy


Unnamed: 0_level_0,rating,rating,rating
Unnamed: 0_level_1,mean,median,count
genres,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
(no genres listed),3.777778,4.0,18
Action,3.445613,3.5,27056
Adventure,3.520393,4.0,22017
Animation,3.636062,4.0,6170
Children,3.466187,3.5,8680


Unnamed: 0_level_0,rating,rating,rating
Unnamed: 0_level_1,mean,median,count
genres,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Film-Noir,3.955702,4.0,1140
War,3.817214,4.0,5025
Documentary,3.813299,4.0,1564
Drama,3.68178,4.0,44752
Crime,3.679639,4.0,16266
Mystery,3.679541,4.0,7625
Animation,3.636062,4.0,6170
Musical,3.598793,4.0,4722
IMAX,3.571134,4.0,3156
Western,3.566423,4.0,1912
