# Exercise Notebook Instructions

### 1. Important: Only modify the cells which instruct you to modify them - leave "do not modify" cells alone.  

The code which tests your responses assumes you have run the startup/read-only code exactly.

### 2. Work through the notebook in order.

Some of the steps depend on previous, so you'll want to move through the notebook in order.

### 3. It is okay to use libraries.

You may find some questions are fairly straightforward to answer using built-in library functions.  That's totally okay - part of the point of these exercises is to familiarize you with the commonly used functions.

### 4. Seek help if stuck

If you get stuck, don't worry!  You can either review the videos/notebooks from this week, ask in the course forums, or look to the solutions for the correct answer.  BUT, be careful about looking to the solutions too quickly.  Struggling to get the right answer is an important part of the learning process.

In [None]:
import pandas as pd
import numpy as np
import os.path

# Exercise Notebook 4 on pandas

In this exercise notebook you will have the opportunity to load the MovieLens database and perform additional analysis.

First let's load the data into a Pandas Dataframe:

In [None]:
# DO NOT MODIFY

# set here the relative path to the movielens folder
MOVIELENS="../movielens"

movies = pd.read_csv(os.path.join(MOVIELENS, 'movies.csv'), sep=',')
ratings = pd.read_csv(os.path.join(MOVIELENS, 'ratings.csv'), sep=',')

In [None]:
movies.head()

In [None]:
ratings.head()

## Exercise 1: Find the minimum rating

First of all let's start by computing the minimum rating.

In the next cell, define the `min_rating` variable to be the minimum rating across all of the DataFrame:

In [None]:
min_rating = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# DO NOT MODIFY

assert isinstance(min_rating, np.float64), "Try again, make sure you are taking the min of just 1 column"
assert abs(min_rating - 0.5) < .01, "Try again, the minimum should be 0.5"

## Exercise 2: Find the mean rating of a movie

**Toy Story** has `movieId` 1, find out its mean rating. For this exercise you just need to use the `ratings` DataFrame:


In [None]:
toy_story_rating = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert abs(toy_story_rating - 3.92) < 0.01, "Try again, select only the rows where the movieId is equal to 1"

## Exercise 3: Find the most common rating

Next let's find which rating is the most common, in the next

In [None]:
rating_counts = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# DO NOT MODIFY

assert rating_counts.iloc[:1].index[0] == 4., "Print out rating_counts and try to understand what is wrong"

It is also interesting to inspect step by step what the statement above is doing:

    rating_counts.iloc[:1].index[0]

print all the intermediate stages of this expression, read the documentation of the functions used here.

## Exercise 4: Usage of the index in pandas

In `numpy` the way to point to a specific entry in an array is by using its integer position. In `pandas` you can do the same with `iloc`, but you also have the option of defining a column in a DataFrame as an `index` and refer to rows using `index` labels instead of integer locations using `loc`.

For example in `rating_counts` defined above, the rating is the `index`, so we can reference a value either by its position or by its rating.

For example we can reference the 4th record either with `iloc` and its position or with `loc` and its label:

In [None]:
rating_counts.iloc[3]

In [None]:
rating_counts.loc[3.5]

## Exercise 5: Set an index for the movies DataFrame

Movies have one column `movieId` that is a natural way of uniquely identifying each row, when that is the case, it is useful to turn that into an index.

In [None]:
movies = movies.set_index("movieId")

In [None]:
ratings.head()

Ratings instead do not have a row identifier, both `userId` and `movieId` reference records in other dataframes, therefore there is no good candidate for an index so we can just leave the default integer indexing.

## Exercise 6: Year with maximum standard deviation in the rating

First assignment is to find which year has the maximum standard deviation in the rating, **not** the maximum value of the standard deviation, but the year when it occurs.
You can use the `idxmax` method, look at its documentation on the pandas website, https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.idxmax.html

First we want to convert the timestamp into a datetime object:

In [None]:
# DO NOT MODIFY

ratings['parsed_time'] = pd.to_datetime(ratings['timestamp'], unit='s')

We can then access to datetime related fields through `dt`, for example:

In [None]:
ratings.parsed_time.dt.month.head()

In [None]:
def find_year_with_max_std(ratings):
    """Function to find the year with the larger standard deviation in rating"""
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert find_year_with_max_std(ratings) == 1998, "Wrong year identified, try again!"

## Advanced exercise 1: Identify popular movies

In the rest of the notebook, we will introduce new concepts not covered in class, this will challenge you to read additional pandas documentation.

First we would like only to consider movies that have a significant number of ratings. This task is complicated by the fact that movies and ratings are in 2 different DataFrames and we want to filter the `movies` DataFrame based on a statistics on the `ratings` DataFrame.

First let's compute the number of ratings per movie:

In [None]:
number_of_ratings = ratings.movieId.value_counts()

In [None]:
number_of_ratings.head()

Now we want to filter this pandas Series object and keep only the rows where the count is larger or equal to 100.
We don't want to pollute our analysis with movies with a tiny number of reviews:

In [None]:
number_of_ratings_of_popular_movies = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(number_of_ratings_of_popular_movies) == 8546, "Try again, check that movies with 100 ratings are accepted"

Finally we want to use the `reindex` function to change the index of movies, this will create a new DataFrame with a new index that contains the movieId of only the most popular movies.

The value of all the rows of `movies` that have the same movieId will be copied over to the new `popular_movies` dataset, the rest will be discarded.

In [None]:
all_popular_movies = movies.reindex(number_of_ratings_of_popular_movies)

## Advanced exercise 2: Data cleaning

Everytime we perform a reindexing operation, `pandas` will create a row for every value of the new index, even if it doesn't exist in the original data structure, and it will mark those data as missing with `NaN` (Not A Number).

Always check if reindexing generated invalid data:

In [None]:
all_popular_movies.isnull().sum()

In the next cell we want to drop the invalid data, look for a `pandas` function that performs that operation (it starts with "drop"!).

In [None]:
popular_movies = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(popular_movies) == 7847, "Try again, check the documentation of the function you used"

## Advanced exercise 3: Filter by genre

Let's implement a general function that filters movies by genres:

In [None]:
def filter_by_genre(input_movies, genre):
    """Return only movies of a specific genre"""
    # YOUR CODE HERE
    raise NotImplementedError()

Then let's apply it to the `popular_movies` dataset to retain only the "Fantasy" movies:

In [None]:
fantasy_movies = filter_by_genre(popular_movies, "Fantasy")

In [None]:
assert len(fantasy_movies) == 382, """Try again, Make sure you are filtering the popular movies"""

## Advanced exercise 4: Join movies and ratings

Let's create a single `DataFrame` that contains both titles and mean ratings of the popular fantasy movies.

Titles are only available in the `movies` `DataFrame`, while ratings in the `ratings` `DataFrame`, we would like to create a single DataFrame that includes Title.

create the `mean_ratings` variable by computing the mean rating for each movie:

In [None]:
mean_ratings = None
# YOUR CODE HERE
raise NotImplementedError()

In this case we don't even need to use a join operation, we can just create a new column in the `fantasy_movies` DataFrame. This will automatically match the index of `mean_ratings` with the index of `fantasy_movies` and attach to each movie its rating. Ratings for movies that are not in the `fantasy_movies` DataFrame are discarded.

The recommended way of creating columns in the recent versions of `pandas` is through the `assign` function, read its documentation!

In [None]:
fantasy_movies.assign?

In [None]:
fantasy_movies_with_ratings = fantasy_movies.assign(rating = mean_ratings)

In [None]:
fantasy_movies_with_ratings.head()

In [None]:
assert fantasy_movies_with_ratings.loc[7842].title.startswith("Dune"), "Try again, missing or wrong title"

In [None]:
assert abs(fantasy_movies_with_ratings.loc[7842].rating - 3.56) < 0.01 , "Try again, missing or wrong rating"

## Advanced exercise 5: Find the highest rated fantasy movie

Again we need to find the index where a column is max, in this case rating:

In [None]:
index_of_max_rating = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
highest_rated_fantasy_movie = fantasy_movies_with_ratings.loc[index_of_max_rating]

In [None]:
assert highest_rated_fantasy_movie.title.startswith("Princess"), "Try again"