In this notebook you will practice the following:

- Scatterplots
- Line charts
- Bar charts
- Histograms
- Box plots
- Scaling plots

To learn about data visualization, we are going to use a modified version of [The Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset) which has information about movies

The dataset is located at `data/movies.csv`, and has the following fields

```
    belongs_to_collection (franquicia): Nombre de la franquicia a la cual pertenece la película.
    budget: Movie budget (in $).
    genre: Genre the movie belongs to.
    original_language: Language the movie was originally filmed in.
    production_company: Name of the production company.
    production_country: Country where the movie was produced.
    release_year: Year the movie was released.
    revenue: Movie ticket sales (in $).
    runtime: Movie duration (in minutes).
    title: Movie title.
    vote_average: Average rating in MovieLens.
    vote_count: Number of votes in MovieLens.
```

In [None]:
import pandas as pd

In [None]:
movies = pd.read_csv("data/movies.csv")

In [None]:
movies.shape

In [None]:
movies.head()

Import matplotlib, pyplot and the matplotlib inline magic.

In [None]:
#YOUR CODE HERE

Change the default chart size (one good set of values is [10, 10])

In [None]:
#YOUR CODE HERE

In [None]:
assert matplotlib.rcParams["figure.figsize"] == [10.0, 10.0]

<hr>

### Note about the grading

Grading plots is difficult, we are using `plotchecker` to grade the plots with nbgrader. 
For `plotchecker` to work with nbgrader, we need to add on each cell, the line

`axis = plt.gca();`

**After the code required to do the plot**.

For example, if we want to plot a scatter plot showing the relationship between revenue and runtime we would do as follows:

In [None]:
# code required to plot
movies[["budget", "revenue"]].plot.scatter(x="budget",y="revenue" )

# last line in the cell required to "capture" the cell and being able to grade it with nbgrader
axis = plt.gca();

<hr>

### How does the budget correlate with the revenue?

In [None]:
#YOUR CODE HERE

axis = plt.gca();

In [None]:
from plotchecker import ScatterPlotChecker, BarPlotChecker, LinePlotChecker
import numpy as np
pc = ScatterPlotChecker(axis)
pc.assert_y_data_equal(movies[movies.revenue.notnull()].revenue)
pc.assert_x_data_equal(movies[movies.revenue.notnull()].budget)

### How does the average movie duration evolves over time? Set the plot title to "Average Movie Duration by year" 

In [None]:
#YOUR CODE HERE

axis = plt.gca();

In [None]:
pc = LinePlotChecker(axis)
pc.assert_x_data_equal([sorted(movies[movies.runtime.notnull()].release_year.unique())])
pc.assert_y_data_equal([movies.groupby("release_year")["runtime"].mean()])
pc.assert_title_equal("Average Movie Duration by year")
print("Success!")

### How does the average revenue vary by genre? Label the x-axis as "Average Revenue"

In [None]:
#YOUR CODE HERE

axis = plt.gca();

In [None]:
pc = BarPlotChecker(axis)
pc.assert_num_bars(len(movies.groupby("genre").groups))
pc.assert_widths_allclose(movies.groupby("genre")["revenue"].mean().values)
pc.assert_xlabel_equal("Average Revenue")
print("Success!")

### How is the variable vote_count distributed? (set the x axis limit to  < 2000)

In [None]:
#YOUR CODE HERE

axis = plt.gca();

In [None]:
submission_pc = BarPlotChecker(axis)
evaluation_plot = movies.vote_count.plot.hist(xlim=(0, 2000))
plt.close()
evaluation_pc = BarPlotChecker(evaluation_plot)

np.testing.assert_allclose(submission_pc.heights, evaluation_pc.heights)
np.testing.assert_allclose(submission_pc.widths, evaluation_pc.widths)

print("Success!")

Make a plot that displays the revenue broken by movie language and that allows us to check if there are outliers.

In [None]:
#YOUR CODE HERE

axis = plt.gca();

In [None]:
submission_pc = LinePlotChecker(axis)
evaluation_plot = movies.boxplot(column="revenue", by="original_language")
plt.close()
evaluation_pc = LinePlotChecker(evaluation_plot)

np.testing.assert_allclose(evaluation_pc.colors, submission_pc.colors)
np.testing.assert_allclose(evaluation_pc.yticks, submission_pc.yticks)
for submit_array, evaluation_array in zip(submission_pc.y_data, evaluation_pc.y_data):
    np.testing.assert_allclose(submit_array, evaluation_array)
for submit_array, evaluation_array in zip(submission_pc.x_data, evaluation_pc.x_data):
    np.testing.assert_allclose(submit_array, evaluation_array)

assert evaluation_pc.xticklabels == submission_pc.xticklabels

print("Success!")



# Ungraded Exercise
Load the file misterious_data.csv and use data visualization to answer the following questions:

* How is the distribution of x in general?
* Are there any outlier in any of the fields?
* Which 2 charts better represent the underlying data?. Change their style to `bmh` and add titles to each chart explaining them 
