# Are Fandango Ratings Still Inflated?
Alex Hickey discovered that the ratings present on the Fandango movie ratings are inflated compared to the real values. This was due to te fact that the real ratings were rounded up to the nearest whole star. This affected the overall rating distribution. The original article can be found [here](https://fivethirtyeight.com/features/fandango-movies-ratings/). 

We are here to find out if the situation has changed.

In [1]:
import numpy as np
import pandas as pd

# the data before Hickey's analysis
before = pd.read_csv('fandango_score_comparison.csv') 

#the data at some point after Hickey's analysis, collected by a Dataquest student
after = pd.read_csv('movie_ratings_16_17.csv') 

In [2]:
# select relevant columns for this analysis
# bh: before Hickey
bh = before[['FILM', 'Fandango_Stars', 'Fandango_Ratingvalue', 'Fandango_votes', 'Fandango_Difference']]

#ah: after Hickey
ah = after[['movie', 'year', 'fandango']]

In [4]:
ah.head()

Unnamed: 0,movie,year,fandango
0,10 Cloverfield Lane,2016,3.5
1,13 Hours,2016,4.5
2,A Cure for Wellness,2016,3.0
3,A Dog's Purpose,2017,4.5
4,A Hologram for the King,2016,3.0


We'll use the movies in our 'before' dataset as the basis of our analysis.

# Task:

By reading the README.md files of the two repositories, figure out whether the two samples are representative for the population we're trying to describe.
    * Determine whether the sampling is random or not — did all the movies have an equal chance to be included in the two samples?
    * Useful information can also be found in Hickey's [article](https://fivethirtyeight.com/features/fandango-movies-ratings/).
    * You can acess the two README.md files directly using [this link](https://github.com/fivethirtyeight/data/blob/master/fandango/README.md) and [this link](https://github.com/mircealex/Movie_ratings_2016_17/blob/master/README.md).

## T:
Change slightly the current goal of our analysis such that:

    - The population of interest changes and the samples we currently work with become representative.
    - The new goal is still a fairly good proxy for our initial goal, which was to determine whether there has been any change in Fandango's rating system after Hickey's analysis.

* Check if both samples contain popular movies — that is, check whether all (or at least most) sample points are movies with over 30 fan ratings on Fandango's website.
* One of the data sets doesn't provide information about the number of fan ratings, and this raises representativity issues once again.
    * Find a quick way to check whether this sample contains enough popular movies as to be representative.
    * If you get stuck here, you can always sneak a look at the solution notebook.
* If you explore the data sets enough, you'll notice that some movies were not released in 2015 and 2016. We need to isolate only the sample points that belong to our populations of interest.
    * Isolate the movies released in 2015 in a separate data set.
    * Isolate the movies released in 2016 in another separate data set.
    * These are the data sets we'll use next to perform our analysis.

* Generate two kernel density plots on the same figure for the distribution of movie ratings of each sample. Customize the graph such that:

    * It has a title with an increased font size.
    * It has labels for both the x and y-axis.
    * It has a legend which explains which distribution is for 2015 and which is for 2016.
    * The x-axis starts at 0 and ends at 5 because movie ratings on Fandango start at 0 and end at 5.
    * The tick labels of the x-axis are: [0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0].
    * It has the fivethirtyeight style (this is optional). You can change to this style by using plt.style.use('fivethirtyeight'). This line of code must be placed before the code that generates the kernel density plots.
* Analyze the two kernel density plots. Try to answer the following questions:

    * What is the shape of each distribution?
    * How do their shapes compare?
    * If their shapes are similar, is there anything that clearly differentiates them?
    * Can we see any evidence on the graph that suggests that there is indeed a change between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016?
    * Provided there's a difference, can we tell anything about the direction of the difference? In other words, were movies in 2016 rated lower or higher compared to 2015?

* Examine the frequency distribution tables of the two distributions.

    * The samples have different number of movies. Does it make sense to compare the two tables using absolute frequencies?
    * If absolute frequencies are not useful here, would relative frequencies be of more help? If so, what would be better for readability — proportions or percentages?
* Analyze the two tables and try to answer the following questions:

    * Is it still clear that there is a difference between the two distributions?
    * What can you tell about the direction of the difference just from the tables? Is the direction still that clear anymore?

* Compute the mean, median, and mode for each distribution.
* Compare these metrics and determine what they tell about the direction of the difference.
* What's magnitude of the difference? Is there a big difference or just a slight difference?
* Generate a grouped bar plot to show comparatively how the mean, median, and mode varied for 2015 and 2016. You should arrive at a graph that looks similar (not necessarily identical) to this: [CheckPictures on localdisk]

# Conclusions

These are a few next steps to consider:

    * Customize your graphs more by reproducing almost completely the FiveThirtyEight style. You can take a look at this tutorial if you want to do that.
    *Improve your project from a stylistical point of view by following the guidelines discussed in this style guide.
    * Use the two samples to compare ratings of different movie ratings aggregators and recommend what's the best website to check for a movie rating. There are many approaches you can take here — you can take some inspiration from this article.
    * Collect recent movie ratings data and formulate your own research questions. You can take a look at this blog post to learn how to scrape movie ratings for IMDB and Metacritic.