# Is Fandango Still Inflating Ratings

In October 2015, a data journalist named Walt Hickey analyzed movie ratings data and found strong evidence to suggest that Fandango's rating system was biased and dishonest ([Fandango](https://www.fandango.com/) is an online movie ratings aggregator). He published his analysis [in this article](https://fivethirtyeight.com/features/fandango-movies-ratings/).

Fandango's officials said that this was due to a bug in their system, and that they would fix the bug.

In this project, we are going to analyze recent movie ratings data to determine if they fixed the bug since Walt Hickey's analysis.

In [1]:
import pandas as pd
prev = pd.read_csv('fandango_score_comparison.csv')
after = pd.read_csv('movie_ratings_16_17.csv')

fand_prev = prev[[
    'FILM', 'Fandango_Stars', 'Fandango_Ratingvalue', 'Fandango_votes', 'Fandango_Difference'
]].copy()

fand_after = after[['movie', 'year', 'fandango']].copy()

fand_after.head(10)

Unnamed: 0,movie,year,fandango
0,10 Cloverfield Lane,2016,3.5
1,13 Hours,2016,4.5
2,A Cure for Wellness,2016,3.0
3,A Dog's Purpose,2017,4.5
4,A Hologram for the King,2016,3.0
5,A Monster Calls,2016,4.0
6,A Street Cat Named Bob,2016,4.5
7,Alice Through the Looking Glass,2016,4.0
8,Allied,2016,4.0
9,Amateur Night,2016,3.5


# Population of Interest
The population we would like to describe is all of the available ratings that fandango displays on their website, regardless of when the film was released. As stated earlier, our goal is to see if Fandango fixed the "bug" that biased their movie ratings, making their distribution even more left-skewed, inflating their ratings of movies.

Unfortunately, the samples that we have, Hickey's original dataset and our 2016-2017 dataset, are not entirely likely to be representative of our population; our datasets filter out based on ratings and ticket sales or release date and votes respectively. A later analysis might involve re-sampling the data, though that may be impossible since Fandango "fixed" the availability of some of the data originally sampled in 2015 by Hickey. For now, we'll change the goal of our analysis based on the data available.

Our new goal will be to compare popular movies from 2015 with 2016 to see if the displayed ratings of movies on the site are still as left-skewed. Our population will end up being limited to movies which were released close to the date of Hickey's original analysis and are filtered out based on engagement on the Fandango site—which is a good proxy for our initial goal. Now the data we have is the representative population of our new, changed goal.

# Refining our Populations
Now we have two data sets

All Fandango's ratings for "popular" movies released in 2015.
All Fandango's ratings for "popular" movies released in 2016.
For the purpose of this notebook, we'll use Hickey's benchmark of 30 fan ratings and count a movie as popular only if it has 30 fan ratings or more on Fandango's website.

Unfortunately, our second data set doesn't have fan ratings. We should be skeptical once more and ask whether it is truly representative and contains popular movies (movies with over 30 fan ratings).

To evaluate whether or not it is representative, we will use sampling techniques, randomly sampling 10 movies from our second data set and checking the number of fan ratings on Fandango's website. We'll say that it is representative if 8 out of the 10 randomly sampled movies match our criteria of 30 fan ratings or more.

In [2]:
fand_after.sample(10, random_state = 1)

Unnamed: 0,movie,year,fandango
108,Mechanic: Resurrection,2016,4.0
206,Warcraft,2016,4.0
106,Max Steel,2016,3.5
107,Me Before You,2016,4.5
51,Fantastic Beasts and Where to Find Them,2016,4.5
33,Cell,2016,3.0
59,Genius,2016,3.5
152,Sully,2016,4.5
4,A Hologram for the King,2016,3.0
31,Captain America: Civil War,2016,4.5


Unfortunately, our dataset doesn'y contain this info, and we'd prefer to check ourselves, but the Fandango website is very unclear about this information. Luckily the [dataquest.io github](https://github.com/dataquestio/solutions/blob/master/Mission288Solutions.ipynb) has a table containing the desired info as of 2018:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-amwm">Movie</th>
    <th class="tg-amwm">Fan ratings</th>
  </tr>
  <tr>
    <td class="tg-baqh">Mechanic: Resurrection</td>
    <td class="tg-baqh">2247</td>
  </tr>
  <tr>
    <td class="tg-baqh">Warcraft</td>
    <td class="tg-baqh">7271</td>
  </tr>
  <tr>
    <td class="tg-baqh">Max Steel</td>
    <td class="tg-baqh">493</td>
  </tr>
  <tr>
    <td class="tg-baqh">Me Before You</td>
    <td class="tg-baqh">5263</td>
  </tr>
  <tr>
    <td class="tg-baqh">Fantastic Beasts and Where to Find Them</td>
    <td class="tg-baqh">13400</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Cell</td>
    <td class="tg-yw4l">17</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Genius</td>
    <td class="tg-yw4l">127</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Sully</td>
    <td class="tg-yw4l">11877</td>
  </tr>
  <tr>
    <td class="tg-yw4l">A Hologram for the King	</td>
    <td class="tg-yw4l">500</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Captain America: Civil War</td>
    <td class="tg-yw4l">35057</td>
  </tr>
</table>

And infact our sample shows that 9 out of 10 of our randomly sampled movies match our criteria.

Next, we'll check Hickey's old dataset to make sure that his data follows his own criteria that we've stated—just double checking.

In [3]:
sum(fand_prev['Fandango_votes'] < 30)

0

Since we changed our goal such that we are only looking at movies released in 2015 and 2016, we need to clean our datasets of movies that fall out of this range.

First, we'll start with Hickey's old dataset. There isn't a column that only contains the release year, but the `FILM` column contains the release year in a predictable location.

In [4]:
# Checking the dataframe to see how the string in the 'FILM' column is formatted
fand_prev.head(2)

Unnamed: 0,FILM,Fandango_Stars,Fandango_Ratingvalue,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),5.0,4.5,14846,0.5
1,Cinderella (2015),5.0,4.5,12640,0.5


In [5]:
# Since the year is always in the same place, we can extract it with a slice
fand_prev['Year'] = fand_prev['FILM'].str[-5:-1]
fand_prev.head(2)

Unnamed: 0,FILM,Fandango_Stars,Fandango_Ratingvalue,Fandango_votes,Fandango_Difference,Year
0,Avengers: Age of Ultron (2015),5.0,4.5,14846,0.5,2015
1,Cinderella (2015),5.0,4.5,12640,0.5,2015


Lets make sure we aren't throwing away too much data. If the year 2014 takes up more than 50% on our frequency table we might reconsider our approach.

In [6]:
fand_prev['Year'].value_counts()

2015    129
2014     17
Name: Year, dtype: int64

In [7]:
fand_2015 = fand_prev[fand_prev['Year'] == '2015'].copy()
fand_2015['Year'].value_counts()

2015    129
Name: Year, dtype: int64

In [8]:
# We'll repeat the process for our 2016 dataframe
fand_after.head(2)

Unnamed: 0,movie,year,fandango
0,10 Cloverfield Lane,2016,3.5
1,13 Hours,2016,4.5


In [9]:
fand_after['year'].value_counts()

2016    191
2017     23
Name: year, dtype: int64

In [10]:
fand_2016 = fand_after[fand_after['year'] == 2016].copy()
fand_2016['year'].value_counts()

2016    191
Name: year, dtype: int64