<h1><center>Final Project Demonstation</center></h1>
<center><h2>Rotten Tomatoes</h2></center>

This is a final demonstration of the outputs of the Rotten Tomatoes research project. The research questions we set out to answer were: 
1. Historically, how well have rotten tomatoes critic scores correlated with “Best picture” Oscar wins?
2. Historically, are rotten tomatoes ratings good predictors of wins in any category at the Oscars?

Our goal was to make a research repository that made it simple to reproduce our findings. Our entire repository was set up with runner scripts that executed a certain function, and partner "utility" scripts that contained the functions for these runners. This design let us easily test the main components of our pipeline. 

## 1. Encapsulating data download

Our first task was ensuring that our datasets would download reliably, so future users could reproduce our analysis. Since we were mostly using Kaggle data, we wrote several functions to encapsulate the API calls and manage credentials. 

### <center><em>Code Demo</em></center>

## 2. Data Cleaning
One of the important artifacts of this project was our data cleaning functions. These functions add in guardrails against common data issues, such as null values, unexpected columns, or missing columns. They were designed as a series of classes that inherit from a DataCleaner base class, which centralized common checks to the data. 

In [2]:
from rotten_tomatoes.utils.data_cleaning import (
    DataCleaner, 
    CriticsDataCleaner, 
    MoviesDataCleaner,
    OscarsDataCleaner, 
    BestPictureOscarsDataCleaner,
    AnyWinOscarsDataCleaner
)

The four data cleaners that are called by the user are CriticsDataCleaner, MoviesDataCleaner, BestPictureOscarsDataCleaner, and AnyWinOscarsDataCleaner. All objects inherit from the DataCleaner base class, and then the "best picture" and "any win" classes also inherit from OscarsDataCleaner. 

This class inheritance structure allowed us to reduce duplicate code. Some examples of common functionality are reading in a csv, subsetting to a list of columns, and checking for nulls. 

In [5]:
issubclass(AnyWinOscarsDataCleaner, DataCleaner)

True

In [6]:
issubclass(CriticsDataCleaner, DataCleaner)

True

In [7]:
issubclass(AnyWinOscarsDataCleaner, OscarsDataCleaner)

True

In [8]:
issubclass(MoviesDataCleaner, OscarsDataCleaner)

False

## 3. Regression Helper Classes
Then, to run our regressions, we built several helper classes that encapsulated calls to sklearn. These made analysis very easy, and let us standardize our results across multiple research questions. 

In [10]:
from rotten_tomatoes.utils.regression import (
    RegressionAnalysis,
    CorrelationAnalysis, 
    plot_linear_fit
)

In [38]:
# Import the iris dataset from sklearn
from sklearn import datasets
import pandas as pd

iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
X.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
y = iris.target

In [36]:
X.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [40]:
y = pd.DataFrame(y, columns=['target'])
y.head()

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


In [41]:
# Create regression analysis object 
analysis = RegressionAnalysis(X, y, is_categorical=True)

In [44]:
# This automatically splits train/test for you! 
# Default test percentage is 0.25, but you can change with the "test_size" argument
analysis.X_train_.shape

(112, 4)

In [45]:
# Set the columns you want to use as inputs to your regression 
analysis.set_X_cols(['sepal_length', 'sepal_width'])

In [46]:
# Train the model 
analysis.fit_train()

  y = column_or_1d(y, warn=True)


In [47]:
# Review the test accuracy 
analysis.score_test()

0.868421052631579

In [48]:
# View the test set predictions
analysis.predict_test()

array([1, 2, 2, 1, 0, 1, 1, 0, 0, 1, 2, 0, 1, 2, 2, 2, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 2, 0, 2, 1, 0, 0, 1, 1, 2, 0])