# Multivariate analysis: Checking relationship between more than 1 variable.
##### We want to figure out which are the values that contribute to the success of a movie.
What we will try to test is the strength of the relationships of both budget and rating on gross, first on the whole dataset and then we will focus on the Animation movies.

In [20]:
#Import libraries we are going to use
import pandas as pd
import numpy as np
import statsmodels.api as sm
import re

In [28]:
movie = pd.read_csv('../Database/Clean/movie_clean.csv', sep = ',', index_col = 0)
movie.head()

Unnamed: 0,budget,company,genre,gross,name,score,votes,year
0,8000000.0,Columbia Pictures Corporation,Adventure,52287414.0,Stand by Me,8.1,299174,1986
1,6000000.0,Paramount Pictures,Comedy,70136369.0,Ferris Bueller's Day Off,7.8,264740,1986
2,15000000.0,Paramount Pictures,Action,179800601.0,Top Gun,6.9,236909,1986
3,18500000.0,Twentieth Century Fox Film Corporation,Action,85160248.0,Aliens,8.4,540152,1986
4,9000000.0,Walt Disney Pictures,Adventure,18564613.0,Flight of the Navigator,6.9,36636,1986


We have decided to create a new dataframe of only those columns that we will be using, to avoid problems afterwards with OLS regression line model.

In [29]:
movie1 = movie[['budget', 'company', 'genre', 'gross', 'name', 'score', 'year']]

We apply the OLS model.

In [36]:
parts = ['budget', 'score']
all_parts = movie1.columns
Y = movie1.gross
X = sm.add_constant(movie1[parts])

model = sm.OLS(Y,X)

results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  gross   R-squared:                       0.525
Model:                            OLS   Adj. R-squared:                  0.525
Method:                 Least Squares   F-statistic:                     3768.
Date:                Mon, 25 Nov 2019   Prob (F-statistic):               0.00
Time:                        17:14:24   Log-Likelihood:            -1.2883e+05
No. Observations:                6807   AIC:                         2.577e+05
Df Residuals:                    6804   BIC:                         2.577e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -4.413e+07   3.13e+06    -14.083      0.0

###### Even though our model only explains 52% of the data, we can see that there is a positive relationship between budget and score with gross.

Let's now focus on Animation movies.

In [34]:
movie2 = movie1[movie1['genre']=='Animation']

In [37]:
parts = ['budget', 'score']
all_parts = movie2.columns
Y = movie2.gross
X = sm.add_constant(movie2[parts])

model = sm.OLS(Y,X)

results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  gross   R-squared:                       0.500
Model:                            OLS   Adj. R-squared:                  0.497
Method:                 Least Squares   F-statistic:                     137.1
Date:                Mon, 25 Nov 2019   Prob (F-statistic):           5.49e-42
Time:                        17:20:13   Log-Likelihood:                -5390.5
No. Observations:                 277   AIC:                         1.079e+04
Df Residuals:                     274   BIC:                         1.080e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.092e+08   3.05e+07     -3.578      0.0

###### There is a stronger relationship between budget and score with gross income.

In [52]:
movie_princess= pd.read_csv('../Database/Clean/movies_90_all_and_princess.csv',sep = ';', index_col = 0)

In [53]:
movie_princess2 = movie_princess.query('Disney_princess == "Princess"')

In [54]:
movie_princess2 = movie_princess2[['budget', 'company', 'genre', 'gross', 'name', 'score', 'year']]
movie_princess2.isnull().sum()

budget     0
company    0
genre      0
gross      0
name       0
score      0
year       0
dtype: int64

In [55]:
parts = ['budget', 'score']
all_parts = movie_princess2.columns
Y = movie_princess2.gross
X = sm.add_constant(movie_princess2[parts])

model = sm.OLS(Y,X)

results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  gross   R-squared:                       0.239
Model:                            OLS   Adj. R-squared:                 -0.015
Method:                 Least Squares   F-statistic:                    0.9421
Date:                Tue, 26 Nov 2019   Prob (F-statistic):              0.441
Time:                        17:13:11   Log-Likelihood:                -176.00
No. Observations:                   9   AIC:                             358.0
Df Residuals:                       6   BIC:                             358.6
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -4.54e+08   5.53e+08     -0.821      0.4

  "anyway, n=%i" % int(n))
