# Student and Problem Set Info
---


## Title: MGSC 310: Problem Set 1

Author:

Ben Labaschin, King of the Notebooks, Destroyer of Worlds.


# Libraries

In [2]:
from os import environ
from google.colab import drive

# Setup

In [3]:
# Ensure you run this cell or otherwise connect to your Google Drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Question 1, Training and Testing Datasets

a) Load the IMDB_movies dataset. Use pandas to create columns `grossM` and `budgetM` that are budget and gross respectively in units of 1 Million.

Also, separately, create two new columns: `log_gross` and `log_budget` that are equal to the natural log of `gross` and `budget`.

Store ([copy](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html)) the new DataFrame as `movies_clean`.

**Hint: import the `log` function from the `numpy` library**

In [4]:
from pandas import read_csv
from numpy import log

file_path = "/content/drive/MyDrive/Work/Chapman/MGSC_310/MGSC_310_shared_files_and_resources/Data/IMDB_movies.csv"

movies = read_csv(file_path)
movies['budgetM'] = movies['budget'] / 1_000_000
movies['grossM'] = movies['gross'] / 1_000_000
movies['log_gross'] = log(movies['gross'])
movies['log_budget'] = log(movies['budget'])
movies_clean = movies.copy()


b) Use pandas to create a new column called `rating_simple`, using `content_rating` as its source, that explicitly lists the four most common values and the rest are given the category "Other". Add this new column to the `movies_clean` DataFrame.

Strategy if it were me: To get the most common values, use the `.value_counts()` function, more [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html). Grab the four most common categories and assign them to a list. Then use an `.apply` lambda to assign any categories _not_ in that list to the variable `Other`.

In [53]:
top_four = movies_clean['content_rating'].value_counts()[:4].index.tolist()
movies_clean['rating_simple'] = movies_clean['content_rating'].apply(lambda x: x if x in top_four else 'Other')

In [54]:
# or just using what we learned in class:
top_four = (movies[['content_rating']]
        .reset_index()
        .groupby('content_rating')
        .count()
        .sort_values(by='index', ascending=False)
        .reset_index()['content_rating'][:4]
        .tolist()
)
top_four

['R', 'PG-13', 'PG', 'G']

In [55]:
movies_clean['cleaned_ratings'] = movies['content_rating'].apply(lambda x: x if x in top_four else 'Other')
movies_clean['cleaned_ratings'].value_counts()

R        1737
PG-13    1329
PG        576
Other     156
G          91
Name: cleaned_ratings, dtype: int64

c) Use the `crosstab()` function from Pandas (more [here](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html))
to compare `content_rating` and `rating_simple`. In your own words, what does crosstab do?

**Hint: `from pandas import crosstab`**

In [None]:
from pandas import crosstab
crosstab(movies_clean['content_rating'], movies_clean['rating_simple'])

rating_simple,G,Other,PG,PG-13,R
content_rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Approved,0,17,0,0,0
G,91,0,0,0,0
GP,0,1,0,0,0
M,0,2,0,0,0
Missing,0,51,0,0,0
NC-17,0,6,0,0,0
Not Rated,0,42,0,0,0
PG,0,0,576,0,0
PG-13,0,0,0,1329,0
Passed,0,3,0,0,0


d) Sometimes we want to convert text data into numerical representations. Look at the get_dummies function from pandas [here](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).

Apply `.get_dummies()` to the following columns: 'country', 'language', 'director_name'. What happens when you run the code?

Save the new dataframe as `dummy_movies`

In [None]:
from pandas import get_dummies

cols = ['country', 'language', 'director_name', 'grossM', 'budgetM']

dummy_movies = get_dummies(movies_clean[cols])

e) Use train_test_split from sklearn to split the `dummy_movies` dataset into training and testing sets of 80% and 20% each respectively.

The feature dataset should contain all columns that aren't `grossM`
The target dataset should contain `grossM`.

**Hint: try [`.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)

Call these dataframes X_train, y_train, X_test, and y_test.

In [None]:
from sklearn.model_selection import train_test_split

X = dummy_movies.drop('grossM', axis=1)
y = dummy_movies['grossM']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=435)

f) In your own words, explain the purpose of the training dataset.

In sample prediction to train our model coefficients to minimize error without overfitting to the out of sample set.

g) In your own words, explain the purpose of the test dataset

Set of data meant to test how well our model predictions generalize out of sample.

# Question 2, Predicting Movie Gross
a) Using the `movies_clean` DataFrame, estimate an OLS (using statsmodels.api) regression model where `imdb_score` is the dependent variable and `grossM` is the independent variable. Store this model as `model`.

In [None]:
import statsmodels.api as sm

X_with_const = sm.add_constant(movies_clean['grossM'])

model = sm.OLS(movies_clean['imdb_score'], X_with_const).fit()

print(model.summary())


                            OLS Regression Results                            
Dep. Variable:             imdb_score   R-squared:                       0.045
Model:                            OLS   Adj. R-squared:                  0.044
Method:                 Least Squares   F-statistic:                     182.0
Date:                Mon, 11 Sep 2023   Prob (F-statistic):           1.42e-40
Time:                        00:43:53   Log-Likelihood:                -5641.9
No. Observations:                3889   AIC:                         1.129e+04
Df Residuals:                    3887   BIC:                         1.130e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.3002      0.021    307.112      0.0

b) Interpret the coefficient for `imdb_score` reletive to `grossM`, being specific about the magnitude of the impact of the variable on gross, and the direction (positive or negative).

The coefficient of `grossM` is very small. For every million dollars earned in a movie, the `imdb_score` only increases .0032 units (holding all else constant).

c) Discuss the significance of the coefficient for imdb_score. What does this imply about the relationship between imdb_score and gross?

All else being equal, in a single variable regression like this, the model indicates that a movie would have to earn hundrends of millions of dollars to make a significant increase in imdb_score. A thousand million would only increase the score 3.2 and since the intercept is 6.3, you need to basically gross a billion to get a 9.5 rating or above.