## Reducing Dimensionality

There are two main drivers of dimensionality in this dataset: Keywords and Production Companies

Both bring thousands of dummy variables and create a sparse matrix that must be used for modelling.  I have several other categories that increase dimensionality more than they should, from these I need to severely reduce the number of features that are present in my data.  I'll list the general approach for each here, and enumerate why I'm taking that approach in subsections below. 

**Keywords:** Eliminate any keyword that appears in less than fifty films, reducing to 23 features

**Production Companies:** Bin this by quartiles, reducing to 4 features

**Release Year:** Bin by decade, reducing to 6 features

**Production Country:** Eliminate and create a single column indicating if a film was produced in the US or not

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer
from sklearn.metrics import r2_score, mean_squared_error, mean_squared_log_error
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [None]:
boxoffice = pd.read_csv('../Data/No_Outliers.csv')

In [None]:
# Copying the boxoffice dataframe prior to making major changes so that I have access to all the information still. 
box = boxoffice
box.shape

### Keywords

The Keywords category adds 7,134 dimensions to this data set and is incredibly sparse.  Only 1000 keywords actually appear in more than 5 films.  As a starting point I'm going to eliminate all keywords that don't appear in in at least 50 films, which reduces this category by 7,123 dimensions to 11 features. 

I would prefer to bin keywords by quartiles as I'll do with production companies. This presents problems to stakeholders who would like to make predictions from this model for a film's revenue.  If keywords are binned by revenue, you need to consult the existing list of keywords to identify which quartile a keywords belongs to.  This by itself isn't a problem.  However, over half of all keywords appear in only a single film, and many are a garbled collection of symbols that are not interpretable. A single previous data point is not a good predictor of revenue, further compounded by the likelihood that a keyword has not appeared in a single film previously.  

If a keyword has not previously appeared in a film, it's impossible to use prior revenue performance to predict future revenue performance.  As a result, I've chosed to bin keywords by their frequency in the data set. 

In [None]:
kwrds = boxoffice['Keywords']
count = kwrds.apply(pd.value_counts)
count = count.iloc[1]
pd.DataFrame(count)
new_cols = list(count[count>=50].index)

In [None]:
old_cols = list(box['Keywords'].columns)
old_cols = set(old_cols)
new_cols = set(new_cols)
drop = old_cols.difference(new_cols)
drop = list(drop)
box.drop(drop, axis=1, level=1, inplace=True)
box.drop('No Keywords', axis=1, level=1, inplace=True)
box.shape

### Production Companies

Next, I need to bin production companies by revenue tier.  This will involve reducing the 2,688 features for Production Companies down to 4 features for production company revenue quartiles. These will indicate if a film has a production company who's fims are generally in the 25th, 50th, 75th, or 100th quartile of film revenue. 

I'm also going to sum up all the rows for production companies and create a new column that indicates the number of companies that contribute to a film since each film can have multiple companies working on it.

In [None]:
sums = boxoffice['Company']
sums = sums.sum(axis=1)
# create a new column that indicates the number of companies that participate in the creation of a film
box['Numerical', 'Num_companies'] = sums

In [None]:
rev = boxoffice['Numerical', 'revenue']
rev = pd.DataFrame(rev)
rev.columns = rev.columns.droplevel()
prod_co = boxoffice['Company']
prod_co_rev = prod_co.join(rev)