<a id='999'></a>
# TMDB Box Office Prediction

This notebook shows all common processes for traditional machine learning step by step: including from data loading, feature engineering, feature selection, model selection, and to the final prediction. 

We have a little data to predict TMDB movie's office box revenue: only 3,000 data items for the train and 4,400 for the test. It results in that using modern ML techniques is not appropriate for this model but we have to use traditional ML algorithms, where domain knowledge is needed for major steps of building model such as feature engineering and feature selection. 

<img src='https://storage.googleapis.com/kaggle-competitions/kaggle/10300/logos/thumb76_76.png' align='right' />

1. [Loading Data](#1)
1. [Feature Engineering](#2)
    1. [Convert JSON features](#21)
    1. [Drop some features](#22)
    1. [Missing values](#23)
    1. [Creation of new features](#24)
1. [Feature selection](#3)
    1. [Intuitive analysis](#31)
        1. [Numerical features](#310)
        1. [Text length features](#311)
        1. [Release date features](#312)
        1. [Genre features](#313)
        1. [Count features](#314)
        1. [Language, Country](#315)
    1. [Summary of Intuitive analysis](#32)
    1. [Automatic Feature Selection](#33)
        1. [Univariate Selection](#331)
        1. [Recursive Feature Elimination](#332)
        1. [Principal Component Analysis](#333)
        1. [Extra Trees Regressor](#334)
    1. [Summary of Automatic Feature Selection](#34)
1. [Model Selection](#4)
    1. [Normalization of skewed data](#41)
    1. [Model Selection](#42)
    1. [Feature selection for the best model](#43)
1. [Train and Predict](#5)


6. **[Further Research Work](#6)**
    1. [Can sub-prediction improve the overall accuracy?](#61)
    1. [Additive Feature Selection](#62)

# 1. Loading Data<a id='1'></a>

In [None]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate

from sklearn.feature_selection import RFECV
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

import random
import time

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

print('train dataset size:', train.shape)
print('test dataset size:', test.shape)
train.sample(4)

There are 8 JSON-style features, 4 numerical, 4 text, and 1 date feature.

In [None]:
train.info()

<a id='2'></a>
# [2. Feature Engineering](#999)

At first, convert JSON-styled features into string/category/list ones.

<a id='21'></a>
## [2.1 Convert JSON features](#999)

* **`belongs_to_collection`**: convert `name` into string
* **`genres`, `production_companies`**: convert `name` values into comma-separated string list
* **`production_countries`**: convert `iso_3166_1` values into comma-separated string list
* **`spoken_languages`**: convert `iso_639_1` values into comma-separated string list
* **`Keywords`**: convert `name` values into comma-separated string list
* **`cast`, `crew`**: get their lengths, as its detailed information is very unlikely relevant to the revenue 

In [None]:
def proc_json(string, key):
    try:
        data = eval(string)
        return ",".join([d[key] for d in data])
    except:
        return ''

def proc_json_len(string):
    try:
        data = eval(string)
        return len(data)
    except:
        return 0

train.belongs_to_collection = train.belongs_to_collection.apply(lambda x: proc_json(x, 'name'))
test.belongs_to_collection = test.belongs_to_collection.apply(lambda x: proc_json(x, 'name'))

train.genres = train.genres.apply(lambda x: proc_json(x, 'name'))
test.genres = test.genres.apply(lambda x: proc_json(x, 'name'))

train.production_companies = train.production_companies.apply(lambda x: proc_json(x, 'name'))
test.production_companies = test.production_companies.apply(lambda x: proc_json(x, 'name'))

train.production_countries = train.production_countries.apply(lambda x: proc_json(x, 'iso_3166_1'))
test.production_countries = test.production_countries.apply(lambda x: proc_json(x, 'iso_3166_1'))

train.spoken_languages = train.spoken_languages.apply(lambda x: proc_json(x, 'iso_639_1'))
test.spoken_languages = test.spoken_languages.apply(lambda x: proc_json(x, 'iso_639_1'))

train.Keywords = train.Keywords.apply(lambda x: proc_json(x, 'name'))
test.Keywords = test.Keywords.apply(lambda x: proc_json(x, 'name'))

train.cast = train.cast.apply(proc_json_len)
test.cast = test.cast.apply(proc_json_len)

train.crew = train.crew.apply(proc_json_len)
test.crew = test.crew.apply(proc_json_len)

Though belongs_to_collection has many missing values, movies in the same collection have similar budgets and revenues. Keep it just now and consider its relevance later.

In [None]:
train.isnull().sum()

<a id='22'></a>
## [2.2. Drop some features](#999)
It's clearly obvious that ID/URL features are not useful for the prediction of revenue and I will drop them.

And text features such as `title` and `overview` are not likely useful by itself. For now I use only length information of text columns, but when I feel like to need more features, I can reconsider extraction some information from these columns through NLP.

In [None]:
# get lengths of text columns
columns = ['original_title', 'title', 'overview', 'tagline']
for col in columns:
    new_col = col + '_len'
    train[new_col] = train[col].apply(lambda x: 0 if x is np.nan else len(x))
    test[new_col] = test[col].apply(lambda x: 0 if x is np.nan else len(x))

# drop ID/URL/text columns
columns.extend(['homepage', 'imdb_id', 'poster_path'])

train.drop(columns, axis=1, inplace=True)
test.drop(columns, axis=1, inplace=True)

<a id='23'></a>
## [2.3. Missing values](#999)

In [None]:
print('-'*30, '\n', train.isnull().sum())
print('-'*30, '\n', test.isnull().sum())

### - runtime
`runtime` column has 2 and 4 missing values for the train and test dataset respectively. I can't understand what it means, but there are some 0 values of `runtime` column. So, fill the missing values of `runtime` column with 0.

In [None]:
train.runtime = train.runtime.fillna(0)
test.runtime = test.runtime.fillna(0)

### - status
`status` column has 2 missing values in the test dataset. Fill it with 'Released', the mostly used value.

In [None]:
test.loc[test.status.isnull(), 'status'] = 'Released'

### - release_date
`release_date` column has one missing value in the test dataset. Fill it with the mostly used value.

In [None]:
test.loc[test.release_date.isnull(), 'release_date'] = test.release_date.mode()[0]


<a id='24'></a>
## [2.4. Creation of new features](#999)

### - Date related Values

In [None]:
def expand_release_date(df):
    df.release_date = pd.to_datetime(df.release_date)

    df['release_year'] = df.release_date.dt.year
    df['release_month'] = df.release_date.dt.month
    df['release_day'] = df.release_date.dt.dayofweek
    df['release_quarter'] = df.release_date.dt.quarter
    
    return df

train = expand_release_date(train)

In [None]:
train[['release_year', 'release_month', 'release_day', 'release_quarter']].describe()

# 😲 
Ooop! The maximum of `release_year` is 2068! 

The year values have only two digits and the years before 1969 are denoted as ones of 2000's. Make it correct.

In [None]:
def expand_release_date(df):
    df.release_date = pd.to_datetime(df.release_date)

    df['release_year'] = df.release_date.dt.year
    df['release_year'] = df.release_year.apply(lambda x: x-100 if x > 2020 else x)
    
    df['release_month'] = df.release_date.dt.month
    df['release_day'] = df.release_date.dt.dayofweek
    df['release_quarter'] = df.release_date.dt.quarter
    
    return df

train = expand_release_date(train)
test = expand_release_date(test)

### - Genres
As 0+ genres for each movie, it is not a reasonable way to convert `genres` column into category type. It might make the same genres different, e.g. 'Drama,Romance' and 'Romance,Drama' would be categorized differently.
Therefore I made dummy columns for all of the genres.

In [None]:
# get total genres list
genres = []
for idx, val in train.genres.iteritems():
    gen_list = val.split(',')
    for gen in gen_list:
        if gen == '':
            continue

        if gen not in genres:
            genres.append(gen)
            

genre_column_names = []
for gen in genres:
    col_name = 'genre_' + gen.replace(' ', '_')
    train[col_name] = train.genres.str.contains(gen).astype('uint8')
    test[col_name] = test.genres.str.contains(gen).astype('uint8')
    genre_column_names.append(col_name)

train.sample(5)

It's not certain if genre count is relevant to the revenue, but calculate it now and test later.

In [None]:
train['genre_count'] = train.genres.apply(lambda x: len(x.split(',')))
test['genre_count'] = test.genres.apply(lambda x: len(x.split(',')))

### - Country & Company

In [None]:
train['country_us'] = train.production_countries.str.contains('US').astype('uint8')
train['country_count'] = train.production_countries.apply(lambda x: len(x.split(',')))
train['company_count'] = train.production_companies.apply(lambda x: len(x.split(',')))

test['country_us'] = test.production_countries.str.contains('US').astype('uint8')
test['country_count'] = test.production_countries.apply(lambda x: len(x.split(',')))
test['company_count'] = test.production_companies.apply(lambda x: len(x.split(',')))

### - Language

In [None]:
train['slc'] = train.spoken_languages.apply(lambda x: len(x.split(',')))
train['orig_lang_code'] = train.original_language.astype('category').cat.codes

test['slc'] = test.spoken_languages.apply(lambda x: len(x.split(',')))
test['orig_lang_code'] = test.original_language.astype('category').cat.codes

### - crew and cast
`crew` and `cast` have been converted into the total number when analysing JSON values. I created a new feature by adding those 2 values.

In [None]:
train['total_staff'] = train.cast + train.crew
test['total_staff'] = test.cast + test.crew

<a id='3'></a>
# [3. Feature selection](#999)

Now, I have nearly 50 columns in the train dataset.
But there are still some more places for new features: belongs_to_collection, status, Keywords, and text columns I have dropped at the beginning. 

There are lots of ways for feature selection using sklearn, but I'd like to see charts to know potencial relevances of features to the target value. After looking at some charts directly and I can apply various methods to calculate feature importance.

Related references:
- From Wikipedia, the free encyclopedia, [Feature selection](https://en.wikipedia.org/wiki/Feature_selection)
- Jason Brownlee, [Feature Selection For Machine Learning in Python](https://machinelearningmastery.com/feature-selection-machine-learning-python/)
- Sudharsan Asaithambi, [Why, How and When to apply Feature Selection](https://towardsdatascience.com/why-how-and-when-to-apply-feature-selection-e9c69adfabf2)
- Kaggle Kernel, [6 Ways for Feature Selection](https://www.kaggle.com/sz8416/6-ways-for-feature-selection)
- Matthew Mayo, [Step Forward Feature Selection: A Practical Example in Python](https://www.kdnuggets.com/2018/06/step-forward-feature-selection-python.html)

<a id='31'></a>
## [3.1 Intuitive analysis](#999)

<a id='310'></a>
### - [Numerical features](#999)

    * Relationship between Numerical Features

In [None]:
numerical_data = train[['budget', 'popularity', 'runtime', 'cast', 'crew', 'total_staff', 'revenue']]
g = sns.PairGrid(numerical_data)
g = g.map_diag(plt.hist, bins=10)
g = g.map_offdiag(plt.scatter, s=5, alpha=.9, linewidth=.5)
g = g.add_legend()
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(numerical_data.corr(), annot=True, fmt='.2', center=0.0, cmap='RdBu_r')
plt.title('Correlation between Numerical Features')
plt.show()

**TIPS**: All numerical features are highly related to the revenue. `total_staff`, which is calculated by adding `cast` and `crew`, shows some interesting characteristics. It is high relevant to the most of the numerical value.

<a id='311'></a>
### - [Text length features](#999)

    * Distribution of Text Length Features

In [None]:
cols = ['original_title_len', 'title_len', 'overview_len', 'tagline_len']

plt.figure(figsize=(16, 8))
for idx, col in enumerate(cols):
    plt.subplot(2, 4, idx+1)
    plt.hist(train[col], bins=15)
    plt.title(col)
    
    plt.subplot(2, 4, idx+5)
    plt.scatter(x=train[col], y=train.revenue, alpha=.4, marker='+')
    
plt.show()   

In [None]:
cols.append('revenue')
plt.figure(figsize=(8,6))
sns.heatmap(train[cols].corr(), annot=True, fmt='.2', center=0.0, cmap='RdBu_r')
plt.title('Correlation of Text Length Features')
plt.show()

**Tips**: Text lengths except `original_title_len` are not likely relevant to the revenue. `original_title_len` and `title_len` are highly related to each other.

<a id='312'></a>
### - [Release date features](#999)

In [None]:
plt.figure(figsize=(16,3))
sns.set(style="whitegrid")
sns.barplot(x='release_year', y='revenue', errwidth=0.5, data=train)
plt.xticks(rotation=90)
plt.title('Average Revenue per Year')
plt.show()

plt.figure(figsize=(16,3))
ax = plt.subplot(131)
sns.barplot(x='release_month', y='revenue', data=train, ax=ax)
ax.set_title('Average Revenue per Month')

ax = plt.subplot(132)
sns.barplot(x='release_day', y='revenue', data=train, ax=ax)
ax.set_title('Average Revenue per Weekday')

ax = plt.subplot(133)
sns.barplot(x='release_quarter', y='revenue', data=train, ax=ax)
ax.set_title('Average Revenue per Quarter')
plt.show()

plt.figure(figsize=(8,6))
sns.heatmap(train[['release_year', 'release_quarter', 'release_month', 'release_day', 'revenue']].corr(), annot=True, fmt='.2', center=0.0, cmap='RdBu_r')
plt.title('Correlation between Date Features')
plt.show()

**Tips**: only `release_year` seems to be relevant to the revenue. `release_quater` and `release_month` are highly related to each other.

<a id='313'></a>
### - [Genre features](#999)

In [None]:
genre_counts = []
genre_avg_revenues = []

for col in genre_column_names:
    genre_counts.append(train[col].sum())
    genre_avg_revenues.append(train.loc[train[col] == 1, 'revenue'].mean())

genre_df = pd.DataFrame({'genre': genres, 'counts': genre_counts, 'revenue': genre_avg_revenues})    

_, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(8, 8))
genre_df.plot.barh(x='genre', y='counts', ax=ax0, legend=False)
ax0.set_title('Movie counts')

genre_df.plot.barh(x='genre', y='revenue', ax=ax1, legend=False)
ax1.set_title('Average Revenues')

plt.subplots_adjust(wspace=0.01)
plt.show()

cols = genre_column_names[:]
cols.append('revenue')
plt.figure(figsize=(20, 12))
sns.heatmap(train[cols].corr(), annot=True, fmt='.2g', center=0.0, cmap='RdBu_r')
plt.title('Movie Genres\' Correlation with Revenue')
plt.show()

**TIPS**: Few genres with high relevance to the revenue have relatively large number of movies and high revenue. They are Adventure, Action, Fantasy, Drama, Family, Animation, and Science Fiction.

<a id='314'></a>
### - [Count features](#999)
I have made 4 count features: genre_count, country_count, company_count, and slc(spoken language count). Let's see them.

In [None]:
_, ((ax0, ax1), (ax2, ax3)) = plt.subplots(nrows=2, ncols=2, figsize=(16, 10))
sns.boxplot(x='genre_count', y='revenue', ax=ax0, data=train)
sns.boxplot(x='slc', y='revenue', ax=ax1, data=train)
sns.boxplot(x='country_count', y='revenue', ax=ax2, data=train)
sns.boxplot(x='company_count', y='revenue', ax=ax3, data=train)
plt.show()

plt.figure(figsize=(8,6))
sns.heatmap(train[['genre_count', 'country_count', 'company_count', 'slc', 'revenue']].corr(), annot=True, fmt='.2', center=0.0, cmap='coolwarm')
plt.title('Correlation between Count Features')
plt.show()

**TIPS**: all the count-related features seem to be useless. :-(

<a id='315'></a>
### - [Language, Country](#999)


In [None]:
_, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=(16, 8))
sns.boxplot(x='country_us', y='revenue', ax=ax0, data=train)
ax0.set_title('Revenue per Country_US')
sns.boxplot(x='orig_lang_code', y='revenue', ax=ax1, data=train)
ax1.set_title('Revenue per Original Language Code')
plt.show()

plt.figure(figsize=(6, 4))
sns.heatmap(train[['country_us', 'orig_lang_code', 'revenue']].corr(), annot=True, fmt='.2', center=0.0, cmap='coolwarm')
plt.title('Correlation of Language Features with Revenue')
plt.show()

**TIPS**: US movies produce high revenues and country_us is likey a useful feature. Language is not relevent to the revenue.

<a id='32'></a>
## [3.2 Summary of Intuitive analysis](#999)

* `budget`, `popularity`, `total_staff`, and `runtime` are great features for the model.(`cast` and `crew` are better to be ignored as they give duplicated information with `total_staff`.
* `original_title_len` is a little bit related to the revenue. 
* `release_year` and `release_day` are OK.
* Selected genres are `Adventure`, `Action`, `Fantasy`, `Drama`, `Family`, `Animation`, and `Science Fiction`.
* `genre_count` and `company_count` are needed to be considered more.
* `country_us` is good.

In [None]:
manually_selected_features = ['budget', 'popularity', 'total_staff', 'runtime', 'original_title_len', 
                              'release_year', 'release_day', 'genre_Adventure', 'genre_Action', 'genre_Fantasy', 
                              'genre_Drama', 'genre_Family', 'genre_Animation', 'genre_Science_Fiction',
                              'genre_count', 'company_count', 'country_us']

<a id='33'></a>
## [3.3 Automatic Feature Selection](#999)

There are several ways using sklearn library to select suitable features for the model. I personally like this blog '[Feature Selection For Machine Learning in Python](https://machinelearningmastery.com/feature-selection-machine-learning-python/)' written by Jason Brownlee. I applied 4 methods mentioned in the blog.

<a id='331'></a>
### - [Univariate Selection](#999)

In [None]:
features = train.select_dtypes(include=['int64', 'float64', 'uint8', 'int8']).columns.tolist()
features.remove('id')
features.remove('revenue')

X, Y = train[features], train['revenue']

uni_test = SelectKBest(score_func=chi2, k='all')
fit = uni_test.fit(X, Y)

feature_df = pd.DataFrame({'feature': features, 'importance': np.log10(fit.scores_)})
feature_df.sort_values(by='importance', ascending=True, inplace=True)

feature_df.plot.barh(x='feature', y='importance', figsize=(12, 12))
plt.title('Feature Importance by Univariate Selection')

features1 = feature_df.feature[-20:].tolist()

The result shown in the chart is a little different from my intuitive analysis. Let's go on with other methods.

<a id='332'></a>
### - [Recursive Feature Elimination](#999)

RFECV(RandomForestRegressor) makes different results for every run. So I iterated this process several times and aggregate the result.

In [None]:
from tqdm import tqdm

importances = np.zeros(len(features))

for i in tqdm(range(10)):
    model = RandomForestRegressor()
    rfecv = RFECV(model, cv=5)
    fit = rfecv.fit(X, Y)

    selected = np.array(fit.support_)
    importances = importances + selected

feature_df = pd.DataFrame({'feature': features, 'importance': importances})
feature_df.sort_values(by='importance', ascending=True, inplace=True)

feature_df.plot.barh(x='feature', y='importance', figsize=(10, 12))
plt.title('Feature Importance by RFECV & random forest regressor')    

features2 = feature_df.loc[feature_df.importance==10, 'feature'].tolist()

9 features are always selected in 10 tests.

<a id='333'></a>
### - [Principal Component Analysis](#999)

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.

Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
fit = pca.fit(X)

print(pca.explained_variance_ratio_)

The first PC shows the full variance, so we calculated feature importances from the first PC.

In [None]:
feature_df = pd.DataFrame({'feature': features, 'importance': abs( pca.components_[0])})
feature_df.sort_values(by='importance', ascending=False, inplace=True)

features3 = feature_df.feature[:20]
features3

<a id='334'></a>
### - [Extra Trees Regressor](#999)

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

Top 20 important features are:

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X, Y)

feature_df = pd.DataFrame({'feature': features, 'importance': model.feature_importances_})
feature_df.sort_values(by='importance', ascending=False, inplace=True)
features4 = feature_df.feature[:20]
features4

<a id='34'></a>
## 3.4 [Summary of Automatic Feature Selection](#999)

The more features, the more likely overfitting, and the more time training consume. From the feature selection results, I tried to select less than 10 features.

As we can see above, the selected features are slightly different according to the model. That's why we should do feature selection along with model selection.

Finally, we decided the automatically selected feature as the intersection of  4 feature sets calculated above:

In [None]:
automatically_selected_features = list( set(features1) & set(features2) & set(features3) & set(features4) )
automatically_selected_features

#### features selected by correlation

And, as one of the automatic feature selection process, we get the features which have relatively high correlation with the revenue:

In [None]:
features = train.select_dtypes(include=['int64', 'float64', 'uint8', 'int8']).columns.tolist()
features.remove('id')
features.remove('revenue')

target = 'revenue'
corr_features = features[:]
corr_features.append(target)
corrs = abs(train[corr_features].corr()['revenue']).sort_values(ascending=False)
corr_selected_features = corrs[:20].index.tolist()
corr_selected_features.remove('revenue')
corr_selected_features

## Summary of Feature Selection
Through the feature selection process, we got 3 lists of features selected by difference approaches.
* **manually_selected_features**
* **automatically_selected_features**
* **corr_selected_features**

But feature selection process is not finished yet. I can't say clearly which is the best one just now. I would use corr_selected_features to select the model and after the model selection consider the features once more.

<a id='4'></a>
# 4. [Model Selection](#999)

<a id='41'></a>
## 4.1 [Normalization of skewed data](#999)
`budget` and `revenue` are highly skewed and they need to be normalized by logarithm.

In [None]:
train['revenue'] = np.log1p(train['revenue'])
train['budget'] = np.log1p(train['budget'])

test['budget'] = np.log1p(test['budget'])

X = train[corr_selected_features]
y = train['revenue']

<a id='42'></a>
## 4.2 [Model Selection](#999)
From various regression models, the most accurate model was selected.

In [None]:
def select_model(X, Y):

    best_models = {}
    models = [
        {
            'name': 'LinearRegression',
            'estimator': LinearRegression(),
            'hyperparameters': {},
        },
        {
            'name': 'KNeighbors',
            'estimator': KNeighborsRegressor(),
            'hyperparameters':{
                'n_neighbors': range(3,50,3),
                'weights': ['distance', 'uniform'],
                'algorithm': ['auto'],
                'leaf_size': list(range(10,51,10)),
                }
        },
        {
            'name': 'GradientBoostingRegressor',
            'estimator': GradientBoostingRegressor(),
            'hyperparameters':{
                'n_estimators': range(70, 150, 10),
                'criterion': ['friedman_mse'],
                'max_depth': [3, 5, 7, 9],
                'max_features': ['log2', 'sqrt'],
                'min_samples_leaf': [1, 2, 4],
                'min_samples_split': [3, 5, 7]
            }
            
        },

        {
            'name': 'XGBoost',
            'estimator': XGBRegressor(),
            'hyperparameters':{
                'booster': ['gbtree', 'gblinear', 'dart'],
                'max_depth': range(10, 51, 10),
                'n_estimators': [200],
                'nthread': [4],
                'min_child_weight': range(1, 8, 2),
                'learning_rate': [.05, .1, .15],
            }
        },
        {
            'name': 'Light GBM',
            'estimator': LGBMRegressor(),
            'hyperparameters':{
                'max_depth': range(20, 85, 15),
                'learning_rate': [.01, .05, .1],
                'num_leaves': [300, 600, 900, 1200],
                'n_estimators': [200]
            }
        },
        {
            'name': 'Cat Boost',
            'estimator': CatBoostRegressor(),
            'hyperparameters':{
                'depth': [4, 7, 10],
                'learning_rate': [.03, .06, .1, .15],
                'l2_leaf_reg': [1, 4, 7, 9],
                'iterations': [300]
            }
        }
        
    ]
    
    for model in tqdm(models):
        # print('\n', '-'*20, '\n', model['name'])
        start = time.perf_counter()
        grid = GridSearchCV(model['estimator'], param_grid=model['hyperparameters'], cv=5, scoring = "explained_variance", verbose=False, n_jobs=-1)
        grid.fit(X, Y)
        best_models[model['name']] = {'score': grid.best_score_, 'params': grid.best_params_}
        run = time.perf_counter() - start
        # print('accuracy: {}\n{} --{:.2f} seconds.'.format(str(grid.best_score_), str(grid.best_params_), run))
        
    return best_models

#best = select_model(X, y)
#best

The best model is GradientBoostingRegressor. Now test which feature list is best.

<a id='43'></a>
## 4.3 [Feature selection for the best model](#999)

In [None]:
def get_accuracy(features):
    X, y = train[features], train['revenue']
    
    model = GradientBoostingRegressor(criterion='mse', max_depth=6, max_features='sqrt', 
                                      min_samples_leaf=4, min_samples_split=9, n_estimators=110, loss='huber')
    result = cross_validate(model, X, y, cv=10, scoring="explained_variance", verbose=False, n_jobs=-1)
    return np.mean(result['test_score'])

In [None]:
all_features = train.select_dtypes(include=['int64', 'float64', 'uint8', 'int8']).columns.tolist()
all_features.remove('id')
all_features.remove('revenue')

best_features = None
best_accuracy = 0

feature_candidates = [all_features, manually_selected_features, automatically_selected_features, corr_selected_features]
for flist in feature_candidates:
    acc = get_accuracy(flist)
    if acc > best_accuracy:
        best_accuracy = acc
        best_features = flist
        
print('The best accuracy is', best_accuracy)
best_features

<a id='5'></a>
# 5. [Train and Predict](#999)

In [None]:
model = GradientBoostingRegressor(criterion='mse', max_depth=6, max_features='sqrt', 
                                      min_samples_leaf=4, min_samples_split=9, n_estimators=110, loss='huber')
model.fit(train[best_features], train['revenue'])
predict = model.predict(test[best_features])


In [None]:
submit = pd.DataFrame({'id': test.id, 'revenue':np.expm1(predict)})
submit.to_csv('submission.csv', index=False)

The accuracy of the model described so far was 2.10761 at most.



<a id='6'></a>
# [6. Further Research Work](#999)

Though I have tried pretty much works of feature engineering/selection and model selection, it is only the introduction of the whole work and the accuracy of my model is not high enough. There are still so many gaps in this kernel and now I'd like to study all capabilities of improvement one by one.

Possible parts to be improved are:

* **feature engineering**
    * there are lots of features to be created.
    * many missing values
    * outliers
* **feature selection**
    * more feature selection methods
    * consideration of validation of selected features
* **model selection**
    * other models including neural networks
    * how to improve the performance of a model(specialize for the selected model)
    * model parameter tunning

<a id='61'></a>
## [6.1 Can sub-prediction improve the overall accuracy?](#999)

We wonder if the overall accuracy can be improved by a sub-prediction, which means an internal prediction of the system.  The target of a sub-predidction might be missing/outlier values of existing features or an other abstract, intermediate feature.

Of course, there can't be a perfect predictor(if exists, it's not a predictor but an equation), so these sub-predictions may result in accumulation of error.

For the case of `budget` feature in this competition, which is the most important feature for the prediction of `revenue`, nearly 25% are 0 and this is not normal but kind of missing values. If we can guess these 0 budgets with more than 50% of accuracy, could the overall accuracy be improved even a little? Will guessing be better than taking them as they are?

It depends. Let't go for it.

------------------------------------------------------------------
We had some experiments for the prediction of `budget` outside of this kernel. Its process is really similar to this kernel itself and we made a function which predicts `budget` values.

Here, we used the GradientBoostingRegressor again but it is not because that I am a fan of GBR. We have used the GBR first in this kernel. In this competition the result of our model selection is always the Gradient Boosting Regressor, we are a little bit strange for it, and it shows implicitly that my approach for model selection has a problem. 

In [None]:
def fill_budget(in_train, in_test):


    def fb_proc_json_len(string):
        try:
            data = eval(string)
            return len(data)
        except:
            return 0

    test = in_test.copy()
    test['revenue'] = np.nan
    total = pd.concat([in_train, test], axis=0)

    ###############################################################
    # this is for the time after data loading
    # here, in the last part of kernel, these steps are already done in the previous steps
    # so, comment them
    
    # total.drop(['belongs_to_collection', 'homepage', 'imdb_id', 'poster_path', 'original_title', 'overview', 'status', 'tagline', 'title'], axis=1, inplace=True)    
    # total.loc[total.runtime.isnull(), 'runtime'] = 0
    # total.loc[total.release_date.isnull(), 'release_date'] = total.release_date.mode()[0]

    # total['genre_count'] = total.genres.apply(fb_proc_json_len)
    # total['company_count'] = total.production_companies.apply(fb_proc_json_len)
    # total.cast = total.cast.apply(fb_proc_json_len)
    # total.crew = total.crew.apply(fb_proc_json_len)

    # total.release_date = pd.to_datetime(total.release_date)
    # total['release_year'] = total.release_date.dt.year
    # total['release_month'] = total.release_date.dt.month
    ################################################################

    # these are the result of a small feature selection.
    fb_features = ['popularity', 'runtime', 'cast', 'crew', 'genre_count', 
                   'company_count', 'release_year', 'release_month']

    fb_target = 'budget'

    train = total[total.budget > 0]
    pred = total[total.budget == 0]

    # and this is the result of a small model selection.
    model = GradientBoostingRegressor(criterion='mse', learning_rate=0.15, loss='huber', max_depth=6, 
                                  max_features='sqrt', min_samples_leaf=4, min_samples_split=9, n_estimators=110)
    model.fit(train[fb_features], train[fb_target])


    predict = model.predict(pred[fb_features])
    pred[fb_target] = predict

    out_train = in_train.copy()
    out_test = in_test.copy()

    for idx, row in out_train[out_train.budget == 0].iterrows():
        mid = row.id
        out_train.loc[out_train.id == mid, 'budget'] = pred.loc[pred.id == mid, 'budget']


    for idx, row in out_test[out_test.budget == 0].iterrows():
        mid = row.id
        out_test.loc[out_test.id == mid, 'budget'] = pred.loc[pred.id == mid, 'budget']

    return (out_train, out_test)

- Predict `budget` missing values(zeroes)

In [None]:
# train, test = fill_budget(train, test)

- Predict `revenue`s(equal to Chapter 5.)

In [None]:
# model = GradientBoostingRegressor(criterion='mse', max_depth=6, max_features='sqrt', 
#                                   min_samples_leaf=4, min_samples_split=9, n_estimators=110, loss='huber')
# model.fit(train[best_features], train['revenue'])
# predict = model.predict(test[best_features])

# submit = pd.DataFrame({'id': test.id, 'revenue':np.expm1(predict)})
# submit.to_csv('submission.csv', index=False)

We have tested this sub-prediction, but the result was bad. It downed the accuracy.

** Sub-prediction dosen't improve the overall accuracy.**

<a id='62'></a>
## [6.2 Forward Feature Selection](#999)
 We have thought and thought how to improve the feature selection and found 2 ways. One is "Forward feature selection" and the other is "Backward feature selection". But soon after, we have found that they have been already invented and studied by others. Their name is SFS & SBS and you can see its detail in "[A survey on feature selection methods](http://romisatriawahono.net/lecture/rm/survey/machine%20learning/Chandrashekar%20-%20Feature%20Selection%20Methods%20-%202014.pdf)".
 
 We have implemented and tested those two methods and found that SFS(adding feature selection) contributes but SBS doesn't.
 
 Here, we will introduce the SFS method.

### Adding Feature Selection

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

candidates = train.select_dtypes(include=['int64', 'float64', 'uint8', 'int8']).columns.tolist()
candidates.remove('id')
candidates.remove('revenue')

X, y = train[candidates], train['revenue']

model = GradientBoostingRegressor(criterion='mse', max_depth=6, max_features='sqrt', 
                                  min_samples_leaf=4, min_samples_split=9, n_estimators=110, loss='huber')
    
sfs = SFS(estimator=model, 
           k_features=(3, 9),
           forward=True, 
           floating=False, 
           scoring='neg_mean_squared_error',
           cv=5)

sfs.fit(X, y, custom_feature_names=candidates)

print('best combination (ACC: %.3f): %s\n' % (sfs.k_score_, sfs.k_feature_idx_))
print('all subsets:\n', sfs.subsets_)

fig = plot_sfs(sfs.get_metric_dict(), kind='std_err')
plt.title('Sequential Forward Selection (w. StdErr)')
plt.grid()
plt.show()



In [None]:
sfs_fatures = ['budget', 'popularity', 'runtime', 'tagline_len', 'release_year', 'genre_Drama', 'genre_Family', 'genre_Thriller', 'genre_Crime']
model = GradientBoostingRegressor(criterion='friedman_mse', max_depth=5, max_features='sqrt', 
                                      min_samples_leaf=4, min_samples_split=3, n_estimators=110, random_state=1)
model.fit(train[sfs_fatures], train['revenue'])
predict = model.predict(test[sfs_fatures])

submit = pd.DataFrame({'id': test.id, 'revenue':np.expm1(predict)})
submit.to_csv('submission.csv', index=False)


It result in a little improvement in the public score. But SBS didn't give us any pleasure.

** Our research goes on..**