# Code Template for Revenue-Prediction (using a Simple Regression)
This is a template that can be used to quick-start into more detailed projects. In this example we remove a lot of information, so the expected result will have a very low accuracy. But it will be a great starting point for your own kernel.

## 1/4 Import Modules and Dataset
We need to load two python modules and the datasets to get started.

In [5]:
# 1.) Import python modules
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# 2.) Import datasets
original_df_trainval = pd.read_csv("../input/train.csv")
original_df_test_X = pd.read_csv("../input/test.csv")

# 3.) Output the first rows of one of the datasets
original_df_trainval.head(2)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg,"[{'name': 'Paramount Pictures', 'id': 4}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",2/20/15,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651
1,2,"[{'id': 107674, 'name': 'The Princess Diaries ...",40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,/w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg,"[{'name': 'Walt Disney Pictures', 'id': 2}]","[{'iso_3166_1': 'US', 'name': 'United States o...",8/6/04,113.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,"[{'id': 2505, 'name': 'coronation'}, {'id': 42...","[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de...",95149435


## 2/4 Prepare Data
We need to prepare our test and training data. Usually, implementing this takes a lot of time, but for this simple example we will just remove features that would be too complicated to preprocess.

In [6]:
# This function will be called later to prepare our input data
def prepare_data(df):
    # a.) Use the `id` feature as the index column of the data frame
    df = df.set_index('id')

    # b.) Only use easy to process features
    #  Warning: huge information loss here, you should propably include more features in your production code.
    df = df[['budget', 'original_language' ,'popularity', 'runtime', 'status']]
    
    # c.) One-Hot-Encoding for all nominal data
    df = pd.get_dummies(df)
    
    # d.) The `runtime` feature is not filled in 2 of the rows. We replace those empty cells / NaN values with a 0.
    #  Warning: in production code, please use a better method to deal with missing cells like interpolation or additional `is_missing` feature columns.
    return df.fillna(0)


# 1.) Extract the target variable `revenue` and use the `id` column as index of that data frame
df_trainval_y = original_df_trainval[['id','revenue']].set_index('id')

# 2.) Prepare the training and test data by using the function we defined above
df_trainval_X = prepare_data(original_df_trainval)
df_test_X  = prepare_data(original_df_test_X)

# 3.) Create columns in train/test dataframes if they only exist in one of them (can happen through one hot encoding / get_dummies)
#  Example: There are no status=`Post Production` entries in the training set, but there are some in the test set.
df_trainval_X, df_test_X = df_trainval_X.align(df_test_X, join='outer', axis=1, fill_value=0)

# 4.) Show the first rows of one of the prepared tables
df_trainval_X.head(2)

Unnamed: 0_level_0,budget,original_language_af,original_language_ar,original_language_bm,original_language_bn,original_language_ca,original_language_cn,original_language_cs,original_language_da,original_language_de,original_language_el,original_language_en,original_language_es,original_language_fa,original_language_fi,original_language_fr,original_language_he,original_language_hi,original_language_hu,original_language_id,original_language_is,original_language_it,original_language_ja,original_language_ka,original_language_kn,original_language_ko,original_language_ml,original_language_mr,original_language_nb,original_language_nl,original_language_no,original_language_pl,original_language_pt,original_language_ro,original_language_ru,original_language_sr,original_language_sv,original_language_ta,original_language_te,original_language_th,original_language_tr,original_language_ur,original_language_vi,original_language_xx,original_language_zh,popularity,runtime,status_Post Production,status_Released,status_Rumored
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1
1,14000000,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6.575393,93.0,0,1,0
2,40000000,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8.248895,113.0,0,1,0


## 3/4 Predict Values (Linear Regression)
In this example we will use a linear regression model to predict the target value (revenue).

In [7]:
# 1.) Remove table meta data, column names etc. → Just use values for prediction.
X_trainval = df_trainval_X.values
y_trainval = df_trainval_y.values

X_test  = df_test_X.values

# 2.) Create Validation Split
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.5, random_state=56)

# 3.) Scale
X_scaler = StandardScaler()
X_train_scaled  = X_scaler.fit_transform(X_train)
X_val_scaled    = X_scaler.transform(X_val)
X_test_scaled   = X_scaler.transform(X_test)

y_scaler = MinMaxScaler((0,1)) # transform and convert column-vector y to a 1d array with ravel
y_train_scaled  = y_scaler.fit_transform(np.log(y_train)).ravel() 
#y_val_scaled  = y_scaler.transform(np.log(y_val)).ravel() #not used but here for consistency

# 4.) Calculate the coefficients of the linear regression / "Train"
reg     = KNeighborsRegressor().fit(X_train_scaled, y_train_scaled)

# 5.) Define functions to calculate a score
def score_function(y_true, y_pred):
    # see https://www.kaggle.com/c/tmdb-box-office-prediction/overview/evaluation
    # we use Root Mean squared logarithmic error (RMSLE) regression loss
    assert len(y_true) == len(y_pred)
    return np.sqrt(np.mean((np.log1p(y_pred) - np.log1p(y_true))**2))

def score_function2(y_true, y_pred):
    # alternative implementation
    y_pred = np.where(y_pred>0, y_pred, 0)
    return np.sqrt(mean_squared_log_error(y_true, y_pred))

def inverseY(y):
    return np.exp(y_scaler.inverse_transform(np.reshape(y, (-1,1))))

# 6.) Apply the regression model on the prepared train, validation and test set and invert the logarithmic scaling
y_train_pred  = inverseY(reg.predict(X_train_scaled))
y_val_pred    = inverseY(reg.predict(X_val_scaled))
y_test_pred   = inverseY(reg.predict(X_test_scaled))
                   
# 7.) Print the RMLS error on training, validation and test set. it should be as low as possible
print("RMLS Error on Training Dataset:\t", score_function(y_train , y_train_pred), score_function2(y_train, y_train_pred))
print("RMLS Error on Val Dataset:\t", score_function(y_val , y_val_pred), score_function2(y_val , y_val_pred))
print("RMLS Error on Test Dataset:\t Check by submitting on kaggle")

RMLS Error on Training Dataset:	 2.1159215354096195 2.1159215354096195
RMLS Error on Val Dataset:	 2.3979569481101604 2.3979569481101604
RMLS Error on Test Dataset:	 Check by submitting on kaggle


## 4/4 Convert Prediction to submittable CSV file
In order to get our test accuracy, we need to convert our prediction to a comma seperated table file which we can upload to kaggle [here](https://www.kaggle.com/c/tmdb-box-office-prediction/data).

In [8]:
# 1.) Add the predicted values to the original test data
df_test = original_df_test_X.assign(revenue=y_test_pred)

# 2.) Extract a table of ids and their revenue predictions
df_test_y = df_test[['id','revenue']].set_index('id')

# 3.) save that table to a csv file. On Kaggle, the file will be visible in the "output" tab if the kernel has been commited at least once.
df_test_y.to_csv("submission.csv")

# 4.) output the head of our file her to check if it looks good :)
pd.read_csv("submission.csv").head(5)

Unnamed: 0,id,revenue
0,3001,11831950.0
1,3002,220524.4
2,3003,3704143.0
3,3004,2011876.0
4,3005,47287.37


## That's it!
I hope you liked this basic template, if you have any suggestions on how to improve this kernel feel free to write a comment.

If this kernel helped you quick-start into your own data science project please make sure to leave an upvote :)