In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from patsy import dmatrix
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score
from sklearn.impute import KNNImputer
from sklearn.model_selection import KFold, train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_graphviz 
from six import StringIO
from IPython.display import Image  
import pydotplus
import time as tm
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV, ParameterGrid
from sklearn.ensemble import BaggingRegressor,BaggingClassifier
import warnings
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix
from sklearn.ensemble import BaggingRegressor,BaggingClassifier,RandomForestRegressor,RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import Ridge
from xgboost import XGBRegressor
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer

# How this model works

This model was created in order to introduce bias to the base models of the stacked model. Each of these models individually will not offer a great solution, but a stacked model with these as the base will allow the overall model to work off the strengths of the individual models. 

Four feature sets were used: 
The top predictors from a random forest model, with the number of features being decided by a forward step-wise 5-fold cv process (not in this code), where the most significant model was added first and the RMSE on the train data was calculated from there. 

The top predictors from a MARS model:
These were the significant predictors in a MARS model where X predicts log_y (no cv and the code is not present as I could only get it to work on Google Collab). 

Top 40 predictors from kbest:
I wanted to add another predictor subset, but was sttuggling to think of other ways. K-best is a function from sklearn, and provided different features from the MARS and RF best predictors so I decided to use it. 

The entire predictor set:
This was used to train a catboost regressor on the entire dataset, which is relatively quick and provided higher accuracy for the meta model to learn from. 

## How to use the pipeline 

In [166]:
rf_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('rf', rf_model)
])

The above code is used in the actual model, and is used as an example here. Pipeline is also a feature from sklearn that allows multiple functions to be executed in order. This can be used to streamline many processes. 

In this case, there are two functions. The first is the 'column_transformer,' which looking back, I probably should have created a unique name for every transformation. This works by using ColumnTransformer, which takes the subset of columns specified, in this case X_top_kbest.columns. The 'passthrough' indicates that these are the columns to be worked on, with the remainder being dropped. 'rf' is the name in the pipeline I gave the rf_model. I gave unique names to every model depending on what subset they were working on. e.g. 'rf1' corresponds with the MARS predictors. 

# What the model does overall

As mentioned, the purpose of the model is to introduce bias to the base models, which will be overcome by the meta model. A good way to introduct bias is by using different predictors, and different models, which is what I did. I suspect that using other, more different models could work well, but I was not succesful in my attempts. This is a stacking regressor, which is different from a voting one. The difference is that there is a 'meta model' that corrects the error for the base models, by learning from the RMSE (or another specified loss function) on the base models. The meta model regressor I used was catboost, as it is the most accurate (I tried many other ways, and I thought that maybe ridge regression would work well, but it did not). 

The specifics of the model are as follows: (Note, I have lightboost set up to be used in the model, but removing it actually made the model better. I suspect this is because catboost is a very similar model, but provides a lower RMSE)

The same, untuned models are run on each predictor subset. The models are random forest, catboost, and adaboost. These are all ran on the MARS subset, the rf subset, and the kbest subset. The catboost model is the only model ran on the entire dataset. 15 fold cross validation is used, which takes around 20-30 minutes to run. This model is not too slow because most of the models are run only on 20-40 predictors. It seems that the higher the k-fold, the better the models, but this is not true as the model gets worse after 15-fold cv. I suspect this is because the model will be testing on very little data for each split, which will not help RMSE, after a point. 

The exact same model was also ran on a 13-fold cv. The purpose of this was to once again introduce bias. This introduces bias because this will create two different models, each with innacuracies of their own. By themselves, these models are still extremely useful. 

These models were then combined by simply averaging the output, giving a low RMSE. 

# These models are not precisely reproducable 

Because these models are untuned, they will give slightly different output every time they are run. To combat this, the models could be tuned for reproducibility, but that could take a very long time. On my first try I got my best RMSE, so it does not take very long. In my case, I saved the model by using pickle, a library to save models. The exact model I used can be uploaded at the very bottom of this under the 'upload' section. 

# Train Data Cleaning

In [2]:
train = pd.read_csv('train.csv')

In [3]:
y = train.y
x = train.drop("y", axis=1)
x = x.drop("id", axis=1)

In [4]:
x = x.apply(pd.to_numeric, errors='coerce')

In [5]:
y = y.apply(pd.to_numeric, errors='coerce')

In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)

In [7]:
imputer = KNNImputer(n_neighbors=7)

In [8]:
X = pd.DataFrame(imputer.fit_transform(X_scaled), columns=x.columns)

In [9]:
log_y = np.log(y)

# Get top predictors from untuned random forest

In [10]:
model = RandomForestRegressor(n_estimators = 100)
model.fit(X,log_y)

In [11]:
top_50_rf = model.feature_importances_.argsort()[-50:][::-1]

In [12]:
top_50_rf = pd.DataFrame(top_50_rf)

In [13]:
top_50_rf.columns = ['predictor']

## I computed the best MARS features on google collab and am importing it as a csv

In [14]:
MARS_features = pd.read_csv('features.csv')

In [15]:
filtered_features = top_50_rf[~top_50_rf['predictor'].isin(MARS_features)]
new_df = filtered_features[['predictor']]

## Above code should only take predictors from the random forest subset that are not present in MARS, I am not sure if it worked, however, so there might be some overlap

In [16]:
rf_top_30 = new_df[-30:][::-1]

In [17]:
top_predictors = rf_top_30['predictor']
X_top_rf = X.iloc[:,top_predictors]

In [18]:
top_predictors_MARS = MARS_features['predictor']
X_top_MARS = X[top_predictors_MARS]

# Get further subset of features by selecting kbest

In [24]:
from sklearn.feature_selection import SelectKBest, f_regression

# Assuming X is your feature set and log_y is your target variable
selector = SelectKBest(f_regression, k=40)
X_new = selector.fit_transform(X, log_y)

# Get the indices sorted by most important to least important
indices = np.argsort(selector.scores_)[::-1]

# Get the names of the top 40 features
X_top_kbest = []
for i in range(40):
    X_top_kbest.append(X.columns[indices[i]])

In [26]:
X_top_kbest = X[X_top_kbest]

## Test Cleaning

In [19]:
test = pd.read_csv('test.csv')

In [20]:
test_x = test.drop("id", axis=1)
test_x = test_x.apply(pd.to_numeric, errors='coerce')

In [21]:
scaler = StandardScaler()
test_X_scaled = pd.DataFrame(scaler.fit_transform(test_x), columns=test_x.columns)

In [22]:
imputer_KNN = KNNImputer(n_neighbors=7)
test_x = pd.DataFrame(imputer_KNN.fit_transform(test_X_scaled), columns=test_x.columns)

# Create Base Models

In [41]:
from catboost import CatBoostRegressor
cat = CatBoostRegressor(verbose=False, random_seed=403)

In [38]:
rf_model = RandomForestRegressor()

In [39]:
from lightgbm import LGBMRegressor
light_model = LGBMRegressor(verbose=-1)

In [40]:
from sklearn.ensemble import BaggingRegressor,BaggingClassifier,AdaBoostRegressor,AdaBoostClassifier
ada_model = AdaBoostRegressor()

# First Model

In [98]:
X_train, X_test, y_train_log, y_test_log = train_test_split(X, log_y, test_size = 0.2, random_state = 8)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np

#kbest
cat_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('cat', cat)
])

rf_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('rf', rf_model)
])

ada_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('ada', ada_model)
])

light_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('light', light_model)
])

cat_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('cat1', cat)
])

rf_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('rf1', rf_model)
])

ada_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('ada1', ada_model)
])

light_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('light1', light_model)
])

cat_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('cat2', cat)
])

rf_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('rf2', rf_model)
])

ada_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('ada2', ada_model)
])

light_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('light2', light_model)
])

cat_pipe3 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform3', 'passthrough', X.columns)], remainder='drop')),
    ('cat3', cat)
])

en_new = StackingRegressor(estimators = [('cat', cat_pipe),('rf', rf_pipe),('ada', ada_pipe), 
                                     ('cat1', cat_pipe1),('rf1', rf_pipe1),('ada1', ada_pipe1),
                                     ('cat2', cat_pipe2),('rf2', rf_pipe2),('ada2', ada_pipe2),
                                     ('cat3', cat_pipe3)],
                     final_estimator=CatBoostRegressor(),                                          
                    cv = KFold(n_splits = 15, shuffle = True, random_state=1))

In [99]:
en_new.fit(X_train, y_train_log)

Learning rate set to 0.051562
0:	learn: 0.9740167	total: 1.86ms	remaining: 1.86s
1:	learn: 0.9519246	total: 3.33ms	remaining: 1.66s
2:	learn: 0.9319381	total: 24.7ms	remaining: 8.21s
3:	learn: 0.9119546	total: 30.4ms	remaining: 7.56s
4:	learn: 0.8941058	total: 33.9ms	remaining: 6.75s
5:	learn: 0.8775318	total: 35.6ms	remaining: 5.9s
6:	learn: 0.8623498	total: 37.4ms	remaining: 5.3s
7:	learn: 0.8485120	total: 39.6ms	remaining: 4.91s
8:	learn: 0.8362195	total: 42.9ms	remaining: 4.72s
9:	learn: 0.8236334	total: 44.6ms	remaining: 4.41s
10:	learn: 0.8120381	total: 48.4ms	remaining: 4.35s
11:	learn: 0.8013524	total: 50.1ms	remaining: 4.13s
12:	learn: 0.7919514	total: 51.6ms	remaining: 3.92s
13:	learn: 0.7826593	total: 52.9ms	remaining: 3.73s
14:	learn: 0.7743578	total: 54.3ms	remaining: 3.57s
15:	learn: 0.7666741	total: 55.9ms	remaining: 3.44s
16:	learn: 0.7604175	total: 57.2ms	remaining: 3.31s
17:	learn: 0.7542868	total: 58.7ms	remaining: 3.21s
18:	learn: 0.7479252	total: 60.3ms	remaining: 

252:	learn: 0.6153457	total: 587ms	remaining: 1.73s
253:	learn: 0.6149339	total: 589ms	remaining: 1.73s
254:	learn: 0.6146677	total: 591ms	remaining: 1.73s
255:	learn: 0.6144093	total: 593ms	remaining: 1.72s
256:	learn: 0.6139833	total: 594ms	remaining: 1.72s
257:	learn: 0.6135536	total: 596ms	remaining: 1.71s
258:	learn: 0.6130873	total: 598ms	remaining: 1.71s
259:	learn: 0.6126978	total: 599ms	remaining: 1.7s
260:	learn: 0.6121975	total: 602ms	remaining: 1.71s
261:	learn: 0.6120109	total: 605ms	remaining: 1.7s
262:	learn: 0.6118542	total: 606ms	remaining: 1.7s
263:	learn: 0.6116968	total: 608ms	remaining: 1.69s
264:	learn: 0.6112500	total: 609ms	remaining: 1.69s
265:	learn: 0.6109600	total: 611ms	remaining: 1.69s
266:	learn: 0.6105668	total: 612ms	remaining: 1.68s
267:	learn: 0.6101790	total: 614ms	remaining: 1.68s
268:	learn: 0.6096914	total: 616ms	remaining: 1.67s
269:	learn: 0.6095043	total: 619ms	remaining: 1.67s
270:	learn: 0.6092946	total: 621ms	remaining: 1.67s
271:	learn: 0.6

444:	learn: 0.5614884	total: 1.18s	remaining: 1.47s
445:	learn: 0.5611528	total: 1.18s	remaining: 1.47s
446:	learn: 0.5610134	total: 1.19s	remaining: 1.47s
447:	learn: 0.5608533	total: 1.19s	remaining: 1.46s
448:	learn: 0.5606738	total: 1.19s	remaining: 1.46s
449:	learn: 0.5603978	total: 1.19s	remaining: 1.46s
450:	learn: 0.5600801	total: 1.19s	remaining: 1.45s
451:	learn: 0.5600013	total: 1.19s	remaining: 1.45s
452:	learn: 0.5597783	total: 1.2s	remaining: 1.45s
453:	learn: 0.5595730	total: 1.2s	remaining: 1.44s
454:	learn: 0.5592732	total: 1.2s	remaining: 1.44s
455:	learn: 0.5590758	total: 1.21s	remaining: 1.44s
456:	learn: 0.5586806	total: 1.21s	remaining: 1.43s
457:	learn: 0.5585551	total: 1.21s	remaining: 1.43s
458:	learn: 0.5583428	total: 1.22s	remaining: 1.43s
459:	learn: 0.5579254	total: 1.3s	remaining: 1.52s
460:	learn: 0.5576562	total: 1.3s	remaining: 1.52s
461:	learn: 0.5573748	total: 1.3s	remaining: 1.51s
462:	learn: 0.5570204	total: 1.3s	remaining: 1.51s
463:	learn: 0.55680

631:	learn: 0.5182739	total: 1.79s	remaining: 1.04s
632:	learn: 0.5180729	total: 1.79s	remaining: 1.04s
633:	learn: 0.5177337	total: 1.79s	remaining: 1.03s
634:	learn: 0.5174252	total: 1.79s	remaining: 1.03s
635:	learn: 0.5171039	total: 1.79s	remaining: 1.03s
636:	learn: 0.5169432	total: 1.8s	remaining: 1.02s
637:	learn: 0.5166679	total: 1.8s	remaining: 1.02s
638:	learn: 0.5163846	total: 1.8s	remaining: 1.02s
639:	learn: 0.5162315	total: 1.8s	remaining: 1.01s
640:	learn: 0.5161178	total: 1.8s	remaining: 1.01s
641:	learn: 0.5159509	total: 1.8s	remaining: 1.01s
642:	learn: 0.5157021	total: 1.81s	remaining: 1s
643:	learn: 0.5154267	total: 1.81s	remaining: 1s
644:	learn: 0.5153385	total: 1.81s	remaining: 997ms
645:	learn: 0.5151482	total: 1.81s	remaining: 993ms
646:	learn: 0.5150705	total: 1.81s	remaining: 990ms
647:	learn: 0.5147841	total: 1.81s	remaining: 986ms
648:	learn: 0.5146593	total: 1.82s	remaining: 983ms
649:	learn: 0.5143785	total: 1.82s	remaining: 980ms
650:	learn: 0.5142754	to

792:	learn: 0.4843690	total: 2.2s	remaining: 574ms
793:	learn: 0.4839981	total: 2.2s	remaining: 571ms
794:	learn: 0.4836229	total: 2.2s	remaining: 568ms
795:	learn: 0.4834509	total: 2.21s	remaining: 565ms
796:	learn: 0.4833601	total: 2.21s	remaining: 562ms
797:	learn: 0.4830889	total: 2.21s	remaining: 559ms
798:	learn: 0.4828809	total: 2.21s	remaining: 556ms
799:	learn: 0.4827573	total: 2.21s	remaining: 553ms
800:	learn: 0.4827276	total: 2.21s	remaining: 550ms
801:	learn: 0.4824022	total: 2.21s	remaining: 547ms
802:	learn: 0.4820443	total: 2.22s	remaining: 544ms
803:	learn: 0.4819027	total: 2.22s	remaining: 541ms
804:	learn: 0.4816353	total: 2.22s	remaining: 538ms
805:	learn: 0.4814728	total: 2.22s	remaining: 535ms
806:	learn: 0.4814263	total: 2.22s	remaining: 532ms
807:	learn: 0.4813278	total: 2.23s	remaining: 529ms
808:	learn: 0.4811033	total: 2.23s	remaining: 526ms
809:	learn: 0.4809175	total: 2.23s	remaining: 523ms
810:	learn: 0.4807265	total: 2.23s	remaining: 520ms
811:	learn: 0.4

974:	learn: 0.4532523	total: 2.59s	remaining: 66.4ms
975:	learn: 0.4530742	total: 2.6s	remaining: 64.1ms
976:	learn: 0.4529803	total: 2.61s	remaining: 61.4ms
977:	learn: 0.4527142	total: 2.61s	remaining: 58.7ms
978:	learn: 0.4525500	total: 2.61s	remaining: 56.1ms
979:	learn: 0.4524338	total: 2.62s	remaining: 53.4ms
980:	learn: 0.4522060	total: 2.63s	remaining: 50.9ms
981:	learn: 0.4520777	total: 2.63s	remaining: 48.2ms
982:	learn: 0.4519839	total: 2.63s	remaining: 45.5ms
983:	learn: 0.4518609	total: 2.65s	remaining: 43ms
984:	learn: 0.4517328	total: 2.65s	remaining: 40.3ms
985:	learn: 0.4515749	total: 2.65s	remaining: 37.7ms
986:	learn: 0.4514927	total: 2.65s	remaining: 35ms
987:	learn: 0.4513126	total: 2.66s	remaining: 32.3ms
988:	learn: 0.4511083	total: 2.66s	remaining: 29.6ms
989:	learn: 0.4509678	total: 2.66s	remaining: 26.9ms
990:	learn: 0.4508539	total: 2.67s	remaining: 24.3ms
991:	learn: 0.4506224	total: 2.67s	remaining: 21.5ms
992:	learn: 0.4505976	total: 2.68s	remaining: 18.9m

## Add intercept because model underestimates

In [100]:
new_intercept = np.mean(np.exp(y_test_log) - np.exp(en_new.predict(X_test)))

In [101]:
new_intercept

1.5521085436351363

In [102]:
en_new.fit(X, log_y)

Learning rate set to 0.053413
0:	learn: 0.9723929	total: 1.8ms	remaining: 1.79s
1:	learn: 0.9487814	total: 3.24ms	remaining: 1.62s
2:	learn: 0.9263357	total: 7.4ms	remaining: 2.46s
3:	learn: 0.9059956	total: 8.68ms	remaining: 2.16s
4:	learn: 0.8871400	total: 10.7ms	remaining: 2.13s
5:	learn: 0.8693813	total: 14.2ms	remaining: 2.35s
6:	learn: 0.8530581	total: 16ms	remaining: 2.27s
7:	learn: 0.8384837	total: 17.9ms	remaining: 2.22s
8:	learn: 0.8249121	total: 19.4ms	remaining: 2.14s
9:	learn: 0.8117238	total: 21.6ms	remaining: 2.14s
10:	learn: 0.7998289	total: 24.2ms	remaining: 2.18s
11:	learn: 0.7892638	total: 27.5ms	remaining: 2.27s
12:	learn: 0.7797173	total: 31.6ms	remaining: 2.4s
13:	learn: 0.7705153	total: 33.3ms	remaining: 2.35s
14:	learn: 0.7625587	total: 40.8ms	remaining: 2.67s
15:	learn: 0.7548620	total: 47.2ms	remaining: 2.9s
16:	learn: 0.7479183	total: 49ms	remaining: 2.83s
17:	learn: 0.7419697	total: 51.3ms	remaining: 2.8s
18:	learn: 0.7361637	total: 53.2ms	remaining: 2.75s
1

170:	learn: 0.6407398	total: 391ms	remaining: 1.89s
171:	learn: 0.6405133	total: 394ms	remaining: 1.9s
172:	learn: 0.6403213	total: 396ms	remaining: 1.89s
173:	learn: 0.6401381	total: 398ms	remaining: 1.89s
174:	learn: 0.6398845	total: 401ms	remaining: 1.89s
175:	learn: 0.6396981	total: 403ms	remaining: 1.89s
176:	learn: 0.6393799	total: 409ms	remaining: 1.9s
177:	learn: 0.6391176	total: 411ms	remaining: 1.9s
178:	learn: 0.6388956	total: 414ms	remaining: 1.9s
179:	learn: 0.6386675	total: 416ms	remaining: 1.9s
180:	learn: 0.6384864	total: 418ms	remaining: 1.89s
181:	learn: 0.6383035	total: 422ms	remaining: 1.9s
182:	learn: 0.6380844	total: 424ms	remaining: 1.89s
183:	learn: 0.6376872	total: 426ms	remaining: 1.89s
184:	learn: 0.6374904	total: 431ms	remaining: 1.9s
185:	learn: 0.6373088	total: 433ms	remaining: 1.9s
186:	learn: 0.6371453	total: 438ms	remaining: 1.9s
187:	learn: 0.6367635	total: 441ms	remaining: 1.9s
188:	learn: 0.6365917	total: 443ms	remaining: 1.9s
189:	learn: 0.6364551	t

357:	learn: 0.5943867	total: 964ms	remaining: 1.73s
358:	learn: 0.5941716	total: 966ms	remaining: 1.73s
359:	learn: 0.5937752	total: 969ms	remaining: 1.72s
360:	learn: 0.5935131	total: 971ms	remaining: 1.72s
361:	learn: 0.5931982	total: 976ms	remaining: 1.72s
362:	learn: 0.5931185	total: 982ms	remaining: 1.72s
363:	learn: 0.5928262	total: 984ms	remaining: 1.72s
364:	learn: 0.5926525	total: 985ms	remaining: 1.71s
365:	learn: 0.5926057	total: 986ms	remaining: 1.71s
366:	learn: 0.5923967	total: 988ms	remaining: 1.7s
367:	learn: 0.5921470	total: 990ms	remaining: 1.7s
368:	learn: 0.5918375	total: 992ms	remaining: 1.7s
369:	learn: 0.5916628	total: 994ms	remaining: 1.69s
370:	learn: 0.5913670	total: 1s	remaining: 1.7s
371:	learn: 0.5910822	total: 1.01s	remaining: 1.7s
372:	learn: 0.5908355	total: 1.01s	remaining: 1.69s
373:	learn: 0.5904657	total: 1.01s	remaining: 1.69s
374:	learn: 0.5902398	total: 1.01s	remaining: 1.69s
375:	learn: 0.5898917	total: 1.01s	remaining: 1.68s
376:	learn: 0.589698

530:	learn: 0.5585551	total: 1.56s	remaining: 1.38s
531:	learn: 0.5582526	total: 1.57s	remaining: 1.38s
532:	learn: 0.5582062	total: 1.57s	remaining: 1.37s
533:	learn: 0.5580459	total: 1.57s	remaining: 1.37s
534:	learn: 0.5578771	total: 1.58s	remaining: 1.38s
535:	learn: 0.5576264	total: 1.58s	remaining: 1.37s
536:	learn: 0.5574913	total: 1.58s	remaining: 1.37s
537:	learn: 0.5573666	total: 1.59s	remaining: 1.37s
538:	learn: 0.5570825	total: 1.6s	remaining: 1.37s
539:	learn: 0.5569625	total: 1.6s	remaining: 1.36s
540:	learn: 0.5568867	total: 1.6s	remaining: 1.36s
541:	learn: 0.5566506	total: 1.61s	remaining: 1.36s
542:	learn: 0.5564724	total: 1.61s	remaining: 1.35s
543:	learn: 0.5563297	total: 1.62s	remaining: 1.36s
544:	learn: 0.5561199	total: 1.63s	remaining: 1.36s
545:	learn: 0.5558887	total: 1.63s	remaining: 1.35s
546:	learn: 0.5556381	total: 1.63s	remaining: 1.35s
547:	learn: 0.5554790	total: 1.63s	remaining: 1.34s
548:	learn: 0.5553569	total: 1.63s	remaining: 1.34s
549:	learn: 0.5

749:	learn: 0.5160867	total: 2.38s	remaining: 795ms
750:	learn: 0.5158233	total: 2.39s	remaining: 791ms
751:	learn: 0.5155009	total: 2.39s	remaining: 788ms
752:	learn: 0.5154731	total: 2.41s	remaining: 791ms
753:	learn: 0.5152455	total: 2.41s	remaining: 787ms
754:	learn: 0.5150335	total: 2.41s	remaining: 783ms
755:	learn: 0.5149349	total: 2.42s	remaining: 780ms
756:	learn: 0.5147573	total: 2.43s	remaining: 780ms
757:	learn: 0.5147068	total: 2.47s	remaining: 788ms
758:	learn: 0.5145585	total: 2.47s	remaining: 784ms
759:	learn: 0.5143735	total: 2.47s	remaining: 780ms
760:	learn: 0.5141583	total: 2.48s	remaining: 778ms
761:	learn: 0.5139260	total: 2.48s	remaining: 776ms
762:	learn: 0.5137292	total: 2.5s	remaining: 775ms
763:	learn: 0.5133989	total: 2.5s	remaining: 772ms
764:	learn: 0.5131933	total: 2.5s	remaining: 768ms
765:	learn: 0.5129686	total: 2.5s	remaining: 764ms
766:	learn: 0.5128495	total: 2.5s	remaining: 761ms
767:	learn: 0.5126067	total: 2.51s	remaining: 758ms
768:	learn: 0.512

918:	learn: 0.4860504	total: 3.19s	remaining: 281ms
919:	learn: 0.4858796	total: 3.2s	remaining: 278ms
920:	learn: 0.4856433	total: 3.2s	remaining: 274ms
921:	learn: 0.4855873	total: 3.2s	remaining: 271ms
922:	learn: 0.4853771	total: 3.23s	remaining: 269ms
923:	learn: 0.4851219	total: 3.23s	remaining: 266ms
924:	learn: 0.4850477	total: 3.23s	remaining: 262ms
925:	learn: 0.4848850	total: 3.24s	remaining: 259ms
926:	learn: 0.4846734	total: 3.24s	remaining: 255ms
927:	learn: 0.4845667	total: 3.24s	remaining: 252ms
928:	learn: 0.4843799	total: 3.24s	remaining: 248ms
929:	learn: 0.4841611	total: 3.25s	remaining: 244ms
930:	learn: 0.4840890	total: 3.25s	remaining: 241ms
931:	learn: 0.4838787	total: 3.26s	remaining: 238ms
932:	learn: 0.4836538	total: 3.26s	remaining: 234ms
933:	learn: 0.4835067	total: 3.27s	remaining: 231ms
934:	learn: 0.4834331	total: 3.28s	remaining: 228ms
935:	learn: 0.4832786	total: 3.28s	remaining: 224ms
936:	learn: 0.4830565	total: 3.28s	remaining: 221ms
937:	learn: 0.4

In [111]:
s = pd.DataFrame({'id':test.iloc[:, 0], "y":np.exp(en_new.predict(test_x)) + new_intercept})

# Second Model

In [135]:
X_train, X_test, y_train_log, y_test_log = train_test_split(X, log_y, test_size = 0.2, random_state = 8)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np

#kbest
cat_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('cat', cat)
])

rf_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('rf', rf_model)
])

ada_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('ada', ada_model)
])

light_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('light', light_model)
])

cat_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('cat1', cat)
])

rf_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('rf1', rf_model)
])

ada_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('ada1', ada_model)
])

light_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('light1', light_model)
])

cat_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('cat2', cat)
])

rf_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('rf2', rf_model)
])

ada_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('ada2', ada_model)
])

light_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('light2', light_model)
])

cat_pipe3 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform3', 'passthrough', X.columns)], remainder='drop')),
    ('cat3', cat)
])

winner = StackingRegressor(estimators = [('cat', cat_pipe),('rf', rf_pipe),('ada', ada_pipe), 
                                     ('cat1', cat_pipe1),('rf1', rf_pipe1),('ada1', ada_pipe1),
                                     ('cat2', cat_pipe2),('rf2', rf_pipe2),('ada2', ada_pipe2),
                                     ('cat3', cat_pipe3)],
                     final_estimator=CatBoostRegressor(),                                          
                    cv = KFold(n_splits = 13, shuffle = True, random_state=1))

In [136]:
winner.fit(X_train, y_train_log)

Learning rate set to 0.051562
0:	learn: 0.9740131	total: 1.16ms	remaining: 1.16s
1:	learn: 0.9512061	total: 2.21ms	remaining: 1.1s
2:	learn: 0.9302736	total: 3.25ms	remaining: 1.08s
3:	learn: 0.9107460	total: 4.34ms	remaining: 1.08s
4:	learn: 0.8924918	total: 5.4ms	remaining: 1.07s
5:	learn: 0.8757241	total: 6.37ms	remaining: 1.05s
6:	learn: 0.8606471	total: 7.39ms	remaining: 1.05s
7:	learn: 0.8461644	total: 8.49ms	remaining: 1.05s
8:	learn: 0.8330520	total: 10.1ms	remaining: 1.12s
9:	learn: 0.8202699	total: 11.6ms	remaining: 1.15s
10:	learn: 0.8092162	total: 12.9ms	remaining: 1.16s
11:	learn: 0.7985638	total: 14.3ms	remaining: 1.17s
12:	learn: 0.7892767	total: 15.4ms	remaining: 1.17s
13:	learn: 0.7804305	total: 16.8ms	remaining: 1.18s
14:	learn: 0.7725125	total: 18.1ms	remaining: 1.19s
15:	learn: 0.7650103	total: 19.8ms	remaining: 1.22s
16:	learn: 0.7582901	total: 21.1ms	remaining: 1.22s
17:	learn: 0.7524004	total: 22.9ms	remaining: 1.25s
18:	learn: 0.7466448	total: 28.3ms	remaining: 

241:	learn: 0.6214502	total: 392ms	remaining: 1.23s
242:	learn: 0.6209422	total: 394ms	remaining: 1.23s
243:	learn: 0.6207490	total: 396ms	remaining: 1.23s
244:	learn: 0.6205087	total: 397ms	remaining: 1.22s
245:	learn: 0.6199884	total: 400ms	remaining: 1.23s
246:	learn: 0.6195272	total: 401ms	remaining: 1.22s
247:	learn: 0.6193635	total: 403ms	remaining: 1.22s
248:	learn: 0.6190066	total: 404ms	remaining: 1.22s
249:	learn: 0.6187321	total: 406ms	remaining: 1.22s
250:	learn: 0.6184626	total: 409ms	remaining: 1.22s
251:	learn: 0.6179081	total: 411ms	remaining: 1.22s
252:	learn: 0.6176743	total: 412ms	remaining: 1.22s
253:	learn: 0.6173956	total: 414ms	remaining: 1.22s
254:	learn: 0.6169756	total: 415ms	remaining: 1.21s
255:	learn: 0.6166275	total: 417ms	remaining: 1.21s
256:	learn: 0.6163764	total: 418ms	remaining: 1.21s
257:	learn: 0.6161628	total: 420ms	remaining: 1.21s
258:	learn: 0.6159854	total: 422ms	remaining: 1.21s
259:	learn: 0.6157150	total: 424ms	remaining: 1.21s
260:	learn: 

452:	learn: 0.5617735	total: 768ms	remaining: 927ms
453:	learn: 0.5615559	total: 769ms	remaining: 925ms
454:	learn: 0.5613350	total: 771ms	remaining: 923ms
455:	learn: 0.5611399	total: 777ms	remaining: 927ms
456:	learn: 0.5607666	total: 778ms	remaining: 925ms
457:	learn: 0.5606645	total: 780ms	remaining: 923ms
458:	learn: 0.5604990	total: 784ms	remaining: 924ms
459:	learn: 0.5603952	total: 785ms	remaining: 922ms
460:	learn: 0.5602774	total: 787ms	remaining: 920ms
461:	learn: 0.5601747	total: 788ms	remaining: 917ms
462:	learn: 0.5599493	total: 789ms	remaining: 915ms
463:	learn: 0.5597969	total: 791ms	remaining: 913ms
464:	learn: 0.5595076	total: 792ms	remaining: 912ms
465:	learn: 0.5591991	total: 794ms	remaining: 909ms
466:	learn: 0.5590132	total: 795ms	remaining: 907ms
467:	learn: 0.5586675	total: 801ms	remaining: 910ms
468:	learn: 0.5584135	total: 805ms	remaining: 911ms
469:	learn: 0.5581392	total: 806ms	remaining: 909ms
470:	learn: 0.5576378	total: 808ms	remaining: 907ms
471:	learn: 

646:	learn: 0.5194133	total: 1.17s	remaining: 637ms
647:	learn: 0.5190965	total: 1.17s	remaining: 635ms
648:	learn: 0.5188374	total: 1.17s	remaining: 633ms
649:	learn: 0.5186828	total: 1.18s	remaining: 633ms
650:	learn: 0.5185162	total: 1.18s	remaining: 631ms
651:	learn: 0.5183143	total: 1.18s	remaining: 629ms
652:	learn: 0.5182548	total: 1.18s	remaining: 630ms
653:	learn: 0.5181214	total: 1.19s	remaining: 628ms
654:	learn: 0.5177798	total: 1.19s	remaining: 626ms
655:	learn: 0.5175390	total: 1.19s	remaining: 623ms
656:	learn: 0.5172582	total: 1.19s	remaining: 621ms
657:	learn: 0.5170468	total: 1.19s	remaining: 619ms
658:	learn: 0.5168182	total: 1.19s	remaining: 618ms
659:	learn: 0.5167241	total: 1.19s	remaining: 615ms
660:	learn: 0.5164841	total: 1.2s	remaining: 614ms
661:	learn: 0.5163065	total: 1.2s	remaining: 612ms
662:	learn: 0.5160071	total: 1.2s	remaining: 610ms
663:	learn: 0.5157571	total: 1.2s	remaining: 608ms
664:	learn: 0.5155791	total: 1.2s	remaining: 606ms
665:	learn: 0.515

862:	learn: 0.4763638	total: 1.57s	remaining: 249ms
863:	learn: 0.4762470	total: 1.59s	remaining: 250ms
864:	learn: 0.4761616	total: 1.6s	remaining: 250ms
865:	learn: 0.4759466	total: 1.6s	remaining: 248ms
866:	learn: 0.4758457	total: 1.6s	remaining: 246ms
867:	learn: 0.4756290	total: 1.6s	remaining: 244ms
868:	learn: 0.4754260	total: 1.61s	remaining: 242ms
869:	learn: 0.4751882	total: 1.61s	remaining: 240ms
870:	learn: 0.4750122	total: 1.61s	remaining: 238ms
871:	learn: 0.4749331	total: 1.63s	remaining: 239ms
872:	learn: 0.4747422	total: 1.63s	remaining: 237ms
873:	learn: 0.4744208	total: 1.63s	remaining: 236ms
874:	learn: 0.4742124	total: 1.64s	remaining: 235ms
875:	learn: 0.4740218	total: 1.67s	remaining: 237ms
876:	learn: 0.4737925	total: 1.68s	remaining: 235ms
877:	learn: 0.4735645	total: 1.68s	remaining: 233ms
878:	learn: 0.4733681	total: 1.69s	remaining: 232ms
879:	learn: 0.4732324	total: 1.69s	remaining: 230ms
880:	learn: 0.4730776	total: 1.69s	remaining: 228ms
881:	learn: 0.47

In [137]:
winner_intercept = np.mean(np.exp(y_test_log) - np.exp(winner.predict(X_test)))

In [138]:
winner_intercept

1.5335550664780593

In [140]:
winner.fit(X, log_y)

Learning rate set to 0.053413
0:	learn: 0.9724442	total: 1.61ms	remaining: 1.61s
1:	learn: 0.9491958	total: 2.95ms	remaining: 1.47s
2:	learn: 0.9268486	total: 4.25ms	remaining: 1.41s
3:	learn: 0.9064946	total: 5.37ms	remaining: 1.34s
4:	learn: 0.8872936	total: 6.64ms	remaining: 1.32s
5:	learn: 0.8698887	total: 8.02ms	remaining: 1.33s
6:	learn: 0.8543032	total: 12ms	remaining: 1.7s
7:	learn: 0.8395214	total: 13.5ms	remaining: 1.67s
8:	learn: 0.8265957	total: 15ms	remaining: 1.65s
9:	learn: 0.8133820	total: 16.9ms	remaining: 1.67s
10:	learn: 0.8016223	total: 18.6ms	remaining: 1.67s
11:	learn: 0.7909732	total: 20.2ms	remaining: 1.66s
12:	learn: 0.7809912	total: 22ms	remaining: 1.67s
13:	learn: 0.7716140	total: 23.6ms	remaining: 1.66s
14:	learn: 0.7637242	total: 25.4ms	remaining: 1.67s
15:	learn: 0.7560214	total: 27.1ms	remaining: 1.66s
16:	learn: 0.7495837	total: 28.9ms	remaining: 1.67s
17:	learn: 0.7432860	total: 30.5ms	remaining: 1.66s
18:	learn: 0.7376116	total: 32.2ms	remaining: 1.66s

184:	learn: 0.6387811	total: 397ms	remaining: 1.75s
185:	learn: 0.6386268	total: 399ms	remaining: 1.75s
186:	learn: 0.6384920	total: 401ms	remaining: 1.74s
187:	learn: 0.6382623	total: 402ms	remaining: 1.74s
188:	learn: 0.6379516	total: 404ms	remaining: 1.73s
189:	learn: 0.6377895	total: 405ms	remaining: 1.73s
190:	learn: 0.6375687	total: 407ms	remaining: 1.73s
191:	learn: 0.6372902	total: 409ms	remaining: 1.72s
192:	learn: 0.6369916	total: 412ms	remaining: 1.72s
193:	learn: 0.6367537	total: 414ms	remaining: 1.72s
194:	learn: 0.6363988	total: 416ms	remaining: 1.72s
195:	learn: 0.6361665	total: 418ms	remaining: 1.71s
196:	learn: 0.6359843	total: 419ms	remaining: 1.71s
197:	learn: 0.6356099	total: 421ms	remaining: 1.71s
198:	learn: 0.6354579	total: 423ms	remaining: 1.7s
199:	learn: 0.6350726	total: 424ms	remaining: 1.7s
200:	learn: 0.6348828	total: 426ms	remaining: 1.69s
201:	learn: 0.6345359	total: 428ms	remaining: 1.69s
202:	learn: 0.6343620	total: 430ms	remaining: 1.69s
203:	learn: 0.

348:	learn: 0.5996849	total: 795ms	remaining: 1.48s
349:	learn: 0.5993561	total: 803ms	remaining: 1.49s
350:	learn: 0.5992685	total: 807ms	remaining: 1.49s
351:	learn: 0.5989937	total: 808ms	remaining: 1.49s
352:	learn: 0.5988441	total: 811ms	remaining: 1.49s
353:	learn: 0.5987650	total: 813ms	remaining: 1.48s
354:	learn: 0.5986006	total: 816ms	remaining: 1.48s
355:	learn: 0.5984628	total: 819ms	remaining: 1.48s
356:	learn: 0.5980875	total: 824ms	remaining: 1.48s
357:	learn: 0.5979461	total: 827ms	remaining: 1.48s
358:	learn: 0.5978276	total: 830ms	remaining: 1.48s
359:	learn: 0.5973984	total: 831ms	remaining: 1.48s
360:	learn: 0.5972582	total: 833ms	remaining: 1.47s
361:	learn: 0.5971120	total: 836ms	remaining: 1.47s
362:	learn: 0.5968805	total: 837ms	remaining: 1.47s
363:	learn: 0.5966683	total: 839ms	remaining: 1.47s
364:	learn: 0.5965347	total: 841ms	remaining: 1.46s
365:	learn: 0.5962529	total: 842ms	remaining: 1.46s
366:	learn: 0.5958601	total: 844ms	remaining: 1.46s
367:	learn: 

541:	learn: 0.5577389	total: 1.4s	remaining: 1.18s
542:	learn: 0.5574337	total: 1.4s	remaining: 1.18s
543:	learn: 0.5570869	total: 1.4s	remaining: 1.18s
544:	learn: 0.5569823	total: 1.4s	remaining: 1.17s
545:	learn: 0.5566892	total: 1.4s	remaining: 1.17s
546:	learn: 0.5565373	total: 1.41s	remaining: 1.17s
547:	learn: 0.5562892	total: 1.41s	remaining: 1.16s
548:	learn: 0.5562492	total: 1.41s	remaining: 1.16s
549:	learn: 0.5561570	total: 1.41s	remaining: 1.16s
550:	learn: 0.5560423	total: 1.41s	remaining: 1.15s
551:	learn: 0.5559294	total: 1.42s	remaining: 1.15s
552:	learn: 0.5556751	total: 1.42s	remaining: 1.15s
553:	learn: 0.5553167	total: 1.42s	remaining: 1.14s
554:	learn: 0.5550938	total: 1.42s	remaining: 1.14s
555:	learn: 0.5548096	total: 1.42s	remaining: 1.14s
556:	learn: 0.5546770	total: 1.42s	remaining: 1.13s
557:	learn: 0.5544550	total: 1.43s	remaining: 1.13s
558:	learn: 0.5543219	total: 1.43s	remaining: 1.13s
559:	learn: 0.5539788	total: 1.43s	remaining: 1.12s
560:	learn: 0.553

700:	learn: 0.5279077	total: 1.99s	remaining: 849ms
701:	learn: 0.5277045	total: 1.99s	remaining: 845ms
702:	learn: 0.5274452	total: 2s	remaining: 843ms
703:	learn: 0.5273100	total: 2s	remaining: 840ms
704:	learn: 0.5270901	total: 2s	remaining: 837ms
705:	learn: 0.5270199	total: 2s	remaining: 833ms
706:	learn: 0.5268401	total: 2s	remaining: 830ms
707:	learn: 0.5266534	total: 2s	remaining: 826ms
708:	learn: 0.5263158	total: 2s	remaining: 823ms
709:	learn: 0.5260052	total: 2.01s	remaining: 820ms
710:	learn: 0.5258503	total: 2.01s	remaining: 816ms
711:	learn: 0.5256978	total: 2.01s	remaining: 814ms
712:	learn: 0.5255621	total: 2.01s	remaining: 811ms
713:	learn: 0.5254109	total: 2.02s	remaining: 807ms
714:	learn: 0.5250861	total: 2.02s	remaining: 804ms
715:	learn: 0.5248515	total: 2.02s	remaining: 800ms
716:	learn: 0.5247324	total: 2.02s	remaining: 797ms
717:	learn: 0.5246263	total: 2.02s	remaining: 794ms
718:	learn: 0.5243622	total: 2.02s	remaining: 791ms
719:	learn: 0.5241769	total: 2.02

866:	learn: 0.4981489	total: 2.4s	remaining: 368ms
867:	learn: 0.4979843	total: 2.4s	remaining: 366ms
868:	learn: 0.4979317	total: 2.4s	remaining: 363ms
869:	learn: 0.4977902	total: 2.41s	remaining: 360ms
870:	learn: 0.4977109	total: 2.41s	remaining: 357ms
871:	learn: 0.4975536	total: 2.41s	remaining: 354ms
872:	learn: 0.4974275	total: 2.41s	remaining: 351ms
873:	learn: 0.4973809	total: 2.42s	remaining: 348ms
874:	learn: 0.4971219	total: 2.42s	remaining: 345ms
875:	learn: 0.4969463	total: 2.42s	remaining: 343ms
876:	learn: 0.4967473	total: 2.42s	remaining: 340ms
877:	learn: 0.4966629	total: 2.42s	remaining: 337ms
878:	learn: 0.4963562	total: 2.42s	remaining: 334ms
879:	learn: 0.4961620	total: 2.43s	remaining: 331ms
880:	learn: 0.4960826	total: 2.43s	remaining: 329ms
881:	learn: 0.4957942	total: 2.44s	remaining: 326ms
882:	learn: 0.4956377	total: 2.44s	remaining: 323ms
883:	learn: 0.4956000	total: 2.44s	remaining: 320ms
884:	learn: 0.4952857	total: 2.44s	remaining: 317ms
885:	learn: 0.4

In [141]:
game = pd.DataFrame({'id':test.iloc[:, 0], "y":np.exp(winner.predict(test_x)) + winner_intercept})

# Create en ensemble by combining the models 

In [142]:
x = game.copy()
x.y = s.y*.5 + game.y*.5

In [143]:
x.to_csv('xgame1.csv', index=False)

# Parameters for Reproducibility

In [146]:
winner.get_params()

{'cv': KFold(n_splits=13, random_state=1, shuffle=True),
 'estimators': [('cat',
   Pipeline(steps=[('column_transformer',
                    ColumnTransformer(transformers=[('cat_transform',
                                                     'passthrough',
                                                     Index(['x146', 'x102', 'x014', 'x096', 'x619', 'x687', 'x724', 'x118', 'x670',
          'x696', 'x651', 'x755', 'x742', 'x654', 'x105', 'x569', 'x561', 'x543',
          'x581', 'x253', 'x427', 'x591', 'x735', 'x756', 'x366', 'x749', 'x488',
          'x702', 'x685', 'x355', 'x572', 'x364', 'x239', 'x669', 'x430', 'x425',
          'x185', 'x638', 'x353', 'x108'],
         dtype='object'))])),
                   ('cat',
                    <catboost.core.CatBoostRegressor object at 0x2a1f58490>)])),
  ('rf',
   Pipeline(steps=[('column_transformer',
                    ColumnTransformer(transformers=[('rf_transform', 'passthrough',
                                             

In [147]:
en_new.get_params()

{'cv': KFold(n_splits=15, random_state=1, shuffle=True),
 'estimators': [('cat',
   Pipeline(steps=[('column_transformer',
                    ColumnTransformer(transformers=[('cat_transform',
                                                     'passthrough',
                                                     Index(['x146', 'x102', 'x014', 'x096', 'x619', 'x687', 'x724', 'x118', 'x670',
          'x696', 'x651', 'x755', 'x742', 'x654', 'x105', 'x569', 'x561', 'x543',
          'x581', 'x253', 'x427', 'x591', 'x735', 'x756', 'x366', 'x749', 'x488',
          'x702', 'x685', 'x355', 'x572', 'x364', 'x239', 'x669', 'x430', 'x425',
          'x185', 'x638', 'x353', 'x108'],
         dtype='object'))])),
                   ('cat',
                    <catboost.core.CatBoostRegressor object at 0x2a1f58490>)])),
  ('rf',
   Pipeline(steps=[('column_transformer',
                    ColumnTransformer(transformers=[('rf_transform', 'passthrough',
                                             

In [149]:
!pip3 install pickle-mixin

Collecting pickle-mixin
  Downloading pickle-mixin-1.0.2.tar.gz (5.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: pickle-mixin
  Building wheel for pickle-mixin (setup.py) ... [?25ldone
[?25h  Created wheel for pickle-mixin: filename=pickle_mixin-1.0.2-py3-none-any.whl size=5990 sha256=b8cc4c8825a4d567c061fdbcfab8e2badd6561cd45ea3b12fdb8a405fc1e1295
  Stored in directory: /Users/jackokeefe/Library/Caches/pip/wheels/3e/c6/e9/d1b0a34e1efc6c3ec9c086623972c6de6317faddb2af0a619c
Successfully built pickle-mixin
Installing collected packages: pickle-mixin
Successfully installed pickle-mixin-1.0.2


# The below code uploads my model. Because there is no hyperparameter tuning, the exact model is not reproducible unless my specific hyperparameters are uploaded. That could be done manually, but saving the entire model with pickle is much easier. 

# Upload winner model

In [155]:
# Load from file
pkl_filename = "winner.pkl"
with open(pkl_filename, 'rb') as file:
    pickle_winner = pickle.load(file)

# Now you can use this model to make predictions, for instance
# y_pred = winner.predict(X_test)

# Upload en_new model

In [156]:
pkl_filename = "en_new.pkl"
with open(pkl_filename, 'rb') as file:
    pickle_en_new = pickle.load(file)

In [157]:
pickle_game = pd.DataFrame({'id':test.iloc[:, 0], "y":np.exp(winner.predict(test_x)) + winner_intercept})

In [158]:
pickle_s = pd.DataFrame({'id':test.iloc[:, 0], "y":np.exp(en_new.predict(test_x)) + new_intercept})

In [160]:
pickle_x = pickle_game.copy()
pickle_x.y = pickle_s.y*.5 + pickle_game.y*.5

In [165]:
pickle_x.to_csv('pickle.csv', index=False)