# ASSIGNMENT
- Use the Caterpillar dataset (or _any_ dataset of your choice). 
- Use scikit-learn for hyperparameter optimization with RandomizedSearchCV.
- Add comments and Markdown to your notebook. Clean up your code.
- Commit your notebook to your fork of the GitHub repo.

Do your assignment ["the hard way."](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit) _"If you copy-paste, you are cheating yourself out of the effectiveness of the lessons."_

### Stretch Goals
- Make more Kaggle submissions. Improve your scores! Look at [Kaggle Kernels](https://www.kaggle.com/c/caterpillar-tube-pricing/kernels) for ideas. **Share your best features and techniques on Slack.**
- Try combining xgboost early stopping, cross-validation, & hyperparameter optimization, [with the scikit-learn API](https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction/discussion/15235#180497), or [the "regular" xgboost API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.cv).
- In additon to `RandomizedSearchCV`, scikit-learn has [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Another library called scikit-optimize has [`BayesSearchCV`](https://scikit-optimize.github.io/notebooks/sklearn-gridsearchcv-replacement.html). Experiment with these alternatives.
- _[Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)_ discusses options for "Grid-Searching Which Model To Use" in Chapter 6:

> You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC ...

The example is shown in [the accompanying notebook](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb), code cells 35-37. Could you apply this concept to your own pipelines?

### Post-Reads
- Jake VanderPlas, [_Python Data Science Handbook_, Chapter 5.3](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html), Hyperparameters and Model Validation
- Jake VanderPlas, [Statistics for Hackers](https://speakerdeck.com/jakevdp/statistics-for-hackers?slide=107)
- Ron Zacharski, [_A Programmer's Guide to Data Mining_, Chapter 5](http://guidetodatamining.com/chapter5/), 10-fold cross validation
- Sebastian Raschka, [A Basic Pipeline and Grid Search Setup](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/svm_iris_pipeline_and_gridsearch.ipynb)

In [0]:
!pip install category_encoders

In [0]:
!wget https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/caterpillar/caterpillar-tube-pricing.zip

In [0]:
!unzip caterpillar-tube-pricing.zip

In [0]:
!unzip data.zip

## Use scikit-learn for hyperparameter optimization with RandomizedSearchCV.

In [0]:
import warnings
import numpy as np
import pandas as pd
from glob import glob
import category_encoders as ce
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from scipy.stats import randint, uniform
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, RandomizedSearchCV

In [0]:
SOURCE = 'competition_data/'

### 1a. Get a tidy list of the component id's in each tube assembly

In [8]:
materials = pd.read_csv(SOURCE + 'bill_of_materials.csv')
materials.head()

Unnamed: 0,tube_assembly_id,component_id_1,quantity_1,component_id_2,quantity_2,component_id_3,quantity_3,component_id_4,quantity_4,component_id_5,quantity_5,component_id_6,quantity_6,component_id_7,quantity_7,component_id_8,quantity_8
0,TA-00001,C-1622,2.0,C-1629,2.0,,,,,,,,,,,,
1,TA-00002,C-1312,2.0,,,,,,,,,,,,,,
2,TA-00003,C-1312,2.0,,,,,,,,,,,,,,
3,TA-00004,C-1312,2.0,,,,,,,,,,,,,,
4,TA-00005,C-1624,1.0,C-1631,1.0,C-1641,1.0,,,,,,,,,,


In [9]:
assembly_components = materials.melt(id_vars='tube_assembly_id',
                                     value_vars=[f'component_id_{n}' for n in range(1,9)])
assembly_components = (assembly_components.sort_values(by='tube_assembly_id')
                       .dropna()
                       .rename(columns={'value':'component_id'}))
print(assembly_components.shape)
assembly_components.head()

(39459, 3)


Unnamed: 0,tube_assembly_id,variable,component_id
0,TA-00001,component_id_1,C-1622
21198,TA-00001,component_id_2,C-1629
1,TA-00002,component_id_1,C-1312
2,TA-00003,component_id_1,C-1312
3,TA-00004,component_id_1,C-1312


#### Get a tidy list of the component quantity's in each tube assembly (not needed)

In [10]:
# components_quantity = materials.melt(id_vars='tube_assembly_id',
#                                      value_vars=[f'quantity_{n}' for n in range(1,9)])
# components_quantity = (components_quantity.sort_values(by='tube_assembly_id')
#                        .dropna()
#                        .rename(columns={'value':'quantity',
#                                         'variable':'quantity_num'}))
# print(components_quantity.shape)
# components_quantity.head()

(39467, 3)


Unnamed: 0,tube_assembly_id,quantity_num,quantity
0,TA-00001,quantity_1,2.0
21198,TA-00001,quantity_2,2.0
1,TA-00002,quantity_1,2.0
2,TA-00003,quantity_1,2.0
3,TA-00004,quantity_1,2.0


##### Merge material's quantity and id

In [0]:
# component_quantity = (assembly_components
#                      .merge(components_quantity, left_index=True, right_index=True)
#                      .drop(columns='tube_assembly_id_y')
#                      .rename(columns={'tube_assembly_id_x':'tube_assembly_id'}))
# print(component_quantity.shape)
# component_quantity.head()

### 1b. Merge with component types

In [12]:
components = pd.read_csv(SOURCE + 'components.csv')
components.describe()

Unnamed: 0,component_id,name,component_type_id
count,2048,2047,2048
unique,2048,297,29
top,C-1866,FLANGE,OTHER
freq,1,350,1006


In [13]:
assembly_component_types = assembly_components.merge(components, how='left')
print(assembly_component_types.shape)
assembly_component_types.head()

(39459, 5)


Unnamed: 0,tube_assembly_id,variable,component_id,name,component_type_id
0,TA-00001,component_id_1,C-1622,NUT-SWIVEL,CP-025
1,TA-00001,component_id_2,C-1629,SLEEVE-ORFS,CP-024
2,TA-00002,component_id_1,C-1312,NUT-FLARED,CP-028
3,TA-00003,component_id_1,C-1312,NUT-FLARED,CP-028
4,TA-00004,component_id_1,C-1312,NUT-FLARED,CP-028


### 1c. Make a crosstab of the component types for each assembly (one-hot encoding)

In [14]:
table = pd.crosstab(assembly_component_types['tube_assembly_id'],
                      assembly_component_types['component_type_id'])
table = table.reset_index()
table.columns.name = ''
print(table.shape)
table.head()

(19149, 30)


Unnamed: 0,tube_assembly_id,CP-001,CP-002,CP-003,CP-004,CP-005,CP-006,CP-007,CP-008,CP-009,CP-010,CP-011,CP-012,CP-014,CP-015,CP-016,CP-017,CP-018,CP-019,CP-020,CP-021,CP-022,CP-023,CP-024,CP-025,CP-026,CP-027,CP-028,CP-029,OTHER
0,TA-00001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
1,TA-00002,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,TA-00003,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,TA-00004,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,TA-00005,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0


### 2a. Most of the component files have a "weight" feature:

In [15]:
def search_column(name):
  for path in glob(SOURCE + '*.csv'):
    df = pd.read_csv(path)
    if name in df.columns:
      print(path, df.shape)
      print(df.columns.tolist(), '\n')
      
search_column('weight')

competition_data/comp_nut.csv (65, 11)
['component_id', 'component_type_id', 'hex_nut_size', 'seat_angle', 'length', 'thread_size', 'thread_pitch', 'diameter', 'blind_hole', 'orientation', 'weight'] 

competition_data/comp_sleeve.csv (50, 10)
['component_id', 'component_type_id', 'connection_type_id', 'length', 'intended_nut_thread', 'intended_nut_pitch', 'unique_feature', 'plating', 'orientation', 'weight'] 

competition_data/comp_hfl.csv (6, 9)
['component_id', 'component_type_id', 'hose_diameter', 'corresponding_shell', 'coupling_class', 'material', 'plating', 'orientation', 'weight'] 

competition_data/comp_boss.csv (147, 15)
['component_id', 'component_type_id', 'type', 'connection_type_id', 'outside_shape', 'base_type', 'height_over_tube', 'bolt_pattern_long', 'bolt_pattern_wide', 'groove', 'base_diameter', 'shoulder_diameter', 'unique_feature', 'orientation', 'weight'] 

competition_data/comp_float.csv (16, 7)
['component_id', 'component_type_id', 'bolt_pattern_long', 'bolt_patt

### 2b. Most of the component files have "orientation" & "unique_feature" binary features

In [16]:
comp_threaded = pd.read_csv(SOURCE + 'comp_threaded.csv')
comp_threaded['orientation'].value_counts()

No     121
Yes     73
Name: orientation, dtype: int64

In [17]:
comp_threaded['unique_feature'].value_counts()

No     161
Yes     33
Name: unique_feature, dtype: int64

### 2c. Read all the component files and concatenate them together

In [0]:
comp = pd.concat((pd.read_csv(path) for path in glob(SOURCE + 'comp_*.csv')), sort=False)
columns = ['component_id', 'component_type_id', 'orientation', 'unique_feature', 'weight']
comp = comp[columns]
comp['orientation'] = (comp['orientation']=='Yes').astype(int)
comp['unique_feature'] = (comp['unique_feature']=='Yes').astype(int)
comp['weight'] = comp['weight'].fillna(comp['weight'].median())

In [19]:
comp.head()

Unnamed: 0,component_id,component_type_id,orientation,unique_feature,weight
0,C-1621,CP-025,0,0,0.015
1,C-1624,CP-025,0,0,0.035
2,C-1623,CP-025,0,0,0.044
3,C-1622,CP-025,0,0,0.036
4,C-1625,CP-025,0,0,0.129


### 2d. Engineer features, aggregated for all components in a tube assembly

In [20]:
materials['components_total'] = sum(materials[f'quantity_{n}'].fillna(0) for n in range(1,9))
materials['components_distinct'] = sum(materials[f'component_id_{n}'].notnull().astype(int) for n in range(1,9))
materials['orientation'] = 0
materials['unique_feature'] = 0
materials['weight'] = 0

for n in range(1,9):
    materials = materials.merge(comp, left_on=f'component_id_{n}', right_on='component_id', 
                                how='left', suffixes=('', f'_{n}'))

for col in materials:
    if 'orientation' in col or 'unique_feature' in col or 'weight' in col:
        materials[col] = materials[col].fillna(0)
        
materials['orientation'] = sum(materials[f'orientation_{n}'] for n in range(1,9))
materials['unique_feature'] = sum(materials[f'unique_feature_{n}'] for n in range(1,9))
materials['weight'] = sum(materials[f'weight_{n}'] for n in range(1,9))

materials.head()

Unnamed: 0,tube_assembly_id,component_id_1,quantity_1,component_id_2,quantity_2,component_id_3,quantity_3,component_id_4,quantity_4,component_id_5,quantity_5,component_id_6,quantity_6,component_id_7,quantity_7,component_id_8,quantity_8,components_total,components_distinct,orientation,unique_feature,weight,component_id,component_type_id,orientation_1,unique_feature_1,weight_1,component_id_2.1,component_type_id_2,orientation_2,unique_feature_2,weight_2,component_id_3.1,component_type_id_3,orientation_3,unique_feature_3,weight_3,component_id_4.1,component_type_id_4,orientation_4,unique_feature_4,weight_4,component_id_5.1,component_type_id_5,orientation_5,unique_feature_5,weight_5,component_id_6.1,component_type_id_6,orientation_6,unique_feature_6,weight_6,component_id_7.1,component_type_id_7,orientation_7,unique_feature_7,weight_7,component_id_8.1,component_type_id_8,orientation_8,unique_feature_8,weight_8
0,TA-00001,C-1622,2.0,C-1629,2.0,,,,,,,,,,,,,4.0,2,0.0,1.0,0.048,C-1622,CP-025,0.0,0.0,0.036,C-1629,CP-024,0.0,1.0,0.012,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0
1,TA-00002,C-1312,2.0,,,,,,,,,,,,,,,2.0,1,0.0,0.0,0.009,C-1312,CP-028,0.0,0.0,0.009,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0
2,TA-00003,C-1312,2.0,,,,,,,,,,,,,,,2.0,1,0.0,0.0,0.009,C-1312,CP-028,0.0,0.0,0.009,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0
3,TA-00004,C-1312,2.0,,,,,,,,,,,,,,,2.0,1,0.0,0.0,0.009,C-1312,CP-028,0.0,0.0,0.009,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0
4,TA-00005,C-1624,1.0,C-1631,1.0,C-1641,1.0,,,,,,,,,,,3.0,3,0.0,1.0,0.21,C-1624,CP-025,0.0,0.0,0.035,C-1631,CP-024,0.0,1.0,0.026,C-1641,CP-014,0.0,0.0,0.149,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0,,,0.0,0.0,0.0


In [22]:
features = ['tube_assembly_id', 'orientation', 'unique_feature', 'weight', 
            'components_total', 'components_distinct', 'component_id_1']
materials = materials[features]
print(materials.shape)
materials.head()

(21198, 7)


Unnamed: 0,tube_assembly_id,orientation,unique_feature,weight,components_total,components_distinct,component_id_1
0,TA-00001,0.0,1.0,0.048,4.0,2,C-1622
1,TA-00002,0.0,0.0,0.009,2.0,1,C-1312
2,TA-00003,0.0,0.0,0.009,2.0,1,C-1312
3,TA-00004,0.0,0.0,0.009,2.0,1,C-1312
4,TA-00005,0.0,1.0,0.21,3.0,3,C-1624


### 3. Read tube data

In [27]:
tube = pd.read_csv(SOURCE + 'tube.csv')
tube.head()

Unnamed: 0,tube_assembly_id,material_id,diameter,wall,length,num_bends,bend_radius,end_a_1x,end_a_2x,end_x_1x,end_x_2x,end_a,end_x,num_boss,num_bracket,other
0,TA-00001,SP-0035,12.7,1.65,164.0,5,38.1,N,N,N,N,EF-003,EF-003,0,0,0
1,TA-00002,SP-0019,6.35,0.71,137.0,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0
2,TA-00003,SP-0019,6.35,0.71,127.0,7,19.05,N,N,N,N,EF-008,EF-008,0,0,0
3,TA-00004,SP-0019,6.35,0.71,137.0,9,19.05,N,N,N,N,EF-008,EF-008,0,0,0
4,TA-00005,SP-0029,19.05,1.24,109.0,4,50.8,N,N,N,N,EF-003,EF-003,0,0,0


### 3a. Engineer features in tube_end_form to tube data

In [32]:
tube_end_form = pd.read_csv(SOURCE + 'tube_end_form.csv')
tube_end_form.head()

Unnamed: 0,end_form_id,forming
0,EF-001,Yes
1,EF-002,No
2,EF-003,No
3,EF-004,No
4,EF-005,Yes


In [34]:
# Check tube_end_form results
tube_end_a = tube_end_form.rename(columns={'end_form_id': 'end_a', 'forming': 'forming_a'})
tube_end_x = tube_end_form.rename(columns={'end_form_id': 'end_x', 'forming': 'forming_x'})
tube = tube.merge(tube_end_a, how='left').merge(tube_end_x, how='left')
# combining results from forming_a and forming_x
tube = tube.assign(forming = tube.forming_a + tube.forming_x)
tube.head()

Unnamed: 0,tube_assembly_id,material_id,diameter,wall,length,num_bends,bend_radius,end_a_1x,end_a_2x,end_x_1x,end_x_2x,end_a,end_x,num_boss,num_bracket,other,forming_a,forming_x,forming
0,TA-00001,SP-0035,12.7,1.65,164.0,5,38.1,N,N,N,N,EF-003,EF-003,0,0,0,No,No,NoNo
1,TA-00002,SP-0019,6.35,0.71,137.0,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0,Yes,Yes,YesYes
2,TA-00003,SP-0019,6.35,0.71,127.0,7,19.05,N,N,N,N,EF-008,EF-008,0,0,0,Yes,Yes,YesYes
3,TA-00004,SP-0019,6.35,0.71,137.0,9,19.05,N,N,N,N,EF-008,EF-008,0,0,0,Yes,Yes,YesYes
4,TA-00005,SP-0029,19.05,1.24,109.0,4,50.8,N,N,N,N,EF-003,EF-003,0,0,0,No,No,NoNo


### 4. Merge all this data with train, validation, and test sets

In [0]:
# Read data
trainval = pd.read_csv(SOURCE + 'train_set.csv')
test = pd.read_csv(SOURCE + 'test_set.csv')

# Split into train & validation sets
# All rows for a given tube_assembly_id should go in either train or validation
trainval_tube_assemblies = trainval['tube_assembly_id'].unique()
train_tube_assemblies, val_tube_assemblies = train_test_split(
    trainval_tube_assemblies, random_state=42)
train = trainval[trainval.tube_assembly_id.isin(train_tube_assemblies)]
val = trainval[trainval.tube_assembly_id.isin(val_tube_assemblies)]

# Wrangle train, validation, and test sets
def wrangle(X):
    X = X.copy()
    
    # Engineer date features
    X['quote_date'] = pd.to_datetime(X['quote_date'], infer_datetime_format=True)
    X['quote_date_year'] = X['quote_date'].dt.year
    X['quote_date_month'] = X['quote_date'].dt.month
    X = X.drop(columns='quote_date')
    
    # Merge data
    X = (X.merge(table, how='left')
         .merge(materials, how='left')
         .merge(tube, how='left')
         .fillna(0))
    
    # Drop tube_assembly_id because our goal is to predict unknown assemblies
    X = X.drop(columns='tube_assembly_id')
    return X

train_wrangled = wrangle(train)
val_wrangled = wrangle(val)
test_wrangled = wrangle(test)

### 5. Arrange X matrix and y vector (log-transformed)

In [0]:
target = 'cost'
X_train = train_wrangled.drop(columns=target)
X_val = val_wrangled.drop(columns=target)
X_test = test_wrangled.drop(columns='id')
y_train = train[target]
y_val = val[target]
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)

### 6. Use XGBoost to fit and evaluate model

In [0]:
class XGRegressorEval(XGBRegressor):
  def fit(self, *args, **kwargs):
    return super().fit(*args, eval_set=eval_set, eval_metric='rmse', 
          early_stopping_rounds=50, **kwargs)
    
pipeline = make_pipeline(ce.OrdinalEncoder(), 
                         XGRegressorEval(n_estimators=1000, random_state=42, n_jobs=-1))

In [62]:
warnings.simplefilter(action='ignore', category=FutureWarning)

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    XGBRegressor(random_state=42)
)

param_distributions = {
    'xgbregressor__n_estimators': randint(500,1000),
    'xgbregressor__max_depth': randint(3, 7)
}

search = RandomizedSearchCV( 
    pipeline,
    param_distributions=param_distributions,
    n_iter=10,
    cv=5,
    scoring='neg_mean_squared_error',
    verbose=10,
    return_train_score=True,
    n_jobs=-1
)

groups = train['tube_assembly_id']
search.fit(X_train, y_train_log, groups=groups);

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   33.4s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  4.7min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  6.5min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  8.6min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed: 11.6min
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed: 15.4min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 16.3min finished




In [63]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation RMSLE', np.sqrt(-search.best_score_))

Best hyperparameters {'xgbregressor__max_depth': 6, 'xgbregressor__n_estimators': 666}
Cross-validation RMSLE 0.28739240004335115


## See detailed results

In [64]:
pd.DataFrame(search.cv_results_).sort_values(by='rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_xgbregressor__max_depth,param_xgbregressor__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
7,44.829155,0.16124,0.252764,0.008539,6,666,"{'xgbregressor__max_depth': 6, 'xgbregressor__...",-0.073504,-0.064903,-0.081792,-0.122083,-0.070695,-0.082594,0.020479,1,-0.005998,-0.005997,-0.00553,-0.004231,-0.006454,-0.005642,0.000763
8,58.814881,0.241069,0.355354,0.014722,6,877,"{'xgbregressor__max_depth': 6, 'xgbregressor__...",-0.074233,-0.06469,-0.081679,-0.122059,-0.070571,-0.082645,0.02046,2,-0.004153,-0.00425,-0.003768,-0.002805,-0.004453,-0.003886,0.000585
5,37.318288,0.128471,0.187184,0.003099,5,672,"{'xgbregressor__max_depth': 5, 'xgbregressor__...",-0.073594,-0.066566,-0.08227,-0.122027,-0.070771,-0.083044,0.020157,3,-0.01226,-0.012856,-0.011648,-0.009517,-0.013541,-0.011964,0.001375
1,42.296043,0.188037,0.213968,0.020965,5,759,"{'xgbregressor__max_depth': 5, 'xgbregressor__...",-0.074604,-0.066204,-0.08265,-0.122203,-0.070283,-0.083188,0.020253,4,-0.010705,-0.011621,-0.010434,-0.008272,-0.011934,-0.010593,0.001287
4,40.322105,0.155136,0.210595,0.012818,5,725,"{'xgbregressor__max_depth': 5, 'xgbregressor__...",-0.074197,-0.066323,-0.082537,-0.122341,-0.070728,-0.083224,0.020267,5,-0.011214,-0.01214,-0.011022,-0.008662,-0.012337,-0.011075,0.001309
2,42.682102,0.302261,0.203198,0.009878,4,955,"{'xgbregressor__max_depth': 4, 'xgbregressor__...",-0.077464,-0.06988,-0.086473,-0.12531,-0.076512,-0.087126,0.019808,6,-0.018407,-0.01923,-0.017546,-0.01447,-0.019331,-0.017797,0.001784
9,28.635311,0.900268,0.147299,0.01544,4,647,"{'xgbregressor__max_depth': 4, 'xgbregressor__...",-0.077735,-0.069164,-0.08766,-0.12578,-0.076618,-0.08739,0.020075,7,-0.024183,-0.025635,-0.022971,-0.019238,-0.025702,-0.023546,0.002379
6,33.354902,0.080207,0.165641,0.00636,3,954,"{'xgbregressor__max_depth': 3, 'xgbregressor__...",-0.077423,-0.071592,-0.092649,-0.139892,-0.078827,-0.092075,0.024885,8,-0.03445,-0.035387,-0.033197,-0.028146,-0.037068,-0.03365,0.003028
0,31.065596,0.261691,0.168437,0.0173,3,864,"{'xgbregressor__max_depth': 3, 'xgbregressor__...",-0.07769,-0.072094,-0.093156,-0.139572,-0.080011,-0.092503,0.024526,9,-0.03632,-0.037307,-0.034734,-0.029732,-0.03903,-0.035425,0.003169
3,20.656617,0.124907,0.118014,0.005736,3,588,"{'xgbregressor__max_depth': 3, 'xgbregressor__...",-0.080742,-0.074871,-0.096694,-0.142759,-0.08353,-0.095718,0.02458,10,-0.043471,-0.045654,-0.04208,-0.035826,-0.047287,-0.042864,0.003946


## Generate Submission for Kaggle

In [0]:
def generate_submission(estimator, X_test, filename):
    y_pred_log = estimator.predict(X_test)
    y_pred = np.expm1(y_pred_log)  # Convert from log-dollars to dollars
    submission = pd.read_csv(SOURCE + '../sample_submission.csv')
    submission['cost'] = y_pred
    submission.to_csv(filename, index=False)

In [0]:
# X_test_encoded = encoder.transform(X_test)
# generate_submission(model, X_test_encoded, 'submission.csv')

pipeline = search.best_estimator_
    
generate_submission(pipeline, X_test, 'submission.csv')

Private Score: 0.24002

Public Score: 0.24575