# Assignment
- Start a clean notebook.
- Get the [Caterpillar data from Kaggle](https://www.kaggle.com/c/caterpillar-tube-pricing/data).
- Do train/validate/test split.
- Select features from `train_set.csv`, `tube.csv`, and at least one more file.
- Fit a model.
- Get your validation RMSLE (or RMSE with log-transformed targets).
- [Submit](https://www.kaggle.com/c/caterpillar-tube-pricing/submit) your predictions to the Kaggle competition.
- Commit your notebook to your fork of the GitHub repo.

## Stretch Goals
- Improve your scores on Kaggle.
- Make visualizations and share on Slack.
- Look at [Kaggle Kernels](https://www.kaggle.com/c/caterpillar-tube-pricing/kernels) for ideas about feature engineerng and visualization.

Read [Better Explained](https://betterexplained.com/) Exponents & Logs series:

1. [An Intuitive Guide To Exponential Functions & e](https://betterexplained.com/articles/an-intuitive-guide-to-exponential-functions-e/)
2. [Demystifying the Natural Logarithm (ln)](https://betterexplained.com/articles/demystifying-the-natural-logarithm-ln/)
3. [A Visual Guide to Simple, Compound and Continuous Interest Rates](https://betterexplained.com/articles/a-visual-guide-to-simple-compound-and-continuous-interest-rates/)
4. [Common Definitions of e (Colorized)](https://betterexplained.com/articles/definitions-of-e-colorized/)
5. [Understanding Exponents (Why does 0^0 = 1?)](https://betterexplained.com/articles/understanding-exponents-why-does-00-1/)
6. [Using Logarithms in the Real World](https://betterexplained.com/articles/using-logs-in-the-real-world/)
7. [How To Think With Exponents And Logarithms](https://betterexplained.com/articles/think-with-exponents/)
8. [Understanding Discrete vs. Continuous Growth](https://betterexplained.com/articles/understanding-discrete-vs-continuous-growth/)
9. [What does an exponent really mean?](https://betterexplained.com/articles/what-does-an-exponent-mean/)
10. [Q: Why is e special? (2.718..., not 2, 3.7 or another number?)](https://betterexplained.com/articles/q-why-is-e-special-2-718-not-other-number/)

In [18]:
from glob import glob
import pandas as pd

for path in glob('competition_data/*.csv'):
    df = pd.read_csv(path)
    print(path, df.shape)

competition_data/comp_threaded.csv (194, 32)
competition_data/comp_adaptor.csv (25, 20)
competition_data/tube_end_form.csv (27, 2)
competition_data/comp_straight.csv (361, 12)
competition_data/comp_tee.csv (4, 14)
competition_data/comp_boss.csv (147, 15)
competition_data/components.csv (2048, 3)
competition_data/comp_float.csv (16, 7)
competition_data/bill_of_materials.csv (21198, 17)
competition_data/comp_elbow.csv (178, 16)
competition_data/type_connection.csv (14, 2)
competition_data/train_set.csv (30213, 8)
competition_data/comp_sleeve.csv (50, 10)
competition_data/test_set.csv (30235, 8)
competition_data/tube.csv (21198, 16)
competition_data/comp_hfl.csv (6, 9)
competition_data/type_end_form.csv (8, 2)
competition_data/comp_other.csv (1001, 3)
competition_data/type_component.csv (29, 2)
competition_data/specs.csv (21198, 11)
competition_data/comp_nut.csv (65, 11)


In [142]:
train = pd.read_csv('competition_data/train_set.csv')
test = pd.read_csv('competition_data/test_set.csv')
tube = pd.read_csv('competition_data/tube.csv')
mats = pd.read_csv('competition_data/bill_of_materials.csv')
comps = pd.read_csv('competition_data/components.csv')
specs = pd.read_csv('competition_data/specs.csv')
end_form = pd.read_csv('competition_data/tube_end_form.csv')

## Merging

#### merging tube with end_form df

In [143]:
# merging tube df with end_form df
tube = tube.merge(end_form,how='left',left_on='end_a',right_on='end_form_id').merge(end_form,how='left',left_on='end_x',right_on='end_form_id')

In [144]:
# drop duplicate ids
tube = tube.drop(['end_form_id_x','end_form_id_y'],axis=1)
# rename forming columns to match end_a and end_x
tube = tube.rename({'forming_x':'forming_a','forming_y':'forming_x'},axis=1)

#### merging comps onto mats.
We are only going to use the first component

In [145]:
# merging comps on mats but only on the first component
mats = mats.merge(comps,left_on='component_id_1',right_on='component_id',how='left')

In [146]:
# dropping redundant columns
mats = mats.drop(['component_id','component_type_id'],axis=1)

In [147]:
train = train.merge(tube, left_on='tube_assembly_id',right_on='tube_assembly_id',how='left')
test = test.merge(tube, left_on='tube_assembly_id',right_on='tube_assembly_id',how='left')

In [148]:
train = train.merge(mats,left_on='tube_assembly_id',right_on='tube_assembly_id')
test = test.merge(mats,left_on='tube_assembly_id',right_on='tube_assembly_id')

In [149]:
print(train['tube_assembly_id'].nunique(), test['tube_assembly_id'].nunique())

8855 8856


In [150]:
unique_tubes = train['tube_assembly_id'].unique()

### Exploring Features

In [156]:
cat_features = train.describe(exclude='number').T.sort_values(by='unique')

In [191]:
hot_encode_cols = list(cat_features.loc[cat_features['unique'] < 150].index)

In [192]:
hot_encode_cols

['component_id_8',
 'bracket_pricing',
 'end_a_1x',
 'end_a_2x',
 'end_x_1x',
 'end_x_2x',
 'forming_a',
 'forming_x',
 'component_id_7',
 'component_id_6',
 'material_id',
 'end_x',
 'end_a',
 'component_id_5',
 'supplier',
 'component_id_4',
 'name']

In [193]:
train.describe()

Unnamed: 0,annual_usage,min_order_quantity,quantity,cost,diameter,wall,length,num_bends,bend_radius,num_boss,num_bracket,other,quantity_1,quantity_2,quantity_3,quantity_4,quantity_5,quantity_6,quantity_7,quantity_8
count,30213.0,30213.0,30213.0,30213.0,30213.0,30213.0,30213.0,30213.0,30213.0,30213.0,30213.0,30213.0,28751.0,21084.0,7171.0,787.0,66.0,28.0,8.0,3.0
mean,1.120939,0.272728,2.465449,2.200478,2.672097,1.384782,97.647605,3.813061,3.455385,0.009489,0.001343,0.004891,1.641021,1.566591,0.425439,0.424446,0.42503,1.178571,1.0,1.0
std,1.91782,0.786065,1.540914,0.82325,0.595846,0.63861,63.230131,2.199564,0.790113,0.064777,0.024488,0.047073,0.488404,0.504756,0.01785,0.01548,0.016197,0.390021,0.0,0.0
min,0.0,0.0,0.693147,0.407831,1.430311,0.71,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.423036,0.423036,0.423036,1.0,1.0,1.0
25%,0.0,0.0,1.098612,1.584774,2.353278,0.89,48.0,2.0,2.998229,0.0,0.0,0.0,1.0,1.0,0.423036,0.423036,0.423036,1.0,1.0,1.0
50%,0.0,0.0,2.397895,2.017719,2.617396,1.24,86.0,3.0,3.488903,0.0,0.0,0.0,2.0,2.0,0.423036,0.423036,0.423036,1.0,1.0,1.0
75%,1.098612,0.0,3.713572,2.669433,2.998229,1.65,133.0,5.0,3.94739,0.0,0.0,0.0,2.0,2.0,0.423036,0.423036,0.423036,1.0,1.0,1.0
max,11.918397,6.284134,7.824446,6.908755,5.3191,7.9,1333.0,17.0,9.21034,0.706395,0.6258,0.771165,4.0,4.0,0.672503,0.672503,0.554618,2.0,1.0,1.0


### Wrangle

In [176]:
import seaborn as sns
import numpy as np

In [180]:
# transform skewed cols
def transform_skewed_cols(df, skew_level=4):
    skew_cols = list(train.skew()[train.skew() > skew_level].index)
    for col in skew_cols:
        df[col] = np.log1p(df[col])
    return df

In [182]:
train = transform_skewed_cols(train)
test = transform_skewed_cols(test)

  after removing the cwd from sys.path.


In [215]:
# remove quote date and replace by year and month cols
def convert_add_dates(df):
    df['quote_date'] = pd.to_datetime(df['quote_date'],infer_datetime_format=True)
    df['month'] = df['quote_date'].dt.month
    df['year'] = df['quote_date'].dt.year
    df = df.drop('quote_date',axis=1)
    return df

In [217]:
test = convert_add_dates(test)
train = convert_add_dates(train)

## Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [119]:
tubes_train, tubes_val = train_test_split(unique_tubes,random_state=42train[train['tube_assembly_id'].isin(tubes_train)])

In [128]:
train = train[train['tube_assembly_id'].isin(tubes_train)]

In [129]:
train_val = train[train['tube_assembly_id'].isin(tubes_val)]

### define features and target

In [135]:
# checking for cardinality

In [139]:
train.describe(exclude='number').T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq
component_id_8,3,1,C-1981,3
bracket_pricing,22628,2,Yes,19699
end_a_1x,22628,2,N,22024
end_a_2x,22628,2,N,20757
end_x_1x,22628,2,N,22194
end_x_2x,22628,2,N,20939
forming_a,21946,2,No,13911
forming_x,21605,2,No,13274
component_id_7,8,3,C-1921,4
component_id_6,25,10,C-2005,8


In [131]:
train.head()

Unnamed: 0,tube_assembly_id,supplier,quote_date,annual_usage,min_order_quantity,bracket_pricing,quantity,cost,material_id,diameter,...,quantity_4,component_id_5,quantity_5,component_id_6,quantity_6,component_id_7,quantity_7,component_id_8,quantity_8,name
8,TA-00004,S-0066,2013-07-07,0,0,Yes,1,21.972702,SP-0019,6.35,...,,,,,,,,,,NUT-FLARED
9,TA-00004,S-0066,2013-07-07,0,0,Yes,2,12.407983,SP-0019,6.35,...,,,,,,,,,,NUT-FLARED
10,TA-00004,S-0066,2013-07-07,0,0,Yes,5,6.668596,SP-0019,6.35,...,,,,,,,,,,NUT-FLARED
11,TA-00004,S-0066,2013-07-07,0,0,Yes,10,4.754539,SP-0019,6.35,...,,,,,,,,,,NUT-FLARED
12,TA-00004,S-0066,2013-07-07,0,0,Yes,25,3.608331,SP-0019,6.35,...,,,,,,,,,,NUT-FLARED


In [132]:
target = 'cost'
features = list(train.columns)

In [133]:
X_train = train[features]
y_train = train[target]
X_val = train_val[features]
y_val = train_val[target]

### make pipeline

In [141]:
from sklearn.pipeline import Pipeline
import category_encoders as ce
from sklearn.ensemble import RandomForestRegressor

In [None]:
pipeline = Pipeline()