# Assignment
- Start a clean notebook.
- Get the [Caterpillar data from Kaggle](https://www.kaggle.com/c/caterpillar-tube-pricing/data).
- Do train/validate/test split.
- Select features from `train_set.csv`, `tube.csv`, and at least one more file.
- Fit a model.
- Get your validation RMSLE (or RMSE with log-transformed targets).
- [Submit](https://www.kaggle.com/c/caterpillar-tube-pricing/submit) your predictions to the Kaggle competition.
- Commit your notebook to your fork of the GitHub repo.

## Stretch Goals
- Improve your scores on Kaggle.
- Make visualizations and share on Slack.
- Look at [Kaggle Kernels](https://www.kaggle.com/c/caterpillar-tube-pricing/kernels) for ideas about feature engineerng and visualization.

Read [Better Explained](https://betterexplained.com/) Exponents & Logs series:

1. [An Intuitive Guide To Exponential Functions & e](https://betterexplained.com/articles/an-intuitive-guide-to-exponential-functions-e/)
2. [Demystifying the Natural Logarithm (ln)](https://betterexplained.com/articles/demystifying-the-natural-logarithm-ln/)
3. [A Visual Guide to Simple, Compound and Continuous Interest Rates](https://betterexplained.com/articles/a-visual-guide-to-simple-compound-and-continuous-interest-rates/)
4. [Common Definitions of e (Colorized)](https://betterexplained.com/articles/definitions-of-e-colorized/)
5. [Understanding Exponents (Why does 0^0 = 1?)](https://betterexplained.com/articles/understanding-exponents-why-does-00-1/)
6. [Using Logarithms in the Real World](https://betterexplained.com/articles/using-logs-in-the-real-world/)
7. [How To Think With Exponents And Logarithms](https://betterexplained.com/articles/think-with-exponents/)
8. [Understanding Discrete vs. Continuous Growth](https://betterexplained.com/articles/understanding-discrete-vs-continuous-growth/)
9. [What does an exponent really mean?](https://betterexplained.com/articles/what-does-an-exponent-mean/)
10. [Q: Why is e special? (2.718..., not 2, 3.7 or another number?)](https://betterexplained.com/articles/q-why-is-e-special-2-718-not-other-number/)

In [18]:
from glob import glob
import pandas as pd

for path in glob('competition_data/*.csv'):
    df = pd.read_csv(path)
    print(path, df.shape)

competition_data/comp_threaded.csv (194, 32)
competition_data/comp_adaptor.csv (25, 20)
competition_data/tube_end_form.csv (27, 2)
competition_data/comp_straight.csv (361, 12)
competition_data/comp_tee.csv (4, 14)
competition_data/comp_boss.csv (147, 15)
competition_data/components.csv (2048, 3)
competition_data/comp_float.csv (16, 7)
competition_data/bill_of_materials.csv (21198, 17)
competition_data/comp_elbow.csv (178, 16)
competition_data/type_connection.csv (14, 2)
competition_data/train_set.csv (30213, 8)
competition_data/comp_sleeve.csv (50, 10)
competition_data/test_set.csv (30235, 8)
competition_data/tube.csv (21198, 16)
competition_data/comp_hfl.csv (6, 9)
competition_data/type_end_form.csv (8, 2)
competition_data/comp_other.csv (1001, 3)
competition_data/type_component.csv (29, 2)
competition_data/specs.csv (21198, 11)
competition_data/comp_nut.csv (65, 11)


In [454]:
train = pd.read_csv('competition_data/train_set.csv')
test = pd.read_csv('competition_data/test_set.csv')
tube = pd.read_csv('competition_data/tube.csv')
mats = pd.read_csv('competition_data/bill_of_materials.csv')
comps = pd.read_csv('competition_data/components.csv')
specs = pd.read_csv('competition_data/specs.csv')
end_form = pd.read_csv('competition_data/tube_end_form.csv')

## Merging

#### merging tube with end_form df

In [455]:
# merging tube df with end_form df
tube = tube.merge(end_form,how='left',left_on='end_a',right_on='end_form_id').merge(end_form,how='left',left_on='end_x',right_on='end_form_id')

In [456]:
# drop duplicate ids
tube = tube.drop(['end_form_id_x','end_form_id_y'],axis=1)
# rename forming columns to match end_a and end_x
tube = tube.rename({'forming_x':'forming_a','forming_y':'forming_x'},axis=1)

#### merging comps onto mats.
We are only going to use the first component

In [457]:
# merging comps on mats but only on the first component
mats = mats.merge(comps,left_on='component_id_1',right_on='component_id',how='left')

In [458]:
# dropping redundant columns
mats = mats.drop(['component_id','component_type_id'],axis=1)

In [459]:
train = train.merge(tube, left_on='tube_assembly_id',right_on='tube_assembly_id',how='left')
test = test.merge(tube, left_on='tube_assembly_id',right_on='tube_assembly_id',how='left')

In [460]:
train = train.merge(mats,left_on='tube_assembly_id',right_on='tube_assembly_id',how='left')
test = test.merge(mats,left_on='tube_assembly_id',right_on='tube_assembly_id',how='left')

### Exploring Features

### Wrangle

In [461]:
import seaborn as sns
import numpy as np

In [462]:
# transform skewed cols
def transform_skewed_cols(df, skew_level=4):
    skew_cols = list(train.skew()[train.skew() > skew_level].index)
    for col in skew_cols:
        df[col] = np.log1p(df[col])
    return df

In [463]:
train = transform_skewed_cols(train)
test = transform_skewed_cols(test)

  """


In [464]:
# remove quote date and replace by year and month cols
def convert_add_dates(df):
    df['quote_date'] = pd.to_datetime(df['quote_date'],infer_datetime_format=True)
    df['month'] = df['quote_date'].dt.month
    df['year'] = df['quote_date'].dt.year
    df = df.drop('quote_date',axis=1)
    return df

In [465]:
test = convert_add_dates(test)
train = convert_add_dates(train)

### Handle missing data

In [590]:
na_cols = list(test.isna().sum()[test.isna().sum().values > 9000].index)

In [591]:
test = test.drop(na_cols,axis=1)
train = train.drop(na_cols,axis=1)

In [592]:
train = train.dropna()
test = test.dropna()

In [593]:
cat_features = train.describe(exclude='number').T.sort_values(by='unique')
hot_encode_cols = list(cat_features.loc[cat_features['unique'] < 150].index)
ordinal_cols = list(cat_features.loc[cat_features['unique'] > 150].index)
numeric_features = list(train.describe().columns)
features = hot_encode_cols + numeric_features + ordinal_cols

## Train/Test Split

In [594]:
from sklearn.model_selection import train_test_split

In [595]:
print(train['tube_assembly_id'].nunique(), test['tube_assembly_id'].nunique())
unique_tubes = train['tube_assembly_id'].unique()

7460 7518


In [596]:
tubes_train, tubes_val = train_test_split(unique_tubes,random_state=42)

In [597]:
tubes_val

array(['TA-06792', 'TA-19812', 'TA-08457', ..., 'TA-11964', 'TA-12724',
       'TA-14090'], dtype=object)

In [598]:
tubes_train

array(['TA-05734', 'TA-00850', 'TA-00870', ..., 'TA-15801', 'TA-02417',
       'TA-20687'], dtype=object)

In [599]:
train_sub = train[train['tube_assembly_id'].isin(tubes_train)]

In [600]:
train_val = train[train['tube_assembly_id'].isin(tubes_val)]

In [601]:
train_sub.shape, train_val.shape, train.shape

((20581, 29), (6990, 29), (27571, 29))

### define features and target

In [602]:
target = 'cost'
features.remove('cost')

In [603]:
X_train = train_sub[features]
y_train = train_sub[target]
X_val = train_val[features]
y_val = train_val[target]

In [604]:
X_train.shape

(20581, 28)

### make pipeline

In [605]:
from sklearn.pipeline import Pipeline
import category_encoders as ce
from sklearn.ensemble import RandomForestRegressor

In [606]:
encoder_hot = ce.OneHotEncoder(cols=hot_encode_cols, use_cat_names=True)

In [607]:
encoder_ord = ce.OrdinalEncoder(cols=ordinal_cols)

In [608]:
X_train = encoder_ord.fit_transform(X_train)
X_val = encoder_ord.transform(X_val)

In [609]:
X_train = encoder_hot.fit_transform(X_train)
X_val = encoder_hot.transform(X_val)

### Random Forest

In [610]:
model = RandomForestRegressor(n_estimators=100, max_depth=30, n_jobs=-1)

In [611]:
model.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=30,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [612]:
model.score(X_train,y_train)

0.9904374112791802

In [613]:
model.score(X_val,y_val)

0.8885119268209178

In [614]:
y_pred_log = model.predict(X_val)

In [615]:
from sklearn.metrics import mean_squared_error
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [616]:
rmse(y_val, y_pred_log)

0.2605352980272578

In [617]:
y_val

0        3.131396
1        2.590858
2        2.028389
3        1.738318
4        1.513271
5        1.440879
6        1.406715
7        1.386059
59       1.318429
78       3.419672
79       2.928495
80       2.449534
81       2.220274
82       2.052613
83       1.998674
84       1.973089
85       1.957497
86       3.683552
87       3.365905
88       3.160954
89       2.866506
98       2.558433
99       2.266610
100      2.107517
101      2.008428
110      1.216656
115      2.463606
128      3.118192
129      2.568079
130      1.988061
           ...   
30050    0.968150
30051    0.935939
30052    5.378739
30053    4.899534
30054    4.670192
30055    4.532210
30056    3.128964
30057    2.586679
30058    2.021042
30059    1.728487
30060    1.500944
30061    1.427620
30062    1.392992
30063    1.372048
30080    3.457721
30081    3.070555
30082    2.736889
30083    2.595435
30084    2.499657
30085    2.471391
30086    2.458144
30087    2.450255
30109    1.487834
30163    2.292234
30176    1