<a href="https://colab.research.google.com/github/ElisabethShah/DS-Unit-2-Applied-Modeling/blob/master/module2-gradient-boosting/Gradient%20Boosting%20Assignment%20Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science — Applied Modeling_ 

This sprint, your project is Caterpillar Tube Pricing: Predict the prices suppliers will quote for industrial tube assemblies.

# Gradient Boosting

## Overview

### Objectives
- Do feature engineering with relational data
- Use xgboost for gradient boosting

### Python libraries for Gradient Boosting
- [scikit-learn Gradient Tree Boosting](https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting) — slower than other libraries, but [the new version may be better](https://twitter.com/amuellerml/status/1129443826945396737)
  - Anaconda: already installed
  - Google Colab: already installed
- [xgboost](https://xgboost.readthedocs.io/en/latest/) — can accept missing values and enforce [monotonic constraints](https://xiaoxiaowang87.github.io/monotonicity_constraint/)
  - Anaconda, Mac/Linux: `conda install -c conda-forge xgboost`
  - Windows: `conda install -c anaconda py-xgboost`
  - Google Colab: already installed
- [LightGBM](https://lightgbm.readthedocs.io/en/latest/) — can accept missing values and enforce [monotonic constraints](https://blog.datadive.net/monotonicity-constraints-in-machine-learning/)
  - Anaconda: `conda install -c conda-forge lightgbm`
  - Google Colab: already installed
- [CatBoost](https://catboost.ai/) — can accept missing values and use [categorical features](https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html) without preprocessing
  - Anaconda: `conda install -c conda-forge catboost`
  - Google Colab: `pip install catboost`

### Understand the difference between boosting & bagging

Boosting (used by Gradient Boosting) is different than Bagging (used by Random Forests). 

[_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8.2.3, Boosting:

>Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model.

>**Boosting works in a similar way, except that the trees are grown _sequentially_: each tree is grown using information from previously grown trees.**

>Unlike fitting a single large decision tree to the data, which amounts to _fitting the data hard_ and potentially overfitting, the boosting approach instead _learns slowly._ Given the current model, we fit a decision tree to the residuals from the model.

>We then add this new decision tree into the fitted function in order to update the residuals. Each of these trees can be rather small, with just a few terminal nodes. **By fitting small trees to the residuals, we slowly improve fˆ in areas where it does not perform well.**

>Note that in boosting, unlike in bagging, the construction of each tree depends strongly on the trees that have already been grown.

### Get data

#### Option 1. Kaggle web UI
 
Sign in to Kaggle and go to the [Caterpillar Tube Pricing](https://www.kaggle.com/c/caterpillar-tube-pricing) competition. Go to the Data page. After you have accepted the rules of the competition, use the download buttons to download the data.

#### Option 2. Kaggle API

Follow these [instructions](https://github.com/Kaggle/kaggle-api).

#### Option 3. GitHub Repo — LOCAL

If you are working locally:

1. Clone the [GitHub repo](https://github.com/LambdaSchool/DS-Unit-2-Applied-Modeling/tree/master/data/caterpillar) locally. The data is in the repo, so you don't need to download it separately.

2. Unzip the file `caterpillar-tube-pricing.zip` which is in the data folder of your local repo.

3. Unzip the file `data.zip`. 

4. Run the cell below to assign a constant named `SOURCE`, a string that points to the location of the data on your local machine. The rest of the code in the notebook will use this constant.

#### Option 4. GitHub Repo — COLAB

If you are working on Google Colab, uncomment and run these cells, to download the data, unzip it, and assign a constant that points to the location of the data.

In [1]:
!wget https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/caterpillar/caterpillar-tube-pricing.zip

--2019-08-12 01:36:27--  https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/caterpillar/caterpillar-tube-pricing.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 823789 (804K) [application/zip]
Saving to: ‘caterpillar-tube-pricing.zip’


2019-08-12 01:36:27 (21.2 MB/s) - ‘caterpillar-tube-pricing.zip’ saved [823789/823789]



In [2]:
!unzip -o caterpillar-tube-pricing.zip

Archive:  caterpillar-tube-pricing.zip
  inflating: sample_submission.csv   
  inflating: data.zip                


In [3]:
!unzip -o data.zip

Archive:  data.zip
   creating: competition_data/
  inflating: competition_data/bill_of_materials.csv  
  inflating: competition_data/comp_adaptor.csv  
  inflating: competition_data/comp_boss.csv  
  inflating: competition_data/comp_elbow.csv  
  inflating: competition_data/comp_float.csv  
  inflating: competition_data/comp_hfl.csv  
  inflating: competition_data/comp_nut.csv  
  inflating: competition_data/comp_other.csv  
  inflating: competition_data/comp_sleeve.csv  
  inflating: competition_data/comp_straight.csv  
  inflating: competition_data/comp_tee.csv  
  inflating: competition_data/comp_threaded.csv  
  inflating: competition_data/components.csv  
  inflating: competition_data/specs.csv  
  inflating: competition_data/test_set.csv  
  inflating: competition_data/train_set.csv  
  inflating: competition_data/tube.csv  
  inflating: competition_data/tube_end_form.csv  
  inflating: competition_data/type_component.csv  
  inflating: competition_data/type_connection.csv  
  i

In [0]:
SOURCE = 'competition_data/'

## Assignment

- Continue to participate in the [Kaggle Caterpillar competition](https://www.kaggle.com/c/caterpillar-tube-pricing).
- Do more feature engineering. 
- Use xgboost for gradient boosting.
- Submit new predictions.
- Commit your notebook to your fork of the GitHub repo.

In [0]:
import pandas as pd

In [6]:
materials = pd.read_csv(SOURCE + 'bill_of_materials.csv')
materials.head()

Unnamed: 0,tube_assembly_id,component_id_1,quantity_1,component_id_2,quantity_2,component_id_3,quantity_3,component_id_4,quantity_4,component_id_5,quantity_5,component_id_6,quantity_6,component_id_7,quantity_7,component_id_8,quantity_8
0,TA-00001,C-1622,2.0,C-1629,2.0,,,,,,,,,,,,
1,TA-00002,C-1312,2.0,,,,,,,,,,,,,,
2,TA-00003,C-1312,2.0,,,,,,,,,,,,,,
3,TA-00004,C-1312,2.0,,,,,,,,,,,,,,
4,TA-00005,C-1624,1.0,C-1631,1.0,C-1641,1.0,,,,,,,,,,


In [10]:
materials.describe(include='all')

Unnamed: 0,tube_assembly_id,component_id_1,quantity_1,component_id_2,quantity_2,component_id_3,quantity_3,component_id_4,quantity_4,component_id_5,quantity_5,component_id_6,quantity_6,component_id_7,quantity_7,component_id_8,quantity_8
count,21198,19149,19149.0,14786,14786.0,4791,4798.0,607,608.0,92,92.0,26,26.0,7,7.0,1,1.0
unique,21198,1079,,834,,509,,204,,62,,19,,4,,1,
top,TA-12035,C-1621,,C-1628,,C-1641,,C-1660,,C-0872,,C-0378,,C-1019,,C-1981,
freq,1,2043,,1959,,421,,62,,10,,3,,3,,1,
mean,,,1.559873,,1.526106,,1.020634,,1.027961,,1.032609,,1.153846,,1.0,,1.0
std,,,0.507444,,0.510851,,0.1601,,0.209041,,0.178583,,0.367946,,0.0,,
min,,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0
25%,,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0
50%,,,2.0,,2.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0
75%,,,2.0,,2.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0


In [7]:
components = pd.read_csv(SOURCE + 'components.csv')
components.head()

Unnamed: 0,component_id,name,component_type_id
0,9999,OTHER,OTHER
1,C-0001,SLEEVE,CP-024
2,C-0002,SLEEVE,CP-024
3,C-0003,SLEEVE-FLARED,CP-024
4,C-0004,NUT,CP-026


In [8]:
components.describe(include='all')

Unnamed: 0,component_id,name,component_type_id
count,2048,2047,2048
unique,2048,297,29
top,C-0597,FLANGE,OTHER
freq,1,350,1006


In [0]:
for index, row in components.iterrows():
  materials = materials.replace(row['component_id'], row['component_type_id'])

In [18]:
materials.head()

Unnamed: 0,tube_assembly_id,component_id_1,quantity_1,component_id_2,quantity_2,component_id_3,quantity_3,component_id_4,quantity_4,component_id_5,quantity_5,component_id_6,quantity_6,component_id_7,quantity_7,component_id_8,quantity_8
0,TA-00001,CP-025,2.0,CP-024,2.0,,,,,,,,,,,,
1,TA-00002,CP-028,2.0,,,,,,,,,,,,,,
2,TA-00003,CP-028,2.0,,,,,,,,,,,,,,
3,TA-00004,CP-028,2.0,,,,,,,,,,,,,,
4,TA-00005,CP-025,1.0,CP-024,1.0,CP-014,1.0,,,,,,,,,,


In [19]:
materials.describe(include='all')

Unnamed: 0,tube_assembly_id,component_id_1,quantity_1,component_id_2,quantity_2,component_id_3,quantity_3,component_id_4,quantity_4,component_id_5,quantity_5,component_id_6,quantity_6,component_id_7,quantity_7,component_id_8,quantity_8
count,21198,19149,19149.0,14786,14786.0,4791,4798.0,607,608.0,92,92.0,26,26.0,7,7.0,1,1.0
unique,21198,26,,25,,27,,19,,12,,6,,1,,1,
top,TA-12035,CP-025,,CP-024,,CP-014,,CP-024,,OTHER,,OTHER,,OTHER,,OTHER,
freq,1,8542,,10670,,2899,,256,,54,,19,,7,,1,
mean,,,1.559873,,1.526106,,1.020634,,1.027961,,1.032609,,1.153846,,1.0,,1.0
std,,,0.507444,,0.510851,,0.1601,,0.209041,,0.178583,,0.367946,,0.0,,
min,,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0
25%,,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0
50%,,,2.0,,2.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0
75%,,,2.0,,2.0,,1.0,,1.0,,1.0,,1.0,,1.0,,1.0


In [0]:
columns = ['component_id_' + str(i+1) for i in range(8)]
quantities = ['quantity_' + str(i+1) for i in range(8)]

In [0]:
materials_wide = materials.pivot(index='tube_assembly_id', 
                                 columns=columns[0], 
                                 values=quantities[0])

In [42]:
materials_wide.head()

component_id_1,nan,CP-001,CP-002,CP-003,CP-004,CP-005,CP-006,CP-007,CP-008,CP-009,CP-010,CP-011,CP-012,CP-014,CP-015,CP-016,CP-018,CP-019,CP-020,CP-021,CP-022,CP-024,CP-025,CP-026,CP-027,CP-028,OTHER
tube_assembly_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
TA-00001,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,
TA-00002,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,
TA-00003,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,
TA-00004,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,
TA-00005,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,


In [0]:
for column, quantity in zip(columns[1::], quantities[1::]):
  materials_wide = materials_wide.add(materials.pivot(index='tube_assembly_id', 
                                                      columns=column, 
                                                      values=quantity), 
                                     fill_value=0)

In [44]:
materials_wide.head()

Unnamed: 0_level_0,nan,CP-001,CP-002,CP-003,CP-004,CP-005,CP-006,CP-007,CP-008,CP-009,CP-010,CP-011,CP-012,CP-014,CP-015,CP-016,CP-017,CP-018,CP-019,CP-020,CP-021,CP-022,CP-023,CP-024,CP-025,CP-026,CP-027,CP-028,CP-029,OTHER
tube_assembly_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
TA-00001,,,,,,,,,,,,,,,,,,,,,,,,2.0,2.0,,,,,
TA-00002,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,
TA-00003,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,
TA-00004,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,
TA-00005,,,,,,,,,,,,,,1.0,,,,,,,,,,1.0,1.0,,,,,


In [0]:
materials_wide['total_components'] = materials_wide.sum(axis=1)

In [0]:
materials_wide['distinct_components'] = (~materials_wide.isnull()).sum(axis=1) - 1

In [49]:
materials_wide.head().T

tube_assembly_id,TA-00001,TA-00002,TA-00003,TA-00004,TA-00005
,,,,,
CP-001,,,,,
CP-002,,,,,
CP-003,,,,,
CP-004,,,,,
CP-005,,,,,
CP-006,,,,,
CP-007,,,,,
CP-008,,,,,
CP-009,,,,,


## Stretch Goals

- Improve your scores on Kaggle.
- Make visualizations and share on Slack.
- Look at [Kaggle Kernels](https://www.kaggle.com/c/caterpillar-tube-pricing/kernels) for ideas about feature engineerng and visualization.
- Look at the bonus notebook in the repo, about Monotonic Constraints with Gradient Boosting.
- Read more about gradient boosting:
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - [Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (3 minute video)