<a href="https://colab.research.google.com/github/hBar2013/DS-Unit-2-Classification-1/blob/master/module2-decision-trees/kim_lowry_decision_trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science, Classification 1_

This sprint, your project is about water pumps in Tanzania. Can you predict which water pumps are faulty?

# Decision Trees, Data Cleaning

#### Objectives
- clean data with outliers
- impute missing values
- use scikit-learn for decision trees
- understand why decision trees are useful to model non-linear, non-monotonic relationships and feature interactions
- get and interpret feature importances of a tree-based model

#### Links

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)

### OPTIONAL SETUP

#### 1. Downgrade pandas to fix pivot table bug

For this lesson, I'll downgrade pandas from 0.24 to 0.23.4, because of a known issue: https://github.com/pandas-dev/pandas/issues/25087

I'm making a pivot table just for demonstration during this lesson, but it's not required for your assignment. So, you don't need to downgrade pandas if you don't want to.

#### 2. Install graphviz to visualize trees

This is also not required for your assignment.

Anaconda:  
```
conda install python-graphviz
```

Google Colab:  
```
!pip install graphviz
!apt-get install graphviz
```


## Clean data with outliers, impute missing values (example solutions)

In [2]:
!pip install category_encoders



In [13]:
from google.colab import files
uploaded = files.upload()

Saving sample_submission.csv to sample_submission.csv


In [0]:
import category_encoders as ce
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split


train_features_file = 'train_features.csv'
train_labels_file = 'train_labels.csv'
test_features_file = 'test_features.csv'

train_feat = pd.read_csv(train_features_file)
train_labels = pd.read_csv(train_labels_file)
test_feat = pd.read_csv(test_features_file)

In [10]:
X_train = train_feat
y_train = train_labels['status_group']

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, train_size=0.80, test_size=0.20, 
    stratify=y_train, random_state=42)

X_train.shape, X_val.shape, y_train.shape, y_val.shape

((47520, 40), (11880, 40), (47520,), (11880,))

In [0]:
num_features = ['region_code', 'district_code']
               
cat_features = ['payment_type','waterpoint_type', 'quantity', 
                'extraction_type', 'permit', 'management'] 

In [12]:
features = cat_features + num_features

X_train_subset = X_train[features]
X_val_subset = X_val[features]

encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train_subset)
X_val_encoded = encoder.transform(X_val_subset)


dt = DecisionTreeClassifier(random_state=42, max_depth=15)
dt.fit(X_train_encoded, y_train)
print('Decision Tree')
print('Train Accuracy', dt.score(X_train_encoded, y_train))
print('Validation Accuracy', dt.score(X_val_encoded, y_val))

Decision Tree
Train Accuracy 0.7805555555555556
Validation Accuracy 0.7566498316498317


In [0]:
X_test_subset = test_feat[features]
X_test_encoded = encoder.transform(X_test_subset)


In [0]:
sumbission_sample_file = 'sample_submission.csv'
sample_submission = pd.read_csv(sumbission_sample_file)

In [0]:
y_pred = dt.predict(X_test_encoded)
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('kel-submission-03.csv', index=False)

In [18]:
!head kel-submission-03.csv

id,status_group
50785,non functional
51630,functional
17168,functional
45559,non functional
49871,functional
52449,functional
24806,functional
28965,non functional
36301,non functional


In [0]:
from google.colab import files
files.download('kel-submission-03.csv')

# Assignment
- Start a clean notebook, or continue with yesterday's assignment notebook.
- Continue to participate in our Kaggle competition with the Tanzania Waterpumps data. 
- Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- Try a Decision Tree Classifier. 
- Submit new predictions.
- Commit your notebook to your fork of the GitHub repo.

## Stretch Goals
- Create visualizations and share on Slack.
- Read more about decision trees and tree ensembles. You can start with the links at the top of this notebook.
- Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html):

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
