<a href="https://colab.research.google.com/github/Terrencebosco/DS-Unit-2-Kaggle-Challenge/blob/master/module1-decision-trees/LS_DS_221_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition. Notice that the Rules page also has instructions for the Submission process. The Data page has feature definitions.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [15]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'



In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [17]:
from pandas_profiling import ProfileReport
profile = ProfileReport(train, minimal=True).to_notebook_iframe()

profile

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=50.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




In [18]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
import category_encoders as ce
from sklearn.pipeline import make_pipeline 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

In [19]:
train, val = train_test_split(train)

train.shape, val.shape

((44550, 41), (14850, 41))

In [20]:
train['status_group'].value_counts(normalize=True)

functional                 0.542671
non functional             0.384040
functional needs repair    0.073288
Name: status_group, dtype: float64

In [21]:
px.scatter(train,x='longitude', y='latitude', color='status_group', opacity=.1)

In [22]:
train[['longitude', 'latitude']].describe()

Unnamed: 0,longitude,latitude
count,44550.0,44550.0
mean,34.075456,-5.702514
std,6.568601,2.948288
min,0.0,-11.64944
25%,33.083729,-8.541999
50%,34.905645,-5.019156
75%,37.170586,-3.325431
max,40.345193,-2e-08


In [32]:
train.head(3)

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group,year_recorded,month_recorded,day_recorded,years,year_missing
4348,43487,50.0,2013-05-03,World Vision,1368.0,Naishu construction co. ltd,36.572899,-3.401705,Kwa Richard,0,Internal,Nguruvani,Arusha,2,2,Arusha Rural,Mateves,200.0,True,GeoData Consultants Ltd,WUA,Mangola pipe scheme,True,2012.0,submersible,submersible,submersible,wua,user-group,pay per bucket,soft,good,enough,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,functional,2013,5,3,1.0,False
18156,6742,500.0,2011-11-03,Rc,2179.0,RC,34.456428,-9.313526,Kwa Jesto Sanga,0,Lake Nyasa,Mtendee,Iringa,11,4,Njombe,Igosi,30.0,True,GeoData Consultants Ltd,VWC,Mafinga,True,1998.0,gravity,gravity,gravity,vwc,user-group,pay annually,soft,good,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,functional,2011,11,3,13.0,False
34580,46514,0.0,2011-07-05,Government Of Tanzania,,DWE,31.165492,-1.417646,Kwamoto,0,Lake Victoria,Maguge,Kagera,18,1,Karagwe,Kihanga,,True,GeoData Consultants Ltd,VWC,Katanda Water Sup,True,,gravity,gravity,gravity,other,other,never pay,soft,good,dry,river,river/lake,surface,communal standpipe,communal standpipe,non functional,2011,7,5,,True


In [31]:
 train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44550 entries, 4348 to 42291
Data columns (total 44 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     44550 non-null  int64         
 1   amount_tsh             44550 non-null  float64       
 2   date_recorded          44550 non-null  datetime64[ns]
 3   funder                 41789 non-null  object        
 4   gps_height             29098 non-null  float64       
 5   installer              41780 non-null  object        
 6   longitude              43190 non-null  float64       
 7   latitude               43190 non-null  float64       
 8   wpt_name               44550 non-null  object        
 9   num_private            44550 non-null  int64         
 10  basin                  44550 non-null  object        
 11  subvillage             44270 non-null  object        
 12  region                 44550 non-null  object        
 13

In [25]:
def wrangle(x):

    x = x.copy()

    # replace small value with 0
    x['latitude'] = x['latitude'].replace(-2e-08, 0)

    # replacing missing values with nan
    cols_with_zero = ['latitude','longitude',
                      'construction_year','gps_height','population']
    # replace 0 with nan
    for col in cols_with_zero:
        x[col] = x[col].replace(0, np.nan)
        
    # duplicate columns
    duplicates = ['quantity_group', 'payment_type']
    x = x.drop(columns=duplicates)
    
    # drop unusable variance 
    unusable_variance = ['recorded_by', 'id']

    # set recorded date to date time
    x['date_recorded'] = pd.to_datetime(x['date_recorded'],
                                        infer_datetime_format=True)

    # sperate date time into day,month,year, and drop original
    x['year_recorded'] = x['date_recorded'].dt.year
    x['month_recorded'] = x['date_recorded'].dt.month
    x['day_recorded'] = x['date_recorded'].dt.day

    # how many years from built to date recorded
    x['years'] = x['year_recorded'] - x['construction_year']

    # number of missing yeats
    x['year_missing'] = x['years'].isnull()

    return x

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [10]:
fig = px.scatter_mapbox(train, lat='latitude', lon='longitude', color='status_group', opacity=0.1)
fig.update_layout(mapbox_style='stamen-terrain')
fig.show()

In [27]:
# y variable 
target = 'status_group'

# drop target and id
train_features = train.drop(columns=[target, 'id'])

# variable numeric features
numeric_features = train_features.select_dtypes(include='number').columns.to_list()

# variable high cardinality features 
high_cardinality_features = train_features.select_dtypes(exclude='number').nunique()

# catigorical featears with less than 50 unique elements
categorical_features = high_cardinality_features[high_cardinality_features <= 50].index.to_list()

# create features variable for subsetting
features = numeric_features + categorical_features

In [29]:
X_train = train[features]
X_val = val[features]
X_test = test[features]

y_train = train[target]
y_val = val[target]

In [30]:
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    DecisionTreeClassifier(max_depth=20, min_samples_leaf=20, random_state=42)
)

pipeline.fit(X_train,y_train)

print('Train score:', pipeline.score(X_train, y_train))
print('Validation score:', pipeline.score(X_val, y_val))

Train score: 0.8046913580246914
Validation score: 0.7673400673400673


In [None]:
import graphviz
from sklearn.tree import export_graphviz

model = pipeline.named_steps['decisiontreeclassifier']
encoder = pipeline.named_steps['onehotencoder']
encoded_columns = encoder.transform(X_val).columns

dot_data = export_graphviz(model, 
                           out_file=None, 
                           max_depth=3, 
                           feature_names=encoded_columns,
                           class_names=model.classes_, 
                           impurity=False, 
                           filled=True, 
                           proportion=True, 
                           rounded=True)   
display(graphviz.Source(dot_data))