<a href="https://colab.research.google.com/github/JesseOtradovec/DS-Unit-2-Kaggle-Challenge/blob/master/OtradovecassignmentKaggleChallenge1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 1

## Assignment
- [x] Do train/validate/test split with the Tanzania Waterpumps data.
- [x] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what other columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What other columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Try other [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this problem, you may want to use the parameter `logistic=True`

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```



In [0]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

Requirement already up-to-date: category_encoders in /usr/local/lib/python3.6/dist-packages (2.0.0)
Requirement already up-to-date: pandas-profiling in /usr/local/lib/python3.6/dist-packages (2.3.0)
Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.1.1)
Reinitialized existing Git repository in /content/.git/
fatal: remote origin already exists.
From https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge
 * branch            master     -> FETCH_HEAD
Already up to date.


In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [0]:
train, validate = train_test_split(train)
train.shape, validate.shape

import numpy as np
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
50676,6068,0.0,2013-02-03,Fw,280,FW,38.993729,-10.864072,Hanu,0,Ruvuma / Southern Coast,Mtenga,Mtwara,90,33,Masasi,Chiugutwa,1,True,GeoData Consultants Ltd,VWC,,True,1983,other,other,other,vwc,user-group,unknown,unknown,salty,salty,dry,dry,shallow well,shallow well,groundwater,other,other,non functional
54289,14257,0.0,2013-01-17,Dfid,1468,Water Aid,34.645702,-4.880806,Nyota,0,Internal,Makungu,Singida,13,4,Singida Urban,Mwankoko,1,True,GeoData Consultants Ltd,WUG,,True,2003,nira/tanira,nira/tanira,handpump,wug,user-group,other,other,soft,good,dry,dry,shallow well,shallow well,groundwater,hand pump,hand pump,non functional
1871,4789,0.0,2012-11-09,Rwssp,0,WEDECO,34.354095,-3.139753,Maji Muhimu,0,Internal,Mayunge,Shinyanga,17,6,Meatu,Sakasaka,0,True,GeoData Consultants Ltd,WUG,,True,0,nira/tanira,nira/tanira,handpump,wug,user-group,never pay,never pay,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional
59075,14131,30.0,2011-02-26,Jaica,43,JAICA,39.292457,-6.977986,Kwa Mbawala,0,Wami / Ruvu,Mwandege,Pwani,60,43,Mkuranga,Vikindu,6922,True,GeoData Consultants Ltd,VWC,,False,2010,submersible,submersible,submersible,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,functional
13320,69535,0.0,2011-07-10,Concern,0,CONCERN,30.677431,-2.469766,Kinyinya C,0,Lake Victoria,Kinyinya C,Kagera,18,30,Ngara,Nyamiyaga,0,True,GeoData Consultants Ltd,VWC,,False,0,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,non functional


In [0]:
def getXandY(df):
  X=df.drop(columns=['id','status_group']).copy()
  X.date_recorded=pd.to_datetime(X.date_recorded, infer_datetime_format=True)
  X.latitude=X.latitude.replace(-2e-08,0)
  cols_with_zeros=['longitude','latitude','construction_year','population']
  for col in cols_with_zeros:
    X[col]=X[col].replace(0,np.nan)
    X[col+'_MISSING']=X[col].isnull()
  X=X.drop(columns=['quantity_group',"payment_type"])
  X=X.drop(columns="recorded_by")
  X['year_recorded'] = X['date_recorded'].dt.year
  X['month_recorded'] = X['date_recorded'].dt.month
  X['day_recorded'] = X['date_recorded'].dt.day
  
  X['Yrs_before_inspection']=X.date_recorded.dt.year-X.construction_year
  X['yrs_before_inspection_MISSING']=X.Yrs_before_inspection.isnull()
  X = X.drop(columns='date_recorded')
  return X, df.status_group.copy()

In [0]:
train_x, train_y = getXandY(train)
val_x, val_y = getXandY(validate)

In [0]:
import pandas_profiling
#train.profile_report()


In [0]:
target = train_y

numeric_features=train_x.select_dtypes(include='number').columns.tolist()

cardinality = train_x.select_dtypes(exclude='number').nunique()

In [0]:
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.preprocessing import StandardScaler

numeric_target=target.replace({'non functional':0,
                              'functional needs repair':1,
                              'functional':2})
numeric_val_y=val_y.replace({'non functional':0,
                              'functional needs repair':1,
                              'functional':2})

k=43 #maximizing validation
for k in range(1,len(train_x.columns)+1):
  print(f'{k} features')
 

  pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    StandardScaler(),
    SimpleImputer(strategy='median'), 
    SelectKBest(score_func=f_regression, k=k),
    RandomForestClassifier(random_state=42, n_estimators=100, n_jobs=-1)
  )

  pipeline.fit(train_x, numeric_target)
  print(f'Validate Accuracy Score: {pipeline.score(val_x, numeric_val_y)}')

1 features
Validate Accuracy Score: 0.5831649831649832
2 features
Validate Accuracy Score: 0.5892255892255892
3 features
Validate Accuracy Score: 0.6451178451178451
4 features
Validate Accuracy Score: 0.6728619528619528
5 features
Validate Accuracy Score: 0.6792592592592592
6 features
Validate Accuracy Score: 0.6934006734006734
7 features
Validate Accuracy Score: 0.7456565656565657
8 features
Validate Accuracy Score: 0.7505050505050505
9 features
Validate Accuracy Score: 0.7461952861952862
10 features
Validate Accuracy Score: 0.7574410774410775
11 features
Validate Accuracy Score: 0.7585858585858586
12 features
Validate Accuracy Score: 0.7613468013468013
13 features
Validate Accuracy Score: 0.7739393939393939
14 features
Validate Accuracy Score: 0.7731986531986532
15 features
Validate Accuracy Score: 0.7771717171717172
16 features
Validate Accuracy Score: 0.7793939393939394
17 features
Validate Accuracy Score: 0.7807407407407407
18 features
Validate Accuracy Score: 0.7804713804713804
1

In [0]:


for k in range(1,len(train_x.columns)+1):
  print(f'{k} features')
 

  pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    StandardScaler(),
    SimpleImputer(strategy='median'), 
    SelectKBest(score_func=f_regression, k=k),
    ExtraTreesClassifier(random_state=42, n_estimators=100, n_jobs=-1)
  )

  pipeline.fit(train_x, numeric_target)
  print(f'Validate Accuracy Score: {pipeline.score(val_x, numeric_val_y)}')

1 features
Validate Accuracy Score: 0.5831649831649832
2 features
Validate Accuracy Score: 0.589023569023569
3 features
Validate Accuracy Score: 0.6452525252525253
4 features
Validate Accuracy Score: 0.671986531986532
5 features
Validate Accuracy Score: 0.6783838383838384
6 features
Validate Accuracy Score: 0.6936026936026936
7 features
Validate Accuracy Score: 0.7448484848484849
8 features
Validate Accuracy Score: 0.748013468013468
9 features
Validate Accuracy Score: 0.7436363636363637
10 features
Validate Accuracy Score: 0.7528619528619529
11 features
Validate Accuracy Score: 0.7544107744107744
12 features
Validate Accuracy Score: 0.756094276094276
13 features
Validate Accuracy Score: 0.7685521885521885
14 features
Validate Accuracy Score: 0.7668013468013468
15 features
Validate Accuracy Score: 0.7704377104377105
16 features
Validate Accuracy Score: 0.7713131313131313
17 features
Validate Accuracy Score: 0.7732659932659932
18 features
Validate Accuracy Score: 0.7749494949494949
19 fe

In [0]:
from sklearn.ensemble import ExtraTreesClassifier

for k in range(1,len(train_x.columns)+1):
  print(f'{k} features')
 

  pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    StandardScaler(),
    SimpleImputer(strategy='median'), 
    SelectKBest(score_func=f_regression, k=k),
    ExtraTreesClassifier(random_state=42, n_estimators=100, n_jobs=-1)
  )

  pipeline.fit(train_x, numeric_target)
  print(f'Validate Accuracy Score: {pipeline.score(val_x, numeric_val_y)}')

In [0]:
k=43
pipeline = make_pipeline(
  ce.OrdinalEncoder(),
  StandardScaler(),
  SimpleImputer(strategy='median'), 
  SelectKBest(score_func=f_regression, k=k),
  RandomForestClassifier(random_state=42, n_estimators=100, n_jobs=-1)
)

pipeline.fit(train_x, numeric_target)
print(f'Validate Accuracy Score: {pipeline.score(val_x, numeric_val_y)}')

Validate Accuracy Score: 0.8043097643097643


In [0]:
 from sklearn.ensemble import RandomForestClassifier
  
pipeline = make_pipeline(
  ce.OrdinalEncoder(),
  SimpleImputer(strategy='mean'),
  RandomForestClassifier(random_state=42, n_estimators=100)
)

pipeline.fit(train_x, train_y)
print('validation accuracy', pipeline.score (val_x, val_y))

validation accuracy 0.8034343434343434


In [63]:
from sklearn.tree import DecisionTreeClassifier

numeric_target=target.replace({'non functional':0,
                              'functional needs repair':1,
                              'functional':2})
numeric_val_y=val_y.replace({'non functional':0,
                              'functional needs repair':1,
                              'functional':2})

pipeline = make_pipeline(
  ce.OrdinalEncoder(),
  SimpleImputer(strategy='mean'),
  DecisionTreeClassifier(random_state=42)
)

pipeline.fit(train_x, train_y)
print('validation accuracy', pipeline.score (val_x, val_y))

validation accuracy 0.7077441077441078


In [0]:


for k in range(1,len(train_x.columns)+1):
  print(f'{k} features')
 

  pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    StandardScaler(),
    SimpleImputer(strategy='median'), 
    SelectKBest(score_func=f_regression, k=k),
    DecisionTreeClassifier(random_state=42)
  )

  pipeline.fit(train_x, numeric_target)
  print(f'Validate Accuracy Score: {pipeline.score(val_x, numeric_val_y)}')