<a href="https://colab.research.google.com/github/Pdugovich/DS-Unit-2-Kaggle-Challenge/blob/master/module2/assignment_kaggle_challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 2

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_






### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [278]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'



In [279]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [280]:
#Checked the average elevation of Tanzania, and it's 1200, so I 
#really can ignore the zeroes here. But also the lowest elevation is 0,
#so I'm a little confused about the negative numbers
train['gps_height'].value_counts()

 0       20438
-15         60
-16         55
-13         55
-20         52
 1290       52
-14         51
 303        51
-18         49
-19         47
 1269       46
 1295       46
 1304       45
-23         45
 280        44
 1538       44
 1286       44
-8          44
-17         44
 1332       43
 320        43
 1317       42
 1293       42
 1319       42
 1359       42
 1264       42
 1288       42
 1401       42
 1303       42
-27         42
         ...  
 2506        1
 2023        1
-53          1
 2364        1
 2332        1
 2402        1
 2236        1
 2420        1
 2291        1
 2407        1
 2080        1
 2250        1
 591         1
 2378        1
 2535        1
 2614        1
 2484        1
 2450        1
 2072        1
 2286        1
 2567        1
 2322        1
 2254        1
 2264        1
 2464        1
 2285        1
 2424        1
 2552        1
 2413        1
 2385        1
Name: gps_height, Length: 2428, dtype: int64

# Copying over some code from previous assignments

In [0]:
# Numeric Columns to clean
numeric_to_clean = ['longitude','latitude','construction_year', 'gps_height']

In [282]:
# Checking for duplicat columns
duplicates1 = ['extraction_type','extraction_type_group','extraction_type_class']
duplicates2 = ['payment','payment_type']
duplicates3 = ['quantity_group','quantity']
duplicates4 = ['source','source_type']
duplicates5 = ['waterpoint_type','waterpoint_type_group']
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


In [283]:
#Checking the duplicates to decide which to keep
duplicate_lists = [duplicates1, duplicates2, duplicates3, duplicates4,
                   duplicates5]

for duplicate in duplicate_lists:
  print(train[duplicate].describe())
  print("")

       extraction_type extraction_type_group extraction_type_class
count            59400                 59400                 59400
unique              18                    13                     7
top            gravity               gravity               gravity
freq             26780                 26780                 26780

          payment payment_type
count       59400        59400
unique          7            7
top     never pay    never pay
freq        25348        25348

       quantity_group quantity
count           59400    59400
unique              5        5
top            enough   enough
freq            33186    33186

        source source_type
count    59400       59400
unique      10           7
top     spring      spring
freq     17021       17021

           waterpoint_type waterpoint_type_group
count                59400                 59400
unique                   7                     6
top     communal standpipe    communal standpipe
freq                

In [0]:
#Looking at the above lists, I'll remove the duplicate columns
# and nearly duplicate columns that have fewer unique variables

duplicates_to_drop = ['extraction_type_group','extraction_type_class',
                    'payment_type','quantity_group', 'source_type',
                    'waterpoint_type_group']

In [0]:
from sklearn.model_selection import train_test_split

my_train, my_val = train_test_split(train, random_state=333)

### Wrangling

In [0]:
import numpy as np
def wrangle(X):
  
  # To prevent copy warnings
  X = X.copy()

  # Latitude is strange in that it doesn't have any 0s, but it does have these
  # near-0 values
  X['latitude'] = X['latitude'].replace(-2e-08, 0)

  # there are numeric columns with 0s that should be nana
  nans_as_zeros = ['latitude','longitude', 'construction_year',
                   'gps_height', 'population']
  for column in nans_as_zeros:
    X[column] = X[column].replace(0, np.nan)
    #I like this code Ryan had to make a new column for if its missing data
    X[column+'_MISSING'] = X[column].isnull()

  #Date recorded is treated as an int. Extracting y/m/d
  X['date_recorded'] = pd.to_datetime(X['date_recorded'])
  X['year_recorded'] = X['date_recorded'].dt.year
  X['month_recorded'] = X['date_recorded'].dt.month
  X['day_recorded'] = X['date_recorded'].dt.day
  X = X.drop(columns='date_recorded')

  #Removing duplicate or near-duplicate columns
  X = X.drop(columns=duplicates_to_drop)

  #Can be used for each train and validation
  return X

In [0]:
my_train = wrangle(my_train)
my_val = wrangle(my_val)
test = wrangle(test)

### Feature Selection

In [0]:
# # Copied from previous assignment. 
# # Unecessary because high cardinality features are fine

# # # Selecting target

# target = 'status_group'

# #Removing the target and useless id columns
# train_columns = my_train.drop(columns=[target,'id'])

# # separating numeric columns to readd after
# numeric_columns = train_columns.select_dtypes(include='number').columns.tolist()

# #Getting a list of cardinality for categorical features to exclude the large
# cardinality = train_columns.select_dtypes(exclude='number').nunique()

# #Excluding features with a cardinality over 50
# categorical_columns = cardinality[cardinality <50].index.tolist()

# #combining lists to get the features I will use for my model
# features = numeric_columns + categorical_columns

In [0]:
# We can use high cardinality features, so no need to remove them
target = 'status_group'

features = my_train.drop(columns=[target,'id']).columns

In [0]:
#Assigning variables

X_train = my_train[features]
y_train = my_train[target]

X_val = my_val[features]
y_val = my_val[target]

X_test = test[features]

# Making pipeline

In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

# Making a pipeline to encode, impute, then classify the data using decisiontree
my_pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=333, n_jobs=-1,
                           max_depth=20)
)


In [317]:
my_pipeline.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['funder', 'installer', 'wpt_name',
                                      'basin', 'subvillage', 'region', 'lga',
                                      'ward', 'public_meeting', 'recorded_by',
                                      'scheme_management', 'scheme_name',
                                      'permit', 'extraction_type', 'management',
                                      'management_group', 'payment',
                                      'water_quality', 'quality_group',
                                      'quantity', 'source', 'source_class',
                                      'waterp...
                ('randomforestclassifier',
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=20,
                                        max_features='auto',
                                        m

In [318]:
my_pipeline.score(X_val,y_val)

0.8057912457912458

### Submission Code 

In [0]:
pred_y_test = my_pipeline.predict(X_test)

In [321]:
pred_y_test

array(['functional', 'functional', 'functional', ..., 'functional',
       'functional', 'non functional'], dtype=object)

In [0]:
randomforest_submission = test.copy()

In [0]:
randomforest_submission['status_group'] = pred_y_test

In [0]:
randomforest_submission = randomforest_submission.filter(['id','status_group'])

In [325]:
randomforest_submission

Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional
5,52449,functional
6,24806,functional
7,28965,non functional
8,36301,non functional
9,54122,functional


In [0]:
randomforest_submission.to_csv('randomforest_submission.csv',index=False)

## Additional Tests for best hyperparameters

In [0]:
max_depth_scores = []
def pipeline_differing_max_depth(n):
  for num in range(13,n):
    my_pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=333, n_jobs=-1,
                           max_depth=num)
    )
    my_pipeline.fit(X_train,y_train)
    max_depth_scores.append({num:my_pipeline.score(X_val,y_val)})



In [295]:
%%time
pipeline_differing_max_depth(23)

CPU times: user 2min 26s, sys: 1.08 s, total: 2min 27s
Wall time: 1min 20s


In [296]:
#Looks like 20 is the best
max_depth_scores

[{13: 0.7902356902356902},
 {14: 0.7942760942760942},
 {15: 0.7985858585858586},
 {16: 0.8016161616161617},
 {17: 0.8055218855218855},
 {18: 0.8047811447811448},
 {19: 0.8049158249158249},
 {20: 0.8057912457912458},
 {21: 0.8049158249158249},
 {22: 0.8024915824915825}]

In [0]:
# The same code, but for strategy='most_frequent'
max_depth_scores = []
def pipeline_differing_max_depth(n):
  for num in range(13,n):
    my_pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='most_frequent'),
    RandomForestClassifier(n_estimators=100, random_state=333, n_jobs=-1,
                           max_depth=num)
    )
    my_pipeline.fit(X_train,y_train)
    max_depth_scores.append({num:my_pipeline.score(X_val,y_val)})



In [302]:
%%time
pipeline_differing_max_depth(23)

CPU times: user 2min 35s, sys: 1.06 s, total: 2min 36s
Wall time: 1min 29s


In [298]:
max_depth_scores

[{13: 0.7902356902356902},
 {14: 0.7942760942760942},
 {15: 0.7985858585858586},
 {16: 0.8016161616161617},
 {17: 0.8055218855218855},
 {18: 0.8047811447811448},
 {19: 0.8049158249158249},
 {20: 0.8057912457912458},
 {21: 0.8049158249158249},
 {22: 0.8024915824915825}]

In [299]:
train.shape

(59400, 41)

In [300]:
test.shape

(14358, 41)