<a href="https://colab.research.google.com/github/LambdaTheda/DS-Unit-2-Kaggle-Challenge/blob/master/PT5_S_apr_18_u2s2m4_DS_224ass.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 4*

---

# Classification Metrics

## Assignment
- [ ] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Plot a confusion matrix for your Tanzania Waterpumps model.
- [ ] Continue to participate in our Kaggle challenge. Every student should have made at least one submission that scores at least 70% accuracy (well above the majority class baseline).
- [ ] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.


## Stretch Goals

### Reading

- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)


### Doing
- [ ] Share visualizations in our Slack channel!
- [ ] RandomizedSearchCV / GridSearchCV, for model selection. (See module 3 assignment notebook)
- [ ] Stacking Ensemble. (See module 3 assignment notebook)
- [ ] More Categorical Encoding. (See module 2 assignment notebook)

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

### Split Validation dataset from Train dataset

In [0]:
# Make Val the same size as Test dataset
from sklearn.model_selection import train_test_split

target = 'status_group'

train, val = train_test_split(train, test_size = len(test),
                              stratify = train[target], random_state = 42)

### Explore datasets

In [19]:
train.shape, val.shape, test.shape

((45042, 41), (14358, 41), (14358, 40))

In [20]:
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
45793,41413,0.0,2011-03-30,,0,,33.075583,-9.385449,Kwa Mwazembe,0,Lake Nyasa,Itaba,Mbeya,12,5,Ileje,Chitete,0,True,GeoData Consultants Ltd,VWC,,False,0,nira/tanira,nira/tanira,handpump,vwc,user-group,unknown,unknown,unknown,unknown,dry,dry,shallow well,shallow well,groundwater,hand pump,hand pump,non functional
26326,48397,500.0,2011-02-28,Dhv,285,DWE,36.228574,-8.207742,Kwamwampwaga,0,Rufiji,Igima,Morogoro,5,3,Kilombero,Mbingu,1000,True,GeoData Consultants Ltd,,,True,1984,swn 80,swn 80,handpump,vwc,user-group,pay monthly,monthly,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,non functional
53251,6910,0.0,2013-01-27,Finw,218,FinW,39.673635,-10.835281,Pachani,0,Ruvuma / Southern Coast,Mnyekehe,Mtwara,9,4,Tandahimba,Naputa,260,True,GeoData Consultants Ltd,Water Board,Borehole,True,1982,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
26791,12526,500.0,2011-02-27,Adb,1704,DWE,34.915589,-9.016965,Kwa Mwangayange Mfumbilwa,0,Rufiji,Ndanula,Iringa,11,4,Njombe,Igongolo,40,True,GeoData Consultants Ltd,VWC,Ibiki gravity water scheme,False,2008,gravity,gravity,gravity,vwc,user-group,pay monthly,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
2162,17511,200.0,2013-01-18,Dwe,1232,DWE,30.332034,-4.308921,Kwa Bungwa,0,Lake Tanganyika,Nyakerera,Kigoma,16,2,Kasulu,Kitagata,500,True,GeoData Consultants Ltd,Water authority,Nyachenda,True,2003,gravity,gravity,gravity,vwc,user-group,pay monthly,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional


### Pre-process/Wrangle data

In [21]:
import numpy as np

# Wrangles train, validate, and test sets in the same way

def wrangle(X):
    X = X.copy()

    #Convert 'date_recorded' to datetime
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format = True)

    # Extract components from 'date_recorded'
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day

    # Drop the original 'date_recorded' column
    X = X.drop(columns = 'date_recorded')

    # Engineer feature: Waterpump age = 'date_recorded' - 'construction_year'
    X['waterpump_age'] = X['year_recorded'] - X['construction_year']

    # Drop 'recorded_by'(never varies) & 'id' (always varies)
    unusable_variance = ['recorded_by', 'id']
    X = X.drop(columns = unusable_variance)

    # Drop duplicate columns
    duplicate_columns = ['quantity_group']
    X = X.drop(columns = duplicate_columns)

    '''About 3% of the time, latitude has small values near zero,
       outside Tanzania, so we'll treat these like null values'''
    X['latitude'] = X['latitude'].replace(-2e-08, np.nan)

    # When columns have zeros and shouldn't, they are like null values
    cols_with_zeros = ['construction_year', 'longitude', 'latitude', 'gps_height', 'population']
    for col in cols_with_zeros:
      X[col] = X[col].replace(0, np.nan)

    return X

# Wrangle train, validate, and test sets in the same way
      
train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

print(train.shape, val.shape, test.shape)

(45042, 41) (14358, 41) (14358, 40)


### Plot a confusion matrix for your Tanzania Waterpumps model