<a href="https://colab.research.google.com/github/Pdugovich/DS-Unit-2-Kaggle-Challenge/blob/master/module3/assignment_kaggle_challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 3


## Assignment
- [ ] [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Continue to participate in our Kaggle challenge. 
- [ ] Use scikit-learn for hyperparameter optimization with RandomizedSearchCV.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Reading
- Jake VanderPlas, [Python Data Science Handbook, Chapter 5.3](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html), Hyperparameters and Model Validation
- Jake VanderPlas, [Statistics for Hackers](https://speakerdeck.com/jakevdp/statistics-for-hackers?slide=107)
- Ron Zacharski, [A Programmer's Guide to Data Mining, Chapter 5](http://guidetodatamining.com/chapter5/), 10-fold cross validation
- Sebastian Raschka, [A Basic Pipeline and Grid Search Setup](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/svm_iris_pipeline_and_gridsearch.ipynb)
- Peter Worcester, [A Comparison of Grid Search and Randomized Search Using Scikit Learn](https://blog.usejournal.com/a-comparison-of-grid-search-and-randomized-search-using-scikit-learn-29823179bc85)

### Doing
- In additon to `RandomizedSearchCV`, scikit-learn has [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Another library called scikit-optimize has [`BayesSearchCV`](https://scikit-optimize.github.io/notebooks/sklearn-gridsearchcv-replacement.html). Experiment with these alternatives.
- _[Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)_ discusses options for "Grid-Searching Which Model To Use" in Chapter 6:

> You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC ...

The example is shown in [the accompanying notebook](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb), code cells 35-37. Could you apply this concept to your own pipelines?

### Copied from Assignment2

In [0]:
import sys

DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'

In [0]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [0]:
#Checked the average elevation of Tanzania, and it's 1200, so I 
#really can ignore the zeroes here. But also the lowest elevation is 0,
#so I'm a little confused about the negative numbers
train['gps_height'].value_counts()

 0       20438
-15         60
-16         55
-13         55
-20         52
         ...  
 2285        1
 2424        1
 2552        1
 2413        1
 2385        1
Name: gps_height, Length: 2428, dtype: int64

In [0]:
# Numeric Columns to clean
numeric_to_clean = ['longitude','latitude','construction_year', 'gps_height']

In [0]:
# Checking for duplicat columns
duplicates1 = ['extraction_type','extraction_type_group','extraction_type_class']
duplicates2 = ['payment','payment_type']
duplicates3 = ['quantity_group','quantity']
duplicates4 = ['source','source_type']
duplicates5 = ['waterpoint_type','waterpoint_type_group']
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


In [0]:
#Checking the duplicates to decide which to keep
duplicate_lists = [duplicates1, duplicates2, duplicates3, duplicates4,
                   duplicates5]

for duplicate in duplicate_lists:
  print(train[duplicate].describe())
  print("")

       extraction_type extraction_type_group extraction_type_class
count            59400                 59400                 59400
unique              18                    13                     7
top            gravity               gravity               gravity
freq             26780                 26780                 26780

          payment payment_type
count       59400        59400
unique          7            7
top     never pay    never pay
freq        25348        25348

       quantity_group quantity
count           59400    59400
unique              5        5
top            enough   enough
freq            33186    33186

        source source_type
count    59400       59400
unique      10           7
top     spring      spring
freq     17021       17021

           waterpoint_type waterpoint_type_group
count                59400                 59400
unique                   7                     6
top     communal standpipe    communal standpipe
freq                

In [0]:
#my_train['region'].value_counts().index

In [0]:
Mwanza = train[train['region'] == 'Mwanza']
Mwanza['longitude'] = Mwanza['longitude'].replace(0,np.nan)
Mwanza['latitude'] = Mwanza['latitude'].replace(-2e-08, np.nan)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [0]:
np.mean(Mwanza['latitude'])

-2.6205017775686277

In [0]:
np.mean(Mwanza['longitude'])

33.09156419778649

In [0]:
Shinyanga = train[train['region'] == 'Shinyanga']
Shinyanga['longitude'] = Shinyanga['longitude'].replace(0,np.nan)
Shinyanga['latitude'] = Shinyanga['latitude'].replace(-2e-08, np.nan)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [0]:
np.mean(Shinyanga['longitude'])

33.24012071028917

In [0]:
np.mean(Shinyanga['latitude'])

-3.495696017133518

In [0]:
#Looking at the above lists, I'll remove the duplicate columns
# and nearly duplicate columns that have fewer unique variables

duplicates_to_drop = ['extraction_type_group','extraction_type_class',
                    'payment_type','quantity_group', 'source_type',
                    'waterpoint_type_group']

In [0]:
from sklearn.model_selection import train_test_split

my_train, my_val = train_test_split(train, random_state=333)

In [0]:
my_train[my_train['longitude']==0]['region'].value_counts()

Shinyanga    763
Mwanza       623
Name: region, dtype: int64

In [0]:
my_train[my_train['region']== 'Mwanza']

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
3353,41817,0.0,2011-07-21,,1183,,33.111956,-2.125932e+00,Office,0,...,unknown,unknown,unknown,unknown,unknown,other,unknown,hand pump,hand pump,non functional
16640,64071,0.0,2011-08-08,Hesawa,1149,DWE,33.020814,-1.870797e+00,Church,0,...,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional
26365,24605,0.0,2011-07-18,Hesawa,0,DWE,33.251206,-2.536370e+00,Dyuya,0,...,soft,good,insufficient,insufficient,spring,spring,groundwater,other,other,non functional
51183,2206,0.0,2011-07-28,Government Of Tanzania,0,Government,0.000000,-2.000000e-08,Mahakama,0,...,unknown,unknown,dry,dry,dam,dam,surface,communal standpipe multiple,communal standpipe,non functional
52506,8245,0.0,2011-08-08,Hesawa,0,DWE,33.515992,-2.651120e+00,Mwabashola,0,...,soft,good,enough,enough,shallow well,shallow well,groundwater,other,other,non functional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18963,41164,0.0,2011-07-25,Bkhws,0,BKHWS,32.919408,-2.711763e+00,Loundy,0,...,soft,good,enough,enough,lake,river/lake,surface,communal standpipe,communal standpipe,non functional
49091,60405,0.0,2011-07-28,Hesawa,0,HESAWA,32.980052,-2.658572e+00,Kwa Kilo,0,...,soft,good,enough,enough,shallow well,shallow well,groundwater,other,other,functional
44103,28939,0.0,2011-07-30,Hesawa,0,DWE,33.048568,-2.460987e+00,Kwa Dundo,0,...,soft,good,enough,enough,shallow well,shallow well,groundwater,other,other,non functional
41317,47225,0.0,2011-07-30,Unicef,0,Maswi,33.069096,-2.431699e+00,Kwa Matayo,0,...,soft,good,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump,functional


In [0]:
import numpy as np
def wrangle(X):
  
  # To prevent copy warnings
  X = X.copy()

  # Latitude is strange in that it doesn't have any 0s, but it does have these
  # near-0 values
  X['latitude'] = X['latitude'].replace(-2e-08, 0)

  # there are numeric columns with 0s that should be nana
  nans_as_zeros = ['latitude','longitude', 'construction_year',
                   'gps_height', 'population']
  for column in nans_as_zeros:
    X[column] = X[column].replace(0, np.nan)
    #I like this code Ryan had to make a new column for if its missing data
    X[column+'_MISSING'] = X[column].isnull()
  
  # X['longitude'] = X.apply(
  #   lambda row: np.mean(Mwanza['longitude']) if np.isnan(row['longitude']) and row['region'] == 'Mwanza' else row['longitude'],
  #   axis=1)
  # X['longitude'] = X.apply(
  #   lambda row: np.mean(Shinyanga['longitude']) if np.isnan(row['longitude']) and row['region'] == 'Shinyanga' else row['longitude'],
  #   axis=1)


  # X['latitude'] = X.apply(
  #   lambda row: np.mean(Mwanza['latitude']) if np.isnan(row['latitude']) and row['region'] == 'Mwanza' else row['latitude'],
  #   axis=1)
  # X['latitude'] = X.apply(
  #   lambda row: np.mean(Shinyanga['latitude']) if np.isnan(row['latitude']) and row['region'] == 'Shinyanga' else row['latitude'],
  #   axis=1)
  
  #Date recorded is treated as an int. Extracting y/m/d
  X['date_recorded'] = pd.to_datetime(X['date_recorded'])
  X['year_recorded'] = X['date_recorded'].dt.year
#   X['month_recorded'] = X['date_recorded'].dt.month
#   X['day_recorded'] = X['date_recorded'].dt.day
  X = X.drop(columns='date_recorded')

  #Removing duplicate or near-duplicate columns
  #X = X.drop(columns=duplicates_to_drop)

  #Can be used for each train and validation
  return X

In [0]:
%%time
my_train = wrangle(my_train)
my_val = wrangle(my_val)
test = wrangle(test)

Wall time: 198 ms


In [0]:
my_train['longitude'].value_counts()

39.103752    2
39.096499    2
37.530515    2
37.339811    2
37.542785    2
            ..
31.701989    1
37.131667    1
34.607110    1
33.076362    1
35.005922    1
Name: longitude, Length: 43123, dtype: int64

In [0]:
my_train[my_train['id']== 6091]

Unnamed: 0,id,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,...,source_class,waterpoint_type,waterpoint_type_group,status_group,latitude_MISSING,longitude_MISSING,construction_year_MISSING,gps_height_MISSING,population_MISSING,year_recorded


In [0]:
# # Copied from previous assignment. 
# # Unecessary because high cardinality features are fine

# # # Selecting target

# target = 'status_group'

# #Removing the target and useless id columns
# train_columns = my_train.drop(columns=[target,'id'])

# # separating numeric columns to readd after
# numeric_columns = train_columns.select_dtypes(include='number').columns.tolist()

# #Getting a list of cardinality for categorical features to exclude the large
# cardinality = train_columns.select_dtypes(exclude='number').nunique()

# #Excluding features with a cardinality over 50
# categorical_columns = cardinality[cardinality <50].index.tolist()

# #combining lists to get the features I will use for my model
# features = numeric_columns + categorical_columns

In [0]:
# We can use high cardinality features, so no need to remove them
target = 'status_group'

features = my_train.drop(columns=[target,'id']).columns

In [0]:
#Assigning variables

X_train = my_train[features]
y_train = my_train[target]

X_val = my_val[features]
y_val = my_val[target]

X_test = test[features]

In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

# Making a pipeline to encode, impute, then classify the data using decisiontree
my_pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=333, n_jobs=-1,
                           max_depth=20)
)


In [0]:
my_pipeline.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['funder', 'installer', 'wpt_name',
                                      'basin', 'subvillage', 'region', 'lga',
                                      'ward', 'public_meeting', 'recorded_by',
                                      'scheme_management', 'scheme_name',
                                      'permit', 'extraction_type',
                                      'extraction_type_group',
                                      'extraction_type_class', 'management',
                                      'management_group', 'payment',
                                      'payment_type', 'water_qual...
                ('randomforestclassifier',
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=20,
                                        max_features='auto',
                                        max

In [0]:
my_pipeline.score(X_val,y_val)

0.805993265993266

In [0]:
max_depth_scores = []
def pipeline_differing_max_depth(n):
  for num in range(13,n):
    my_pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=333, n_jobs=-1,
                           max_depth=num)
    )
    my_pipeline.fit(X_train,y_train)
    max_depth_scores.append({num:my_pipeline.score(X_val,y_val)})


In [0]:
%%time
pipeline_differing_max_depth(23)

Wall time: 1min 4s


In [0]:
#Looks like 20 is the best
max_depth_scores

[{13: 0.78996632996633},
 {14: 0.7950841750841751},
 {15: 0.7968350168350168},
 {16: 0.8021548821548822},
 {17: 0.8037710437710438},
 {18: 0.8036363636363636},
 {19: 0.8048484848484848},
 {20: 0.805993265993266},
 {21: 0.8045117845117845},
 {22: 0.8044444444444444}]

### Assignment 3 code

In [0]:
from scipy.stats import randint, uniform
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


In [0]:
%%time

my_pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(random_state=333, max_depth=20)
)

param_distributions = {
    'randomforestclassifier__n_estimators': randint(100, 200),
}

search = RandomizedSearchCV(
    my_pipeline,
    param_distributions=param_distributions,
    n_iter=10,
    cv=3,
    scoring='accuracy',
    verbose=10,
    return_train_score=False,
    n_jobs=-1
)

Wall time: 1e+03 µs


In [0]:
search.fit(X_train,y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   34.8s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   59.2s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed:  2.1min remaining:   14.1s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.3min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=Pipeline(memory=None,
                                      steps=[('ordinalencoder',
                                              OrdinalEncoder(cols=None,
                                                             drop_invariant=False,
                                                             handle_missing='value',
                                                             handle_unknown='value',
                                                             mapping=None,
                                                             return_df=True,
                                                             verbose=0)),
                                             ('simpleimputer',
                                              SimpleImputer(add_indicator=False,
                                                            copy=True,
                                                            f

In [0]:
search.best_score_

0.8044893378226712

In [0]:
search.best_params_

{'randomforestclassifier__n_estimators': 191}

In [0]:
search.best_estimator_

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['funder', 'installer', 'wpt_name',
                                      'basin', 'subvillage', 'region', 'lga',
                                      'ward', 'public_meeting', 'recorded_by',
                                      'scheme_management', 'scheme_name',
                                      'permit', 'extraction_type',
                                      'extraction_type_group',
                                      'extraction_type_class', 'management',
                                      'management_group', 'payment',
                                      'payment_type', 'water_qual...
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=20,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                         

In [0]:
pipeline = search.best_estimator_

In [0]:
pred_y_test = pipeline.predict(X_test)

In [0]:
cv_submission = test.copy()

In [0]:
cv_submission['status_group'] = pred_y_test

In [0]:
cv_submission = cv_submission.filter(['id','status_group'])

In [0]:
cv_submission

Unnamed: 0,id,status_group
0,50785,non functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional
...,...,...
14353,39307,non functional
14354,18990,functional
14355,28749,functional
14356,33492,functional


In [0]:
cv_submission.to_csv('cv_submission.csv', index=False)

In [0]:
pipeline.score(X_val,y_val)

0.8062626262626262

In [0]:
my_pipeline.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['funder', 'installer', 'wpt_name',
                                      'basin', 'subvillage', 'region', 'lga',
                                      'ward', 'public_meeting', 'recorded_by',
                                      'scheme_management', 'scheme_name',
                                      'permit', 'extraction_type', 'management',
                                      'management_group', 'payment',
                                      'water_quality', 'quality_group',
                                      'quantity', 'source', 'source_class',
                                      'waterp...
                ('randomforestclassifier',
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=20,
                                        max_features='auto',
                                        m

In [0]:
my_pipeline.score(X_val,y_val)

0.8049158249158249