<a href="https://colab.research.google.com/github/ThisIsJorgeLima/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/module3/JAL_Dec_11__LS_DS_223_assignment_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 3*

---

# Cross-Validation


## Assignment
- [ ] [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Continue to participate in our Kaggle challenge. 
- [ ] Use scikit-learn for hyperparameter optimization with RandomizedSearchCV.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


You won't be able to just copy from the lesson notebook to this assignment.

- Because the lesson was ***regression***, but the assignment is ***classification.***
- Because the lesson used [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html), which doesn't work as-is for _multi-class_ classification.

So you will have to adapt the example, which is good real-world practice.

1. Use a model for classification, such as [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
2. Use hyperparameters that match the classifier, such as `randomforestclassifier__ ...`
3. Use a metric for classification, such as [`scoring='accuracy'`](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)
4. If you’re doing a multi-class classification problem — such as whether a waterpump is functional, functional needs repair, or nonfunctional — then use a categorical encoding that works for multi-class classification, such as [OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html) (not [TargetEncoder](https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html))



## Stretch Goals

### Reading
- Jake VanderPlas, [Python Data Science Handbook, Chapter 5.3](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html), Hyperparameters and Model Validation
- Jake VanderPlas, [Statistics for Hackers](https://speakerdeck.com/jakevdp/statistics-for-hackers?slide=107)
- Ron Zacharski, [A Programmer's Guide to Data Mining, Chapter 5](http://guidetodatamining.com/chapter5/), 10-fold cross validation
- Sebastian Raschka, [A Basic Pipeline and Grid Search Setup](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/svm_iris_pipeline_and_gridsearch.ipynb)
- Peter Worcester, [A Comparison of Grid Search and Randomized Search Using Scikit Learn](https://blog.usejournal.com/a-comparison-of-grid-search-and-randomized-search-using-scikit-learn-29823179bc85)

### Doing
- Add your own stretch goals!
- Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/). See the previous assignment notebook for details.
- In additon to `RandomizedSearchCV`, scikit-learn has [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Another library called scikit-optimize has [`BayesSearchCV`](https://scikit-optimize.github.io/notebooks/sklearn-gridsearchcv-replacement.html). Experiment with these alternatives.
- _[Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)_ discusses options for "Grid-Searching Which Model To Use" in Chapter 6:

> You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC ...

The example is shown in [the accompanying notebook](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb), code cells 35-37. Could you apply this concept to your own pipelines?


### BONUS: Stacking!

Here's some code you can use to "stack" multiple submissions, which is another form of ensembling:

```python
import pandas as pd

# Filenames of your submissions you want to ensemble
files = ['submission-01.csv', 'submission-02.csv', 'submission-03.csv']

target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('my-ultimate-ensemble-submission.csv', index=False)
```

In [0]:
test = 'https://github.com/ThisIsJorgeLima/DS-Unit-2-Kaggle-Challenge/blob/master/data/waterpumps/test_features.csv?raw=true'

In [0]:
train = 'https://github.com/ThisIsJorgeLima/DS-Unit-2-Kaggle-Challenge/blob/master/data/waterpumps/train_features.csv?raw=true'

In [0]:
label = 'https://github.com/ThisIsJorgeLima/DS-Unit-2-Kaggle-Challenge/blob/master/data/waterpumps/train_labels.csv?raw=true'

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

#train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 #pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
#test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
#sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')


train =  pd.merge(pd.read_csv(train), 
                 pd.read_csv(label))
test = pd.read_csv(test)
sample_submission = pd.read_csv('https://raw.githubusercontent.com/ThisIsJorgeLima/DS-Unit-2-Kaggle-Challenge/master/data/waterpumps/sample_submission.csv')

# Split train into train & val
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=42)


In [0]:
# The status_group column is the target
target = 'status_group'

# Get a dataframe with all train columns except the target & id
train_features = train.drop(columns=[target, 'id'])

# Get a list of the numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
categorical_features = cardinality[cardinality <= 50].index.tolist()

# Combine the lists 
features = numeric_features + categorical_features
print(features)

['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'basin', 'region', 'public_meeting', 'recorded_by', 'scheme_management', 'permit', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'payment_type', 'water_quality', 'quality_group', 'quantity', 'quantity_group', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group']


In [0]:
# Arrange data into X features matrix and y target vector 
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]


In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

encoder = ce.OneHotEncoder(use_cat_names=True)
imputer = SimpleImputer()
scaler = StandardScaler()
model = LogisticRegression(multi_class='auto', solver='lbfgs', n_jobs=-1)

X_train_encoded = encoder.fit_transform(X_train)
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_train_scaled = scaler.fit_transform(X_train_imputed)
model.fit(X_train_scaled, y_train)

X_val_encoded = encoder.transform(X_val)
X_val_imputed = imputer.transform(X_val_encoded)
X_val_scaled = scaler.transform(X_val_imputed)
print('Our Validation Accuracy is:', model.score(X_val_scaled, y_val))

X_test_encoded = encoder.transform(X_test)
X_test_imputed = imputer.transform(X_test_encoded)
X_test_scaled = scaler.transform(X_test_imputed)
y_pred = model.predict(X_test_scaled)

Our Validation Accuracy is: 0.7319023569023569


In [0]:
train.shape, val.shape, test.shape

((47520, 41), (11880, 41), (14358, 40))

In [0]:
train['status_group'].value_counts(normalize=True)

functional                 0.543077
non functional             0.384238
functional needs repair    0.072685
Name: status_group, dtype: float64

In [0]:
def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    
    # Prevent SettingWithCopyWarning
    X = X.copy()
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # The following columns contain null values instead of NaNs. with 
    # this code they will be replaced by NaNs 
    cols_with_zeros = ['longitude', 'latitude',  
                       'construction_year','gps_height',
                       'amount_tsh']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
            
    # quantity & quantity_group are duplicates, also num_private has no 
    # useful information, so these columns will be dropped 
    X = X.drop(columns=['quantity_group',"num_private"])
    
    # return the wrangled dataframe
    return X

In [0]:
pip install --upgrade category_encoders

Requirement already up-to-date: category_encoders in ./opt/anaconda3/lib/python3.7/site-packages (2.1.0)
Note: you may need to restart the kernel to use updated packages.


In [0]:
%%time

# The pipeline is identical to the example cell above, 
# except we're replacing one-hot encoder with "ordinal" encoder
pipeline = make_pipeline(
    #ce.BinaryEncoder(),
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=500, random_state=0, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Our Train Accuracy is:', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))


## BinaryEncoder .81 with n_estimators=100
## BinaryEncoder .81203 with n_estimators=300
## BinaryEncoder .81212 with n_estimators=500
## OrdinalEncoder .812 with n_estimators=100
## OrdinalEncoder .8153 with n_estimators=300
## OrdinalEncoder .81555 with n_estimators=500



Our Train Accuracy is: 1.0
Validation Accuracy 0.8154882154882155
CPU times: user 1min 35s, sys: 1.57 s, total: 1min 37s
Wall time: 16.2 s


In [0]:
print('X_train shape before', X_train.shape)

encoder = pipeline.named_steps['ordinalencoder']
encoded = encoder.transform(X_train)

print('X_train shape after encoding', encoded.shape)

X_train shape before (47520, 40)
X_train shape after encoding (47520, 40)


In [0]:
%matplotlib inline
import matplotlib.pyplot as plt

# Get feature importances
rf = pipeline.named_steps['randomforestclassifier']
rf.feature_importances_,

(array([0.0513891 , 0.01917882, 0.03629088, 0.03044401, 0.04055447,
        0.02527832, 0.07251141, 0.07171061, 0.04898794, 0.00092392,
        0.0141706 , 0.04741341, 0.01457888, 0.01488837, 0.01568785,
        0.0223551 , 0.0350888 , 0.02959902, 0.00683123, 0.        ,
        0.01157812, 0.02282693, 0.00699801, 0.03794511, 0.01741252,
        0.01516247, 0.02522414, 0.01088409, 0.00520953, 0.01532811,
        0.01436795, 0.00694199, 0.00699435, 0.05785779, 0.05583353,
        0.01376542, 0.01311005, 0.00470031, 0.0361549 , 0.02382194]),)

In [0]:
X_train['wpt_name'].describe()

count     47520
unique    30661
top        none
freq       2879
Name: wpt_name, dtype: object

In [0]:
X_train['wpt_name'].value_counts()

none           2879
Shuleni        1416
Zahanati        675
Msikitini       424
Kanisani        253
               ... 
High School       1
Kwa Mmoto         1
Kwa Ndwati        1
Nakatunga         1
Plot 52           1
Name: wpt_name, Length: 30661, dtype: int64

In [0]:
encoded['wpt_name'].value_counts()

27       2879
73       1416
8         675
69        424
46        253
         ... 
2162        1
115         1
14452       1
12405       1
2047        1
Name: wpt_name, Length: 30661, dtype: int64

In [0]:
feature = 'extraction_type_class'

In [0]:
X_train[feature].head(20)

43360        gravity
7263         gravity
2486        handpump
313            other
52726      motorpump
8558         gravity
2559         gravity
54735      motorpump
25763       handpump
44540    submersible
28603          other
4372     submersible
30666        gravity
6431     submersible
57420          other
1373         gravity
2026         gravity
58977       handpump
41101        gravity
10019        gravity
Name: extraction_type_class, dtype: object

In [0]:
encoder = ce.OrdinalEncoder()
encoded = encoder.fit_transform(X_train[[feature]])
print(f'{len(encoded.columns)}columns')
encoded.head(20)

1columns


Unnamed: 0,extraction_type_class
43360,1
7263,1
2486,2
313,3
52726,4
8558,1
2559,1
54735,4
25763,2
44540,5


In [0]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

lr = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    StandardScaler(),
    LogisticRegressionCV(multi_class='auto', solver='lbfgs', cv=5, n_jobs=-1)

)

lr.fit(X_train[[feature]], y_train)
score = lr.score(X_val[[feature]], y_val)
print('Logistic Regression, Validation Accuracy', score)

Logistic Regression, Validation Accuracy 0.6202861952861953


In [0]:
from sklearn.tree import DecisionTreeClassifier

dt = make_pipeline(
    ce.BinaryEncoder(),
    #ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=500, random_state=0, n_jobs=-1)
)

dt.fit(X_train[[feature]], y_train)
score = dt.score(X_val[[feature]], y_val)
print('Random Forest Classifer, Valisation Accuracy', score)

# ordinal encoder 0.6202861952861953


Random Forest Classifer, Valisation Accuracy 0.6202861952861953


In [0]:
encoder = ce.OrdinalEncoder()
encoded = encoder.fit_transform(X_train[[feature]])
print(f'1 column, {encoded[feature].nunique()} unique values')
encoded.head(20)

1 column, 7 unique values


Unnamed: 0,extraction_type_class
43360,1
7263,1
2486,2
313,3
52726,4
8558,1
2559,1
54735,4
25763,2
44540,5


In [0]:
lr = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(), 
    StandardScaler(), 
    LogisticRegressionCV(multi_class='auto', solver='lbfgs', cv=5, n_jobs=-1)
)

lr.fit(X_train[[feature]], y_train)
score = lr.score(X_val[[feature]], y_val)
print('Logistic Regression, Validation Accuracy', score)

Logistic Regression, Validation Accuracy 0.6202861952861953


In [0]:
import category_encoders as ce
import numpy as np
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(strategy='mean'), 
    StandardScaler(), 
    SelectKBest(f_regression, k=20), 
    Ridge(alpha=1.0)
)

k = 2
scores = cross_val_score(pipeline, X_train, y_train, cv=k, 
                         scoring='neg_mean_absolute_error')
print(f'MAE for {k} folds:', -scores)

In [0]:
from sklearn.ensemble import RandomForestRegressor

features = train.columns.drop(target)
X_train = train[features]
y_train = train[target]

pipeline = make_pipeline(
    ce.TargetEncoder(min_samples_leaf=1, smoothing=1), 
    SimpleImputer(strategy='median'), 
    RandomForestRegressor(n_estimators=100, n_jobs=-1, random_state=42)
)

k = 3
scores = cross_val_score(pipeline, X_train, y_train, cv=k, 
                         scoring='neg_mean_absolute_error')
print(f'MAE for {k} folds:', -scores)

In [0]:
-scores.mean()

In [0]:
print('Model Hyperparameters:')
print(pipeline.named_steps['randomforestregressor'])