<a href="https://colab.research.google.com/github/Nolanole/DS-Unit-2-Sprint-3-Classification-Validation/blob/master/Tuesday_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ASSIGNMENT options

- **Replicate the lesson code.** [Do it "the hard way" or with the "Benjamin Franklin method."](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit)
- Apply the lesson to other datasets you've worked with before, and compare results.
- Iterate and improve your **Bank Marketing** model. Engineer new features.
- Get **weather** data for your own area and calculate both baselines.  _"One (persistence) predicts that the weather tomorrow is going to be whatever it was today. The other (climatology) predicts whatever the average historical weather has been on this day from prior years."_ What is the mean absolute error for each baseline? What if you average the two together? 
- [This example from scikit-learn documentation](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html) demonstrates its improved `OneHotEncoder` and new `ColumnTransformer` objects, which can replace functionality from [third-party libraries](https://github.com/scikit-learn-contrib) like category_encoders and sklearn-pandas. Adapt this example, which uses Titanic data, to work with Bank Marketing or another dataset.
- When would this notebook's pipelines fail? How could you fix them? Add more [preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) and [imputation](https://scikit-learn.org/stable/modules/impute.html) to your [pipelines](https://scikit-learn.org/stable/modules/compose.html) with scikit-learn.

In [0]:
# Imports
%matplotlib inline
import warnings
import category_encoders as ce
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import graphviz
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

In [0]:
df = sns.load_dataset('titanic').drop(columns= ['alive', 'class', 'embark_town'])

In [0]:
#fillna for 2 missing rows of embarked, using most frequent embarkation:
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

#fillna for age w/ ~mean age of 30:
df['age'] = df['age'].fillna(30)

In [0]:
#try a lil extra feature_engineering:

df['num_companions'] = df['sibsp'] + df['parch'] 

In [5]:
#convert categorical cols to numeric cat codes:

cat_cols = ['pclass', 'sex', 'embarked', 'who', 'deck', 'adult_male', 'alone']

def convert_categoricals(df, cat_cols):
  copy = df.copy()
  for col in cat_cols:
    copy[col] = pd.Categorical(copy[col])
    copy[col] = copy[col].cat.codes
  return copy

df = convert_categoricals(df, cat_cols)
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,who,adult_male,deck,alone,num_companions
0,0,2,1,22.0,1,0,7.25,2,1,1,-1,0,1
1,1,0,0,38.0,1,0,71.2833,0,2,0,2,0,1
2,1,2,0,26.0,0,0,7.925,2,2,0,-1,1,0
3,1,0,0,35.0,1,0,53.1,2,2,0,2,0,1
4,0,2,1,35.0,0,0,8.05,2,1,1,-1,1,0


In [0]:
def test_pipeline(pipeline, X, y):
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  
  pipeline.fit(X_train, y_train)
  y_pred = pipeline.predict(X_test)
  print('Accuracy score: ' + str(accuracy_score(y_pred, y_test)))
  y_pred_proba = pipeline.predict_proba(X_test)[:,1]
  print('Roc_auc_score: ' + str(roc_auc_score(y_test, y_pred_proba)))

In [0]:
tree_pipe = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    StandardScaler(),
    DecisionTreeClassifier()
)

log_reg_pipe = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    StandardScaler(),
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

In [0]:
target = 'survived'
X = df.drop(columns=target)
y = df[target]

In [9]:
test_pipeline(tree_pipe, X, y)

Accuracy score: 0.7597765363128491
Roc_auc_score: 0.7601673101673102


In [10]:
test_pipeline(log_reg_pipe, X, y)

Accuracy score: 0.8100558659217877
Roc_auc_score: 0.8794079794079794


In [0]:
#Use cross-validation w/ training data instead:

def test_pipeline_cross_val(pipeline, X, y):
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  scores = cross_val_score(pipeline, X_train, y_train, scoring='roc_auc', cv=10, n_jobs=-1, verbose=10)
  print('Cross-Validation ROC AUC scores:', scores)
  print('Average:', scores.mean())

In [12]:
test_pipeline_cross_val(tree_pipe, X, y)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Cross-Validation ROC AUC scores: [0.70246914 0.71851852 0.69835391 0.81481481 0.78661616 0.74242424
 0.69276094 0.79292929 0.74519231 0.8361014 ]
Average: 0.753018072462517


[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    2.1s finished


In [13]:
test_pipeline_cross_val(log_reg_pipe, X, y)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Cross-Validation ROC AUC scores: [0.92345679 0.83786008 0.70452675 0.94650206 0.88468013 0.8047138
 0.76346801 0.8973064  0.8277972  0.96328671]
Average: 0.8553597945264613


[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0546s.) Setting batch_size=6.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.2s finished
