_Lambda School Data Science — Model Validation_ 

# Begin the modeling process

Objectives
- Train/Validate/Test split
- Cross-Validation
- Begin with baselines

## Get the Bank Marketing dataset

You have several ways you can get the dataset:

#### Kaggle
- Download from the [Kaggle competition page](https://www.kaggle.com/c/ds2-model-validation/data)
- Use the Kaggle API

#### GitHub
- Clone the [repo](https://github.com/LambdaSchool/DS-Unit-2-Sprint-4-Model-Validation/tree/master/module-1-begin-modeling-process/bank-marketing)
- Download from the repo:

In [0]:
# !wget https://github.com/LambdaSchool/DS-Unit-2-Sprint-4-Model-Validation/blob/master/module-1-begin-modeling-process/bank-marketing/train_features.csv
# !wget https://github.com/LambdaSchool/DS-Unit-2-Sprint-4-Model-Validation/blob/master/module-1-begin-modeling-process/bank-marketing/train_labels.csv
# !wget https://github.com/LambdaSchool/DS-Unit-2-Sprint-4-Model-Validation/blob/master/module-1-begin-modeling-process/bank-marketing/test_features.csv
# !wget https://github.com/LambdaSchool/DS-Unit-2-Sprint-4-Model-Validation/blob/master/module-1-begin-modeling-process/bank-marketing/sample_submission.csv    

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

!kaggle competitions download -c ds2-model-validation

Mounted at /content/drive
env: KAGGLE_CONFIG_DIR=/content/drive/My Drive/
Downloading train_features.csv.zip to /content
  0% 0.00/424k [00:00<?, ?B/s]
100% 424k/424k [00:00<00:00, 29.1MB/s]
Downloading train_labels.csv to /content
  0% 0.00/241k [00:00<?, ?B/s]
100% 241k/241k [00:00<00:00, 73.2MB/s]
Downloading test_features.csv.zip to /content
  0% 0.00/142k [00:00<?, ?B/s]
100% 142k/142k [00:00<00:00, 43.1MB/s]
Downloading sample_submission.csv to /content
  0% 0.00/101k [00:00<?, ?B/s]
100% 101k/101k [00:00<00:00, 32.5MB/s]


In [3]:
%cd bank-marketing

[Errno 2] No such file or directory: 'bank-marketing'
/content


In [5]:
!unzip train_features.csv.zip
!unzip test_features.csv.zip

Archive:  train_features.csv.zip
  inflating: train_features.csv      
Archive:  test_features.csv.zip
  inflating: test_features.csv       


In [6]:
import pandas as pd

train_features = pd.read_csv('train_features.csv')
train_labels = pd.read_csv('train_labels.csv')
test_features = pd.read_csv('test_features.csv')
sample_submission = pd.read_csv('sample_submission.csv')

train_features.shape, train_labels.shape, test_features.shape, sample_submission.shape

((30891, 20), (30891, 2), (10297, 20), (10297, 2))

In [9]:
train_features.head()

Unnamed: 0,id,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,20591,29,services,single,high.school,no,yes,yes,cellular,may,thu,10,999,0,nonexistent,-1.8,92.893,-46.2,1.266,5099.1
1,18343,54,management,married,university.degree,no,no,no,cellular,nov,tue,1,999,1,failure,-0.1,93.2,-42.0,4.153,5195.8
2,32826,55,self-employed,married,unknown,unknown,no,no,cellular,jul,mon,3,999,0,nonexistent,1.4,93.918,-42.7,4.962,5228.1
3,29780,43,blue-collar,married,unknown,unknown,no,no,cellular,may,mon,6,999,0,nonexistent,-1.8,92.893,-46.2,1.244,5099.1
4,40736,54,blue-collar,married,basic.4y,no,yes,no,telephone,may,wed,5,999,0,nonexistent,1.1,93.994,-36.4,4.856,5191.0


In [0]:
y_train = train_labels['y']
X_train = train_features.drop(columns='id')
X_test  = test_features.drop(columns='id')

In [11]:
y_train.value_counts(normalize=True)
#customers sign only around 10% of the time
#identify those customers!

0    0.887346
1    0.112654
Name: y, dtype: float64

## Train/Validation/Test split

How can we get from a two-way split, to a three-way split?

We can use the [**`sklearn.model_selection.train_test_split`**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to split the training data into training and validation data.

In [0]:
#train test, a two way split, isn't sufficient to compare multiple models
#or models with hyperparameters
#and cannot measure performance
#two way splits overfit

#methods: three way split, cross validation
#begin with baselines!
#get above score from guessing as fast as possible, then iterate
#then try to exceed human performance if realistic for problem

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

Fit on the training set.

Predict and score with the validation set.

## Majority class baseline

Determine the majority class:

In [0]:
y_train.value_counts(normalize=True)

Guess the majority class for every prediction:

In [0]:
majority_class = 0
y_pred = [majority_class] * len(y_val)

#### [`sklearn.metrics.accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

Baseline accuracy by guessing the majority class for every prediction:

In [0]:
from sklearn.metrics import accuracy_score
accuracy_score(y_val, y_pred)

#### [`sklearn.metrics.roc_auc_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)

Baseline "ROC AUC" score by guessing the majority class for every prediction:

In [0]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_val, y_pred)

## Fast first models

### Ignore rows/columns with nulls

This dataset doesn't have any nulls:

In [0]:
X_train.isnull().sum()

### Ignore nonnumeric features

Here are the numeric features:

In [0]:
import numpy as np
X_train.describe(include=np.number)

Here are the nonnumeric features:

In [0]:
X_train.describe(exclude=np.number)

Just select the nonnumeric features:

In [0]:
X_train_numeric = X_train.select_dtypes(np.number)
X_val_numeric = X_val.select_dtypes(np.number)

In [0]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(X_train_numeric, y_train)

y_pred = model.predict(X_val_numeric)
roc_auc_score(y_val, y_pred)

### With Scaler

In [0]:
from sklearn.exceptions import DataConversionWarning
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_val_scaled = scaler.transform(X_val_numeric)

model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(X_train_numeric, y_train)

y_pred = model.predict(X_val_scaled)
roc_auc_score(y_val, y_pred)

### Same, as a pipeline

In [0]:
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    StandardScaler(), 
    LogisticRegression(solver='lbfgs')
)

pipeline.fit(X_train_numeric, y_train)

y_pred = pipeline.predict(X_val_numeric)
roc_auc_score(y_val, y_pred)

### Encode "low cardinality" categoricals

One-hot encode the "low cardinality" categoricals

In [0]:
X_train.select_dtypes(exclude=np.number).nunique()

Install the Category Encoder library

If you're running on Google Colab:

```
!pip install category_encoders
```

If you're running locally with Anaconda:

```
!conda install -c conda-forge category_encoders
```

In [0]:
import category_encoders as ce

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    StandardScaler(), 
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_val)
roc_auc_score(y_val, y_pred)