_Lambda School Data Science — Model Validation_ 

# Begin the modeling process

Objectives
- Train/Validate/Test split
- Cross-Validation
- Begin with baselines

## Get the Bank Marketing dataset

You have several ways you can get the dataset:

#### Kaggle
- Download from the [Kaggle competition page](https://www.kaggle.com/c/ds2-model-validation/data)
- Use the Kaggle API

#### GitHub
- Clone the [repo](https://github.com/LambdaSchool/DS-Unit-2-Sprint-4-Model-Validation/tree/master/module-1-begin-modeling-process/bank-marketing)
- Download from the repo:

In [1]:
import pandas as pd

train_features = pd.read_csv('train_features.csv')
train_labels = pd.read_csv('train_labels.csv')
test_features = pd.read_csv('test_features.csv')
sample_submission = pd.read_csv('sample_submission.csv')

train_features.shape, train_labels.shape, test_features.shape, sample_submission.shape

((30891, 20), (30891, 2), (10297, 20), (10297, 2))

In [2]:
y_train = train_labels['y']
X_train = train_features.drop(columns='id')
X_test  = test_features.drop(columns='id')

## Train/Validation/Test split

How can we get from a two-way split, to a three-way split?

We can use the [**`sklearn.model_selection.train_test_split`**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to split the training data into training and validation data.

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((23168, 19), (7723, 19), (23168,), (7723,))

Fit on the training set.

Predict and score with the validation set.

## Majority class baseline

Determine the majority class:

In [4]:
y_train.value_counts(normalize=True)

0    0.887604
1    0.112396
Name: y, dtype: float64

Guess the majority class for every prediction:

In [5]:
majority_class = 0
y_pred = [majority_class] * len(y_val)

#### [`sklearn.metrics.accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

Baseline accuracy by guessing the majority class for every prediction:

In [6]:
from sklearn.metrics import accuracy_score
accuracy_score(y_val, y_pred)

0.886572575424058

#### [`sklearn.metrics.roc_auc_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)

Baseline "ROC AUC" score by guessing the majority class for every prediction:

In [7]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_val, y_pred)

0.5

## Fast first models

### Ignore rows/columns with nulls

This dataset doesn't have any nulls:

In [8]:
X_train.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
dtype: int64

### Ignore nonnumeric features

Here are the numeric features:

In [9]:
import numpy as np
X_train.describe(include=np.number)

Unnamed: 0,age,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,23168.0,23168.0,23168.0,23168.0,23168.0,23168.0,23168.0,23168.0,23168.0
mean,40.060773,2.563665,962.269208,0.175328,0.087116,93.57718,-40.494212,3.624343,5167.039662
std,10.476481,2.74335,187.418505,0.503358,1.568426,0.578115,4.623754,1.734214,72.62778
min,17.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


Here are the nonnumeric features:

In [10]:
X_train.describe(exclude=np.number)

Unnamed: 0,job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome
count,23168,23168,23168,23168,23168,23168,23168,23168,23168,23168
unique,12,4,8,3,3,3,2,10,5,3
top,admin.,married,university.degree,no,yes,no,cellular,may,thu,nonexistent
freq,5839,14056,6802,18285,12157,19008,14753,7702,4863,19991


Just select the nonnumeric features:

In [11]:
X_train_numeric = X_train.select_dtypes(np.number)
X_val_numeric = X_val.select_dtypes(np.number)

In [12]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(X_train_numeric, y_train)

y_pred = model.predict(X_val_numeric)
roc_auc_score(y_val, y_pred)

0.5986235014101433

### With Scaler

In [13]:
from sklearn.exceptions import DataConversionWarning
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_val_scaled = scaler.transform(X_val_numeric)

model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(X_train_numeric, y_train)

y_pred = model.predict(X_val_scaled)
roc_auc_score(y_val, y_pred)

0.5152552396043195

### Same, as a pipeline

In [14]:
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    StandardScaler(), 
    LogisticRegression(solver='lbfgs')
)

pipeline.fit(X_train_numeric, y_train)

y_pred = pipeline.predict(X_val_numeric)
roc_auc_score(y_val, y_pred)

0.5964864457519976

### Encode "low cardinality" categoricals

One-hot encode the "low cardinality" categoricals

In [15]:
X_train.select_dtypes(exclude=np.number).nunique()

job            12
marital         4
education       8
default         3
housing         3
loan            3
contact         2
month          10
day_of_week     5
poutcome        3
dtype: int64

Install the Category Encoder library

If you're running on Google Colab:

```
!pip install category_encoders
```

If you're running locally with Anaconda:

```
!conda install -c conda-forge category_encoders
```

In [16]:
!pip install category_encoders
import category_encoders as ce

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    StandardScaler(), 
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_val)
roc_auc_score(y_val, y_pred)

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/f7/d3/82a4b85a87ece114f6d0139d643580c726efa45fa4db3b81aed38c0156c5/category_encoders-1.3.0-py2.py3-none-any.whl (61kB)
[K    100% |████████████████████████████████| 61kB 727kB/s ta 0:00:01
Installing collected packages: category-encoders
Successfully installed category-encoders-1.3.0


0.6084920869920699

In [17]:
X_train.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
2810,29,blue-collar,married,basic.9y,no,no,no,telephone,may,wed,5,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
25207,55,technician,married,professional.course,unknown,yes,no,telephone,jun,wed,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1
1012,52,admin.,single,high.school,no,no,no,cellular,may,mon,2,999,0,nonexistent,-1.8,92.893,-46.2,1.354,5099.1
5645,52,management,married,university.degree,no,yes,no,cellular,nov,mon,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8
8374,34,management,single,university.degree,no,no,no,cellular,jul,thu,2,999,0,nonexistent,1.4,93.918,-42.7,4.962,5228.1


In [30]:
X_train.nunique()

age                75
job                12
marital             4
education           8
default             3
housing             3
loan                3
contact             2
month              10
day_of_week         5
campaign           38
pdays              27
previous            8
poutcome            3
emp.var.rate       10
cons.price.idx     26
cons.conf.idx      26
euribor3m         304
nr.employed        11
dtype: int64

In [32]:
submission = sample_submission.copy()
submission['y'] = pipeline.predict_proba(X_test)[:, 1:]
submission.to_csv('submission-0001.csv', index=False)