# Exercise with bank marketing data

## Introduction

- Data from the UCI Machine Learning Repository: [data](https://github.com/justmarkham/DAT8/blob/master/data/bank-additional.csv), [data dictionary](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)
- **Goal:** Predict whether a customer will purchase a bank product marketed over the phone
- `bank.csv` is already in your repo, so there is no need to download the data from the UCI website

## Step 1: Read the data into Pandas

In [None]:
import pandas as pd
url = 'bank.csv'
bank = pd.read_csv(url, sep=',')
bank.head()

## Step 2: Prepare at least three features

- Include both numeric and categorical features
- Choose features that you think might be related to the response (based on intuition or exploration)
- Think about how to handle missing values (encoded as "unknown")

In [None]:
# list all columns (for reference)
bank.columns

### convert the Outcome (response variable) to numeric

In [None]:
# convert the response to numeric values and store as a new column
bank['Outcome'] = bank.Outcome.map({'no':0, 'yes':1})

### age

In [None]:
%matplotlib inline

In [None]:
# Do a BoxPlot by Outcome and see if it is a good feature to add 
bank.boxplot(column='age', by='Outcome')

### job

In [None]:
# See if job which is a categorical column looks like a useful feature
bank.groupby('job').Outcome.mean()

In [None]:
# create job_dummies (we will add it to the bank DataFrame later)


### default

In [None]:
# Does "default" look like a useful feature (since it is a categorical var, do the same as above)


In [None]:
# Does it need to have three values or can you combine them into two? (hint: Do a Value Counts and see )


In [None]:
# If so, let's treat this as a 2-class feature rather than a 3-class feature (hint: use .map() function to change values)


### contact

In [None]:
#  Again do the same kind of analysis on "contact" since it looks like a categorical variable


In [None]:
# convert the feature to numeric values if it is worth adding to your model since SKLearn only accepts numeric values


### month

In [None]:
# looks like a useful feature at first glance but is it really? Do a similar analysis as above for categorical vars


In [None]:
# But, it looks like the month's success rate is actually just correlated with number of calls
# thus, the month feature is unlikely to generalize


### euribor3m

In [None]:
# Does it look like an a good feature? (hint: do the same analysis you did for Numeric variables)


## Step 3: Model building

- Use cross-validation to evaluate the AUC of a logistic regression model with your chosen features
- Try to increase the AUC by selecting different sets of features

In [None]:
# List your final list of predictors from the list of columns (including dummy columns)
# feature_cols = 

In [None]:
# Set your X variable (including the features from above)
X = bank[feature_cols]

In [None]:
# create y
y = bank.Outcome

In [None]:
# calculate cross-validated AUC and print it
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
logreg = LogisticRegression(C=1e9)
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()