# Workshop: Analyzing bank marketing data with scikit-learn

Task: Your client has given you a dataset and has asked you to build a model to:
1. predict whether a given customer is likely to purchase a bank term deposit.
2. analyze the factors that make customer more likely (or less likely) to purchase a bank term deposit

Build this model by going through the process of tackling classification problems:
1. Load and explore data
2. Preprocess / clean data
3. Train the model
4. Evaluate the model
5. Use the model (for prediction and interpretation)

Based on the dataset's [README](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing), we know that the data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). For more info on the dataset, please see the dataset's [README](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

In [1]:
# Load libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_columns = 50
%precision 3

'%.3f'

## 1. Load and explore the data

In [2]:
# Load data
df = pd.read_csv('../data/bank-marketing-data/bank-additional-full.csv', sep=';')

### Data exploration

In [3]:
# see the top n rows by calling df.head(n)

# YOUR CODE HERE:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [None]:

# see summary statistics by calling df.describe()

# YOUR CODE HERE:

## 2. Clean and preprocess data

Because data cleaning and working with pandas warrants a workshop in itself, the following cells have done the clean up for you. There are plenty of pandas tutorials out there that you can follow to learn more about data cleaning

In [None]:
# removing rows with unknown values
for column in df.columns:
    if (df[column].dtype == object): 
        df = df[df[column] != 'unknown'] ## remove rows with unknown

df.head()

In [None]:
# Convert string data to numerical data so that scikitlearn can understand it
cols_to_transform = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week',
                    'poutcome', 'y']
df_with_dummies = pd.get_dummies(df, columns = cols_to_transform)


In [None]:
# Remove the redundant 'y_no' column generated by .get_dummies()
df = df_with_dummies.drop('y_no', axis=1)

In [None]:
df.head()

### Convert pandas dataframe into 2 matrices for the model's consumption

In [None]:
X = df.iloc[:, df.columns != 'y_yes'].values
y = df.iloc[:, df.columns == 'y_yes'].values.ravel()

In [None]:
X.shape

In [None]:
X[0]

In [None]:
# try printing the following commands to get a sense of what X and y actually are:
# X.shape, y.shape
# X[0], y[0]
# X[any_random_integer], y[any_random_integer]
# X, y

In [None]:
# YOUR CODE HERE:

### Split data into train and test set

In [None]:
# Use sklearn's train_test_split method to split the data into train and test set

# YOUR CODE HERE:



## 3. Train the model!

In [None]:
# import the LogisticRegression class from sklearn.linear_model

# YOUR CODE HERE:

In [None]:
# train the model using the .fit(x_train, y_train) method

# YOUR CODE HERE:

## 4. Evaluate the model

### Evaluation method 1: `.score(X, y)`

In [None]:
# Evaluate your model's performance using the .score() method

# YOUR CODE HERE:



Note: is 90% accuracy good? Not really... if I had a model that just always predicted y=0, it would be correct 88.7% of the time:

In [None]:
print(df['y_yes'].value_counts())
print('Accuracy of a model that always predicted y = 0:')
print(36548.0/(36548 + 4640))

### Evaluation method 2: `.confusion_matrix(expected, predicted)`

In [None]:
from sklearn import metrics

In [None]:
# Evaluate model using metrics.confusion_matrix(y_true, y_predicted)

# YOUR CODE HERE:



Confusion matrices are in the following format:
    
```
[[ true negatives,  false positives ]
 [ false negatives, true positives  ]]
```

In [None]:
# Uncomment the rest of this cell and execute it

# tn, fp, fn, tp = metrics.confusion_matrix(expected, predicted).ravel()
# tn, fp, fn, tp
# print("=== negatives ===")
# print("true negatives: {}".format(tn))
# print("false positives: {}".format(fp))
# print()
# print("=== positives ===")
# print("false negatives: {}".format(fn))
# print("true positives: {}".format(tp))

### Evaluation method 3: `.classification_report(expected, predicted)`

In [None]:
# Evaluate model using .classification_report(y_true, y_predicted)

# YOUR CODE HERE:



## Iteration 2: Train and evaluate the model again with a balanced dataset

In [None]:
df = pd.read_csv('../data/bank-marketing-data/bank-additional-one-hot-encoded.csv')
df['y'].value_counts()

In [None]:
negatives = df[df['y'] == 0]
positives = df[df['y'] == 1]
negatives.head()

In [None]:
number_of_samples = len(positives)
sliced_negatives = negatives.head(number_of_samples)

In [None]:
df = pd.concat([sliced_negatives, positives])

In [None]:
X = df.iloc[:, df.columns != 'y'].values
y = df.iloc[:, df.columns == 'y'].values.ravel()

In [None]:
# Split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
# Train model 

# YOUR CODE HERE:



In [None]:
# Evaluate model
expected_2 = y
predicted_2 = model_2.predict(X)
report = metrics.classification_report(expected_2, predicted_2)

print("CLASSIFICATION REPORT")
print(report)

## 5. Using the model to predict outcomes based on fresh/unseen data

Load new data from '../data/bank-marketing-data/bank-unseen-data.csv'

In [None]:
df_unseen = pd.read_csv('../data/bank-marketing-data/bank-unseen-data.csv')

In [None]:
# Explore data again with df.head(). Notice that there's no 'y' column at the end

# YOUR CODE HERE:


In [None]:
# Convert our pandas dataframe to a matrix, so that the model can consume it
X_unseen = df_new.as_matrix()

In [None]:
# Use your model to predict the y value (i.e. 0 or 1) of the new data (hint: model.predict()`)



In [None]:
# Use your model to predict the probabilities of y being 0 or 1 (hint: model.predict_proba()`)


# Bonus: interpreting our model

In [None]:
plt.figure(figsize=(16,9))

plt.plot(model.coef_.T, 'o', label="logisticregression model (C=1)")
plt.xticks(range(X.shape[1]), df.columns, rotation=90)
plt.title("Coefficients of logistic_regression_with_threshold model")
plt.ylabel("Coefficients")
plt.xlabel("X variables")
plt.legend()

# Note: if you get any errors here saying model is not defined, simply replace 'model' in the second line of this box with the name of your model variable

In [None]:
# Before we can interpret coefficients as probabilities, we need to do a little math to calculate the odds ratio
# and the probability
logodds = model.intercept_ + model.coef_[0] * 2
odds = np.exp(logodds)
probabilities = odds/(1 + odds)
probabilities

In [None]:
number_of_x_vars = len(df.columns) - 1

In [None]:
plt.figure(figsize=(16,9))

plt.bar(range(0, number_of_x_vars), probabilities)
plt.title("Probabilities of outcome where y=1 given a unit change in X")
plt.xlabel("X variables")
plt.ylabel("Probability")
plt.axhline(y=0.5, hold=None, alpha=0.5)
plt.xticks(range(X.shape[1]), df.columns, rotation=90)
plt.legend()

#### How to interpret the chart 

We can interpret the chart above as such: Given a unit increase in X, the user is predicted to be \__% more likely to purchase a bank term deposit (i.e. y=1)

For example, given a unit increase in employment variation rate (the first positive blip in the chart), the user is predicted to be 16% more likely to purchase a bank term deposit

#### Based on this chart, we can observe the following: 
    
Attributes that have a positive effect on the outcome:
- contact_cellular
- month_august
- month_oct
- day_of_week_fri

Attributes that have a negative effect on the outcome:
- emp.var.rate
- cons.price.index
- cons.conf.index
- euribor3m
- education_basic.4y
- contact_telephone
- month_may