<h1><center>Credit Risk Analysis using IBM Snap ML</center></h1>

In this notebook, we will explore a customer credit history dataset (1 million observations) and train a logistic regression model to predict credit default behavoir. You will need to first download this dataset into the data folder. You can find the dataset at: https://ibm.box.com/s/ithxaw0lx7ccyylrek1sge0v4kajbqfx

We will compare the time it takes to train logistic regression with SKLearn and Snap ML on CPU. 
We're also going to use some common data science libraries, namely Numpy and Pandas for working with data, Matplotlib for visualisations and SKLearn for training our machine learning model.

# Imports

In [None]:
from __future__ import print_function
import numpy as np
import pandas as pd
pd.options.display.max_columns = 999
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, normalize
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, roc_curve, roc_auc_score
from scipy.stats import chi2_contingency,ttest_ind
from sklearn.utils import shuffle
import time
import zipfile
import os
import warnings
warnings.filterwarnings('ignore')

In [None]:
%matplotlib inline

# Dataset Visualization

In [4]:
if not os.path.exists("data/credit_customer_history.csv.zip"):
    print("Dataset is not in the data folder. Firstly you should download data from: https://ibm.box.com/s/8r2osyysmrh0tks9dsd3oo501lqh1v0m")

Dataset is not in the data folder. Download data from: https://ibm.box.com/s/8r2osyysmrh0tks9dsd3oo501lqh1v0m


In [1]:
# unzip dataset
with zipfile.ZipFile("data/credit_customer_history.csv.zip","r") as zip_ref:
    zip_ref.extractall("data")

NameError: name 'zipfile' is not defined

In [None]:
cust_pd = pd.read_csv('data/credit_customer_history.csv')

In [None]:
print("There are " + str(len(cust_pd)) + " observations in the customer history dataset.")
print("There are " + str(len(cust_pd.columns)) + " variables in the dataset.")

In [None]:
cust_pd.info()

In [None]:
cust_pd.head()

In [None]:
cust_pd.describe()

In [None]:
del cust_pd['NUMBER_CREDITS']

In [None]:
cust_pd.describe()

In [None]:
x = cust_pd['EMI_TENURE']
y = cust_pd['TRANSACTION_AMOUNT']

x_bin = len(cust_pd['EMI_TENURE'].unique())
y_bin = len(cust_pd['TRANSACTION_AMOUNT'].unique())

fig = plt.figure(figsize=(0.3*x_bin, 0.3*y_bin))

graph = plt.hist2d(x,y, bins=(x_bin, y_bin))
plt.xlabel('EMI_TENURE')
plt.ylabel('TRANSACTION_AMOUNT')
plt.colorbar(graph[3])

plt.show()

In [None]:
print(cust_pd.groupby(['IS_DEFAULT']).size())
index = ['Yes','No']
churn_plot = cust_pd['IS_DEFAULT'].value_counts(sort=True, ascending=False).plot(kind='bar',figsize=(4,4),title="Total number for occurences of loan default " + str(cust_pd['IS_DEFAULT'].count()), color=['#BB6B5A','#8CCB9B'])
churn_plot.set_xlabel("IS_DEFAULT")
churn_plot.set_ylabel("Frequency")

# Data Preprocessing

Data preparation is a very important step in machine learning model building. This is because the model can perform well only when the data it is trained on is good and well prepared. Hence, this step consumes bulk of data scientist's time spent building models.

During this process, we identify categorical columns in the dataset. Categories needed to be indexed, which means the string labels are converted to label indices. These label indices are encoded using One-hot encoding to a binary vector. This encoding allows algorithms which expect continuous features to use categorical features.

In [None]:
# Split dataframe into Features (X) and Labels (y)
cust_pd_Y = cust_pd[['IS_DEFAULT']]
cust_pd_X = cust_pd.drop(['IS_DEFAULT'],axis=1)
print('cust_pd_X.shape=', cust_pd_X.shape, 'cust_pd_Y.shape=', cust_pd_Y.shape)

# Transform Labels (y)

In [None]:
le = LabelEncoder()
cust_pd_Y["IS_DEFAULT"] = le.fit_transform(cust_pd_Y['IS_DEFAULT'])
cust_pd_Y.head()

# Transform Features (X)

In [None]:
print('features X dataframe shape = ', cust_pd_X.shape)

# One-Hot Encoding of Categorical Features

In [None]:
categoricalColumns = ['CREDIT_HISTORY', 'TRANSACTION_CATEGORY', 'ACCOUNT_TYPE', 'ACCOUNT_AGE',
                      'STATE', 'IS_URBAN', 'IS_STATE_BORDER', 'HAS_CO_APPLICANT', 'HAS_GUARANTOR',
                      'OWN_REAL_ESTATE', 'OTHER_INSTALMENT_PLAN',
                      'OWN_RESIDENCE', 'RFM_SCORE', 'OWN_CAR', 'SHIP_INTERNATIONAL']
cust_pd_X = pd.get_dummies(cust_pd_X, columns=categoricalColumns)
cust_pd_X.head()

print('features X dataframe shape = ', cust_pd_X.shape)

In [None]:
cust_pd_X.head()

# Normalize Features

In [None]:
min_max_scaler = MinMaxScaler()
features = min_max_scaler.fit_transform(cust_pd_X)
features = normalize(features, axis=1, norm='l1')

cust_pd_X = pd.DataFrame(features,columns=cust_pd_X.columns)
cust_pd_X.head()

# Generate Train and Test Datasets

In [None]:
labels    = cust_pd_Y.values
features  = cust_pd_X.values

In [None]:
# labels = np.reshape(labels,(-1,1))

X_train,X_test,y_train,y_test = train_test_split(features, labels, test_size=0.3, random_state=42, stratify=labels)
                    
print('X_train.shape=', X_train.shape, 'Y_train.shape=', y_train.shape)
print('X_test.shape=', X_test.shape, 'Y_test.shape=', y_test.shape)

# Train a Logistic Regression Model using Scikit-Learn

In [None]:
from sklearn.linear_model import LogisticRegression
sklearn_lr = LogisticRegression()

# from pai4sk.linear_model import LogisticRegression #or you can import it directly from snapml library and choose a scikit-learn solver, e.g.liblinear
# sklearn_lr = LogisticRegression(solver='liblinear') 
print(sklearn_lr)

In [None]:
# Train a logistic regression model using Scikit-Learn
t0 = time.time()
sklearn_lr.fit(X_train, y_train)
sklearn_time = time.time() - t0
print("[sklearn] Training time (s):  {0:.2f}".format(sklearn_time))

# Evaluate accuracy on test set
sklearn_pred = sklearn_lr.predict(X_test)
print('[sklearn] Accuracy score : {0:.6f}'.format(accuracy_score(y_test, sklearn_pred)))

# Train a Logistic Regression Model using Snap ML

In [None]:
from pai4sk import LogisticRegression
# snapml_lr = LogisticRegression(use_gpu=True, device_ids=[0])
snapml_lr = LogisticRegression(use_gpu=True)

In [None]:
print(snapml_lr.get_params())

In [None]:
# Train a logistic regression model using Snap ML
t0 = time.time()
model = snapml_lr.fit(X_train, y_train)
snapml_time = time.time() - t0
print("[Snap ML] Training time (s):  {0:.2f}".format(snapml_time))

# Evaluate accuracy on test set
snapml_pred = snapml_lr.predict(X_test)
print('[Snap ML] Accuracy score : {0:.6f}'.format(accuracy_score(y_test, snapml_pred)))
print('[Logistic Regression] Snap ML vs. sklearn speedup : {0:.2f}x '.format(sklearn_time/snapml_time))

# Q: Can you train a Random Forest Classifier with SKLearn and Snap ML and compare the results? 

Now let's see if you can use the Snap ML API guide to train a Random Forest Classifier and compare it's performance with SKLearn. 

Snap ML API: https://ibmsoe.github.io/snap-ml-doc/v1.6.0/ranforapidoc.html

In [None]:
# specify model parameters
max_depth     =  10
n_estimators  =  50
n_jobs        =  16     # e.g. number of threads
max_features  =  4

In [None]:
from sklearn.ensemble import RandomForestClassifier
sklearn_rf = RandomForestClassifier(random_state=0, max_depth=max_depth, n_estimators=n_estimators, n_jobs=n_jobs, max_features=max_features)

In [None]:
# Training a random forest model using scikit-learn
t0 = time.time()
sklearn_rf.fit(X_train, y_train)
sklearn_time = time.time() - t0
print("[sklearn] Training time (s):  {0:.5f}".format(sklearn_time))

# Evaluate accuracy on test set
sklearn_pred = sklearn_rf.predict(X_test)
print('[sklearn] Accuracy score : ', accuracy_score(y_test, sklearn_pred))

In [None]:
# Import the Random Forest model directly from the SnapML package
# [insert code here]

In [None]:
# Training a random forest model using Snap ML
# [insert code here]

In [None]:
# Evaluate accuracy on test set
# [insert code here]

&copy; Copyright IBM Corporation 2019