<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Machine-Learning-with-Python:-Hands-On-Practice-in-Business-Scenario" data-toc-modified-id="Machine-Learning-with-Python:-Hands-On-Practice-in-Business-Scenario-1">Machine Learning with Python: Hands-On Practice in Business Scenario</a></span><ul class="toc-item"><li><span><a href="#Loan-Default-Prediction-(LDP)-Competition-Introduction" data-toc-modified-id="Loan-Default-Prediction-(LDP)-Competition-Introduction-1.1">Loan Default Prediction (LDP) Competition Introduction</a></span></li><li><span><a href="#Data-Description" data-toc-modified-id="Data-Description-1.2">Data Description</a></span></li><li><span><a href="#Results-Evaluation" data-toc-modified-id="Results-Evaluation-1.3">Results Evaluation</a></span></li><li><span><a href="#Assignment" data-toc-modified-id="Assignment-1.4">Assignment</a></span></li><li><span><a href="#Task-1:-Exploratory-data-analysis" data-toc-modified-id="Task-1:-Exploratory-data-analysis-1.5">Task 1: Exploratory data analysis</a></span></li><li><span><a href="#Task-2:-Classification-w/o-feature-engineering" data-toc-modified-id="Task-2:-Classification-w/o-feature-engineering-1.6">Task 2: Classification w/o feature engineering</a></span></li><li><span><a href="#Task-3:-Classification-w/-feature-engineering" data-toc-modified-id="Task-3:-Classification-w/-feature-engineering-1.7">Task 3: Classification w/ feature engineering</a></span></li><li><span><a href="#Task-4:-Classification-plus-regression-and-submit-the-results-to-the-forum" data-toc-modified-id="Task-4:-Classification-plus-regression-and-submit-the-results-to-the-forum-1.8">Task 4: Classification plus regression and submit the results to the forum</a></span></li></ul></li></ul></div>

# Machine Learning with Python: Hands-On Practice in Business Scenario

**Wang, En Qun EwenWangSH@cn.ibm.com**

## Loan Default Prediction (LDP) Competition Introduction

This competition asks you to determine whether a loan will default, as well as the loss incurred if it does default. Unlike traditional finance-based approaches to this problem, where one distinguishes between good or bad counterparties in a binary way, we seek to anticipate and incorporate both the default and the severity of the losses that result. In doing so, we are building a bridge between traditional banking, where we are looking at reducing the consumption of economic capital, to an asset-management perspective, where we optimize on the risk to the financial investor.

This competition is sponsored by researchers at Imperial College London.

![image](https://kaggle2.blob.core.windows.net/competitions/kaggle/3756/media/icl_logo.gif)

## Data Description

This data corresponds to a set of financial transactions associated with individuals. The data has been standardized, detrended, and anonymized. You are provided with over two hundred thousand observations and nearly 800 features.  Each observation is independent from the previous. 

For each observation, it was recorded whether a default was triggered. In case of a default, the loss was measured. This quantity lies between 0 and 100. It has been normalized, considering that the notional of each transaction at inception is 100. For example, a loss of 60 means that only 40 is reimbursed. If the loan did not default, the loss was 0. You are asked to predict the losses for each observation in the test set.

Missing feature values have been kept as is, so that the competing teams can really use the maximum data available, implementing a strategy to fill the gaps if desired. Note that some variables may be categorical (e.g. $f776$ and $f777$).

The competition sponsor has worked to remove time-dimensionality from the data. However, the observations are still listed in order from old to new in the training set. In the test set they are in random order.

Please go to [Kaggle](https://www.kaggle.com/c/loan-default-prediction/data) and download the training and test data.


## Results Evaluation

This competition is evaluated on the mean absolute error (MAE):

$$MAE = \frac{1}{n} \sum_{i=1}^{n}{|y_i - \hat{y_i}|}$$

where

$n$ is the number of rows

$\hat{y_i}$ is the predicted loss

$y_i$ is the actual loss


## Assignment

A two-step-process to predict the loss is an intuitive solution to this competition: **classification to predict the defaulter** and **regression to predict the loss** ($log(loss+1)$ to be correct). 

In this hands-on practice, we will divide the competition into four tasks as following:

- Task 1: Exploratory data analysis.
- Task 2: Classification w/o feature engineering.
- Task 3: Classification w/ feature engineering.
- Task 4: Classification plus regression and submit the results to the forum.

You may complete **Task 0** and **Task 1** during the training and the rest two tasks after the class. 

---

## Task 1: Exploratory data analysis

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. “EDA” is a critical first step in analyzing the data from an experiment. Here are the main reasons we use EDA:

- detection of mistakes
- checking of assumptions
- preliminary selection of appropriate models
- determining relationships among the explanatory variables, and
- assessing the direction and rough size of relationships between explanatory and outcome variables.

Loosely speaking, any method of looking at data that does not include formal statistical modeling and inference falls under the term exploratory data analysis.

**Task: Do as much as possible EDA w/ training data.**

**Requirement: Please provide rational EDA w/ descriptions. Never drop a chart or graph w/o any explanation!**

In [19]:
import os
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

wd = '/Users/ewenwang/Documents/practice_data'  # use your work directory
file_train = 'loan_default.csv'                 # use your training set file name
os.chdir(wd)

train = pd.read_csv(file_train)

In [7]:
train.info()                       # get the information of the training set
train.head()                       # take a look at the training set

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105471 entries, 0 to 105470
Columns: 771 entries, id to loss
dtypes: float64(653), int64(99), object(19)
memory usage: 620.4+ MB


Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f770,f771,f772,f773,f774,f775,f776,f777,f778,loss
0,1,126,10,0.686842,1100,3,13699,7201.0,4949.0,126.75,...,5,2.14,-1.54,1.18,0.1833,0.7873,1,0,5,0
1,2,121,10,0.782776,1100,3,84645,240.0,1625.0,123.52,...,6,0.54,-0.24,0.13,0.1926,-0.6787,1,0,5,0
2,3,126,10,0.50008,1100,3,83607,1800.0,1527.0,127.76,...,13,2.89,-1.73,1.04,0.2521,0.7258,1,0,5,0
3,4,134,10,0.439874,1100,3,82642,7542.0,1730.0,132.94,...,4,1.29,-0.89,0.66,0.2498,0.7119,1,0,5,0
4,5,109,9,0.502749,2900,4,79124,89.0,491.0,122.72,...,26,6.11,-3.82,2.51,0.2282,-0.5399,0,0,5,0


In [12]:
train.describe()

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f770,f771,f772,f773,f774,f775,f776,f777,f778,loss
count,105471.0,105471.0,105471.0,105471.0,105471.0,105471.0,105471.0,105289.0,105370.0,105471.0,...,105471.0,105471.0,105471.0,105471.0,104407.0,103946.0,105471.0,105471.0,105471.0,105471.0
mean,52736.0,134.603171,8.246883,0.499066,2678.488874,7.354533,47993.704317,2974.336018,2436.363718,134.555225,...,17.422543,5.800976,-4.246788,3.273059,0.233852,0.014797,0.310246,0.322847,175.951589,0.799585
std,30446.999458,14.725467,1.691535,0.288752,1401.010943,5.151112,35677.136048,2546.551085,2262.950221,13.824682,...,18.548936,6.508555,4.828265,3.766746,0.073578,1.039439,0.462597,0.467567,298.294043,4.32112
min,1.0,103.0,1.0,6e-06,1100.0,1.0,0.0,1.0,1.0,106.82,...,2.0,0.0,-43.16,0.0,0.0,-18.4396,0.0,0.0,2.0,0.0
25%,26368.5,124.0,8.0,0.24895,1500.0,4.0,11255.0,629.0,746.0,124.29,...,5.0,1.48,-5.7,0.74,0.1984,-0.704275,0.0,0.0,19.0,0.0
50%,52736.0,129.0,9.0,0.498267,2200.0,4.0,76530.0,2292.0,1786.0,128.46,...,11.0,3.57,-2.6,1.99,0.2518,0.3754,0.0,0.0,40.0,0.0
75%,79103.5,148.0,9.0,0.749494,3700.0,10.0,80135.0,4679.0,3411.0,149.08,...,23.0,7.7,-1.01,4.44,0.2836,0.7371,1.0,1.0,104.0,0.0
max,105471.0,176.0,11.0,0.999994,7900.0,17.0,88565.0,9968.0,11541.0,172.95,...,168.0,58.12,0.0,34.04,0.4737,11.092,1.0,1.0,1212.0,100.0


**Help:**

`pandas_profiling` generates profile reports from a `pandas` DataFrame. The `df.describe()` function is great but a little basic for serious exploratory data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Histogram

Go to [Github](https://github.com/JosPolfliet/pandas-profiling) to find more information about `pandas_profiling`.

In [None]:
import pandas_profiling as pp

# pp.ProfileReport(train)

Now, it's your turn!

---

## Task 2: Classification w/o feature engineering

We will explore classification algorithms you have learn in the morning.  

**Task: Try any classification algorithm you would like to and evaluate them with AUC score.**

**Requirement: Start from simple statistical models like logistic regression first, and then dive deeper with algorithms like SVM, bagging, boosting, or even artificial neural network.**

**Notes:** 
1. All normal algorithms can be found in the package `sklearn`.
2. If you already familiar with the normal machine learning algorithms, please try some more efficient algorithm products like `xgboost` and `lightgbm`.
3. All algorithms mentioned above are just a single model or a single algorithm with ensemble models; to go further in machine learning algorithm, please try some more complicated technique like **stacking**.

Training set and validation set are prepared as following:

**Help:**

If you are confused with the difference among training set, test set, and validation set, please go to [Wikipedia](https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets) for more information or just ask our training instructors.

In [20]:
# generate target variable for the classification task

default = train['loss']
default[default>0] = 1
train['default'] = default

target = 'default'
features = [x for x in train.columns if x not in [target, 'id', 'loss']]

In [21]:
from sklearn.model_selection import train_test_split

seed = 2018
test_size = 0.3

dtrain, dvalid = train_test_split(train, test_size = test_size, random_state = seed)

Try your algorithms and fit your models on `train` and predict on `valid`.

The following is a simple logistic regression model with the golden feature. 

**Note:** The golden feature is generated from feature engineering, which will be covered in the next task.

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

golden_feature = dtrain['f274'] - dtrain['f528']               # generated from feature engineering
golden_feature = pd.DataFrame(golden_feature.fillna(golden_feature.median()))

lr = LogisticRegression()
lr.fit(golden_feature, dtrain[target])


train_prob = lr.predict_proba(golden_feature)[:, 1]
train_auc = metrics.roc_auc_score(dtrain[target], train_prob)

valid_gf = dvalid['f274'] - dvalid['f528']             
valid_gf = pd.DataFrame(valid_gf.fillna(valid_gf.median()))

valid_prob = lr.predict_proba(valid_gf)[:, 1]
valid_auc = metrics.roc_auc_score(dvalid[target], valid_prob)

In [30]:
print('Model Report:')
print('AUC_Train: ', train_auc)
print('AUC_Valid: ', valid_auc)

Model Report:
AUC_Train:  0.936762776912
AUC_Valid:  0.939375926472


The above practice with logistic regression and golden feature already achieves a pretty good result, which largely thanks to the feature engineering. Please feel free to use the golden feature above and explore more machine learning algorithms. 

Now, it's your turn!

---

## Task 3: Classification w/ feature engineering

You may already benefit the feature engineering from the golden feature in the task above. We will do more feature engineering task in this section.

Feature engineering is the process of using domain knowledge of the data to create features that make 
machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, 
and is both difficult and expensive. The need for manual feature engineering can be obviated by automated 
feature learning.

Feature engineering is an informal topic, but it is considered essential in applied machine learning.

**Task:**

1. Summary the data types in the dataset. You may need some EDA to explore it. Please refer to the EDA task.
2. Try to find possible ways to transform raw data into the one that machine learning algorithms can handle with.
3. Generate reasonable features with your expert knowledge and possible techniques. Illustrate why you do so.
4. Rebuild your classification models with data after the feature engineering.

**Requirement: Please provide rational FE w/ descriptions. Do everything w/ a reason and necessary explanation!**

---

## Task 4: Classification plus regression and submit the results to the forum

So far you have completed the most difficult part of this competition -- classification, your next task is to do regression with defaulters generated from classification. 

The linear regression is one of the most classical topic in statistics. If you would like to learn more about linear regression, please check it on [Wikipedia](https://en.wikipedia.org/wiki/Linear_regression). 

Now, it's your turn!

When all of above are done, it's time to submit your result to the [Kaggle](https://www.kaggle.com/c/loan-default-prediction). 