# This notebook gives an example of an end-to-end Data Science process (from business problem to evaluation)

## Business Problem
First National Bank (FNB) Botswana has a dataset about its customers' past loans. Being one of the fastest growing bank in the region, FNB wants detailed insights about customers whose loans are already paid off or defaulted (collection) considering key factors that might have contributed to such loan status. The findings will be very useful as they will enable management and executives to deduce data-driven strategies to develop loan products that will better suit customer needs.  

## Description of the Data
To address this problem (which is a classification problem), a Data Scientist is provided with the loan dataset stored in two csv files: __Loan_Train.csv__ and __Loan_Test.csv__. The Loan_Train.csv will be used as training data when building the model and the Loan_Test.csv will be treated as unseen test data used in the model evaluation step.  The dataset includes the following key fields (column attributes):

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| Loan_status    | Whether a loan is paid off or is in collection                                           |
| Principal      | Basic principal loan amount                                                           |
| Terms          | Payoff schedule (frequency)                                                           |
| Effective_date | When the loan took effect                                                            |
| Due_date       | Payoff schedule due date                                                              |
| Age            | Age of applicant                                                                      |
| Education      | Education of applicant                                                                |
| Gender         | The gender of applicant                                                               |

## Load required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
import seaborn as sns
%matplotlib inline
print("Important libraries to get us started!")

Important libraries to get us started!


## Load Data from CSV file

In [2]:
#df = pd.read_csv('Loan_Train.csv')

## Data Visualization and Pre-processing
Once loaded, we analyse the dataset to summarize the main characteristics. We perform descriptive statistical analysis to describe our data using e.g., describe(), value_counts(), and info() methods. This allows us to gain more insights about the nature of the data including types, missing data, categorial variables, and whether target classes are balanced or not (to avoid biased model).  

## Pre-processing: Feature Selection
Important features (column attributes or variables) are selected, and if any, categorial variables are converted to numericals.

## Classification: using  scikit-learn library
We split the dataset (from __Loan_Train.csv__) into training set and test set to build the model with the best accuracy. We can apply e.g., the following algorithms:
- Support Vector Machine (SVM)
- Decision Tree
- K Nearest Neighbor (KNN)
- Logistic Regression

For accuracy evaluation, many evaluation scoring metrics can be considered including e.g., root mean square error (RMSE), Jaccard, F1-score and Log-loss (where applicable).

### Splitting Data and Standardization
Note that we first split data and then perform standardization (on the feature matrix say __X__) to give data zero mean and unit variance.

### Import Classifiers and Accuracy Scoring Metrics
- SVM classifier
- Decision Tree classifier
- KNN classifier
- Logistic Regression classifier
- Accuracy scoring metrics (Jaccard, F1-score, Log-Loss)

### Train (Fit) the Model, Prediction, Accuracy Evaluation
Firstly, each classifier is defined, then the model is trained or built, prediction (loan status) is perforned using the
developed model, and finally the scoring metrics used to evaluate the accuracy of the model against the test set (__y__ vector).

## Model Evaluation and Reporting
Repeat only the following steps from above (Loading Data, Feature Selection and Standardization) but this time using __Loan_Test.csv__ dataset. Note that, splitting data and model building are not performed in this step since we are simply testing the already built model (based on the __Loan_Train.csv__ dataset) in the previous steps. It is also worth noting that using unseen test set is important since it validates the robustness of the model post deployment. Finally, report the accuracy of the built model using different evaluation metrics.