# Practice Case Study
## German Credit Risk Analysis

### Problem Statement:
The approval of loans to customers is the most important aspect of a bank, yet it is also a time-consumed process. Value Trust Bank was founded in 2003 in Germany with the objective of providing loans and finance to individuals and enterprises. After three years of launch, the bank was experiencing difficulties in making profits and dealing with customers who do not repay their loans back. Now, the bank manager wants a software application that will identify the right customers to approve the loan and it will also help to reduce the effort and time required to process the applications in order to increase the customer experience.


### Import all necessary libraries

In this example, we will be using `pandas`, `numpy`, `datetime` and `time` libraries for processing our data, the `tarfile` library to package our trained model, the `boto3` and `sagemaker` libraries to interact with aws services & sagemaker (to start training & hyper-parameter jobs, and to deploy the model using endpoints) and finally a bunch of packages from the `sklearn` library for ML

In [None]:
# import the libraries

### Setup a SageMaker Session

In this step, we will setup a SageMaker session using the `boto3` library and set the region and S3 bucket we will be using for this walkthrough. I will be setting my S3 location to be the default bucket that was created for me by SageMaker, but you can change this if you want to by giving the URL of a different S3 bucket you might want to use

In [None]:
# Setting up a sagemaker session


In [None]:
# Get region and defualt bucket


# Display region and bucket


In [65]:
# Set variables to write data into s3
data = ""
prefix = ""

In [None]:
# Write the data into s3
boto3.Session().resource('s3').Bucket(bucket) \
     .Object(os.path.join(prefix, 'input/german_credit_case_study.csv')) \
     .upload_file('german_credit_case_study.csv')

### Read the data & pre-process
In the case-study we are going to use 'german_credit.csv', Each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes.The attributes are:

* _Age_  = customer age
* _Sex_  = customer gender
* _JOB_  = customer profession
* _HOSUING_  = does customer has a house
* _Savings account$Checking account_  = customer savings account and checking account
* _Credit amount_  = customer credit amount 
* _Duration_  = duration of loan (months)
* _Purpose_   = loan purpose
* _risk_ = Class variable (1:person is at risk, 0: perosn is not risk)

#### Read data from S3

In [7]:
# Read data from S3 into a pandas dataframe


In [8]:
# Check the shape of the dataset


#### Handle missing values

In [9]:
# Check for missing values


In [11]:
# Handle missing values in age column by imputing with median


In [12]:
# Handle missing values in Sex column by imputing with mode


#### Manage Vector Features

In [13]:
# Show the vector feature


In [14]:
# Apply lamba function to split the vector into 2 parts


#### Manage Columns

In [15]:
# Drop the vector column


In [17]:
# Rename the columns 'Age' to 'age,'Sex' to 'sex','JOB' to 'job','HOUSING' to 'housing','Credit amount' to 'credit_amount','Duration' to 'duration','Purpose' to 'purpose'


#### Duplicate Rows

In [18]:
# Check for all duplicate rows


In [19]:
# Drop duplicate rows (If any)


#### Parse Features to the correct datatypes

In [20]:
# Data info


#### Format string (standardize string values)

In [21]:
# Check the unique values of job column feature 


In [22]:
# Use RegEx to replace '-' with '' (No space)

# Use RegEx to replace '_' to ' ' (Space)


In [23]:
# Use regex to replace "?" with frequant value of purpose column (Most frequant value Ex: Mode)


In [24]:
# Convert all string features in 'housing' column into lower case


### Data Exploration

#### Table Summary

#### How is age distributed?

#### Feature Correlation

#### Multicollinearity Check

### Data Preparation

#### Encode Categorical

#### Data Imbalance Check

#### Process Numeric

In [25]:
# Copy the data into new variable


In [26]:
# Standardize the values of age, credit_amount, duraion column with the help of standard scaler 


#### Split data and upload back to S3

In [27]:
# Split data into Train & Test dataframes

# Write the train and test dataframes to sagemaker file browser


In [28]:
# Upload train data to s3


# Upload test data to s3


# Display train and test data location in s3


### Build models to train using a Classification Algorithm

In this step, we will first read the data from S3, split the features & target into train & test parts, and create and evaluate multiple classification models like:

* Logistic Regression
* Decision Tree Classifier
* Random Forest Classifier
* Gradient Boosting
* Adaptive Boosting
* K-Nearest Neighhour Classifier
* Support Vector Machine Classifier
* Naive Bayes Classifier

#### Read train & test data from s3

In [29]:
# Read training & testing datasets from s3


In [30]:
# Create train dataframes with features & target

# Create test dataframes with features & target


#### Train Classification Models using the data

NOTE: _Uncomment the model you wish to use, keep the rest commented_

In [55]:
# model = LogisticRegression(random_state=1)
# model = DecisionTreeClassifier(random_state=1)
model = RandomForestClassifier(random_state=1)
# model = GradientBoostingClassifier(random_state=1)
# model = AdaBoostClassifier(random_state=1)
# model = KNeighborsClassifier(random_state=1)
# model = SVC(random_state=1)
# model = GaussianNB(random_state=1)

In [31]:
# Fit the training data


In [32]:
# Get preditions on test data


#### Save predictions as a dataframe

In [33]:
# Save the predictions and actual values

# Concatenate them into a single dataframe

# Display output


#### Evaluate the model using performance metrics

In [34]:
# Evaluate the model using a confusion matrix

# Display matrix


In [35]:
# Calculate TP, FP, FN, TN


**Accuracy**: 

This metric is calculated by dividing the correct predictions made by the total number of predictions made. In the context of our model, we are trying to answer the below questions:

* In how many cases did we correctly predict that a customer is risk?
* In how many cases did we correctly predict that a customer is not risk?

In [36]:
# Calulate Accuracy Score

# Display score


**Precision**: 

This metric is calculated by dividing the correct positive predictions by the total number of positive predictions. In the context of our model, we are trying to answer - out of all the cases where we predicted that a customer is risk, how many customers are actually risk

This metric comes from the prespective of checking how good we were when predicting the positive class, but only out of all the predictions we made for the positive class. In other words, 

* I am **concerned** about how many customers were actually identified as risk customers only out of the ones I predicted that were at-risk (the penalty is from predicting cases which were not actually at-risk, as being at-risk)
* I am **not concerned** about the cases where I missed out on predicting that a customer is at-risk, when actually there was evidence of risk after duraion

In [37]:
# Calculate Precision Score

# Display score


**Recall**:

This metric is calculated by dividing the correct positive predictions by the total number of actual positives. In the context of our model, we are trying to answer - out of all the cases where customers were actually risk, how many did I predict were at-risk?

This metric comes from the prespective of checking how good we were with predicting the positive class, out of all the actual positive cases. In other words, 

* I am concerned about how many cases I was able to identify as at-risk out of all the customers who were risk (the penalty is from missing out on identifying a case as at-risk, when actually it was)
* I am not concerened about the cases where I wrongly predicted that a customers was at-risk, when actually there was no risk

In [38]:
# Calculate Recall Score

# Display score


**F1 score**:

The F1 score is a metric that combines the merits of both precision and recall by calculating the harmonic mean between the two of them. In other words, this score is concerned with make sure 

* We don't wrongly lable a case as at-risk when it is not (which can lead to unnecessarily treating a customer)
* We don't wrongly lable a case as not at-risk when it is (which can lead to missing out on treating a customer)

In [39]:
# Calculate F1-score

# Display score


### Model Tuning

A model by default is built on set internal parameters, for example, in the Random Forest classifier, the `max_depth` value which tells how deep the tree must expand is set to `None` by default - which means the tree will expand until it finds all the pure leaves

This will definitely work out well for the training data, but you would be over-fitting the model if you allow the depth to go as deep as "stopping only when you find pure leaves"! Here are some of the other significant questions you can ask:

* What should be the minimum number of samples in a node to call it a leaf?
* How do you score your features on their strength to predict?
* What is the maximum number of leaf nodes? - too many leaves means you might be over-fitting

When building an ML model, you can decide what these hidden parameters must be. For RandomForest Classifier, you can find all the hyper-parameters in this documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

**Hyper-parameter Tuning** is the art of figuring out what works best for your model - performing well statistically and aligning with your business context & needs

#### Define the parameter space & train the model

Each model is build on a different algorithm, and hence the hyper-parameters may vary. So before you tune these models, refer to the below sklearn documentations to check the parameters and modify your tuning code accordingly

* [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* [sklearn.tree.DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
* [sklearn.ensemble.RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [sklearn.ensemble.GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
* [sklearn.ensemble.AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
* [sklearn.neighbors.KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* [sklearn.svm.SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
* [sklearn.naive_bayes.GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

In [40]:
# Set the parameter space / grid to tune on


In [41]:
# Set parameters for Grid Search Algorithm


In [42]:
# Train and Evaluate the model on all parameter combinations


In [43]:
# Save the results into the dataframe

# Get the top 5 models which have the best mean_test_score


#### Re-build the model & predict with the best parameters

In [44]:
# Re-build the model using the best parameters


In [45]:
# Re-fit the model using the best parameters


In [46]:
# Save the predictions and actual values

# Concatenate them into a single dataframe

# Display output


#### Evaluate the tuned model (compare performace with baseline model)

In [47]:
# Evaluate the model using a confusion matrix

# Display matrix


In [48]:
# Calculate TP, FP, FN, TN


In [49]:
# Calulate all performance metrics - after tuning


# Display performance metrics before & after tuning, and the % increment achieved


### Model Deployment

A model is deployed when it is "fit for production use" - but what does it really mean to "deploy" a model? All you are doing is packaging the model you trained as a python code so that it can be used by a stakeholder to input newly arriving data and receive predictions to take action within the business context

This is known as creating an **Endpoint**. An endpoint is a system / node / computer / machince where your code lives, and can be triggered by giving inputs to get predictions out

NOTE: _Before we create docker containers and deploy our model, let's first learn how to deploy it locally and see how it works_

#### Understanding how "argument parsing" works

You model code has to be flexible to take in different hyper-parameter inputs and new production data in the future. So you need a way to input values into your code before it starts running as it is unproductive for you to change multiple values in your code one at a time when needed. You can make sure your code is parameterized using the `argparse` library. Let's learn how to do this first

In [157]:
%%writefile argparse_example.py

# Package to parse arguments
import argparse

# Create a parser instance
parser = argparse.ArgumentParser()

# Set rules to parse variables
parser.add_argument("--hyp-par-1", type=int, default=10)
parser.add_argument("--hyp-par-2", type=int, default=5)
parser.add_argument("--hyp-par-3", type=int, default=200)

# Read all input values into a variable
args = parser.parse_args()

# Save each input value into separate variables
hyperparam_1 = args.hyp_par_1
hyperparam_2 = args.hyp_par_2
hyperparam_3 = args.hyp_par_3

# Display values
print("training model with below hyper-parameters...:")
print("hyper-parameter-1: {}".format(hyperparam_1))
print("hyper-parameter-2: {}".format(hyperparam_2))
print("hyper-parameter-3: {}".format(hyperparam_3))

Writing argparse_example.py


#### Modifying our model code before local deployment

In [50]:
%%writefile model_script_local.py

# ------------------------------
# Import all necessary libraries
# ------------------------------

# Package to take in features and hyper-parameters are inputs


# Packages to interact with system

# Packages to process data


# Packages to interact with AWS Services & Sagemaker


# Packages for Machine Learning


# Packages for Hyper-parameter tuning


# ----------------------------------
# Parse test data & hyper-parameters
# ----------------------------------

# -----------------------
# Setup sagemaker session
# -----------------------

# Setting up a sagemaker session


# Get region and bucket


# Display region and bucket


# ---------------------------
# Read the data & pre-process
# ---------------------------


# Read data from S3 into a pandas dataframe


# Display the dataset

# Split data into Train & Test dataframes


# Write the train and test dataframes to sagemaker file browser

# Upload train data to s3

# Upload test data to s3


# Display train and test data location in s3
# print(trainpath)
# print(testpath)

# --------------------------
# Build classification model
# --------------------------

# Read training & testing datasets from s3

# Create train dataframes with features & target

# Create test dataframes with features & target

# Build model using the best parameters

# Fit model


# -----------------------
# Predict using the model
# -----------------------

# Save the predictions and actual values


# Concatenate them into a single dataframe


# Display output

# --------------------
# Evaluate Performnace
# --------------------


# Evaluate the model using a confusion matrix

# Display matrix


# Calculate TP, FP, FN, TN


# Calulate all performance metrics - after tuning


# Display performance metrics before & after tuning, and the % increment achieved


Writing model_script_local.py


### Running Training, Hyper-parameter Tuning Jobs and Deploying the model on Docker

Now that we have done training, tuning and deploying on our sagemaker local notebook, let's learn how to run these stages in the ML Lifecycle as processing jobs on Docker Containers

#### Prepare the model script to run on Docker Container

In [162]:
%%writefile model_script_docker.py

# ------------------------------
# Import all necessary libraries
# ------------------------------

# Package to take in features and hyper-parameters are inputs


# Packages to interact with system


# Packages to process data


# Packages for Machine Learning


# ---------------------------------
# Function to persist model package
# ---------------------------------

# Write a  function will be used by the training model to create the model package
# Use this as it is

# Main code

    # ----------------------------------
    # Parse test data & hyper-parameters
    # ----------------------------------
    
    parser = argparse.ArgumentParser()
    # test and train file
    parser.add_argument("--test-file", type=str, default='test.csv')
    parser.add_argument("--train-file", type=str, default="train.csv")
    # hyper-parameters
    parser.add_argument(" ", type=str, default="gini")
    parser.add_argument(" ", type=int, default=5)
    parser.add_argument(" ", type=int, default=50)
    # docker contrainer default locations
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR")) #*
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) #*
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST")) #*

    args = parser.parse_args()

    # --------------------------
    # Build classification model
    # --------------------------

    # Read training & testing datasets from s3
    
    # Create train dataframes with features & target

    # Create test dataframes with features & target
 

    # Build model using the best parameters


    # Fit model


    # -----------------------
    # Predict using the model
    # -----------------------

    # Save the predictions and actual values


    # Concatenate them into a single dataframe

    # Display output
    

    # --------------------
    # Evaluate Performnace
    # --------------------


    # Evaluate the model using a confusion matrix

    # Display matrix

    # Calculate TP, FP, FN, TN

    # Calulate all performance metrics - after tuning

    # Display performance metrics before & after tuning, and the % increment achieved


    # ------------------------
    # Save model package file
    # ------------------------

    # persist model


Overwriting model_script_docker.py


#### Train the model using SageMaker Estimator Function

In [54]:
# Import library to train model
from sagemaker.sklearn.estimator import SKLearn

# Set framework version (stable)
FRAMEWORK_VERSION = "0.23-1"

# Creat the sklearn estimator instance
sklearn_estimator = SKLearn(
    entry_point=" ", # your model code
    role=get_execution_role(), # the default IAM role used
    instance_count=1, # number of nodes
    instance_type=" ", # EC2 instance type
    framework_version=FRAMEWORK_VERSION, # Framework version
    base_job_name=" ", # Prefix of the job name
    hyperparameters={ # Hyper-parameters
      
    }
)

In [51]:
# Launch Training Job
#sklearn_estimator.fit({"train": trainpath, "test": testpath}, wait=True)

#### Tune the model using SageMaker Tuner Function

In [165]:
# Library to tune models
from sagemaker.tuner import IntegerParameter

# Define exploration boundaries
hyperparameter_ranges = {
    
}

# Create an optimizer instance
sklearn_optimizer = sagemaker.tuner.HyperparameterTuner(
    estimator=sklearn_estimator, # model training instance
    hyperparameter_ranges=hyperparameter_ranges, # parameter space
    base_tuning_job_name=" ", # Prefix of the tuning job
    objective_type=" ", 
    objective_metric_name=" ",
    max_jobs=10,
    max_parallel_jobs=5,
    metric_definitions=[
        {"Name": "accuracy", "Regex": "accuracy:([0-9.]+).*$"} # search logs on cloudwatch
    ],
)

In [55]:
sklearn_optimizer.fit({"train": trainpath, "test": testpath})

In [56]:
# Get the tuner results in a df


#### Deploy the model using SageMaker Model Deploy Function

In [57]:
# Save the model artefact into a variable


In [169]:
# Import library to deploy model
from sagemaker.sklearn.model import SKLearnModel

model = SKLearnModel(
    model_data=artifact, # The model zip file the training job generated
    role=get_execution_role(), # Default IAM role
    entry_point=" ", # Model Code
    framework_version=FRAMEWORK_VERSION
)

In [58]:
# Deploy SKLearn Predictor
sklearn_predictor = model.deploy(instance_type=" ", initial_instance_count=1)

In [59]:
# Make prediction on test data using deployed model


In [60]:
# Save predictions as output dataframe


In [61]:
# Evaluate the model using a confusion matrix

# Display matrix
# print(matrix)

# Calculate TP, FP, FN, TN

# Calulate all performance metrics - after tuning


# Display performance metrics before & after tuning, and the % increment achieved


#### Delete the Endpoint

**Important!** An Endpoint is a LIVE node which is always running, ready to process & predict to give you output. So unless you are making real-time predictions on streaming data, delete your endpoints after use

In [64]:
# Make sure your end-point should be deleted in order to avoid charges

In [62]:
sm_boto3.delete_endpoint(EndpointName=sklearn_predictor.endpoint)

# Thank you