# LAB3 - Model Definitions and Training

After the process of data manipulation we did in LAB2, we are ready to use our dataframe to train and set up machine learning models.

## Initial setup

We have split in two parts what otherwise would be a full notebook (LAB2 and LAB3). We have tried to reflect in two different labs the logical sctructure of working with data in a real data science experiment. This obliges us to add some extra code at the beginning of this lab. 

As we did in LAB-2, we need to do an initial set up of the notebook to have all the data available with the same structure we coded before.

We first recreate the `import` sentences. 

In [None]:
# CELL 1

import pandas as pd
import numpy as np
import seaborn as sns
import datetime
import time
from sklearn import metrics
from sklearn import neighbors
from sklearn import ensemble
from sklearn import tree
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from pandas.plotting import scatter_matrix
from sklearn.metrics import classification_report
from matplotlib import pyplot as plt
from datetime import datetime, date, time, timedelta
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
import matplotlib.ticker as mtick
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn import svm
plt.style.use('ggplot')

import dsx_core_utils, requests, os, io
from dsx_core_utils import ProjectContext
from pyspark.sql import SparkSession
from pyspark import SparkContext

Now we have to reload some data that is required for this lab that was already formatted in LAB-2. We loaded it from a Db2 table in LAB-2 but we are now loading it from a CSV file, just for informational purposes.

In [None]:
# CELL 2

#### Insert code below. Load LOAN table from Db2 as you did in LAB2.


#### Insert code from the lab script here to load convert loan_raw into loan with the required options


The next step is to load the dataframe built in LAB-2 with exactly the same structure created then. If you remember what you will remember that, by the end of LAB-2, you saved the formatted pandas dataframe in a CSV file. It is that file which we are loading now to continue working in the same conditions created in LAB-2.

Let's check the working path for this environment in the USS of the system:

In [None]:
# CELL 3

%pwd

In [None]:
# CELL 4

df = pd.read_csv(
    'fullDataframe_EOLAB1_<login_userid>.csv',
    sep=";",
    delimiter=None,
    header="infer",
    names=None,
    low_memory=False)

df.head()

From now on, we will start working on the machine learning models. 

# Standard processing and Training/Test set Split

It is necessary to have a set of data used to train machine learning models. We are splitting our dataframe into two parts and will use one of them for training. The other part will be used to evaluate how models work after being trained.

In [None]:
# CELL 5

# Select status column as the y (our target) and all the rest as X (features) 

X = df.loc[:, df.columns != "status"]
y = df.loc[:, "status"]

#### Insert code below. Split the dataframe in 2 sets. 3% of the data will be used for training


After splitting we have several dataframes ready to use with the models. We have `X` containing the full list of features and `y` containing the targets. Remember, we want to know whether a new loaner has a risk of defaulting or not and that is reflected within the `status`. We also have `X_train` and `X_test` which are used for both stages: train and evaluate of models.

# Model Evaluation

# Draft Modeling: Random Forest

We are running an initial Random Forest model because we need to build some small dataframes around it. Some initial visualizations will be helpful to understand the behaviour of the features and the influence they have on the overall fitting of the model. After resolving the influence of each of the features we will be able to build a more accurate model. That is the reason why this is a _draft model_.

In [None]:
# CELL 6

# This is how a model is trained. Nothing strange: just invoke the algorithm with the necessary parameters. If you want more information
# you may visit the documentation centre: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

rf = ensemble.RandomForestClassifier(
    n_estimators=200,
    criterion="gini",
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_features="auto",
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    min_impurity_split=None,
    bootstrap=True,
    oob_score=False,
    n_jobs=1,
    random_state=None,
    verbose=0,
    warm_start=False,
    class_weight=None
)

#### Insert code below. Train the model created above and get a target prediction. 




In [None]:
# CELL 7

# For an explanation of the classification report and the following confusion matrix
# Follow this link: https://www.geeksforgeeks.org/confusion-matrix-machine-learning/

#### Insert code below.




In [None]:
# CELL 8

# A confusion matrix is a visual to help you understand how good predictions (y_pred) are for X_train compared to the test.
cm1 = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(cm1,annot = True)

In [None]:
# CELL 9

# Check how features have an "importance" assigned. It is a measure of the influence of each in the classification process

fi = rf.feature_importances_

#### Inset code below. Print the array of importances. 


In [None]:
# CELL 10

#### Insert code below. Subset the dataframe to check the behaviour of importances and features



In [None]:
# CELL 11

# Study the subset created in previous cell. Using a visual aid is very useful.
# Plot features in x axis and importance in y axis
# Limit the visual to 20 first features in importance

importance = pd.DataFrame(
    {"feature": feature_cols[:], "importance": rf.feature_importances_[:]})

importance.sort_values(
    by="importance",
    axis=0,
    ascending=False,
    inplace=True,
    kind="quicksort",
    na_position="last",
)

#### Insert code below to do the plotting



# Visualization & Feature Selection

We enter into a section to understand the behaviour of featured and the influence they have in the model. We will select the most relevant to try to improve model accuracy. After this process we will set up new models to have a comparison and choose the best fitting one.

In [None]:
# CELL 12

# Get an idea of the distribution of defaults by sex. 
# Remember: sex 0 = women, sex 1 = men. Status 0 = all ok, Status 1 = default

df.groupby(["sex", "status"])["status"].size()


In [None]:
# CELL 13

# Visualize same result as in former cell using a stacked bar graph

df.groupby(["sex", "status"])["status"].size().groupby(level=0).apply(
    lambda x: 100 * x / x.sum()
).unstack().plot(kind="bar", stacked=True)

#### Insert code below. Customize the graph to make it more understandable


In [None]:
# CELL 14

# Visualize age related to status

#### Insert code below. Plot the relationship between age and status



In [None]:
# CELL 15

# Plot years of having a card vs status

#### Insert code below 



In [None]:
# CELL 16

# Study the influence of age vs status. To do that you need to segment by ages
# Define a function to create age bins

# Binning:
def binning(col, cut_points, labels=None):
    # Define min and max values:
    minval = col.min()
    maxval = col.max()

    # create list by adding min and max to cut_points
    break_points = [minval] + cut_points + [maxval]

    # if no labels provided, use default labels 0 ... (n-1)
    if not labels:
        labels = range(len(cut_points) + 1)

    # Binning using cut function of pandas
    colBin = pd.cut(col, bins=break_points, labels=labels, include_lowest=True)
    return colBin


# Binning age:
#### Insert code below. Apply function binning to a sample of ages between 20 and 50 years



In [None]:
# CELL 17

# Plot the recently created column age_bin vs status

#### Insert code below. Create a plot grouping age_bin with status vs status



In [None]:
# CELL 18

# Have a look to the dataframe selecting all rows marked with a 1 status (any kind of default)

#### Insert code below. Select all rows with status 1


In [None]:
# CELL 19

# Graph status vs payments mean to have an idea of the behaviour of clients defaulting and not defaulting.

#### Insert code below. Group averaged payments with status and plot it in a bar graph



In [None]:
# CELL 20

# Plot a heatmap to understand the correlation of the main features. Limit it to a 10x10 matrix
import seaborn as sns

cols = list(importance.feature[:10]) # To add or remove features to the matrix change this number
cols.insert(0, "status")
corrcoef_map = np.corrcoef(df[cols].values.T)
fig, ax = plt.subplots(figsize=(12, 12))  # Sample figsize in inches
hm = sns.heatmap(
    corrcoef_map,
    cbar=True,
    annot=True,
    square=True,
    fmt=".2f",
    annot_kws={"size": 15},
    yticklabels=cols,
    xticklabels=cols,
    ax=ax,
)

#Optional. Fix for this version of the package matplotlib. 

#### Insert code below to fix the view. This is a problem with the present version of matplotlib


# Modeling

We have run several tests and visualizations to understand our dataframes. We are now in a position to train some machine learning models and decide which one is the best for us.

## Random Forest

In [None]:
# CELL 21

# Train and test a Random Forest model
# Remember that in a previous cell we defined the dataframes like this:
#X = df.loc[:, df.columns != "status"]
#y = df.loc[:, "status"] 

rf = ensemble.RandomForestClassifier(
    n_estimators=800,
    criterion="gini",
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_features="auto",
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    min_impurity_split=None,
    bootstrap=True,
    oob_score=False,
    n_jobs=1,
    random_state=None,
    verbose=0,
    warm_start=False,
    class_weight=None
)

#### Insert code below. Run the fitting and prediction of the model created above



In [None]:
# CELL 22

# Get the classification report

#### Insert code below. Print the classification report for this fit of the model


In [None]:
# CELL 23

# Get the confusion matrix for the model

#### Insert code below. Print the confusion matrix for this fit of the model


In [None]:
# CELL 24

# Build feature and importance dataframe to further check the behaviour of the model

feature_cols = X_test.columns
importance = pd.DataFrame(
    {"feature": feature_cols[:], "importance": rf.feature_importances_[:]}
)

importance.sort_values(
    by="importance",
    axis=0,
    ascending=False,
    inplace=True,
    kind="quicksort",
    na_position="last"
)


importance[:18].plot(x="feature", y="importance", kind="bar")
plt.ylabel("importance")

In [None]:
# CELL 25

# Now you will export this model to the repository on z/OS so that it will be available for use in test and production

from repository_v3.mlrepository import MetaNames
from repository_v3.mlrepository import MetaProps
from repository_v3.mlrepositoryclient import MLRepositoryClient
from repository_v3.mlrepositoryartifact import MLRepositoryArtifact
import pprint

metaservicePath = "https://10.3.72.69:443"
client = MLRepositoryClient(metaservicePath)
client.authorize_with_token(pc.authToken)

props1 = MetaProps(
        {MetaNames.AUTHOR_NAME:"author",
         MetaNames.AUTHOR_EMAIL:"author@example.com",
         MetaNames.MODEL_META_PROJECT_ID: pc.projectName,
         MetaNames.MODEL_META_ORIGIN_TYPE: "notebook",
         MetaNames.SCOPE: "Project",
         MetaNames.MODEL_META_ORIGIN_ID: pc.nbName})

input_artifact = MLRepositoryArtifact(rf, 
      name="loan_random_forest", meta_props=props1, # Change the name of the model if needed
      training_data=X_train, training_target=y_train)

client.models.save(artifact=input_artifact)
print("model saved successfully")


# Decision Tree

For learning purposes we are creating and training some additional machine learning models. This allows us to compare the how different models give different results. 

In [None]:
# CELL 26

# Create and train a decision tree model

#### Insert code below. 


In [None]:
# CELL 27

# Get the classification report

#### Insert code below. Print the classification report for this fit of the model



In [None]:
# CELL 28

# Plot the confusion matrix for this decision tree

#### Insert code below. 



In [None]:
# CELL 29

# Classify features by order of importance for the decision tree

feature_cols = X_test.columns
importance = pd.DataFrame(
    {"feature": feature_cols[:], "importance": rf.feature_importances_[:]}
)
importance.sort_values(
    by="importance",
    axis=0,
    ascending=False,
    inplace=True,
    kind="quicksort",
    na_position="last",
)
importance[:18].plot(x="feature", y="importance", kind="bar")


In [None]:
# CELL 30

# Copy the code used to export the previous model and update this cell. It is in CELL 25
# Delete the imports which have already done in previous cell 
# Update the name of the model before executing the cell 

### Insert code below. Export decision tree model to the repository.
from repository_v3.mlrepository import MetaNames
from repository_v3.mlrepository import MetaProps
from repository_v3.mlrepositoryclient import MLRepositoryClient
from repository_v3.mlrepositoryartifact import MLRepositoryArtifact
import pprint

metaservicePath = "https://10.3.72.69:443"
client = MLRepositoryClient(metaservicePath)
client.authorize_with_token(pc.authToken)

props1 = MetaProps(
        {MetaNames.AUTHOR_NAME:"author",
         MetaNames.AUTHOR_EMAIL:"author@example.com",
         MetaNames.MODEL_META_PROJECT_ID: pc.projectName,
         MetaNames.MODEL_META_ORIGIN_TYPE: "notebook",
         MetaNames.SCOPE: "Project",
         MetaNames.MODEL_META_ORIGIN_ID: pc.nbName})

input_artifact = MLRepositoryArtifact(rf, 
      name="loan_random_forest", meta_props=props1, # Change the name of the model if needed
      training_data=X_train, training_target=y_train)

client.models.save(artifact=input_artifact)
print("model saved successfully")


# Gradient Boosting Classifier

We are doing the same work as for previous models.

In [None]:
# CELL 31

from sklearn.ensemble import GradientBoostingClassifier

# Create the model and set parameters

gbc = GradientBoostingClassifier(
    loss="deviance",
    learning_rate=0.1,
    n_estimators=200,
    subsample=1.0,
    criterion="friedman_mse",
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_depth=3,
    min_impurity_decrease=0.0,
    min_impurity_split=None,
    init=None,
    random_state=None,
    max_features=None,
)

# Train the model and fit data
model = gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)

# Get the classification report
print(classification_report(y_test, y_pred))

In [None]:
# CELL 32

# Get the confusion matrix for this Gradient Boost Classifier

#### Insert code below. Plot the confusion matrix for this GBC model


In [None]:
# CELL 33

# Copy the code used to export the previous model and update this cell
# Delete the imports which have already done in previous cell 
# Update the name of the model before executing the cell 

### Insert code below. Export GBC model to the repository.


# Feature Scaling for SVM & Logistic Regression Models

Before we can start working with Support Vector Machine (SVM) algorithms we need to do some technical work calle _feature scaling_. You may learn a little bit more about this in this [Wikipedia entry](https://en.wikipedia.org/wiki/Feature_scaling).

In [None]:
# CELL 34

# Standard processing for feature scaling

sc = StandardScaler()
X.drop(['age'], axis=1, inplace=True)
X = sc.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


# Support Vector Machine

In [None]:
# CELL 35

# Create, train and fit the SVM model. Print the classification model

svc = svm.SVC(
    C=5,
    kernel="rbf",
    degree=3,
    gamma="auto",
    coef0=0.0,
    shrinking=True,
    probability=False,
    tol=0.001,
    cache_size=200,
    class_weight=None,
    verbose=False,
    max_iter=-1,
    decision_function_shape="ovr",
    random_state=None,
)
model = svc.fit(X_train, y_train)

y_pred = svc.predict(X_test)

print(classification_report(y_test, y_pred))

In [None]:
# CELL 36

# Get the confusion matrix as with every other model.

#### Insert code below. Get the confusion matrix for your SVM model



In [None]:
# CELL 37

# Copy the code used to export the previous model and update this cell
# Delete the imports which have already done in previous cell 
# Update the name of the model before executing the cell 

### Insert code below. Export SVM model to the repository.


# Logistic Regression

In [None]:
# CELL 38

# Create and fit a Logistic Regression model

#### Insert code below. Invoke the model and fit it, then make your predictions




In [None]:
# CELL 39

# Get the confusion matrix for the Logistic Regression model

cm6 = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(cm6,annot = True)


In [None]:
# CELL 40

# Copy the code used to export the previous model and update this cell
# Delete the imports which have already done in previous cell 
# Update the name of the model before executing the cell 

### Insert code below. Export logistic regression model to the repository.


**END OF LAB-3**

# Try: Use only two features to plot Decision Boundary

We are adding here some more cells for you to work on some additional areas of modelling. 

Some models like SVM and Logistic Regression. Decision boundaries are aimed at finding the boundaries between different features in a model. We are using just two features because it's easier to understand how to manage decision boundaries.

With two features there will be just a boundary which will be a single line separating the data points into two regions. Sets of data points in each regions are called _classes_. 2 features = 2 classes.

Read the comments inside the code for suggestions.

In [None]:
# CELL 41

def plot_decision_boundary(model, X, y):
    X_max = X.max(axis=0)
    X_min = X.min(axis=0)
    xticks = np.linspace(X_min[0], X_max[0], 100)
    yticks = np.linspace(X_min[1], X_max[1], 100)
    xx, yy = np.meshgrid(xticks, yticks)
    ZZ = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = ZZ >= 0.5
    Z = Z.reshape(xx.shape)
    fig, ax = plt.subplots()
    ax = plt.gca()
    ax.contourf(xx, yy, Z, cmap=plt.cm.PRGn, alpha=0.6)
    ax.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, alpha=0.6)


In [None]:
# CELL 42

# You can test here changing times_balance_below_5K for some other feature
# the second best fitting, for example: X = df[["min_balance_before_loan", "amount_INTEREST IF NEG. BALANCE"]]

X = df[["min_balance_before_loan", "times_balance_below_5K"]]
y = df["status"]


In [None]:
# CELL 43

rf = ensemble.RandomForestClassifier(
    n_estimators=500,
    criterion="gini",
    max_depth=4,
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_features="auto",
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    min_impurity_split=None,
    bootstrap=True,
    oob_score=False,
    n_jobs=1,
    random_state=None,
    verbose=0,
    warm_start=False,
    class_weight=None,
)

X_train, X_test, y_train, y_test = train_test_split(X, y)
model = rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
f1 = f1_score(y_pred, y_test)
f1


In [None]:
# CELL 44

print(classification_report(y_test, y_pred))


In [None]:
# CELL 45

cm7 = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(cm7,annot = True)


In [None]:
# CELL 46

plot_decision_boundary(model, X_test, y_test)
plt.xlabel("min_balance_before_loan")

# If you have changed the column times_balance_below_5K for some other,
# You'll have to change the label in this graph.
plt.ylabel("times_balance_below_5K")


In [None]:
# CELL 47

# This kind of graph is easy to understand with two features, 
# if more features are used we might not be able to graph it
# because it would exceed 3 dimensions.

feature_cols = X_test.columns
importance = pd.DataFrame(
    {"feature": feature_cols[:], "importance": rf.feature_importances_[:]}
)
importance.sort_values(
    by="importance",
    axis=0,
    ascending=False,
    inplace=True,
    kind="quicksort",
    na_position="last",
)
importance[:18].plot(x="feature", y="importance", kind="bar")
