# Practice Exercise 3

Todays exercise will be focused on the analysis of your machine learning models. To this end, we will use the well-curated [deepchecks](https://github.com/deepchecks/deepchecks) library which besides data integrity and validation functions offers this toolset. 

<img src="https://raw.githubusercontent.com/deepchecks/deepchecks/ce15432c6224c4e2636481fc168bb9e3b01d734e/docs/source/_static/images/general/pipeline_when_to_validate.svg">



## 1-Installs & Imports

### Install libraries

The following packages are pre-installed if you are using GitHub Codespaces.

If you do not use GitHub Codespaces with the pre-installed Kernel, please consider creating a conda environment with Python 3.8 and install the following packages manually. This will pull in all the dependencies and will be quick.


In [1]:
# !pip install deepchecks
# !pip install scikit-learn
# !pip install lightgbm
# !pip install seaborn
# !pip install pandas==1.5.3

In [2]:
import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, StratifiedKFold

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

## 2-Data Imports

You will now load a custom dataset regarding loan approvals from the banking context with the label column 'loan_status'

In [3]:
# Reading the loan data
loan_df = pd.read_csv('data.csv')

Look at the data in brief!

In [4]:
loan_df.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,purpose,title,dti,earliest_cr_line,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,application_type,mort_acc,pub_rec_bankruptcies,address
0,10000.0,36 months,11.44,329.48,B,B4,Marketing,10+ years,RENT,117000.0,Not Verified,Jan-2015,Fully Paid,vacation,Vacation,26.24,Jun-1990,16.0,0.0,36369.0,41.8,25.0,w,INDIVIDUAL,0.0,0.0,"0174 Michelle Gateway\r\nMendozaberg, OK 22690"
1,8000.0,36 months,11.99,265.68,B,B5,Credit analyst,4 years,MORTGAGE,65000.0,Not Verified,Jan-2015,Fully Paid,debt_consolidation,Debt consolidation,22.05,Jul-2004,17.0,0.0,20131.0,53.3,27.0,f,INDIVIDUAL,3.0,0.0,"1076 Carney Fort Apt. 347\r\nLoganmouth, SD 05113"
2,15600.0,36 months,10.49,506.97,B,B3,Statistician,< 1 year,RENT,43057.0,Source Verified,Jan-2015,Fully Paid,credit_card,Credit card refinancing,12.79,Aug-2007,13.0,0.0,11987.0,92.2,26.0,f,INDIVIDUAL,0.0,0.0,"87025 Mark Dale Apt. 269\r\nNew Sabrina, WV 05113"
3,7200.0,36 months,6.49,220.65,A,A2,Client Advocate,6 years,RENT,54000.0,Not Verified,Nov-2014,Fully Paid,credit_card,Credit card refinancing,2.6,Sep-2006,6.0,0.0,5472.0,21.5,13.0,f,INDIVIDUAL,0.0,0.0,"823 Reid Ford\r\nDelacruzside, MA 00813"
4,24375.0,60 months,17.27,609.33,C,C5,Destiny Management Inc.,9 years,MORTGAGE,55000.0,Verified,Apr-2013,Charged Off,credit_card,Credit Card Refinance,33.95,Mar-1999,13.0,0.0,24584.0,69.8,43.0,f,INDIVIDUAL,1.0,0.0,"679 Luna Roads\r\nGreggshire, VA 11650"


Data features meaning

- loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
- term: The number of payments on the loan. Values are in months and can be either 36 or 60.
- int_rate:	Interest Rate on the loan
- installment: The monthly payment owed by the borrower if the loan originates.
- grade: LC assigned loan grade
- sub_grade: LC assigned loan subgrade
- emp_title: The job title supplied by the Borrower when applying for the loan.*
- emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
- home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
- annual_inc: The self-reported annual income provided by the borrower during registration.
- verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified
- issue_d: The month which the loan was funded
- loan_status: Current status of the loan
- purpose: A category provided by the borrower for the loan request.
- title: The loan title provided by the borrower
- zip_code: The first 3 numbers of the zip code provided by the borrower in the loan application.
- addr_state: The state provided by the borrower in the loan application
- dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
- earliest_cr_line: The month the borrower's earliest reported credit line was opened
- open_acc: The number of open credit lines in the borrower's credit file.
- pub_rec: Number of derogatory public records
- revol_bal: Total credit revolving balance
- revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
- total_acc: The total number of credit lines currently in the borrower's credit file
- initial_list_status: The initial listing status of the loan. Possible values are – W, F
- application_type: Indicates whether the loan is an individual application or a joint application with two co-borrowers
- mort_acc:	Number of mortgage accounts.
- pub_rec_bankruptcies: Number of public record bankruptcies

## 3-Data Preprocessing

Some custom preproccesing to drop irrelevant features and encode categorical features for example

In [5]:
def quick_preprocessing(df):
    # Feature Engineering
    # ## Extract zip code from address feature
    df['zip_code'] = df.address.apply(lambda x: x[-5:])
    # ## convert date feature to date time feature
    df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'])
    ## Extract year from the earliest_cr_line feature
    df['earliest_cr_line'] = df.earliest_cr_line.dt.year.astype(int)
    
    # Dropping features
    ## charge off rates are extremely similar across all employment lengths. So we are going to drop the emp_length column.
    ## grade is just a sub feature of sub_grade, So we are goinig to drop it.
    ## issue_d will yield data leakage as it will show whether or not a loan would be issued when using our model.
    ## gotten zip code feature for address and contains only text component, so we are going to drop it.
    df.drop(columns=['emp_title', 'emp_length', 'title', 'grade', 'address', 'issue_d'], inplace=True)
    
    # Filling missing values with median value
    df['mort_acc'] = df['mort_acc'].fillna(df['mort_acc'].median().round())
    df['revol_util'] = df['revol_util'].fillna(df['revol_util'].median().round())
    df['pub_rec_bankruptcies'] = df['pub_rec_bankruptcies'].fillna(df['pub_rec_bankruptcies'].median().round())
    
    ## Mapping features
    term_values = {' 36 months': 36, ' 60 months': 60}
    df['term'] = df.term.map(term_values)
    ## Mapping target feature
    df['loan_status'] = df.loan_status.map({'Fully Paid':1, 'Charged Off':0})
    
    # #Change cattegorical features to categorical data type
    cat_columns =  ['sub_grade', 'verification_status', 'purpose', 'initial_list_status','application_type', 'home_ownership', 'zip_code']
    # #Change some features to categorical data type
    for item in cat_columns:
        df[item] = df[item].astype("category").cat.codes +1

    # Categorical encoding
    dummies = ['sub_grade', 'verification_status', 'purpose', 'initial_list_status','application_type', 'home_ownership', 'zip_code']
    df = pd.get_dummies(df, columns=dummies, drop_first=True)
    return df

In [6]:
# Preprocessing the loan dataframe
loan_df = quick_preprocessing(loan_df)

Perform your train-test split

In [7]:
# Seperating matrix of features and target variables respectively
y =  loan_df['loan_status']
X = loan_df.drop(['loan_status'], axis=1)

# Splitting the data with a train:test ratio of 80:20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Create a deepchecks data object instance

In [8]:
from deepchecks.tabular import Dataset

ds_train = X_train.merge(y_train, left_index=True, right_index=True)
ds_test = X_test.merge(y_test, left_index=True, right_index=True)
ds_train = Dataset(ds_train, label="loan_status")
ds_test =  Dataset(ds_test,  label="loan_status")



## 4-Modelling

In [9]:
# Apply AdaBoost Classifier 

from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(random_state=42)
ada_clf.fit(ds_train.data[ds_train.features], ds_train.data[ds_train.label_name])

In [10]:
# Apply LightGBM Classifier
from lightgbm import LGBMClassifier
lgb_clf = LGBMClassifier(random_state=42)
lgb_clf.fit(ds_train.data[ds_train.features], ds_train.data[ds_train.label_name])

## 5-Model Analysis

Now, lets start with the topic of interest for today---model analysis! You can refer to all the functions available for model validation [here](https://docs.deepchecks.com/stable/tabular/auto_checks/model_evaluation/index.html).

### Model Info

First, lets gather some information about the fitted models.

In [11]:
from deepchecks.tabular.checks import ModelInfo

In [12]:
ModelInfo().run(lgb_clf)

VBox(children=(HTML(value='<h4><b>Model Info</b></h4>'), HTML(value='<p>Summarize given model parameters. <a h…

In [13]:
ModelInfo().run(ada_clf)

VBox(children=(HTML(value='<h4><b>Model Info</b></h4>'), HTML(value='<p>Summarize given model parameters. <a h…

## Inference Times

Inferencing can be important in ML operations. In effect, it is important to assess inferencing times for your models.

In [14]:
from deepchecks.tabular.checks import ModelInferenceTime

In [15]:
ModelInferenceTime().run(ds_test, ada_clf)

VBox(children=(HTML(value='<h4><b>Model Inference Time</b></h4>'), HTML(value='<p>Measure model average infere…

In [16]:
ModelInferenceTime().run(ds_test, lgb_clf)

VBox(children=(HTML(value='<h4><b>Model Inference Time</b></h4>'), HTML(value='<p>Measure model average infere…

## Model Performance

Now, regarding your trained classifiers, you might be interested in the actual performance and therefore refer to the classification report (aka the confusion matrix).

In [17]:
from deepchecks.tabular.checks import ConfusionMatrixReport

In [18]:
ConfusionMatrixReport().run(ds_test, lgb_clf).show()

VBox(children=(HTML(value='<h4><b>Confusion Matrix Report</b></h4>'), HTML(value='<p>Calculate the confusion m…

In [19]:
ConfusionMatrixReport().run(ds_test, ada_clf).show()

VBox(children=(HTML(value='<h4><b>Confusion Matrix Report</b></h4>'), HTML(value='<p>Calculate the confusion m…

You can also define a threshold for missclassifications and create a summary for the model under investigation.

In [20]:
# Let's add a condition and re-run the check:
ConfusionMatrixReport().add_condition_misclassified_samples_lower_than_condition(misclassified_samples_threshold=0.2).run(ds_test, ada_clf).show()

VBox(children=(HTML(value='<h4><b>Confusion Matrix Report</b></h4>'), HTML(value='<p>Calculate the confusion m…

## Segment Performance (Slicing)

You might also be interested in the models performance for different slices of your data. Please do some experiments for yourself.

In [21]:
from deepchecks.tabular.checks import SegmentPerformance

In [22]:
SegmentPerformance(feature_1='annual_inc', feature_2='installment').run(ds_test, lgb_clf)


The SegmentPerformance check is deprecated and will be removed in the 0.11 version. Please use the WeakSegmentsPerformance check instead.



VBox(children=(HTML(value='<h4><b>Segment Performance</b></h4>'), HTML(value='<p>Display performance score seg…

In [23]:
SegmentPerformance(feature_1='annual_inc', feature_2='installment').run(ds_test, ada_clf)


The SegmentPerformance check is deprecated and will be removed in the 0.11 version. Please use the WeakSegmentsPerformance check instead.



VBox(children=(HTML(value='<h4><b>Segment Performance</b></h4>'), HTML(value='<p>Display performance score seg…

## Feature Importance and Use

Further interest into your developed models leads you to the question whether there are any features unused by your model(s). 

In [24]:
from deepchecks.tabular.checks import UnusedFeatures

In [25]:
UnusedFeatures(feature_variance_threshold=1.5).run(ds_test, ada_clf).show()

VBox(children=(HTML(value='<h4><b>Unused Features</b></h4>'), HTML(value='<p>Detect features that are nearly u…

In [26]:
UnusedFeatures(feature_variance_threshold=1.5).run(ds_test, lgb_clf).show()

VBox(children=(HTML(value='<h4><b>Unused Features</b></h4>'), HTML(value='<p>Detect features that are nearly u…

## Bias

Lastly, you are curious whether your models are biased. Hence, you assess performance bias with deepchecks for selected features.

In [27]:
from deepchecks.tabular.checks.model_evaluation import PerformanceBias

In [28]:
PerformanceBias(protected_feature="pub_rec", control_feature="annual_inc", scorer="accuracy", max_segments=3).run(ds_test, lgb_clf).show()


VBox(children=(HTML(value='<h4><b>Performance Bias</b></h4>'), HTML(value='<p>    Check for performance differ…

In [29]:
PerformanceBias(protected_feature="pub_rec", control_feature="annual_inc", scorer="accuracy", max_segments=3).run(ds_test, ada_clf).show()


Thank you on going through this weeks material concerning model analysis.