<div class="jumbotron">
  <h1 class="display-3">Credit Risk Case Study in Python</h1>
  <p>Using Logistic Regression</p>
  
<hr>
<h3> What is Credit risk? </h3>
Credit risk refers to the risk that a borrower may not repay a loan and that the lender may lose the principal of the loan or the interest associated with it. In Banking sector this is an important factor to be considered before approving the loan of an applicant.

<h3> How Is Credit Risk Assessed? </h3>
Credit risks are calculated based on the borrowers' overall ability to repay. To assess credit risk on a consumer loan, lenders look at the five C's: an applicant's credit history, his capacity to repay, his capital, the loan's conditions and associated collateral.

<h3> Problem </h3>
To automate the loan eligibility process based on the customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, we given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.

<h3> Data Variables and Description </h3>
<table>
 <tr> <th align="left">Variable</th> <th align="left">Description</th> </tr>
 <tr> <td>Loan_ID</td> <td>Unique Loan ID</td> </tr>
 <tr> <td>Gender</td> <td>Male/ Female</td> </tr>
 <tr> <td>Married</td> <td>Applicant married (Y/N)</td> </tr>
 <tr> <td>Dependents</td> <td>Number of dependents</td> </tr>
 <tr> <td>Education</td> <td>Applicant Education (Graduate/ Under Graduate)</td> </tr>
 <tr> <td>Self_Employed</td> <td>Self employed (Y/N)</td> </tr>
 <tr> <td>ApplicantIncome</td> <td>Applicant income</td> </tr>
 <tr> <td>CoapplicantIncome</td> <td>Coapplicant income</td> </tr>
 <tr> <td>LoanAmount</td> <td>Loan amount in thousands</td> </tr>
 <tr> <td>Loan_Amount_Term</td> <td>Term of loan in months</td> </tr>
 <tr> <td>Credit_History</td> <td>credit history meets guidelines</td> </tr>
 <tr> <td>Property_Area</td> <td>Urban/ Semi Urban/ Rural</td> </tr>	
 <tr> <td>Loan_Status</td> <td>Loan approved (Y/N)</td> </tr>			
</table>

<h3> Dataset File Given </h3>
* Credit_Risk_Train_Data
* Credit_Risk_Test_Data
  
 
 <h2><span class="label label-info">Index</span></h2> <br>
 &#x25FE; [Importing Datasets](#ImportingDatasets) <br>
 &#x25FE; [Finding NULL Values](#FindNULLVal) <br>
 &#x25FE; [Counting Levels in Datasets](#CountDSLevels) <br>
 &#x25FE; [Treating NULL Values & Converting Variables into 0's and 1's](#TreatNULLVal) <br>
 &#x25FE; [Plot and Graphs](#Plots) <br>
 &#x25FE; [Logistic Regression](#logreg) <br>
 &#x25FE; [Classification Report using Stats Model](#classreport_statsmod) <br>
 &#x25FE; [Classification Report using Sci Kit Learn](#classreport_scikit) <br>
 &#x25FE; [Model Performance Evaluation](#ModPerformEva) <br>
 &#x25FE; [Confusion Matrix](#ConfMat) <br>
 &#x25FE; [Adjusting the Classification Threshold](#AdjClassThershold) <br>
 &#x25FE; [ Decreasing the Threshold [Optional]](#DecThreshold) <br>
 &#x25FE; [ROC Curves and Area Under the Curve (AUC)](#rocauc) <br>
 &#x25FE; [Exporting predicted values in Validate Dataset File](#ExportVals) <br>
 &#x25FE; [Conclusion](#Conclusion) <br>
</div>
<a id="head"></a>

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from pandas import Series, DataFrame

import sklearn
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics 
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc

import pylab as pl

<a id="ImportingDatasets"></a>
<div class="alert alert-block alert-info" style="color:#000000">
# Importing Datasets

## <span class="label label-default">Importing Traning Dataset</span>

In [3]:
# Importing Train Datset
train_data = pd.read_csv("F:/Lectures/Data Science/iMarticus/Python/Scripts/iMarticus-Projects/Datasets/Credit_Risk_Train_Data.csv")
train_data = pd.DataFrame(train_data)
train_data.shape # Shape gives you total number of observations and variables present in datasets

(614, 13)

In [4]:
train_data.info() # Info gives you name of each variable with the data type associated with it.
train_data.head(3) # Head gives you first few rows of the dataset. (3 is the nummber of rows)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null object
Married              611 non-null object
Dependents           599 non-null object
Education            614 non-null object
Self_Employed        582 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


## <span class="label label-default">Importing Testing Dataset</span>

In [5]:
# Importing Test Datset
test_data = pd.read_csv("F:/Lectures/Data Science/iMarticus/Python/Scripts/iMarticus-Projects/Datasets/Credit_Risk_Test_Data.csv")
test_data = pd.DataFrame(test_data)
test_data.shape

(367, 13)

In [6]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 13 columns):
Loan_ID              367 non-null object
Gender               356 non-null object
Married              367 non-null object
Dependents           357 non-null object
Education            367 non-null object
Self_Employed        344 non-null object
ApplicantIncome      367 non-null int64
CoapplicantIncome    367 non-null int64
LoanAmount           362 non-null float64
Loan_Amount_Term     361 non-null float64
Credit_History       338 non-null float64
Property_Area        367 non-null object
outcome              367 non-null object
dtypes: float64(3), int64(2), object(8)
memory usage: 37.4+ KB


In [7]:
test_data.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,outcome
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban,Y
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban,Y
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban,Y


<a id="FindNULLVal"></a>
<div class="alert alert-block alert-info" style="color:#000000">
# Finding NULL Values

In [8]:
train_data.isnull().sum() # Gives Variable wise NaN values

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [9]:
train_data.isnull().sum().sum() #136 # Gives total number of NaN values in a dataset

149

In [10]:
test_data.isnull().sum() 

Loan_ID               0
Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
outcome               0
dtype: int64

In [11]:
test_data.isnull().sum().sum() #84

84

<a id="CountDSLevels"></a>
<div class="alert alert-block alert-info" style="color:#000000">
# Counting Levels in Datasets

Counting the levels of variables in dataset is important because it shows the distribution of elements in categorical variable. In addation to that it also shows the wrongly entred elements. Like in case of **Dependents** variable (3+ = 51).

## <span class="label label-default">For Traning Dataset</span>

In [None]:
train_data.Gender.value_counts()

In [None]:
train_data.Married.value_counts()

In [None]:
train_data.Dependents.value_counts()  # 3+

In [None]:
train_data.Education.value_counts()

In [None]:
train_data.Self_Employed.value_counts()

In [None]:
train_data.LoanAmount.value_counts()

In [None]:
train_data.Loan_Amount_Term.value_counts()

In [None]:
train_data.Credit_History.value_counts()

In [None]:
train_data.Property_Area.value_counts()

In [None]:
train_data.Loan_Status.value_counts()

## <span class="label label-default">For Testing Dataset</span>

In [None]:
test_data.Gender.value_counts()

In [None]:
test_data.Married.value_counts()

In [None]:
test_data.Dependents.value_counts()  # 3+

In [None]:
test_data.Education.value_counts()

In [None]:
test_data.Self_Employed.value_counts()

In [None]:
train_data.LoanAmount.value_counts()

In [None]:
test_data.Loan_Amount_Term.value_counts()

In [None]:
test_data.Credit_History.value_counts()

In [None]:
test_data.Property_Area.value_counts()

<a id="TreatNULLVal"></a>
<div class="alert alert-block alert-info" style="color:#000000">
# Treating NULL Values and Converting Variables into 0's and 1's 

### Key

* **Gender**          : Male=1 | Female=0
* **Married** 	      : Yes=1	| No=0
* **Dependents**      : 0,1,2,3
* **Education**	  	  : Graduate=1 | Not Graduate=0
* **Self_Employed**   : Yes=1 | No=0
* **Credit_History**  : 0 and 1
* **Property_Area**   : Urban=1 | Rural=2 | SemiUrban=3
* **Loan_Status**	  : Yes=1	| No=0

### Categorical Encoding
* <p>The idea is to convert the all categorical variable into 0s, 1s, 2s, etc and into numeric integers. Though Logistics Regression is roboust to handle categorical variable but it is always a good practice to convert things into numerical values because its all about mathematical calculations.</p>
* <p>To do this we have LabelEncoder package from SciKit learn that encodes the labels with value between **0 and n_classes-1**. Though there are several other [methods](http://pbpython.com/categorical-encoding.html)  to do this. [Label Encoder Help](https://chrisalbon.com/machine-learning/convert_pandas_categorical_column_into_integers_for_scikit-learn.html) <br></p>
* <p>The **fit_transform(y)** function fit label encoder (male/female) and return encoded labels (1/0) </p>
* <p>By default the fit_transform() function will encode the NaN values also. So, we have to take care of them in best possible way.</p>

In [None]:
# Importing the LabelEncoder Libraries
from sklearn.preprocessing import LabelEncoder
number = LabelEncoder()

## <span class="label label-default">For Traning Dataset</span>

In [None]:
train_data['Gender'] = number.fit_transform(train_data['Gender'].astype('str')) # This makes three levels 0,1, NaN = 2 
avgnum = np.round(np.mean(train_data['Gender'])) # So, we have to convert '2' into either 1 or 0 by taking the mean.
train_data['Gender'].replace(2,avgnum,inplace=True) # This replaces '2' with Rounded Avg Value

train_data['Married'] = number.fit_transform(train_data['Married'].astype('str'))
avgnum = np.round(np.mean(train_data['Married']))
train_data['Married'].replace(2,avgnum,inplace=True)

train_data['Dependents'] = number.fit_transform(train_data['Dependents'].astype('str')) # Creates 4 levels
train_data['Dependents'].replace(4,3,inplace=True) # The 4th level is converted into 3

train_data['Education'] = number.fit_transform(train_data['Education'].astype('str'))

train_data['Self_Employed'] = number.fit_transform(train_data['Self_Employed'].astype('str'))
avgnum = np.round(np.mean(train_data['Self_Employed']))
train_data['Self_Employed'].replace(2,avgnum,inplace=True)

avgnum = np.round(np.mean(train_data['Loan_Amount_Term'])) # It gives 342 which is closer to 360
train_data.Loan_Amount_Term.fillna(360 ,inplace = True) # So, filled NA values with 360

avgnum = np.round(np.mean(train_data['LoanAmount'])) # It gives 146 
train_data.LoanAmount.fillna(146 ,inplace = True) # So, filled NA values with 146

avgnum = np.round(np.mean(train_data['Credit_History'])) # It gives 1
train_data.Credit_History.fillna(1 ,inplace = True) # So, filled NA values with 1

train_data['Property_Area'] = number.fit_transform(train_data['Property_Area'].astype('str'))
train_data['Loan_Status'] = number.fit_transform(train_data['Loan_Status'].astype('str'))

In [None]:
train_data.head(3)

In [None]:
train_data.isnull().sum() 

## <span class="label label-default">For Testing Dataset</span>

In [None]:
test_data['Gender'] = number.fit_transform(test_data['Gender'].astype('str'))  
avgnum = np.round(np.mean(test_data['Gender'])) 
test_data['Gender'].replace(2,avgnum,inplace=True) 

test_data['Married'] = number.fit_transform(test_data['Married'].astype('str'))
avgnum = np.round(np.mean(test_data['Married']))
test_data['Married'].replace(2,avgnum,inplace=True)

test_data['Dependents'] = number.fit_transform(test_data['Dependents'].astype('str')) # Creates 4 levels
test_data['Dependents'].replace(4,3,inplace=True) # The 4th level is converted into 3

test_data['Education'] = number.fit_transform(test_data['Education'].astype('str'))

test_data['Self_Employed'] = number.fit_transform(test_data['Self_Employed'].astype('str'))
avgnum = np.round(np.mean(test_data['Self_Employed']))
test_data['Self_Employed'].replace(2,avgnum,inplace=True)

avgnum = np.round(np.mean(test_data['Loan_Amount_Term'])) # It gives 343 which is closer to 360
test_data.Loan_Amount_Term.fillna(360 ,inplace = True) # So, filled NA values with 360

avgnum = np.round(np.mean(test_data['LoanAmount'])) # It gives 136 
test_data.LoanAmount.fillna(136 ,inplace = True) # So, filled NA values with 146

avgnum = np.round(np.mean(test_data['Credit_History'])) # It gives 1
test_data.Credit_History.fillna(1 ,inplace = True) # So, filled NA values with 1

test_data['Property_Area'] = number.fit_transform(test_data['Property_Area'].astype('str'))
test_data['outcome'] = number.fit_transform(test_data['outcome'].astype('str'))

In [None]:
test_data.head(3)

In [None]:
test_data.isnull().sum() 

<a id="Plots"></a>
<div class="alert alert-block alert-info" style="color:#000000">
# Plot and Graphs 

Now, we are going to make some graphs to see the visualization of train dataset. For this I am going to used seaborn package, that is a is a Python visualization library based on matplotlib.[More Info.](https://seaborn.pydata.org/introduction.html#introduction)

In [None]:
# Setting up the style and grid style of seaborn graphs
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
# palette=pkmn_type_colors

## <span class="label label-default">Heatmap</span>

In [None]:
# Calculate correlations Using Heatmap
corr = train_data.corr()
sns.heatmap(corr, annot=True, fmt="0.2f"); # annot:write the data value in each cell | fmt:String formatting code d=decimal f=float  
plt.xticks(rotation=-90) # Heatmap
plt.show()

The scale shows Pearson Coefficient (-1 to 1), the values near to 1 or -1 have high correlation. <br>
The heatmap clearly shows that there is very less corelations between most of the variables and target variable **Loan_Status**. But **Credit_History** has a good corelation with **Loan_Status**.

## <span class="label label-default">Distribution Plot (a.k.a. Histogram)</span>

In [None]:
# Distribution Plot (a.k.a. Histogram)
sns.distplot(train_data.ApplicantIncome)
plt.show()

In [None]:
sns.factorplot("Gender", "Credit_History", "Education", data=train_data, kind="bar", palette="muted", legend=True)
plt.show()

## <span class="label label-default">Pair Plot</span>

In [None]:
# Pair Plot
sns.pairplot(train_data)
plt.show()

## <span class="label label-default">Boxplot</span>

In [None]:
# Boxplot
sns.boxplot(data=train_data, palette="deep")
sns.boxplot(x='Gender' , y='ApplicantIncome', data=train_data, palette="deep")
plt.show()

## <span class="label label-default">Joint Distribution Plot</span>

In [None]:
# Joint Distribution Plot
sns.jointplot(x='Gender', y='Loan_Status', data=train_data)
plt.show()


<a id="logreg"></a>
<div class="alert alert-block alert-info" style="color:#000000">
# === Logistic Regression ===

#### We will drop the Loan_ID coloumn as it is not required.  And build the model using all the variable first, to check the most significant variables (using p-values).

In [None]:
train_data_new = train_data.drop(['Loan_ID'], 1)
train_data_new.head(3) # 94.55% Accuracy | AUC = 92%

In [None]:
# Validating Dataset
test_data_new = test_data.drop(['Loan_ID'], 1)
test_data_new.head(3)

In [None]:
# Taking Train Dataset values in two variables
X_train = train_data_new.ix[:,(0,1,2,3,4,5,6,7,8,9,10)]
X_train.head(3)

In [None]:
y_train = train_data_new.ix[:,11]
y_train.head(5)

In [None]:
# Taking Validate Dataset values as test data
# The purpose of doing this is : Test Data does not have a "Outcome" variable 
# So, the predicted values will be compared with "Outcome" variable of Validate Data
X_test = test_data_new.ix[:,(0,1,2,3,4,5,6,7,8,9,10)]
X_test.head()

In [None]:
y_test = test_data_new.ix[:,11]
y_test.head()

<a id="classreport_statsmod"></a>
<div class="alert alert-block alert-info" style="color:#000000">
# Classification Report using Stats Model
### This is done to view the proper output of Logistic Regression. Because SciKit Learn package does not give this much of detailed output.  

In [None]:
logit_model = sm.Logit(y_train, sm.add_constant(X_train)).fit()

In [None]:
logit_model.summary() # Gives the summary of model

> ### When model with all the variable was generated it gives Accuracy of 94.55% and AUC of 92%. Also, it shows that the Married and Credit_History variable are only siginificant variable.

In [None]:
logit_model.conf_int() # gives you idea for how robust the coefficients of the model are

In [None]:
np.exp(logit_model.params) # odds ratios # the exponential of each of the coefficients to generate the odds ratios.
# All the odds ratio must be above 1, meaning that they are positively associated with <target_variable/dependent_variable>

========================================================================================================================

## <span class="label label-primary">Now, building a model using only Married and Credit History variables. <br>(As these variables are significant with Loan_Status)

In [None]:
train_data_new = train_data.drop(['Loan_ID', 'Gender','Dependents', 'Self_Employed', 'Education', 'Property_Area', 'ApplicantIncome', 'Loan_Amount_Term','CoapplicantIncome', 'LoanAmount' ], 1)
train_data_new.head(3) # 95% Accuracy | ROC = 91%

In [None]:
test_data_new = test_data.drop(['Loan_ID', 'Gender','Dependents', 'Self_Employed', 'Education', 'Property_Area', 'ApplicantIncome', 'Loan_Amount_Term','CoapplicantIncome', 'LoanAmount' ], 1)
test_data_new.head(3)

In [None]:
# Taking Train Dataset values in two variables
X_train = train_data_new.ix[:,(0,1)]
X_train.head()

In [None]:
y_train = train_data_new.ix[:,2]
y_train.head()

In [None]:
# Taking Validate Dataset values as test data
# The purpose of doing this is : Test Data does not have a "Outcome" variable 
# So, the predicted values will be compared with "Outcome" variable of Validate Data
X_test = test_data_new.ix[:,(0,1)]
X_test.head()

In [None]:
y_test = test_data_new.ix[:,2]
y_test.head()

<a id="classreport_scikit"></a>
<div class="alert alert-block alert-info" style="color:#000000">
## Classification Report using Sci Kit Learn

In [None]:
logreg = LogisticRegression(fit_intercept=True,C = 1e15)
logreg.fit(X_train, y_train)
logreg.get_params()
logreg.decision_function(X_train)


> ### Predicting the variable

In [None]:
# The y_pred = logreg.predict(X_test) will give output as a class prediction (0 and 1) 
# for every observstion in a testing set, which will store in y_pred class.
y_pred = logreg.predict(X_test)
y_pred


In [None]:
# Calculating R Sq.
sklearn.metrics.r2_score(y_test, y_pred)

In [None]:
# classification Report using Scikit Learn
print(classification_report(y_test, y_pred))


In [None]:
logreg.score(X_test, y_test) # Exact Accuracy


In [None]:
logreg.coef_

In [None]:
logreg.intercept_

<a id="ModPerformEva"></a>
<div class="alert alert-block alert-info" style="color:#000000">
### =======================================================================
# Model Performance Evaluation 
### =======================================================================

 __Classification Accuracy:__ Percentage of correct predictions

In [None]:
# Model Accuracy or The correct classification of model in %

from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))

__ Null accuracy: __ It is an accuracy that could be achieved by always predicting the most frequent class (most numbers of 0s or 1s). It is important to compare Classification accuracy with Null Accuracy 
**_Whichever, 0s or 1s is maximum is the NULL accuracy_**

* This answer the question of _"if my model wants to predict the predominant (main/chief/principal) class 100% of the time, How often would it be correct?_
* Its like dummy model which would be correct the maximum percentage of the time. 
* In this case **79.01%** of the time the model will predict Loan Outcome as 1 (Yes). 
* Now, this is not a geniun model but it gives a baseline to predict out logistic regression model.

In [None]:
y_test.value_counts() # examine the class distribution of the testing set (using a Pandas Series method)

In [None]:
y_test.mean() # calculate the percentage of ones

In [None]:
1 - y_test.mean() # calculate the percentage of zeros

In [None]:
max(y_test.mean(), 1 - y_test.mean()) # calculates null accuracy (for binary classification problems coded as 0/1)

In [None]:
y_test.value_counts().head(1) / len(y_test) # calculates null accuracy (for multi-class classification problems)

In [None]:
y_test.value_counts()


In [None]:
np.bincount(y_pred) # Count the elements of an numpy array

#### Drawbacks of Classification accuracy :+1:
* Classification accuracy is the easiest classification metric to understand. __But,__ it does not tell you the underlying distribution of response values (NULL accuracy tells you this).
* And, it does not tell you what "types" of errors your classifier is making. ** _This problem can be solved by Confusion Matrix _ **

<a id="ConfMat"></a>
<div class="alert alert-block alert-info" style="color:#000000">
# Confusion Matrix

* Confusion Matrix Table describes the performance of a classification model

In [None]:
from sklearn.metrics import confusion_matrix
confmat = confusion_matrix(y_test, y_pred) # IMPORTANT: first argument is true values, second is predicted values
confmat # If you change the order of arguments the matrix will be reversed but no error will be raised.
# So always use a fixed place of those arguments.
# The result is telling us that we have 58+289=247 correct predictions
# and 19+1=20 incorrect predictions.

* __True Positives (TP):__ we correctly predicted that they do have diabetes
* __True Negatives (TN):__ we correctly predicted that they don't have diabetes
* __False Positives (FP):__ we incorrectly predicted that they do have diabetes (a "Type I error")
* __False Negatives (FN):__ we incorrectly predicted that they don't have diabetes (a "Type II error")

In [None]:
# Let us see the first 10 true and predicted responses
print('True:', y_test.values[0:10])
print('Pred:', y_pred[0:10])
# Identify the four cases for the output generated.

In [None]:
# slice confusion matrix into four pieces and save it 
TP = confmat[1, 1]
TN = confmat[0, 0]
FP = confmat[0, 1]
FN = confmat[1, 0]

**Classification Accuracy:** Overall, how often is the classifier correct?

In [None]:
(TP + TN) / (TP + TN + FP + FN) 

** Classification Error (_"Misclassification Rate"_):** Overall, how often is the classifier incorrect?

In [None]:
(FP + FN) / (TP + TN + FP + FN) 

** Sensitivity (_"True Positive Rate" or "Recall"_):** When the actual value is positive (1), how often is the prediction correct?
* How "sensitive" is the classifier to detecting positive instances?
                     

In [None]:
TP / (TP + FN)

**Specificity:** When the actual value is negative, how often is the prediction correct?
* How "specific" (or "selective") is the classifier in predicting positive instances?

In [None]:
TN / (TN + FP) 

**False Positive Rate:** When the actual value is negative, how often is the prediction incorrect?

In [None]:
FP / (TN + FP)

**Precision:** When a positive value is predicted, how often is the prediction correct?
* How "precise" is the classifier when predicting positive instances?

In [None]:
TP / (TP + FP) 

** F1 score is the harmonic mean of precision and sensitivity **

In [None]:
(2*TP) / ((2*TP) + FP + FN) 

#### Using Metrics

In [None]:
print(metrics.accuracy_score(y_test, y_pred)) # Classification Accuracy
print(1 - metrics.accuracy_score(y_test, y_pred)) # Classification Error
print(metrics.recall_score(y_test, y_pred)) # Sensitivity
print(metrics.precision_score(y_test, y_pred)) # Precision
# Specificity has no metric function in scikitlearn.

#### Sensitivity and Specificity must be as high as possible. 
* In this model we can describe that our classifier is highly Sensitive and highly Specific

<a id="AdjClassThershold"></a>
<div class="alert alert-block alert-info" style="color:#000000">
# Adjusting the Classification Threshold

In [None]:
# print the first 10 predicted responses
logreg.predict(X_test)[0:10]

In [None]:
# print the first 10 predicted probabilities of class membership
logreg.predict_proba(X_test)[0:10, :]

* Each row represents observation and the coloumn represents class 0 and 1
* The sum of each row is 1
* By default the classification threshold is set to 0.5 
* So, out of 2 values in each row the value greater than or eaual to 0.5 is stored as '1'
* **The value less than 0.5 is stored as '0'**

In [None]:
print (logreg.predict_proba(X_test)[0:10, 1]) # print the first 10 predicted probabilities for class 1
y_pred_prob = logreg.predict_proba(X_test)[:, 1] # And store the predicted probabilities for class 1

<a id="Histogram_of_pred_prob"></a>
### Histogram of predicted probabilities for class 1

In [None]:
# If kernel density estimate (KDE) is set to TRUE then it shows density at y-axis
sns.distplot(y_pred_prob, color="red", kde=False, rug=True) # In this plot number of counts are shown at y-axis
plt.xlim(0, 1)
plt.title('Histogram of predicted probabilities', fontsize=17, fontweight='bold')
plt.xlabel('Predicted probability of diabetes', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()


* From this histogram we can see these probablities varies between **0 to 1**. And most of the points are above 0.6
* As we can clearly see that probability of ** 0.82 ** has the highst frequency  
* This states that majority of prediction (of class 1) occured are above 0.5 **[Default Threshold = 0.5]**
* If majority of prediction occured were below 0.5 then we might change our threshold value **(We can say class - 1 is rarely predicted)**
* ** But in this case we can say that class - 1 frequently predicted **

> * You can adjust Sensitivity and Specificity by setting threshold value
> * Sensitivity and specificity have an inverse relationship
> * Lower the cutoff higher will be the Sensitivity
> * Higher the cutoff higher will be the Specificity
> * So, depending on your business requirments you can increast or decrease thershold

 <a id="DecThreshold"></a>
 <div class="alert alert-block alert-info" style="color:#000000">
 ** =====================================================================================================================**
 ## Decreasing the threshold in order to increase the sensitivity of the classifier  
 ** [Optional for this Project] **

In [None]:
# predict loan outcome if the predicted probability is greater than 0.1
from sklearn.preprocessing import binarize
y_pred_class = binarize([y_pred_prob], 0.1)[0] # Its a 2D numpy array and we will slice only first dimension

y_pred_prob[0:10] # print the first 10 predicted probabilities
y_pred_class[0:10] # print the first 10 predicted classes with the lower threshold
print(confmat) # previous confusion matrix (default threshold of 0.5)
metrics.confusion_matrix(y_test, y_pred_class) # new confusion matrix (threshold of 0.1)

290 / (290 + 0) # sensitivity has increased to 1 (used to be 0.99655172)
24 / (24 + 53) # specificity has decreased to 0.31168831 (used to be 0.7532467)
# =================================================================================

  <div class="alert alert-block alert-info" style="color:#000000">
 ** =====================================================================================================================**

<a id="rocauc"></a>
<div class="alert alert-block alert-info" style="color:#000000">
# ROC Curves and Area Under the Curve (AUC)

* It would be nice if we can see sensitivity and specificity are affected by various thresholds
* Without actually changing the threshold manually. (But if we change the threshold we have do the above process)
* We can do this by ROC curve. **[Receiver operating characteristic]**

In [None]:
# IMPORTANT: first argument is true values, second argument is predicted probabilities
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
auc = metrics.auc(fpr, tpr) 
auc

* Higher the AUC value better is the Classifier
* It is single number summary as a performance of classifier (alternative to accuract=y score)
* If you randomly chose one positive and one negative observation, AUC represents the 
* likelihood that your classifier will assign a higher predicted probability to the positive observation.

In [None]:
plt.plot(fpr, tpr, color='red', lw=2, label='ROC Curve (Area = %0.2f)' % auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC Curve for Default Loan Classifier', fontsize=17, fontweight='bold')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

In [None]:
print("False Positive Rates : ", fpr) # Increasing false positive rates
print("True Positive Rates : ",tpr) # Increasing true positive rates
print("Thresholds : ",thresholds) # Decreasing thresholds on the deecision 

* The curve tells you that if you want to achave Senssitive of 0.90 then 
* You have to accept the specificity of 0.20
* ROC curve can help you to visually choose a threshold that balances sensitivity and specificity
* But you can't actually see the thresholds used to generate the curve on the ROC curve itself

## <span class="label label-primary"> A function that accepts a threshold and prints sensitivity and specificity

* Refer to graph of [Histogram of Predicted robabilities](#Histogram_of_pred_prob) while studying the values of Sensitivity and Specificity.
* The graph is having 1-Specificity on the X-axis, which is converted to Specificity in the function.

In [None]:
def evaluate_threshold(threshold):
    print('Sensitivity:', tpr[thresholds > threshold][-1])
    print('Specificity:', 1 - fpr[thresholds > threshold][-1])

This tells us that, when we set cutoff/threshold of 0.5 we get the **Sensitivity=0.99** and **Specificity: 0.75**

In [None]:
evaluate_threshold(0.5)

But when we set the cutoff to 0.01 we get the **Sensitivity=1** and **Specificity: 0**. This is because all the probabilities are above 0.01.

In [None]:
evaluate_threshold(0.01)

** So depending on your business requirment you have to adjust Sensitivity and Specificity **

<a id="ExportVals"></a>
<div class="alert alert-block alert-info" style="color:#000000">
## Exporting predicted values in Validate Dataset File

In [None]:
test_data['Predicted Loan Status'] = y_pred 

In [None]:
validate_data.to_csv('F:/Lectures/Data Science/iMarticus/Python/Project/Project 5-6 Case Study on Credit Risk/Predicted_Data.csv', index=False, encoding='utf-8') 

In [None]:
validate_data.head(10)

<a id="Conclusion"></a>
<div class="alert alert-block alert-info">
# Conclusion

** Confusion matrix advantages:**
* Allows you to calculate a variety of metrics
* Useful for multi-class problems (more than two response classes)


** ROC/AUC advantages:**
* Does not require you to set a classification threshold
* Still useful when there is high class imbalance
* However, multi-class problems it is difficult to identify threshold

[Top](#head)