## Logistic Regression

1. Supervised MLTechnique
2. Used for Binary Classification

### Project : Classification of default and non-default credit card customers

### 1. Importing libraries 

In [1]:
import pandas as pd  # for basic data handling
import numpy as np   #numpy library for working with arrays
import matplotlib.pyplot as plt #for data visualisation
from sklearn.linear_model import LogisticRegression  #to bulid ML model
from sklearn.model_selection import train_test_split  # to split data 

In [2]:
Risk_data = {'CreditScore': [780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,660,640,620,660,660,680,650,670,580,590,690],
              'DC_ratio': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
              'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
              'default': [1,1,1,1,1,1,0,1,1,0,0,1,1,1,1,0,0,1,0,0,0,0,0,0,0,1,1,0,1,1,0,0,1,0,0,0,0,0,0,1]
              }  #data in Dictionary structure

###  2. Converting data to pandas dataframe object

In [3]:
df=pd.DataFrame(Risk_data) #Creating a dataframe object 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      40 non-null     int64  
 1   DC_ratio         40 non-null     float64
 2   work_experience  40 non-null     int64  
 3   default          40 non-null     int64  
dtypes: float64(1), int64(3)
memory usage: 1.4 KB


In [5]:
df.describe()

Unnamed: 0,CreditScore,DC_ratio,work_experience,default
count,40.0,40.0,40.0,40.0
mean,654.0,3.095,3.425,0.475
std,61.427464,0.631218,1.737778,0.505736
min,540.0,1.7,1.0,0.0
25%,607.5,2.7,2.0,0.0
50%,660.0,3.3,4.0,0.0
75%,690.0,3.7,5.0,1.0
max,780.0,4.0,6.0,1.0


In [6]:
df['default'].value_counts()

0    21
1    19
Name: default, dtype: int64

Out of 40 data collected, 21 may default, 19 may not default

In [7]:
df.head() #displaying first 5 rows of the dataset

Unnamed: 0,CreditScore,DC_ratio,work_experience,default
0,780,4.0,3,1
1,750,3.9,4,1
2,690,3.3,3,1
3,710,3.7,5,1
4,680,3.9,4,1


x variable- Credit score, DC_ratio, work_experience
y variable- default  # logistic regression model

### 3. Building a logistic regression model 

In [8]:
X=df.drop(['default'],axis=1) #declaring X and y variables
y=df['default'] 

In [9]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0) 
#Splitting the dataset into train and test data
#test_size =0.25 implies that 25% data will be used for testing and 75% for training 
#freezing the random state to take the same values in each iteration to compare the models

In [10]:
logistic_regression=LogisticRegression()#building the model using logistic regresiion

In [11]:
logistic_regression.fit(X_train,y_train)  #training the model

In [13]:
y_test_predict=logistic_regression.predict(X_test)

In [14]:
y_test_predict #predicted values

array([0, 0, 1, 1, 0, 0, 1, 1, 0, 1], dtype=int64)

In [15]:
logistic_regression.predict_proba(X_test) #Predicting the probabilities of test data

array([[9.99357599e-01, 6.42401145e-04],
       [9.61854639e-01, 3.81453612e-02],
       [2.60474402e-01, 7.39525598e-01],
       [1.81320940e-01, 8.18679060e-01],
       [9.83741583e-01, 1.62584174e-02],
       [9.89488452e-01, 1.05115478e-02],
       [3.77297855e-01, 6.22702145e-01],
       [1.00649100e-01, 8.99350900e-01],
       [9.99722602e-01, 2.77397774e-04],
       [4.73266321e-01, 5.26733679e-01]])

In [16]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_test_predict)
cm

array([[5, 0],
       [0, 5]], dtype=int64)

Confusion Matrix displays the number of true positives, true negatives, false positives, and false negatives.
From the confusion matrix, it can be inferred that out of the taken test data, all the data are classified correctly.

In [17]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_test_predict))  #Accuracy score = (TP+TN)/(TP+TN+FP+FN)

1.0


In [18]:
from sklearn.metrics import recall_score
print(recall_score(y_test,y_test_predict)) #recall= TP/(TP+FN)

1.0


In [19]:
from sklearn.metrics import roc_auc_score

In [20]:
print(roc_auc_score(y_test,y_test_predict)) #RATIO of true positive rate against false positive rate

1.0
