# Classification:

# Logistic Regression Model

# Credit Card Default Prediction Model

   The data set consists of 2000 samples from each of two categories. Five variables are

      1.Income
      2.Age
      3.Loan
      4.Loan to Income (engineered feature)
      5.Default

## Step1: import library

In [1]:
import pandas as pd

## Step2: import data

In [2]:
default = pd.read_csv('https://github.com/ybifoundation/Dataset/raw/main/Credit%20Default.csv')

In [3]:
default.head()

Unnamed: 0,Income,Age,Loan,Loan to Income,Default
0,66155.9251,59.017015,8106.532131,0.122537,0
1,34415.15397,48.117153,6564.745018,0.190752,0
2,57317.17006,63.108049,8020.953296,0.13994,0
3,42709.5342,45.751972,6103.64226,0.142911,0
4,66952.68885,18.584336,8770.099235,0.13099,1


In [4]:
default.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Income          2000 non-null   float64
 1   Age             2000 non-null   float64
 2   Loan            2000 non-null   float64
 3   Loan to Income  2000 non-null   float64
 4   Default         2000 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 78.2 KB


In [5]:
default.describe()

Unnamed: 0,Income,Age,Loan,Loan to Income,Default
count,2000.0,2000.0,2000.0,2000.0,2000.0
mean,45331.600018,40.927143,4444.369695,0.098403,0.1415
std,14326.327119,13.26245,3045.410024,0.05762,0.348624
min,20014.48947,18.055189,1.37763,4.9e-05,0.0
25%,32796.45972,29.062492,1939.708847,0.047903,0.0
50%,45789.11731,41.382673,3974.719418,0.099437,0.0
75%,57791.28167,52.596993,6432.410625,0.147585,0.0
max,69995.68558,63.971796,13766.05124,0.199938,1.0


In [6]:
default.columns

Index(['Income', 'Age', 'Loan', 'Loan to Income', 'Default'], dtype='object')

In [7]:
default.shape

(2000, 5)

## Pandas Series.value_counts(): 
  
  function return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
  
    Syntax: Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)

In [8]:
# Count of each category

default['Default'].value_counts()

0    1717
1     283
Name: Default, dtype: int64

## Step3: define target (y) and features (x), where (y) is a dependant variable and (x) is an independent variable.

## we can have multiple independent variables (x) and single or classified dependent variable (y).

In [9]:
default.columns

Index(['Income', 'Age', 'Loan', 'Loan to Income', 'Default'], dtype='object')

## pandas.DataFrame.drop():

   Removes rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names.

    axis : {0 or ‘index’, 1 or ‘columns’}, default 0
    Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

In [10]:
y = default['Default' ]

x = default.drop(['Default'],axis=1)

In [11]:
default.shape

(2000, 5)

In [12]:
x.shape

(2000, 4)

In [13]:
y.shape

(2000,)

## Step4: train test split | spliting dataset into train set and test set.

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.7, random_state = 2529)

In [16]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((1400, 4), (600, 4), (1400,), (600,))

In [17]:
x_train

Unnamed: 0,Income,Age,Loan,Loan to Income
67,30735.80850,22.242098,5946.822297,0.193482
60,37660.77072,53.745060,2129.597165,0.056547
1751,65913.83084,36.477758,11738.915710,0.178095
300,27218.56103,55.171020,4145.003587,0.152286
1504,43044.51778,60.848424,1661.713460,0.038605
...,...,...,...,...
740,63661.38333,25.595524,6095.308749,0.095746
399,24037.16514,23.311574,2469.364426,0.102731
828,68100.73562,47.752940,8124.598980,0.119303
1586,34163.62565,45.782718,6617.400172,0.193697


In [18]:
y_train

67      1
60      0
1751    0
300     0
1504    0
       ..
740     0
399     0
828     0
1586    0
1376    0
Name: Default, Length: 1400, dtype: int64

In [19]:
x_test

Unnamed: 0,Income,Age,Loan,Loan to Income
1317,62125.25811,21.085868,5700.457195,0.091757
705,53330.76714,42.377246,2343.497556,0.043943
1881,24406.89381,37.905318,1733.403111,0.071021
1725,34428.97264,27.368410,6016.615091,0.174754
1622,54609.46518,18.413736,5618.204570,0.102880
...,...,...,...,...
573,42476.26553,46.438223,8334.182008,0.196208
1868,58503.77101,42.372513,7050.432526,0.120512
969,23066.96468,33.091353,1933.353568,0.083815
1127,21448.82799,31.795188,1989.182976,0.092741


In [20]:
y_test

1317    0
705     0
1881    0
1725    1
1622    0
       ..
573     0
1868    0
969     0
1127    0
1538    0
Name: Default, Length: 600, dtype: int64

## Step5: select a model

In [21]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

## Step6: train the model (fit the model)

In [22]:
model.fit(x_train, y_train)

LogisticRegression()

In [23]:
# To get the intercept of the Linear Regression Model.
model.intercept_

array([9.39569099])

In [24]:
# To get the slope (coefficient)
model.coef_

array([[-2.31410018e-04, -3.43062682e-01,  1.67863324e-03,
         1.51188531e+00]])

## Step7: prediction of model

In [25]:
y_pred = model.predict(x_test)

In [26]:
y_pred

array([0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,

## Step8: accuracy of model

In [27]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [28]:
accuracy_score(y_test, y_pred)

0.95

In [29]:
confusion_matrix(y_test, y_pred)

array([[506,  13],
       [ 17,  64]], dtype=int64)

In [30]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97       519
           1       0.83      0.79      0.81        81

    accuracy                           0.95       600
   macro avg       0.90      0.88      0.89       600
weighted avg       0.95      0.95      0.95       600

