# **Credit Loan Default Prediction: A Machine Learning Approach**  

## **Introduction**  

In today’s financial landscape, lending institutions face significant risks when issuing loans. Determining whether a borrower will repay their loan or default is a critical challenge for banks and credit agencies. This project aims to develop a **predictive model** that assesses the likelihood of loan default based on various financial and demographic factors.  

## **Project Overview**  

This analysis leverages a structured dataset containing key attributes such as **loan amount, interest rate, borrower grade, verification status, and total payment history** to predict loan repayment behavior. The dataset includes both numerical and categorical variables, each carrying valuable insights into a borrower's creditworthiness.  

### **Workflow Breakdown**  
The project follows a structured approach:  

1. **Data Cleaning & Preprocessing (Conducted Separately)**  
   - Handling missing values  
   - Feature engineering and transformation  
   - Encoding categorical variables  
   - Scaling antl2ers and anomalies  

3. **Predictive Modeling**  
   - Training machine learning models  
   - Evaluating model performance  
   - Optimizing for accura3y and interpretability  

4. **Results & Insights**  
   - Model evaluation metrics  
   - Business implications  
   - Recommendations for lenders  

## **Objective**  
The primary goal is to **build an accurate machine learning model that predicts loan default risks**, helping financial institutions **minimize risk exposure and optimize lending decisions**.  

## **Dataset Overview**  
The dataset consists of the following key columns:  
- **`loan_amnt`** – Total loan amount requested  
- **`loan_status`** – The target variable (e.g., Fully Paid, Default, Charged Off)  
- **`int_rate`** – Interest rate applied to the loan  
- **`installment`** – Fixed monthly payment  
- **`grade` & `sub_grade`** – Creditworthiness classification  
- **`verification_status`** – Whether income was verified  
- **`total_pymnt`** – Total amount repaid by the borrower  
- ... and other essential financial indicators  

## **Significance of the Study**  
By leveraging **machine learning**, this project provides valuable insights that can:  
✔ Improve lending decision-making  
✔ Reduce financial losses due to loan defaults  
✔ Enhance risk assessment strategies  

Let’s dive into the data and build an effective prediction model! 🚀  


## Importing Necessary Libaries 

In [1]:
import pandas as pd
import numpy as np
np.set_printoptions(suppress = True, linewidth = 100, precision = 2)

**We read our data using np.genfromtxt and this baciscally gives us a numerical dataset and any column with string is return as "nan". With this we can easily
classify our dataset into numerical and sting**

In [2]:
raw_data = np.genfromtxt('loan-data.csv', delimiter = ';', skip_header= 1, autostrip= True)

In [3]:
raw_data

array([[48010226.  ,         nan,    35000.  , ...,         nan,         nan,     9452.96],
       [57693261.  ,         nan,    30000.  , ...,         nan,         nan,     4679.7 ],
       [59432726.  ,         nan,    15000.  , ...,         nan,         nan,     1969.83],
       ...,
       [50415990.  ,         nan,    10000.  , ...,         nan,         nan,     2185.64],
       [46154151.  ,         nan,         nan, ...,         nan,         nan,     3199.4 ],
       [66055249.  ,         nan,    10000.  , ...,         nan,         nan,      301.9 ]])

## Handling missing data

In [4]:
# Checking for missing data
np.isnan(raw_data).sum()

88005

**Since all string data in our dataset are represented by "nan". we can create a variable and attribute a value of the maximum numurical data + 1**

In [5]:
#creating a temporary filler

temporary_fill = np.nanmax(raw_data) + 1
temporary_fill

68616520.0

**We take the mean of each column in the dataset**

In [6]:
temporary_mean = np.nanmean(raw_data, axis = 0)
temporary_mean

  temporary_mean = np.nanmean(raw_data, axis = 0)


array([54015809.19,         nan,    15273.46,         nan,    15311.04,         nan,       16.62,
            440.92,         nan,         nan,         nan,         nan,         nan,     3143.85])

In [7]:
temporary_stat = np.array([np.nanmin(raw_data, axis = 0), np.nanmean(raw_data, axis = 0), np.nanmax(raw_data, axis = 0)])
temporary_stat

  temporary_stat = np.array([np.nanmin(raw_data, axis = 0), np.nanmean(raw_data, axis = 0), np.nanmax(raw_data, axis = 0)])
  temporary_stat = np.array([np.nanmin(raw_data, axis = 0), np.nanmean(raw_data, axis = 0), np.nanmax(raw_data, axis = 0)])


array([[  373332.  ,         nan,     1000.  ,         nan,     1000.  ,         nan,        6.  ,
              31.42,         nan,         nan,         nan,         nan,         nan,        0.  ],
       [54015809.19,         nan,    15273.46,         nan,    15311.04,         nan,       16.62,
             440.92,         nan,         nan,         nan,         nan,         nan,     3143.85],
       [68616519.  ,         nan,    35000.  ,         nan,    35000.  ,         nan,       28.99,
            1372.97,         nan,         nan,         nan,         nan,         nan,    41913.62]])

**Now we can easily differential and identify the strings columns from the numerical columns**

In [8]:
# identifying the strings columns
column_strings = np.argwhere(np.isnan(temporary_mean)).squeeze()
column_strings

array([ 1,  3,  5,  8,  9, 10, 11, 12], dtype=int64)

In [9]:
# identifying the numeric columns
column_numeric = np.argwhere(np.isnan(temporary_mean) == False).squeeze()
column_numeric

array([ 0,  2,  4,  6,  7, 13], dtype=int64)

**Let's now create a list of only strings and numerical columns**

In [10]:
# dataset for only strings related columns
loan_data_strings = np.genfromtxt('loan-data.csv',
                                  delimiter = ';',
                                  autostrip = True,
                                  skip_header = 1,
                                  usecols = column_strings,
                                  dtype = 'str')

loan_data_strings

array([['May-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']],
      dtype='<U69')

In [11]:
# Dataset for numeric related data
loan_data_numeric = np.genfromtxt('loan-data.csv',
                                  delimiter = ';',
                                  autostrip = True,
                                  skip_header = 1,
                                  usecols = column_numeric,
                                  dtype = 'int',
                                  filling_values = temporary_fill)

loan_data_numeric

array([[48010226,    35000,    35000,       13,     1184,     9452],
       [57693261,    30000,    30000, 68616520,      938,     4679],
       [59432726,    15000,    15000, 68616520,      494,     1969],
       ...,
       [50415990,    10000,    10000, 68616520, 68616520,     2185],
       [46154151, 68616520,    10000,       16,      354,     3199],
       [66055249,    10000,    10000, 68616520,      309,      301]])

**Also, we need to identify and differential column headers for numerical and strings data**

In [12]:
# getting the headers for each data
header_full = np.genfromtxt('loan-data.csv',
                                  delimiter = ';',
                                  autostrip = True,
                                  skip_footer = raw_data.shape[0] ,
                                  dtype = 'str')
header_full

array(['id', 'issue_d', 'loan_amnt', 'loan_status', 'funded_amnt', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state',
       'total_pymnt'], dtype='<U19')

In [13]:
header_strings, header_numeric = header_full[column_strings], header_full[column_numeric]

In [14]:
header_strings

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [15]:
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

## Checkpoint Function

**For us to keep track of different stage of our work, we create a checkpoint function.** 

In [16]:
def checkpoint(file_name, checkpoint_header, checkpoint_data):
    np.savez(file_name, header = checkpoint_header, data = checkpoint_data)
    checkpoint_variable = np.load(file_name + ".npz")
    return checkpoint_variable

In [17]:
checkpoint_test = checkpoint('Checkpoint-test', header_strings, loan_data_strings)

In [18]:
checkpoint_test['header']

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [19]:
checkpoint_test['data']

array([['May-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']],
      dtype='<U69')

In [20]:
# Checking if the checkpoint data is same and the main data

np.array_equal(checkpoint_test['data'], loan_data_strings)

True

## Data Preprocessing

### String Columns

**Let's first preprocess the strings data part of our project. We will be working with each column at a time**

In [21]:
# First column
header_strings[0]

'issue_d'

In [22]:
# remaining the column to 'issue_date'
header_strings[0] = 'issue_data'
header_strings[0]

'issue_data'

In [23]:
# Let's see the data inside this column

loan_data_strings[:, 0]

array(['May-15', '', 'Sep-15', ..., 'Jun-15', 'Apr-15', 'Dec-15'], dtype='<U69')

In [24]:
np.unique(loan_data_strings[:, 0])

array(['', 'Apr-15', 'Aug-15', 'Dec-15', 'Feb-15', 'Jan-15', 'Jul-15', 'Jun-15', 'Mar-15',
       'May-15', 'Nov-15', 'Oct-15', 'Sep-15'], dtype='<U69')

In [25]:
# removing the -15 from each data since we are only working on just a single year (2015) data

loan_data_strings[:, 0] = np.chararray.strip(loan_data_strings[:, 0], '-15')

In [26]:
np.unique(loan_data_strings[:, 0])

array(['', 'Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'],
      dtype='<U69')

In [27]:
# converting each of the data in this column in its respective month number equivalent

month = ['', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
for i in range(13):
    loan_data_strings[:, 0] = np.where(loan_data_strings[:, 0] == month[i], 
                                      i,
                                      loan_data_strings[:, 0])

In [28]:
np.unique(loan_data_strings[:, 0])

array(['0', '1', '10', '11', '12', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='<U69')

In [29]:
# Second column
header_strings[1]

'loan_status'

In [30]:
loan_data_strings[:, 1]

array(['Current', 'Current', 'Current', ..., 'Current', 'Current', 'Current'], dtype='<U69')

In [31]:
np.unique(loan_data_strings[:, 1])

array(['', 'Charged Off', 'Current', 'Default', 'Fully Paid', 'In Grace Period', 'Issued',
       'Late (16-30 days)', 'Late (31-120 days)'], dtype='<U69')

In [32]:
# Categorizing this column into good(1) and bad(0) that where ['Current', 'Fully Paid', 'In Grace Period', 'Issued' 'Late(16-30 days')] 
# are tag as good and the rest as bad
bad_status = ['', 'Charged Off', 'Default', 'Late (310-120 days)']
loan_data_strings[:, 1] = np.where(np.isin(loan_data_strings[:, 1], bad_status), 
                                  0,
                                  1)

In [33]:
np.unique(loan_data_strings[:, 1])

array(['0', '1'], dtype='<U69')

In [34]:
# Third column
header_strings[2]

'term'

In [35]:
header_strings[2] = 'term_month'
header_strings[2]

'term_month'

In [36]:
loan_data_strings[:, 2]

array(['36 months', '36 months', '36 months', ..., '36 months', '36 months', '36 months'],
      dtype='<U69')

In [37]:
np.unique(loan_data_strings[:, 2])

array(['', '36 months', '60 months'], dtype='<U69')

In [38]:
# removing the months to just leave only the number
loan_data_strings[:, 2] = np.chararray.strip(loan_data_strings[:, 2], ' months')

In [39]:
np.unique(loan_data_strings[:, 2])

array(['', '36', '60'], dtype='<U69')

In [40]:
# fixing the missing data and converting to int
loan_data_strings[:, 2] = np.where(loan_data_strings[:, 2] == '',
                                   60, 
                                   loan_data_strings[:, 2])

In [41]:
np.unique(loan_data_strings[:, 2])

array(['36', '60'], dtype='<U69')

In [42]:
# Fourth column
header_strings[3]

'grade'

In [43]:
loan_data_strings[:, 3]

array(['C', 'A', 'B', ..., 'A', 'D', 'A'], dtype='<U69')

In [44]:
np.unique(loan_data_strings[:, 3])

array(['', 'A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='<U69')

In [45]:
# There is a relationship between the grade column and the subgrade column

header_strings[4]

'sub_grade'

In [46]:
loan_data_strings[:, 4]

array(['C3', 'A5', 'B5', ..., 'A5', 'D2', 'A4'], dtype='<U69')

In [47]:
np.unique(loan_data_strings[:, 4])

array(['', 'A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4',
       'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4',
       'F5', 'G1', 'G2', 'G3', 'G4', 'G5'], dtype='<U69')

In [48]:
# Using the grade column to populate the missing data in the sub-grade column
for i in np.unique(loan_data_strings[:, 3])[1:]:
    loan_data_strings[:, 4]= np.where((loan_data_strings[:, 4] == '') & (loan_data_strings[:, 3] == i), 
                    i + '5',
                    loan_data_strings[:, 4])

In [49]:
np.unique(loan_data_strings[:, 4], return_counts = True)

(array(['', 'A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4',
        'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4',
        'F5', 'G1', 'G2', 'G3', 'G4', 'G5'], dtype='<U69'),
 array([  9, 285, 278, 239, 323, 592, 509, 517, 530, 553, 633, 629, 567, 586, 564, 577, 391, 267,
        250, 255, 288, 235, 162, 171, 139, 160,  94,  52,  34,  43,  24,  19,  10,   3,   7,   5],
       dtype=int64))

In [50]:
# We still have 9 missing data and we can just assign 'H1' to them
loan_data_strings[:, 4] = np.where(loan_data_strings[:, 4] == '',
                                   'H1',
                                   loan_data_strings[:, 4])

In [51]:
np.unique(loan_data_strings[:, 4], return_counts = True)

(array(['A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4', 'C5',
        'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5',
        'G1', 'G2', 'G3', 'G4', 'G5', 'H1'], dtype='<U69'),
 array([285, 278, 239, 323, 592, 509, 517, 530, 553, 633, 629, 567, 586, 564, 577, 391, 267, 250,
        255, 288, 235, 162, 171, 139, 160,  94,  52,  34,  43,  24,  19,  10,   3,   7,   5,   9],
       dtype=int64))

In [52]:
# We can now effectively delete the grade column

loan_data_strings = np.delete(loan_data_strings, 3, axis = 1)
header_strings = np.delete(header_strings, 3)

In [53]:
loan_data_strings[:, 3]

array(['C3', 'A5', 'B5', ..., 'A5', 'D2', 'A4'], dtype='<U69')

In [54]:
# Unpacking the sub-grade column into key-value pairs
keys = np.unique(loan_data_strings[:, 3])
values = list(range(1, len(np.unique(loan_data_strings[:, 3])) + 1))
dict_sub_grade = dict(zip(keys, values))

In [55]:
dict_sub_grade

{'A1': 1,
 'A2': 2,
 'A3': 3,
 'A4': 4,
 'A5': 5,
 'B1': 6,
 'B2': 7,
 'B3': 8,
 'B4': 9,
 'B5': 10,
 'C1': 11,
 'C2': 12,
 'C3': 13,
 'C4': 14,
 'C5': 15,
 'D1': 16,
 'D2': 17,
 'D3': 18,
 'D4': 19,
 'D5': 20,
 'E1': 21,
 'E2': 22,
 'E3': 23,
 'E4': 24,
 'E5': 25,
 'F1': 26,
 'F2': 27,
 'F3': 28,
 'F4': 29,
 'F5': 30,
 'G1': 31,
 'G2': 32,
 'G3': 33,
 'G4': 34,
 'G5': 35,
 'H1': 36}

In [56]:
for i in np.unique(loan_data_strings[:, 3]):
    loan_data_strings[:, 3] = np.where(loan_data_strings[:, 3] == i, 
                                       dict_sub_grade[i],
                                       loan_data_strings[:, 3])

In [57]:
np.unique(loan_data_strings[:, 3])

array(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22',
       '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36',
       '4', '5', '6', '7', '8', '9'], dtype='<U69')

In [58]:
# Forth Column
header_strings[4]

'verification_status'

In [59]:
loan_data_strings[:, 4]

array(['Verified', 'Source Verified', 'Verified', ..., 'Source Verified', 'Source Verified', ''],
      dtype='<U69')

In [60]:
np.unique(loan_data_strings[:, 4])

array(['', 'Not Verified', 'Source Verified', 'Verified'], dtype='<U69')

In [61]:
# Formating the verification status into dummie variables(0, 1)
not_verified = ['', 'Not Verified']
loan_data_strings[:, 4] = np.where(np.isin(loan_data_strings[:, 4], not_verified),
                                  0,
                                  1)

In [62]:
np.unique(loan_data_strings[:, 4])

array(['0', '1'], dtype='<U69')

In [63]:
# Fifth column
header_strings[5]

'url'

In [64]:
loan_data_strings[:, 5]

array(['https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', ...,
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249'], dtype='<U69')

**The url column contain almost the same data throughout the dataset only for it last set of numbers which indicate the users id and this has 
been capture in another column. So we delete the url column**

In [65]:
loan_data_strings = np.delete(loan_data_strings, 5, axis = 1)
header_strings = np.delete(header_strings, 5)

In [66]:
header_strings[5]

'addr_state'

In [67]:
loan_data_strings[:, 5]

array(['CA', 'NY', 'PA', ..., 'CA', 'OH', 'IL'], dtype='<U69')

In [68]:
np.unique(loan_data_strings[:, 5])

array(['', 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IL', 'IN',
       'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH',
       'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA',
       'VT', 'WA', 'WI', 'WV', 'WY'], dtype='<U69')

In [69]:
np.unique(loan_data_strings[:, 5], return_counts = True)

(array(['', 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IL', 'IN',
        'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH',
        'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA',
        'VT', 'WA', 'WI', 'WV', 'WY'], dtype='<U69'),
 array([ 500,   26,  119,   74,  220, 1336,  201,  143,   27,   27,  690,  321,   44,  389,  152,
          84,   84,  116,  210,  222,   10,  267,  156,  160,   61,   28,  261,   16,   25,   58,
         341,   57,  130,  777,  312,   83,  108,  320,   40,  107,   24,  143,  758,   74,  242,
          17,  216,  148,   49,   27], dtype=int64))

In [70]:
np.unique(loan_data_strings[:, 5]).size

50

In [71]:
state_name, state_count = np.unique(loan_data_strings[:, 5], return_counts = True)
state_count_sorted = np.argsort(-state_count)
state_name[state_count_sorted], state_count[state_count_sorted]

(array(['CA', 'NY', 'TX', 'FL', '', 'IL', 'NJ', 'GA', 'PA', 'OH', 'MI', 'NC', 'VA', 'MD', 'AZ',
        'WA', 'MA', 'CO', 'MO', 'MN', 'IN', 'WI', 'CT', 'TN', 'NV', 'AL', 'LA', 'OR', 'SC', 'KY',
        'KS', 'OK', 'UT', 'AR', 'MS', 'NH', 'NM', 'WV', 'HI', 'RI', 'MT', 'DE', 'DC', 'WY', 'AK',
        'NE', 'SD', 'VT', 'ND', 'ME'], dtype='<U69'),
 array([1336,  777,  758,  690,  500,  389,  341,  321,  320,  312,  267,  261,  242,  222,  220,
         216,  210,  201,  160,  156,  152,  148,  143,  143,  130,  119,  116,  108,  107,   84,
          84,   83,   74,   74,   61,   58,   57,   49,   44,   40,   28,   27,   27,   27,   26,
          25,   24,   17,   16,   10], dtype=int64))

In [72]:
# filling the missing data 
loan_data_strings[:, 5] = np.where(loan_data_strings[:, 5] == '',
                                   0, 
                                   loan_data_strings[:, 5])

In [73]:
np.unique(loan_data_strings[:, 5])

array(['0', 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IL', 'IN',
       'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH',
       'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA',
       'VT', 'WA', 'WI', 'WV', 'WY'], dtype='<U69')

**This addr_state column give the exact total number of states in the United States of American, So we categorize them into their geopolitical zones**

In [74]:
states_west = np.array(['WA', 'OR', 'CA', 'NV', 'ID', 'MT', 'WY', 'UT', 'CO', 'AZ', 'NM', 'HI', 'AK'])
states_south = np.array(['TX', 'OK', 'AR', 'LA', 'MS', 'AL', 'TN', 'KY', 'FL', 'GA', 'SC', 'NC', 'VA', 'WV', 'MD', 'DE', 'DC'])
states_midwest = np.array(['ND', 'SD', 'NE', 'KS', 'MN', 'IA', 'MO', 'WI', 'IL', 'IN', 'MI', 'OH'])
states_east = np.array(['PA', 'NY', 'NJ', 'CT', 'MA', 'VT', 'NH', 'ME', 'RI'])

In [75]:
loan_data_strings[:, 5] = np.where(np.isin(loan_data_strings[:, 5], states_west), 1, loan_data_strings[:, 5])
loan_data_strings[:, 5] = np.where(np.isin(loan_data_strings[:, 5], states_south), 2, loan_data_strings[:, 5])
loan_data_strings[:, 5] = np.where(np.isin(loan_data_strings[:, 5], states_midwest), 3, loan_data_strings[:, 5])
loan_data_strings[:, 5] = np.where(np.isin(loan_data_strings[:, 5], states_east), 4, loan_data_strings[:, 5])

In [76]:
np.unique(loan_data_strings[:, 5])

array(['0', '1', '2', '3', '4'], dtype='<U69')

In [77]:
loan_data_strings

array([['5', '1', '36', '13', '1', '1'],
       ['0', '1', '36', '5', '1', '4'],
       ['9', '1', '36', '10', '1', '4'],
       ...,
       ['6', '1', '36', '5', '1', '1'],
       ['4', '1', '36', '17', '1', '3'],
       ['12', '1', '36', '4', '0', '3']], dtype='<U69')

In [78]:
loan_data_strings = loan_data_strings.astype(int)

In [79]:
loan_data_strings

array([[ 5,  1, 36, 13,  1,  1],
       [ 0,  1, 36,  5,  1,  4],
       [ 9,  1, 36, 10,  1,  4],
       ...,
       [ 6,  1, 36,  5,  1,  1],
       [ 4,  1, 36, 17,  1,  3],
       [12,  1, 36,  4,  0,  3]])

**We have been able to preprocess the strings data columns**

### Checkpoint

In [80]:
checkpoint_strings = checkpoint("Checkpoint-Strings", header_strings, loan_data_strings)

In [81]:
checkpoint_strings['header']

array(['issue_data', 'loan_status', 'term_month', 'sub_grade', 'verification_status', 'addr_state'],
      dtype='<U19')

In [82]:
checkpoint_strings['data']

array([[ 5,  1, 36, 13,  1,  1],
       [ 0,  1, 36,  5,  1,  4],
       [ 9,  1, 36, 10,  1,  4],
       ...,
       [ 6,  1, 36,  5,  1,  1],
       [ 4,  1, 36, 17,  1,  3],
       [12,  1, 36,  4,  0,  3]])

### Numeric Columns

**Let's now preprocess the numeric columns**

In [83]:
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

In [84]:
loan_data_numeric

array([[48010226,    35000,    35000,       13,     1184,     9452],
       [57693261,    30000,    30000, 68616520,      938,     4679],
       [59432726,    15000,    15000, 68616520,      494,     1969],
       ...,
       [50415990,    10000,    10000, 68616520, 68616520,     2185],
       [46154151, 68616520,    10000,       16,      354,     3199],
       [66055249,    10000,    10000, 68616520,      309,      301]])

In [85]:
# Checking for missing data
np.isnan(loan_data_numeric).sum()

0

**We can see that we don't have any missing value because we did fill all missing numeric value with the max value of the data + 1 which was 68616520**

In [86]:
loan_data_numeric[:, 0]

array([48010226, 57693261, 59432726, ..., 50415990, 46154151, 66055249])

In [87]:
# checking if any of the first column was filled with the temporry fil
np.isin(loan_data_numeric[:, 0], temporary_fill).sum()

0

In [88]:
temporary_stat

array([[  373332.  ,         nan,     1000.  ,         nan,     1000.  ,         nan,        6.  ,
              31.42,         nan,         nan,         nan,         nan,         nan,        0.  ],
       [54015809.19,         nan,    15273.46,         nan,    15311.04,         nan,       16.62,
             440.92,         nan,         nan,         nan,         nan,         nan,     3143.85],
       [68616519.  ,         nan,    35000.  ,         nan,    35000.  ,         nan,       28.99,
            1372.97,         nan,         nan,         nan,         nan,         nan,    41913.62]])

In [89]:
temporary_stat[:, column_numeric]
# the first row represent the min values, followed by the mean and then the max

array([[  373332.  ,     1000.  ,     1000.  ,        6.  ,       31.42,        0.  ],
       [54015809.19,    15273.46,    15311.04,       16.62,      440.92,     3143.85],
       [68616519.  ,    35000.  ,    35000.  ,       28.99,     1372.97,    41913.62]])

In [90]:
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

In [91]:
# funded_amt
loan_data_numeric[:, 2]

array([35000, 30000, 15000, ..., 10000, 10000, 10000])

In [92]:
# Replacing every temporary_fill values in the funded_amt with the min of the the column
loan_data_numeric[:, 2] = np.where(loan_data_numeric[:, 2] == temporary_fill, 
                                   temporary_stat[0, column_numeric[2]],
                                   loan_data_numeric[:, 2])


In [93]:
np.isin(loan_data_numeric[:, 2], temporary_fill).sum()

0

**Working on the other columns that is 'loan_amnt', 'int_rate', 'installment', and, 'total_pymnt'. This columns have the index of 1, 3, 4, 5**

**Using a for loop to fill the max values from the temporary_stat into all these columns**

In [94]:
for i in [1, 3, 4, 5]:
    loan_data_numeric[:, i] = np.where(loan_data_numeric[:, i] == temporary_fill,
                                      temporary_stat[2, column_numeric[i]],
                                      loan_data_numeric[:, i]) 


In [95]:
loan_data_numeric

array([[48010226,    35000,    35000,       13,     1184,     9452],
       [57693261,    30000,    30000,       28,      938,     4679],
       [59432726,    15000,    15000,       28,      494,     1969],
       ...,
       [50415990,    10000,    10000,       28,     1372,     2185],
       [46154151,    35000,    10000,       16,      354,     3199],
       [66055249,    10000,    10000,       28,      309,      301]])

In [96]:
# checking if any of the numeric column is still filled with the temporry fil
np.isin(loan_data_numeric, temporary_fill).sum()

0

### Currency Change

**Importing the Exchange dataset**

In [97]:
# Exchange dataset
EUR_USD = np.genfromtxt('EUR-USD.csv', delimiter =',', autostrip = True, dtype = 'str')
EUR_USD

array([['Open', 'High', 'Low', 'Close', 'Volume'],
       ['1.2098628282546997', '1.2098628282546997', '1.11055588722229', '1.1287955045700073', '0'],
       ['1.1287955045700073', '1.1484194993972778', '1.117680549621582', '1.1205360889434814',
        '0'],
       ['1.119795799255371', '1.1240400075912476', '1.0460032224655151', '1.0830246210098267',
        '0'],
       ['1.0741022825241089', '1.1247594356536865', '1.0521597862243652', '1.1114321947097778',
        '0'],
       ['1.1215037107467651', '1.145304799079895', '1.0821995735168457', '1.0960345268249512',
        '0'],
       ['1.095902442932129', '1.1428401470184326', '1.0888904333114624', '1.122296690940857', '0'],
       ['1.1134989261627197', '1.1219995021820068', '1.081270456314087', '1.0939244031906128',
        '0'],
       ['1.0969001054763794', '1.1705996990203857', '1.0850305557250977', '1.1340054273605347',
        '0'],
       ['1.1225990056991577', '1.1460003852844238', '1.1089695692062378', '1.1255937814712524

**Since we will only be using the "Close" column we can only import just that column**

In [98]:
EUR_USD = np.genfromtxt('EUR-USD.csv', delimiter =',', autostrip = True, skip_header = 1, usecols = 3)
EUR_USD

array([1.13, 1.12, 1.08, 1.11, 1.1 , 1.12, 1.09, 1.13, 1.13, 1.1 , 1.06, 1.09])

In [99]:
loan_data_strings[:, 0]

array([ 5,  0,  9, ...,  6,  4, 12])

In [100]:
# attributing exchange rate for each of the issued_data
exchange_rate = loan_data_strings[:, 0]

for i in range(1, 13):
    exchange_rate = np.where(exchange_rate == i,
                             EUR_USD[i - 1],
                             exchange_rate)

exchange_rate = np.where(exchange_rate == 0,
                             np.mean(EUR_USD),
                             exchange_rate)
exchange_rate

array([1.1 , 1.11, 1.13, ..., 1.12, 1.11, 1.09])

In [101]:
exchange_rate.shape

(10000,)

In [102]:
loan_data_numeric.shape

(10000, 6)

In [103]:
# In other to stack the exchange_rate into the loan_data_numeric, we need to change the shape of the exchange_rate

exchange_rate = np.reshape(exchange_rate, (10000, 1))
exchange_rate.shape

(10000, 1)

#### Horizotal stalking

In [104]:
loan_data_numeric = np.hstack((loan_data_numeric, exchange_rate))
loan_data_numeric

array([[48010226.  ,    35000.  ,    35000.  , ...,     1184.  ,     9452.  ,        1.1 ],
       [57693261.  ,    30000.  ,    30000.  , ...,      938.  ,     4679.  ,        1.11],
       [59432726.  ,    15000.  ,    15000.  , ...,      494.  ,     1969.  ,        1.13],
       ...,
       [50415990.  ,    10000.  ,    10000.  , ...,     1372.  ,     2185.  ,        1.12],
       [46154151.  ,    35000.  ,    10000.  , ...,      354.  ,     3199.  ,        1.11],
       [66055249.  ,    10000.  ,    10000.  , ...,      309.  ,      301.  ,        1.09]])

In [105]:
# Also stalking the header 

header_numeric = np.hstack((header_numeric, np.array(['exchange_rate'])))
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt', 'exchange_rate'],
      dtype='<U19')

In [106]:
# columns that indicated amount in dollars
columns_dollars = np.array([1, 2, 4, 5])

In [107]:
loan_data_numeric[:, columns_dollars]

array([[35000., 35000.,  1184.,  9452.],
       [30000., 30000.,   938.,  4679.],
       [15000., 15000.,   494.,  1969.],
       ...,
       [10000., 10000.,  1372.,  2185.],
       [35000., 10000.,   354.,  3199.],
       [10000., 10000.,   309.,   301.]])

In [108]:
loan_data_numeric[:, 6]

array([1.1 , 1.11, 1.13, ..., 1.12, 1.11, 1.09])

In [109]:
# creating new columns where we convert all the dollar to EUR equivalent
for i in columns_dollars:
    loan_data_numeric = np.hstack((loan_data_numeric, np.reshape((loan_data_numeric[:, i] / loan_data_numeric[:, 6]), (10000, 1))))

In [110]:
loan_data_numeric

array([[48010226.  ,    35000.  ,    35000.  , ...,    31933.3 ,     1080.26,     8623.82],
       [57693261.  ,    30000.  ,    30000.  , ...,    27132.46,      848.34,     4231.76],
       [59432726.  ,    15000.  ,    15000.  , ...,    13326.3 ,      438.88,     1749.3 ],
       ...,
       [50415990.  ,    10000.  ,    10000.  , ...,     8910.3 ,     1222.49,     1946.9 ],
       [46154151.  ,    35000.  ,    10000.  , ...,     8997.4 ,      318.51,     2878.27],
       [66055249.  ,    10000.  ,    10000.  , ...,     9145.8 ,      282.61,      275.29]])

In [111]:
loan_data_numeric.shape

(10000, 11)

**Expanding the headers name**

In [112]:
header_additional = np.array([column_name + '_EUR' for column_name in header_numeric[columns_dollars]])

header_additional

array(['loan_amnt_EUR', 'funded_amnt_EUR', 'installment_EUR', 'total_pymnt_EUR'], dtype='<U15')

In [113]:
# Stacking the additional columns names with header_numeric

header_numeric = np.hstack((header_numeric, header_additional))
header_numeric 

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt', 'exchange_rate',
       'loan_amnt_EUR', 'funded_amnt_EUR', 'installment_EUR', 'total_pymnt_EUR'], dtype='<U19')

In [114]:
#adding '_USD' to the dollar amount 

for i in header_numeric[columns_dollars]:
    header_numeric = np.where(header_numeric == i,
                             i + '_USD', 
                             header_numeric)

header_numeric

array(['id', 'loan_amnt_USD', 'funded_amnt_USD', 'int_rate', 'installment_USD', 'total_pymnt_USD',
       'exchange_rate', 'loan_amnt_EUR', 'funded_amnt_EUR', 'installment_EUR', 'total_pymnt_EUR'],
      dtype='<U19')

In [115]:
col_rearrange  = [0, 1, 7, 2, 8, 3, 4, 9, 5,10, 6]

In [116]:
header_numeric = header_numeric[col_rearrange]
header_numeric

array(['id', 'loan_amnt_USD', 'loan_amnt_EUR', 'funded_amnt_USD', 'funded_amnt_EUR', 'int_rate',
       'installment_USD', 'installment_EUR', 'total_pymnt_USD', 'total_pymnt_EUR', 'exchange_rate'],
      dtype='<U19')

In [117]:
loan_data_numeric =loan_data_numeric[:, col_rearrange]
loan_data_numeric

array([[48010226.  ,    35000.  ,    31933.3 , ...,     9452.  ,     8623.82,        1.1 ],
       [57693261.  ,    30000.  ,    27132.46, ...,     4679.  ,     4231.76,        1.11],
       [59432726.  ,    15000.  ,    13326.3 , ...,     1969.  ,     1749.3 ,        1.13],
       ...,
       [50415990.  ,    10000.  ,     8910.3 , ...,     2185.  ,     1946.9 ,        1.12],
       [46154151.  ,    35000.  ,    31490.9 , ...,     3199.  ,     2878.27,        1.11],
       [66055249.  ,    10000.  ,     9145.8 , ...,      301.  ,      275.29,        1.09]])

### Checkpoint

In [118]:
checkpoint_numeric = checkpoint('Checkpoint-Numeric', header_numeric, loan_data_numeric)

In [119]:
checkpoint_numeric['data']

array([[48010226.  ,    35000.  ,    31933.3 , ...,     9452.  ,     8623.82,        1.1 ],
       [57693261.  ,    30000.  ,    27132.46, ...,     4679.  ,     4231.76,        1.11],
       [59432726.  ,    15000.  ,    13326.3 , ...,     1969.  ,     1749.3 ,        1.13],
       ...,
       [50415990.  ,    10000.  ,     8910.3 , ...,     2185.  ,     1946.9 ,        1.12],
       [46154151.  ,    35000.  ,    31490.9 , ...,     3199.  ,     2878.27,        1.11],
       [66055249.  ,    10000.  ,     9145.8 , ...,      301.  ,      275.29,        1.09]])

In [120]:
checkpoint_numeric['header']

array(['id', 'loan_amnt_USD', 'loan_amnt_EUR', 'funded_amnt_USD', 'funded_amnt_EUR', 'int_rate',
       'installment_USD', 'installment_EUR', 'total_pymnt_USD', 'total_pymnt_EUR', 'exchange_rate'],
      dtype='<U19')

## Creating A Complete Dataset

In [121]:
checkpoint_strings['data'].shape

(10000, 6)

In [122]:
checkpoint_numeric['data'].shape

(10000, 11)

In [123]:
loan_data = np.hstack((checkpoint_numeric['data'], checkpoint_strings['data']))

In [124]:
loan_data

array([[48010226.  ,    35000.  ,    31933.3 , ...,       13.  ,        1.  ,        1.  ],
       [57693261.  ,    30000.  ,    27132.46, ...,        5.  ,        1.  ,        4.  ],
       [59432726.  ,    15000.  ,    13326.3 , ...,       10.  ,        1.  ,        4.  ],
       ...,
       [50415990.  ,    10000.  ,     8910.3 , ...,        5.  ,        1.  ,        1.  ],
       [46154151.  ,    35000.  ,    31490.9 , ...,       17.  ,        1.  ,        3.  ],
       [66055249.  ,    10000.  ,     9145.8 , ...,        4.  ,        0.  ,        3.  ]])

In [125]:
loan_data.shape

(10000, 17)

In [126]:
np.isnan(loan_data).sum()

0

In [127]:
header_full = np.concatenate([checkpoint_numeric['header'], checkpoint_strings['header']])

In [128]:
header_full

array(['id', 'loan_amnt_USD', 'loan_amnt_EUR', 'funded_amnt_USD', 'funded_amnt_EUR', 'int_rate',
       'installment_USD', 'installment_EUR', 'total_pymnt_USD', 'total_pymnt_EUR', 'exchange_rate',
       'issue_data', 'loan_status', 'term_month', 'sub_grade', 'verification_status', 'addr_state'],
      dtype='<U19')

In [129]:
header_full[-1] = 'state_address'

In [130]:
header_full

array(['id', 'loan_amnt_USD', 'loan_amnt_EUR', 'funded_amnt_USD', 'funded_amnt_EUR', 'int_rate',
       'installment_USD', 'installment_EUR', 'total_pymnt_USD', 'total_pymnt_EUR', 'exchange_rate',
       'issue_data', 'loan_status', 'term_month', 'sub_grade', 'verification_status',
       'state_address'], dtype='<U19')

In [131]:
np.argsort(loan_data[:, 0])

array([2086, 4812, 2353, ..., 4935, 9388, 8415], dtype=int64)

In [132]:
loan_data = loan_data[np.argsort(loan_data[:, 0])]

In [133]:
loan_data

array([[  373332.  ,     9950.  ,     9038.08, ...,       21.  ,        0.  ,        1.  ],
       [  575239.  ,    12000.  ,    10900.2 , ...,       25.  ,        1.  ,        2.  ],
       [  707689.  ,    10000.  ,     8924.3 , ...,       13.  ,        1.  ,        0.  ],
       ...,
       [68614880.  ,     5600.  ,     5121.65, ...,        8.  ,        1.  ,        1.  ],
       [68615915.  ,     4000.  ,     3658.32, ...,       10.  ,        1.  ,        2.  ],
       [68616519.  ,    21600.  ,    19754.93, ...,        3.  ,        0.  ,        2.  ]])

In [134]:
np.argsort(loan_data[:, 0])

array([   0,    1,    2, ..., 9997, 9998, 9999], dtype=int64)

**Combining the header_full to the loan_data**

In [135]:
loan_data = np.vstack((header_full, loan_data))

In [136]:
loan_data

array([['id', 'loan_amnt_USD', 'loan_amnt_EUR', ..., 'sub_grade', 'verification_status',
        'state_address'],
       ['373332.0', '9950.0', '9038.082814338286', ..., '21.0', '0.0', '1.0'],
       ['575239.0', '12000.0', '10900.20037910145', ..., '25.0', '1.0', '2.0'],
       ...,
       ['68614880.0', '5600.0', '5121.647851612413', ..., '8.0', '1.0', '1.0'],
       ['68615915.0', '4000.0', '3658.319894008867', ..., '10.0', '1.0', '2.0'],
       ['68616519.0', '21600.0', '19754.927427647883', ..., '3.0', '0.0', '2.0']], dtype='<U32')

In [137]:
# Saving the dataset
np.savetxt('loan-data-preprocessing.csv', 
           loan_data, 
           fmt ="%s", 
           delimiter=',')

## Train a Logistic Regression Model

**Loading the preprocessed dataset**

In [138]:
preprocessed_df = pd.read_csv('loan-data-preprocessing.csv')

In [139]:
preprocessed_df.head()

Unnamed: 0,id,loan_amnt_USD,loan_amnt_EUR,funded_amnt_USD,funded_amnt_EUR,int_rate,installment_USD,installment_EUR,total_pymnt_USD,total_pymnt_EUR,exchange_rate,issue_data,loan_status,term_month,sub_grade,verification_status,state_address
0,373332.0,9950.0,9038.082814,1000.0,908.350032,18.0,360.0,327.006011,1072.0,973.751234,1.100897,10.0,1.0,36.0,21.0,0.0,1.0
1,575239.0,12000.0,10900.200379,12000.0,10900.200379,20.0,324.0,294.30541,959.0,871.10768,1.100897,10.0,1.0,60.0,25.0,1.0,2.0
2,707689.0,10000.0,8924.299805,10000.0,8924.299805,13.0,340.0,303.426193,3726.0,3325.194107,1.120536,2.0,1.0,36.0,13.0,1.0,0.0
3,709828.0,27200.0,24707.120859,27200.0,24707.120859,28.0,553.0,502.317567,41913.0,38071.674874,1.100897,10.0,1.0,60.0,6.0,0.0,4.0
4,849994.0,11400.0,10526.076489,11400.0,10526.076489,28.0,376.0,347.175856,3753.0,3465.295181,1.083025,3.0,0.0,36.0,10.0,0.0,1.0


**Since we only need to work with the payment in Eur, we can futher split the data**

In [140]:
model_df = preprocessed_df.iloc[:, [2,4,5,7,9,11,12,13,14,15,16]]

In [141]:
model_df

Unnamed: 0,loan_amnt_EUR,funded_amnt_EUR,int_rate,installment_EUR,total_pymnt_EUR,issue_data,loan_status,term_month,sub_grade,verification_status,state_address
0,9038.082814,908.350032,18.0,327.006011,973.751234,10.0,1.0,36.0,21.0,0.0,1.0
1,10900.200379,10900.200379,20.0,294.305410,871.107680,10.0,1.0,60.0,25.0,1.0,2.0
2,8924.299805,8924.299805,13.0,303.426193,3325.194107,2.0,1.0,36.0,13.0,1.0,0.0
3,24707.120859,24707.120859,28.0,502.317567,38071.674874,10.0,1.0,60.0,6.0,0.0,4.0
4,10526.076489,10526.076489,28.0,347.175856,3465.295181,3.0,0.0,36.0,10.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
9995,12804.119629,12804.119629,28.0,385.038169,38332.790429,12.0,1.0,36.0,1.0,0.0,1.0
9996,18291.599470,18291.599470,28.0,577.099963,0.000000,12.0,1.0,36.0,6.0,0.0,2.0
9997,5121.647852,5121.647852,28.0,164.624395,0.000000,12.0,1.0,36.0,8.0,1.0,1.0
9998,3658.319894,3658.319894,28.0,119.809977,0.000000,12.0,1.0,36.0,10.0,1.0,2.0


In [142]:
X = model_df.drop('loan_status', axis = 1)

In [143]:
y = model_df['loan_status']

### Train Test Split

In [144]:
from sklearn.model_selection import train_test_split

In [145]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=101)

### Normalizing the data

In [146]:
from sklearn.preprocessing import MinMaxScaler

In [147]:
scaler = MinMaxScaler()

In [148]:
X_train = scaler.fit_transform(X_train)

In [149]:
X_test = scaler.transform(X_test)

### Logistic Regression Model

In [150]:
from sklearn.linear_model import LogisticRegression

In [151]:
logreg = LogisticRegression()

In [152]:
logreg.fit(X_train, y_train)

In [153]:
predictions = logreg.predict(X_test)

### Model Evaluation

In [154]:
from sklearn.metrics import precision_score, confusion_matrix, classification_report

In [155]:
print("Precision Score: ", precision_score(y_test, predictions))

Precision Score:  0.947


In [156]:
print("Confusion Matrix: ")
print(confusion_matrix(y_test, predictions))

Confusion Matrix: 
[[   0  106]
 [   0 1894]]


In [157]:
print("Classification Report:")
print(classification_report(y_test, predictions))

Classification Report:
              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00       106
         1.0       0.95      1.00      0.97      1894

    accuracy                           0.95      2000
   macro avg       0.47      0.50      0.49      2000
weighted avg       0.90      0.95      0.92      2000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [158]:
y.value_counts()

loan_status
1.0    9428
0.0     572
Name: count, dtype: int64

In [159]:
unique, counts = np.unique(predictions, return_counts=True)
print(dict(zip(unique, counts)))

{1.0: 2000}


**The model show a significant imbalance in our data, so let's create another model and using "class_weight = 'balance' "**

In [160]:
logreg2 = LogisticRegression(class_weight='balanced')

In [161]:
logreg2.fit(X_train, y_train)

In [162]:
predictions2 = logreg2.predict(X_test)

In [163]:
print("Precision Score: ", precision_score(y_test, predictions2))

Precision Score:  0.9677033492822966


In [164]:
print("Confusion Matrix: ")
print(confusion_matrix(y_test, predictions2))

Confusion Matrix: 
[[  79   27]
 [1085  809]]


In [165]:
print("Classification Report:")
print(classification_report(y_test, predictions2))

Classification Report:
              precision    recall  f1-score   support

         0.0       0.07      0.75      0.12       106
         1.0       0.97      0.43      0.59      1894

    accuracy                           0.44      2000
   macro avg       0.52      0.59      0.36      2000
weighted avg       0.92      0.44      0.57      2000



In [166]:
unique, counts = np.unique(predictions2, return_counts=True)
print(dict(zip(unique, counts)))

{0.0: 1164, 1.0: 836}


# Conclusion, Limitations, Recommendations, and Business Insights

## Conclusion 
The objective of this project was to build a **predictive model for loan default** using logistic regression. However, during the modeling process, a **significant class imbalance** was identified in the dataset, where the majority class (1.0 – Non-default) heavily outweighed the minority class (0.0 – Default). This imbalance led to an initial model that **predicted only the majority class**, resulting in a misleadingly high accuracy but poor generalization to default cases.  

To address this, a second model was trained using **class_weight='balanced'**, which improved class predictions but at the cost of a lower overall accuracy. The confusion matrix and classification report indicate that while the recall for the minority class (defaults) improved significantly, precision suffered, leading to a weaker overall model performance.  

This highlights the challenge of **building predictive models in highly imbalanced datasets**—ensuring the model is both sensitive to the minority class while maintaining overall predictive power.  

---

## Limitations  
1. **Severe Class Imbalance** – The dataset was highly skewed towards non-default cases, making it difficult for the model to learn patterns associated with loan defaults.   
2. **Model Choice** – Logistic regression, while interpretable, may not be the best model for handling imbalanced data. More advanced models like Random Forest, XGBoost, or SMOTE-balanced classifiers could yield better results.  
3. **False Negatives** – The model still struggles to correctly classify defaulted loans, which is a major business risk for lenders.  

---

## Recommendations for Improvement
1. **Apply Data Resampling Techniques:**  
   - Use **SMOTE (Synthetic Minority Over-sampling Technique)** to generate synthetic samples for the minority class.  
   - Experiment with **undersampling the majority class** to balance the dataset.  

2. **Try Advanced Models:**  
   - Implement **Random Forest, XGBoost, or Gradient Boosting** to see if they better capture patterns in the data.  
   - Use **ensemble methods** to combine multiple models for better performance.  

3. **Feature Engineering & Data Enrichment:**  
   - Incorporate additional borrower financial indicators like **income, credit history, and debt-to-income ratio**.  
   - Create new features (e.g., **loan-to-income ratio, payment delinquency trends**).  
 
---

## Business Insights & Implications
1. **High Risk of Loan Defaults is Hard to Predict with the Current Data**  
   - Lenders should **enhance their risk assessment criteria** by incorporating additional borrower details beyond just the given features.  
   
2. **False Negatives Pose a Financial Risk**  
   - If the model misclassifies defaulters as safe borrowers, financial institutions may suffer heavy losses. A **better risk scoring system** is needed before approving loans.  

3. **Alternative Credit Scoring Strategies Should Be Considered**  
   - Banks should explore **alternative data sources (e.g., transaction behavior, mobile payment history, social media insights)** to assess creditworthiness.  

4. **Lending Institutions Should Adjust Loan Approval Strategies**  
   - Given the difficulty in predicting defaults, institutions may consider **stricter lending policies** for high-risk applicants or **higher interest rates to offset risk**.  

5. **Regular Model Retraining is Essential**  
   - Loan patterns and economic conditions change over time. **Periodic retraining and evaluation** of the model are necessary to maintain its predictive accuracy.  

---

## Final Thoughts  
While the model has provided some predictive insights, its performance is limited due to the imbalanced nature of the dataset. **Further data preprocessing, advanced modeling, and business strategy refinements** are required before deploying this model in a real-world lending environment.  

---