## Predictive Default Risk Modeling for Smarter Loan Approvals

**Summary:**  
Build a regularized logistic regression model on 45 000 historical loan applications to estimate each applicant’s probability of default.  

**Key Features:**  
- **Demographics:** age, gender, education  
- **Financial Profile:** income, employment tenure, home‐ownership  
- **Credit History:** credit score, length of credit history, past defaults  
- **Loan Details:** amount, purpose, interest rate, payment‑to‑income ratio  

**Objective:**  
- **Balance** catching true defaulters (recall) against avoiding false alarms on good borrowers (precision)  
- **Optimize** the F₁‑Score on the “default” class as our primary metric  

**Business Impact:**  
- Reduce credit losses by flagging high‑risk applicants  
- Accelerate approvals for low‑risk borrowers  
- Inform data‑driven underwriting thresholds and loan‐term adjustments via `.predict_proba()`  


In [5]:
#import the required libraries

import pandas as pd
import numpy as numpy
import matplotlib.pyplot as plt
import seaborn as sns


#import sklearn libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score


### Data loading and Inspection

In [22]:
loan_df = pd.read_csv('data/loan_data.csv')
loan_df.head(5)

Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status
0,22.0,female,Master,71948.0,0,RENT,35000.0,PERSONAL,16.02,0.49,3.0,561,No,1
1,21.0,female,High School,12282.0,0,OWN,1000.0,EDUCATION,11.14,0.08,2.0,504,Yes,0
2,25.0,female,High School,12438.0,3,MORTGAGE,5500.0,MEDICAL,12.87,0.44,3.0,635,No,1
3,23.0,female,Bachelor,79753.0,0,RENT,35000.0,MEDICAL,15.23,0.44,2.0,675,No,1
4,24.0,male,Master,66135.0,1,RENT,35000.0,MEDICAL,14.27,0.53,4.0,586,No,1


In [23]:
loan_df.shape

(45000, 14)

There are `45,000` rows and `14` columns in the data.

#### Check for missing values

In [16]:
loan_df.isnull().sum()

person_age                        0
person_gender                     0
person_education                  0
person_income                     0
person_emp_exp                    0
person_home_ownership             0
loan_amnt                         0
loan_intent                       0
loan_int_rate                     0
loan_percent_income               0
cb_person_cred_hist_length        0
credit_score                      0
previous_loan_defaults_on_file    0
loan_status                       0
dtype: int64

#### Check data types

In [10]:
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45000 entries, 0 to 44999
Data columns (total 14 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   person_age                      45000 non-null  float64
 1   person_gender                   45000 non-null  object 
 2   person_education                45000 non-null  object 
 3   person_income                   45000 non-null  float64
 4   person_emp_exp                  45000 non-null  int64  
 5   person_home_ownership           45000 non-null  object 
 6   loan_amnt                       45000 non-null  float64
 7   loan_intent                     45000 non-null  object 
 8   loan_int_rate                   45000 non-null  float64
 9   loan_percent_income             45000 non-null  float64
 10  cb_person_cred_hist_length      45000 non-null  float64
 11  credit_score                    45000 non-null  int64  
 12  previous_loan_defaults_on_file  

### loan_df info interpretation and expectation

| Column Name                         | Current Type      | To be converted to| Reason                                                          |
| ----------------------------------- | ----------------- | ----------------- | --------------------------------------------------------------- |
| `person_gender`                     | `object`          | `category`        | Categorical variable with limited values ("male", "female")     |
| `person_education`                  | `object`          | `category`        | Ordinal/nominal variable ("High School", "Bachelor", etc.)      |
| `person_home_ownership`             | `object`          | `category`        | Nominal categories like "RENT", "OWN", "MORTGAGE"               |
| `loan_intent`                       | `object`          | `category`        | Nominal loan purpose categories ("EDUCATION", "PERSONAL", etc.) |
| `previous_loan_defaults_on_file`    | `object`          | `binary/int`      | Convert `"Yes"` : `1`, `"No"` :`0` for modeling                |
| `loan_status`                       | `int64`           | `Already correct` | Target column (binary: 0/1)                                     |
| `person_age`, `person_income`, etc. | `Already numeric` | -                 | No changes needed                                               |


#### Data statistics: To get a glimpse into data distribution

In [12]:
loan_df.describe()

Unnamed: 0,person_age,person_income,person_emp_exp,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,loan_status
count,45000.0,45000.0,45000.0,45000.0,45000.0,45000.0,45000.0,45000.0,45000.0
mean,27.764178,80319.05,5.410333,9583.157556,11.006606,0.139725,5.867489,632.608756,0.222222
std,6.045108,80422.5,6.063532,6314.886691,2.978808,0.087212,3.879702,50.435865,0.415744
min,20.0,8000.0,0.0,500.0,5.42,0.0,2.0,390.0,0.0
25%,24.0,47204.0,1.0,5000.0,8.59,0.07,3.0,601.0,0.0
50%,26.0,67048.0,4.0,8000.0,11.01,0.12,4.0,640.0,0.0
75%,30.0,95789.25,8.0,12237.25,12.99,0.19,8.0,670.0,0.0
max,144.0,7200766.0,125.0,35000.0,20.0,0.66,30.0,850.0,1.0
