# Credit Risk Prediction System - Business Understanding

## Business Problem
Financial institutions face significant losses due to loan defaults.  
The goal of this project is to **reduce loan defaults** by building a machine learning model that predicts whether a loan applicant is a **high-risk** or **low-risk** borrower.

## Objective
- Analyze applicant demographic and financial data.
- Identify key factors that contribute to loan default.
- Build a predictive system that classifies applicants into risk categories.
- Provide insights that help lenders make informed decisions and minimize risk.

## Dataset
The dataset contains ~45,000 records with features such as:
- **Demographics:** Age, Gender, Education, Income, Employment Experience, Home Ownership
- **Loan Details:** Loan Amount, Loan Intent, Interest Rate, Loan Percent Income
- **Credit History:** Credit Score, Credit History Length, Previous Loan Defaults
- **Target Variable:** `loan_status` (High Risk - 0 / Low Risk - 1)

---

THIS NOTEBOOK HAS DATA UNDERSTANDING OKAY

In [1]:
#importing necessary libraries
import pandas as pd
import numpy as np

data_path="../data/raw/loan_data.csv"
dataframe=pd.read_csv(data_path)

print("Shape of the dataset:",dataframe.shape)


Shape of the dataset: (45000, 14)


In [2]:
print("Head data")
print(dataframe.head())


Head data
   person_age person_gender person_education  person_income  person_emp_exp  \
0        22.0        female           Master        71948.0               0   
1        21.0        female      High School        12282.0               0   
2        25.0        female      High School        12438.0               3   
3        23.0        female         Bachelor        79753.0               0   
4        24.0          male           Master        66135.0               1   

  person_home_ownership  loan_amnt loan_intent  loan_int_rate  \
0                  RENT    35000.0    PERSONAL          16.02   
1                   OWN     1000.0   EDUCATION          11.14   
2              MORTGAGE     5500.0     MEDICAL          12.87   
3                  RENT    35000.0     MEDICAL          15.23   
4                  RENT    35000.0     MEDICAL          14.27   

   loan_percent_income  cb_person_cred_hist_length  credit_score  \
0                 0.49                         3.0      

In [3]:
#data types
dataframe.dtypes

person_age                        float64
person_gender                      object
person_education                   object
person_income                     float64
person_emp_exp                      int64
person_home_ownership              object
loan_amnt                         float64
loan_intent                        object
loan_int_rate                     float64
loan_percent_income               float64
cb_person_cred_hist_length        float64
credit_score                        int64
previous_loan_defaults_on_file     object
loan_status                         int64
dtype: object

In [4]:
#info and description
print("INfo")
print(dataframe.info())
print("="*50)
print("Description")
print(dataframe.describe())

INfo
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45000 entries, 0 to 44999
Data columns (total 14 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   person_age                      45000 non-null  float64
 1   person_gender                   45000 non-null  object 
 2   person_education                45000 non-null  object 
 3   person_income                   45000 non-null  float64
 4   person_emp_exp                  45000 non-null  int64  
 5   person_home_ownership           45000 non-null  object 
 6   loan_amnt                       45000 non-null  float64
 7   loan_intent                     45000 non-null  object 
 8   loan_int_rate                   45000 non-null  float64
 9   loan_percent_income             45000 non-null  float64
 10  cb_person_cred_hist_length      45000 non-null  float64
 11  credit_score                    45000 non-null  int64  
 12  previous_loan_defaults_on_f

In [7]:
cat_columns=dataframe.select_dtypes(include=['object']).columns
for col in cat_columns:
    
    print(dataframe[col].value_counts())
    print()

person_gender
male      24841
female    20159
Name: count, dtype: int64

person_education
Bachelor       13399
Associate      12028
High School    11972
Master          6980
Doctorate        621
Name: count, dtype: int64

person_home_ownership
RENT        23443
MORTGAGE    18489
OWN          2951
OTHER         117
Name: count, dtype: int64

loan_intent
EDUCATION            9153
MEDICAL              8548
VENTURE              7819
PERSONAL             7552
DEBTCONSOLIDATION    7145
HOMEIMPROVEMENT      4783
Name: count, dtype: int64

previous_loan_defaults_on_file
Yes    22858
No     22142
Name: count, dtype: int64



In [8]:
#missing values
print(dataframe.isna().sum())

person_age                        0
person_gender                     0
person_education                  0
person_income                     0
person_emp_exp                    0
person_home_ownership             0
loan_amnt                         0
loan_intent                       0
loan_int_rate                     0
loan_percent_income               0
cb_person_cred_hist_length        0
credit_score                      0
previous_loan_defaults_on_file    0
loan_status                       0
dtype: int64


In [9]:
#check for duplicates
print(dataframe.duplicated().sum())

0


In [10]:
dataframe["loan_status"].value_counts()

loan_status
0    35000
1    10000
Name: count, dtype: int64

In [11]:
#well here the dataset is imbalanced adn we need not missing values and duplicates values