# üéØ Credit Risk Prediction ‚Äì Machine Learning Project

## üìå Problem Statement

Credit risk prediction is a critical task in the banking and financial services industry. Financial institutions provide loans to customers with the expectation that they will repay the borrowed amount on time. However, some borrowers fail to meet their repayment obligations, leading to loan defaults and financial losses.
This project aims to build a machine learning model that can predict whether a borrower is likely to default on a loan based on their financial and demographic information. By identifying high-risk borrowers in advance, lenders can make informed loan approval decisions, reduce potential losses, and improve overall risk management.

---

### üß† Problem Type: Classification
### üéØ Target Variable: Loan Status (Default / Non-Default)
- 1 ‚Üí Default (High Credit Risk)
- 0 ‚Üí Non-Default (Low Credit Risk)
### ‚öôÔ∏è Approach: Supervised Machine Learning

---

## 1. Data Loading

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df=pd.read_csv("data/application_train.csv")

In [4]:
df.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


### üìê Dataset Shape

In [5]:
df.shape

(307511, 122)

### üè∑Ô∏è Target Variable Overview

In [8]:
df['TARGET'].value_counts()

TARGET
0    282686
1     24825
Name: count, dtype: int64

In [9]:
df['TARGET'].value_counts(normalize=True) * 100

TARGET
0    91.927118
1     8.072882
Name: proportion, dtype: float64

### üßæ Feature Types Identification

In [10]:
df.dtypes.value_counts()

float64    65
int64      41
object     16
Name: count, dtype: int64

### üìä Numerical vs Categorical Features

In [11]:
numerical_features = df.select_dtypes(include =['int64', 'float64']).columns
categorical_features = df.select_dtypes(include=['object']).columns

len(numerical_features), len(categorical_features)

(106, 16)

### ‚ùì Missing Values Overview

In [14]:
missing_values = df.isnull().mean().sort_values(ascending=False)
missing_values.head(10)

COMMONAREA_AVG              0.698723
COMMONAREA_MODE             0.698723
COMMONAREA_MEDI             0.698723
NONLIVINGAPARTMENTS_MEDI    0.694330
NONLIVINGAPARTMENTS_MODE    0.694330
NONLIVINGAPARTMENTS_AVG     0.694330
FONDKAPREMONT_MODE          0.683862
LIVINGAPARTMENTS_AVG        0.683550
LIVINGAPARTMENTS_MEDI       0.683550
LIVINGAPARTMENTS_MODE       0.683550
dtype: float64

## 2. Data Cleaning & Preprocessing

### Separate Target & Features