## üìå Project Overview

Credit card approval is a critical decision-making process for financial institutions. This project leverages **machine learning techniques** to build a predictive model that determines whether a credit card application should be **approved or rejected** based on applicant features such as age, income, credit score, and other financial indicators.

The goal of this project is to demonstrate a **complete end-to-end machine learning pipeline**, from raw data to model evaluation.

---

## üîç Key Components of the Project

- **Data Preprocessing**
  - Handling missing values
  - Encoding categorical variables
  - Feature scaling and cleaning

- **Exploratory Data Analysis (EDA)**
  - Understanding data distributions
  - Identifying patterns and correlations
  - Visualizing key insights

- **Feature Engineering**
  - Creating and selecting meaningful features
  - Reducing noise and improving model performance

- **Model Training & Tuning**
  - Training multiple machine learning models
  - Hyperparameter optimization

- **Model Evaluation**
  - Performance metrics (Accuracy, Precision, Recall, F1-score, ROC-AUC)
  - Comparison of model results

---

## üéØ Outcome

The final model provides a reliable and data-driven approach to credit card approval, helping financial institutions reduce risk while improving decision efficiency.


# Import necessary libraries

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
print("All Libraries Imported Successfully")

All Libraries Imported Successfully


## üìÇ Loading Dataset

The dataset used in this project is taken from **Kaggle**. It contains two CSV files related to credit card applications and credit history.

### `application_record.csv`
This file includes basic information about credit card applicants such as:
- Age and gender
- Income details
- Employment and education
- Family and housing information

It helps understand **who the applicant is**.

### `credit_record.csv`
This file contains the credit history of applicants, including:
- Monthly repayment status
- Past credit behavior

It helps understand **how the applicant has handled credit in the past**.

### üîó Note
Both datasets are connected using a common **ID** column and are merged for analysis and model building.


In [2]:
application_record = pd.read_csv('application_record.csv')
credit_score = pd.read_csv('credit_record.csv')

### EDA on application_record.csv

#### check data shape

shape is property that returns total rows and columns

In [3]:
application_record.shape

(438557, 18)

In [4]:
application_record.sample(5)

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
232086,6091355,F,Y,Y,0,495000.0,Commercial associate,Higher education,Civil marriage,House / apartment,-18236,-347,1,0,0,0,Core staff,2
280839,6121343,F,Y,N,0,202500.0,Working,Higher education,Married,House / apartment,-11658,-1445,1,0,0,0,,2
313145,6241471,M,Y,Y,2,202500.0,Working,Secondary / secondary special,Married,House / apartment,-14518,-4436,1,0,0,1,Laborers,4
341573,6409010,M,N,N,0,202500.0,Commercial associate,Secondary / secondary special,Married,With parents,-17045,-2040,1,0,0,0,Waiters/barmen staff,2
235298,5999506,F,N,Y,1,162000.0,Working,Higher education,Single / not married,House / apartment,-13863,-2707,1,0,1,0,Core staff,2


In [5]:
## count null values

application_record.isnull().sum()

ID                          0
CODE_GENDER                 0
FLAG_OWN_CAR                0
FLAG_OWN_REALTY             0
CNT_CHILDREN                0
AMT_INCOME_TOTAL            0
NAME_INCOME_TYPE            0
NAME_EDUCATION_TYPE         0
NAME_FAMILY_STATUS          0
NAME_HOUSING_TYPE           0
DAYS_BIRTH                  0
DAYS_EMPLOYED               0
FLAG_MOBIL                  0
FLAG_WORK_PHONE             0
FLAG_PHONE                  0
FLAG_EMAIL                  0
OCCUPATION_TYPE        134203
CNT_FAM_MEMBERS             0
dtype: int64

In [None]:
# statistic analysis

application_record.describe()

In [6]:
application_record.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
438552    False
438553    False
438554    False
438555    False
438556    False
Length: 438557, dtype: bool

In [7]:
## data types of columns

application_record.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 438557 entries, 0 to 438556
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   ID                   438557 non-null  int64  
 1   CODE_GENDER          438557 non-null  object 
 2   FLAG_OWN_CAR         438557 non-null  object 
 3   FLAG_OWN_REALTY      438557 non-null  object 
 4   CNT_CHILDREN         438557 non-null  int64  
 5   AMT_INCOME_TOTAL     438557 non-null  float64
 6   NAME_INCOME_TYPE     438557 non-null  object 
 7   NAME_EDUCATION_TYPE  438557 non-null  object 
 8   NAME_FAMILY_STATUS   438557 non-null  object 
 9   NAME_HOUSING_TYPE    438557 non-null  object 
 10  DAYS_BIRTH           438557 non-null  int64  
 11  DAYS_EMPLOYED        438557 non-null  int64  
 12  FLAG_MOBIL           438557 non-null  int64  
 13  FLAG_WORK_PHONE      438557 non-null  int64  
 14  FLAG_PHONE           438557 non-null  int64  
 15  FLAG_EMAIL       

In [8]:
# Step 2.5: Explore numerical columns
numerical_cols = [
    "AMT_INCOME_TOTAL",
    "CNT_CHILDREN",
    "CNT_FAM_MEMBERS",
    "DAYS_BIRTH",
    "DAYS_EMPLOYED"
]

print("\nSummary Statistics for Numerical Columns:")
print(application_record[numerical_cols].describe())


Summary Statistics for Numerical Columns:
       AMT_INCOME_TOTAL   CNT_CHILDREN  CNT_FAM_MEMBERS     DAYS_BIRTH  \
count      4.385570e+05  438557.000000    438557.000000  438557.000000   
mean       1.875243e+05       0.427390         2.194465  -15997.904649   
std        1.100869e+05       0.724882         0.897207    4185.030007   
min        2.610000e+04       0.000000         1.000000  -25201.000000   
25%        1.215000e+05       0.000000         2.000000  -19483.000000   
50%        1.607805e+05       0.000000         2.000000  -15630.000000   
75%        2.250000e+05       1.000000         3.000000  -12514.000000   
max        6.750000e+06      19.000000        20.000000   -7489.000000   

       DAYS_EMPLOYED  
count  438557.000000  
mean    60563.675328  
std    138767.799647  
min    -17531.000000  
25%     -3103.000000  
50%     -1467.000000  
75%      -371.000000  
max    365243.000000  


In [9]:

#Explore categorical columns

categorical_cols = [
    "CODE_GENDER",
    "FLAG_OWN_CAR",
    "FLAG_OWN_REALTY",
    "NAME_INCOME_TYPE",
    "NAME_EDUCATION_TYPE",
    "NAME_FAMILY_STATUS",
    "NAME_HOUSING_TYPE",
    "OCCUPATION_TYPE"
]

print("\nUnique Values in Categorical Columns:")
for col in categorical_cols:
    print(f"\n{col}:")
    print(application_record[col].value_counts())


Unique Values in Categorical Columns:

CODE_GENDER:
CODE_GENDER
F    294440
M    144117
Name: count, dtype: int64

FLAG_OWN_CAR:
FLAG_OWN_CAR
N    275459
Y    163098
Name: count, dtype: int64

FLAG_OWN_REALTY:
FLAG_OWN_REALTY
Y    304074
N    134483
Name: count, dtype: int64

NAME_INCOME_TYPE:
NAME_INCOME_TYPE
Working                 226104
Commercial associate    100757
Pensioner                75493
State servant            36186
Student                     17
Name: count, dtype: int64

NAME_EDUCATION_TYPE:
NAME_EDUCATION_TYPE
Secondary / secondary special    301821
Higher education                 117522
Incomplete higher                 14851
Lower secondary                    4051
Academic degree                     312
Name: count, dtype: int64

NAME_FAMILY_STATUS:
NAME_FAMILY_STATUS
Married                 299828
Single / not married     55271
Civil marriage           36532
Separated                27251
Widow                    19675
Name: count, dtype: int64

NAME_HOUSING_TYP

### Column Meanings

- **ID** ‚Äì Unique identifier for each applicant.  
- **CODE_GENDER** ‚Äì Gender of the applicant (M = Male, F = Female).  
- **FLAG_OWN_CAR** ‚Äì Whether the applicant owns a car (Y/N).  
- **FLAG_OWN_REALTY** ‚Äì Whether the applicant owns property (Y/N).  
- **CNT_CHILDREN** ‚Äì Number of children in the applicant‚Äôs family.  
- **AMT_INCOME_TOTAL** ‚Äì Total annual income of the applicant.  
- **NAME_INCOME_TYPE** ‚Äì Type of income (e.g., Working, Commercial associate).  
- **NAME_EDUCATION_TYPE** ‚Äì Highest education level of the applicant.  
- **NAME_FAMILY_STATUS** ‚Äì Marital status of the applicant.  
- **NAME_HOUSING_TYPE** ‚Äì Type of housing (apartment, house, etc.).  
- **DAYS_BIRTH** ‚Äì Age of the applicant in days (negative value means years ago).  
- **DAYS_EMPLOYED** ‚Äì Days employed in current job (negative value).  
- **FLAG_MOBIL** ‚Äì Whether the applicant has a mobile phone (1 = Yes, 0 = No).  
- **FLAG_WORK_PHONE** ‚Äì Whether the applicant has a work phone (1 = Yes, 0 = No).  
- **FLAG_PHONE** ‚Äì Whether the applicant has a home phone (1 = Yes, 0 = No).  
- **FLAG_EMAIL** ‚Äì Whether the applicant has an email (1 = Yes, 0 = No).  
- **OCCUPATION_TYPE** ‚Äì Type of job/occupation (may contain missing values).  
- **CNT_FAM_MEMBERS** ‚Äì Number of family members in the household.  


In [11]:
print(credit_score.head())
print(credit_score.shape)

        ID  MONTHS_BALANCE STATUS
0  5001711               0      X
1  5001711              -1      0
2  5001711              -2      0
3  5001711              -3      0
4  5001712               0      C
(1048575, 3)


### Column Meanings (`credit_record.csv`)

- **ID** ‚Äì Unique identifier for each applicant (matches `application_record.csv`).  
- **MONTHS_BALANCE** ‚Äì Month of the record relative to the current month (0 = most recent, -1 = 1 month ago, etc.).  
- **STATUS** ‚Äì Credit repayment status for that month:
  - `0` ‚Äì Paid off that month  
  - `1` ‚Äì Payment was 30 days late  
  - `2` ‚Äì Payment was 60 days late  
  - `3` ‚Äì Payment was 90 days late  
  - `4` ‚Äì Payment was 120 days late  
  - `5` ‚Äì Payment was more than 150 days late  
  - `C` ‚Äì Closed account  
  - `X` ‚Äì No loan for that month  
