# Credit Card Approval Prediction using ML

### **Task**: 
- Build a machine learning model to predict if an applicant is 'good' or 'bad' client.
- But the definition of 'good' or 'bad' is not given ! 
- Unbalance data problem is a big problem in this task. 

## <b> <font color=green> STEP (1) Problem Definition, Scoping & Framing </font> </b>

### 1.1) Load Libraries
Import all necessary libraries for the project.

In [1]:
import numpy as np
import pandas as pd
import sklearn

### 1.2) Load The Dataset
Load the dataset and display the first few rows.

In [2]:
df_app = pd.read_csv("application_record.csv")
print("The size of the Data: ", len(df_app))
df_app.head()

The size of the Data:  438557


Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0


In [3]:
df_credit = pd.read_csv("credit_record.csv")
print("The size of the Data: ", len(df_credit))
df_credit.head()

The size of the Data:  1048575


Unnamed: 0,ID,MONTHS_BALANCE,STATUS
0,5001711,0,X
1,5001711,-1,0
2,5001711,-2,0
3,5001711,-3,0
4,5001712,0,C


### See if there is duplicate IDs in applicant info dataset

In [33]:
count_ids = df_app["ID"].value_counts()
count_ids

ID
7137299    2
7702238    2
7282535    2
7243768    2
7050948    2
          ..
5690727    1
6621262    1
6621261    1
6621260    1
6842885    1
Name: count, Length: 438510, dtype: int64

In [34]:
repeated_ids = len(df_app["ID"]) - len(df_app["ID"].unique())
repeated_ids

47

### Example

In [18]:
df_app[df_app["ID"] == 7137299]

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
423160,7137299,M,Y,N,1,225000.0,Working,Secondary / secondary special,Married,House / apartment,-15243,-7260,1,1,1,0,High skill tech staff,3.0
426665,7137299,F,N,Y,0,292500.0,Working,Secondary / secondary special,Single / not married,Office apartment,-19679,-2074,1,0,0,0,Cleaning staff,1.0


### See if there any ID exist in applicant dataset but not in credit info dataset and opposite

In [28]:
missing_ids1 = df_app[~df_app["ID"].isin(df_credit["ID"])]
unique_missing_ids1 = missing_ids1["ID"].unique()
print("Number of IDs in applicant data but not in credit data:" , {len(unique_missing_ids1)})

Number of IDs in applicant data but not in credit data: {402053}


In [29]:
aplicant_noCreditRecord = len(df_app) - len(unique_missing_ids1)
aplicant_noCreditRecord

36504

In [30]:
missing_ids2 = df_credit[~df_credit["ID"].isin(df_app["ID"])]
unique_missing_ids2 = missing_ids2["ID"].unique()
print("Number of IDs in credit data but not in applicant data:" , {len(unique_missing_ids2)})

Number of IDs in credit data but not in applicant data: {9528}


In [31]:
credit_noApplicant = len(df_credit) - len(unique_missing_ids2)
credit_noApplicant

1039047

In [5]:
combined_df = pd.merge(df_app, df_credit, on="ID")
combined_df

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,MONTHS_BALANCE,STATUS
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,0,C
1,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,-1,C
2,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,-2,C
3,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,-3,C
4,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,-4,C
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
777710,5150337,M,N,Y,0,112500.0,Working,Secondary / secondary special,Single / not married,Rented apartment,-9188,-1193,1,0,0,0,Laborers,1.0,-9,0
777711,5150337,M,N,Y,0,112500.0,Working,Secondary / secondary special,Single / not married,Rented apartment,-9188,-1193,1,0,0,0,Laborers,1.0,-10,2
777712,5150337,M,N,Y,0,112500.0,Working,Secondary / secondary special,Single / not married,Rented apartment,-9188,-1193,1,0,0,0,Laborers,1.0,-11,1
777713,5150337,M,N,Y,0,112500.0,Working,Secondary / secondary special,Single / not married,Rented apartment,-9188,-1193,1,0,0,0,Laborers,1.0,-12,0


In [6]:
id_count = combined_df["ID"].value_counts()
id_count

ID
5090630    61
5148524    61
5066707    61
5061848    61
5118380    61
           ..
5024557     1
5062311     1
5024365     1
5024364     1
5041568     1
Name: count, Length: 36457, dtype: int64

## <b> <font color=green> STEP (2) Data Exploration & Understanding </font> </b>

### 2.1) Descriptive Statistics
Generate summary statistics for the dataset.

In [7]:
combined_df.describe()

Unnamed: 0,ID,CNT_CHILDREN,AMT_INCOME_TOTAL,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,MONTHS_BALANCE
count,777715.0,777715.0,777715.0,777715.0,777715.0,777715.0,777715.0,777715.0,777715.0,777715.0,777715.0
mean,5078743.0,0.428082,188534.8,-16124.937046,57775.825016,1.0,0.231818,0.300965,0.091675,2.208837,-19.373564
std,41804.42,0.745755,101622.5,4104.304018,136471.735391,0.0,0.421993,0.458678,0.288567,0.90738,14.082208
min,5008804.0,0.0,27000.0,-25152.0,-15713.0,1.0,0.0,0.0,0.0,1.0,-60.0
25%,5044568.0,0.0,121500.0,-19453.0,-3292.0,1.0,0.0,0.0,0.0,2.0,-29.0
50%,5069530.0,0.0,162000.0,-15760.0,-1682.0,1.0,0.0,0.0,0.0,2.0,-17.0
75%,5115551.0,1.0,225000.0,-12716.0,-431.0,1.0,0.0,1.0,0.0,3.0,-8.0
max,5150487.0,19.0,1575000.0,-7489.0,365243.0,1.0,1.0,1.0,1.0,20.0,0.0


In [11]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777715 entries, 0 to 777714
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   ID                   777715 non-null  int64  
 1   CODE_GENDER          777715 non-null  object 
 2   FLAG_OWN_CAR         777715 non-null  object 
 3   FLAG_OWN_REALTY      777715 non-null  object 
 4   CNT_CHILDREN         777715 non-null  int64  
 5   AMT_INCOME_TOTAL     777715 non-null  float64
 6   NAME_INCOME_TYPE     777715 non-null  object 
 7   NAME_EDUCATION_TYPE  777715 non-null  object 
 8   NAME_FAMILY_STATUS   777715 non-null  object 
 9   NAME_HOUSING_TYPE    777715 non-null  object 
 10  DAYS_BIRTH           777715 non-null  int64  
 11  DAYS_EMPLOYED        777715 non-null  int64  
 12  FLAG_MOBIL           777715 non-null  int64  
 13  FLAG_WORK_PHONE      777715 non-null  int64  
 14  FLAG_PHONE           777715 non-null  int64  
 15  FLAG_EMAIL       

In [10]:
mobile_counts = combined_df["FLAG_MOBIL"].value_counts()
mobile_counts

FLAG_MOBIL
1    777715
Name: count, dtype: int64

### 2.2) Data Visualizations
Visualize the data to identify patterns and relationships.

## <b> <font color=green> STEP (3) Data Preparation & Feature Engineering </font> </b>

### 3.1) Data Cleaning
Handle missing values and outliers.

### 3.1) Feature Selection
Select the most relevant features for the model.

### 3.2) Data Transforms
Apply transformations to prepare the data for modeling.

## <b> <font color=green> STEP (4)  ML Model Selection & Evaluation </font> </b>

### 4.1) Split-out Validation Dataset
Split the dataset into training and validation sets.

### 4.2) Test Options and Evaluation Metric
Define the evaluation metric for the model.

### 4.3) Spot Check Algorithms
Test multiple algorithms to identify the best-performing one.

## <b> <font color=green> STEP (4)  Performance Tuning & Optimization </font> </b>

### 4.1) Algorithm Tuning
Optimize hyperparameters for better performance.

### 4.2) Ensembles
Combine multiple models to improve accuracy.

## <b> <font color=green> STEP (6) Results Interpretation & Deployment </font> </b>

### 6.1) Predictions on Validation Dataset
Make predictions using the validation dataset.

### 6.2) Create Standalone Model on Entire Training Dataset
Train the final model on the entire dataset.

### 6.3) Save Model for Later Use
Save the trained model to a file.