# Enhancing Creditworthiness Assessment beyond Traditional Credit History.

# 1.1 Business Problem
Financial institutions face losses due loss of customers as they heavily rely on credit history as the main and sometimes sole factor for determining credit-worthiness of customers. This costs them a lot of credit-worthy customers. In this project I use other factors to prove credit-worthiness of some customers instead of the unfair traditional credit history.



# 1.2 Business understanding
Traditional financial institutions have historically relied on formal credit history as the primary basis for loan approval, a practice rooted in the development of centralized credit bureaus and standardized credit scoring systems in mature financial markets. These systems were designed to provide a scalable, objective proxy for borrower risk by assuming that past borrowing behavior reliably predicts future repayment. While effective for customers with established credit records, this approach has proven increasingly limiting as banking expands to emerging markets and underserved populations, where large segments of otherwise financially responsible individuals operate outside formal credit systems and therefore lack sufficient credit footprints.

As a result, banks continue to reject approximately 25–35% of loan applications, with research and industry evidence indicating that 30–40% of these rejected applicants are in fact creditworthy when assessed using broader financial and behavioral indicators such as income stability, transaction consistency, and expense management. This over-reliance on credit history leads to an estimated 8–12% loss in potential loan volume, translating to KES 800 million–1.2 billion in unrealized lending opportunity for a mid-sized bank issuing KES 10 billion annually. Addressing this gap through enhanced data-driven credit assessment presents a clear opportunity to improve financial inclusion while capturing profitable, low-risk customers.

# 1.3 Data understanding
The dataset used in this project was obtained from Kaggle and is composed of seven relational tables containing detailed historical and behavioral credit information for loan applicants. The data represents real-world financial records typically used by traditional lending institutions to assess creditworthiness. Due to the scale and granularity of the data, the dataset captures both current application information and longitudinal credit behavior across multiple financial products. Here is the link to the dataset on kaggle :(https://www.kaggle.com/c/home-credit-default-risk/data)

## Overview of Tables
1.	Application Train (application_train)
This is the primary dataset used for model training and contains demographic, socioeconomic, and financial information for each loan applicant. It includes attributes such as income, employment history, family status, housing conditions, and loan characteristics. Importantly, this table contains the target variable indicating whether a customer defaulted on a loan. Each row represents a unique applicant identified by SK_ID_CURR.
2.	Application Test (application_test)
This table has the same structure as the training dataset but does not contain the target variable. It is used for model evaluation and prediction. It allows the trained model to assess credit risk for new applicants using learned patterns.
3.	Bureau (bureau)
This table contains historical credit records of applicants obtained from external credit bureaus. Each applicant may have multiple entries representing different loans taken in the past. Features include credit type, loan status (active or closed), credit amount, outstanding debt, and overdue amounts. This table provides insight into long-term credit behavior beyond the current loan application.
4.	Bureau Balance (bureau_balance)
This table provides monthly snapshots of an applicant’s credit status for each loan reported in the bureau table. It contains information such as days past due and loan status over time. With over one million rows, it enables temporal analysis of repayment behavior and credit stability.
5.	Previous Application (previous_application)
This table records past loan applications made by the applicant, whether approved, refused, or canceled. It includes information about loan amounts, application decisions, and contract terms. This table helps identify application behavior patterns, such as repeated rejections or frequent borrowing.
6.	Installments Payments (installments_payments)
This table captures detailed payment history for installment-based loans. It includes scheduled payment amounts, actual payments made, payment delays, and early repayments. With millions of records, it provides strong indicators of repayment discipline and financial responsibility.
7.	Credit Card Balance (credit_card_balance)
This table contains monthly credit card usage data such as balances, limits, minimum payments, and utilization. It reflects short-term financial behavior, spending discipline, and credit dependency.
8.	POS Cash Balance (POS_CASH_balance)
This table tracks point-of-sale and cash loan repayment behavior, including delinquency and contract status. It offers insight into small-loan behavior and short-term liquidity management.

### Data Characteristics
•	The dataset contains hundreds of columns across tables, including numeric, categorical, and temporal variables.
•	Several tables contain hundreds of thousands to millions of rows, reflecting one-to-many relationships with applicants.
•	Missing values are present in multiple features, often representing absence of credit history rather than data quality issues.
•	All tables are linked through unique identifiers such as SK_ID_CURR and SK_ID_BUREAU, enabling relational aggregation.




# 1.4 Data Preparation
To transform the raw multi-table credit data into a clean, consistent dataset ready for analysis, the first step will be to aggregate and merge all relational tables into a single customer-level dataset using the unique applicant identifier (SK_ID_CURR). Since most auxiliary tables contain one-to-many relationships, relevant numeric and categorical features will be summarized using statistical aggregations such as mean, sum, minimum, maximum, counts, and unique counts. This ensures that each applicant is represented by a single, comprehensive record while preserving historical credit behavior across bureau records, installments, credit cards, POS cash, and previous applications.


After merging, data cleaning and consistency checks will be applied. Missing values will be handled using context-appropriate strategies: numerical features will be imputed using median values to reduce the influence of outliers, count-based features will be filled with zeros where missing values indicate absence of credit activity, and categorical variables will be filled using forward fill, backward fill, or meaningful default categories where applicable. Data types will be validated and corrected to ensure numerical features are stored as integers or floats, and binary indicators are properly encoded as integers (0/1). Duplicate records will be checked and removed where necessary to maintain data integrity.


Finally, the dataset will be prepared for modeling by engineering ratio features, flags, and normalized metrics that better capture financial behavior, such as debt ratios, utilization rates, and repayment consistency. Outliers will be assessed using statistical methods (e.g., IQR), and feature distributions will be reviewed to ensure stability and low noise. The result will be a single, clean, and analysis-ready dataset that accurately reflects applicant behavior beyond traditional credit history, supporting fairer and more inclusive credit-worthiness assessment.


# 1.2 Business Objectives
# Main objective
To use alternative data signals to improve credit approval decisions by accurately identifying creditworthy applicants with limited or no traditional credit history.
# Specific objectives
i)	To identify important features that are neglected by legacy banks.

ii)	To handle missing values by applying appropriate statistical and logical imputation methods for numerical and categorical features.

iii)	To engineer alternative creditworthiness features such as repayment behavior, application velocity, and utilization ratios.

iv)	To encode categorical columns to make them model ready.

v)	To identify and handle outliers accordingly so that they do not affect the model.

vi)	Standardizing large numbers through log transformation so that they do not overshadow the smaller numbers


# Criteria of Success
1.	To reduce the rejection of creditworthy applicants by at least 30–40% among customers with thin or non-existent credit histories compared to a traditional credit-history-only baseline.
2.	To lower False Negatives (good borrowers incorrectly declined) by a minimum of 50%, while keeping default risk within acceptable thresholds.
3.	To demonstrate that ignored features (cash-flow stability, rent payment regularity, skill-based stability income, and first time employees) contribute significantly to credit decisions through measurable feature importance. 
4.	To increase approved loan volume by 8–12% without a corresponding increase in default rates, reflecting recovered missed lending opportunities.


# Data validation
Data validation is important in this project because it ensures that the data you are working with is accurate, consistent, and trustworthy. Decisions, models, and insights are only as good as the data behind them—if the data is wrong, the conclusions will be wrong too.

## Here are the key reasons explained clearly:

1. Prevents wrong decisions

Invalid data (wrong values, duplicates, incorrect formats) can lead to false insights. For example, a negative income value or an impossible age can distort averages, ratios, and model predictions, leading to poor business decisions.

2. Improves model performance

In projects like credit risk scoring, invalid or inconsistent data increases noise. This causes models to learn incorrect patterns, increasing false positives or false negatives. Validated data leads to more reliable and stable models.

3. Ensures consistency across datasets

When working with multiple tables (like application, bureau, installments, POS cash), validation ensures:

Keys match correctly (e.g., SK_ID_CURR)

Units and formats are consistent

Aggregations represent reality
Without validation, merges can silently fail or introduce bias.

4. Reduces bias and data leakage

Data validation helps detect:

Duplicates that overweight some customers

Future information leaking into training data

Outliers that unfairly influence predictions
This is critical in regulated domains like finance.

 # importing libraries

In [98]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
from sklearn.feature_selection import VarianceThreshold

In [99]:
#loading the dataset

# Main application data
app_train = pd.read_csv(r"C:/Users/USER/Desktop/first project/application_train.csv")
app_test = pd.read_csv(r"C:/Users/USER/Desktop/first project/application_test.csv")

# Other relational tables
bureau = pd.read_csv(r"C:/Users/USER/Desktop/first project/bureau.csv")
bureau_balance = pd.read_csv(r"C:/Users/USER/Desktop/first project/bureau_balance.csv")
prev_app = pd.read_csv(r"C:/Users/USER/Desktop/first project/previous_application.csv")
pos_cash = pd.read_csv(r"C:/Users/USER/Desktop/first project/POS_CASH_balance.csv")
installments = pd.read_csv(r"C:/Users/USER/Desktop/first project/installments_payments.csv")
credit_card = pd.read_csv(r"C:/Users/USER/Desktop/first project/credit_card_balance.csv")
print("loaded")

loaded


## Viewing the dataset

In [100]:
# viewing the datasets
app_train.shape


(307511, 122)

In [101]:
bureau.shape

(1716428, 17)

In [102]:
credit_card.shape

(3840312, 23)

In [103]:
installments.shape

(13605401, 8)

In [104]:
installments.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
0,1054186,161674,1.0,6,-1180.0,-1187.0,6948.36,6948.36
1,1330831,151639,0.0,34,-2156.0,-2156.0,1716.525,1716.525
2,2085231,193053,2.0,1,-63.0,-63.0,25425.0,25425.0
3,2452527,199697,1.0,3,-2418.0,-2426.0,24350.13,24350.13
4,2714724,167756,1.0,2,-1383.0,-1366.0,2165.04,2160.585


In [105]:
bureau_balance.shape

(27299925, 3)

In [106]:
app_test.shape

(48744, 121)

In [107]:
pos_cash.shape

(10001358, 8)

# merging one to many relationship table

In [108]:
# defining aggregation function
def aggregate_one_to_many(df, group_key, prefix):
    num_cols = df.select_dtypes(include='number').columns.tolist()
    cat_cols = df.select_dtypes(include='object').columns.tolist()

    # Remove group key if present
    if group_key in num_cols:
        num_cols.remove(group_key)
    if group_key in cat_cols:
        cat_cols.remove(group_key)

    agg_dict = {}

    for col in num_cols:
        agg_dict[col] = ['mean', 'sum', 'max', 'min']

    for col in cat_cols:
        agg_dict[col] = ['nunique']

    agg = df.groupby(group_key).agg(agg_dict)

    agg.columns = [f"{prefix}_{col}_{stat}" for col, stat in agg.columns]
    agg.reset_index(inplace=True)

    return agg


In [None]:
#aggregating one to many tables
bureau_agg = aggregate_one_to_many(bureau, 'SK_ID_CURR', 'BUREAU')

bureau_balance_agg = aggregate_one_to_many(
    bureau_balance.merge(bureau[['SK_ID_BUREAU', 'SK_ID_CURR']], 
                          on='SK_ID_BUREAU', how='left'),
    'SK_ID_CURR',
    'BB'
)

prev_app_agg = aggregate_one_to_many(prev_app, 'SK_ID_CURR', 'PREV')

installments_agg = aggregate_one_to_many(installments, 'SK_ID_CURR', 'INST')

credit_card_agg = aggregate_one_to_many(credit_card, 'SK_ID_CURR', 'CC')

pos_cash_agg = aggregate_one_to_many(pos_cash, 'SK_ID_CURR', 'POS')

In [None]:
# merging the tables
final_df = app_train.copy()

final_df = (
    final_df
    .merge(bureau_agg, on='SK_ID_CURR', how='left')
    .merge(bureau_balance_agg, on='SK_ID_CURR', how='left')
    .merge(prev_app_agg, on='SK_ID_CURR', how='left')
    .merge(installments_agg, on='SK_ID_CURR', how='left')
    .merge(credit_card_agg, on='SK_ID_CURR', how='left')
    .merge(pos_cash_agg, on='SK_ID_CURR', how='left')
)

print(final_df.shape)

In [None]:
final_df.info()

# data validation

### 1.completeness check

In [None]:
# completeness check
final_df.isnull().sum()

###  2. uniqueness check

In [None]:
# uniqueness check
final_df.duplicated().sum()

In [None]:
final_df.describe()

In [None]:
final_df[['DAYS_BIRTH', 'DAYS_EMPLOYED']].describe()

In [None]:
(final_df['DAYS_BIRTH'] > 0).value_counts()

### 3. Accuracy check

In [None]:
# Identify impossible values
# Age check (DAYS_BIRTH is negative in Home Credit)
final_df['AGE_YEARS'] = (-final_df['DAYS_BIRTH'] / 365).round(1)

# Find unrealistic ages
invalid_age = final_df[
    (final_df['AGE_YEARS'] < 18) | (final_df['AGE_YEARS'] > 100)
]

print(f"Invalid age records: {invalid_age.shape[0]}")

In [None]:
# Negative income
invalid_income = final_df[final_df['AMT_INCOME_TOTAL'] < 0]

# Zero or negative loan amounts
invalid_loan = final_df[final_df['AMT_CREDIT'] <= 0]

print(len(invalid_income), len(invalid_loan))

### 4.validity check

In [None]:
# validity check

final_df.dtypes

### 5. Consistency check

In [None]:
inconsistent_income = final_df[
    (final_df['NAME_INCOME_TYPE'] == 'Unemployed') &
    (final_df['AMT_INCOME_TOTAL'] > 0)
]

print(inconsistent_income.shape[0])

In [None]:
final_df['EMPLOYMENT_YEARS'] = (-final_df['DAYS_EMPLOYED'] / 365)

employment_inconsistency = final_df[
    final_df['EMPLOYMENT_YEARS'] > final_df['AGE_YEARS']
]

print(employment_inconsistency.shape[0])

## Data Validation Report

After merging and aggregating all source tables into a single dataset (final_df), data validation was performed to ensure the data is accurate, consistent, and ready for analysis and modeling.

Completeness:
Missing values mainly resulted from left joins where applicants had no historical records. Numerical fields will be filled using median or zero (where absence implies no activity), while categorical fields will be handled using appropriate method of analysis e.g. mode or “Unknown”. Columns with excessive missing values were flagged for feature selection.

Accuracy:
Checks were conducted to identify unrealistic values such as negative incomes, invalid credit amounts, and extreme ratios. I found the data to be legit and accurate therefore ready for usage.

Consistency:
A consistency check was performed between income type and reported income. Applicants labeled as “Unemployed” but reporting positive income were flagged as logically inconsistent. These cases were reviewed to identify potential data quality issues or alternative income sources not captured by traditional employment labels.
Logical relationships across features were validated, such as alignment between credit amounts, installments, and repayment behavior. 22 inconsistensies found in two columns and will be handled by flagging them down hence creating great features for the model.

Uniqueness:
Duplicate records were checked using SK_ID_CURR to ensure one row per applicant. No duplicate customer records were found out.

Conclusion:
The dataset passed all major validation checks and is considered clean, consistent, and reliable for feature engineering and predictive modeling.

# Data Remediation

## Data cleaning

In [None]:
# handling the inconsistensy by flagging it down
final_df['INCOME_TYPE_MISMATCH'] = (
    (final_df['NAME_INCOME_TYPE'] == 'Unemployed') &
    (final_df['AMT_INCOME_TOTAL'] > 0)
).astype(int)

In [None]:
# Get numeric columns
numeric_cols = final_df.select_dtypes(include='number').columns

# Filter only columns with null values
cols_with_null = final_df[numeric_cols].isnull().sum()
cols_with_null = cols_with_null[cols_with_null > 0].index

# Compute medians for those columns
medians = final_df[cols_with_null].median()
print("Medians for columns with null values:\n", medians)
#because mean is not affected by outliers

In [None]:
cols_to_fill_zero = [
    'POS_SK_DPD_DEF_mean',
    'POS_SK_DPD_DEF_sum',
    'POS_SK_DPD_DEF_max',
    'POS_SK_DPD_DEF_min'
]

final_df[cols_to_fill_zero] = final_df[cols_to_fill_zero].fillna(0)
# because they show no hsitory in that column. was never present

In [None]:
final_df[cols_to_fill_zero].isnull().sum()

In [None]:
final_df['POS_NAME_CONTRACT_STATUS_nunique'] = (
    final_df['POS_NAME_CONTRACT_STATUS_nunique'].fillna(0)
)
final_df.isnull().sum()

In [None]:
# handling the 22 inconsistensies

final_df.loc[
    (final_df['NAME_INCOME_TYPE'] == 'Unemployed') &
    (final_df['AMT_INCOME_TOTAL'] > 0),
    'NAME_INCOME_TYPE'
] = 'Other'

# i prefer this method because it 
#Keeps customer
# Realistic to African informal economies
# preserves income signal

In [None]:
final_df.columns.tolist()

In [None]:
final_df.duplicated().sum()

In [None]:
final_df.head(10)

# feature engineering

In [None]:
from sklearn.preprocessing import MinMaxScaler


In [None]:

# Make sure final_df is a copy

final_df = final_df.copy()


# 1. Feature Engineering

# 1. Income & Credit Capacity
final_df['CREDIT_INCOME_RATIO'] = final_df['AMT_CREDIT'] / final_df['AMT_INCOME_TOTAL']
final_df['ANNUITY_INCOME_RATIO'] = final_df['AMT_ANNUITY'] / final_df['AMT_INCOME_TOTAL']
final_df['GOODS_PRICE_CREDIT_RATIO'] = final_df['AMT_GOODS_PRICE'] / final_df['AMT_CREDIT']

# 2. Employment & Age Stability
final_df['EMPLOYED_YEARS'] = np.abs(final_df['DAYS_EMPLOYED']) / 365
final_df['AGE_YEARS'] = np.abs(final_df['DAYS_BIRTH']) / 365
final_df['EMPLOYMENT_AGE_RATIO'] = final_df['EMPLOYED_YEARS'] / final_df['AGE_YEARS']

# 3. External Source Aggregations
ext_sources = ['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']
final_df['EXT_SOURCE_MEAN'] = final_df[ext_sources].mean(axis=1)
final_df['EXT_SOURCE_MAX'] = final_df[ext_sources].max(axis=1)
final_df['EXT_SOURCE_MIN'] = final_df[ext_sources].min(axis=1)

# 4. Credit History Presence Flags
final_df['HAS_BUREAU_HISTORY'] = (final_df['BUREAU_SK_ID_BUREAU_sum'] > 0).astype(int)
final_df['HAS_PREV_APPLICATION'] = (final_df['PREV_SK_ID_PREV_sum'] > 0).astype(int)
final_df['HAS_POS_HISTORY'] = (final_df['POS_SK_ID_PREV_sum'] > 0).astype(int)
final_df['HAS_CC_HISTORY'] = (final_df['CC_SK_ID_PREV_sum'] > 0).astype(int)

# 5. Delinquency / Risk Features
final_df['TOTAL_DPD'] = (
    final_df['BUREAU_CREDIT_DAY_OVERDUE_sum'] +
    final_df['POS_SK_DPD_sum'] +
    final_df['CC_SK_DPD_sum']
)
final_df['TOTAL_DPD_DEF'] = final_df['POS_SK_DPD_DEF_sum'] + final_df['CC_SK_DPD_DEF_sum']

# 6. Credit Card Utilization
final_df['CC_UTILIZATION'] = final_df['CC_AMT_BALANCE_mean'] / final_df['CC_AMT_CREDIT_LIMIT_ACTUAL_mean']

# 7. Installment Payment Behavior
final_df['INST_PAYMENT_RATIO'] = final_df['INST_AMT_PAYMENT_mean'] / final_df['INST_AMT_INSTALMENT_mean']

# 8. Loan Activity Intensity
final_df['TOTAL_LOAN_COUNT'] = (
    final_df['BUREAU_SK_ID_BUREAU_sum'] +
    final_df['PREV_SK_ID_PREV_sum'] +
    final_df['POS_SK_ID_PREV_sum'] +
    final_df['CC_SK_ID_PREV_sum']
)

# 9. Recent Application Behavior
final_df['RECENT_APPLICATION'] = (final_df['PREV_DAYS_DECISION_min'] > -30).astype(int)

# 10. Housing / Living Quality
final_df['LIVING_AREA_RATIO'] = final_df['LIVINGAREA_AVG'] / final_df['TOTALAREA_MODE']


# 2. Handle Missing / Infinite Values

# Replace inf/-inf with NaN
final_df.replace([np.inf, -np.inf], np.nan, inplace=True)

# Fill NaN for sum/count features
fill_zero_cols = ['TOTAL_DPD','TOTAL_DPD_DEF','CC_UTILIZATION','TOTAL_LOAN_COUNT']
final_df[fill_zero_cols] = final_df[fill_zero_cols].fillna(0)

# Binary features (ensure integer)
binary_cols = ['HAS_BUREAU_HISTORY','HAS_PREV_APPLICATION','HAS_POS_HISTORY','HAS_CC_HISTORY','RECENT_APPLICATION']
final_df[binary_cols] = final_df[binary_cols].fillna(0).astype(int)

# Ratio / numeric features
ratio_cols = [
    'CREDIT_INCOME_RATIO','ANNUITY_INCOME_RATIO','GOODS_PRICE_CREDIT_RATIO',
    'EMPLOYED_YEARS','AGE_YEARS','EMPLOYMENT_AGE_RATIO',
    'EXT_SOURCE_MEAN','EXT_SOURCE_MAX','EXT_SOURCE_MIN',
    'CC_UTILIZATION','INST_PAYMENT_RATIO','LIVING_AREA_RATIO'
]
final_df[ratio_cols] = final_df[ratio_cols].fillna(0)


# 3. Scale Ratio Features

scaler = MinMaxScaler()
final_df[ratio_cols] = scaler.fit_transform(final_df[ratio_cols])


# 4. Preview Engineered Features
engineered_features = binary_cols + ratio_cols + fill_zero_cols
print("✅ Engineered, cleaned, and scaled features preview:")
print(final_df[engineered_features].head())

In [None]:
final_df.isnull().sum()

In [None]:
final_df.duplicated().sum()

  # Exploratory Data analysis

In [None]:
 final_df.shape

In [None]:
final_df.info

In [None]:
final_df.head()

In [None]:
final_df.describe()

In [None]:
final_df.isna().sum()

In [None]:
final_df.duplicated().sum()

# Univariate EDA

## Distribution

In [None]:
final_df['AMT_INCOME_TOTAL'].hist(bins=5)

plt.title('Distribution of Total income')
plt.xlabel('No. of applicants')
plt.ylabel('Total income')

plt.show()

## comment
the visual shows that the vast majority of the applicants are in the lowest bracket, while a few applicants have higher incomes 

## Recommendation
use log scale for highly skewed data
check for outliers and verify that the high points are legitimate data points.

In [None]:
import matplotlib.pyplot as plt

final_df['AMT_CREDIT'].hist(bins=25)

plt.title('Distribution of Credit Amount')
plt.xlabel('No. of applicants')
plt.ylabel('credit amount')

plt.show()

## comment
the data has a positive skewness and a few outliers.

## Recommendation
Use median to handle missing values in such dataset as it cannot be corrupted by outliers

## Checking for outliers

In [None]:
final_df.boxplot(column = 'AMT_CREDIT')

## comment
the data is right skewed and there is outliers in plenty

## Recommendation
Donot remove outliers and null values blindly rather use robust statistical methods e.g. median to fill null values in a data that has outliers.

In [None]:
final_df.boxplot(column = 'AMT_INCOME_TOTAL')

## comment
there is high positive skewness and a good number of outliers.

## recommendation
you should handle outliers effectively to avoid skewing the data.

In [None]:
final_df.boxplot(column ='AMT_ANNUITY')

## comment
The data is heavily right skewed and there is extreme outliers

## recommendation
handle the outliers with the right statistical method of filling missing values e.g. median whch cannot be affected by outliers. you can also use log to deal with outliers.

In [None]:
final_df.boxplot(column ='AGE_YEARS')

## comment
the data is evenly skewed and symmetrical.

## recommendation
there is no log transformation needed here for an obvious reason.

In [None]:
final_df.boxplot(column ='EMPLOYMENT_YEARS')

## comment
there is negative skewness and a mix of positive and negative ages of employment.

## recommendation
handle the signs in the ages as all should be negative 

## checking skewness 

In [None]:
final_df['EMPLOYMENT_YEARS'].skew()

In [None]:
final_df['AGE_YEARS'].skew()

In [None]:
final_df['AMT_ANNUITY'].skew()

In [None]:
final_df['AMT_INCOME_TOTAL'].skew()

In [None]:
final_df['AMT_CREDIT'].skew()

# Bivariate EDA

In [None]:
# NUMERICAL VS NUMERICAL ANALYSIS
final_df[['AMT_INCOME_TOTAL', 'AMT_ANNUITY']].corr()

## comment
there is a weak positive relationship between the two columns.AS one increases the other tend to increase slightly.
## recommendation
 Donot assume that the two columns directly affect each other.   

In [None]:
plt.scatter(final_df['AMT_INCOME_TOTAL'], final_df['AMT_ANNUITY'])
plt.xlabel('AMT_INCOME_TOTAL')
plt.ylabel('AMT_ANNUITY')
plt.show()

In [None]:
final_df[['AMT_INCOME_TOTAL', 'AMT_CREDIT']].corr()

In [None]:
final_df[['AMT_INCOME_TOTAL', 'EMPLOYMENT_YEARS']].corr()

In [None]:
final_df[['AMT_INCOME_TOTAL', 'AGE_YEARS']].corr()

## comment 
there is a weak negative relationship between age and total income.
## recommendation
Bigger age doesnot mean higher income.

# Multivariate EDA

In [None]:
VYU