# Capstone EDA
By : Pythonic Minds

In [2]:
# Table of Contents  
# this is develop

In [3]:
## 1. Overview of the Project

### **a. Business Problems**
Home Credit aims to expand financial services to the unbanked population, where traditional credit history is limited or non-existent. To assess creditworthiness, they supplement loan applications with alternative data such as transaction records and telecom behavior.

The business objective is to predict the likelihood of loan default, where 0 indicates no default and 1 represents at least one late payment. Accurately predicting defaults will help Home Credit minimize losses and promote financial inclusion while ensuring responsible lending. This project aims to build a predictive model to assess clients' loan repayment ability using available data.

### **b. Analytical Methodology**
Default risk prediction will be generated using a supervised machine learning classification model. Historical loan application data will be split into two sets, 80% for training and 20% for test. A model will be built based on the trained dataset and test its performance on the test set. The model will analyze factors like applicant demographics, credit history, past loan performance and repayment behavior to predict the likeliness of clients’ default on loans.

### **c. Goal**
Reducing the likelihood of default: by identifying the main predictors that are positively and negatively correlated with it

## **2. Introduction**
The notebook begins by exploring the application_train.csv file, which contains key demographic information about the applicants, such as age, income, along with their credit history and other significant documents. Throughout the notebook, extensive data cleaning is performed to ensure the dataset is in optimal shape for analysis. Furthermore, the notebook focuses on identifying the most important predictors that influence the likelihood of defaulting. By carefully removing unnecessary rows and columns, the analysis reduces noise and enhances the model's overall efficiency

## 3. Importing Packages

In [4]:
!pip list

Package                            Version
---------------------------------- --------------------
alabaster                          0.7.12
anaconda-client                    1.9.0
anaconda-navigator                 2.1.1
anaconda-project                   0.10.1
anyio                              2.2.0
appdirs                            1.4.4
argh                               0.26.2
argon2-cffi                        20.1.0
arrow                              0.13.1
asn1crypto                         1.4.0
astroid                            2.6.6
astropy                            4.3.1
asttokens                          3.0.0
async-generator                    1.10
atomicwrites                       1.4.0
attrs                              21.2.0
autopep8                           1.5.7
Babel                              2.9.1
backcall                           0.2.0
backports.functools-lru-cache      1.6.4
backports.shutil-get-terminal-size 1.0.0
backports.tempfile                 

In [5]:
!where python

pyerfa                             2.0.0
pyflakes                           2.3.1
Pygments                           2.10.0
pyjanitor                          0.30.0
PyJWT                              2.1.0
pylint                             2.9.6
pyls-spyder                        0.4.0
PyNaCl                             1.4.0
pyodbc                             4.0.0-unsupported
pyOpenSSL                          21.0.0
pyparsing                          3.0.4
pyreadline                         2.1
pyrsistent                         0.18.0
PySocks                            1.7.1
pytest                             6.2.4
python-dateutil                    2.8.2
python-lsp-black                   1.0.0
python-lsp-jsonrpc                 1.0.0
python-lsp-server                  1.2.4
python-slugify                     5.0.2
pytz                               2021.3
PyWavelets                         1.1.1
pywin32                            228
pywin32-ctypes                     0.2.0
pyw

In [6]:
#Importing Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 4. Loading Data

In [7]:
# suppress warnings
import warnings
warnings.filterwarnings('ignore')

# define the function for reducing memory usage when importing data
def reduce_memory_usage(df):
  
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [13]:
#reading the csv file 
application_train_df = pd.read_csv("application_train.csv")
bureau_df = pd.read_csv("bureau.csv")
# bureau_balance_df = pd.read_csv("bureau_balance.csv")
# credit_card_balance_df = pd.read_csv("credit_card_balance.csv")
# installments_payments_df = pd.read_csv("installments_payments.csv")
# pos_cash_balance_df =  pd.read_csv("POS_CASH_balance.csv")
# previous_application_df = pd.read_csv("previous_application.csv")

In [14]:
# Apply the reduce memory usage function to each DataFrame
application_train_df = reduce_memory_usage(application_train_df)
bureau_df = reduce_memory_usage(bureau_df)
# bureau_balance_df = reduce_memory_usage(bureau_balance_df)
# credit_card_balance_df = reduce_memory_usage(credit_card_balance_df)
# installments_payments_df = reduce_memory_usage(installments_payments_df)
# POS_CASH_balance_df = reduce_memory_usage(pos_cash_balance_df)
# previous_application_df = reduce_memory_usage(previous_application_df)

Memory usage of dataframe is 286.23 MB
Memory usage after optimization is: 92.38 MB
Decreased by 67.7%
Memory usage of dataframe is 222.62 MB
Memory usage after optimization is: 112.95 MB
Decreased by 49.3%



## 5. Data Exploration
In this section, we will understand different features of the all the datasets.


In [10]:
#getting a summary statistics and shape of the application dataset
summary = application_train_df.describe()
display(summary)
print(application_train_df.shape)

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,307511.0,307511.0,307511.0,307511.0,307511.0,307499.0,307233.0,307511.0,307511.0,307511.0,...,307511.0,307511.0,307511.0,307511.0,265992.0,265992.0,265992.0,265992.0,265992.0,265992.0
mean,278180.518577,0.080729,0.417052,168797.9,599025.9,27108.572266,538396.1,0.0,-16036.995067,63815.045904,...,0.00813,0.000595,0.000507,0.000335,0.0,0.0,0.0,,,
std,102790.175348,0.272419,0.722121,237175.9,402479.5,14493.233398,369542.7,0.013824,4363.988632,141275.766519,...,0.089798,0.024387,0.022518,0.018299,0.083984,0.110718,0.204712,0.0,,0.0
min,100002.0,0.0,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,-25229.0,-17912.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,189145.5,0.0,0.0,112500.0,270000.0,16524.0,238500.0,0.01001,-19682.0,-2760.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,278202.0,0.0,0.0,147150.0,513531.0,24903.0,450000.0,0.018845,-15750.0,-1213.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,367142.5,0.0,1.0,202500.0,808650.0,34596.0,679500.0,0.028656,-12413.0,-289.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,456255.0,1.0,19.0,117000000.0,4050000.0,258025.5,4050000.0,0.07251,-7489.0,365243.0,...,1.0,1.0,1.0,1.0,4.0,9.0,8.0,27.0,261.0,25.0


(307511, 122)


In [11]:
#getting a summary statistics and shape of the bureau_df dataset
summary = bureau_df.describe()
display(summary)
print(bureau_df.shape)

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
count,1716428.0,1716428.0,1716428.0,1716428.0,1610875.0,1082775.0,591940.0,1716428.0,1716415.0,1458759.0,1124648.0,1716428.0,1716428.0,489637.0
mean,278214.9,5924434.0,-1142.108,0.8181666,,,3825.417,0.006410406,354994.6,137085.1,6229.514,37.91277,-593.7483,15712.76
std,102938.6,532265.7,795.1649,36.54443,,,205987.3,0.09622391,1150277.0,679074.9,44896.66,5937.519,720.7473,325655.6
min,100001.0,5000000.0,-2922.0,0.0,-42048.0,-42016.0,0.0,0.0,0.0,-4705600.0,-586406.1,0.0,-41947.0,0.0
25%,188866.8,5463954.0,-1666.0,0.0,-1138.0,-1489.0,0.0,0.0,51300.0,0.0,0.0,0.0,-908.0,0.0
50%,278055.0,5926304.0,-987.0,0.0,-330.0,-897.0,0.0,0.0,125518.5,0.0,0.0,0.0,-395.0,0.0
75%,367426.0,6385681.0,-474.0,0.0,474.0,-425.0,0.0,0.0,315000.0,40153.5,0.0,0.0,-33.0,13500.0
max,456255.0,6843457.0,0.0,2792.0,31200.0,0.0,115987200.0,9.0,585000000.0,170100000.0,4705600.0,3756681.0,372.0,118453400.0


(1716428, 17)


In [12]:
# #getting a summary statistics and shape of the bureau_balance_df dataset
# summary = bureau_balance_df.describe()
# display(summary)
# print(bureau_balance_df.shape)

NameError: name 'bureau_balance_df' is not defined

In [None]:
# #getting a summary statistics and shape of the credit_card_balance_df dataset
# summary = credit_card_balance_df.describe()
# display(summary)
# print(credit_card_balance_df.shape)

In [None]:
# #getting a summary statistics and shape of the installments_payments_df dataset
# summary = installments_payments_df.describe()
# display(summary)
# print(installments_payments_df.shape)

In [None]:
# #getting a summary statistics and shape of the POS_CASH_balance_df dataset
# summary = POS_CASH_balance_df.describe()
# display(summary)
# print(POS_CASH_balance_df.shape)

In [None]:
# #getting a summary statistics and shape of the previous_application_df dataset
# summary = previous_application_df.describe()
# display(summary)
# print(previous_application_df.shape)

In [None]:
### a. Exploring Target Variable

In [None]:
sns.countplot(x = application_train_df['TARGET'])
plt.title('Distribution of the Target Variable')
plt.xlabel('Non Default = [0], Default = [1]')
plt.show()

In [None]:
# Obtaining Target variable proportion
target_prop = round(application_train_df.value_counts(subset='TARGET', normalize=True),2)
print(target_prop)

print(f'The proportion of Non Defaulters [0] is {target_prop[0]}')
print(f'The proportion of Defaulters [1] is {target_prop[1]} ')


Data Description:
It appears we have a higher proportion of clients not having payment difficulties on loans compared to clients that have payment difficulties.
The above countplot illustrates this with a proportion of 92% for non default compared to 8% for default.

Additionally we have roughly 300,000 rows of data on the Train set with 122 columns
Whereas we have 48,000 rows of data on the Test set with 121 columns (as it excludes the Target variable)

In [None]:
### b. Plots

In [None]:
# Plot distribution on numerical columns
# Select numerical columns
num_cols = application_train_df.select_dtypes(include=['number']).columns

# Define batch size for better visualization
batch_size = 6  
num_batches = int(np.ceil(len(num_cols) / batch_size))

# Plot in smaller groups
for i in range(num_batches):
    batch_cols = num_cols[i * batch_size:(i + 1) * batch_size]
    application_train_df[batch_cols].hist(figsize=(15, 10), bins=30)
    plt.show()


Since the following features have low variations, we can remove them from our analysis:

1. FLAG DOCUMENT 13
2. FLAG DOCUMENT 16

In [None]:
# drop FLAG DOCUMENT 13 and 16
application_train_df = application_train_df.drop(columns=['FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_16'])

In [None]:
# Plot distribution on categorical columns
# Select categorical columns
cat_cols = application_train_df.select_dtypes(include=['object']).columns

# Define batch size for better visualization
batch_size = 1
num_batches = int(np.ceil(len(cat_cols) / batch_size))

# Plot in smaller groups
for i in range(num_batches):
    batch_cols = cat_cols[i * batch_size:(i + 1) * batch_size]
    
    # Plot each categorical column
    plt.figure(figsize=(30, 15))
    for j, col in enumerate(batch_cols, 1):
        plt.subplot(2, 3, j)  # Adjust rows and columns based on your batch size
        sns.countplot(data=application_train_df, x=col)
        plt.title(f'Distribution of {col}')
        plt.xticks(rotation=90)  # Rotate x-axis labels vertically
    
    plt.tight_layout()
    plt.show()


Most of applicants apply for cash loans and the largest type of housing is house/apartment. Interestingly, the number of people within business entity type 3 applying for loans amounts to nearly 70,000, far more than other type of organization.

In [None]:
# Count plot for CODE_GENDER vs TARGET
sns.countplot(x='CODE_GENDER', hue='TARGET', data=application_train_df)
plt.title('CODE_GENDER vs TARGET')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()


Interestingly, male has a higher proportion of defaulting on loan than women

In [None]:
# Set up the figure size
plt.figure(figsize=(12, 6))

# Create a box plot with 'TARGET' as hue to separate the data
sns.boxplot(data=application_train_df, x='NAME_CONTRACT_TYPE', y='AMT_CREDIT', hue='TARGET', palette='Set2')

# Add title and labels
plt.title('Bivariate Analysis: NAME_CONTRACT_TYPE vs AMT_CREDIT by TARGET')
plt.xlabel('NAME_CONTRACT_TYPE')
plt.ylabel('AMT_CREDIT')

# Show the plot
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()

The credit amount of loan for cash loans is much higher than revolving loans. Also, the non-defaulters usually borrow more than the defaulters.

In [None]:
# Set the figure size
plt.figure(figsize=(10, 6))

# Create the boxplot for AMT_ANNUITY based on REGION_RATING_CLIENT and TARGET
sns.boxplot(data=application_train_df, x='REGION_RATING_CLIENT', y='AMT_ANNUITY', hue='TARGET')

# Set the labels and title
plt.title('AMT_ANNUITY by REGION_RATING_CLIENT and TARGET')
plt.xlabel('Region Rating Client')
plt.ylabel('Annuity Amount (AMT_ANNUITY)')

# Display the plot
plt.show()

The loan annuity is generally higher in those region rating = 1. 

In [None]:
#Plotting the variation in the normalised credit score
# from scipy.stats import gaussian_kde

# # Replacing inf values with NaN
# df_clean = application_train_df.copy()
# df_clean.replace([np.inf, -np.inf], np.nan, inplace=True)

# # Drop NaN values before plotting
# df_clean.dropna(subset=['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3'], inplace=True)

# # Plot using Scipy's gaussian_kde for the density plot
# plt.figure(figsize=(10, 5))

# # Create the KDE for each EXT_SOURCE column
# for column in ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']:
#     data = df_clean[column].dropna()
#     kde = gaussian_kde(data)
#     x_vals = np.linspace(data.min(), data.max(), 1000)
#     y_vals = kde(x_vals)
#     plt.fill(x_vals, y_vals, label=column, alpha=0.5)

# # Add legend and show the plot
# plt.legend()
# plt.show()


From the plot, it is seen that, 
EXT_SOURCE_2 has a concentrated distribution with a peak of 0.6, suggesting that this score is generally higher and less spread out compared to the other two sources which means it probably has more null values

## **6. Data Cleaning**
###  **a. Data Cleaning on application_train dataset**
####  **1. Evaluating columns with missing values**

In [None]:
# filtering the data that has missing values > 65%
def dropna_over65(df):
    missing_values = df.isnull().sum().sort_values(ascending=False)
    missing_percent = (missing_values / len(df) * 100)

    missing_data_over_65 = missing_percent[missing_percent > 65]
    print(f'There are: {len(missing_data_over_65)} columns missing data over 65%')
    print(missing_data_over_65)

#dropping columns that have more than 65% null values
    df.drop(columns = missing_data_over_65.index, inplace=True)
    print('\n')
    print(f'Shape of the df after removing missing data over 65% : {df.shape}')
    
    return df  

In [None]:
### **b. Factorize all Categorical Columns 

Since 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3' seem like important columns, we will first bin the values and factorise these particular columns first.

In [None]:
def factorize_application(df):
    bins = [0, 0.3, 0.6, 0.8, 1.0]
    labels = ['Very Poor', 'Average', 'Good', 'Excellent']

    # Replacing NaN values with 0
    for col in ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']:
        df[col] = df[col].fillna(0).astype('float32')  
    
    # Binning and creating new category columns
    for col in ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']:
        df[col + '_Category'] = pd.cut(df[col], bins=bins, labels=labels, right=False)
    
    # Dropping original EXT_SOURCE columns
    df.drop(columns=['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3'], inplace = True)
    
    # Factorizing the category columns
    for col in ['EXT_SOURCE_1_Category', 'EXT_SOURCE_2_Category', 'EXT_SOURCE_3_Category']:
        df[col] = pd.factorize(df[col])[0]

    print(f"Factorized Columns: {['EXT_SOURCE_1_Category', 'EXT_SOURCE_2_Category', 'EXT_SOURCE_3_Category']}")
    return df



In [None]:
factorize_application(application_train_df)
print(application_train_df.shape)

The rest of the categorical columns are factorized here.
the labelEncode will detect columns with binary categories are factorize accordingly whereas the OneHotEncoder, will factorize columns with more than two uniq columns. Below is the function for that.

In [None]:
def factorize_cat_cols(df):
    
    cat_cols = df.select_dtypes(include='object').columns
    for col in cat_cols:
        df[col] = pd.factorize(df[col])[0]
    return df

#### 2. Evaluating highly-correlated columns
Remove highly-correlated numerical variables with greater than 0.8 correlation score using correlation matrix
Remove highly-correlated categorical variables and other irrelevant columns using domain knowledge

In [None]:
# numeric_columns = application_train_df.select_dtypes(include=['number']).columns

# correlation_matrix = application_train_df[numeric_columns].corr()

# plt.figure(figsize=(20,20))
# sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths = 0.2)
# plt.show()

In [None]:
# Drop highly-correlated numerical variables
# Identify pairs of features with correlation above a threshold
def redundant_data(df,correlation_matrix):
    threshold = 0.8
    to_drop = []  # List to store columns to drop
    numeric_columns = df.select_dtypes(include=['number']).columns
    correlation_matrix = df[numeric_columns].corr()
    
# Looping through the correlation matrix to find highly correlated pairs
    for i in range(len(correlation_matrix.columns)):
        for j in range(i):
            if abs(correlation_matrix.iloc[i, j]) > threshold:
                colname = correlation_matrix.columns[i]
                if colname not in to_drop:
                    to_drop.append(colname)

# Drop one column from each highly correlated pair
    to_drop = list(set(to_drop) & set(df.columns))

    # Drop columns
    df.drop(columns=to_drop, inplace=True)
    print(f"Dropped {len(to_drop)} redundant columns")
    print(f"The columns names that were dropped are :{to_drop}")
    print('\n')
    print(f"Shape of the dataset after removing multicolinearitly: {df.shape}")

    return df

### ** c. Imputing NA values in the Remaining Columns**
For all numeric columns the na values will be replaced with the median of the column 
for all categorical columns that were factorised, the na values will be replaced with the mode of the column

In [None]:
def imputing_na(df):
    for col in df.columns:
        if not df[col].dtype == 'number':
            df[col] = df[col].fillna(df[col].mode()[0])
        else:
            df[col] = df[col].fillna(df[col].median())
    print("Imputed all na values")
    return df


### 3. Looking for Columns with Outliers

In [None]:
#fixing CNT_CHILDREN
# We can see some people have 19 children according to the dataset. This is statistically rare.
# For any records have have > 6 children, they will be imputed to 6 children.

test = application_train_df.copy()
test['CNT_CHILDREN'] = np.where(application_train_df['CNT_CHILDREN'] > 6, 6, application_train_df['CNT_CHILDREN'])

# implementing the changes back  application_clean_df
application_train_df = test.copy()

application_train_df['CNT_CHILDREN'].value_counts().sort_index(ascending=True)

# CNT Children is much more reasonable now.

In [None]:
def cap_outliers_sd(df, threshold=3, exclude_columns=None):
    """
    Caps outliers at ±3 standard deviations for all numeric columns,
    except those in exclude_columns.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    threshold (float): Standard deviation threshold (default is 3).
    exclude_columns (list): List of column names to exclude from capping.

    Returns:
    pd.DataFrame: DataFrame with outliers capped for numeric columns only.
    """
    if exclude_columns is None:
        exclude_columns = []
        
    df_capped = df.copy()
    # Select only numeric columns
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    
    for column in numeric_columns:
        if column in exclude_columns:
            continue  # Skip excluded column
        
        mean = df[column].mean()
        std_dev = df[column].std()

        lower_bound = mean - threshold * std_dev
        upper_bound = mean + threshold * std_dev

        df_capped[column] = df[column].clip(lower=lower_bound, upper=upper_bound)
       
    return df_capped

# Exclude 'CNT_CHILDREN' and apply the function to all numeric columns
exclude_cols = ['CNT_CHILDREN', 'TARGET']



In [None]:
## cleaned dataset

In [None]:
#calling all the functions for data clening
application_clean_df = application_train_df.copy()
dropna_over65(application_clean_df)

factorize_cat_cols(application_clean_df)
print(application_clean_df.shape)

imputing_na(application_clean_df)
print(application_clean_df.shape)

redundant_data(application_clean_df,correlation_matrix)
print(application_clean_df.shape)

df_capped = cap_outliers_sd(application_clean_df, exclude_columns = exclude_cols)
print(df_capped.shape)

application_clean_df = df_capped


In [None]:
#looking for the updated summaries
application_clean_df.describe()


In [None]:
#checking for na values
application_clean_df.isna().sum()


## 7. **Scaling Data**
Logistic Regression and most ML models perform better on scaled data because they are sensitive to feature magnitudes.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import classification_report

def scale_df(df,target):
    X = df.drop(columns=['TARGET'])  # Features
    if 'SK_ID_CURR' in X.columns:
        X_id = X[['SK_ID_CURR']]  # Retain SK_ID_CURR
        X = X.drop(columns=['SK_ID_CURR'])
    else:
        X_id = None
    y = df['TARGET']

    # Split Data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # 1. Scale the Data
    scaler = StandardScaler()
    X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns) 
    
    if X_id is not None:
        X_train = pd.concat([X_id.iloc[y_train.index].reset_index(drop=True), X_train], axis=1)
        X_test = pd.concat([X_id.iloc[y_test.index].reset_index(drop=True), X_test], axis=1)
        
    print("Data Scaling Done")
    return X_train, X_test, y_train, y_test
    

## 8. **2. Handle Class Imbalance**
Imbalance skews the model towards the majority class, making it harder to predict defaults (1s).

In [None]:
# 2. Undersample the Majority Class
def under_sample(X_train, y_train):
    if 'SK_ID_CURR' in X_train.columns:
        X_id = X_train[['SK_ID_CURR']]
        X_train = X_train.drop(columns=['SK_ID_CURR'])
    else:
        X_id = None
        
    undersample = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
    X_train, y_train = undersample.fit_resample(X_train, y_train)
    
    if X_id is not None:
        X_train = pd.concat([X_id.iloc[y_train.index].reset_index(drop=True), X_train], axis=1)
        
    print("Class Imbalance Fixed")
    return X_train, y_train

## 9.**Feature Selection (Top Predictors)**
Helps reduce dimensionality and removes noisy features before final model training.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

def imp_features(X_train, X_test, y_train, num_features=10):
    # Ensure SK_ID_CURR is kept separately
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)

    # Get feature importance scores
    feature_importance = pd.Series(model.feature_importances_, index=X_train.columns)

    # Select top 'num_features' based on importance
    selected_cols = feature_importance.nlargest(num_features).index

    # Restore SK_ID_CURR if it was present
    if 'SK_ID_CURR' in X_train.columns:
        X_train_id = X_train[['SK_ID_CURR']]
        X_test_id = X_test[['SK_ID_CURR']]
        
        X_train = pd.concat([X_train_id, X_train[selected_cols]], axis=1)
        X_test = pd.concat([X_test_id, X_test[selected_cols]], axis=1)
    else:
        X_train = X_train[selected_cols]
        X_test = X_test[selected_cols]
    print(selected_cols)

    return X_train, X_test, y_train

In [None]:
application_final_df = application_clean_df.copy() 
target = 'TARGET'  # Your Target Column Name

# 1. Scale Data
X_train, X_test, y_train, y_test = scale_df(application_final_df,target)

# 2. Fix Class Imbalance
X_train, y_train = under_sample(X_train, y_train)

# 3. Feature Selection
X_train, X_test, y_train = imp_features(X_train, X_test, y_train)

print(f"Final X_train Shape: {X_train.shape}")
print(f"Final X_test Shape: {X_test.shape}")


In [None]:
## 10. Training The model Using various methods

In [None]:
### a. Logistic Regression Model

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Train Logistic Regression model with top 10 features
model_log_reg = LogisticRegression(max_iter=1000, random_state=42)
model_log_reg.fit(X_train, y_train)

# Predict on the test set
y_pred = model_log_reg.predict(X_test)

# Evaluate model performance 
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy:.4f}")
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

# Plot Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Class 0", "Class 1"], yticklabels=["Class 0", "Class 1"])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Logistic Regression Confusion Matrix")
plt.show()


In [None]:
###c. KNN

In [None]:
# ALI - adding a KNN model
## **** ALI: TO DO *** ADD GRIDSEARCHCV TOO

from sklearn.neighbors import KNeighborsClassifier

# Initialize and train KNN model
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Predict on the test set
y_pred_knn = knn_model.predict(X_test)

# Evaluate the KNN model
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"KNN Accuracy: {accuracy_knn:.4f}")
print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_knn)

# Plot Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Class 0", "Class 1"], yticklabels=["Class 0", "Class 1"])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("KNN Confusion Matrix")
plt.show()



In [None]:
### b. Random Classifier Model

In [None]:
model_RFC = RandomForestClassifier(class_weight='balanced', random_state=42)
model_RFC.fit(X_train, y_train)

random_predictions = np.random.choice([0, 1], size=y_test.shape[0], p=[0.7, 0.3])

# Calculate accuracy of the random classifier
random_accuracy = accuracy_score(y_test, random_predictions)
print(f"Random Classifier Accuracy: {random_accuracy:.4f}")

In [None]:
# #Creating an ROC curve
# from sklearn.metrics import roc_curve, auc
# fpr, tpr, thresholds = roc_curve(y_train, y_train_preds_proba)
# roc_auc = auc(fpr, tpr)

# plt.plot(fpr, tpr)
# plt.plot([0,1], [0,1])
# plt.xlabel('False positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('ROC curve')
# plt.show()

In [None]:
## Exploring Bureau Dataset

In [None]:
### Merging other Dataframes

The bureau_df has a column named CREDIT_ACTIVE, which has records of the applicants credit history. Every applicant seems to have atleast two records of credits either active or closed. We have tried to convert these rows of data into one and merge it with the main dataset. CREDIT_ACTIVE seems to be an important column for an applicant to default, so we are merging the dataset to improve accuracy

In [None]:
print(bureau_df.isna().sum())

In [None]:
bureau_clean_df = bureau_df.copy()
dropna_over65(bureau_clean_df)

factorize_cat_cols(bureau_clean_df)
print(bureau_clean_df.shape)

imputing_na(bureau_clean_df)
print(bureau_clean_df.shape)



redundant_data(bureau_clean_df,correlation_matrix)
print(bureau_clean_df.shape)

df_capped = cap_outliers_sd(bureau_clean_df, exclude_columns = exclude_cols)
print(df_capped.shape)

bureau_clean_df = df_capped


In [None]:
# Converting multiple records per ID into one based on the condition
bureau_summarised_df = bureau_clean_df.groupby("SK_ID_CURR")["CREDIT_ACTIVE"].apply(lambda x: "Active" if "Active" in x.values else "Closed").reset_index()

# Display the result
print(bureau_summarised_df)


In [None]:
print(bureau_summarised_df[bureau_summarised_df.duplicated(subset=["SK_ID_CURR"])])

In [None]:
merged_bureau_df = pd.merge(bureau_summarised_df,bureau_clean_df, on = "SK_ID_CURR", how = "inner")
merged_bureau_df = merged_bureau_df.drop_duplicates(subset=["SK_ID_CURR"])
print(merged_bureau_df.shape)

In [None]:
print(X_train.describe())

In [None]:
duplicates = X_train.columns[X_train.columns.duplicated()]
print("Duplicate column names:", duplicates)
# If there are duplicates (e.g., two columns with the same name like 'SK_ID_CURR')
if len(duplicates) > 0:
    # Drop the second 'SK_ID_CURR' column
    X_train = X_train.loc[:, ~X_train.columns.duplicated()]

# Print the updated shape of X_train to confirm the change
print(X_train.shape)



In [None]:
new_df = pd.merge(merged_bureau_df, X_train, on = "SK_ID_CURR", how = "inner")
print(new_df.shape)

In [None]:
featured_df = imp_features(new_df)
