### Important Features
- Based on the insights gained from our EDA process, the following features have been identified as important for classifying customers as low-risk or high-risk:

- Experience
- Current Job Years (CURRENT_JOB_YRS)
- House Ownership
- State (STATE)
- Income
- Age
- Current House Years (current_house_years)

#### we can select only one feature from Experience and Current Job Years because they both are highlly correlated features.


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

In [2]:
df = pd.read_json("loan_approval_dataset.json.zip")
df.head(3)

Unnamed: 0,Id,Income,Age,Experience,Married/Single,House_Ownership,Car_Ownership,Profession,CITY,STATE,CURRENT_JOB_YRS,CURRENT_HOUSE_YRS,Risk_Flag
0,1,1303834,23,3,single,rented,no,Mechanical_engineer,Rewa,Madhya_Pradesh,3,13,0
1,2,7574516,40,10,single,rented,no,Software_Developer,Parbhani,Maharashtra,9,13,0
2,3,3991815,66,4,married,rented,no,Technical_writer,Alappuzha,Kerala,4,10,0


# Select Important Features
-  we can select some important feature that we can find relavent from EDA

In [3]:
important_features = ['CURRENT_JOB_YRS', 'House_Ownership', 'STATE', 'Income', 'Age', 'CURRENT_HOUSE_YRS', 'Risk_Flag']

data_selected = df[important_features]

In [4]:
data_selected.head(3)

Unnamed: 0,CURRENT_JOB_YRS,House_Ownership,STATE,Income,Age,CURRENT_HOUSE_YRS,Risk_Flag
0,3,rented,Madhya_Pradesh,1303834,23,13,0
1,9,rented,Maharashtra,7574516,40,13,0
2,4,rented,Kerala,3991815,66,10,0


In [5]:
data_selected.info()

<class 'pandas.core.frame.DataFrame'>
Index: 252000 entries, 0 to 251999
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   CURRENT_JOB_YRS    252000 non-null  int64 
 1   House_Ownership    252000 non-null  object
 2   STATE              252000 non-null  object
 3   Income             252000 non-null  int64 
 4   Age                252000 non-null  int64 
 5   CURRENT_HOUSE_YRS  252000 non-null  int64 
 6   Risk_Flag          252000 non-null  int64 
dtypes: int64(5), object(2)
memory usage: 15.4+ MB


In [6]:
data_selected.isnull().sum()   # no nan values.

CURRENT_JOB_YRS      0
House_Ownership      0
STATE                0
Income               0
Age                  0
CURRENT_HOUSE_YRS    0
Risk_Flag            0
dtype: int64

# Feauture Engeenering

In [7]:
# A high-income-to-age ratio may indicate financial security, which correlates with lower risk.
# Add Income Per Age feature
data_selected['Income_Per_Age'] = data_selected['Income'] / (data_selected['Age'] + 1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_selected['Income_Per_Age'] = data_selected['Income'] / (data_selected['Age'] + 1)


In [8]:
#This feature measures the ratio of years in a job to years in a house, providing insights into stability and consistency.
# Why: Higher stability often correlates with lower risk.
# Add Job-Household Stability Ratio feature
data_selected['Job_House_Stability'] = data_selected['CURRENT_JOB_YRS'] / (data_selected['CURRENT_HOUSE_YRS'] + 1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_selected['Job_House_Stability'] = data_selected['CURRENT_JOB_YRS'] / (data_selected['CURRENT_HOUSE_YRS'] + 1)


# Encode Categorical Variables

In [9]:
def encode_categorical(data):
    cat_cols = data.select_dtypes(include=['object']).columns
    if len(cat_cols) >0:
        encoder = OneHotEncoder(drop='first',sparse_output=False)
        encoded_cols = pd.DataFrame(
                       encoder.fit_transform(data[cat_cols]),
                       columns = encoder.get_feature_names_out(cat_cols),
                       index = data.index
        )

        # Now combine the encoded columns with the rest of the data_selected dataset
        data = pd.concat([data.drop(columns=cat_cols),encoded_cols],axis=1)
        return data

# Apply encoding
data_selected = encode_categorical(data_selected)

In [10]:
data_selected.head(1)

Unnamed: 0,CURRENT_JOB_YRS,Income,Age,CURRENT_HOUSE_YRS,Risk_Flag,Income_Per_Age,Job_House_Stability,House_Ownership_owned,House_Ownership_rented,STATE_Assam,...,STATE_Punjab,STATE_Rajasthan,STATE_Sikkim,STATE_Tamil_Nadu,STATE_Telangana,STATE_Tripura,STATE_Uttar_Pradesh,STATE_Uttar_Pradesh[5],STATE_Uttarakhand,STATE_West_Bengal
0,3,1303834,23,13,0,54326.416667,0.214286,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
data_selected.head()

Unnamed: 0,CURRENT_JOB_YRS,Income,Age,CURRENT_HOUSE_YRS,Risk_Flag,Income_Per_Age,Job_House_Stability,House_Ownership_owned,House_Ownership_rented,STATE_Assam,...,STATE_Punjab,STATE_Rajasthan,STATE_Sikkim,STATE_Tamil_Nadu,STATE_Telangana,STATE_Tripura,STATE_Uttar_Pradesh,STATE_Uttar_Pradesh[5],STATE_Uttarakhand,STATE_West_Bengal
0,3,1303834,23,13,0,54326.416667,0.214286,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,9,7574516,40,13,0,184744.292683,0.642857,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4,3991815,66,10,0,59579.328358,0.363636,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2,6256451,41,12,1,148963.119048,0.153846,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3,5768871,47,14,1,120184.8125,0.2,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


# Split Data into Features and Target

In [12]:
# Seperate features and target
X = data_selected.drop(columns=['Risk_Flag'],axis=1) # input variable
y= data_selected['Risk_Flag'] # target variable

In [13]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

# Scale Numerical Features

In [14]:
from sklearn.preprocessing import StandardScaler

def scale_numerical(data):
    # Select only numerical columns to scale
    num_cols = ['CURRENT_JOB_YRS', 'Income', 'Age', 'CURRENT_HOUSE_YRS']
    
    # Initialize the scaler
    scaler = StandardScaler()
    
    # Apply scaling to the selected columns and update the DataFrame
    data[num_cols] = scaler.fit_transform(data[num_cols])
    
    return data

# Scale numerical features
data_selected = scale_numerical(data_selected)


# Handle Class Imbalance with SMOTE

In [15]:
def balance_classes(X,y):
    smote = SMOTE(random_state=42)
    X_balanced, y_balanced = smote.fit_resample(X,y)
    return X_balanced, y_balanced

# Apply SMOTE to balance classes
X_train, y_train = balance_classes(X_train, y_train)

# Feature Engineering

In [16]:
def engineer_features(data):
    # Example: Create a derived feature (e.g., income-to-age ratio)
    if 'Income' in data.columns and 'Age' in data.columns:
        data['income_age_ratio'] = data['Income'] / (data['Age'] + 1)  # Avoid division by zero
    
    # Add more engineered features as needed
    return data

# Apply feature engineering
X_train = engineer_features(X_train)
X_test = engineer_features(X_test)


In [17]:
# Combine X_train (features) and y_train (target) after SMOTE into a single DataFrame
train_balanced = pd.DataFrame(X_train, columns=X.columns)
train_balanced['Risk_Flag'] = y_train

In [18]:
# Combine X_test and y_test for the test set
test_data = pd.DataFrame(X_test, columns=X.columns)
test_data['Risk_Flag'] = y_test

In [19]:
# Save training and testing datasets
train_balanced.to_csv('balanced_training_data_2.csv', index=False)
test_data.to_csv('test_data_2.csv', index=False)