# Loan Prediction â€“ Feature Engineering and Model Building

This notebook focuses on transforming the cleaned dataset into a machine-learning
ready format. Feature encoding, feature selection, and supervised classification
models are applied to predict loan approval status.

## 1. Import Libraries and Load Cleaned Dataset

The cleaned loan prediction dataset is loaded for feature engineering
and model training.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [2]:
df = pd.read_csv('../data/loan_data_cleaned.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## 2. Encoding Categorical Variables

Machine learning models require numerical inputs.
Categorical features are encoded using Label Encoding.

In [3]:
le = LabelEncoder()
categorical_cols = [
    'Gender', 'Married', 'Dependents',
    'Education', 'Self_Employed', 'Property_Area'
]
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

# Encode target variable
df['Loan_Status'] = le.fit_transform(df['Loan_Status'])
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,1,0,0,0,0,5849,0.0,128.0,360.0,1.0,2,1
1,LP001003,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0,0
2,LP001005,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,2,1
3,LP001006,1,1,0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
4,LP001008,1,0,0,0,0,6000,0.0,141.0,360.0,1.0,2,1


## 3. Feature and Target Separation

The dataset is split into input features (X) and target variable (y).

In [4]:
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']


X.shape, y.shape

((614, 12), (614,))

## 4. Feature Selection using SelectKBest

SelectKBest is used to select the top features based on the Chi-Square score,
reducing dimensionality and improving model performance.

In [5]:
# Drop identifier column (not useful for modeling)
X = X.drop('Loan_ID', axis=1)

selector = SelectKBest(score_func=chi2, k=10)
X_selected = selector.fit_transform(X, y)

selected_features = X.columns[selector.get_support()]
selected_features

Index(['Gender', 'Married', 'Dependents', 'Education', 'ApplicantIncome',
       'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History',
       'Property_Area'],
      dtype='object')

## 5. Train-Test Split

The dataset is split into training and testing sets to evaluate
model generalization.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X_selected, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape

((491, 10), (123, 10))

## 6. Logistic Regression Model

Logistic Regression is trained as a baseline classification model
for loan approval prediction.

In [7]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

log_reg_accuracy = log_reg.score(X_test, y_test)
log_reg_accuracy

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.7886178861788617

## 7. Decision Tree Classifier

A Decision Tree model is trained to capture non-linear
relationships in the dataset.

In [8]:
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

dt_accuracy = dt_model.score(X_test, y_test)
dt_accuracy

0.6991869918699187

## Conclusion

In this notebook, categorical features were encoded and relevant features
were selected using SelectKBest. Two supervised learning models,
Logistic Regression and Decision Tree Classifier, were trained and evaluated.
The model performances will be further analyzed using detailed evaluation
metrics in the next phase.