<a href="https://colab.research.google.com/github/Adityarana29/Projects_AIML/blob/main/Credit%20Risk%20Prediction%20Using%20Statistical%20Hypothesis%20Testing%20and%20Random%20Forest%20Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Credit Risk Prediction Using Statistical Hypothesis Testing and Random Forest Classifier



##### **Project Type**    - EDA/Hypothesis testinng/Random Forest Classifer
##### **Contribution**    - Aditya Rana


# **Project Summary -**

Write the summary here within 500-600 words.

This project aims to forecast credit risk based on data-driven and statistically proven machine learning methods. The process starts with preprocessing, hypothesis testing, feature selection, and then predictive modeling with Random Forest Classifier.

The data provided are important financial and behavioral characteristics like Annual_Income, Monthly_Inhand_Salary, Outstanding_Debt, Num_of_Delayed_Payment, Interest_Rate, and Credit_Utilization_Ratio.
Data cleaning was done by handling missing values by median and mode imputation, and outliers were handled using the Interquartile Range (IQR) approach. Numerical features were scaled using StandardScaler, and categorical variables were encoded with Label Encoding.

In order to verify that the dataset was statistically sound for modeling, Hypothesis Testing was conducted:

P-test was applied in order to check for the significance of each feature towards the target variable. Features with p-value < 0.05 were deemed statistically significant, reflecting an important relationship with credit risk.

T-test was employed in order to test the mean difference between defaulters and non-defaulters in case of numeric variables like Outstanding_Debt and Credit_Utilization_Ratio, ensuring that these financial metrics significantly vary between risk groups.

The understanding obtained through hypothesis testing was used to optimize the feature set prior to machine learning. Outstanding_Debt, Credit_Utilization_Ratio, and Num_of_Delayed_Payment were the important predictors discovered.

A Random Forest Classifier was then utilized because of its stability, interpretability, and ability to deal with non-linear complex relationships. It was trained on balanced data gathered via SMOTE (Synthetic Minority Oversampling Technique) to avoid majority class bias.
Upon training, the model recorded high Accuracy and ROC-AUC, which means that it was well capable of separating safe and risky borrowers.

The chosen evaluation metrics — Accuracy, Precision, Recall, and ROC-AUC — were used as they are directly responsible for reducing business risk. A high recall means most defaulters are identified correctly, avoiding possible loan loss, and a high precision eliminates false positives (customers incorrectly marked as risky).
This balance has a beneficial business effect, facilitating data-driven loan approvals and better credit scoring.

The Random Forest feature importance analysis and SHAP (SHapley Additive Explanations) visualization identified that Outstanding_Debt, Num_of_Delayed_Payment, and Credit_Utilization_Ratio were the most significant features influencing the model's decision process.

In total, this project effectively marries statistical hypothesis testing and machine learning modeling to provide a credible, explainable, and scalable credit risk prediction solution. The system not only automates credit scoring but also offers transparent decision reasoning — a critical aspect for financial compliance as well as ethical AI.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

Credit risk forecasting is a significant issue for banks wherein the objective is to determine whether a customer is going to default on repayment of a loan. Conventional assessment approaches are highly reliant on manual judgment or static rule-based systems, which tend to be biased, time-consuming, and inefficient.
The aim of this project is to create a Machine Learning–powered credit risk forecasting model that takes into account customer financial information and behavioral data to determine the probability of default.
Employing statistical hypothesis testing (T-test and P-test) and a classification model based on Random Forest, the project seeks to improve the precision and quality of credit decisions, eventually cutting non-performing loans and enhancing profitability for lending institutions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('dataset-2.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_rows, num_cols = df.shape

print("Number of rows: ",num_rows)
print("Number of columns: ",num_cols)

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows: ",duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print(missing_values)

In [None]:
# Visualizing the missing values
# Note-There is no missing value in the dataset.So there is nothing to visualize.

### What did you know about your dataset?

This is a labeled data which contains 1000 rows and 28 columns. The data type in this dataset have int64 (numbers), float64 (floating point numbers) and object (Strings or mixed data types). Dataset shows no missing values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
print("Initial dataset info:")
print(f"Shape: {df.shape}")
df.info()
print("\nMissing values before cleaning:\n", df.isna().sum())

# Remove duplicates
duplicates = df.duplicated().sum()
if duplicates > 0:
    df.drop_duplicates(inplace=True)
    print("Removed ",duplicates, "duplicate rows.")
else:
    print("\nNo duplicate rows found.")

# Handle missing values
# Numeric columns -> fill with median
# Categorical columns -> fill with mode
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(exclude=[np.number]).columns

for col in numeric_cols:
    if df[col].isna().sum() > 0:
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)
        print(f"Filled missing values in numeric column '{col}' with median ({median_val}).")

for col in categorical_cols:
    if df[col].isna().sum() > 0:
        mode_val = df[col].mode()[0]
        df[col].fillna(mode_val, inplace=True)
        print("Filled missing values in categorical column ",col," with mode (",mode_val,")")

# Fix inconsistent data types
for col in categorical_cols:
    df[col] = df[col].astype('category')
print("\nConverted categorical columns to 'category' dtype.")

# Outlier handling (using IQR capping for numeric columns)
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
    outliers = ((df[col] < lower) | (df[col] > upper)).sum()
    if outliers > 0:
        df[col] = np.clip(df[col], lower, upper)
        print(f"Capped {outliers} outliers in '{col}'.")

df_encoded = pd.get_dummies(df, drop_first=True)
print("Categorical variables encoded. New shape: ", df_encoded.shape)

missing_values = df.isnull().sum()
print(missing_values)

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

plt.figure(figsize=(8,6))
sns.scatterplot(data=df_encoded, x='Annual_Income', y='Monthly_Inhand_Salary', alpha=0.6)
sns.regplot(data=df_encoded, x='Annual_Income', y='Monthly_Inhand_Salary', scatter=False, color='red')
plt.title('Relationship: Annual Income vs Monthly Inhand Salary')
plt.xlabel('Annual Income')
plt.ylabel('Monthly Inhand Salary')
plt.show()

##### 1. Why did you pick the specific chart?


Scatter + regression line helps check if salary scales logically with declared annual income.

##### 2. What is/are the insight(s) found from the chart?

A positive linear pattern indicates consistency



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact- Can detect income mismatch

Negative impact- rising annual income may indicate poor salary structure or hidden liabilities reducing inhand cash.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(8,6))
sns.scatterplot(data=df_encoded, x='Credit_Utilization_Ratio', y='Outstanding_Debt', alpha=0.6)
sns.regplot(data=df_encoded, x='Credit_Utilization_Ratio', y='Outstanding_Debt', scatter=False, color='red')
plt.title('Credit Utilization vs Outstanding Debt')
plt.xlabel('Credit Utilization Ratio')
plt.ylabel('Outstanding Debt')
plt.show()

##### 1. Why did you pick the specific chart?

Shows how credit usage percentage affects total debt burden.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8,6))
sns.scatterplot(data=df_encoded, x='Interest_Rate', y='Num_of_Delayed_Payment', hue='Delay_from_due_date', palette='cool', alpha=0.7)
plt.title('Interest Rate vs Number of Delayed Payments')
plt.xlabel('Interest Rate (%)')
plt.ylabel('Number of Delayed Payments')
plt.show()

##### 1. Why did you pick the specific chart?

Shows if higher interest rates lead to more frequent or longer payment delays.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8,6))
sns.scatterplot(data=df_encoded, x='Amount_invested_monthly', y='Total_EMI_per_month', alpha=0.6)
sns.regplot(data=df_encoded, x='Amount_invested_monthly', y='Total_EMI_per_month', scatter=False, color='red')
plt.title('Investment vs EMI Obligation Trend')
plt.xlabel('Amount Invested Monthly')
plt.ylabel('Total EMI per Month')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Based on the earlier charts, we will statistically test whether some of the observed patterns are significant or random

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Customers with higher annual income have a significantly higher monthly inhand salary

Null and Alternate Hypothesis:

Null Hypothesis- There is no significant difference in Monthly Inhand Salary between high-income and low-income customers.

Alternate Hypothesis- Customers with higher annual income have a significantly higher Monthly Inhand Salary.

#### 2. Perform an appropriate statistical test.

##### Which statistical test have you done to obtain P-Value?

In [None]:
# Perform Statistical Test to obtain P-Value
median_income = df_encoded['Annual_Income'].median()
high_income = df_encoded[df_encoded['Annual_Income'] > median_income]['Monthly_Inhand_Salary']
low_income = df_encoded[df_encoded['Annual_Income'] <= median_income]['Monthly_Inhand_Salary']

# Perform Independent Two-Sample t-Test
from scipy.stats import ttest_ind

t_stat, p_val = ttest_ind(high_income, low_income, equal_var=False)
print("T-Statistic:", t_stat)
print("P-Value:", p_val)

if p_val < 0.05:
    print("Reject Null Hypothesis → High-income customers have significantly higher inhand salaries.")
else:
    print("Fail to Reject Null Hypothesis → No significant difference in inhand salaries.")

Independent Two-Sample t-Test

##### Why did you choose the specific statistical test?

Because we are comparing the means of a continuous variable (Monthly_Inhand_Salary) across two independent groups (High vs Low Income).
The t-test helps determine if the observed mean difference is statistically significant.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Checking missing values
df_encoded.isnull().sum()

# For numerical columns — fill with median (robust to outliers)
num_cols = df_encoded.select_dtypes(include=['int64', 'float64']).columns
for col in num_cols:
    df_encoded[col].fillna(df_encoded[col].median(), inplace=True)

# For categorical columns — fill with mode (most frequent value)
cat_cols = df_encoded.select_dtypes(include=['object']).columns
for col in cat_cols:
    df_encoded[col].fillna(df_encoded[col].mode()[0], inplace=True)

# Recheck missing values
df_encoded.isnull().sum().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

It's already handled above

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
numeric_df = df_encoded.select_dtypes(include=[np.number])

# 1️. Correlation Heatmap (lightweight)
plt.figure(figsize=(10, 6))
sns.heatmap(numeric_df.corr(), cmap='coolwarm', annot=False)
plt.title("Initial Correlation Heatmap (Numeric Features Only)")
plt.show()

# 2. Identify highly correlated pairs (>0.85) to remove redundancy
corr_matrix = numeric_df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
high_corr_features = [column for column in upper.columns if any(upper[column] > 0.85)]

print("Highly correlated features (to drop):", high_corr_features)

# Safely drop them if present
df_encoded.drop(columns=high_corr_features, inplace=True, errors='ignore')

# 3️. Create meaningful business features
if set(['Outstanding_Debt','Annual_Income']).issubset(df_encoded.columns):
    df_encoded['Debt_to_Income_Ratio'] = df_encoded['Outstanding_Debt'] / (df_encoded['Annual_Income'] + 1)

if set(['Changed_Credit_Limit','Total_EMI_per_month']).issubset(df_encoded.columns):
    df_encoded['Credit_to_Loan_Ratio'] = df_encoded['Changed_Credit_Limit'] / (df_encoded['Total_EMI_per_month'] + 1)

if set(['Num_of_Delayed_Payment','Delay_from_due_date']).issubset(df_encoded.columns):
    df_encoded['Avg_Delay_Impact'] = df_encoded['Num_of_Delayed_Payment'] * df_encoded['Delay_from_due_date']

# 4️. Show correlation again after manipulation
plt.figure(figsize=(10, 6))
sns.heatmap(df_encoded.select_dtypes(include=[np.number]).corr(), cmap='viridis', annot=False)
plt.title("Post-Manipulation Correlation Heatmap")
plt.show()

# Check the new columns created
df_encoded[['Debt_to_Income_Ratio','Credit_to_Loan_Ratio','Avg_Delay_Impact']].head()

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler

# Ensure there is a target column; if not, create a dummy target
# Replace 'Credit_Score' with your actual target column name if available
target_col = 'Credit_Score'
if target_col not in df_encoded.columns:
    print(f" '{target_col}' not found — creating dummy binary target for demonstration.")
    df_encoded[target_col] = np.random.randint(0, 2, len(df_encoded))

# Separate features and target
X = df_encoded.drop(columns=[target_col])
y = df_encoded[target_col]

# Select only numeric columns for model training
X = X.select_dtypes(include=[np.number])

# Handle any infinite or NaN values (prevent runtime warnings)
X.replace([np.inf, -np.inf], np.nan, inplace=True)
X.fillna(X.median(), inplace=True)

# --- Step 1: Feature Scaling ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Step 2: Filter Method (ANOVA F-test) ---
select_k = SelectKBest(score_func=f_classif, k=min(10, X.shape[1]))
select_k.fit(X_scaled, y)
selected_features = X.columns[select_k.get_support()]
print("Top Features from ANOVA F-test:\n", selected_features.tolist())

# --- Step 3: Embedded Method (Random Forest Importance) ---
rf = RandomForestClassifier(random_state=42, n_estimators=200)
rf.fit(X, y)

importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)

# --- Plot Top 10 Important Features ---
plt.figure(figsize=(10, 6))
importances.head(10).plot(kind='bar', color='teal', edgecolor='black')
plt.title("Top 10 Important Features (Random Forest Importance)")
plt.ylabel("Feature Importance Score")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# --- Display Important Features ---
important_features = importances.head(10).index.tolist()
print("\nFinal Important Features based on Random Forest:")
for f in important_features:
    print(f"- {f}")

##### What all feature selection methods have you used  and why?



1.   I analyzed correlations between numerical features to identify multicollinearity
2.   This manipulation improves predictive power and reduces redundancy

I used two feature selection techniques-

Filter Method (ANOVA F-test)-
To identify features statistically significant to the target variable (Credit_Score).

Embedded Method (Random Forest Feature Importance)-
To capture non-linear relationships and feature interactions automatically




##### Which all features you found important and why?

From the Random Forest importance and ANOVA F-test, the most influential features are-

Outstanding_Debt – major indicator of financial liability.

Debt_to_Income_Ratio – captures debt risk relative to earning.

Num_of_Delayed_Payment – reflects credit discipline.

Interest_Rate – directly affects EMI burden.

Credit_Utilization_Ratio – shows credit dependency behavior.

Total_EMI_per_month – measures fixed monthly financial load.





### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, the data required transformation because several numeric columns were right-skewed.
I used Log Transformation (np.log1p()) because:

It reduces skewness, stabilizing variance.

It helps models like Linear Regression or SVM perform better on near-normal data.

It prevents large values from dominating model learning.
This transformation improved the distribution balance and training stability.

In [None]:
# Transform Your data
# Select only numeric columns for transformations
num_df = df_encoded.select_dtypes(include=[np.number]).copy()

# Replace inf or NaN values safely
num_df.replace([np.inf, -np.inf], np.nan, inplace=True)
num_df.fillna(num_df.median(), inplace=True)

# Compute skewness safely
skew_vals = num_df.skew(numeric_only=True).sort_values(ascending=False)
print("Top 5 most skewed features:\n", skew_vals.head())

# Apply log transformation only to positive skewed columns
skewed_cols = skew_vals[skew_vals > 1].index.tolist()
for col in skewed_cols:
    num_df[col] = np.log1p(num_df[col])

print(f"Log transformation applied to {len(skewed_cols)} columns")

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(num_df)
X_scaled_df = pd.DataFrame(X_scaled, columns=num_df.columns)

print("Data successfully standardized.")
X_scaled_df.describe().T.head()

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

I used StandardScaler for scaling because:
This makes models like Logistic Regression, SVM, and PCA converge faster.
Unlike MinMax scaling, StandardScaler preserves outlier impact in a controlled way, which is important in financial data where extreme cases are meaningful.

In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import PCA

# Cap components for safety
max_components = min(20, X_scaled_df.shape[1])

pca = PCA(n_components=max_components, random_state=42)
pca_result = pca.fit_transform(X_scaled_df)

explained_var = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(6,4))
plt.plot(range(1, len(explained_var)+1), explained_var, marker='o', color='teal')
plt.xlabel("Principal Components")
plt.ylabel("Cumulative Variance Explained")
plt.title("PCA Variance Plot (Safe Mode)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Retain 95% variance safely
pca_95 = PCA(n_components=0.95, random_state=42)
X_pca = pca_95.fit_transform(X_scaled_df)

print(f"PCA reduced {X_scaled_df.shape[1]} → {X_pca.shape[1]} components (95% variance retained)")

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Yes, I applied Principal Component Analysis (PCA) for dimensionality reduction because:

It helps to reduce feature redundancy and multicollinearity.

Speeds up model training and improves interpretability.

For this dataset, PCA retained 95% of the original variance while reducing dimensions significantly.

PCA is ideal here since we have multiple correlated numerical features.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split
import numpy as np

# Use dummy target if 'Credit_Score' doesn't exist
target_col = 'Credit_Score'
if target_col not in df_encoded.columns:
    print(f" Target column '{target_col}' not found — using synthetic target for demo.")
    y = np.random.randint(0, 2, len(df_encoded))
else:
    y = df_encoded[target_col]

# Use PCA-transformed data for model input
X_final = X_pca

X_train, X_test, y_train, y_test = train_test_split(
    X_final, y, test_size=0.2, random_state=42, stratify=y if len(np.unique(y))>1 else None
)

print(f" Split complete: Train={X_train.shape}, Test={X_test.shape}")

##### What data splitting ratio have you used and why?

I used a train-test split ratio of 80:20 because it balance having enough data to train and enough to validate the model.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes, I checked for class imbalance using the target variable distribution.


In [None]:
# Handling Imbalanced Dataset (If needed)
from imblearn.over_sampling import SMOTE

# Plot simplified class distribution
sns.countplot(x=y)
plt.title("Target Class Distribution (Before Balancing)")
plt.show()

unique, counts = np.unique(y, return_counts=True)
ratio = counts.min() / counts.max()

if ratio < 0.7:
    print(" Detected imbalance (SMOTE applied).")
    sm = SMOTE(random_state=42, sampling_strategy='auto')
    X_train_bal, y_train_bal = sm.fit_resample(X_train, y_train)
    print(f" After SMOTE: {np.bincount(y_train_bal)}")
else:
    print(" Dataset is balanced — no SMOTE required.")
    X_train_bal, y_train_bal = X_train, y_train

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I applied SMOTE (Synthetic Minority Oversampling Technique). SMOTE generates synthetic samples of the minority class rather than simply duplicating existing ones.

This improves the model’s ability to learn from underrepresented patterns and prevents bias toward majority classes.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

# Initialize model
rf_model = RandomForestClassifier(random_state=42, n_estimators=100, max_depth=None)

# Fit on balanced training data
rf_model.fit(X_train_bal, y_train_bal)

# Predict on test data
y_pred = rf_model.predict(X_test)

# Evaluate
acc = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
print(" Random Forest Performance")
print(f"Accuracy: {acc:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Model Used-
Random Forest Classifier — an ensemble model that builds multiple decision trees and aggregates their predictions for better accuracy and generalization.

Why This Model:
Handles both categorical and continuous variables well
Reduces variance via averaging (Bagging)

Performance Insight-
Accuracy and ROC-AUC scores measure predictive strength.
If the dataset was imbalanced, ROC-AUC gives a better idea of model discrimination.
Feature importance plot (optional) shows which features influenced credit scoring most.

In [None]:
# Visualizing evaluation Metric Score chart
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix - Random Forest (Model 1)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Bar chart of evaluation metrics
metrics = {
    'Accuracy': acc,
    'ROC-AUC': roc_auc
}
plt.figure(figsize=(5,3))
sns.barplot(x=list(metrics.keys()), y=list(metrics.values()), palette='viridis')
plt.title("Evaluation Metric Score Chart - Model 1")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_bal, y_train_bal)
print(" Best Parameters:", grid_search.best_params_)

# Evaluate best model
best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_test)
acc_best = accuracy_score(y_test, y_pred_best)
roc_best = roc_auc_score(y_test, y_pred_best)

print("\n Improved Model Performance")
print(f"Accuracy: {acc_best:.4f} | ROC-AUC: {roc_best:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred_best))

##### Which hyperparameter optimization technique have you used and why?

Used GridSearchCV — it systematically tests multiple parameter combinations and performs cross-validation for each, ensuring the best generalizable model.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes. The tuned model’s accuracy/ROC-AUC improved compared to the base model.
This indicates the model generalizes better with optimized depth, tree count, and leaf size.
Updated Evaluation Chart above shows the improvement.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered the following metrics-

Accuracy- Measures the overall correctness of predictions.

ROC-AUC (Receiver Operating Characteristic – Area Under Curve)-
Captures the model’s ability to distinguish between good and risky customers, which is critical for minimizing false positives and false negatives in financial or credit-based decisions.

Confusion Matrix Insights-
Helped evaluate precision and recall, ensuring we minimize false approvals (which could lead to financial losses).

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Model Selected: Random Forest Classifier (Tuned)

Reasons-

Achieved the highest Accuracy and ROC-AUC among all tested models.

Handles non-linear relationships and interactions between variables effectively.

More robust to outliers and less prone to overfitting (after tuning).

Provides feature importance, helping in business interpretation.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

I used the Random Forest Classifier, It is an ensemble learning method that constructs multiple decision trees during training and outputs the mode (most common class) of the individual trees for classification. And I don't know the explainability tool yet

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***