# **Project Name**    - EMIPredict AI ( Intelligent Financial Risk Assessment Platform )



##### **Project Type**    - Classification + Regression (Supervised learning)
##### **Contribution**    - Individual
##### **Team Member 1 -Chandraprakash kahar

# **Project Summary -**

### -: Financial Risk Assessment and EMI Prediction Platform using MLflow and Streamlit

The Financial Risk Assessment and EMI Prediction Platform is an end-to-end machine learning solution designed to improve loan decision-making and financial planning. The project addresses the growing issue of EMI defaults caused by poor risk assessment by developing a data-driven platform capable of predicting EMI eligibility and estimating the maximum affordable EMI for individuals.

Built using a dataset of 400,000 financial records with 22 demographic and economic variables, the system performs dual ML tasks‚Äîa classification model for EMI eligibility prediction and a regression model for maximum EMI amount estimation. Extensive feature engineering was applied to derive meaningful attributes such as credit utilization ratio, EMI-to-income ratio, and financial health score, improving the interpretability and performance of the models.

The platform leverages MLflow for experiment tracking and model comparison, ensuring efficient management of model versions, hyperparameters, and metrics. This integration enables reproducible experimentation and streamlined model optimization. A range of algorithms‚Äîincluding Logistic Regression, Random Forest, XGBoost, and Linear Regression‚Äîwere evaluated using metrics such as accuracy, F1-score, mean squared error (MSE), and R¬≤ to ensure robust predictive capability.

To make the solution accessible and interactive, a Streamlit-based web application was developed. Users can input personal and financial data to instantly view EMI eligibility results, estimated EMI limits, and visual insights into their financial health. The system also includes CRUD functionalities for managing financial records, enabling seamless data updates and historical tracking.

Deployed on Streamlit Cloud, the platform offers a production-ready, scalable, and user-friendly interface for real-time financial risk assessment.

‚öì Overall, this project demonstrates how machine learning, model tracking, and web technologies can be integrated into a comprehensive FinTech solution. It not only showcases technical expertise in end-to-end ML development but also delivers practical impact by helping individuals and financial institutions make smarter, data-driven loan decisions.

# **GitHub Link -**

https://github.com/Sandruez/EMI-Eligibility-Maximum-Monthly-EMI-Prediction-System.git

# Mlflow Analysis Link:-

https://dagshub.com/chandrapapr1501/MLflow-model-tracking.mlflow/#/compare-experiments/s?experiments=%5B%223%22%2C%224%22%2C%225%22%2C%226%22%5D&searchFilter=&orderByKey=attributes.start_time&orderByAsc=false&startTime=ALL&lifecycleFilter=Active&modelVersionFilter=All+Runs&datasetsFilter=W10%3D

# Streamlit App Link:

https://emi-insights-app.streamlit.app/

# **Problem Statement**


<u>Problem Statement</u>

In the modern financial ecosystem, individuals frequently face difficulties in managing their EMIs (Equated Monthly Instalments) due to inadequate financial planning and insufficient risk assessment. Traditional methods of evaluating creditworthiness often fail to capture complex financial patterns and behavioral variables, leading to inaccurate lending decisions and higher default risks.

To address this issue, the project aims to build a comprehensive Financial Risk Assessment Platform that leverages machine learning models integrated with MLflow for efficient experiment tracking and model management. The platform is designed to provide data-driven insights that enhance loan decision-making, improve financial literacy, and enable users to assess their EMI eligibility in real time.



The system focuses on delivering the following key capabilities:

 * Dual ML problem solving: Classification for EMI eligibility prediction and Regression for estimating the maximum EMI amount.

 * Real-time financial risk assessment using a dataset of over 400,000 financial records.

 * Advanced feature engineering from 22 financial and demographic variables to enhance model interpretability.

 * MLflow integration for experiment tracking, versioning, and performance comparison.

 * Interactive Streamlit Cloud deployment for scalable, production-ready web access.

 * Complete CRUD operations for managing and maintaining financial data seamlessly.

This platform aims to bridge the gap between financial decision-making and intelligent analytics, enabling both individuals and financial institutions to make more informed, transparent, and responsible lending choices.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***Step 1: Data Loading and Preprocessing
* Load the provided dataset of 400,000 realistic financial records across 5 EMI scenarios
* Implement comprehensive data cleaning for missing values, inconsistencies, and duplicates
* Apply data quality assessment and validation checks
* Create train-test-validation splits for model development
***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data_file_path='/content/drive/MyDrive/Labmentix intern projects/EMIPredict AI /emi_prediction_dataset.csv'
#Loading of provided dataset of 400,000 realistic financial records across 5 EMI scenarios
df=pd.read_csv(data_file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Number of rows in the dataset:", df.shape[0])
print("Number of columns in the dataset:", df.shape[1])

### **Implementing comprehensive data cleaning for missing values, inconsistencies, and duplicates

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dup_count=df.duplicated().sum()
print(f'Number of duplicate entries are :{dup_count}')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,8))
sns.barplot(x=df.isnull().sum().index,y=df.isnull().sum())
plt.xticks(rotation=45)
plt.title('Missing Values Count')
plt.show()

### What did we know about our dataset?

Dataset Scale:

* Total Records: 400,000 financial profiles
* Input Features: 22 comprehensive variables
* Target Variables: 2 (Classification + Regression)
* EMI Scenarios: 5 lending categories with realistic distributions

EMI Scenario Distribution:
* E-commerce Shopping EMI (80,000 records) - Amount: 10K-200K, Tenure: 3-24 months
* Home Appliances EMI (80,000 records) - Amount: 20K-300K, Tenure: 6-36 months
* Vehicle EMI (80,000 records) - Amount: 80K-1500K, Tenure: 12-84 months
* Personal Loan EMI (80,000 records) - Amount: 50K-1000K, Tenure: 12-60 months
* Education EMI (80,000 records) - Amount: 50K-500K, Tenure: 6-48 months

<u>Dataset Explanation</u>

Input Features (22 Variables):

  Personal Demographics:
* age: Customer age (25-60 years)
* gender: Gender classification (Male/Female)
* marital_status: Marital status (Single/Married)
* education: Educational qualification (High School/Graduate/Post Graduate/Professional)

Employment and Income:
* monthly_salary: Monthly gross salary (15K-200K INR)
* employment_type: Employment category (Private/Government/Self-employed)
* years_of_employment: Work experience duration
* company_type: Organization size and type

Housing and Family:
* house_type: Residential ownership status (Rented/Own/Family)
* monthly_rent: Monthly rental expenses
* family_size: Total household members
* dependents: Number of financial dependents

Monthly Financial Obligations:
* school_fees: Educational expenses for dependents
* college_fees: Higher education costs
* travel_expenses: Monthly transportation costs
* groceries_utilities: Essential living expenses
* other_monthly_expenses: Miscellaneous financial obligations

Financial Status and Credit History:
* existing_loans: Current loan obligations status
* current_emi_amount: Existing monthly EMI burden
* credit_score: Credit worthiness score (300-850)
* bank_balance: Current account balance
* emergency_fund: Available emergency savings

Loan Application Details:
* emi_scenario: Type of EMI application (5 categories)
* requested_amount: Desired loan amount
* requested_tenure: Preferred repayment period in months

Target Variables:
Classification Target:
* emi_eligibility: Primary classification target with 3 classes
     * Eligible: Low risk, comfortable EMI affordability
     * High_Risk: Marginal case, requires higher interest rates
     * Not_Eligible: High risk, loan not recommended

Regression Target:
* max_monthly_emi: Primary regression target
     * Continuous variable representing maximum safe monthly EMI amount (500-50000 INR)
     * Calculated using comprehensive financial capacity analysis

## ***2. Understanding dataset Variables/Features***

In [None]:
# Dataset Columns
df.columns.to_list()

In [None]:
# Dataset Describe
df.describe()


### Variables Description


<u>Dataset Explanation</u>

Input Features (22 Variables):

  Personal Demographics:
* age: Customer age (25-60 years)
* gender: Gender classification (Male/Female)
* marital_status: Marital status (Single/Married)
* education: Educational qualification (High School/Graduate/Post Graduate/Professional)

Employment and Income:
* monthly_salary: Monthly gross salary (15K-200K INR)
* employment_type: Employment category (Private/Government/Self-employed)
* years_of_employment: Work experience duration
* company_type: Organization size and type

Housing and Family:
* house_type: Residential ownership status (Rented/Own/Family)
* monthly_rent: Monthly rental expenses
* family_size: Total household members
* dependents: Number of financial dependents

Monthly Financial Obligations:
* school_fees: Educational expenses for dependents
* college_fees: Higher education costs
* travel_expenses: Monthly transportation costs
* groceries_utilities: Essential living expenses
* other_monthly_expenses: Miscellaneous financial obligations

Financial Status and Credit History:
* existing_loans: Current loan obligations status
* current_emi_amount: Existing monthly EMI burden
* credit_score: Credit worthiness score (300-850)
* bank_balance: Current account balance
* emergency_fund: Available emergency savings

Loan Application Details:
* emi_scenario: Type of EMI application (5 categories)
* requested_amount: Desired loan amount
* requested_tenure: Preferred repayment period in months

Target Variables:
Classification Target:
* emi_eligibility: Primary classification target with 3 classes
     * Eligible: Low risk, comfortable EMI affordability
     * High_Risk: Marginal case, requires higher interest rates
     * Not_Eligible: High risk, loan not recommended

Regression Target:
* max_monthly_emi: Primary regression target
     * Continuous variable representing maximum safe monthly EMI amount (500-50000 INR)
     * Calculated using comprehensive financial capacity analysisAnswer Here

### Check Unique Values for each variable.

In [None]:
#Making gender feature/column consistent
gender_dict={
  'male':'Male','m':'Male','f':'Female', 'female':'Female'
            }
temp = df['gender'].astype(str).str.lower().map(gender_dict)


In [None]:
df.gender.value_counts().index

In [None]:
df.gender=temp

In [None]:
df.gender.value_counts()

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
unique_values

## **  Handling missing values

In [None]:
df.isna().sum()

** Missing %	Recommended Action	Reason :-

    < 1%  |	Safe to drop rows	Very small data loss; |  negligible impact

    1 ‚Äì 5% |	Usually drop or impute|	Context-dependent

    > 5%	| Consider imputation or model-based filling	| Too much information lost if dropped

    > 30%	| Often drop the column |	Not enough data to be useful

### Handling missing values in Education column.

In [None]:
print(f'1. % Missing values in education column :{(df.education.isna().sum()/len(df))*100}')

In [None]:
# Acc. to recomendations....
# % missing values in education column <1% (~0.6<1) ,so droping it create negligible empact.
df.dropna(subset=['education'],inplace=True)

### Handling missing values in Monthly rented column.

In [None]:
print(f'2. % Missing values in monthly_rent column :{(df.monthly_rent.isna().sum()/len(df))*100}')

In [None]:
df.monthly_rent.median()

In [None]:
df[(df.house_type=='Rented') & (df.monthly_rent.isna())].shape

In [None]:
# def helper_fun(x):
#   if x.house_type=='Rented' and np.isnan(x.monthly_rent):
#     return x
#   elif (x.house_type=='Own' or x.house_type=='Family') and np.isnan(x.monthly_rent):
#     x.monthly_rent=0
#     return x
#   else :
#     return x


In [None]:
# temp=df.copy()
# temp=temp.apply(helper_fun,axis=1)

In [None]:
temp=df.copy()

In [None]:
# Acc. to recomendations....
# % missing values in monthly_rent column <1% (~0.6<1) , in which those house_type are in [Own ,Family ] are filled with 0 and
#  Rented place holders are so droping,which create negligible empact.


temp.reset_index(drop=True,inplace=True)
# For 'Own' or 'Family' houses where rent is missing, set rent = 0
temp.loc[(temp['house_type'].isin(['Own', 'Family'])) & (temp['monthly_rent'].isna()), 'monthly_rent'] = 0
df=temp.copy()


In [None]:
# Droping values with Nan in monthly_rent col...
df.dropna(subset=['monthly_rent'],inplace=True)

In [None]:
df.monthly_rent.isna().sum()/len(df)*100

In [None]:
# ‚úÖ Step 1 ‚Äî Inspect the bad entries

# Check what‚Äôs inside:
temp=df.copy()


# or see non-numeric rows:

temp[~temp['bank_balance'].astype(str).str.replace('.', '', 1).str.isdigit()].head()


# We‚Äôll likely see things like '73500.0.0' or '48000.'.




In [None]:
# ‚úÖ Step 2 ‚Äî Clean and convert properly

# We can sanitize those malformed values before converting:


# Convert to string first
temp['bank_balance'] = temp['bank_balance'].astype(str)

# Fix malformed entries like '73500.0.0' ‚Üí '73500.0'
temp['bank_balance'] = temp['bank_balance'].str.replace(r'\.0\.', '.', regex=True)

# Remove any stray non-numeric characters
temp['bank_balance'] = temp['bank_balance'].str.replace(r'[^0-9\.]', '', regex=True)

# Convert safely to float
temp['bank_balance'] = pd.to_numeric(temp['bank_balance'], errors='coerce')



In [None]:
# ‚úÖ After this:

temp['bank_balance'].dtype
# expected Output: float64


In [None]:
temp[~temp['bank_balance'].astype(str).str.replace('.', '', 1).str.isdigit()].head()


In [None]:
df=temp.copy()

### **Dealing with ['bank_balance',"emergency_fund",'credit_score'] columns Null values..

In [None]:
df[['bank_balance',"emergency_fund",'credit_score']].skew()

#### üìä <u>Skewness Summary</u>
Column	Skewness	Distribution Type	Interpretation

|--------------------|------------------|----------------|



| Column | Skewness | Distribution | Interpretation |
| :--- | :------: | ----: | ----:|
|bank_balance |	+1.415 |Right-skewed |	Long tail to the right ‚Äî few people with very high balances |
|emergency_fund |+1.791|Right-skewed	| A few people have very large funds|
| credit_score |‚Äì1.097|Left-skewed	Most people have high credit scores; a few have very low ones|



üß† <u>What This Means</u>

* Right-skewed ‚Üí use median for imputation and optionally apply a log transform (np.log1p).

* Left-skewed ‚Üí use mean for imputation (if not extreme), and you can mirror the data (if needed for modeling).

***‚úÖ Step-by-Step Recommendation

In [None]:
# 1Ô∏è‚É£Ô∏è bank_balance

# Skewness = 1.415 ‚Üí moderately right-skewed
# ‚úÖ Impute with median
# ‚úÖ Optional: apply log transform for normalization

df['bank_balance'] = df['bank_balance'].fillna(df['bank_balance'].median())
# df['bank_balance_log'] = np.log1p(df['bank_balance'])

In [None]:
# 2Ô∏è‚É£ emergency_fund

# Skewness = 1.791 ‚Üí strongly right-skewed
# ‚úÖ Impute with median
# ‚úÖ Apply log transform (definitely helps)

df['emergency_fund'] = df['emergency_fund'].fillna(df['emergency_fund'].median())
# df['emergency_fund_log'] = np.log1p(df['emergency_fund'])

In [None]:
# 3Ô∏è‚É£ credit_score

# Skewness = ‚Äì1.097 ‚Üí left-skewed
# ‚úÖ Impute with mean (median would be biased lower)
# ‚öôÔ∏è Optional: transform if needed (e.g., mirror or box-cox)

df['credit_score'] = df['credit_score'].fillna(df['credit_score'].mean())


In [None]:
df.age.value_counts()

### **Inspecting bad entries in monthly salary column

In [None]:
## # ‚úÖ Step 1 ‚Äî Inspect the bad entries
df[~df['monthly_salary'].astype(str).str.replace('.', '', 1).str.isdigit()].head()


In [None]:
# ‚úÖ Step 2 ‚Äî Clean and convert properly

# We can sanitize those malformed values before converting:

temp=df.copy()
# Convert to string first
temp['monthly_salary'] = temp['monthly_salary'].astype(str)

# Fix malformed entries like '73500.0.0' ‚Üí '73500.0'
temp['monthly_salary'] = temp['monthly_salary'].str.replace(r'\.0\.', '.', regex=True)

# Remove any stray non-numeric characters
temp['monthly_salary'] = temp['monthly_salary'].str.replace(r'[^0-9\.]', '', regex=True)

# Convert safely to float
temp['monthly_salary'] = pd.to_numeric(temp['monthly_salary'], errors='coerce')



In [None]:
## # ‚úÖ Step 1 ‚Äî Inspect the bad entries
temp[~temp['monthly_salary'].astype(str).str.replace('.', '', 1).str.isdigit()].head()


In [None]:
temp.dropna(subset=['monthly_salary'],inplace=True)

In [None]:
df=temp.copy()

In [None]:
## # ‚úÖ Step 1 ‚Äî Inspect the bad entries
df[~df['monthly_salary'].astype(str).str.replace('.', '', 1).str.isdigit()].head()


In [None]:
# df[df.monthly_salary.astype(str).str.count(r'\.0')==1]

In [None]:
## Step 2 ‚Äî Dealing with the bad entries (eg:- 56.0.0 -> 56.0)

# df[df.age.astype(str).str.count(r'\.0')==1]
# Convert to string first
df['age'] = df['age'].astype(str)

# Fix malformed entries like '73500.0.0' ‚Üí '73500.0'
df['age'] = df['age'].str.replace(r'\.0\.', '.', regex=True)

# Convert safely to float
df['age'] = pd.to_numeric(df['age'], errors='coerce')


# Step 2: Exploratory Data Analysis (EDA)

* Analyze EMI eligibility distribution patterns across different lending scenarios
* Study correlation between financial variables and loan approval rates
* Investigate demographic patterns and risk factor relationships
* Generate comprehensive statistical summaries and business insights


In [None]:
temp_df=df.copy()
temp_df.existing_loans= temp_df.existing_loans.map({'Yes':1,'No':0})
temp_df.existing_loans.value_counts()

In [None]:
df_temp=df.copy()

In [None]:
df=temp_df.copy()
df.existing_loans.value_counts()

In [None]:
# ----------------------------------------------------------
# üì¶ Import Libraries
# ----------------------------------------------------------
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set(style="whitegrid", palette="pastel")
plt.rcParams['figure.figsize'] = (8, 5)

# ----------------------------------------------------------
# 1Ô∏è‚É£ Statistical Overview
# ----------------------------------------------------------
print("üìã Data Overview:")
display(df.head())

print("\nüî¢ Basic Info:")
df.info()

print("\nüìà Statistical Summary (Numerical Columns):")
display(df.describe().T)

print("\nüìä Missing Value %:")
display((df.isna().sum() / len(df) * 100).sort_values(ascending=False))

# ----------------------------------------------------------
# 2Ô∏è‚É£ EMI Eligibility Distribution
# ----------------------------------------------------------
print("\nüéØ EMI Eligibility Distribution:")
emi_dist = df['emi_eligibility'].value_counts(normalize=True) * 100
print(emi_dist)

sns.countplot(x='emi_eligibility', data=df, hue='emi_scenario', palette='coolwarm')
plt.title("EMI Eligibility Across Different Lending Scenarios")
plt.xlabel("EMI Eligibility (Yes / No)")
plt.ylabel("Count")
plt.legend(title="EMI Scenario")
plt.show()

# ----------------------------------------------------------
# 3Ô∏è‚É£ Financial Correlation Study
# ----------------------------------------------------------
financial_cols = [
    'monthly_salary', 'monthly_rent', 'school_fees', 'college_fees',
    'travel_expenses', 'groceries_utilities', 'other_monthly_expenses',
    'existing_loans', 'current_emi_amount', 'credit_score',
    'bank_balance', 'emergency_fund', 'requested_amount', 'max_monthly_emi'
]

# Correlation Matrix
corr = df[financial_cols].corr()
plt.figure(figsize=(12,8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("üîó Correlation Between Financial Variables")
plt.show()

# Strongest correlations
print("\nüìå Top 5 Positive & Negative Correlations:")
corr_unstacked = corr.unstack().sort_values(ascending=False)
corr_unstacked = corr_unstacked[corr_unstacked < 0.999]  # remove self-correlations
display(pd.concat([
    corr_unstacked.head(5).to_frame("Top Positive Correlations"),
    corr_unstacked.tail(5).to_frame("Top Negative Correlations")
], axis=1))

# ----------------------------------------------------------
# 4Ô∏è‚É£ Demographic & Risk Factor Relationships
# ----------------------------------------------------------
# Gender-wise eligibility
sns.barplot(x='gender', y='emi_eligibility', data=df, estimator=lambda x: np.mean(x=='Yes')*100)
plt.title("Gender vs EMI Eligibility (%)")
plt.ylabel("Approval Rate (%)")
plt.show()

# Education level influence
sns.barplot(x='education', y='emi_eligibility', data=df, estimator=lambda x: np.mean(x=='Yes')*100)
plt.title("Education Level vs EMI Eligibility (%)")
plt.ylabel("Approval Rate (%)")
plt.xticks(rotation=45)
plt.show()

# House type & EMI risk
sns.barplot(x='house_type', y='emi_eligibility', data=df, estimator=lambda x: np.mean(x=='Yes')*100)
plt.title("House Type vs EMI Eligibility (%)")
plt.ylabel("Approval Rate (%)")
plt.show()

# Age vs EMI eligibility trend
sns.histplot(data=df, x='age', hue='emi_eligibility', kde=True, bins=30, palette='coolwarm')
plt.title("Age Distribution by EMI Eligibility")
plt.show()

# ----------------------------------------------------------
# 5Ô∏è‚É£ Risk Feature Insights
# ----------------------------------------------------------
# Compare distributions for approved vs rejected
num_cols = ['monthly_salary','existing_loans','credit_score','bank_balance','emergency_fund']

for col in num_cols:
    plt.figure(figsize=(8,4))
    sns.kdeplot(data=df, x=col, hue='emi_eligibility', fill=True)
    plt.title(f"{col} Distribution by EMI Eligibility")
    plt.show()

# ----------------------------------------------------------
# 6Ô∏è‚É£ Business Insights Summary
# ----------------------------------------------------------
print("\nüí° Business Insights:")

print("""
1Ô∏è‚É£ Income and Loan Factors:
   ‚Ä¢ Applicants with higher monthly salary and lower existing loan amounts
     show higher EMI eligibility.
   ‚Ä¢ Strong positive correlation between monthly salary, bank balance,
     and max_monthly_emi indicates good repayment capacity.

2Ô∏è‚É£ Credit & Risk:
   ‚Ä¢ Credit score and emergency fund both positively impact approval probability.
   ‚Ä¢ Applicants with poor credit or low savings face rejection risk.

3Ô∏è‚É£ Demographics:
   ‚Ä¢ Married and educated individuals tend to have higher approval rates.
   ‚Ä¢ Younger age groups (<25) show lower eligibility due to limited employment years.

4Ô∏è‚É£ Housing:
   ‚Ä¢ Those with 'Own' or 'Family' housing often get approved ‚Äî lower financial stress from rent.
   ‚Ä¢ 'Rented' applicants show higher rejection correlation due to added monthly liabilities.

5Ô∏è‚É£ Lending Scenarios:
   ‚Ä¢ Certain EMI scenarios may have stricter credit thresholds; visualize their
     approval rates for portfolio optimization.
""")


In [None]:
import matplotlib.pyplot as plt
df['monthly_rent'].hist(bins=50)
plt.xlabel('Monthly Rent')
plt.ylabel('Count')
plt.show()


In [None]:
df.credit_score.describe()

In [None]:
df.nunique()

# Step 3: Feature Engineering

* Create derived financial ratios (debt-to-income, expense-to-income, affordability ratios)
* Generate risk scoring features based on credit history and employment stability
* Apply categorical encoding and numerical feature scaling
* Develop interaction features between key financial variables


Answer Here.

In [None]:
# ----------------------------------------------------------
# ‚öôÔ∏è Step 3: Feature Engineering
# ----------------------------------------------------------
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Make a working copy
fe = df.copy()

# ----------------------------------------------------------
# 1Ô∏è‚É£ Derived Financial Ratios
# ----------------------------------------------------------
# To avoid division errors, add small epsilon (1e-6)
eps = 1e-6

# Total monthly expenses (combine major cost components)
fe['total_expenses'] = (
    fe['monthly_rent'].fillna(0) +
    fe['school_fees'].fillna(0) +
    fe['college_fees'].fillna(0) +
    fe['travel_expenses'].fillna(0) +
    fe['groceries_utilities'].fillna(0) +
    fe['other_monthly_expenses'].fillna(0) +
    fe['current_emi_amount'].fillna(0)
)

# Debt-to-Income Ratio
fe['debt_to_income'] = fe['existing_loans'] / (fe['monthly_salary'] + eps)

# Expense-to-Income Ratio
fe['expense_to_income'] = fe['total_expenses'] / (fe['monthly_salary'] + eps)

# EMI Affordability Ratio ‚Äî how much of income can safely go toward EMI
fe['estimated_affordability_ratio'] = (
    (fe['monthly_salary'] - fe['other_monthly_expenses']) /
    (fe['requested_amount'] / fe['requested_tenure'] + eps)
)

# Savings Ratio ‚Äî shows liquidity safety
fe['savings_ratio'] = (fe['bank_balance'] + fe['emergency_fund']) / (fe['monthly_salary'] + eps)

# ----------------------------------------------------------
# 2Ô∏è‚É£ Risk Scoring Features
# ----------------------------------------------------------

# Credit Score Risk Bucket
def credit_risk(score):
    if pd.isna(score):
        return 'Unknown'
    elif score >= 800:
        return 'Very Low Risk'
    elif score >= 700:
        return 'Low Risk'
    elif score >= 600:
        return 'Medium Risk'
    elif score >= 500:
        return 'High Risk'
    else:
        return 'Very High Risk'

fe['credit_risk_level'] = fe['credit_score'].apply(credit_risk)

# Employment Stability ‚Äî years of employment bucket
def employment_stability(years):
    if years >= 5: return 'Stable'
    elif years >= 2: return 'Moderate'
    else: return 'Unstable'

fe['employment_stability'] = fe['years_of_employment'].apply(employment_stability)

# Combined risk score (numerical)
risk_map = {
    'Very Low Risk': 1,
    'Low Risk': 2,
    'Medium Risk': 3,
    'High Risk': 4,
    'Very High Risk': 5,
    'Unknown': np.nan
}
stability_map = {'Stable': 1, 'Moderate': 2, 'Unstable': 3}

fe['combined_risk_score'] = (
    fe['credit_risk_level'].map(risk_map) +
    fe['employment_stability'].map(stability_map)
)
# ----------------------------------------------------------
# 4. Interaction Features
# ----------------------------------------------------------
# Income √ó Credit ‚Üí Financial Strength
fe['income_credit_interaction'] = fe['monthly_salary'] * fe['credit_score']

# Employment √ó Income ‚Üí Stability value
fe['employment_income_interaction'] = fe['years_of_employment'] * fe['monthly_salary']

# Expense √ó Loan ‚Üí Pressure factor
fe['expense_loan_interaction'] = fe['total_expenses'] * fe['existing_loans']

# ----------------------------------------------------------
# ‚úÖ Final Check
# ----------------------------------------------------------
print("\n‚úÖ Feature Engineering Completed Successfully!")
print(f"Total Features: {fe.shape[1]}")

display(fe.head())

# Optional: show correlation heatmap of new engineered features
import seaborn as sns
import matplotlib.pyplot as plt

engineered_cols = [
    'debt_to_income','expense_to_income','estimated_affordability_ratio',
    'savings_ratio','combined_risk_score','income_credit_interaction',
    'employment_income_interaction','expense_loan_interaction'
]

plt.figure(figsize=(10,6))
sns.heatmap(fe[engineered_cols].corr(), annot=True, cmap='Blues', fmt=".2f")
plt.title("üìä Correlation Among Engineered Features")
plt.show()


In [None]:

# # ----------------------------------------------------------
# # 3Ô∏è‚É£ Categorical Encoding
# # ----------------------------------------------------------
# cat_cols = [
#     'gender', 'marital_status', 'education', 'employment_type',
#     'company_type', 'house_type', 'emi_scenario',
#     'credit_risk_level', 'employment_stability'
# ]

# # Label Encoding (simple, efficient for ML models)
# encoder = LabelEncoder()
# for col in cat_cols:
#     fe[col] = encoder.fit_transform(fe[col].astype(str))

# # ----------------------------------------------------------
# # 4Ô∏è‚É£ Numerical Feature Scaling
# # ----------------------------------------------------------
# num_cols = [
#     'age', 'monthly_salary', 'monthly_rent', 'years_of_employment', 'family_size',
#     'dependents', 'school_fees', 'college_fees', 'travel_expenses',
#     'groceries_utilities', 'other_monthly_expenses', 'existing_loans',
#     'current_emi_amount', 'credit_score', 'bank_balance', 'emergency_fund',
#     'requested_amount', 'requested_tenure', 'max_monthly_emi',
#     'debt_to_income', 'expense_to_income', 'estimated_affordability_ratio',
#     'savings_ratio', 'combined_risk_score'
# ]

# scaler = StandardScaler()
# fe[num_cols] = scaler.fit_transform(fe[num_cols])



##  **Catogorical encoding and Scaling processes..

In [None]:
fe.columns

In [None]:
#OneHotEncoding fetures (Nominal features):-
OneHot_enc_cols = [
     'gender', 'marital_status', 'education',
       'employment_type', 'company_type', 'house_type',
       'emi_scenario',
       'credit_risk_level', 'employment_stability'
 ]

numerical_cols=[
    'monthly_salary', 'monthly_rent', 'school_fees', 'college_fees',
    'travel_expenses', 'groceries_utilities', 'other_monthly_expenses',
    'existing_loans', 'current_emi_amount', 'bank_balance',
    'emergency_fund', 'requested_amount', 'requested_tenure',
    'total_expenses', 'debt_to_income', 'expense_to_income',
    'estimated_affordability_ratio', 'savings_ratio',
    'combined_risk_score', 'income_credit_interaction',
    'employment_income_interaction', 'expense_loan_interaction'
]


In [None]:
# OneHot Encoding
from sklearn.preprocessing import OneHotEncoder

enocder=OneHotEncoder()
encoded_cat_df=pd.DataFrame(enocder.fit_transform(fe[OneHot_enc_cols]).toarray(),
                            columns=enocder.get_feature_names_out(),
                            index=fe.index)

encoded_df = pd.concat(
    [fe.drop(columns=OneHot_enc_cols, axis=1), encoded_cat_df],
    axis=1
)


In [None]:
print(fe.index.equals(encoded_cat_df.index))


In [None]:
# Scaling

temp=encoded_df.copy()
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
encoded_df[numerical_cols]=scaler.fit_transform(encoded_df[numerical_cols])

In [None]:
## Storing scaler
import joblib,os

joblib.dump(scaler,os.path.join('/content/drive/MyDrive/Labmentix intern projects/EMIPredict AI /Models/Scalers & Encoders','scaler.joblib'))

In [None]:
##Storing Encoder
joblib.dump(enocder,os.path.join('/content/drive/MyDrive/Labmentix intern projects/EMIPredict AI /Models/Scalers & Encoders','encoder.joblib'))

In [None]:
encoded_df.isna().sum()

## ** Spliting of dataset (Tringing testing and Validation..)

*** For classification predictions...(Multiclass classification)

* input Columns(features):-['age', 'monthly_salary', 'years_of_employment',
       'monthly_rent',
       'family_size', 'dependents', 'school_fees', 'college_fees',
       'travel_expenses', 'groceries_utilities', 'other_monthly_expenses',
       'existing_loans', 'current_emi_amount', 'credit_score', 'bank_balance',
       'emergency_fund', 'requested_amount', 'requested_tenure' 'total_expenses',
       'debt_to_income', 'expense_to_income', 'estimated_affordability_ratio',
       'savings_ratio', 'combined_risk_score', 'income_credit_interaction',
       'employment_income_interaction', 'expense_loan_interaction',
       'gender_Female', 'gender_Male', 'marital_status_Married',
       'marital_status_Single', 'education_Graduate', 'education_High School',
       'education_Post Graduate', 'education_Professional',
       'employment_type_Government', 'employment_type_Private',
       'employment_type_Self-employed', 'company_type_Large Indian',
       'company_type_MNC', 'company_type_Mid-size', 'company_type_Small',
       'company_type_Startup', 'house_type_Family', 'house_type_Own',
       'house_type_Rented', 'emi_scenario_E-commerce Shopping EMI',
       'emi_scenario_Education EMI', 'emi_scenario_Home Appliances EMI',
       'emi_scenario_Personal Loan EMI', 'emi_scenario_Vehicle EMI',
       'credit_risk_level_High Risk', 'credit_risk_level_Low Risk',
       'credit_risk_level_Medium Risk', 'credit_risk_level_Very High Risk',
       'credit_risk_level_Very Low Risk', 'employment_stability_Moderate',
       'employment_stability_Stable', 'employment_stability_Unstable']

  * target feature:-emi_eligibility ('Not_Eligible':-(1), 'Eligible':-(2), 'High_Risk':-(3))

In [None]:

## maping classification targets..
encoded_df['emi_eligibility_target']=encoded_df['emi_eligibility'].map({
    'Not_Eligible': 0,
    'Eligible': 1,
    'High_Risk': 2
})

In [None]:
## shuffling of data
encoded_df= encoded_df.sample(frac=1)

In [None]:
## Data spilting..
from sklearn.model_selection import train_test_split
X=encoded_df.drop(columns=['max_monthly_emi','emi_eligibility_target','emi_eligibility'],axis=1)
y_clf=encoded_df['emi_eligibility_target']
y_rg=encoded_df['max_monthly_emi']


In [None]:
x_train_clf,x_temp,y_train_clf,y_temp=train_test_split(X,y_clf,test_size=0.2,random_state=42)
x_val_clf,x_test_clf,y_val_clf,y_test_clf=train_test_split(x_temp,y_temp,test_size=0.5,random_state=42)

x_train_rg,x_temp,y_train_rg,y_temp = train_test_split(X,y_rg,test_size=0.2,random_state=42)
x_val_rg,x_test_rg,y_val_rg,y_test_rg=train_test_split(x_temp,y_temp,test_size=0.5,random_state=42)



In [None]:
print(x_train_clf.shape,x_val_clf.shape,x_test_clf.shape)
print(x_train_rg.shape,x_val_rg.shape,x_test_rg.shape)

print(y_train_clf.shape,y_val_clf.shape,y_test_clf.shape)
print(y_train_rg.shape,y_val_rg.shape,y_test_rg.shape )


In [None]:
x_train_clf.columns

## ***7. ML Model Implementation***

A. Classification Models (EMI Eligibility Prediction)
Required Models (Minimum 3):

* Logistic Regression - Baseline interpretable model
* Random Forest Classifier - Feature importance analysis
* XGBoost Classifier - High-performance gradient boosting

Additional Models (Choose 1+):

* Support Vector Classifier (SVC)
* Decision Tree Classifier
* Gradient Boosting Classifier
* LightGBM Classifier
* CatBoost Classifier

In [None]:
## Importing Required Libraries..
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, confusion_matrix,
                             classification_report, roc_curve)
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# ** Setup for Mlflow using DagsHUb

In [None]:
!pip install mlflow

In [None]:

import mlflow
import mlflow.sklearn

In [None]:
%pip install -q dagshub 'mlflow>=2,<3'


In [None]:
import dagshub
dagshub.init(repo_owner='chandrapapr1501', repo_name='MLflow-model-tracking', mlflow=True)

In [None]:

mlflow.set_tracking_uri(f"https://dagshub.com/chandrapapr1501/MLflow-model-tracking.mlflow/")
mlflow.set_experiment("colab_experiment3")


In [None]:
import mlflow
with mlflow.start_run():
  # Your training code here...
  mlflow.log_metric('accuracy', 42)
  mlflow.log_param('Param name', 'Value')

In [None]:
# 2 exp..
mlflow.set_experiment("colab_experiment1")

with mlflow.start_run():
  # Your training code here...
  mlflow.log_metric('accuracy', 99.99)
  mlflow.log_param('Param name', 'Value')

In [None]:
print(mlflow.__version__)

In [None]:
import mlflow
import mlflow.sklearn
from mlflow.models import infer_signature
import os

# 1. Setup MLflow
def setup_mlflow(experiment_name="EMI_Prediction_Models"):
    """
    Configure MLflow tracking
    """
    # Set tracking URI (local or remote server)
    # mlflow.set_tracking_uri("file:./mlruns")  # Local tracking
    # For remote: mlflow.set_tracking_uri("http://localhost:5000")

    # Create or get experiment
    try:
        experiment_id = mlflow.create_experiment(experiment_name)
        print(f"MLflow experiment '{experiment_name}' created with ID: {experiment_id}")
    except Exception as e:
        print(f"Error creating experiment '{experiment_name}': {e}")
        print(f"Attempting to get existing experiment '{experiment_name}'")
        try:
            experiment = mlflow.get_experiment_by_name(experiment_name)
            experiment_id = experiment.experiment_id
            print(f"MLflow experiment '{experiment_name}' already exists with ID: {experiment_id}")
        except Exception as e:
            print(f"Error getting experiment '{experiment_name}': {e}")
            raise # Re-raise if getting also fails


    mlflow.set_experiment(experiment_name)
    print(f"MLflow experiment '{experiment_name}' is ready!")

    return experiment_id

In [None]:
import joblib
import os

base_path = "/content/drive/MyDrive/Labmentix intern projects/EMIPredict AI /Models"
clf_models_path = os.path.join(base_path, "clf_models")
reg_models_path = os.path.join(base_path, "reg_models")

os.makedirs(clf_models_path, exist_ok=True)
os.makedirs(reg_models_path, exist_ok=True)

print(f"Created directory: {clf_models_path}")
print(f"Created directory: {reg_models_path}")

In [None]:
y_train_clf.value_counts()

In [None]:
# 2. Model Training Function
def train_classification_models(X_train, X_test, y_train, y_test, experiment_name="EMI_Eligibility_prediction"):
    """
    Train classification models and log to MLflow
    """
    setup_mlflow(experiment_name=experiment_name)

    # 'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    # 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    models = {
        # 'XGBoost': XGBClassifier(random_state=42, eval_metric='logloss'),
        # 'Decision Tree': DecisionTreeClassifier(random_state=42),
        'SVC': SVC(probability=True, random_state=42),
        'Gradient Boosting': GradientBoostingClassifier(random_state=42)
    }

    results = {}

    for name, model in models.items():
        with mlflow.start_run(run_name=f"Classification_{name}"):
            print(f"\nTraining {name}...")

            mlflow.log_param("model_type", "classification")
            mlflow.log_param("algorithm", name)

            # Train model
            model.fit(X_train, y_train)

            # Predictions
            y_pred = model.predict(X_test)
            # For multi-class, predict_proba returns probabilities for each class.
            # roc_auc_score for multi-class needs probabilities for each class or a single probability for one class in binary case.
            # Since we have 3 classes, we need to specify multi_class.
            y_pred_proba = model.predict_proba(X_test)

            # Evaluation metrics
            metrics = {
                'accuracy': accuracy_score(y_test, y_pred),
                'precision': precision_score(y_test, y_pred, average='weighted'),
                'recall': recall_score(y_test, y_pred, average='weighted'),
                'f1_score': f1_score(y_test, y_pred, average='weighted'),
                'roc_auc': roc_auc_score(y_test, y_pred_proba, multi_class='ovr')
            }

            # Cross-validation score
            cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
            metrics['cv_mean'] = cv_scores.mean()
            metrics['cv_std'] = cv_scores.std()

            # Log metrics
            mlflow.log_metrics(metrics)

            # Create and log confusion matrix
            cm = confusion_matrix(y_test, y_pred)
            fig, ax = plt.subplots(figsize=(8, 6))
            sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
            ax.set_title(f'Confusion Matrix - {name}')
            ax.set_ylabel('Actual')
            ax.set_xlabel('Predicted')
            plt.tight_layout()
            mlflow.log_figure(fig, f"confusion_matrix_{name}.png")
            plt.close()

            # Create and log ROC curve
            # For multi-class, ROC curve needs to be plotted per class or using micro/macro averaging.
            # Here, we'll skip plotting a single ROC curve as it's less informative for multi-class without averaging.
            # If needed, we can add plotting logic for multi-class ROC.
            # fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
            # fig, ax = plt.subplots(figsize=(8, 6))
            # ax.plot(fpr, tpr, label=f'AUC = {metrics["roc_auc"]:.3f}')
            # ax.plot([0, 1], [0, 1], 'k--', label='Random')
            # ax.set_xlabel('False Positive Rate')
            # ax.set_ylabel('True Positive Rate')
            # ax.set_title(f'ROC Curve - {name}')
            # ax.legend()
            # plt.tight_layout()
            # mlflow.log_figure(fig, f"roc_curve_{name}.png")
            # plt.close()

          # Log feature importance (if available)
            if hasattr(model, 'feature_importances_'):
                feature_names = X_train.columns.tolist()
                importance_df = pd.DataFrame({
                    'feature': feature_names,
                    'importance': model.feature_importances_
                }).sort_values('importance', ascending=False)

                fig, ax = plt.subplots(figsize=(10, 6))
                importance_df.head(20).plot(x='feature', y='importance',
                                            kind='barh', ax=ax)
                ax.set_title(f'Top 20 Feature Importances - {name}')
                plt.tight_layout()
                mlflow.log_figure(fig, f"feature_importance_{name}.png")
                plt.close()

                # Log as artifact
                importance_df.to_csv(f"feature_importance_{name}.csv", index=False)
                mlflow.log_artifact(f"feature_importance_{name}.csv")

            # Infer signature
            signature = infer_signature(X_train, model.predict(X_train))

            #Saving model locally
            joblib.dump(model, os.path.join(clf_models_path, f"{name}.joblib"))

            # # Log model
            # mlflow.sklearn.log_model(
            #     model,
            #     artifact_path="model",
            #     signature=signature,
            #     registered_model_name=f"EMI_Classification_{name}"
            # )

            # Log classification report
            report = classification_report(y_test, y_pred, output_dict=True)
            report_df = pd.DataFrame(report).transpose()
            report_df.to_csv(f"classification_report_{name}.csv")
            mlflow.log_artifact(f"classification_report_{name}.csv")

            results[name] = {
                'model': model,
                'metrics': metrics,
                'predictions': y_pred,
                'probabilities': y_pred_proba,
                'run_id': mlflow.active_run().info.run_id

            }

            print(f"‚úì Model logged successfully!")
            print(f"Accuracy: {metrics['accuracy']:.4f}")
            print(f"ROC-AUC: {metrics['roc_auc']:.4f}")

    return results

In [None]:

# 3. Model Evaluation and Visualization
def evaluate_classification_models(results, y_test):
    """
    Comprehensive evaluation and visualization
    """
    # Create comparison DataFrame
    comparison_data = []
    for name, result in results.items():
        metrics = result['metrics']
        comparison_data.append({
            'Model': name,
            'Accuracy': metrics['accuracy'],
            'Precision': metrics['precision'],
            'Recall': metrics['recall'],
            'F1-Score': metrics['f1_score'],
            'ROC-AUC': metrics['roc_auc'],
            'CV Mean': metrics['cv_mean'],
            'CV Std': metrics['cv_std']
        })

    comparison_df = pd.DataFrame(comparison_data)
    print("\n" + "="*80)
    print("MODEL COMPARISON")
    print("="*80)
    print(comparison_df.to_string(index=False))

    # Visualizations
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))

    # 1. Metrics Comparison
    metrics_to_plot = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
    comparison_df.set_index('Model')[metrics_to_plot].plot(kind='bar', ax=axes[0, 0])
    axes[0, 0].set_title('Model Performance Comparison')
    axes[0, 0].set_ylabel('Score')
    axes[0, 0].legend(loc='lower right')
    axes[0, 0].set_ylim([0, 1])

    # 2. ROC Curves
    for name, result in results.items():
        fpr, tpr, _ = roc_curve(y_test, result['probabilities'])
        axes[0, 1].plot(fpr, tpr, label=f"{name} (AUC={result['metrics']['roc_auc']:.3f})")
    axes[0, 1].plot([0, 1], [0, 1], 'k--', label='Random')
    axes[0, 1].set_xlabel('False Positive Rate')
    axes[0, 1].set_ylabel('True Positive Rate')
    axes[0, 1].set_title('ROC Curves Comparison')
    axes[0, 1].legend()

    # 3. Feature Importance (Random Forest)
    if 'Random Forest' in results:
        rf_model = results['Random Forest']['model']
        importances = rf_model.feature_importances_
        indices = np.argsort(importances)[-10:]  # Top 10 features
        axes[1, 0].barh(range(len(indices)), importances[indices])
        axes[1, 0].set_yticks(range(len(indices)))
        axes[1, 0].set_yticklabels([f'Feature {i}' for i in indices])
        axes[1, 0].set_xlabel('Importance')
        axes[1, 0].set_title('Top 10 Feature Importances (Random Forest)')

    # 4. Confusion Matrix (Best Model)
    best_model_name = comparison_df.loc[comparison_df['ROC-AUC'].idxmax(), 'Model']
    cm = confusion_matrix(y_test, results[best_model_name]['predictions'])
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1, 1])
    axes[1, 1].set_title(f'Confusion Matrix - {best_model_name}')
    axes[1, 1].set_ylabel('Actual')
    axes[1, 1].set_xlabel('Predicted')

    plt.tight_layout()
    plt.savefig('classification_model_comparison.png', dpi=300, bbox_inches='tight')
    plt.show()

    return comparison_df

B. Classification Models with MLflow (hyperparemeteric tuning)

In [None]:
def train_and_log_classification_models(X_train, X_test, y_train, y_test,
                                       feature_names, experiment_name="hyperpara-Tuned_EMI_eligibility_prediction"):
    """
    Train classification models and log to MLflow
    """
    setup_mlflow(experiment_name="hyperpara-Tuned_EMI_eligibility_prediction")

    models = {
        # 'Logistic_Regression': {
        #     'model': LogisticRegression(random_state=42, max_iter=1000,C=1.0,penalty='l2',solver='lbfgs'),
        #     'params': {'C': 1.0, 'penalty': 'l2', 'solver': 'lbfgs'}
        # },
        # 'Random_Forest': {
        #     'model': RandomForestClassifier(n_estimators=100, random_state=42,
        #                                    max_depth=10, min_samples_split=5,criterion='gini'),
        #     'params': {'n_estimators': 100, 'max_depth': 10,
        #               'min_samples_split': 5, 'criterion': 'gini'}
        # },
        # 'XGBoost': {
        #     'model': XGBClassifier(random_state=42, n_estimators=100,
        #                           learning_rate=0.1, max_depth=6,eval_metric= 'logloss'),
        #     'params': {'n_estimators': 100, 'learning_rate': 0.1,
        #               'max_depth': 6, 'eval_metric': 'logloss'}
        # },
        'SVC': {
            'model': SVC(probability=True, random_state=42, C=1.0, kernel='rbf',gamma='scale'),
            'params': {'C': 1.0, 'kernel': 'rbf', 'gamma': 'scale'}
        }
        # 'Gradient_Boosting': {
        #     'model': GradientBoostingClassifier(random_state=42, n_estimators=100,learning_rate= 0.1,max_depth= 3),
        #     'params': {'n_estimators': 100, 'learning_rate': 0.1, 'max_depth': 3}
        # }
    }

    results = {}

    for name, model_info in models.items():
        with mlflow.start_run(run_name=f"Classification_{name}"):
            print(f"\n{'='*60}")
            print(f"Training and logging: {name}")
            print(f"{'='*60}")

            model = model_info['model']
            params = model_info['params']

            # Log parameters
            mlflow.log_params(params)
            mlflow.log_param("model_type", "classification")
            mlflow.log_param("algorithm", name)

            # Train model
            model.fit(X_train, y_train)

            # Predictions
            y_pred = model.predict(X_test)
            y_pred_proba = model.predict_proba(X_test)

            # Calculate metrics
            metrics = {
                'accuracy': accuracy_score(y_test, y_pred),
                'precision': precision_score(y_test, y_pred, average='weighted'),
                'recall': recall_score(y_test, y_pred, average='weighted'),
                'f1_score': f1_score(y_test, y_pred, average='weighted'),
                'roc_auc': roc_auc_score(y_test, y_pred_proba,multi_class='ovr')
            }

            # Cross-validation
            cv_scores = cross_val_score(model, X_train, y_train, cv=5,
                                       scoring='accuracy')
            metrics['cv_accuracy_mean'] = cv_scores.mean()
            metrics['cv_accuracy_std'] = cv_scores.std()

            # Log metrics
            mlflow.log_metrics(metrics)

            # Create and log confusion matrix
            cm = confusion_matrix(y_test, y_pred)
            fig, ax = plt.subplots(figsize=(8, 6))
            sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
            ax.set_title(f'Confusion Matrix - {name}')
            ax.set_ylabel('Actual')
            ax.set_xlabel('Predicted')
            plt.tight_layout()
            mlflow.log_figure(fig, f"confusion_matrix_{name}.png")
            plt.close()

            # Create and log ROC curve
            # fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
            # fig, ax = plt.subplots(figsize=(8, 6))
            # ax.plot(fpr, tpr, label=f'AUC = {metrics["roc_auc"]:.3f}')
            # ax.plot([0, 1], [0, 1], 'k--', label='Random')
            # ax.set_xlabel('False Positive Rate')
            # ax.set_ylabel('True Positive Rate')
            # ax.set_title(f'ROC Curve - {name}')
            # ax.legend()
            # plt.tight_layout()
            # mlflow.log_figure(fig, f"roc_curve_{name}.png")
            # plt.close()

            # Log feature importance (if available)
            if hasattr(model, 'feature_importances_'):
                importance_df = pd.DataFrame({
                    'feature': feature_names,
                    'importance': model.feature_importances_
                }).sort_values('importance', ascending=False)

                fig, ax = plt.subplots(figsize=(10, 6))
                importance_df.head(15).plot(x='feature', y='importance',
                                           kind='barh', ax=ax)
                ax.set_title(f'Top 15 Feature Importances - {name}')
                plt.tight_layout()
                mlflow.log_figure(fig, f"feature_importance_{name}.png")
                plt.close()

                # Log as artifact
                importance_df.to_csv(f"feature_importance_{name}.csv", index=False)
                mlflow.log_artifact(f"feature_importance_{name}.csv")

            # Infer signature
            signature = infer_signature(X_train, model.predict(X_train))

            #Saving model locally
            joblib.dump(model, os.path.join(clf_models_path, f"{name}_tuned.joblib"))

            # Log model
            # mlflow.sklearn.log_model(
            #     model,
            #     artifact_path="model",
            #     signature=signature,
            #     registered_model_name=f"EMI_Eligibility_Classification_{name}"
            # )

            # Log classification report
            report = classification_report(y_test, y_pred, output_dict=True)
            report_df = pd.DataFrame(report).transpose()
            report_df.to_csv(f"classification_report_{name}.csv")
            mlflow.log_artifact(f"classification_report_{name}.csv")

            # Store results
            results[name] = {
                'model': model,
                'metrics': metrics,
                'run_id': mlflow.active_run().info.run_id
            }

            print(f"‚úì Model logged successfully!")
            print(f"  Accuracy: {metrics['accuracy']:.4f}")
            print(f"  ROC-AUC: {metrics['roc_auc']:.4f}")

    return results

In [None]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix, classification_report

In [None]:
# 2. Model Training Function
def train_regression_models(X_train, X_test, y_train, y_test):
  """
  Train regression models and log to MLflow
  """
  setup_mlflow(experiment_name='Max_EMI_Prediction')

  models = {
  # 'Linear Regression': LinearRegression(),
  'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
  'XGBoost': XGBRegressor(random_state=42),
  'Decision Tree': DecisionTreeRegressor(random_state=42),
  'SVR': SVR(kernel='rbf'),
  'Gradient Boosting': GradientBoostingRegressor(random_state=42)

  }

  results = {}

  for name, model in models.items():
    with mlflow.start_run(run_name=f"Regression_{name}"):
      print(f"\nTraining {name}...")


      mlflow.log_param("model_type", "regression")
      mlflow.log_param("algorithm", name)

      # Train model
      model.fit(X_train, y_train)

      # Predictions
      y_pred_train = model.predict(X_train)
      y_pred_test = model.predict(X_test)

      # Evaluation metrics
      metrics = {
          'rmse': np.sqrt(mean_squared_error(y_test, y_pred_test)),
          'mae': mean_absolute_error(y_test, y_pred_test),
          'r2': r2_score(y_test, y_pred_test),
          'mape': mean_absolute_percentage_error(y_test, y_pred_test) * 100,
          'train_r2': r2_score(y_train, y_pred_train),
          'mse': mean_squared_error(y_test, y_pred_test)

      }

      # Cross-validation score
      cv_scores = cross_val_score(model, X_train, y_train, cv=5,
                                  scoring='neg_mean_squared_error')
      metrics['cv_rmse_mean'] = np.sqrt(cv_scores.mean())
      metrics['cv_rmse_std'] = np.sqrt(cv_scores.std())

        # Log metrics
      mlflow.log_metrics(metrics)

      # Create and log actual vs predicted plot
      fig, axes = plt.subplots(1, 2, figsize=(15, 6))

      # Actual vs Predicted
      axes[0].scatter(y_test, y_pred_test, alpha=0.5)
      axes[0].plot([y_test.min(), y_test.max()],
                  [y_test.min(), y_test.max()], 'r--', lw=2)
      axes[0].set_xlabel('Actual Values')
      axes[0].set_ylabel('Predicted Values')
      axes[0].set_title(f'Actual vs Predicted - {name}')

      # Residuals
      residuals = y_test - y_pred_test
      axes[1].scatter(y_pred_test, residuals, alpha=0.5)
      axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
      axes[1].set_xlabel('Predicted Values')
      axes[1].set_ylabel('Residuals')
      axes[1].set_title(f'Residual Plot - {name}')

      plt.tight_layout()
      mlflow.log_figure(fig, f"prediction_plots_{name}.png")
      plt.close()

      # Log feature importance (if available)
      if hasattr(model, 'feature_importances_'):
        feature_names = X_train.columns.tolist()
        importance_df = pd.DataFrame({
            'feature': feature_names,
            'importance': model.feature_importances_
        }).sort_values('importance', ascending=False)

        fig, ax = plt.subplots(figsize=(10, 6))
        importance_df.head(20).plot(x='feature', y='importance',
                                    kind='barh', ax=ax)
        ax.set_title(f'Top 20 Feature Importances - {name}')
        plt.tight_layout()
        mlflow.log_figure(fig, f"feature_importance_{name}.png")
        plt.close()

        importance_df.to_csv(f"feature_importance_{name}.csv", index=False)
        mlflow.log_artifact(f"feature_importance_{name}.csv")

      # Infer signature
      signature = infer_signature(X_train, model.predict(X_train))

      #Saving model locally
      joblib.dump(model, os.path.join(reg_models_path, f"{name}.joblib"))


      # Log model
      # mlflow.sklearn.log_model(
      #     model,
      #     artifact_path="model",
      #     signature=signature,
      #     registered_model_name=f"EMI_Regression_{name}"
      # )

      # Save prediction results
      results_df = pd.DataFrame({
          'Actual': y_test,
          'Predicted': y_pred_test,
          'Residual': residuals
      })
      results_df.to_csv(f"predictions_{name}.csv", index=False)
      mlflow.log_artifact(f"predictions_{name}.csv")

      # Store results
      results[name] = {
          'model': model,
          'metrics': metrics,
          'run_id': mlflow.active_run().info.run_id
      }

      print(f"‚úì Model logged successfully!")
      print(f"  RMSE: {metrics['rmse']:.2f}")
      print(f"  R¬≤: {metrics['r2']:.4f}")
  return results


In [None]:
# 3. Model Evaluation and Visualization
def evaluate_regression_models(results, y_test):
    """
    Comprehensive evaluation and visualization for regression
    """
    # Create comparison DataFrame
    comparison_data = []
    for name, result in results.items():
        metrics = result['metrics']
        comparison_data.append({
            'Model': name,
            'RMSE': metrics['rmse'],
            'MAE': metrics['mae'],
            'R¬≤': metrics['r2'],
            'MAPE (%)': metrics['mape'],
            'Train R¬≤': metrics['train_r2'],
            'CV RMSE': metrics['cv_rmse_mean']
        })

    comparison_df = pd.DataFrame(comparison_data)
    print("\n" + "="*80)
    print("REGRESSION MODEL COMPARISON")
    print("="*80)
    print(comparison_df.to_string(index=False))

    # Visualizations
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))

    # 1. Metrics Comparison
    comparison_df.set_index('Model')[['RMSE', 'MAE']].plot(kind='bar', ax=axes[0, 0])
    axes[0, 0].set_title('Error Metrics Comparison')
    axes[0, 0].set_ylabel('Error')
    axes[0, 0].legend()

    # 2. R¬≤ Comparison
    comparison_df.set_index('Model')[['R¬≤', 'Train R¬≤']].plot(kind='bar', ax=axes[0, 1])
    axes[0, 1].set_title('R¬≤ Score Comparison')
    axes[0, 1].set_ylabel('R¬≤ Score')
    axes[0, 1].legend()
    axes[0, 1].set_ylim([0, 1])

    # 3. Actual vs Predicted (Best Model)
    best_model_name = comparison_df.loc[comparison_df['R¬≤'].idxmax(), 'Model']
    best_predictions = results[best_model_name]['predictions']
    axes[1, 0].scatter(y_test, best_predictions, alpha=0.5)
    axes[1, 0].plot([y_test.min(), y_test.max()],
                    [y_test.min(), y_test.max()], 'r--', lw=2)
    axes[1, 0].set_xlabel('Actual Values')
    axes[1, 0].set_ylabel('Predicted Values')
    axes[1, 0].set_title(f'Actual vs Predicted - {best_model_name}')

    # 4. Residuals Plot
    residuals = y_test - best_predictions
    axes[1, 1].scatter(best_predictions, residuals, alpha=0.5)
    axes[1, 1].axhline(y=0, color='r', linestyle='--', lw=2)
    axes[1, 1].set_xlabel('Predicted Values')
    axes[1, 1].set_ylabel('Residuals')
    axes[1, 1].set_title(f'Residual Plot - {best_model_name}')

    plt.tight_layout()
    plt.savefig('regression_model_comparison.png', dpi=300, bbox_inches='tight')
    plt.show()

    return comparison_df

 Hyperparamatric Tuned Regression Models with MLflow

In [None]:
def train_and_log_regression_models(X_train, X_test, y_train, y_test,
                                   feature_names, experiment_name='Hyperpar_tuned_MAX_EMI_prediction'):
    """
    Train regression models and log to MLflow
    """
    setup_mlflow(experiment_name='Hyperpar_tuned_MAX_EMI_prediction')

    models = {
        'Linear_Regression': {
            'model': LinearRegression(),
            'params': {'fit_intercept': True, 'normalize': False}
        },
        'Random_Forest': {
            'model': RandomForestRegressor(n_estimators=100, random_state=42,
                                          max_depth=10, min_samples_split=5),
            'params': {'n_estimators': 100, 'max_depth': 10,
                      'min_samples_split': 5}
        },
        'XGBoost': {
            'model': XGBRegressor(random_state=42, n_estimators=100,
                                 learning_rate=0.1, max_depth=6),
            'params': {'n_estimators': 100, 'learning_rate': 0.1, 'max_depth': 6}
        },
        'Gradient_Boosting': {
            'model': GradientBoostingRegressor(random_state=42, n_estimators=100,max_depth=3),
            'params': {'n_estimators': 100, 'learning_rate': 0.1, 'max_depth': 3}
        },
        'Ridge_Regression': {
            'model': Ridge(alpha=1.0),
            'params': {'alpha': 1.0, 'solver': 'auto'}
        },'SVR': {
            'model': SVR(kernel='rbf', C=1.0,epsilon= 0.1),
            'params': {'kernel': 'rbf', 'C': 1.0, 'epsilon': 0.1}
        }
    }

    results = {}

    for name, model_info in models.items():
        with mlflow.start_run(run_name=f"Regression_{name}"):
            print(f"\n{'='*60}")
            print(f"Training and logging: {name}")
            print(f"{'='*60}")

            model = model_info['model']
            params = model_info['params']

            # Log parameters
            mlflow.log_params(params)
            mlflow.log_param("model_type", "regression")
            mlflow.log_param("algorithm", name)

            # Train model
            model.fit(X_train, y_train)

            # Predictions
            y_pred_train = model.predict(X_train)
            y_pred_test = model.predict(X_test)

            # Calculate metrics
            metrics = {
                'rmse': np.sqrt(mean_squared_error(y_test, y_pred_test)),
                'mae': mean_absolute_error(y_test, y_pred_test),
                'r2': r2_score(y_test, y_pred_test),
                'mape': mean_absolute_percentage_error(y_test, y_pred_test) * 100,
                'train_r2': r2_score(y_train, y_pred_train),
                'mse': mean_squared_error(y_test, y_pred_test)
            }

            # Cross-validation
            cv_scores = cross_val_score(model, X_train, y_train, cv=5,
                                       scoring='neg_mean_squared_error')
            metrics['cv_rmse_mean'] = np.sqrt(-cv_scores.mean())
            metrics['cv_rmse_std'] = np.sqrt(cv_scores.std())

            # Log metrics
            mlflow.log_metrics(metrics)

            # Create and log actual vs predicted plot
            fig, axes = plt.subplots(1, 2, figsize=(15, 6))

            # Actual vs Predicted
            axes[0].scatter(y_test, y_pred_test, alpha=0.5)
            axes[0].plot([y_test.min(), y_test.max()],
                        [y_test.min(), y_test.max()], 'r--', lw=2)
            axes[0].set_xlabel('Actual Values')
            axes[0].set_ylabel('Predicted Values')
            axes[0].set_title(f'Actual vs Predicted - {name}')

            # Residuals
            residuals = y_test - y_pred_test
            axes[1].scatter(y_pred_test, residuals, alpha=0.5)
            axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
            axes[1].set_xlabel('Predicted Values')
            axes[1].set_ylabel('Residuals')
            axes[1].set_title(f'Residual Plot - {name}')

            plt.tight_layout()
            mlflow.log_figure(fig, f"prediction_plots_{name}.png")
            plt.close()

            # Log feature importance (if available)
            if hasattr(model, 'feature_importances_'):
                importance_df = pd.DataFrame({
                    'feature': feature_names,
                    'importance': model.feature_importances_
                }).sort_values('importance', ascending=False)

                fig, ax = plt.subplots(figsize=(10, 6))
                importance_df.head(15).plot(x='feature', y='importance',
                                           kind='barh', ax=ax)
                ax.set_title(f'Top 15 Feature Importances - {name}')
                plt.tight_layout()
                mlflow.log_figure(fig, f"feature_importance_{name}.png")
                plt.close()

                importance_df.to_csv(f"feature_importance_{name}.csv", index=False)
                mlflow.log_artifact(f"feature_importance_{name}.csv")

            # Infer signature
            signature = infer_signature(X_train, model.predict(X_train))

            # Log model
            # mlflow.sklearn.log_model(
            #     model,
            #     artifact_path="model",
            #     signature=signature,
            #     registered_model_name=f"EMI_Regression_{name}"
            # )


            #Saving model locally
            joblib.dump(model, os.path.join(reg_models_path, f"{name}_tuned.joblib"))


            # Save prediction results
            results_df = pd.DataFrame({
                'Actual': y_test,
                'Predicted': y_pred_test,
                'Residual': residuals
            })
            results_df.to_csv(f"predictions_tuned{name}.csv", index=False)
            mlflow.log_artifact(f"predictions_tuned{name}.csv")

            # Store results
            results[name] = {
                'model': model,
                'metrics': metrics,
                'run_id': mlflow.active_run().info.run_id
            }

            print(f"‚úì Model logged successfully!")
            print(f"  RMSE: {metrics['rmse']:.2f}")
            print(f"  R¬≤: {metrics['r2']:.4f}")

    return results

** Training of models..

In [None]:
## Classification models

In [None]:
# clf_results= train_classification_models(x_train_clf, x_val_clf, y_train_clf, y_val_clf)

In [None]:
# clf_models_evaluation_df=evaluate_classification_models(results=clf_results,y_test=y_val_clf)

In [None]:
# clf_hyperpara_tuned_results= train_and_log_classification_models(x_train_clf, x_val_clf, y_train_clf, y_val_clf,feature_names=x_train_clf.columns.tolist(), experiment_name='Hyperpar_tuned_EMI_Classification')

In [None]:
# clf_tuned_models_evaluation_df=evaluate_classification_models(results=clf_hyperpara_tuned_results,y_test=y_val_clf)

In [None]:
## Regression models models

In [None]:
# reg_results= train_regression_models(x_train_rg, x_val_rg, y_train_rg, y_val_rg)

In [None]:
# rg_models_evaluation_df=evaluate_regression_models(results=reg_results,y_test=y_val_rg)

In [None]:
# rg_hyperpara_tuned_results= train_and_log_regression_models(x_train_rg, x_val_rg, y_train_rg, y_val_rg,feature_names=x_train_rg.columns.tolist())

In [None]:
# rg_models_evaluation_df=evaluate_regression_models(results=rg_hyperpara_tuned_results,y_test=y_val_rg)

## Registering models and production deployment

def compare_and_select_best_models(classification_results, regression_results):
    """
    Compare all models and select the best performing ones
    """
    print("\n" + "="*80)
    print("MODEL SELECTION AND COMPARISON")
    print("="*80)
    
    # Classification Model Comparison
    print("\nüìä CLASSIFICATION MODELS COMPARISON:")
    print("-" * 80)
    
    classification_comparison = []
    for name, result in classification_results.items():
        metrics = result['metrics']
        classification_comparison.append({
            'Model': name,
            'Accuracy': f"{metrics['accuracy']:.4f}",
            'Precision': f"{metrics['precision']:.4f}",
            'Recall': f"{metrics['recall']:.4f}",
            'F1-Score': f"{metrics['f1_score']:.4f}",
            'ROC-AUC': f"{metrics['roc_auc']:.4f}",
            'CV Accuracy': f"{metrics['cv_accuracy_mean']:.4f} ¬± {metrics['cv_accuracy_std']:.4f}",
            'Run ID': result['run_id']
        })
    
    classification_df = pd.DataFrame(classification_comparison)
    print(classification_df.to_string(index=False))
    
    # Find best classification model
    best_classification_idx = classification_df['ROC-AUC'].astype(float).idxmax()
    best_classification_model = classification_df.loc[best_classification_idx, 'Model']
    best_classification_auc = classification_df.loc[best_classification_idx, 'ROC-AUC']
    
    print(f"\nüèÜ Best Classification Model: {best_classification_model}")
    print(f"   ROC-AUC Score: {best_classification_auc}")
    
    # Regression Model Comparison
    print("\n" + "="*80)
    print("üìä REGRESSION MODELS COMPARISON:")
    print("-" * 80)
    
    regression_comparison = []
    for name, result in regression_results.items():
        metrics = result['metrics']
        regression_comparison.append({
            'Model': name,
            'RMSE': f"{metrics['rmse']:.2f}",
            'MAE': f"{metrics['mae']:.2f}",
            'R¬≤': f"{metrics['r2']:.4f}",
            'MAPE (%)': f"{metrics['mape']:.2f}",
            'Train R¬≤': f"{metrics['train_r2']:.4f}",
            'CV RMSE': f"{metrics['cv_rmse_mean']:.2f} ¬± {metrics['cv_rmse_std']:.2f}",
            'Run ID': result['run_id']
        })
    
    regression_df = pd.DataFrame(regression_comparison)
    print(regression_df.to_string(index=False))
    
    # Find best regression model
    best_regression_idx = regression_df['R¬≤'].astype(float).idxmax()
    best_regression_model = regression_df.loc[best_regression_idx, 'Model']
    best_regression_r2 = regression_df.loc[best_regression_idx, 'R¬≤']
    
    print(f"\nüèÜ Best Regression Model: {best_regression_model}")
    print(f"   R¬≤ Score: {best_regression_r2}")
    
    # Save comparison reports
    classification_df.to_csv('classification_models_comparison.csv', index=False)
    regression_df.to_csv('regression_models_comparison.csv', index=False)
    
    print("\n‚úì Comparison reports saved!")
    
    return {
        'best_classification': {
            'name': best_classification_model,
            'model': classification_results[best_classification_model]['model'],
            'metrics': classification_results[best_classification_model]['metrics'],
            'run_id': classification_results[best_classification_model]['run_id']
        },
        'best_regression': {
            'name': best_regression_model,
            'model': regression_results[best_regression_model]['model'],
            'metrics': regression_results[best_regression_model]['metrics'],
            'run_id': regression_results[best_regression_model]['run_id']
        }
    }

# E. Model Registry and Production Deployment
def register_production_models(best_models):
    """
    Register best models to MLflow Model Registry for production
    """
    print("\n" + "="*80)
    print("MODEL REGISTRY - PRODUCTION DEPLOYMENT")
    print("="*80)
    
    from mlflow.tracking import MlflowClient
    client = MlflowClient()
    
    # Register Classification Model
    classification_name = best_models['best_classification']['name']
    classification_run_id = best_models['best_classification']['run_id']
    
    print(f"\nüì¶ Registering Classification Model: {classification_name}")
    
    model_uri_classification = f"runs:/{classification_run_id}/model"
    model_details_classification = mlflow.register_model(
        model_uri=model_uri_classification,
        name=f"EMI_Classification_Production"
    )
    
    # Transition to Production
    client.transition_model_version_stage(
        name="EMI_Classification_Production",
        version=model_details_classification.version,
        stage="Production",
        archive_existing_versions=True
    )
    
    # Add model description
    client.update_model_version(
        name="EMI_Classification_Production",
        version=model_details_classification.version,
        description=f"Best performing classification model: {classification_name}. "
                   f"ROC-AUC: {best_models['best_classification']['metrics']['roc_auc']:.4f}"
    )
    
    print(f"‚úì Classification model registered as version {model_details_classification.version}")
    print(f"‚úì Model transitioned to Production stage")
    
    # Register Regression Model
    regression_name = best_models['best_regression']['name']
    regression_run_id = best_models['best_regression']['run_id']
    
    print(f"\nüì¶ Registering Regression Model: {regression_name}")
    
    model_uri_regression = f"runs:/{regression_run_id}/model"
    model_details_regression = mlflow.register_model(
        model_uri=model_uri_regression,
        name=f"EMI_Regression_Production"
    )
    
    # Transition to Production
    client.transition_model_version_stage(
        name="EMI_Regression_Production",
        version=model_details_regression.version,
        stage="Production",
        archive_existing_versions=True
    )
    
    # Add model description
    client.update_model_version(
        name="EMI_Regression_Production",
        version=model_details_regression.version,
        description=f"Best performing regression model: {regression_name}. "
                   f"R¬≤: {best_models['best_regression']['metrics']['r2']:.4f}"
    )
    
    print(f"‚úì Regression model registered as version {model_details_regression.version}")
    print(f"‚úì Model transitioned to Production stage")
    
    print("\n" + "="*80)
    print("‚úÖ PRODUCTION MODELS SUCCESSFULLY DEPLOYED!")
    print("="*80)
    
    return {
        'classification': model_details_classification,
        'regression': model_details_regression
    }

# F. Load Production Models for Inference
def load_production_models():
    """
    Load production models from MLflow Model Registry
    """
    print("\nüì• Loading Production Models from Registry...")
    
    # Load Classification Model
    classification_model = mlflow.pyfunc.load_model(
        model_uri="models:/EMI_Classification_Production/Production"
    )
    print("‚úì Classification model loaded")
    
    # Load Regression Model
    regression_model = mlflow.pyfunc.load_model(
        model_uri="models:/EMI_Regression_Production/Production"
    )
    print("‚úì Regression model loaded")
    
    return {
        'classification': classification_model,
        'regression': regression_model
    }

# G. Model Inference Function
def predict_emi_eligibility_and_amount(new_data, scaler=None):
    """
    Make predictions using production models
    """
    # Load production models
    models = load_production_models()
    
    # Preprocess new data
    if scaler:
        new_data_scaled = scaler.transform(new_data)
    else:
        new_data_scaled = new_data
    
    # Classification prediction
    eligibility_prediction = models['classification'].predict(new_data_scaled)
    eligibility_proba = models['classification'].predict_proba(new_data_scaled)
    
    # Regression prediction
    emi_amount_prediction = models['regression'].predict(new_data_scaled)
    
    results = pd.DataFrame({
        'EMI_Eligible': eligibility_prediction,
        'Eligibility_Probability': eligibility_proba[:, 1],
        'Predicted_EMI_Amount': emi_amount_prediction
    })
    
    return results

# ** Best model selection process

In [None]:
def clf_model_analysis(model,model_name):
  print(f"{'='*20}{model_name.split()[0]}{'='*20}\n")
  y_preds_train=model.predict(x_train_clf)
  print(f"train_accuracy:{accuracy_score(y_train_clf,y_preds_train)}")
  y_preds_val=model.predict(x_val_clf)
  print(f"val_accuracy:{accuracy_score(y_val_clf,y_preds_val)}")
  y_pred_test=model.predict(x_test_clf)
  print(f"test_accuracy:{accuracy_score(y_test_clf,y_pred_test)}\n")


In [None]:
def reg_model_analysis(model,model_name):
  print(f"{'='*20}{model_name}{'='*20}\n")
  y_preds_train=model.predict(x_train_rg)
  print(f"train_RMSE:{np.sqrt(mean_squared_error(y_train_rg,y_preds_train))}")
  y_preds_val=model.predict(x_val_rg)
  print(f"val_RMSE:{np.sqrt(mean_squared_error(y_val_rg,y_preds_val))}")
  y_pred_test=model.predict(x_test_rg)
  print(f"test_RMSE:{np.sqrt(mean_squared_error(y_test_rg,y_pred_test))}\n")


In [None]:
clf_dir_list=os.listdir('/content/drive/MyDrive/Labmentix intern projects/EMIPredict AI /Models/clf_models')

In [None]:
clf_dir_list

In [None]:
for model_name in clf_dir_list:
  model=joblib.load(f'/content/drive/MyDrive/Labmentix intern projects/EMIPredict AI /Models/clf_models/{model_name}')
  clf_model_analysis(model,model_name=model_name)

In [None]:
reg_dir_list=os.listdir('/content/drive/MyDrive/Labmentix intern projects/EMIPredict AI /Models/reg_models')

In [None]:
reg_dir_list

In [None]:
for model_name in reg_dir_list:
  model=joblib.load(f'/content/drive/MyDrive/Labmentix intern projects/EMIPredict AI /Models/reg_models/{model_name}')
  reg_model_analysis(model,model_name=model_name)

# üü¢ Best Generalising Classification Models

## ‚úÖ 1. XGBoost (Tuned) --- **BEST GENERALISATION**

**Performance:** - Train = **0.9716** - Val = **0.9707** - Test =
**0.9725**

### ‚úî Why it Generalizes Best?

-   Very small train--validation--test gap
-   Strong regularization (max_depth, learning_rate)
-   High accuracy + low variance
-   Handles imbalance and noise effectively
-   Industry-standard for generalization capability

‚û° **Most generalizable model in your results.**

------------------------------------------------------------------------

## ‚úÖ 2. Logistic Regression (Tuned) --- **MOST STABLE & SIMPLE**

**Performance:** - Train = **0.8988** - Val = **0.8975** - Test =
**0.8991**

### ‚úî Why it Generalizes Well?

-   The smallest possible gap across splits
-   Extremely low overfitting risk
-   Highly interpretable linear decision boundary
-   Best choice for explainability and stability

‚û° **Safest low-variance generalizing model.**

------------------------------------------------------------------------

# üèÜ Final Generalization Recommendation

### üîπ If you want the **MOST generalizable + highest accuracy** model:

## ‚≠ê **XGBoost (Tuned)**

### üîπ If you want the **MOST stable + interpretable** model:

## ‚≠ê **Logistic Regression (Tuned)**

------------------------------------------------------------------------

# üìä Summary Table

  -------------------------------------------------------------------------------
  Model         Generalization Quality                         Why
  ------------- ---------------------------------------------- ------------------
  **XGBoost     ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê                                     Best trade-off:
  (Tuned)**                                                    accuracy +
                                                               regularization +
                                                               low variance

  **Logistic    ‚≠ê‚≠ê‚≠ê‚≠ê                                       Most stable,
  Regression                                                   simplest boundary,
  (Tuned)**                                                    minimal
                                                               overfitting

  **Gradient    ‚≠ê‚≠ê‚≠ê‚≠ê                                       Strong but
  Boosting                                                     slightly less
  (Tuned)**                                                    generalizing than
                                                               XGB

  **Random      ‚≠ê‚≠ê‚≠ê                                         Good stability but
  Forest                                                       lower performance
  (Tuned)**                                                    

  **Decision    ‚≠ê                                             Overfits in
  Tree**                                                       general despite
                                                               good metrics
  -------------------------------------------------------------------------------

------------------------------------------------------------------------

# üìå Conclusion

For your classification problem: - **XGBoost (Tuned)** ‚Üí Best for high
accuracy + strong generalization\
- **Logistic Regression (Tuned)** ‚Üí Best for stability, simplicity, and
interpretability


# ‚úÖ Best Regression Model (Most Accurate + Most Generalized)

## ‚≠ê Random Forest (Untuned)

### **Why?**

-   **Lowest RMSE among all models**
    -   Train: **526**
    -   Validation: **490**
    -   Test: **529**
-   Very small train--test gap ‚Üí **excellent generalization**
-   Performs significantly better than boosted and linear models
-   Low bias + low variance due to ensemble averaging

‚û° **This is the strongest model in your results.**

------------------------------------------------------------------------

## ü•à Second Best Model

### ‚≠ê XGBoost (Untuned)

-   Train: **621**
-   Val: **597**
-   Test: **628**
-   Good generalization but slightly worse RMSE than Random Forest
-   More stable than Decision Tree

------------------------------------------------------------------------

## ‚ö†Ô∏è Models to Avoid

### ‚ùå Linear Regression / Tuned Linear / Ridge Regression

-   RMSE ‚âà **4000** ‚Üí Model not fitting data at all\
-   Indicates:
    -   Strong non-linear relationships\
    -   Feature interactions\
    -   High complexity\
-   Not suitable for this dataset

### ‚ùå Decision Tree

-   Low train RMSE but much higher test RMSE\
-   **Overfitting**\
-   Avoid unless pruned or regularized

------------------------------------------------------------------------

## ‚≠ê Tuned Models Performance

  ------------------------------------------------------------------------
  Model       Train RMSE          Test RMSE          Observation
  ----------- ------------------- ------------------ ---------------------
  Random      1366                1335               Worse than untuned
  Forest                                             
  (Tuned)                                            

  Gradient    1404                1350               High error
  Boosting                                           
  (Tuned)                                            

  XGBoost     714                 707                Worse than untuned
  (Tuned)                                            due to
                                                     over-regularization
  ------------------------------------------------------------------------

‚û° **Tuned parameters reduced model capacity ‚Üí higher error.**

------------------------------------------------------------------------

# üèÜ Final Recommendation Summary

  -------------------------------------------------------------------------
  Rank         Model            Performance                Reason
  ------------ ---------------- -------------------------- ----------------
  **1 (Best)** Random Forest    ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê                 Best RMSE + Best
               (Untuned)                                   generalization

  **2**        XGBoost            ‚≠ê‚≠ê‚≠ê‚≠ê                   Strong, stable
               (Untuned)                                   model

  **3**        Decision Tree    ‚≠ê‚≠ê‚≠ê                     Moderate, but
                                                           overfits

  **4**        Gradient         ‚≠ê‚≠ê                       Higher error
               Boosting/Tuned                              
               Models                                      

  **5          Linear / Ridge   ‚≠ê                         Model not
  (Worst)**    Regression                                  fitting data
  -------------------------------------------------------------------------

------------------------------------------------------------------------

## üìå Final Recommendation (For Project Report)

### ‚úî For Highest Accuracy + Best Generalization

### ‚≠ê **Random Forest (Untuned)**

### ‚úî For Backup Model

### ‚≠ê **XGBoost (Untuned)**
