# **MODELING**

This notebook implements machine learning models based on the comprehensive EDA findings and preprocessing recommendations. We'll follow the evidence-based approach to ensure our modeling strategy aligns with the data patterns discovered.

**Modeling Strategy**

1. **Model Selection** - Random Forest, XGBoost, Logistic Regression (EDA recommendations)
2. **Hyperparameter Tuning** - StratifiedKFold cross-validation with macro-F1 optimization
3. **Class Imbalance Handling** - Balanced class weights and stratified sampling
4. **Feature Importance** - Validate engineered features from EDA
5. **Model Evaluation** - Confusion matrix, balanced accuracy, macro-F1
6. **Explainability** - SHAP values for feature interpretation

**EDA Evidence for Modeling**
- **Non-linear interactions** - Tree-based models (Random Forest, XGBoost)
- **Class imbalance** - Stratified sampling, balanced metrics
- **Feature engineering** - Validate acidity ratios and interactions
- **Production focus** - Interpretable models with feature importance


## **Import Libraries and Load Preprocessed Data**


In [34]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Machine Learning libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import (classification_report, confusion_matrix, accuracy_score, 
                           balanced_accuracy_score, f1_score, precision_score, recall_score)
from sklearn.preprocessing import StandardScaler
import joblib

# Advanced ML libraries
try:
    import xgboost as xgb
    XGBOOST_AVAILABLE = True
except ImportError:
    XGBOOST_AVAILABLE = False
    print("XGBoost not available. Install with: pip install xgboost")

try:
    import lightgbm as lgb
    LIGHTGBM_AVAILABLE = True
except ImportError:
    LIGHTGBM_AVAILABLE = False
    print("LightGBM not available. Install with: pip install lightgbm")

# Explainability
try:
    import shap
    SHAP_AVAILABLE = True
except ImportError:
    SHAP_AVAILABLE = False
    print("SHAP not available. Install with: pip install shap")

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"XGBoost available: {XGBOOST_AVAILABLE}")
print(f"LightGBM available: {LIGHTGBM_AVAILABLE}")
print(f"SHAP available: {SHAP_AVAILABLE}")

XGBoost not available. Install with: pip install xgboost
LightGBM not available. Install with: pip install lightgbm
SHAP not available. Install with: pip install shap
Libraries imported successfully!
XGBoost available: False
LightGBM available: False
SHAP available: False


In [35]:
print("Loading preprocessed data")
try:
# loading data
    X_train= pd.read_csv("X_train.csv")
    X_val= pd.read_csv("X_val.csv")
    X_test= pd.read_csv("X_test.csv")

    # Load preprocessing objects
    selected_features = joblib.load('selected_features.pkl')

    y_train= pd.read_csv("y_train.csv").squeeze()
    y_val= pd.read_csv("y_val.csv").squeeze()
    y_test= pd.read_csv("y_test.csv").squeeze()
    
    print("Preprocessed data loaded")
    print(f"Training set: {X_train.shape}")
    print(f"Validation set: {X_val.shape}")
    print(f"Test set: {X_test.shape}")
    
    # Display class distribution
    print("\nClass distribution:")
    print("Training set:")
    print(y_train.value_counts().sort_index())
    print("\nValidation set:")
    print(y_val.value_counts().sort_index())
    print("\nTest set:")
    print(y_test.value_counts().sort_index())
    
    # Display selected features
    print(f"\nSelected features ({len(selected_features)}):")
    for i, feature in enumerate(selected_features, 1):
        print(f"{i:2d}. {feature}")

except FileNotFoundError as e:
    print(f"Error loading preprocessed data: {e}")
    print("""Run preprocess first to generate the required files.
    Required files: 
        X_train_scaled.csv, 
        X_val_scaled.csv, 
        X_test_scaled.csv
        y_train.csv, y_val.csv, y_test.csv
        selected_features.pkl""")


Loading preprocessed data
Preprocessed data loaded
Training set: (346, 7)
Validation set: (116, 7)
Test set: (116, 7)

Class distribution:
Training set:
Loan_Status
0    105
1    241
Name: count, dtype: int64

Validation set:
Loan_Status
0    35
1    81
Name: count, dtype: int64

Test set:
Loan_Status
0    35
1    81
Name: count, dtype: int64

Selected features (7):
 1. Credit_History
 2. Property_Area_Semiurban
 3. Married
 4. DTI
 5. Education
 6. TotalIncome
 7. CoapplicantIncome


## Baseline Model- Logistic Regression

In [36]:
print("="*50)
print("BASELINE LOGISTIC REGRESSION MODEL")
print("="*50)

# create the model
baseline_model= LogisticRegression(random_state=234, max_iter=1000,class_weight="balanced")

# train Model
baseline_model.fit(X_train,y_train)

# let predict with our trained model
y_train_pred= baseline_model.predict(X_train)
y_val_pred= baseline_model.predict(X_val)
y_test_pred= baseline_model.predict(X_test)

# Metric Calculations
train_accuracy= accuracy_score(y_train, y_train_pred)
val_accuracy= accuracy_score(y_val, y_val_pred)
test_accuracy= accuracy_score(y_test, y_test_pred)

# Balanced check
train_balanced_acc= balanced_accuracy_score(y_train, y_train_pred)
val_balanced_acc= balanced_accuracy_score(y_val, y_val_pred)
test_balanced_acc= balanced_accuracy_score(y_test, y_test_pred)

# Check F1 Scores
train_f1= f1_score(y_train, y_train_pred, average="macro")
val_f1= f1_score(y_val, y_val_pred, average="macro")
test_f1= f1_score(y_test, y_test_pred, average="macro")
print()
print("Baseline Model Performace:")
print("---------------------------")
print(f"Train Accuracy: {train_accuracy:.3f}, Balanced Accuracy: {train_balanced_acc:.3f}, Macro F1: {train_f1:.3f}")
print(f"Validation Accuracy: {val_accuracy:.3f}, Balanced Accuracy: {val_balanced_acc:.3f}, Macro F1: {val_f1:.3f}")
print(f"Test Accuracy: {test_accuracy:.3f}, Balanced Accuracy: {test_balanced_acc:.3f}, Macro F1: {test_f1:.3f}")
print("---------------------------")
print()
# Displaying Features coefficients
print("Feature Coefficients (Top 10):")
print("-------------------------------")

features_importance= pd.DataFrame({
    "feature": X_train.columns,
    "coefficient": baseline_model.coef_[0]
}).sort_values("coefficient", key=abs, ascending=False)

for i, (index, row) in enumerate(features_importance.head(10).iterrows(),i):
    print(f"{i}. {row['feature']} : {row['coefficient']:.3f}")

# Store baseline results
baseline_results = {
    'model': 'Logistic Regression',
    'train_accuracy': train_accuracy,
    'val_accuracy': val_accuracy,
    'test_accuracy': test_accuracy,
    'train_balanced_acc': train_balanced_acc,
    'val_balanced_acc': val_balanced_acc,
    'test_balanced_acc': test_balanced_acc,
    'train_f1': train_f1,
    'val_f1': val_f1,
    'test_f1': test_f1
}
print()
print(".... Baseline model completed!")


BASELINE LOGISTIC REGRESSION MODEL

Baseline Model Performace:
---------------------------
Train Accuracy: 0.751, Balanced Accuracy: 0.711, Macro F1: 0.709
Validation Accuracy: 0.750, Balanced Accuracy: 0.716, Macro F1: 0.710
Test Accuracy: 0.698, Balanced Accuracy: 0.614, Macro F1: 0.619
---------------------------

Feature Coefficients (Top 10):
-------------------------------
7. Credit_History : 0.928
8. Married : 0.660
9. Property_Area_Semiurban : 0.616
10. CoapplicantIncome : -0.602
11. DTI : -0.513
12. TotalIncome : 0.153
13. Education : 0.080

.... Baseline model completed!


## Randomm Forest Classifier

In [None]:
print("="*50)
print("RANDOM FOREST CLASSIFIER")
print("="*50)

# Let build our model
random_forest_model= RandomForestClassifier(n_estimators=100,random_state=234, class_weight='balanced', n_jobs=-1)

# Let train our model
random_forest_model.fit(X_train,y_train)

# Let predict with our trained model
y_train_pred_rf= random_forest_model.predict(X_train)
y_val_pred_rf= random_forest_model.predict(X_val)
y_test_pred_rf= random_forest_model.predict(X_test)

# Let check the Accuracy Scores
train_acc_rf= accuracy_score(y_train, y_train_pred_rf)
val_acc_rf= accuracy_score(y_val, y_val_pred_rf)
test_acc_rf= accuracy_score(y_test, y_test_pred_rf)

# Let check the Balanced Scores
train_bal_rf= balanced_accuracy_score(y_train, y_train_pred_rf)
val_bal_rf= balanced_accuracy_score(y_val, y_val_pred_rf)
test_bal_rf= balanced_accuracy_score(y_test, y_test_pred_rf)

# let check F1 Scores
train_f1_rf= f1_score(y_train, y_train_pred_rf, average='macro')
val_f1_rf= f1_score(y_val, y_val_pred_rf, average='macro')
test_f1_rf= f1_score(y_test, y_test_pred_rf, average='macro')

# let report our model performance
print("")
print("Random Forest Performance Report")
print("--" * 30)
print(f"Train - Accuracy Score: {train_acc_rf:.3f}, Balanced Acc: {train_bal_rf:.3f}, Macro F1: {train_f1_rf:.3f}")
print(f"Validation - Accuracy Score: {val_acc_rf:.3f}, Balanced Acc: {val_bal_rf:.3f}, Macro F1: {val_f1_rf:.3f}")
print(f"Test - Accuracy Score: {test_acc_rf:.3f}, Balanced Acc: {test_bal_rf:.3f}, Macro F1: {test_f1_rf:.3f}")

# print features coefficient
features_importance_rf = pd.DataFrame({
    "feature": X_train.columns,
    "importance": random_forest_model.feature_importances_
}).sort_values("importance", key=abs, ascending=False)
print("")
# printing the top 10 features
print("Top 10 features:")
print("--" * 30)
for i, (index,row) in enumerate(features_importance_rf.head(10).iterrows(),1):
    print(f"{i:2d}. {row['feature']}: {row['importance']:.3f}")

plt.figure(figsize=(10,8))
top_features_rf= features_importance_rf.head(15)


RANDOM FOREST CLASSIFIER

Random Forest Performance Report
------------------------------------------------------------
Train - Accuracy Score: 1.000, Balanced Acc: 1.000, Macro F1: 1.000
Validation - Accuracy Score: 0.741, Balanced Acc: 0.677, Macro F1: 0.682
Test - Accuracy Score: 0.681, Balanced Acc: 0.577, Macro F1: 0.580

Top 10 features:
------------------------------------------------------------
 1. DTI: 0.299
 2. TotalIncome: 0.289
 3. Credit_History: 0.171
 4. CoapplicantIncome: 0.142
 5. Property_Area_Semiurban: 0.041
 6. Married: 0.033
 7. Education: 0.025
