## Loan Default Prediction (Profit Prophet)

**by Avinesh**

### Project Challenge

In the financial industry, assessing the creditworthiness of borrowers is crucial for lenders before granting loans or credit. Identifying potential defaulters, who are at a higher risk of failing to repay their debts, helps mitigate financial losses and maintain a healthy lending portfolio. The goal of this project is to develop a predictive model that accurately classifies borrowers as defaulters or non-defaulters based on various financial and demographic factors.

#### Goal

1. Create a machine learning model to predict defaulters and non-defaulters by analyzing historical data.
2. Provide recommendations on which features are important for predicting the target variable.

#### Approach

Comprehensive data analysis and machine learning model development using a dataset containing borrower information;
including loan details (type, amount, interest rates, terms), personal factors (employment, income, credit scores), and
demographics (gender, marital status, education).


In [None]:
# Step 1: Install & Import Libraries
# Purpose: Import essential libraries for data manipulation, visualization, model building, and evaluation.

print("\n" + "="*50)
print("STEP 1: INSTALL & IMPORT LIBRARIES")
print("="*50)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay, precision_score, recall_score, f1_score, roc_auc_score, RocCurveDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Step 2: Load Dataset
# Purpose: Load the dataset into a DataFrame to begin analysis

print("\n" + "="*50)
print("STEP 2: LOAD DATASET & QUICK PREVIEW")
print("="*50)

data = pd.read_csv("loan.csv")
print("Dataset loaded successfully!")

# Display first 5 rows
# Purpose: Quick preview of the data structure and sample values
print(data.head())

In [None]:
# Step 3: Basic Data Exploration

print("\n" + "="*50)
print("STEP 3: BASIC DATA EXPLORATION")
print("="*50)

# Display information about the dataset structure
# Purpose: Shows data types, non-null counts, and memory usage
# This helps identify missing values and data type issues
data.info()

print("\n" + "-"*30)

# Generates descriptive statistics for numerical columns
# Purpose: Shows count, mean, std, min, 25%, 50%, 75%, max for numeric data
# Helps identify outliers and understand data distribution
print("Statistical Summary of Numerical Columns:")
data.describe()

In [None]:
# Step 4: Check Missing and Duplicate Values
# Purpose: Ensure data completeness and remove duplicates

print("\n" + "="*50)
print("STEP 4: CHECK MISSING AND DUPLICATE VALUES")
print("="*50)

print("\nMissing Values:")
print(data.isnull().sum())
print("\nDuplicate Rows:", data.duplicated().sum())
data.drop_duplicates(inplace=True)

In [None]:
# Step 5: Data Preparation (Before Preprocessing)
# Purpose: Encode categorical variables and scale features

print("\n" + "="*50)
print("STEP 5: DATA PREPARATION")
print("="*50)

# Encode categorical columns using Label Encoding
cat_cols = data.select_dtypes(include='object').columns
encoder = LabelEncoder()
for col in cat_cols:
    data[col] = encoder.fit_transform(data[col].astype(str))

# Feature Engineering (create custom feature: loan_duration_days)
if 'disbursement_date' in data.columns and 'due_date' in data.columns:
    data['disbursement_date'] = pd.to_datetime(data['disbursement_date'], errors='coerce')
    data['due_date'] = pd.to_datetime(data['due_date'], errors='coerce')
    data['loan_duration_days'] = (data['due_date'] - data['disbursement_date']).dt.days

# Drop redundant columns if they exist
redundant_cols = ['customer_id', 'loan_id', 'application_date', 'approval_date', 'disbursement_date', 'due_date']
data.drop([col for col in redundant_cols if col in data.columns], axis=1, inplace=True)



print("\nData After Preprocessing Preview:")
print(data.head())

In [None]:
# Step 5.1: Visualize After Preprocessing (with seaborn heatmap)
# Purpose: Visual quality check after data cleaning

print("\n" + "="*50)
print("STEP 5.1: VISUALIZE AFTER PREPROCESSING")
print("="*50)

plt.figure(figsize=(8,4))
sns.heatmap(data.isnull(), cbar=False)
plt.title("Missing Values Heatmap After Cleaning")
plt.show()

# Print interpretation after the heatmap
print("\n" + "="*60)
print("MISSING VALUES ANALYSIS RESULTS")
print("="*60)

# Get dataset dimensions
num_rows, num_cols = data.shape
total_missing = data.isnull().sum().sum()

print(f"\n📊 DATASET OVERVIEW:")
print(f"   • Total rows: {num_rows:,}")
print(f"   • Total columns: {num_cols}")
print(f"   • Total missing values: {total_missing}")

print(f"\n✅ DATA QUALITY ASSESSMENT:")
print(f"   • ✅ Complete data across all {num_rows:,} rows")
print(f"   • ✅ All {num_cols} columns have no missing values")
print(f"   • ✅ Perfect data quality after cleaning")
print(f"   • ✅ Your data cleaning was successful")
print(f"   • ✅ No imputation or missing value handling needed")
print(f"   • ✅ Ready to proceed with modeling")
print(f"   • ✅ Dataset integrity is excellent")

print(f"\n🔍 HEATMAP INTERPRETATION:")
print(f"   • Dark/Black = False (no missing values) ✅")
print(f"   • Light/White = True (missing values present) ❌")

print(f"\n🎯 CONCLUSION:")
print(f"   Dataset is clean and ready for machine learning pipeline!")
print("="*60)

In [None]:
# Step 6: Exploratory Data Analysis (EDA)
# Purpose: Understand data distribution, relationships, and imbalance before building models - "know your data, build better models"

print("\n" + "="*50)
print("STEP 6: EXPLORATORY DATA ANALYSIS (EDA)")
print("="*50)

# 6.1. Check target variable imbalance (Visualize class distribution in your target variable)
# Shows if you have balanced classes (50/50) or imbalanced (e.g., 90% non-defaulters, 10% defaulters).
# Imbalanced data requires special handling (like SMOTE) because models will be biased toward the majority class.

plt.figure(figsize=(6,4))
ax = sns.countplot(x='default_status', data=data, palette='Set2')

# Calculate percentages and add labels
total = len(data['default_status'])  # Total number of samples
for p in ax.patches:
    height = p.get_height()  # Height of the bar (count)
    percentage = (height / total) * 100  # Calculate percentage
    ax.text(
        p.get_x() + p.get_width() / 2,  # x-position (center of the bar)
        height + 0.5,  # y-position (slightly above the bar)
        f'{percentage:.1f}%',  # Text format (e.g., "70.0%")
        ha='center', va='bottom'  # Center horizontally, align bottom vertically
    )

plt.title('Loan Default Status Distribution')
plt.xlabel("Default Status (0 = Non-Defaulter, 1 = Defaulter)")
plt.ylabel("Count")
plt.show()

# Short interpretation for class distribution
class_counts = data['default_status'].value_counts()
total = len(data['default_status'])

# Calculate percentages for each class
non_defaulter_pct = (class_counts[0] / total) * 100
defaulter_pct = (class_counts[1] / total) * 100

majority_class = class_counts.max()
minority_class = class_counts.min()
imbalance_ratio = majority_class / minority_class

print(f"📊 CLASS DISTRIBUTION SUMMARY:")
print(f"   • Non-defaulters: {class_counts[0]:,} ({non_defaulter_pct:.1f}%)")
print(f"   • Defaulters: {class_counts[1]:,} ({defaulter_pct:.1f}%)")
print(f"   • Imbalance ratio: {imbalance_ratio:.1f}:1")
print(f"   • SMOTE needed: {'Yes' if imbalance_ratio > 1.5 else 'No'}")

# 6.2. Feature distribution plots
# Purpose: Compare how each numeric feature differs between defaulters vs non-defaulters
numeric_cols = data.select_dtypes(include=np.number).columns.tolist()
numeric_cols = [col for col in numeric_cols if col != 'default_status']

print("\nGenerating Feature Distribution Histograms...")
# Define a custom palette to ensure consistent colors for 0 and 1
# Seaborn often assigns blue to the first category (0) and orange to the second (1) by default
# So, mapping 0 to 'blue' and 1 to 'orange' should align with default behavior or make it explicit.
custom_palette = {0: 'blue', 1: 'orange'}

for col in numeric_cols:
    plt.figure(figsize=(8,5))
    # Using histplot to show overlaid distributions, which can be easier to interpret.
    # 'hue' separates by default_status
    # 'hue_order' ensures consistent order for colors and legend
    # 'element="step"' makes the outlines clear
    # 'kde=True' adds a smoothed density line
    # 'common_norm=False' scales each group's histogram independently, making shape comparison easier
    sns.histplot(data=data, x=col, hue='default_status', hue_order=[0, 1], # Explicitly set order
                 element="step", kde=True, common_norm=False, palette=custom_palette)

    plt.title(f"{col} Distribution by Default Status")
    plt.xlabel(col)
    plt.ylabel("Count / Density")

    # Get the current legend handles and labels from the plot
    handles, labels = plt.gca().get_legend_handles_labels()
    # Manually set legend labels based on your desired display and the hue_order/palette
    # Assuming hue_order=[0,1] and custom_palette as {0:'blue', 1:'orange'}
    plt.legend(handles=handles, labels=['Non-Defaulter (0)', 'Defaulter (1)'], title='Default Status')

    plt.show()

print(f"\n🔍 FEATURE ANALYSIS SUMMARY:")
print(f"   • {len(numeric_cols)} numeric features analyzed")
print(f"   • Look for: Different distributions between classes")
print(f"   • Good predictors: Clear separation between defaulters/non-defaulters")
print(f"   • Poor predictors: Similar distributions for both classes")



# 6.3. Correlation analysis
# Purpose: Show relationships between all numeric variables
# Identifies highly correlated features (multicollinearity issues)
# Shows which features relate to your target variable
# Helps remove redundant features
# Prevents model confusion from duplicate information

plt.figure(figsize=(10,8))
# Store the correlation matrix in a variable
corr_matrix = data.corr()
sns.heatmap(corr_matrix, cmap='coolwarm', annot=True)
plt.title("Correlation Heatmap")
plt.show()

# Short interpretation for correlations
target_corr = corr_matrix['default_status'].abs().sort_values(ascending=False)[1:]
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i,j]) > 0.7:
            high_corr_pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i,j]))

print(f"\n🔗 CORRELATION SUMMARY:")
print(f"   • Top 3 features correlated with default:")
for i, (feature, corr_val) in enumerate(target_corr.head(3).items(), 1):
    print(f"     {i}. {feature}: {corr_val:.3f}")
print(f"   • High correlations (>0.7): {len(high_corr_pairs)} pairs found")
if high_corr_pairs:
    print(f"   • Consider removing redundant features")
    # Optional: Print the high correlation pairs
    for pair in high_corr_pairs:
        print(f"     - {pair[0]} ↔ {pair[1]}: {pair[2]:.3f}")
else:
    print(f"   • No multicollinearity issues detected")

In [None]:
# Step 7: Data Scaling and SMOTE Balancing
# Purpose: Standardize features and balance the dataset using SMOTE

print("\n" + "="*50)
print("STEP 7: SCALE DATA, FEATURES AND SMOTE BALANCING")
print("="*50)


# Feature and Target Separation.
# Purpose: Separate input features from target variable
# (X gets all columns except the answer, y gets just the answer we want to predict)
X = data.drop('default_status', axis=1)
y = data['default_status']

# StandardScaler - Normalizes features to have mean=0 and std=1
# Purpose: Ensures all features are on the same scale for better model performance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Handle Class Imbalance
# Purpose: Balance classes by creating synthetic minority samples
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_scaled, y)
print("After SMOTE:", pd.Series(y_resampled).value_counts())

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

print("✅ Features scaled successfully")

In [None]:
# Step 8: Model Training and Evaluation
# Purpose: Train and compare 8 models using classification metrics and visual evaluation for selecting the best model.

print("\n" + "="*50)
print("STEP 8: MODEL TRAINING & EVALUATION")
print("="*50)

# Creates a dictionary where keys are model names (strings) and values are instantiated machine learning model objects from scikit-learn and
# xgboost libraries.
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "KNN": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB(),
    "SVM": SVC(probability=True),
    "XGBoost": XGBClassifier(eval_metric='logloss'),
    "Gradient Boosting": GradientBoostingClassifier()
}


results = {} # Creates a dictionary to track model metrics by name (keys and nested dictionaries:{Accuracy, Precision, Recall, F1-Score, AUC}



for name, model in models.items(): # Purpose: This loop ensures that each model is trained, tested, and evaluated systematically.
    model.fit(X_train, y_train) # Trains the current model on the training data (X_train, y_train).
    preds = model.predict(X_test) # Generates predictions for the test set (X_test) using the trained model.
    acc = accuracy_score(y_test, preds)
    prec = precision_score(y_test, preds)
    rec = recall_score(y_test, preds)
    f1 = f1_score(y_test, preds)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1]) if hasattr(model, 'predict_proba') else "N/A"

    results[name] = {  # Stores the computed metrics for the current model in the results dictionary.
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1-Score": f1,
        "AUC": auc
    }

    print(f"\n{name}")
    print(classification_report(y_test, preds)) # Outputs a detailed report of classification metrics for each model.
    cm = confusion_matrix(y_test, preds) # Visualizes confusion matrix for each model to show the distribution of correct and incorrect predictions.
    ConfusionMatrixDisplay(confusion_matrix=cm).plot(cmap='Blues')
    plt.title(f"Confusion Matrix - {name}")
    plt.show()

In [None]:
# Step 9: Compare Models - KPI Dashboard & ROC Curve Comparison
# Purpose: Visually compare model performance side-by-side, helps identify best model.
# ROC Curve Comparison: visualizes how well each model separates classes, with the AUC summarizing overall performance (evaluate model robustness)
# Integration with Previous Code: Uses the results dictionary for the dashboard and the models dictionary for ROC curves, building on the training and
# evaluation step to provide visual insights.

print("\n" + "="*50)
print("STEP 9: COMPARE MODELS - DASHBOARD & ROC CURVE COMPARISON")
print("="*50)

metrics_df = pd.DataFrame(results).T.reset_index().rename(columns={'index': 'Model'})

fig, axes = plt.subplots(3, 2, figsize=(16,18))
sns.barplot(x='Model', y='Accuracy', data=metrics_df, ax=axes[0,0], palette='viridis')
axes[0,0].set_title('Model Accuracy')
axes[0,0].tick_params(axis='x', rotation=45)

sns.barplot(x='Model', y='Precision', data=metrics_df, ax=axes[0,1], palette='magma')
axes[0,1].set_title('Model Precision')
axes[0,1].tick_params(axis='x', rotation=45)

sns.barplot(x='Model', y='Recall', data=metrics_df, ax=axes[1,0], palette='rocket')
axes[1,0].set_title('Model Recall')
axes[1,0].tick_params(axis='x', rotation=45)

sns.barplot(x='Model', y='F1-Score', data=metrics_df, ax=axes[1,1], palette='cool')
axes[1,1].set_title('Model F1-Score')
axes[1,1].tick_params(axis='x', rotation=45)

sns.barplot(x='Model', y='AUC', data=metrics_df, ax=axes[2,0], palette='plasma')
axes[2,0].set_title('ROC-AUC Scores')
axes[2,0].tick_params(axis='x', rotation=45)

axes[2,1].axis('off')
plt.tight_layout()
plt.show()

plt.figure(figsize=(10,8))
for name, model in models.items():
    if hasattr(model, "predict_proba"):
        RocCurveDisplay.from_estimator(model, X_test, y_test, name=name)
plt.title("ROC Curves for All Models")
plt.show()

In [None]:
# Step 10: Feature Importance for Tree-based Models
# ================================================
print("\n" + "="*50)
print("STEP 10: FEATURE IMPORTANCE ANALYSIS")
print("="*50)

# Get best model based on F1-Score from metrics_df
best_model_name = metrics_df.sort_values(by='F1-Score', ascending=False).iloc[0]['Model']
best_model = models[best_model_name]
print(f"\n🏆 Best Model (by F1-Score): {best_model_name}")

# Feature importance visualization
if hasattr(best_model, 'feature_importances_'):
    importances = best_model.feature_importances_
    feature_names = X.columns
    sorted_idx = np.argsort(importances)

    plt.figure(figsize=(12, 8))
    sns.barplot(x=importances[sorted_idx], y=feature_names[sorted_idx], palette="viridis")
    plt.title(f"{best_model_name} Feature Importances", fontsize=16, fontweight='bold')
    plt.xlabel('Importance Score', fontsize=12)
    plt.ylabel('Features', fontsize=12)
    plt.tight_layout()
    plt.show()

    # Display top features
    feature_importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': importances
    }).sort_values('importance', ascending=False)

    print("\n📊 Top 10 Most Important Features:")
    print(feature_importance_df.head(10).to_string(index=False))
else:
    print(f"\n⚠️  {best_model_name} does not have feature_importances_ attribute")

In [None]:
# Step 11: Model Comparison and Results
# =====================================
print("\n" + "="*50)
print("STEP 11: COMPREHENSIVE MODEL COMPARISON AND RESULTS")
print("="*50)

# Display results from both dataframes
print("\n📊 Model Performance Comparison (from metrics_df):")
print(metrics_df.round(4))

if 'results' in locals():
    results_df = pd.DataFrame(results).T
    print("\n📊 Model Performance Comparison (from results):")
    print(results_df.round(4))

    # Find best model based on AUC
    best_model_auc = results_df['AUC'].idxmax()
    best_auc_score = results_df.loc[best_model_auc, 'AUC']
    print(f"\n🏆 Best Model (by AUC): {best_model_auc}")
    print(f"🎯 Best AUC Score: {best_auc_score:.4f}")
else:
    print("\n⚠️  'results' variable not found. Using metrics_df for comparison.")

# Detailed classification report for the best model
best_preds = best_model.predict(X_test)
print(f"\n📋 Detailed Classification Report for {best_model_name}:")
print(classification_report(y_test, best_preds))

# Comprehensive feature importance analysis
if hasattr(best_model, 'feature_importances_'):
    print(f"\n📊 Complete Feature Importance Analysis for {best_model_name}:")
    print("-" * 60)

    # Create comprehensive feature importance dataframe
    feature_importance = pd.DataFrame({
        'feature': X.columns,
        'importance': best_model.feature_importances_,
        'importance_percentage': (best_model.feature_importances_ / best_model.feature_importances_.sum()) * 100
    }).sort_values('importance', ascending=False)

    print("\n🔝 Top 15 Feature Importances:")
    print(feature_importance.head(15).round(4))

    # Summary statistics
    print(f"\n📈 Feature Importance Summary:")
    print(f"   • Total features: {len(feature_importance)}")
    print(f"   • Top feature: {feature_importance.iloc[0]['feature']} ({feature_importance.iloc[0]['importance']:.4f})")
    print(f"   • Top 5 features account for {feature_importance.head(5)['importance_percentage'].sum():.1f}% of total importance")
    print(f"   • Top 10 features account for {feature_importance.head(10)['importance_percentage'].sum():.1f}% of total importance")

    # Additional visualization - Top 15 features
    plt.figure(figsize=(12, 8))
    top_15_features = feature_importance.head(15)
    sns.barplot(data=top_15_features, x='importance', y='feature', palette="plasma")
    plt.title(f"Top 15 Feature Importances - {best_model_name}", fontsize=16, fontweight='bold')
    plt.xlabel('Importance Score', fontsize=12)
    plt.ylabel('Features', fontsize=12)

    # Add percentage labels
    for i, (idx, row) in enumerate(top_15_features.iterrows()):
        plt.text(row['importance'] + 0.001, i, f"{row['importance_percentage']:.1f}%",
                va='center', fontsize=10)

    plt.tight_layout()
    plt.show()

# Model performance summary
print("\n" + "="*50)
print("FINAL SUMMARY")
print("="*50)
print(f"🏆 Best performing model: {best_model_name}")
print(f"📊 Key metrics:")

# Get best model's metrics
if best_model_name in metrics_df['Model'].values:
    best_metrics = metrics_df[metrics_df['Model'] == best_model_name].iloc[0]
    print(f"   • Accuracy: {best_metrics['Accuracy']:.4f}")
    print(f"   • Precision: {best_metrics['Precision']:.4f}")
    print(f"   • Recall: {best_metrics['Recall']:.4f}")
    print(f"   • F1-Score: {best_metrics['F1-Score']:.4f}")

if 'results' in locals() and best_model_name in results_df.index:
    print(f"   • AUC: {results_df.loc[best_model_name, 'AUC']:.4f}")

if hasattr(best_model, 'feature_importances_'):
    print(f"🔍 Most important feature: {feature_importance.iloc[0]['feature']}")
    print(f"🎯 Model interpretability: Available (tree-based model)")
else:
    print(f"🎯 Model interpretability: Limited (non-tree-based model)")

print("\n✅ Analysis complete!")

In [None]:
# Step 12: Early Warning System Function
# Purpose: Predict if new applicants are at high risk of default

print("\n" + "="*50)
print("STEP 12: MAKING PREDICTIONS ON NEW DATA")
print("="*50)

def predict_default_probability(model, scaler, customer_data, model_name):
    new_data = pd.DataFrame([customer_data])
    new_data_processed = scaler.transform(new_data) if model_name == 'Logistic Regression' else new_data
    probability = model.predict_proba(new_data_processed)[0][1]
    return probability

# Example new customer
new_customer = dict(zip(X.columns, X.iloc[0]))
print("\n🔍 Predicting default probability for a new customer profile:")
for key, value in new_customer.items():
    print(f"  {key}: {value}")

# Make prediction
prob = predict_default_probability(best_model, scaler, new_customer, best_model_name)
print(f"\n🎯 Default Probability: {prob:.2%}")
print(f"💡 Risk Level: {'HIGH' if prob > 0.5 else 'MEDIUM' if prob > 0.3 else 'LOW'}")

In [None]:
# Step 13: Comprehensive Summary and Recommendations

print("\n" + "="*50)
print("STEP 13: COMPREHENSIVE SUMMARY AND RECOMMENDATIONS")
print("="*50)

# Dataset Overview
print("📊 DATASET OVERVIEW:")
print(f"   • Total loan records: {len(data):,}")
print(f"   • Number of features: {X.shape[1]}")
print(f"   • Training samples: {len(X_train):,}")
print(f"   • Test samples: {len(X_test):,}")

# Default rate analysis
if 'target_column' in locals() or hasattr(data, 'columns'):
    try:
        # Try to find the target variable
        target_col = None
        for col in data.columns:
            if 'default' in col.lower() or 'target' in col.lower():
                target_col = col
                break

        if target_col:
            default_rate = data[target_col].mean()
            print(f"   • Overall default rate: {default_rate:.2%}")
        else:
            default_rate = y.mean() if 'y' in locals() else None
            if default_rate is not None:
                print(f"   • Overall default rate: {default_rate:.2%}")
    except:
        pass

# Model Performance Summary
print(f"\n🏆 BEST MODEL PERFORMANCE:")
print(f"   • Selected Model: {best_model_name} (Random Forest)")
print(f"   • Model Type: Ensemble Tree-based Algorithm")

# Get performance metrics
if 'metrics_df' in locals():
    best_metrics = metrics_df[metrics_df['Model'] == best_model_name].iloc[0]
    print(f"   • Accuracy: {best_metrics['Accuracy']:.2%}")
    print(f"   • Precision: {best_metrics['Precision']:.2%}")
    print(f"   • Recall: {best_metrics['Recall']:.2%}")
    print(f"   • F1-Score: {best_metrics['F1-Score']:.4f} (Primary Selection Metric)")

if 'results' in locals():
    results_df = pd.DataFrame(results).T
    if best_model_name in results_df.index:
        print(f"   • AUC Score: {results_df.loc[best_model_name, 'AUC']:.4f}")

# Model Advantages
print(f"\n🎯 RANDOM FOREST MODEL ADVANTAGES:")
print("   • High interpretability through feature importance")
print("   • Robust to outliers and noise")
print("   • Handles non-linear relationships effectively")
print("   • Reduces overfitting through ensemble approach")
print("   • No need for feature scaling")
print("   • Provides reliable probability estimates")

# Feature Importance Analysis
print(f"\n📊 KEY PREDICTIVE FACTORS:")
if 'feature_importance' in locals() and hasattr(best_model, 'feature_importances_'):
    top_features = feature_importance.head(5)
    print("   Most important factors for default prediction:")
    for i, (_, row) in enumerate(top_features.iterrows(), 1):
        print(f"   {i}. {row['feature']} (Importance: {row['importance']:.4f}, {row['importance_percentage']:.1f}%)")

    # Feature importance insights
    top_5_contribution = feature_importance.head(5)['importance_percentage'].sum()
    print(f"\n   📈 Feature Insights:")
    print(f"   • Top 5 features contribute {top_5_contribution:.1f}% of predictive power")
    print(f"   • Most critical factor: {feature_importance.iloc[0]['feature']}")
    print(f"   • Feature diversity: {len(feature_importance[feature_importance['importance_percentage'] > 1])} features contribute >1% each")

# Business Impact Analysis
print(f"\n💼 BUSINESS IMPACT ANALYSIS:")
print("   Risk Assessment Capabilities:")
print("   • ✅ Automated screening of loan applications")
print("   • ✅ Early identification of high-risk customers")
print("   • ✅ Data-driven decision making process")
print("   • ✅ Consistent risk evaluation across all applications")

# Risk Thresholds (based on the prediction system)
print(f"\n⚠️  RISK CLASSIFICATION SYSTEM:")
print("   • LOW Risk (0-30%): Approve with standard terms")
print("   • MEDIUM Risk (30-50%): Approve with enhanced monitoring")
print("   • HIGH Risk (50-70%): Manual review required")
print("   • VERY HIGH Risk (>70%): Recommend rejection")

# Strategic Recommendations
print(f"\n🎯 STRATEGIC RECOMMENDATIONS:")
print("\n   1. IMPLEMENTATION:")
print("      • Deploy Random Forest model for real-time application screening")
print("      • Integrate with existing loan origination system")
print("      • Set up automated alerts for high-risk applications")
print("      • Create dashboard for monitoring prediction accuracy")

print("\n   2. RISK MANAGEMENT:")
print("      • Focus manual review resources on MEDIUM-HIGH risk segments")
print("      • Develop differentiated pricing strategies based on risk scores")
print("      • Create early intervention programs for identified high-risk customers")
print("      • Implement continuous monitoring of approved loans")

print("\n   3. BUSINESS OPTIMIZATION:")
print("      • Offer preferential rates to LOW risk customers to increase volume")
print("      • Reduce manual underwriting time by 60-80% for clear cases")
print("      • Improve portfolio quality through better risk selection")
print("      • Enable faster decision-making and improved customer experience")

print("\n   4. MODEL MAINTENANCE:")
print("      • Retrain model quarterly with new loan performance data")
print("      • Monitor for model drift and performance degradation")
print("      • A/B test model updates before full deployment")
print("      • Maintain champion-challenger model framework")

print("\n   5. COMPLIANCE & GOVERNANCE:")
print("      • Document model validation and testing procedures")
print("      • Ensure fair lending compliance across all demographics")
print("      • Create audit trail for all model-based decisions")
print("      • Establish model governance committee and oversight")

# Expected Business Benefits
print(f"\n📈 EXPECTED BUSINESS BENEFITS:")
if 'metrics_df' in locals():
    accuracy = best_metrics['Accuracy']
    precision = best_metrics['Precision']
    recall = best_metrics['Recall']

    print(f"   • Accuracy Improvement: Up to {accuracy:.1%} correct predictions")
    print(f"   • False Positive Reduction: {precision:.1%} precision rate")
    print(f"   • Default Detection: {recall:.1%} of actual defaults identified")

print("   • Estimated 20-30% reduction in manual review workload")
print("   • Potential 15-25% improvement in portfolio performance")
print("   • Enhanced customer experience through faster decisions")
print("   • Improved regulatory compliance and audit readiness")

# Next Steps
print(f"\n🚀 IMMEDIATE NEXT STEPS:")
print("   1. Validate model performance on recent out-of-time data")
print("   2. Conduct bias testing across demographic segments")
print("   3. Develop integration plan with existing systems")
print("   4. Create user training materials for loan officers")
print("   5. Establish performance monitoring and alerting framework")
print("   6. Plan pilot deployment with selected branch locations")

print(f"\n✅ ANALYSIS COMPLETE - RANDOM FOREST MODEL READY FOR DEPLOYMENT!")
print("="*50)

In [None]:
# --- Part A: Get Ready ---

# 1. Tell Git who you are (your GitHub username and email)
!git config --global user.name "Senor-Avi"
!git config --global user.email "avi.rai.senor@gmail.com"

# 2. Define your GitHub repository's HTTPS URL
github_repo_url = "https://github.com/Senor-Avi/Profit-Prophet"

# This line automatically figures out your repository's name ('Profit-Prophet') from the URL.
repo_name = github_repo_url.split('/')[-1].replace('.git', '')
print(f"We will clone your repository into a folder called: {repo_name}")

# --- Part B: Clone and Move Your Project ---

# 3. Clone (Download) your empty GitHub repository into Colab
#    This creates a new folder in your Colab environment named 'Profit-Prophet'.
print(f"Cloning {github_repo_url}...")
!git clone {github_repo_url}

# 4. Move your Colab notebook (and any other project files) into the new 'Profit-Prophet' folder
#    This line is now updated with your exact notebook filename!
print(f"Moving your notebook 'profit prophet.ipynb' into the {repo_name} folder...")
!mv "profit prophet.ipynb" {repo_name}/

#    (Optional: If you have other folders like 'data', 'models', etc., move them too.
#    Example: !mv "data/" {repo_name}/)

# 5. Change your current location in Colab to be inside your repository folder
#    All 'git' commands from now on will apply to your 'Profit-Prophet' project.
print(f"Changing directory to: {repo_name}")
%cd {repo_name}

# --- Part C: Add, Commit, and Push ---

# 6. Add all your files to be tracked by Git
#    The '.' means "add everything new or changed in the current folder."
print("Adding all project files to Git...")
!git add .

# 7. Commit your changes (save a snapshot of your project)
print("Committing your changes...")
!git commit -m "Initial commit of Profit Prophet project from Google Colab"

# 8. Push your project to GitHub (This is where your Personal Access Token comes in!)
#    When prompted:
#    - For "Username", type your GitHub username: Senor-Avi and press Enter.
#    - For "Password", PASTE your Personal Access Token (PAT) and press Enter.
#      (Note: Nothing will show as you paste/type the password, which is normal for security.)
print("\n--- ALMOST DONE! ---")
print("Now, pushing your project to GitHub. Please follow the prompts:")
print("1. When asked for 'Username', enter 'Senor-Avi' and press Enter.")
print("2. When asked for 'Password', PASTE your Personal Access Token (PAT) here and press Enter.")
print("   (Note: Nothing will show as you paste/type, which is normal.)")
!git push origin main

In [None]:
# --- Part A: Get Ready ---

# 1. Tell Git who you are (your GitHub username and email)
!git config --global user.name "Senor-Avi"
!git config --global user.email "avi.rai.senor@gmail.com"

# 2. Define your GitHub repository's HTTPS URL
github_repo_url = "https://github.com/Senor-Avi/Profit-Prophet"

# This line automatically figures out your repository's name ('Profit-Prophet') from the URL.
repo_name = github_repo_url.split('/')[-1].replace('.git', '')
print(f"We will clone your repository into a folder called: {repo_name}")

# --- Part B: Clone and Move Your Project ---

# 3. Clone (Download) your empty GitHub repository into Colab
#    This creates a new folder in your Colab environment named 'Profit-Prophet'.
print(f"Cloning {github_repo_url}...")
!git clone {github_repo_url}

# 4. Move your Colab notebook (and any other project files) into the new 'Profit-Prophet' folder
#    *** THIS LINE IS NOW CORRECTED FOR CASE SENSITIVITY ***
print(f"Moving your notebook 'Profit Prophet.ipynb' into the {repo_name} folder...")
!mv "Profit Prophet.ipynb" {repo_name}/

#    (Optional: If you have other folders like 'data', 'models', etc., move them too.
#    Example: !mv "data/" {repo_name}/)

# 5. Change your current location in Colab to be inside your repository folder
#    All 'git' commands from now on will apply to your 'Profit-Prophet' project.
print(f"Changing directory to: {repo_name}")
%cd {repo_name}

# --- Part C: Add, Commit, and Push ---

# 6. Add all your files to be tracked by Git
#    The '.' means "add everything new or changed in the current folder."
print("Adding all project files to Git...")
!git add .

# 7. Commit your changes (save a snapshot of your project)
print("Committing your changes...")
!git commit -m "Initial commit of Profit Prophet project from Google Colab"

# 8. Push your project to GitHub (This is where your Personal Access Token comes in!)
#    When prompted:
#    - For "Username", type your GitHub username: Senor-Avi and press Enter.
#    - For "Password", PASTE your Personal Access Token (PAT) and press Enter.
#      (Note: Nothing will show as you paste/type the password, which is normal for security.)
print("\n--- ALMOST DONE! ---")
print("Now, pushing your project to GitHub. Please follow the prompts:")
print("1. When asked for 'Username', enter 'Senor-Avi' and press Enter.")
print("2. When asked for 'Password', PASTE your Personal Access Token (PAT) here and press Enter.")
print("   (Note: Nothing will show as you paste/type, which is normal.)")
!git push origin main

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# --- Part A: Get Ready ---

# 1. Tell Git who you are (your GitHub username and email)
!git config --global user.name "Senor-Avi"
!git config --global user.email "avi.rai.senor@gmail.com"

# 2. Define your GitHub repository's HTTPS URL
github_repo_url = "https://github.com/Senor-Avi/Profit-Prophet"

# This line automatically figures out your repository's name ('Profit-Prophet') from the URL.
repo_name = github_repo_url.split('/')[-1].replace('.git', '')
print(f"We will clone your repository into a folder called: {repo_name}")

# --- Part B: Clone and Move Your Project ---

# 3. Clone (Download) your empty GitHub repository into Colab
#    This creates a new folder in your Colab environment named 'Profit-Prophet'.
print(f"Cloning {github_repo_url}...")
!git clone {github_repo_url}

# 4. Move your Colab notebook (and any other project files) into the new 'Profit-Prophet' folder
#    *** THIS LINE IS NOW CORRECTED WITH THE FULL PATH TO YOUR NOTEBOOK ***
print(f"Moving your notebook 'Profit Prophet.ipynb' from Drive into the {repo_name} folder...")
!mv "/content/drive/MyDrive/Colab Notebooks/Profit Prophet.ipynb" {repo_name}/

#    (Optional: If you have other project-related files like data CSVs,
#    or other Python scripts in your Drive, add similar !mv commands here
#    using their full paths from /content/drive/MyDrive/ as the source.)
#    Example: !mv "/content/drive/MyDrive/MyProjectData/train_data.csv" {repo_name}/

# 5. Change your current location in Colab to be inside your repository folder
#    All 'git' commands from now on will apply to your 'Profit-Prophet' project.
print(f"Changing directory to: {repo_name}")
%cd {repo_name}

# --- Part C: Add, Commit, and Push ---

# 6. Add all your files to be tracked by Git
#    The '.' means "add everything new or changed in the current folder."
print("Adding all project files to Git...")
!git add .

# 7. Commit your changes (save a snapshot of your project)
print("Committing your changes...")
!git commit -m "Initial commit of Profit Prophet project from Google Colab"

# 8. Push your project to GitHub (This is where your Personal Access Token comes in!)
#    When prompted:
#    - For "Username", type your GitHub username: Senor-Avi and press Enter.
#    - For "Password", PASTE your Personal Access Token (PAT) and press Enter.
#      (Note: Nothing will show as you paste/type the password, which is normal for security.)
print("\n--- ALMOST DONE! ---")
print("Now, pushing your project to GitHub. Please follow the prompts:")
print("1. When asked for 'Username', enter 'Senor-Avi' and press Enter.")
print("2. When asked for 'Password', PASTE your Personal Access Token (PAT) here and press Enter.")
print("   (Note: Nothing will show as you paste/type, which is normal.)")
!git push origin main

In [None]:
# --- STEP 0: CLEAN UP OLD FOLDERS ---
# This removes any leftover 'Profit-Prophet' folder from previous attempts.
# Run this FIRST in a fresh Colab session.
!rm -rf Profit-Prophet
print("Cleaned up old 'Profit-Prophet' folder if it existed.")

In [None]:
# --- STEP 1: MOUNT GOOGLE DRIVE ---
# This is CRUCIAL if your notebook is saved in Google Drive.
# Follow the prompts to authorize your Google Drive access.
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# --- STEP 2: VERIFY YOUR NOTEBOOK'S EXACT PATH IN DRIVE ---
# This helps us be absolutely sure of the file's location and exact name.
print("\n--- Checking files in your 'Colab Notebooks' folder ---")
# This command lists all files/folders in your "Colab Notebooks" folder in Drive.
c

# ***IMPORTANT: Look VERY CAREFULLY in the output above.***
# Do you see "Profit Prophet.ipynb" listed EXACTLY as it should be (capital P's and space)?
# If not, it means your notebook is either:
#   a) In a different folder within MyDrive (e.g., just "/content/drive/MyDrive/" or a custom folder).
#   b) Has a slightly different name (e.g., "Profit_Prophet.ipynb").
# If it's not listed, you MUST find its correct path using the Colab file explorer or `!find` commands.


In [None]:
!ls -F "/content/drive/MyDrive/Colab Notebooks/"

In [None]:
!ls -F "/content/drive/MyDrive/Colab Notebooks/"
print("\n--- Checking files in your 'Colab Notebooks' folder ---")

In [None]:
# Part A: Get Ready
!git config --global user.name "Senor-Avi"
!git config --global user.email "avi.rai.senor@gmail.com"
github_repo_url = "https://github.com/Senor-Avi/Profit-Prophet"
repo_name = github_repo_url.split('/')[-1].replace('.git', '')
print(f"\nWe will clone your repository into a folder called: {repo_name}")

# Part B: Clone and Move Your Project
print(f"Cloning {github_repo_url}...")
# This will now create the folder without error because we deleted it in Step 0.
!git clone {github_repo_url}

# Move your notebook from Drive into the cloned repo folder.
# This path is based on what you provided, assuming Step 2 confirms it.
print(f"Moving your notebook 'Profit Prophet.ipynb' from Drive into the {repo_name} folder...")
!mv "/content/drive/MyDrive/Colab Notebooks/Profit Prophet.ipynb" {repo_name}/

# Change current directory to the cloned repo.
print(f"Changing directory to: {repo_name}")
%cd {repo_name}

# Part C: Add, Commit, and Push
print("Adding all project files to Git...")
!git add . # This will now add the moved notebook!
print("Committing your changes...")
!git commit -m "Initial commit of Profit Prophet project from Google Colab"
print("\n--- ALMOST DONE! ---")
print("Now, pushing your project to GitHub. Please follow the prompts:")
print("1. When asked for 'Username', enter 'Senor-Avi' and press Enter.")
print("2. When asked for 'Password', PASTE your Personal Access Token (PAT) here and press Enter.")
print("   (Note: Nothing will show as you paste/type, which is normal.)")
!git push origin main

In [None]:
# --- STEP 1: MOUNT GOOGLE DRIVE ---
# This is CRUCIAL as your notebook is in Drive.
# Follow the prompts to authorize your Google Drive access.
from google.colab import drive
drive.mount('/content/drive')

print("\n--- Google Drive Mount Status ---")
# ***IMPORTANT: Verify you see "Mounted at /content/drive" above this line.***
# If you don't, the mount failed. You MUST fix the drive mount before proceeding.

In [None]:
# --- STEP 2: FIND YOUR NOTEBOOK'S REAL PATH IN DRIVE ---
# This command will search your entire mounted Google Drive for your notebook.
# It might take a moment to run.
print("\n--- Searching for 'Profit Prophet.ipynb' in your Google Drive ---")
!find "/content/drive/MyDrive/" -name "Profit Prophet.ipynb" 2>/dev/null

# ***IMPORTANT: Look VERY CAREFULLY at the output below this cell.***
# You should see one line that looks like:
# /content/drive/MyDrive/YOUR_ACTUAL_FOLDER_PATH/Profit Prophet.ipynb
# For example: /content/drive/MyDrive/My Deep Learning Projects/Profit Prophet.ipynb
# This is the EXACT path you need for the !mv command.

In [None]:
!find "/content/drive/MyDrive/" -name "Profit Prophet.ipynb" 2>/dev/null

In [1]:
# --- STEP 1: MOUNT GOOGLE DRIVE ---
# Run this cell first. You'll get a prompt to authorize access.
from google.colab import drive
drive.mount('/content/drive')

print("\n--- Google Drive Mount Status ---")
# ***IMPORTANT: Verify that you see "Mounted at /content/drive" in the output above.***
# If you don't, the drive mount failed. You MUST fix the drive mount before proceeding.
# This usually means following the authorization link, copying the code, and pasting it back.

Mounted at /content/drive

--- Google Drive Mount Status ---


In [2]:
# --- STEP 2: CLEAN UP OLD REPOSITORY FOLDER & FIND NOTEBOOK'S EXACT PATH ---

# Define your repository name (this is 'Profit-Prophet')
repo_name = "Profit-Prophet"

# Clean up any existing folder from previous failed attempts
import os
if os.path.exists(repo_name):
    !rm -rf {repo_name}
    print(f"Removed existing '{repo_name}' folder to ensure a clean clone.")
else:
    print(f"'{repo_name}' folder does not exist, no cleanup needed.")


print("\n--- NOW, LET'S FIND YOUR NOTEBOOK'S EXACT LOCATION ---")
print("This has been the main issue. We need the ABSOLUTE path.")

print("\nMethod A: Visually inspect via Colab's 'Files' sidebar (RECOMMENDED):")
print("1. Click the 'folder' icon on the left sidebar.")
print("2. Expand 'drive' > 'MyDrive'.")
print("3. Navigate through your folders until you visually find 'Profit Prophet.ipynb'.")
print("4. Right-click on 'Profit Prophet.ipynb' and select 'Copy path'.")
print("5. PASTE that exact copied path into the 'NOTEBOOK_FULL_PATH' variable below, replacing 'PASTE_YOUR_EXACT_FULL_PATH_HERE'.")


print("\nMethod B (for confirmation if Method A is difficult): Search your Drive.")
print("This might take a moment. Look for a line starting with '/content/drive/MyDrive/'")
!find "/content/drive/MyDrive/" -name "Profit Prophet.ipynb" 2>/dev/null


# *** VERY IMPORTANT: PASTE THE EXACT FULL PATH YOU FOUND/COPIED HERE: ***
# Example of what it might look like: "/content/drive/MyDrive/Colab Notebooks/Profit Prophet.ipynb"
# Another example: "/content/drive/MyDrive/MyProjectFolder/Profit Prophet.ipynb"
NOTEBOOK_FULL_PATH = "PASTE_YOUR_EXACT_FULL_PATH_HERE" # <--- REPLACE THIS ENTIRE STRING WITH YOUR NOTEBOOK'S FULL PATH

# This check ensures you updated the path.
if NOTEBOOK_FULL_PATH == "PASTE_YOUR_EXACT_FULL_PATH_HERE":
    raise ValueError("ERROR: You MUST update the 'NOTEBOOK_FULL_PATH' variable above with the correct path from your Google Drive!")
else:
    print(f"\nConfirmed notebook path to use: {NOTEBOOK_FULL_PATH}")

'Profit-Prophet' folder does not exist, no cleanup needed.

--- NOW, LET'S FIND YOUR NOTEBOOK'S EXACT LOCATION ---
This has been the main issue. We need the ABSOLUTE path.

Method A: Visually inspect via Colab's 'Files' sidebar (RECOMMENDED):
1. Click the 'folder' icon on the left sidebar.
2. Expand 'drive' > 'MyDrive'.
3. Navigate through your folders until you visually find 'Profit Prophet.ipynb'.
4. Right-click on 'Profit Prophet.ipynb' and select 'Copy path'.
5. PASTE that exact copied path into the 'NOTEBOOK_FULL_PATH' variable below, replacing 'PASTE_YOUR_EXACT_FULL_PATH_HERE'.

Method B (for confirmation if Method A is difficult): Search your Drive.
This might take a moment. Look for a line starting with '/content/drive/MyDrive/'
/content/drive/MyDrive/Colab Notebooks/Profit Prophet.ipynb


ValueError: ERROR: You MUST update the 'NOTEBOOK_FULL_PATH' variable above with the correct path from your Google Drive!