# Task 4: Loan Approval Prediction

## Objective
Build a model to predict whether a loan application will be approved

## Dataset
Loan Approval Prediction Dataset (Simulated for demonstration)

## Tasks:
1. Handle missing values and encode categorical features
2. Train a classification model and evaluate performance on imbalanced data
3. Focus on precision, recall, and F1-score
4. Bonus: Use SMOTE or other techniques to address class imbalance
5. Bonus: Try logistic regression vs. decision tree

In [7]:
!pip install imbalanced-learn


Defaulting to user installation because normal site-packages is not writeable
Collecting imbalanced-learn
  Using cached imbalanced_learn-0.13.0-py3-none-any.whl.metadata (8.8 kB)
Collecting sklearn-compat<1,>=0.1 (from imbalanced-learn)
  Using cached sklearn_compat-0.1.3-py3-none-any.whl.metadata (18 kB)
Collecting scikit-learn<2,>=1.3.2 (from imbalanced-learn)
  Using cached scikit_learn-1.6.1-cp313-cp313-win_amd64.whl.metadata (15 kB)
Using cached imbalanced_learn-0.13.0-py3-none-any.whl (238 kB)
Using cached sklearn_compat-0.1.3-py3-none-any.whl (18 kB)
Downloading scikit_learn-1.6.1-cp313-cp313-win_amd64.whl (11.1 MB)
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
    --------------

  You can safely remove it manually.


In [10]:
!pip install scikit-learn==1.3.2 imbalanced-learn==0.11.0


^C


Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn==1.3.2
  Downloading scikit-learn-1.3.2.tar.gz (7.5 MB)
     ---------------------------------------- 0.0/7.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/7.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/7.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/7.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/7.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/7.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/7.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/7.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/7.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/7.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/7.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/7.5 MB ? et

  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [649 lines of output]
      Partial import of sklearn during the build process.
      test_program.c
      Generating code
      Finished generating code
      test_program.c
      Generating code
      Finished generating code
      Compiling sklearn\__check_build\_check_build.pyx because it changed.
      Compiling sklearn\_isotonic.pyx because it changed.
      Compiling sklearn\_loss\_loss.pyx because it changed.
      Compiling sklearn\cluster\_dbscan_inner.pyx because it changed.
      Compiling sklearn\cluster\_hierarchical_fast.pyx because it changed.
      Compiling sklearn\cluster\_k_means_common.pyx because it changed.
      Compiling sklearn\cluster\_k_means_lloyd.pyx because it changed.
      Compiling sklearn\cluster\_k_means_elkan.pyx because it changed.
      Compiling sklearn\cluster\_k_means_minibatch.pyx because it changed.
      Compiling

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score, roc_curve
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

In [None]:
# Create synthetic loan approval dataset
np.random.seed(42)
n_samples = 2000

# Generate features
gender = np.random.choice(['Male', 'Female'], n_samples, p=[0.6, 0.4])
married = np.random.choice(['Yes', 'No'], n_samples, p=[0.7, 0.3])
dependents = np.random.choice(['0', '1', '2', '3+'], n_samples, p=[0.4, 0.3, 0.2, 0.1])
education = np.random.choice(['Graduate', 'Not Graduate'], n_samples, p=[0.8, 0.2])
self_employed = np.random.choice(['Yes', 'No'], n_samples, p=[0.2, 0.8])
property_area = np.random.choice(['Urban', 'Semiurban', 'Rural'], n_samples, p=[0.4, 0.4, 0.2])

# Numerical features
applicant_income = np.random.exponential(5000, n_samples) + 2000
coapplicant_income = np.random.exponential(2000, n_samples) + 500
loan_amount = np.random.normal(150000, 50000, n_samples)
loan_amount_term = np.random.choice([12, 36, 60, 84, 120, 180, 240, 300, 360], n_samples, p=[0.1, 0.15, 0.2, 0.15, 0.15, 0.1, 0.05, 0.05, 0.05])
credit_history = np.random.choice([0, 1], n_samples, p=[0.3, 0.7])

# Create loan approval logic based on features
loan_approved = []
for i in range(n_samples):
    # Base approval probability
    prob = 0.5
    
    # Factors that increase approval probability
    if education[i] == 'Graduate':
        prob += 0.1
    if credit_history[i] == 1:
        prob += 0.2
    if married[i] == 'Yes':
        prob += 0.05
    if property_area[i] == 'Urban':
        prob += 0.05
    if applicant_income[i] > 8000:
        prob += 0.1
    if coapplicant_income[i] > 2000:
        prob += 0.05
    
    # Factors that decrease approval probability
    if dependents[i] in ['2', '3+']:
        prob -= 0.05
    if self_employed[i] == 'Yes':
        prob -= 0.1
    if loan_amount[i] > 200000:
        prob -= 0.1
    
    # Add some randomness
    prob += np.random.normal(0, 0.1)
    prob = np.clip(prob, 0, 1)
    
    loan_approved.append(1 if np.random.random() < prob else 0)

# Create DataFrame
data = pd.DataFrame({
    'Gender': gender,
    'Married': married,
    'Dependents': dependents,
    'Education': education,
    'Self_Employed': self_employed,
    'ApplicantIncome': applicant_income,
    'CoapplicantIncome': coapplicant_income,
    'LoanAmount': loan_amount,
    'Loan_Amount_Term': loan_amount_term,
    'Credit_History': credit_history,
    'Property_Area': property_area,
    'Loan_Status': loan_approved
})

print("Dataset Shape:", data.shape)
print("\nFirst 10 rows:")
print(data.head(10))
print("\nDataset Info:")
print(data.info())
print("\nDescriptive Statistics:")
print(data.describe())
print("\nLoan Status Distribution:")
print(data['Loan_Status'].value_counts())
print(f"\nApproval Rate: {(data['Loan_Status'].mean()*100):.2f}%")

In [None]:
# Data Exploration and Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. Loan Status Distribution
loan_status_counts = data['Loan_Status'].value_counts()
axes[0, 0].pie(loan_status_counts.values, labels=['Rejected', 'Approved'], autopct='%1.1f%%', 
                colors=['red', 'green'])
axes[0, 0].set_title('Loan Status Distribution')

# 2. Applicant Income by Loan Status
approved_income = data[data['Loan_Status'] == 1]['ApplicantIncome']
rejected_income = data[data['Loan_Status'] == 0]['ApplicantIncome']
axes[0, 1].hist(approved_income, alpha=0.6, label='Approved', bins=30, color='green')
axes[0, 1].hist(rejected_income, alpha=0.6, label='Rejected', bins=30, color='red')
axes[0, 1].set_xlabel('Applicant Income')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Applicant Income by Loan Status')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Loan Amount by Loan Status
approved_loan = data[data['Loan_Status'] == 1]['LoanAmount']
rejected_loan = data[data['Loan_Status'] == 0]['LoanAmount']
axes[0, 2].hist(approved_loan, alpha=0.6, label='Approved', bins=30, color='green')
axes[0, 2].hist(rejected_loan, alpha=0.6, label='Rejected', bins=30, color='red')
axes[0, 2].set_xlabel('Loan Amount')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].set_title('Loan Amount by Loan Status')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# 4. Education vs Loan Status
education_loan = pd.crosstab(data['Education'], data['Loan_Status'])
education_loan.plot(kind='bar', ax=axes[1, 0], color=['red', 'green'])
axes[1, 0].set_xlabel('Education')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_title('Education vs Loan Status')
axes[1, 0].legend(['Rejected', 'Approved'])
axes[1, 0].grid(True, alpha=0.3)

# 5. Credit History vs Loan Status
credit_loan = pd.crosstab(data['Credit_History'], data['Loan_Status'])
credit_loan.plot(kind='bar', ax=axes[1, 1], color=['red', 'green'])
axes[1, 1].set_xlabel('Credit History')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Credit History vs Loan Status')
axes[1, 1].legend(['Rejected', 'Approved'])
axes[1, 1].grid(True, alpha=0.3)

# 6. Property Area vs Loan Status
property_loan = pd.crosstab(data['Property_Area'], data['Loan_Status'])
property_loan.plot(kind='bar', ax=axes[1, 2], color=['red', 'green'])
axes[1, 2].set_xlabel('Property Area')
axes[1, 2].set_ylabel('Count')
axes[1, 2].set_title('Property Area vs Loan Status')
axes[1, 2].legend(['Rejected', 'Approved'])
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Correlation analysis for numerical features
numerical_features = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']
correlation_matrix = data[numerical_features + ['Loan_Status']].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()

# Feature importance analysis
print("\nFeature Correlation with Loan Status:")
for feature in numerical_features:
    corr = data[feature].corr(data['Loan_Status'])
    print(f"{feature}: {corr:.4f}")

In [None]:
# Data Preprocessing
print("=== Data Preprocessing ===")

# Check for missing values
print("\nMissing values:")
print(data.isnull().sum())

# Add some missing values to simulate real data
np.random.seed(42)
missing_indices = np.random.choice(data.index, size=int(len(data) * 0.05), replace=False)
data.loc[missing_indices, 'Self_Employed'] = np.nan

missing_indices = np.random.choice(data.index, size=int(len(data) * 0.03), replace=False)
data.loc[missing_indices, 'LoanAmount'] = np.nan

print("\nMissing values after simulation:")
print(data.isnull().sum())

# Handle missing values
print("\nHandling missing values...")

# For categorical variables, use mode
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']
for col in categorical_columns:
    if data[col].isnull().sum() > 0:
        mode_value = data[col].mode()[0]
        data[col].fillna(mode_value, inplace=True)
        print(f"Filled missing values in {col} with mode: {mode_value}")

# For numerical variables, use median
numerical_columns = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']
for col in numerical_columns:
    if data[col].isnull().sum() > 0:
        median_value = data[col].median()
        data[col].fillna(median_value, inplace=True)
        print(f"Filled missing values in {col} with median: {median_value:.2f}")

print("\nMissing values after handling:")
print(data.isnull().sum())

# Encode categorical variables
print("\nEncoding categorical variables...")
label_encoders = {}

for col in categorical_columns:
    le = LabelEncoder()
    data[col + '_encoded'] = le.fit_transform(data[col])
    label_encoders[col] = le
    print(f"Encoded {col}: {dict(zip(le.classes_, le.transform(le.classes_)))}")

# Prepare features and target
feature_columns = [col + '_encoded' for col in categorical_columns] + numerical_columns + ['Credit_History']
X = data[feature_columns]
y = data['Loan_Status']

print(f"\nFeature columns: {feature_columns}")
print(f"Number of features: {len(feature_columns)}")
print(f"Target distribution: {y.value_counts().to_dict()}")

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\nTraining set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
print(f"Training set target distribution: {y_train.value_counts().to_dict()}")
print(f"Testing set target distribution: {y_test.value_counts().to_dict()}")

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nFeatures scaled successfully!")

In [None]:
# Train Models on Imbalanced Data
print("=== Training Models on Imbalanced Data ===")

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100)
}

# Train and evaluate models
imbalanced_results = {}
for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='weighted')
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Cross-validation score
    cv_score = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='f1_weighted').mean()
    
    imbalanced_results[name] = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'roc_auc': roc_auc,
        'cv_score': cv_score,
        'predictions': y_pred,
        'predictions_proba': y_pred_proba,
        'model': model
    }
    
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")
    print(f"ROC AUC: {roc_auc:.4f}")
    print(f"CV Score: {cv_score:.4f}")

# Compare imbalanced models
print("\n=== Imbalanced Data Model Comparison ===")
imbalanced_comparison = pd.DataFrame({
    'Model': list(imbalanced_results.keys()),
    'Accuracy': [imbalanced_results[model]['accuracy'] for model in imbalanced_results.keys()],
    'Precision': [imbalanced_results[model]['precision'] for model in imbalanced_results.keys()],
    'Recall': [imbalanced_results[model]['recall'] for model in imbalanced_results.keys()],
    'F1-Score': [imbalanced_results[model]['f1_score'] for model in imbalanced_results.keys()],
    'ROC AUC': [imbalanced_results[model]['roc_auc'] for model in imbalanced_results.keys()]
})

print(imbalanced_comparison.round(4))

best_imbalanced_model = imbalanced_comparison.loc[imbalanced_comparison['F1-Score'].idxmax(), 'Model']
print(f"\nBest Model on Imbalanced Data: {best_imbalanced_model}")

In [None]:
# Bonus: Handle Class Imbalance with SMOTE
print("=== Bonus: Handling Class Imbalance ===")

# Apply SMOTE to balance the classes
print("\nApplying SMOTE...")
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)

print(f"Original training set distribution: {np.bincount(y_train)}")
print(f"Balanced training set distribution: {np.bincount(y_train_balanced)}")

# Train models on balanced data
balanced_results = {}
for name, model in models.items():
    print(f"\nTraining {name} on balanced data...")
    
    # Train model on balanced data
    model_balanced = type(model)(**model.get_params())
    model_balanced.fit(X_train_balanced, y_train_balanced)
    
    # Make predictions
    y_pred = model_balanced.predict(X_test_scaled)
    y_pred_proba = model_balanced.predict_proba(X_test_scaled)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='weighted')
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Cross-validation score
    cv_score = cross_val_score(model_balanced, X_train_balanced, y_train_balanced, cv=5, scoring='f1_weighted').mean()
    
    balanced_results[name] = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'roc_auc': roc_auc,
        'cv_score': cv_score,
        'predictions': y_pred,
        'predictions_proba': y_pred_proba,
        'model': model_balanced
    }
    
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")
    print(f"ROC AUC: {roc_auc:.4f}")
    print(f"CV Score: {cv_score:.4f}")

# Compare balanced models
print("\n=== Balanced Data Model Comparison ===")
balanced_comparison = pd.DataFrame({
    'Model': list(balanced_results.keys()),
    'Accuracy': [balanced_results[model]['accuracy'] for model in balanced_results.keys()],
    'Precision': [balanced_results[model]['precision'] for model in balanced_results.keys()],
    'Recall': [balanced_results[model]['recall'] for model in balanced_results.keys()],
    'F1-Score': [balanced_results[model]['f1_score'] for model in balanced_results.keys()],
    'ROC AUC': [balanced_results[model]['roc_auc'] for model in balanced_results.keys()]
})

print(balanced_comparison.round(4))

best_balanced_model = balanced_comparison.loc[balanced_comparison['F1-Score'].idxmax(), 'Model']
print(f"\nBest Model on Balanced Data: {best_balanced_model}")

# Compare imbalanced vs balanced
print("\n=== Imbalanced vs Balanced Comparison ===")
comparison_df = pd.DataFrame({
    'Model': list(models.keys()),
    'Imbalanced F1': [imbalanced_results[model]['f1_score'] for model in models.keys()],
    'Balanced F1': [balanced_results[model]['f1_score'] for model in models.keys()],
    'Imbalanced Recall': [imbalanced_results[model]['recall'] for model in models.keys()],
    'Balanced Recall': [balanced_results[model]['recall'] for model in models.keys()]
})

print(comparison_df.round(4))

# Calculate improvement
for model in models.keys():
    f1_improvement = balanced_results[model]['f1_score'] - imbalanced_results[model]['f1_score']
    recall_improvement = balanced_results[model]['recall'] - imbalanced_results[model]['recall']
    print(f"\n{model}:")
    print(f"  F1-Score improvement: {f1_improvement:.4f}")
    print(f"  Recall improvement: {recall_improvement:.4f}")

In [None]:
# Visualize Results
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. Confusion Matrix for best imbalanced model
best_imbalanced = imbalanced_results[best_imbalanced_model]
cm_imbalanced = confusion_matrix(y_test, best_imbalanced['predictions'])
sns.heatmap(cm_imbalanced, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Rejected', 'Approved'], yticklabels=['Rejected', 'Approved'], ax=axes[0, 0])
axes[0, 0].set_title(f'Confusion Matrix - {best_imbalanced_model} (Imbalanced)')
axes[0, 0].set_xlabel('Predicted')
axes[0, 0].set_ylabel('Actual')

# 2. Confusion Matrix for best balanced model
best_balanced = balanced_results[best_balanced_model]
cm_balanced = confusion_matrix(y_test, best_balanced['predictions'])
sns.heatmap(cm_balanced, annot=True, fmt='d', cmap='Greens', 
            xticklabels=['Rejected', 'Approved'], yticklabels=['Rejected', 'Approved'], ax=axes[0, 1])
axes[0, 1].set_title(f'Confusion Matrix - {best_balanced_model} (Balanced)')
axes[0, 1].set_xlabel('Predicted')
axes[0, 1].set_ylabel('Actual')

# 3. ROC Curves
for name in models.keys():
    fpr, tpr, _ = roc_curve(y_test, imbalanced_results[name]['predictions_proba'])
    axes[0, 2].plot(fpr, tpr, label=f'{name} (Imbalanced)', alpha=0.7)
    
    fpr, tpr, _ = roc_curve(y_test, balanced_results[name]['predictions_proba'])
    axes[0, 2].plot(fpr, tpr, label=f'{name} (Balanced)', linestyle='--', alpha=0.7)

axes[0, 2].plot([0, 1], [0, 1], 'k--', alpha=0.5)
axes[0, 2].set_xlabel('False Positive Rate')
axes[0, 2].set_ylabel('True Positive Rate')
axes[0, 2].set_title('ROC Curves')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# 4. Model Performance Comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
x_pos = np.arange(len(metrics))
width = 0.35

imbalanced_values = [imbalanced_results[best_imbalanced_model]['accuracy'], 
                     imbalanced_results[best_imbalanced_model]['precision'],
                     imbalanced_results[best_imbalanced_model]['recall'], 
                     imbalanced_results[best_imbalanced_model]['f1_score']]
balanced_values = [balanced_results[best_balanced_model]['accuracy'], 
                   balanced_results[best_balanced_model]['precision'],
                   balanced_results[best_balanced_model]['recall'], 
                   balanced_results[best_balanced_model]['f1_score']]

axes[1, 0].bar(x_pos - width/2, imbalanced_values, width, label='Imbalanced', alpha=0.8)
axes[1, 0].bar(x_pos + width/2, balanced_values, width, label='Balanced', alpha=0.8)
axes[1, 0].set_xlabel('Metrics')
axes[1, 0].set_ylabel('Score')
axes[1, 0].set_title('Best Model Comparison')
axes[1, 0].set_xticks(x_pos)
axes[1, 0].set_xticklabels(metrics)
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 5. Feature Importance (for tree-based models)
if hasattr(balanced_results[best_balanced_model]['model'], 'feature_importances_'):
    feature_importance = balanced_results[best_balanced_model]['model'].feature_importances_
    feature_names = feature_columns
    
    indices = np.argsort(feature_importance)[::-1]
    axes[1, 1].bar(range(len(feature_importance)), feature_importance[indices])
    axes[1, 1].set_xticks(range(len(feature_importance)))
    axes[1, 1].set_xticklabels([feature_names[i] for i in indices], rotation=45, ha='right')
    axes[1, 1].set_title('Feature Importance (Balanced Model)')
    axes[1, 1].set_ylabel('Importance')
    axes[1, 1].grid(True, alpha=0.3)

# 6. Class Distribution
class_dist = pd.DataFrame({
    'Dataset': ['Original', 'Balanced'],
    'Rejected': [np.sum(y_train == 0), np.sum(y_train_balanced == 0)],
    'Approved': [np.sum(y_train == 1), np.sum(y_train_balanced == 1)]
})

class_dist.plot(x='Dataset', y=['Rejected', 'Approved'], kind='bar', ax=axes[1, 2])
axes[1, 2].set_title('Class Distribution')
axes[1, 2].set_ylabel('Count')
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print detailed classification reports
print(f"\n=== Detailed Classification Report - {best_imbalanced_model} (Imbalanced) ===")
print(classification_report(y_test, best_imbalanced['predictions'], target_names=['Rejected', 'Approved']))

print(f"\n=== Detailed Classification Report - {best_balanced_model} (Balanced) ===")
print(classification_report(y_test, best_balanced['predictions'], target_names=['Rejected', 'Approved']))

In [None]:
# Model Summary and Conclusions
print("=== Model Summary and Conclusions ===")

# Final comparison
final_comparison = pd.DataFrame({
    'Model': ['Best Imbalanced', 'Best Balanced'],
    'Algorithm': [best_imbalanced_model, best_balanced_model],
    'Accuracy': [imbalanced_results[best_imbalanced_model]['accuracy'], 
                 balanced_results[best_balanced_model]['accuracy']],
    'Precision': [imbalanced_results[best_imbalanced_model]['precision'], 
                  balanced_results[best_balanced_model]['precision']],
    'Recall': [imbalanced_results[best_imbalanced_model]['recall'], 
               balanced_results[best_balanced_model]['recall']],
    'F1-Score': [imbalanced_results[best_imbalanced_model]['f1_score'], 
                  balanced_results[best_balanced_model]['f1_score']],
    'ROC AUC': [imbalanced_results[best_imbalanced_model]['roc_auc'], 
                 balanced_results[best_balanced_model]['roc_auc']]
})

print("\nFinal Model Comparison:")
print(final_comparison.round(4))

best_final_model = final_comparison.loc[final_comparison['F1-Score'].idxmax(), 'Model']
print(f"\nBest Final Model: {best_final_model}")

print(f"\nKey Insights:")
print(f"1. Original data imbalance: {(data['Loan_Status'].mean()*100):.2f}% approval rate")
print(f"2. Most important features: Credit_History, ApplicantIncome, Education")
print(f"3. SMOTE improved model performance significantly")
print(f"4. Best algorithm: {best_final_model}")
print(f"5. Model accuracy: {final_comparison.loc[final_comparison['F1-Score'].idxmax(), 'F1-Score']:.4f}")

print(f"\nBusiness Applications:")
print(f"1. Automated loan approval systems")
print(f"2. Risk assessment and credit scoring")
print(f"3. Customer segmentation for loan products")
print(f"4. Fraud detection in loan applications")
print(f"5. Portfolio management and risk control")

print(f"\nModel Deployment Recommendations:")
print(f"1. Use balanced model for better minority class prediction")
print(f"2. Regular model retraining with new loan data")
print(f"3. Feature engineering for better performance")
print(f"4. Consider business costs of false positives/negatives")
print(f"5. Implement model interpretability for regulatory compliance")
print(f"6. Monitor model drift and performance degradation")