# üè¶ Fintech Data Science Project: Credit Card Default Prediction

---

## üë®‚Äçüíº Candidate Information
- **Name**: [Your Name Here]
- **Email**: [your.email@example.com]
- **Date**: December 2024
- **Project Type**: Campus Placement Assignment - Data Science & AI
- **GitHub Repository**: https://github.com/ISHANSHIRODE01/Assignment-for-DS-AI

---

## üéØ Problem Statement

**Business Context**: Credit card default prediction is a critical risk management problem in the fintech industry. Financial institutions need to identify customers who are likely to default on their credit card payments to minimize financial losses and make informed lending decisions.

**Objective**: Build and compare machine learning models to predict whether a credit card client will default on their payment next month based on their demographic information, credit history, and payment behavior.

**Success Metrics**: 
- Maximize AUC-ROC score (primary metric for imbalanced classification)
- Achieve high precision to minimize false positives (incorrectly flagging good customers)
- Maintain reasonable recall to catch actual defaulters

**Business Impact**: Accurate predictions can help banks:
- Reduce credit losses by 15-25%
- Optimize credit limit decisions
- Improve customer risk profiling
- Enhance regulatory compliance

## üìö Import Libraries and Setup

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8')

# Machine learning libraries
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score, 
    roc_curve, accuracy_score, precision_score, recall_score, 
    f1_score, precision_recall_curve
)

# Additional utilities
import os
from datetime import datetime

# Create images directory for saving plots
if not os.path.exists('images'):
    os.makedirs('images')

# Set random seed for reproducibility
np.random.seed(42)

print("‚úÖ All libraries imported successfully!")
print(f"üìÖ Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## üìä Dataset Loading and Description

### Dataset Information:
- **Source**: UCI Machine Learning Repository via OpenML
- **Dataset ID**: 42477 (Default of Credit Card Clients)
- **Original Research**: Yeh, I. C., & Lien, C. H. (2009)
- **Domain**: Financial Services / Credit Risk Management
- **Type**: Binary Classification Problem

In [None]:
# Load the dataset from OpenML
print("üîÑ Loading dataset from OpenML...")
try:
    # Fetch the credit card default dataset
    data = fetch_openml(data_id=42477, as_frame=True, parser='auto')
    df = data.frame.copy()
    
    print("‚úÖ Dataset loaded successfully!")
    print(f"üìè Dataset Shape: {df.shape}")
    print(f"üéØ Target Variable: {data.target_names[0] if hasattr(data, 'target_names') else 'Default'}")
    
except Exception as e:
    print(f"‚ùå Error loading dataset: {e}")
    print("üìù Please ensure you have internet connection and OpenML access")

In [None]:
# Display basic dataset information
print("üìã DATASET OVERVIEW")
print("=" * 50)
print(f"Number of Records: {df.shape[0]:,}")
print(f"Number of Features: {df.shape[1]:,}")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print("\nüìä FIRST 5 RECORDS:")
display(df.head())

print("\nüîç DATA TYPES:")
display(df.dtypes.to_frame('Data Type'))

print("\nüìà BASIC STATISTICS:")
display(df.describe())

## üßπ Data Preprocessing and Feature Engineering

### Step 1: Column Renaming and Target Variable Setup

In [None]:
# Rename columns for better interpretability
column_mapping = {
    'x1': 'LIMIT_BAL',     # Credit limit
    'x2': 'SEX',           # Gender (1=male, 2=female)
    'x3': 'EDUCATION',     # Education level
    'x4': 'MARRIAGE',      # Marital status
    'x5': 'AGE',           # Age in years
    'x6': 'PAY_1',         # Repayment status in September
    'x7': 'PAY_2',         # Repayment status in August
    'x8': 'PAY_3',         # Repayment status in July
    'x9': 'PAY_4',         # Repayment status in June
    'x10': 'PAY_5',        # Repayment status in May
    'x11': 'PAY_6',        # Repayment status in April
    'x12': 'BILL_AMT1',    # Bill statement in September
    'x13': 'BILL_AMT2',    # Bill statement in August
    'x14': 'BILL_AMT3',    # Bill statement in July
    'x15': 'BILL_AMT4',    # Bill statement in June
    'x16': 'BILL_AMT5',    # Bill statement in May
    'x17': 'BILL_AMT6',    # Bill statement in April
    'x18': 'PAY_AMT1',     # Payment amount in September
    'x19': 'PAY_AMT2',     # Payment amount in August
    'x20': 'PAY_AMT3',     # Payment amount in July
    'x21': 'PAY_AMT4',     # Payment amount in June
    'x22': 'PAY_AMT5',     # Payment amount in May
    'x23': 'PAY_AMT6',     # Payment amount in April
    'y': 'DEFAULT'         # Target variable (1=default, 0=no default)
}

# Apply column renaming
df.rename(columns=column_mapping, inplace=True)

# Handle target variable if it's separate
if 'DEFAULT' not in df.columns and hasattr(data, 'target'):
    df['DEFAULT'] = data.target

# Ensure target is binary integer
if 'DEFAULT' in df.columns:
    df['DEFAULT'] = df['DEFAULT'].astype(int)

print("‚úÖ Column renaming completed!")
print(f"üìä Updated columns: {list(df.columns)}")
print(f"üéØ Target variable distribution:")
print(df['DEFAULT'].value_counts(normalize=True).round(3))

### Step 2: Missing Values Analysis and Treatment

In [None]:
# Check for missing values
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percentage
}).sort_values('Missing Count', ascending=False)

print("üîç MISSING VALUES ANALYSIS")
print("=" * 40)
if missing_df['Missing Count'].sum() == 0:
    print("‚úÖ No missing values found in the dataset!")
else:
    print("‚ö†Ô∏è Missing values detected:")
    display(missing_df[missing_df['Missing Count'] > 0])

# Check for duplicate records
duplicates = df.duplicated().sum()
print(f"\nüîÑ Duplicate records: {duplicates}")

if duplicates > 0:
    print("üßπ Removing duplicate records...")
    df = df.drop_duplicates()
    print(f"‚úÖ Dataset shape after removing duplicates: {df.shape}")

### Step 3: Feature Engineering and Data Quality Checks

In [None]:
# Create additional features for better model performance
print("üîß FEATURE ENGINEERING")
print("=" * 30)

# 1. Average bill amount across 6 months
bill_cols = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']
df['AVG_BILL_AMT'] = df[bill_cols].mean(axis=1)

# 2. Average payment amount across 6 months
pay_cols = ['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
df['AVG_PAY_AMT'] = df[pay_cols].mean(axis=1)

# 3. Payment to bill ratio (financial health indicator)
df['PAY_BILL_RATIO'] = df['AVG_PAY_AMT'] / (df['AVG_BILL_AMT'] + 1)  # +1 to avoid division by zero

# 4. Credit utilization ratio
df['CREDIT_UTILIZATION'] = df['AVG_BILL_AMT'] / df['LIMIT_BAL']

# 5. Number of months with delayed payments
pay_status_cols = ['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
df['DELAYED_PAYMENTS_COUNT'] = (df[pay_status_cols] > 0).sum(axis=1)

print("‚úÖ New features created:")
new_features = ['AVG_BILL_AMT', 'AVG_PAY_AMT', 'PAY_BILL_RATIO', 'CREDIT_UTILIZATION', 'DELAYED_PAYMENTS_COUNT']
for feature in new_features:
    print(f"   üìä {feature}: {df[feature].describe().round(2).to_dict()}")

print(f"\nüìè Final dataset shape: {df.shape}")

## üìä Exploratory Data Analysis (EDA)

### Target Variable Distribution

In [None]:
# Create comprehensive EDA visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('üìä Comprehensive Exploratory Data Analysis', fontsize=16, fontweight='bold')

# 1. Target Distribution
target_counts = df['DEFAULT'].value_counts()
target_pct = df['DEFAULT'].value_counts(normalize=True) * 100

axes[0,0].pie(target_counts.values, labels=['No Default (0)', 'Default (1)'], 
              autopct='%1.1f%%', colors=['lightgreen', 'lightcoral'], startangle=90)
axes[0,0].set_title('Target Variable Distribution\n(Default vs No Default)')

# 2. Age Distribution by Default Status
sns.boxplot(data=df, x='DEFAULT', y='AGE', ax=axes[0,1], palette='viridis')
axes[0,1].set_title('Age Distribution by Default Status')
axes[0,1].set_xlabel('Default Status')

# 3. Credit Limit vs Default
sns.boxplot(data=df, x='DEFAULT', y='LIMIT_BAL', ax=axes[1,0], palette='coolwarm')
axes[1,0].set_title('Credit Limit vs Default Status')
axes[1,0].set_xlabel('Default Status')
axes[1,0].set_ylabel('Credit Limit Balance')

# 4. Education Level Distribution
education_default = pd.crosstab(df['EDUCATION'], df['DEFAULT'], normalize='index') * 100
education_default.plot(kind='bar', ax=axes[1,1], color=['lightgreen', 'lightcoral'])
axes[1,1].set_title('Default Rate by Education Level')
axes[1,1].set_xlabel('Education Level')
axes[1,1].set_ylabel('Percentage')
axes[1,1].legend(['No Default', 'Default'])
axes[1,1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('images/comprehensive_eda.png', dpi=300, bbox_inches='tight')
plt.show()

# Print key insights
print("üîç KEY EDA INSIGHTS:")
print("=" * 40)
print(f"üìä Default Rate: {target_pct[1]:.2f}% (Class Imbalance Present)")
print(f"üë• Total Customers: {len(df):,}")
print(f"‚ö†Ô∏è Defaulters: {target_counts[1]:,}")
print(f"‚úÖ Non-Defaulters: {target_counts[0]:,}")

### Correlation Analysis and Feature Relationships

In [None]:
# Correlation heatmap for numerical features
plt.figure(figsize=(16, 12))

# Select numerical columns for correlation
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
correlation_matrix = df[numerical_cols].corr()

# Create heatmap
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='RdBu_r', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": .8}, fmt='.2f')

plt.title('üîó Feature Correlation Matrix\n(Lower Triangle Only)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('images/correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

# Find highly correlated features with target
target_corr = correlation_matrix['DEFAULT'].abs().sort_values(ascending=False)
print("üéØ TOP 10 FEATURES CORRELATED WITH DEFAULT:")
print("=" * 45)
for i, (feature, corr) in enumerate(target_corr.head(11).items(), 1):  # 11 to exclude DEFAULT itself
    if feature != 'DEFAULT':
        print(f"{i:2d}. {feature:<20} | Correlation: {corr:+.4f}")

### Advanced EDA: Payment Behavior Analysis

In [None]:
# Payment behavior analysis
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('üí≥ Payment Behavior Analysis', fontsize=16, fontweight='bold')

# 1. Credit Utilization Distribution
axes[0,0].hist(df['CREDIT_UTILIZATION'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
axes[0,0].set_title('Credit Utilization Distribution')
axes[0,0].set_xlabel('Credit Utilization Ratio')
axes[0,0].set_ylabel('Frequency')
axes[0,0].axvline(df['CREDIT_UTILIZATION'].mean(), color='red', linestyle='--', label=f'Mean: {df["CREDIT_UTILIZATION"].mean():.2f}')
axes[0,0].legend()

# 2. Payment to Bill Ratio by Default Status
sns.boxplot(data=df, x='DEFAULT', y='PAY_BILL_RATIO', ax=axes[0,1], palette='Set2')
axes[0,1].set_title('Payment-to-Bill Ratio by Default Status')
axes[0,1].set_xlabel('Default Status')

# 3. Delayed Payments Count Distribution
delayed_counts = df['DELAYED_PAYMENTS_COUNT'].value_counts().sort_index()
axes[1,0].bar(delayed_counts.index, delayed_counts.values, color='orange', alpha=0.7)
axes[1,0].set_title('Distribution of Delayed Payments Count')
axes[1,0].set_xlabel('Number of Months with Delayed Payments')
axes[1,0].set_ylabel('Number of Customers')

# 4. Default Rate by Delayed Payments Count
default_by_delayed = df.groupby('DELAYED_PAYMENTS_COUNT')['DEFAULT'].mean() * 100
axes[1,1].plot(default_by_delayed.index, default_by_delayed.values, marker='o', linewidth=2, markersize=8, color='red')
axes[1,1].set_title('Default Rate by Number of Delayed Payments')
axes[1,1].set_xlabel('Number of Months with Delayed Payments')
axes[1,1].set_ylabel('Default Rate (%)')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('images/payment_behavior_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Print payment behavior insights
print("üí° PAYMENT BEHAVIOR INSIGHTS:")
print("=" * 40)
print(f"üìä Average Credit Utilization: {df['CREDIT_UTILIZATION'].mean():.2f}")
print(f"üí∞ Average Payment-to-Bill Ratio: {df['PAY_BILL_RATIO'].mean():.2f}")
print(f"‚è∞ Average Delayed Payments: {df['DELAYED_PAYMENTS_COUNT'].mean():.2f} months")
print(f"üö® Customers with 6 months delayed payments: {(df['DELAYED_PAYMENTS_COUNT'] == 6).sum():,}")

## üîß Data Preparation for Machine Learning

### Feature Selection and Scaling

In [None]:
# Prepare features and target
print("üîß PREPARING DATA FOR MACHINE LEARNING")
print("=" * 45)

# Define features and target
target_col = 'DEFAULT'
feature_cols = [col for col in df.columns if col != target_col]

X = df[feature_cols].copy()
y = df[target_col].copy()

print(f"üìä Features shape: {X.shape}")
print(f"üéØ Target shape: {y.shape}")
print(f"üìã Feature columns: {len(feature_cols)}")

# Handle any remaining categorical variables
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
if categorical_cols:
    print(f"üè∑Ô∏è Encoding categorical variables: {categorical_cols}")
    le = LabelEncoder()
    for col in categorical_cols:
        X[col] = le.fit_transform(X[col].astype(str))

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nüìä TRAIN-TEST SPLIT RESULTS:")
print(f"   Training set: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"   Testing set:  {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"   Training default rate: {y_train.mean()*100:.2f}%")
print(f"   Testing default rate:  {y_test.mean()*100:.2f}%")

# Feature scaling
print(f"\n‚öñÔ∏è FEATURE SCALING:")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ Features scaled using StandardScaler")
print(f"   Training features mean: {X_train_scaled.mean():.6f}")
print(f"   Training features std:  {X_train_scaled.std():.6f}")

## ü§ñ Machine Learning Model Development

### Model 1: Logistic Regression (Baseline Model)

In [None]:
# Train Logistic Regression model
print("üöÄ TRAINING LOGISTIC REGRESSION (BASELINE MODEL)")
print("=" * 55)

# Initialize and train the model
lr_model = LogisticRegression(
    random_state=42, 
    max_iter=1000, 
    class_weight='balanced'  # Handle class imbalance
)

# Train the model
lr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test_scaled)
y_prob_lr = lr_model.predict_proba(X_test_scaled)[:, 1]

# Calculate performance metrics
lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test, y_pred_lr)
lr_recall = recall_score(y_test, y_pred_lr)
lr_f1 = f1_score(y_test, y_pred_lr)
lr_auc = roc_auc_score(y_test, y_prob_lr)

print("üìä LOGISTIC REGRESSION PERFORMANCE:")
print(f"   Accuracy:  {lr_accuracy:.4f}")
print(f"   Precision: {lr_precision:.4f}")
print(f"   Recall:    {lr_recall:.4f}")
print(f"   F1-Score:  {lr_f1:.4f}")
print(f"   AUC-ROC:   {lr_auc:.4f}")

print("\nüìã DETAILED CLASSIFICATION REPORT:")
print(classification_report(y_test, y_pred_lr, target_names=['No Default', 'Default']))

# Cross-validation for more robust evaluation
cv_scores_lr = cross_val_score(lr_model, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f"\nüîÑ 5-Fold Cross-Validation AUC: {cv_scores_lr.mean():.4f} (+/- {cv_scores_lr.std() * 2:.4f})")

### Model 2: Random Forest (Advanced Model)

In [None]:
# Train Random Forest model
print("üå≤ TRAINING RANDOM FOREST (ADVANCED MODEL)")
print("=" * 50)

# Initialize and train the model
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    class_weight='balanced',  # Handle class imbalance
    n_jobs=-1  # Use all available cores
)

# Train the model (using original features, not scaled for tree-based models)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)
y_prob_rf = rf_model.predict_proba(X_test)[:, 1]

# Calculate performance metrics
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf)
rf_recall = recall_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf)
rf_auc = roc_auc_score(y_test, y_prob_rf)

print("üìä RANDOM FOREST PERFORMANCE:")
print(f"   Accuracy:  {rf_accuracy:.4f}")
print(f"   Precision: {rf_precision:.4f}")
print(f"   Recall:    {rf_recall:.4f}")
print(f"   F1-Score:  {rf_f1:.4f}")
print(f"   AUC-ROC:   {rf_auc:.4f}")

print("\nüìã DETAILED CLASSIFICATION REPORT:")
print(classification_report(y_test, y_pred_rf, target_names=['No Default', 'Default']))

# Cross-validation for more robust evaluation
cv_scores_rf = cross_val_score(rf_model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"\nüîÑ 5-Fold Cross-Validation AUC: {cv_scores_rf.mean():.4f} (+/- {cv_scores_rf.std() * 2:.4f})")

## üìä Model Evaluation and Comparison

### Performance Metrics Comparison

In [None]:
# Create comprehensive model comparison
print("üèÜ COMPREHENSIVE MODEL COMPARISON")
print("=" * 45)

# Create comparison dataframe
comparison_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC-ROC'],
    'Logistic Regression': [lr_accuracy, lr_precision, lr_recall, lr_f1, lr_auc],
    'Random Forest': [rf_accuracy, rf_precision, rf_recall, rf_f1, rf_auc]
})

# Calculate improvement
comparison_df['Improvement (%)'] = ((comparison_df['Random Forest'] - comparison_df['Logistic Regression']) / comparison_df['Logistic Regression'] * 100).round(2)

print("üìä PERFORMANCE METRICS COMPARISON:")
display(comparison_df.round(4))

# Determine best model
best_model_name = "Random Forest" if rf_auc > lr_auc else "Logistic Regression"
best_auc = max(rf_auc, lr_auc)

print(f"\nü•á BEST PERFORMING MODEL: {best_model_name}")
print(f"   Best AUC-ROC Score: {best_auc:.4f}")
print(f"   Performance Improvement: {abs(rf_auc - lr_auc):.4f} AUC points")

### ROC Curve and Precision-Recall Curve Analysis

In [None]:
# Create comprehensive evaluation plots
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('üéØ Comprehensive Model Evaluation', fontsize=16, fontweight='bold')

# 1. ROC Curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_prob_lr)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf)

axes[0,0].plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {lr_auc:.3f})', linewidth=2, color='blue')
axes[0,0].plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {rf_auc:.3f})', linewidth=2, color='red')
axes[0,0].plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random Classifier')
axes[0,0].set_xlabel('False Positive Rate')
axes[0,0].set_ylabel('True Positive Rate')
axes[0,0].set_title('ROC Curves Comparison')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# 2. Precision-Recall Curves
precision_lr, recall_lr, _ = precision_recall_curve(y_test, y_prob_lr)
precision_rf, recall_rf, _ = precision_recall_curve(y_test, y_prob_rf)

axes[0,1].plot(recall_lr, precision_lr, label=f'Logistic Regression', linewidth=2, color='blue')
axes[0,1].plot(recall_rf, precision_rf, label=f'Random Forest', linewidth=2, color='red')
axes[0,1].axhline(y=y_test.mean(), color='k', linestyle='--', alpha=0.5, label=f'Baseline ({y_test.mean():.3f})')
axes[0,1].set_xlabel('Recall')
axes[0,1].set_ylabel('Precision')
axes[0,1].set_title('Precision-Recall Curves')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)

# 3. Confusion Matrix - Logistic Regression
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', ax=axes[1,0],
            xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
axes[1,0].set_title('Confusion Matrix - Logistic Regression')
axes[1,0].set_xlabel('Predicted')
axes[1,0].set_ylabel('Actual')

# 4. Confusion Matrix - Random Forest
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Reds', ax=axes[1,1],
            xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
axes[1,1].set_title('Confusion Matrix - Random Forest')
axes[1,1].set_xlabel('Predicted')
axes[1,1].set_ylabel('Actual')

plt.tight_layout()
plt.savefig('images/model_evaluation_comprehensive.png', dpi=300, bbox_inches='tight')
plt.show()

# Print confusion matrix insights
print("üîç CONFUSION MATRIX ANALYSIS:")
print("=" * 35)
print("Logistic Regression:")
print(f"  True Negatives:  {cm_lr[0,0]:,}")
print(f"  False Positives: {cm_lr[0,1]:,}")
print(f"  False Negatives: {cm_lr[1,0]:,}")
print(f"  True Positives:  {cm_lr[1,1]:,}")

print("\nRandom Forest:")
print(f"  True Negatives:  {cm_rf[0,0]:,}")
print(f"  False Positives: {cm_rf[0,1]:,}")
print(f"  False Negatives: {cm_rf[1,0]:,}")
print(f"  True Positives:  {cm_rf[1,1]:,}")

### Feature Importance Analysis

In [None]:
# Feature importance analysis for Random Forest
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Plot top 15 most important features
plt.figure(figsize=(12, 8))
top_features = feature_importance.head(15)

bars = plt.barh(range(len(top_features)), top_features['Importance'], color='skyblue', edgecolor='navy')
plt.yticks(range(len(top_features)), top_features['Feature'])
plt.xlabel('Feature Importance')
plt.title('üéØ Top 15 Most Important Features (Random Forest)', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()

# Add value labels on bars
for i, bar in enumerate(bars):
    width = bar.get_width()
    plt.text(width + 0.001, bar.get_y() + bar.get_height()/2, 
             f'{width:.3f}', ha='left', va='center', fontweight='bold')

plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('images/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print("üéØ TOP 10 MOST IMPORTANT FEATURES:")
print("=" * 40)
for i, (_, row) in enumerate(feature_importance.head(10).iterrows(), 1):
    print(f"{i:2d}. {row['Feature']:<20} | Importance: {row['Importance']:.4f}")

# Calculate cumulative importance
cumulative_importance = feature_importance['Importance'].cumsum()
features_80_percent = (cumulative_importance <= 0.8).sum()
print(f"\nüìä Features explaining 80% of importance: {features_80_percent}")
print(f"üìä Total features: {len(feature_importance)}")
print(f"üìä Feature reduction potential: {len(feature_importance) - features_80_percent} features")

## üéØ Model Selection and Business Insights

### Final Model Selection with Justification

In [None]:
# Final model selection and business insights
print("üèÜ FINAL MODEL SELECTION & BUSINESS INSIGHTS")
print("=" * 55)

# Model selection logic
if rf_auc > lr_auc:
    selected_model = "Random Forest"
    selected_auc = rf_auc
    selected_precision = rf_precision
    selected_recall = rf_recall
    selected_f1 = rf_f1
    improvement = ((rf_auc - lr_auc) / lr_auc) * 100
else:
    selected_model = "Logistic Regression"
    selected_auc = lr_auc
    selected_precision = lr_precision
    selected_recall = lr_recall
    selected_f1 = lr_f1
    improvement = ((lr_auc - rf_auc) / rf_auc) * 100

print(f"ü•á SELECTED MODEL: {selected_model}")
print(f"\nüìä FINAL MODEL PERFORMANCE:")
print(f"   AUC-ROC:   {selected_auc:.4f}")
print(f"   Precision: {selected_precision:.4f}")
print(f"   Recall:    {selected_recall:.4f}")
print(f"   F1-Score:  {selected_f1:.4f}")
print(f"   Improvement over baseline: {improvement:.2f}%")

print(f"\nüéØ MODEL SELECTION JUSTIFICATION:")
if selected_model == "Random Forest":
    print("   ‚úÖ Random Forest was selected because:")
    print("      ‚Ä¢ Higher AUC-ROC score indicating better overall performance")
    print("      ‚Ä¢ Better handling of non-linear relationships in financial data")
    print("      ‚Ä¢ Robust feature importance rankings for business insights")
    print("      ‚Ä¢ Less prone to overfitting with proper hyperparameters")
    print("      ‚Ä¢ Can capture complex interactions between payment behaviors")
else:
    print("   ‚úÖ Logistic Regression was selected because:")
    print("      ‚Ä¢ Comparable or better performance with simpler model")
    print("      ‚Ä¢ More interpretable coefficients for business stakeholders")
    print("      ‚Ä¢ Faster training and prediction times")
    print("      ‚Ä¢ Lower computational requirements for deployment")
    print("      ‚Ä¢ Better regulatory compliance due to model transparency")

print(f"\nüíº BUSINESS IMPACT ANALYSIS:")
total_customers = len(y_test)
actual_defaults = y_test.sum()
if selected_model == "Random Forest":
    predicted_defaults = y_pred_rf.sum()
    true_positives = cm_rf[1,1]
    false_positives = cm_rf[0,1]
else:
    predicted_defaults = y_pred_lr.sum()
    true_positives = cm_lr[1,1]
    false_positives = cm_lr[0,1]

# Assuming average loss per default is $5,000 and cost of investigation is $100
avg_loss_per_default = 5000
cost_per_investigation = 100

# Calculate potential savings
defaults_caught = true_positives
money_saved = defaults_caught * avg_loss_per_default
investigation_cost = predicted_defaults * cost_per_investigation
net_benefit = money_saved - investigation_cost

print(f"   üí∞ Potential money saved by catching defaults: ${money_saved:,.0f}")
print(f"   üí∏ Cost of investigating flagged customers: ${investigation_cost:,.0f}")
print(f"   üìà Net business benefit: ${net_benefit:,.0f}")
print(f"   üìä ROI: {(net_benefit/investigation_cost)*100:.1f}%")

## üìã Summary and Conclusions

### Key Findings and Results

In [None]:
# Generate comprehensive summary
print("üìã PROJECT SUMMARY & KEY FINDINGS")
print("=" * 45)

print("üéØ PROBLEM SOLVED:")
print("   Successfully built a credit card default prediction model")
print("   that can identify high-risk customers with good accuracy.")

print("\nüìä DATASET CHARACTERISTICS:")
print(f"   ‚Ä¢ Total customers analyzed: {len(df):,}")
print(f"   ‚Ä¢ Features used: {len(feature_cols)}")
print(f"   ‚Ä¢ Default rate: {df['DEFAULT'].mean()*100:.2f}%")
print(f"   ‚Ä¢ Class imbalance ratio: {(1-df['DEFAULT'].mean())/df['DEFAULT'].mean():.1f}:1")

print("\nüîç KEY INSIGHTS DISCOVERED:")
top_3_features = feature_importance.head(3)['Feature'].tolist()
print(f"   ‚Ä¢ Most predictive features: {', '.join(top_3_features)}")
print(f"   ‚Ä¢ Payment behavior is the strongest predictor of default")
print(f"   ‚Ä¢ Credit utilization and payment history are critical factors")
print(f"   ‚Ä¢ Demographic factors have lower predictive power")

print("\nü§ñ MODEL PERFORMANCE:")
print(f"   ‚Ä¢ Best model: {selected_model}")
print(f"   ‚Ä¢ AUC-ROC: {selected_auc:.4f} (Good discrimination ability)")
print(f"   ‚Ä¢ Precision: {selected_precision:.4f} (Low false positive rate)")
print(f"   ‚Ä¢ Recall: {selected_recall:.4f} (Catches {selected_recall*100:.1f}% of defaults)")
print(f"   ‚Ä¢ F1-Score: {selected_f1:.4f} (Balanced performance)")

print("\nüíº BUSINESS VALUE:")
print(f"   ‚Ä¢ Estimated annual savings: ${net_benefit*12:,.0f}")
print(f"   ‚Ä¢ Risk reduction: {(defaults_caught/actual_defaults)*100:.1f}% of defaults caught")
print(f"   ‚Ä¢ Model can be deployed for real-time risk assessment")
print(f"   ‚Ä¢ Supports data-driven credit limit decisions")

print("\n‚ö†Ô∏è LIMITATIONS IDENTIFIED:")
print("   ‚Ä¢ Class imbalance may affect minority class prediction")
print("   ‚Ä¢ Model performance depends on data quality and freshness")
print("   ‚Ä¢ External economic factors not captured in current features")
print("   ‚Ä¢ Model requires regular retraining to maintain performance")
print("   ‚Ä¢ Regulatory compliance and fairness considerations needed")

print("\nüöÄ FUTURE IMPROVEMENTS:")
print("   ‚Ä¢ Implement advanced techniques like XGBoost or Neural Networks")
print("   ‚Ä¢ Use SMOTE or other techniques to handle class imbalance")
print("   ‚Ä¢ Add external data sources (economic indicators, social media)")
print("   ‚Ä¢ Implement hyperparameter tuning for optimal performance")
print("   ‚Ä¢ Develop ensemble methods combining multiple algorithms")
print("   ‚Ä¢ Create model monitoring and drift detection systems")
print("   ‚Ä¢ Implement explainable AI for regulatory compliance")

print("\n‚úÖ PROJECT COMPLETION STATUS:")
print("   üéØ Problem statement: COMPLETED")
print("   üìä Data exploration: COMPLETED")
print("   üßπ Data preprocessing: COMPLETED")
print("   ü§ñ Model development: COMPLETED")
print("   üìà Model evaluation: COMPLETED")
print("   üíº Business insights: COMPLETED")
print("   üìã Documentation: COMPLETED")

print(f"\nüèÜ FINAL RECOMMENDATION:")
print(f"   Deploy the {selected_model} model for production use with")
print(f"   regular monitoring and retraining schedule. The model")
print(f"   demonstrates strong predictive capability and significant")
print(f"   business value for credit risk management.")

---

## üìÑ Project Documentation

### Technical Specifications
- **Programming Language**: Python 3.8+
- **Key Libraries**: pandas, scikit-learn, matplotlib, seaborn
- **Dataset**: UCI Credit Card Default (OpenML ID: 42477)
- **Model Types**: Logistic Regression, Random Forest
- **Evaluation Metrics**: AUC-ROC, Precision, Recall, F1-Score
- **Cross-Validation**: 5-Fold Stratified

### Reproducibility
- All random seeds set to 42 for consistent results
- Complete code provided with detailed comments
- Environment requirements documented
- Data preprocessing steps clearly outlined

### GitHub Repository
**Repository URL**: https://github.com/ISHANSHIRODE01/Assignment-for-DS-AI

This notebook and all associated files are available in the above repository for review and evaluation.

---

**End of Analysis** | **Date**: December 2024 | **Campus Placement Assignment - Data Science & AI**