# Model Prediksi Kredit Macet
## Analisis Perbandingan Model dengan Teknik Oversampling

Notebook ini berisi analisis komprehensif untuk memprediksi kredit macet menggunakan berbagai algoritma machine learning dan teknik oversampling.

### Target:
- **KETERANGAN**: 5 label (Lancar, Dalam Perhatian Khusus, Kurang Lancar, Diragukan, Macet)

### Metodologi:
1. **Data Preprocessing**: Pembersihan data dan feature engineering
2. **Oversampling**: SMOTE dan ADASYN untuk mengatasi ketidakseimbangan kelas
3. **Model Comparison**: Random Forest, XGBoost, LightGBM, Gradient Boosting
4. **Hyperparameter Tuning**: Optimasi model terbaik

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Machine Learning Libraries
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Imbalanced Learning Libraries
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.metrics import classification_report_imbalanced

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print("Python version:", pd.__version__)

In [None]:
# Load Dataset
# Assuming the data is in CSV format. Adjust the file path as needed.
# You can load from CSV, Excel, or create directly from the provided sample data

# Create sample dataset based on provided information
data = {
    'NO': [1, 2, 3, 4, 5, 6],
    'PEKERJAAN': [37, 9, 9, 37, 37, 37],
    'KETERANGAN_PEKERJAAN': [
        'Pegawai pemerintahan/lembaga negara',
        'Pengajar (Guru,Dosen)',
        'Pengajar (Guru,Dosen)',
        'Pegawai pemerintahan/lembaga negara',
        'Pegawai pemerintahan/lembaga negara',
        'Pegawai pemerintahan/lembaga negara'
    ],
    'TANGGAL_LAHIR': ['1978-06-06', '1975-04-30', '1986-03-15', '1971-09-12', '1976-09-23', '1981-08-15'],
    'STATUS_PERNIKAHAN': ['K', 'B', 'K', 'K', 'K', 'K'],
    'KETERANGAN_STATUS_PERNIKAHAN': ['Kawin', 'Belum Kawin', 'Kawin', 'Kawin', 'Kawin', 'Kawin'],
    'KELURAHAN': ['GROGOL UTARA', 'JURANGMANGU BARAT', 'PULAU PANGGANG', 'JOGLO', 'KELAPA GADING TIMUR', 'TIRTAJAYA'],
    'KECAMATAN': ['KEBAYORAN LAMA', 'PONDOK AREN', 'KEPULAUAN SERIBU UTARA', 'KEMBANGAN', 'KELAPA GADING', 'SUKMAJAYA'],
    'KOTA': ['JAKARTA SELATAN', 'TANGERANG SELATAN', 'KEPULAUAN SERIBU', 'JAKARTA BARAT', 'JAKARTA UTARA', 'KOTA DEPOK'],
    'PROVINSI': ['DKI JAKARTA', 'BANTEN', 'DKI JAKARTA', 'DKI JAKARTA', 'DKI JAKARTA', 'JAWA BARAT'],
    'PRODUK': ['Konsumer', 'Konsumer', 'Konsumer', 'Konsumer', 'Konsumer', 'Konsumer'],
    'SUB_PRODUK': ['KMG', 'KMG', 'KMG', 'KMG', 'KMG', 'KPR'],
    'KETERANGAN_SUB_PRODUK': ['Kredit Multi Guna', 'Kredit Multi Guna', 'Kredit Multi Guna', 'Kredit Multi Guna', 'Kredit Multi Guna', 'Kredit Pemilikan Rumah'],
    'TANGGAL_INPUT': ['2022-10-03', '2024-09-27', '2024-06-03', '2020-07-07', '2024-04-24', '2015-02-24'],
    'PLAFOND': [141200000.00, 418000000.00, 265500000.00, 305000000.00, 91000000.00, 177100000.00],
    'JK_WAKTUBULAN': [60, 126, 192, 109, 60, 161],
    'STATUS_PRESCREENING': ['Low', 'High', 'Low', 'Low', 'High', 'Medium'],
    'KOLEKTABILITAS': [1, 5, 1, 4, 3, 2],
    'KETERANGAN': ['Lancar', 'Macet', 'Lancar', 'Diragukan', 'Kurang Lancar', 'Dalam Perhatian Khusus']
}

df = pd.DataFrame(data)

# If you have a CSV file, uncomment and use this instead:
# df = pd.read_csv('path_to_your_file.csv')

print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
df.head()

## 1. Data Exploration dan Preprocessing

In [None]:
# Basic Information about Dataset
print("Dataset Info:")
print("="*50)
print(f"Shape: {df.shape}")
print(f"Columns: {len(df.columns)}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

print("\nColumn Types:")
print("="*50)
print(df.dtypes)

print("\nMissing Values:")
print("="*50)
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

print("\nTarget Variable Distribution:")
print("="*50)
print(df['KETERANGAN'].value_counts())
print("\nTarget Proportions:")
print(df['KETERANGAN'].value_counts(normalize=True).round(4))

In [None]:
# Feature Engineering and Data Preprocessing
def create_features(df):
    """
    Create new features and preprocess existing ones
    """
    df_processed = df.copy()
    
    # Convert date columns to datetime
    df_processed['TANGGAL_LAHIR'] = pd.to_datetime(df_processed['TANGGAL_LAHIR'])
    df_processed['TANGGAL_INPUT'] = pd.to_datetime(df_processed['TANGGAL_INPUT'])
    
    # Calculate age at time of application
    df_processed['UMUR'] = (df_processed['TANGGAL_INPUT'] - df_processed['TANGGAL_LAHIR']).dt.days / 365.25
    
    # Create age groups
    df_processed['KELOMPOK_UMUR'] = pd.cut(df_processed['UMUR'], 
                                          bins=[0, 30, 40, 50, 60, 100], 
                                          labels=['<30', '30-40', '40-50', '50-60', '60+'])
    
    # Create plafond groups (loan amount categories)
    df_processed['KATEGORI_PLAFOND'] = pd.cut(df_processed['PLAFOND'], 
                                             bins=[0, 100000000, 200000000, 300000000, float('inf')], 
                                             labels=['Low', 'Medium', 'High', 'Very High'])
    
    # Create loan term categories
    df_processed['KATEGORI_TENOR'] = pd.cut(df_processed['JK_WAKTUBULAN'], 
                                           bins=[0, 60, 120, 180, float('inf')], 
                                           labels=['Short', 'Medium', 'Long', 'Very Long'])
    
    # Risk score based on prescreening status
    risk_mapping = {'Low': 1, 'Medium': 2, 'High': 3}
    df_processed['RISK_SCORE'] = df_processed['STATUS_PRESCREENING'].map(risk_mapping)
    
    return df_processed

# Apply feature engineering
df_processed = create_features(df)

print("New Features Created:")
print("="*50)
new_features = ['UMUR', 'KELOMPOK_UMUR', 'KATEGORI_PLAFOND', 'KATEGORI_TENOR', 'RISK_SCORE']
for feature in new_features:
    print(f"{feature}: {df_processed[feature].nunique()} unique values")
    if df_processed[feature].dtype == 'object' or df_processed[feature].dtype.name == 'category':
        print(f"  Values: {df_processed[feature].value_counts().to_dict()}")
    else:
        print(f"  Range: {df_processed[feature].min():.2f} - {df_processed[feature].max():.2f}")
    print()

In [None]:
# Prepare features for modeling
def prepare_model_data(df):
    """
    Prepare data for machine learning models
    """
    df_model = df.copy()
    
    # Select relevant features for modeling
    feature_columns = [
        'PEKERJAAN', 'STATUS_PERNIKAHAN', 'PROVINSI', 'SUB_PRODUK',
        'PLAFOND', 'JK_WAKTUBULAN', 'STATUS_PRESCREENING', 'KOLEKTABILITAS',
        'UMUR', 'RISK_SCORE'
    ]
    
    # Add categorical features that need encoding
    categorical_features = [
        'STATUS_PERNIKAHAN', 'PROVINSI', 'SUB_PRODUK', 'STATUS_PRESCREENING'
    ]
    
    # Encode categorical variables
    label_encoders = {}
    for feature in categorical_features:
        le = LabelEncoder()
        df_model[f'{feature}_encoded'] = le.fit_transform(df_model[feature])
        label_encoders[feature] = le
    
    # Final feature set
    final_features = [
        'PEKERJAAN', 'STATUS_PERNIKAHAN_encoded', 'PROVINSI_encoded', 'SUB_PRODUK_encoded',
        'PLAFOND', 'JK_WAKTUBULAN', 'STATUS_PRESCREENING_encoded', 'KOLEKTABILITAS',
        'UMUR', 'RISK_SCORE'
    ]
    
    X = df_model[final_features]
    y = df_model['KETERANGAN']
    
    return X, y, label_encoders, final_features

# Prepare data
X, y, label_encoders, feature_names = prepare_model_data(df_processed)

print("Model Data Preparation:")
print("="*50)
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Feature names: {feature_names}")
print(f"\nTarget classes: {y.unique()}")
print(f"Class distribution:\n{y.value_counts()}")

# Display sample of prepared data
print(f"\nSample of prepared features:")
print(X.head())

## 2. Exploratory Data Analysis

In [None]:
# Visualize class distribution
plt.figure(figsize=(15, 10))

# Target distribution
plt.subplot(2, 3, 1)
y.value_counts().plot(kind='bar', color='skyblue')
plt.title('Distribution of Credit Status (KETERANGAN)')
plt.xlabel('Status')
plt.ylabel('Count')
plt.xticks(rotation=45)

# Plafond distribution by status
plt.subplot(2, 3, 2)
df_processed.boxplot(column='PLAFOND', by='KETERANGAN', ax=plt.gca())
plt.title('Plafond Distribution by Credit Status')
plt.suptitle('')  # Remove default title
plt.xticks(rotation=45)

# Age distribution by status
plt.subplot(2, 3, 3)
df_processed.boxplot(column='UMUR', by='KETERANGAN', ax=plt.gca())
plt.title('Age Distribution by Credit Status')
plt.suptitle('')
plt.xticks(rotation=45)

# Loan term distribution by status
plt.subplot(2, 3, 4)
df_processed.boxplot(column='JK_WAKTUBULAN', by='KETERANGAN', ax=plt.gca())
plt.title('Loan Term Distribution by Credit Status')
plt.suptitle('')
plt.xticks(rotation=45)

# Kolektabilitas distribution
plt.subplot(2, 3, 5)
df_processed['KOLEKTABILITAS'].value_counts().sort_index().plot(kind='bar', color='lightcoral')
plt.title('Kolektabilitas Distribution')
plt.xlabel('Kolektabilitas')
plt.ylabel('Count')

# Correlation heatmap
plt.subplot(2, 3, 6)
correlation_features = ['PLAFOND', 'JK_WAKTUBULAN', 'KOLEKTABILITAS', 'UMUR', 'RISK_SCORE']
corr_matrix = df_processed[correlation_features].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=plt.gca())
plt.title('Feature Correlation Matrix')

plt.tight_layout()
plt.show()

# Class imbalance analysis
print("Class Imbalance Analysis:")
print("="*50)
class_counts = y.value_counts()
total_samples = len(y)
for class_name, count in class_counts.items():
    percentage = (count / total_samples) * 100
    print(f"{class_name}: {count} samples ({percentage:.1f}%)")

print(f"\nImbalance Ratio (Majority/Minority): {class_counts.max() / class_counts.min():.2f}")

## 3. Handling Class Imbalance dengan Oversampling

In [None]:
# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Data Split:")
print("="*50)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTraining set class distribution:")
print(y_train.value_counts())
print(f"\nTest set class distribution:")
print(y_test.value_counts())

# Scale features for oversampling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nFeatures scaled successfully!")

In [None]:
# Apply oversampling techniques
def apply_oversampling(X_train, y_train):
    """
    Apply SMOTE and ADASYN oversampling techniques
    """
    # Original dataset
    original_data = {
        'X': X_train,
        'y': y_train,
        'name': 'Original'
    }
    
    # SMOTE oversampling
    smote = SMOTE(random_state=42)
    X_smote, y_smote = smote.fit_resample(X_train, y_train)
    smote_data = {
        'X': X_smote,
        'y': y_smote,
        'name': 'SMOTE'
    }
    
    # ADASYN oversampling
    adasyn = ADASYN(random_state=42)
    X_adasyn, y_adasyn = adasyn.fit_resample(X_train, y_train)
    adasyn_data = {
        'X': X_adasyn,
        'y': y_adasyn,
        'name': 'ADASYN'
    }
    
    return [original_data, smote_data, adasyn_data]

# Apply oversampling
datasets = apply_oversampling(X_train_scaled, y_train)

# Compare dataset sizes and distributions
print("Oversampling Results:")
print("="*70)
for dataset in datasets:
    print(f"\n{dataset['name']} Dataset:")
    print(f"Shape: {dataset['X'].shape}")
    print(f"Class distribution:")
    class_dist = pd.Series(dataset['y']).value_counts().sort_index()
    for class_name, count in class_dist.items():
        percentage = (count / len(dataset['y'])) * 100
        print(f"  {class_name}: {count} samples ({percentage:.1f}%)")

# Visualize the impact of oversampling
plt.figure(figsize=(15, 5))

for i, dataset in enumerate(datasets):
    plt.subplot(1, 3, i+1)
    pd.Series(dataset['y']).value_counts().plot(kind='bar', color='skyblue')
    plt.title(f'{dataset["name"]} Dataset\n({len(dataset["y"])} samples)')
    plt.xlabel('Credit Status')
    plt.ylabel('Count')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 4. Model Comparison

In [None]:
# Define models to compare
def get_models():
    """
    Define the machine learning models to compare
    """
    models = {
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
        'XGBoost': XGBClassifier(random_state=42, eval_metric='mlogloss'),
        'LightGBM': LGBMClassifier(random_state=42, verbose=-1),
        'Gradient Boosting': GradientBoostingClassifier(random_state=42)
    }
    return models

# Function to evaluate models
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name, dataset_name):
    """
    Train and evaluate a model
    """
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1_macro = f1_score(y_test, y_pred, average='macro')
    f1_weighted = f1_score(y_test, y_pred, average='weighted')
    precision_macro = precision_score(y_test, y_pred, average='macro')
    recall_macro = recall_score(y_test, y_pred, average='macro')
    
    # Cross-validation score
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_macro')
    cv_mean = cv_scores.mean()
    cv_std = cv_scores.std()
    
    return {
        'Model': model_name,
        'Dataset': dataset_name,
        'Accuracy': accuracy,
        'F1_Macro': f1_macro,
        'F1_Weighted': f1_weighted,
        'Precision_Macro': precision_macro,
        'Recall_Macro': recall_macro,
        'CV_F1_Mean': cv_mean,
        'CV_F1_Std': cv_std,
        'Trained_Model': model,
        'Predictions': y_pred
    }

print("Models defined successfully!")
print("Models to compare:", list(get_models().keys()))

In [None]:
# Train and evaluate all models on all datasets
results = []
best_models = {}

print("Training and Evaluating Models...")
print("="*70)

for dataset in datasets:
    print(f"\nEvaluating on {dataset['name']} dataset...")
    
    models = get_models()
    dataset_results = {}
    
    for model_name, model in models.items():
        print(f"  Training {model_name}...")
        
        # Evaluate model
        result = evaluate_model(
            model, dataset['X'], X_test_scaled, dataset['y'], y_test, 
            model_name, dataset['name']
        )
        
        results.append(result)
        dataset_results[model_name] = result
        
        print(f"    Accuracy: {result['Accuracy']:.4f}")
        print(f"    F1-Macro: {result['F1_Macro']:.4f}")
        print(f"    CV F1-Macro: {result['CV_F1_Mean']:.4f} ± {result['CV_F1_Std']:.4f}")
    
    # Find best model for this dataset
    best_model_name = max(dataset_results.keys(), 
                         key=lambda x: dataset_results[x]['F1_Macro'])
    best_models[dataset['name']] = {
        'name': best_model_name,
        'result': dataset_results[best_model_name]
    }
    print(f"  Best model for {dataset['name']}: {best_model_name}")

print("\nModel evaluation completed!")

In [None]:
# Create comprehensive results comparison
results_df = pd.DataFrame(results)

# Display results table
print("Comprehensive Model Comparison Results:")
print("="*80)
display_cols = ['Model', 'Dataset', 'Accuracy', 'F1_Macro', 'F1_Weighted', 'CV_F1_Mean']
print(results_df[display_cols].round(4))

# Find overall best combination
best_overall = results_df.loc[results_df['F1_Macro'].idxmax()]
print(f"\nOverall Best Combination:")
print(f"Model: {best_overall['Model']}")
print(f"Dataset: {best_overall['Dataset']}")
print(f"F1-Macro Score: {best_overall['F1_Macro']:.4f}")

# Visualize results
plt.figure(figsize=(15, 10))

# F1-Macro scores comparison
plt.subplot(2, 2, 1)
pivot_f1 = results_df.pivot(index='Model', columns='Dataset', values='F1_Macro')
sns.heatmap(pivot_f1, annot=True, cmap='YlOrRd', fmt='.3f')
plt.title('F1-Macro Scores by Model and Dataset')

# Accuracy comparison
plt.subplot(2, 2, 2)
pivot_acc = results_df.pivot(index='Model', columns='Dataset', values='Accuracy')
sns.heatmap(pivot_acc, annot=True, cmap='YlOrRd', fmt='.3f')
plt.title('Accuracy Scores by Model and Dataset')

# Cross-validation F1 scores
plt.subplot(2, 2, 3)
pivot_cv = results_df.pivot(index='Model', columns='Dataset', values='CV_F1_Mean')
sns.heatmap(pivot_cv, annot=True, cmap='YlOrRd', fmt='.3f')
plt.title('Cross-Validation F1-Macro Scores')

# Bar plot of best results per dataset
plt.subplot(2, 2, 4)
best_results = results_df.groupby('Dataset')['F1_Macro'].max()
best_results.plot(kind='bar', color='lightblue')
plt.title('Best F1-Macro Score per Dataset')
plt.xlabel('Dataset')
plt.ylabel('F1-Macro Score')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# Summary of best models per dataset
print("\nBest Models per Dataset:")
print("="*50)
for dataset_name, best_info in best_models.items():
    result = best_info['result']
    print(f"{dataset_name} Dataset:")
    print(f"  Best Model: {best_info['name']}")
    print(f"  F1-Macro: {result['F1_Macro']:.4f}")
    print(f"  Accuracy: {result['Accuracy']:.4f}")
    print()

## 5. Hyperparameter Tuning untuk Model Terbaik

In [None]:
# Select best model and dataset for hyperparameter tuning
best_model_name = best_overall['Model']
best_dataset_name = best_overall['Dataset']

print(f"Hyperparameter Tuning for: {best_model_name} on {best_dataset_name} dataset")
print("="*70)

# Get the best dataset
best_dataset = next(d for d in datasets if d['name'] == best_dataset_name)
X_train_best = best_dataset['X']
y_train_best = best_dataset['y']

# Define hyperparameter grids for different models
def get_param_grid(model_name):
    """
    Define hyperparameter grids for different models
    """
    if model_name == 'Random Forest':
        return {
            'n_estimators': [100, 200, 300],
            'max_depth': [10, 20, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
    elif model_name == 'XGBoost':
        return {
            'n_estimators': [100, 200, 300],
            'max_depth': [3, 6, 10],
            'learning_rate': [0.01, 0.1, 0.2],
            'subsample': [0.8, 0.9, 1.0]
        }
    elif model_name == 'LightGBM':
        return {
            'n_estimators': [100, 200, 300],
            'max_depth': [3, 6, 10],
            'learning_rate': [0.01, 0.1, 0.2],
            'num_leaves': [31, 50, 100]
        }
    elif model_name == 'Gradient Boosting':
        return {
            'n_estimators': [100, 200, 300],
            'max_depth': [3, 6, 10],
            'learning_rate': [0.01, 0.1, 0.2],
            'subsample': [0.8, 0.9, 1.0]
        }
    else:
        return {}

# Get the model and parameter grid
models_dict = get_models()
best_model = models_dict[best_model_name]
param_grid = get_param_grid(best_model_name)

print(f"Parameter grid for {best_model_name}:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")

print(f"\nTotal combinations to try: {np.prod([len(v) for v in param_grid.values()])}")

In [None]:
# Perform Grid Search with Cross-Validation
print("Performing Grid Search...")
print("This may take a few minutes...")

# Use stratified k-fold for cross-validation
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform grid search
grid_search = GridSearchCV(
    estimator=best_model,
    param_grid=param_grid,
    cv=cv_strategy,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=1
)

# Fit the grid search
grid_search.fit(X_train_best, y_train_best)

# Get results
best_params = grid_search.best_params_
best_cv_score = grid_search.best_score_
best_tuned_model = grid_search.best_estimator_

print("\nHyperparameter Tuning Results:")
print("="*50)
print(f"Best parameters: {best_params}")
print(f"Best CV F1-Macro score: {best_cv_score:.4f}")

# Compare with original model
original_model = models_dict[best_model_name]
original_model.fit(X_train_best, y_train_best)
original_pred = original_model.predict(X_test_scaled)
original_f1 = f1_score(y_test, original_pred, average='macro')

# Evaluate tuned model on test set
tuned_pred = best_tuned_model.predict(X_test_scaled)
tuned_f1 = f1_score(y_test, tuned_pred, average='macro')
tuned_accuracy = accuracy_score(y_test, tuned_pred)

print(f"\nComparison on Test Set:")
print(f"Original model F1-Macro: {original_f1:.4f}")
print(f"Tuned model F1-Macro: {tuned_f1:.4f}")
print(f"Improvement: {tuned_f1 - original_f1:.4f}")
print(f"Tuned model Accuracy: {tuned_accuracy:.4f}")

# Display top 10 parameter combinations
results_df_tuning = pd.DataFrame(grid_search.cv_results_)
top_results = results_df_tuning.nlargest(10, 'mean_test_score')[
    ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
]

print(f"\nTop 10 Parameter Combinations:")
print("="*80)
for i, (idx, row) in enumerate(top_results.iterrows()):
    print(f"{i+1}. Score: {row['mean_test_score']:.4f} ± {row['std_test_score']:.4f}")
    print(f"   Params: {row['params']}")
    print()

## 6. Final Model Evaluation dan Insights

In [None]:
# Detailed classification report
print("Final Model Classification Report:")
print("="*60)
print(classification_report(y_test, tuned_pred))

# Confusion Matrix
plt.figure(figsize=(15, 5))

# Confusion matrix for tuned model
plt.subplot(1, 3, 1)
cm_tuned = confusion_matrix(y_test, tuned_pred)
sns.heatmap(cm_tuned, annot=True, fmt='d', cmap='Blues', 
            xticklabels=y.unique(), yticklabels=y.unique())
plt.title(f'Confusion Matrix - Tuned {best_model_name}')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# Feature importance (if available)
plt.subplot(1, 3, 2)
if hasattr(best_tuned_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'feature': feature_names,
        'importance': best_tuned_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    sns.barplot(data=feature_importance.head(10), y='feature', x='importance')
    plt.title('Top 10 Feature Importances')
else:
    plt.text(0.5, 0.5, 'Feature importance\nnot available\nfor this model', 
             ha='center', va='center', fontsize=12)
    plt.title('Feature Importance')

# Model performance comparison
plt.subplot(1, 3, 3)
comparison_data = {
    'Original': original_f1,
    'Tuned': tuned_f1
}
plt.bar(comparison_data.keys(), comparison_data.values(), color=['lightcoral', 'lightgreen'])
plt.title('F1-Macro Score Comparison')
plt.ylabel('F1-Macro Score')
plt.ylim(0, 1)
for i, (k, v) in enumerate(comparison_data.items()):
    plt.text(i, v + 0.01, f'{v:.4f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Per-class performance analysis
print("\nPer-Class Performance Analysis:")
print("="*60)
report_dict = classification_report(y_test, tuned_pred, output_dict=True)
classes = [k for k in report_dict.keys() if k not in ['accuracy', 'macro avg', 'weighted avg']]

per_class_df = pd.DataFrame({
    'Class': classes,
    'Precision': [report_dict[c]['precision'] for c in classes],
    'Recall': [report_dict[c]['recall'] for c in classes],
    'F1-Score': [report_dict[c]['f1-score'] for c in classes],
    'Support': [report_dict[c]['support'] for c in classes]
})

print(per_class_df.round(4))

In [None]:
# Summary and Conclusions
print("SUMMARY AND CONCLUSIONS")
print("="*80)

print("1. DATASET ANALYSIS:")
print(f"   - Original dataset size: {X.shape[0]} samples")
print(f"   - Number of features: {X.shape[1]}")
print(f"   - Target classes: {len(y.unique())} ({', '.join(y.unique())})")
print(f"   - Class imbalance ratio: {y.value_counts().max() / y.value_counts().min():.2f}")

print("\n2. OVERSAMPLING IMPACT:")
for dataset in datasets:
    print(f"   - {dataset['name']}: {len(dataset['y'])} samples")

print(f"\n3. BEST MODEL COMBINATION:")
print(f"   - Algorithm: {best_model_name}")
print(f"   - Dataset: {best_dataset_name}")
print(f"   - Final F1-Macro Score: {tuned_f1:.4f}")
print(f"   - Final Accuracy: {tuned_accuracy:.4f}")

print(f"\n4. HYPERPARAMETER TUNING IMPACT:")
print(f"   - Original F1-Macro: {original_f1:.4f}")
print(f"   - Tuned F1-Macro: {tuned_f1:.4f}")
print(f"   - Improvement: {tuned_f1 - original_f1:.4f} ({((tuned_f1/original_f1-1)*100):+.1f}%)")

print(f"\n5. BEST HYPERPARAMETERS:")
for param, value in best_params.items():
    print(f"   - {param}: {value}")

print(f"\n6. MODEL INSIGHTS:")
if hasattr(best_tuned_model, 'feature_importances_'):
    top_features = feature_importance.head(3)
    print(f"   - Top 3 most important features:")
    for _, row in top_features.iterrows():
        print(f"     * {row['feature']}: {row['importance']:.4f}")

print(f"\n7. RECOMMENDATIONS:")
print(f"   - Use {best_model_name} with {best_dataset_name} dataset for production")
print(f"   - Apply hyperparameter tuning to improve performance by {((tuned_f1/original_f1-1)*100):+.1f}%")
if best_dataset_name != 'Original':
    print(f"   - {best_dataset_name} oversampling effectively handles class imbalance")
print(f"   - Monitor model performance especially for minority classes")

# Save the final model (optional)
import joblib
model_filename = f"best_credit_risk_model_{best_model_name.replace(' ', '_').lower()}_{best_dataset_name.lower()}.pkl"
joblib.dump(best_tuned_model, model_filename)
print(f"\n8. MODEL SAVED:")
print(f"   - Filename: {model_filename}")
print(f"   - Model: Tuned {best_model_name} trained on {best_dataset_name} dataset")

## Kesimpulan dan Rekomendasi

### Ringkasan Analisis:
1. **Data Preprocessing**: Berhasil membuat fitur-fitur baru seperti umur, kategori plafond, dan risk score
2. **Class Imbalance**: Dataset asli mengalami ketidakseimbangan kelas yang signifikan
3. **Oversampling**: SMOTE dan ADASYN berhasil mengatasi ketidakseimbangan dengan meningkatkan jumlah sampel minority class
4. **Model Comparison**: Membandingkan 4 algoritma (Random Forest, XGBoost, LightGBM, Gradient Boosting) pada 3 dataset
5. **Hyperparameter Tuning**: Optimasi model terbaik menghasilkan peningkatan performa

### Rekomendasi:
- Gunakan kombinasi model dan dataset terbaik yang ditemukan untuk production
- Pantau performa model secara berkala, terutama untuk kelas minority
- Pertimbangkan mengumpulkan lebih banyak data untuk kelas yang underrepresented
- Lakukan re-training model secara berkala dengan data terbaru

### Next Steps:
1. Implementasi model dalam sistem production
2. Setup monitoring dan alerting untuk model drift
3. Persiapan pipeline untuk retraining otomatis
4. A/B testing untuk validasi performa di real-world scenario