# Random Forest vs Gradient Boosting — Credit Risk Prediction (Offline Dataset)

In this notebook, we compare **Random Forest** and **Gradient Boosting** classifiers for predicting loan default risk. We use a real-world offline Credit Risk dataset to understand how bagging and boosting differ in performance and behavior.

In [None]:
# Step 1: Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_auc_score, roc_curve
import warnings
warnings.filterwarnings('ignore')

## Step 2: Load Dataset

In [None]:
# Load from local path
df = pd.read_csv('D:/ChaitanyaKhot-96/CreditRisk.csv')
df.head()

## Step 3: Data Preprocessing

In [None]:
# Encode categorical variables
cat_cols = df.select_dtypes(include='object').columns
df_encoded = df.copy()
for col in cat_cols:
    df_encoded[col] = LabelEncoder().fit_transform(df_encoded[col])

# Define X and y
X = df_encoded.drop(['Loan_ID', 'Loan_Status'], axis=1)
y = df_encoded['Loan_Status']  # 1 = Approved, 0 = Not approved

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Step 4: Model Training

In [None]:
# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Train Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train_scaled, y_train)

## Step 5: Model Evaluation

In [None]:
def evaluate(model, name):
    y_pred = model.predict(X_test_scaled)
    print(f"\nModel: {name}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("ROC AUC:", roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:, 1]))

evaluate(rf_model, "Random Forest")
evaluate(gb_model, "Gradient Boosting")

## Step 6: ROC Curve

In [None]:
rf_probs = rf_model.predict_proba(X_test_scaled)[:, 1]
gb_probs = gb_model.predict_proba(X_test_scaled)[:, 1]
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_probs)
gb_fpr, gb_tpr, _ = roc_curve(y_test, gb_probs)

plt.figure(figsize=(8,6))
plt.plot(rf_fpr, rf_tpr, label='Random Forest')
plt.plot(gb_fpr, gb_tpr, label='Gradient Boosting')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()

## Step 7: Realistic Prediction Queries

In [None]:
# Sample profile from test set
sample = pd.DataFrame([X.iloc[5]], columns=X.columns)
sample_scaled = scaler.transform(sample)
print("Prediction for profile:")
print(sample)

print("\nRandom Forest Prediction:", rf_model.predict(sample_scaled)[0],
      ", Probability = {:.1f}%".format(rf_model.predict_proba(sample_scaled)[0][1]*100))
print("Gradient Boosting Prediction:", gb_model.predict(sample_scaled)[0],
      ", Probability = {:.1f}%".format(gb_model.predict_proba(sample_scaled)[0][1]*100))

## Step 8: Conclusion

- **Random Forest** uses bagging and is generally robust and fast.
- **Gradient Boosting** focuses on correcting previous mistakes and can give better accuracy after tuning.
- Both models perform well but have different strengths.
- Choose RF for simplicity and speed, GB for performance and tuning flexibility.