# Credit Risk Prediction Report

## 1. Business Problem
Financial institutions face the challenge of predicting whether customers will default on their payments.  
The cost of **missing a defaulter (false negative)** is often higher than the cost of **flagging a good customer as risky (false positive)**.  
Therefore, our goal is to **maximize recall (catch as many defaulters as possible)** while still maintaining **decent precision** to avoid rejecting too many good customers.

---

## 2. Data Overview
- Original dataset was highly imbalanced:
  - **Non-defaulters (0): 18,691**
  - **Defaulters (1): 5,309**
- After applying **SMOTE (Synthetic Minority Oversampling Technique)**, classes were balanced for training:
  - **Non-defaulters (0): 18,691**
  - **Defaulters (1): 18,691**

---

## 3. Baseline Modeling (Before Feature Engineering)
I first built models on the raw dataset (with SMOTE applied for balancing).  

### Logistic Regression (baseline)
- Recall: **62%**
- Precision: **37%**
- Strength: caught many defaulters.  
- Weakness: very low precision → too many false positives.

### Random Forest (baseline, without SMOTE)
- Recall: **34%**
- Precision: **64%**
- Strength: fewer false positives.  
- Weakness: missed majority of defaulters.

### Random Forest + SMOTE (baseline)
- Recall: **47%**
- Precision: **55%**
- Best balance among the baseline models.

---

## 4. Feature Engineering
To improve predictive power, I engineered new features in addition to the original dataset.  
Examples of engineered features include:
- Ratios of bill amounts to credit limits.  
- Aggregated payment delays across months.  
- Grouping categorical levels (e.g., education, marital status).  

The motivation was to better capture customer repayment behavior.

---

## 5. Enhanced Modeling (After Feature Engineering)
After creating new features, I rebuilt the models to compare performance.

- **Logistic Regression (with engineered features):**
  - Recall remained high (~62%) but precision stayed low (~37%).  
- **Random Forest (with engineered features, no SMOTE):**
  - Recall improved slightly but still too low to be practical.  
- **Random Forest + SMOTE (with engineered features):**
  - Recall ~47%, Precision ~55%.  
  - Maintained the best trade-off even with richer features.  

---

## 6. Why Random Forest + SMOTE?
- Logistic Regression gave the **highest recall** but at the cost of an unacceptable false positive rate.  
- Random Forest without SMOTE improved precision but failed to catch enough defaulters.  
- **Random Forest + SMOTE consistently provided the best compromise**, both before and after feature engineering:
  - Better recall than plain RF.  
  - Better precision than logistic regression.  
  - Engineered features allowed RF to capture non-linear repayment patterns.  

---

## 7. Model Improvement Strategy
- **Threshold tuning:** Adjust probability threshold to reach business-defined recall or precision.  
- **Cost-sensitive learning:** Penalize false negatives more heavily.  
- **Feature importance analysis:** Identify which engineered features drive model performance.  
- **Business cost simulation:** Quantify trade-off of false positives vs false negatives in monetary terms.

---

## 8. Recommendation
I recommend **deploying Random Forest with SMOTE and engineered features** as the production model.  
It consistently balanced recall and precision across both phases of experimentation, making it the most practical choice.  
Threshold tuning can be applied to align the model with business priorities (e.g., maximize recall in high-risk scenarios).
