<a href="https://colab.research.google.com/github/Naomie25/DI-Bootcamp/blob/main/Week5_Day1_ExerciceXP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exercise 1 : Defining the Problem and Data Collection for Loan Default Prediction



## **Loan Default Prediction: Problem Statement & Data Collection Plan**

### **1. Problem Statement**

Financial institutions face significant losses due to loan defaults. To reduce this risk and improve lending decisions, it is essential to identify potential defaulters before loans are approved. This project aims to build a machine learning model that predicts whether a borrower is likely to default on a loan based on historical and applicant data.

**Objective:**
Develop a predictive model to classify loan applicants as *likely to default* or *not likely to default*, using various financial, personal, and behavioral indicators.

**Output:**
A binary classification:

* `1` — Likely to default
* `0` — Not likely to default

**Success Criteria:**

* High accuracy, precision, and recall
* Robustness to data imbalance
* Interpretability for real-world decision-making
* Compliance with data privacy and financial regulations

---

### **2. Data Collection Plan**

#### **A. Types of Data Needed**

| Category                  | Data Attributes                                                                     |
| ------------------------- | ----------------------------------------------------------------------------------- |
| **Personal Details**      | Age, gender, marital status, education, dependents, residential status              |
| **Financial Information** | Annual income, monthly expenses, credit score, existing debts, debt-to-income ratio |
| **Employment Details**    | Employment status, job title, industry, years employed, employment type             |
| **Loan Information**      | Loan amount, term, interest rate, purpose, collateral                               |
| **Repayment History**     | Previous defaults, missed/late payments, repayment timeliness, delinquency          |
| **Transaction Data**      | Bank balances, recent transactions, overdraft frequency, payment-to-income ratio    |
| **Other Data**            | Location, loan application channel                                                  |

---

#### **B. Data Sources**

| Data Type                   | Possible Data Sources                                                                   |
| --------------------------- | --------------------------------------------------------------------------------------- |
| **Internal Records**        | Financial institution’s loan management systems, customer profiles, transaction logs    |
| **Credit Bureaus**          | Experian, Equifax, TransUnion — for credit scores, credit history, delinquencies        |
| **Employment Verification** | Income verification services, payroll providers, employment databases                   |
| **Bank Statements**         | Directly from customer-provided documents or consent-based API access (e.g., Plaid)     |
| **Public/Third-party Data** | Government databases (for income norms, inflation), alternative credit scoring services |
| **Application Forms**       | Online or offline loan application documents filled by the customer                     |

---

### **3. Considerations**

* **Privacy & Compliance**: Ensure GDPR, CCPA, or local data protection laws are followed. Use anonymization and encryption where needed.
* **Data Quality**: Validate completeness, accuracy, and consistency before model training.
* **Imbalance Handling**: Default cases are typically rare; techniques like SMOTE or stratified sampling may be required.
* **Ethical Use**: Avoid bias in data and ensure fairness in predictions.

---

### **4. Next Steps**

1. Define data access and permissions with stakeholders.
2. Gather and explore data from all available internal systems.
3. Integrate external data sources (e.g., credit bureau APIs).
4. Perform data cleaning, feature engineering, and exploratory data analysis.
5. Train, evaluate, and deploy predictive model.



Exercise 2 : Feature Selection and Model Choice for Loan Default Prediction




### 🔑 **Most Relevant Features and Justifications:**

1. **Credit\_History**

   * **Justification:** This is typically the most important factor in predicting loan defaults. A good credit history (e.g., timely repayments, no defaults) strongly correlates with lower default risk.

2. **ApplicantIncome & CoapplicantIncome**

   * **Justification:** Higher income levels often indicate a better ability to repay loans. Both primary and secondary (coapplicant) incomes are relevant to assess total household earning capacity.

3. **LoanAmount**

   * **Justification:** Larger loan amounts imply greater financial burden. A high loan amount relative to income may increase default risk.

4. **Loan\_Amount\_Term**

   * **Justification:** The length of the loan term affects monthly repayment amounts. Shorter terms may increase monthly payments and affect repayment ability.

5. **Education**

   * **Justification:** Generally, graduates may have more stable and higher-paying jobs, lowering their default risk.

6. **Married**

   * **Justification:** Married applicants may have dual income sources or more stable financial support, which could influence repayment capacity.

7. **Self\_Employed**

   * **Justification:** Self-employed individuals often have irregular income patterns, which may increase risk.

8. **Property\_Area**

   * **Justification:** Location can affect income stability, job availability, and living costs — all of which relate to default risk.

---

### ⚠️ **Less Relevant / Indirect Features:**

* **Loan\_ID**: This is an identifier and does not provide predictive value.
* **Gender**: May have some correlation, but using it for prediction raises fairness and ethical concerns.
* **Dependents**: Could have minor relevance (more dependents may mean more financial burden), but effect size may be small.


In [None]:


### **Most Relevant Features and Justifications:**

1. **Credit\_History**

   * **Justification:** This is typically the most important factor in predicting loan defaults because good credit history strongly correlates with lower default risk.

2. **ApplicantIncome & CoapplicantIncome**

   * **Justification:** Higher income levels often indicate a better ability to repay loans. Both primary and secondary (coapplicant) incomes are relevant to assess total household earning capacity.

3. **LoanAmount**

   * **Justification:** Larger loan amounts imply greater financial burden. A high loan amount relative to income may increase default risk.

4. **Loan\_Amount\_Term**

   * **Justification:** The length of the loan term affects monthly repayment amounts. Shorter terms may increase monthly payments and affect repayment ability.

5. **Education**

   * **Justification:** Generally, graduates may have more stable and higher-paying jobs, lowering their default risk.

6. **Married**

   * **Justification:** Married applicants may have dual income sources or more stable financial support, which could influence repayment capacity.

7. **Self\_Employed**

   * **Justification:** Self-employed individuals often have irregular income patterns, which may increase risk.

8. **Property\_Area**

   * **Justification:** Location can affect income stability, job availability, and living costs — all of which relate to default risk.

---

### Less Relevant

* **Loan\_ID**: This is an identifier and does not provide predictive value.
* **Gender**: May have some correlation, but using it for prediction raises fairness and ethical concerns.
* **Dependents**: Could have minor relevance (more dependents may mean more financial burden), but effect size may be small.


Exercise 3 : Training, Evaluating, and Optimizing the Model:

Loan default prediction is a classification problem where you already have labeled data — meaning each past loan record includes whether the borrower defaulted or not. It's mean that it's Supervised Learning.



Exercise 4 : Designing Machine Learning Solutions for Specific Problems

- Predicting Stock Prices : predict future prices: Supervised Learning because it involves forecasting continuous numeric values (future prices) based on historical data (past prices, indicators, etc.).
- Organizing a Library of Books : group books into genres or categories based on similarities: Unserpivised Learning because if you don't have predefined labels for the genres or categories of the books and want to automatically group them based on features like title, author, keywords, or text content, unsupervised learning is the right approach.
- Program a robot to navigate and find the shortest path in a maze: Reiforcement Learning because it involves an agent (robot) interacting with an environment
(maze) and learning to take actions (move in directions) that maximize a reward (finding the shortest path).


Exercise 5 : Designing an Evaluation Strategy for Different ML Models

1. Supervised Learning (Classification)
Metrics: Accuracy, Precision, Recall, F1-score, ROC-AUC

Methods: k-fold Cross-validation, Confusion Matrix, ROC Curve

Challenges: Handling imbalanced data, avoiding overfitting, choosing classification threshold

2. Unsupervised Learning (Clustering)
Metrics: Silhouette Score, Elbow Method, Davies-Bouldin Index

Methods: Analyze metric scores, use domain knowledge to validate clusters

Challenges: No ground truth labels, metric conflicts, high-dimensional data effects

3. Reinforcement Learning
Metrics: Cumulative reward, convergence behavior, episode length, exploration-exploitation balance

Methods: Track rewards over episodes, test final policy, run multiple trials

Challenges: Sparse/delayed rewards, high variance due to exploration, designing reward function

