<a href="https://colab.research.google.com/github/EvgeniaKantor/DI-Bootcamp_ML/blob/main/xpW5D1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌟 Exercise 1 : Defining the Problem and Data Collection for Loan Default

Prediction

Instructions

Write a clear problem statement for predicting loan defaults.
Identify and list the types of data you would need for this project (e.g., personal details of applicants, credit scores, loan amounts, repayment history).
Discuss the sources where you can collect this data (e.g., financial institution’s internal records, credit bureaus).
Expected Output: A document detailing the problem statement and a comprehensive plan for data collection, including data types and sources.



**Problem statement for predicting loan defaults**

Predicting loan defaults is a critical task for financial institutions to minimize risk, optimize lending strategies, and enhance financial stability. The objective of this project is to develop a predictive model that accurately identifies potential loan defaulters based on historical and real-time data. By leveraging key financial, personal, and behavioral attributes of loan applicants, this model will help institutions improve their decision-making processes, reduce non-performing loans, and increase profitability while ensuring responsible lending practices.

**A list the types of data you would need for this project:**

1) Applicant Personal Information:

Age

Gender

Marital status

Employment status

Education level

Number of dependents

2) Financial and Credit History:

Annual income

Credit score

Outstanding debts

Debt-to-income ratio

Previous loan history

Number of credit inquiries

3) Loan-Specific Information:

Loan amount requested

Loan tenure

Interest rate

Loan purpose (e.g., home, auto, personal, business)

Type of collateral (if applicable)

4) Repayment and Behavioral Data:

On-time payment history

Late payment instances

Default history

Account balance trends

Frequency of missed payments

**The sources where you can collect this data **

1) Financial Institution’s Internal Records:

Customer loan applications

Repayment history and loan performance records

Internal risk assessment reports

2) Credit Bureaus:

Credit scores

Credit report details (e.g., outstanding debts, past defaults)

Credit utilization patterns

3) Government Databases:

Tax records

Employment status verification

Public default records

4) Open Data and Third-Party Providers:

Economic indicators and market trends

Alternative credit scoring agencies

Social media and digital footprint analysis (where applicable)

# 🌟 Exercise 2 : Feature Selection and Model Choice for Loan Default Prediction

Instructions

From this dataset, identify which features might be most relevant for predicting loan defaults.

Justify your choice of features.


In [None]:
from google.colab import files
files.upload()

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d ninzaami/loan-predication

Saving kaggle.json to kaggle.json
Dataset URL: https://www.kaggle.com/datasets/ninzaami/loan-predication
License(s): CC0-1.0
Downloading loan-predication.zip to /content
  0% 0.00/7.83k [00:00<?, ?B/s]
100% 7.83k/7.83k [00:00<00:00, 12.4MB/s]


In [None]:
!unzip loan-predication.zip

Archive:  loan-predication.zip
  inflating: train_u6lujuX_CVtuZ9i (1).csv  


In [None]:
import pandas as pd
df = pd.read_csv('/content/train_u6lujuX_CVtuZ9i (1).csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


**Features Likely to Be Relevant**

ApplicantIncome – Higher income typically indicates a greater ability to repay the loan.

CoapplicantIncome – If a co-applicant is contributing, it might reduce the burden on the primary applicant.

LoanAmount – A higher loan amount could increase default risk if it’s disproportionate to income.

Loan_Amount_Term – The length of the loan term affects monthly payments and affordability.

Credit_History – Past credit behavior is a strong predictor of default risk.

Married – Being married might indicate dual incomes, which could reduce default risk.

Dependents – More dependents could mean higher financial obligations, impacting repayment ability.

Self_Employed – Self-employed individuals may have more variable income, affecting repayment stability.

Education – Graduates may have more stable, higher-paying jobs.

Property_Area – The urban/rural location might influence income stability and property value.

**Features to Drop or Consider Less Important**

Loan_ID – It's just an identifier and has no predictive value.

Gender – While it might have some correlation, it’s not a direct factor affecting repayment ability.

# 🌟 Exercise 3 : Training, Evaluating, and Optimizing the Model

Instructions

Which model(s) would you pick for a Loan Prediction ?

Outline the steps to evaluate the model’s performance, mentioning specific metrics that would be relevant to evaluate the model.

We're dealing with a **Supervised Learning Classification problem** where the target variable is **Loan_Status (Y/N)**.

**Best Models for Loan Approval Prediction**

Logistic Regression – A good baseline model for binary classification.

Decision Tree – Simple and interpretable, but prone to overfitting.

Random Forest – More robust than a single tree, reduces overfitting.

Gradient Boosting (XGBoost, LightGBM, CatBoost) – Generally the best for structured/tabular data.

K-Nearest Neighbors (KNN) – Could be tested but needs proper scaling and handling of categorical data.

**Model Evaluation Metrics**

To evaluate the model, we should use:

Accuracy – Overall correctness.

Precision – How many predicted approvals were actually correct.

Recall – How many actual approvals were correctly identified.

F1-Score – Balances precision and recall (important for imbalanced data).

# 🌟 Exercise 4 : Designing Machine Learning Solutions for Specific Problems

Instructions

For each of these scenario, decide which type of machine learning would be most suitable. Explain.

Predicting Stock Prices : predict future prices

Organizing a Library of Books : group books into genres or categories based on similarities.

Program a robot to navigate and find the shortest path in a maze.

**1. Predicting Stock Prices (Regression – Supervised Learning)**

Stock prices are continuous numerical values, making this a regression problem.

The model learns from historical stock price data to predict future prices.

**Possible Models:**

Linear Regression (for simple trends).

Random Forest Regressor (for non-linear patterns).

**2. Organizing a Library of Books (Clustering – Unsupervised Learning)**

If genres/categories are unknown, the model needs to find patterns and group similar books.

This is a clustering problem, as there are no predefined labels.

**Possible Models:**

K-Means Clustering (groups books based on features like text content, keywords, or metadata).

Hierarchical Clustering (useful for organizing books into a tree-like structure).

**3. Programming a Robot to Navigate a Maze (Reinforcement Learning)**

The robot learns by trial and error, improving based on rewards (e.g., reaching the goal faster)

1. **Supervised Learning: Classification Model (e.g., Random Forest Classifier)**

Evaluation Strategy

Goal: To classify data into discrete categories (e.g., loan approval prediction).

Metrics:

Accuracy: The percentage of correct predictions.

Limitation: Accuracy can be misleading, especially in imbalanced datasets (e.g., 95% "no loan" vs. 5% "loan approved").

Precision: The proportion of correctly predicted positives out of all predicted positives.

Limitation: Precision alone does not account for false negatives.
Recall (Sensitivity): The proportion of actual positives correctly identified.

Limitation: Recall can be too high if the model classifies everything as positive.

F1-Score: The harmonic mean of precision and recall, providing a balanced metric.

Ideal when: Both precision and recall are important, especially with class imbalance.

ROC Curve & AUC (Area Under Curve): Measures the trade-off between sensitivity and specificity. AUC close to 1 indicates a good model, while 0.5 indicates a random classifier.

Methods:

Cross-Validation: Split the data into k-folds (typically 5 or 10), train the model k times, and average the metrics. This helps assess how the model generalizes.

Limitation: Cross-validation can be time-consuming for large datasets.

Confusion Matrix: Visualizes the True Positives, True Negatives, False Positives, and False Negatives, which gives insight into where the model is failing.

Challenges:

Class Imbalance: The model may focus too much on the majority class. Metrics like F1-score and ROC-AUC help address this issue.

Overfitting: Using cross-validation helps check if the model is generalizing well to unseen data.

**2. Unsupervised Learning: Clustering Model (e.g., K-Means Clustering)
Evaluation Strategy**

Goal: To group data into clusters based on similarity (e.g., customer segmentation).

Metrics:

Silhouette Score: Measures how similar each point is to its own cluster compared to other clusters. Scores range from -1 (incorrect clustering) to +1 (well-clustered). A higher score indicates better clustering.

Limitation: Not always interpretable, especially in high-dimensional spaces.
Elbow Method: Used to determine the optimal number of clusters (k) by plotting the sum of squared distances between points and their centroids for different k values.

Limitation: The "elbow" may not always be clear, requiring further domain knowledge.

Davies-Bouldin Index: Measures the average similarity ratio of each cluster with the cluster that is most similar to it. Lower values indicate better clustering.

Limitation: May not work well with irregularly shaped clusters.

Methods:

Visualizations: Plot clusters in 2D/3D using dimensionality reduction techniques (PCA, t-SNE) to evaluate how well-separated the clusters are.

Internal Validation: Use internal metrics like Silhouette Score to assess the quality without ground truth labels.

Challenges:

No Ground Truth: Unlike supervised learning, we don’t have predefined labels for validation.

Cluster Interpretability: Understanding the meaning of clusters may require domain knowledge.

Scalability: Clustering can be computationally expensive for large datasets.

**3. Reinforcement Learning Model (e.g., Q-Learning for Maze Navigation)**

Evaluation Strategy

Goal: To train an agent to maximize cumulative reward and navigate a maze effectively.

Metrics:

Cumulative Reward: The total reward accumulated by the agent over an episode or multiple episodes. Higher cumulative reward indicates better performance.

Limitation: Reward may not always correlate directly with task success (e.g., finding the shortest path may not always maximize reward).

Convergence: Measures how quickly and effectively the agent converges to the optimal policy. A model is considered successful when the learning curve flattens and the agent reaches near-optimal behavior.

Limitation: Convergence time can be long, especially in complex environments.
Exploration vs. Exploitation: Tracks how well the agent balances exploring new actions and exploiting known actions to maximize rewards. An optimal agent explores enough to discover better strategies but also exploits learned strategies for high rewards.

Limitation: It can be difficult to tune the exploration/exploitation balance effectively.

Success Rate: The percentage of episodes where the agent reaches the goal or successfully navigates the maze.

Limitation: In some cases, success might be defined differently (e.g., the shortest path rather than reaching the goal).

Methods:

Learning Curves: Track the agent’s cumulative reward or success rate over time to evaluate its improvement.

Policy Evaluation: Evaluate the learned policy (action-selection strategy) by simulating episodes and calculating performance metrics.

Challenges:

Sample Efficiency: Reinforcement learning often requires a lot of training data and interactions, making it computationally expensive.

Exploration-Exploitation Trade-Off: It’s challenging to fine-tune this balance, and poor exploration can lead to suboptimal policies.

Stochasticity: Environments with randomness can make performance evaluation harder, as the agent may not behave consistently in every episode.