### 🌟 Exercise 1: Loan Default Prediction

Problem Statement
The goal is to predict if a loan applicant will pay back the loan or not. This helps the bank reduce risk and make better decisions.

Data Needed

Personal details (age, job, education, marital status).
Financial data (income, expenses, savings).
Credit history (credit score, past payments, other loans).
Loan details (amount, duration, interest rate).

Data Sources
Bank internal records.
Credit bureaus.
Customer payment history.

In [5]:
import pandas as pd

### 🌟 Exercise 2 : Feature Selection and Model Choice for Loan Default Prediction

In [6]:
df = pd.read_csv('train_u6lujuX_CVtuZ9i (1).csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


Loan_ID – only an identifier (no signal).

Strong features (keep)

Credit_History – past payment behavior → biggest signal of default.
ApplicantIncome, CoapplicantIncome – ability to pay.
LoanAmount, Loan_Amount_Term – size and length of debt.
Dependents – more dependents → higher expenses.
Married, Education, Self_Employed – job stability and income risk.
Property_Area – urban/rural differences in income and costs.
Gender – usually weak; can keep for test, but may drop if not useful.

Simple feature engineering (helps a lot)

TotalIncome = ApplicantIncome + CoapplicantIncome – total money coming in.
EMI = LoanAmount / Loan_Amount_Term – monthly payment (rough).
DTI = EMI / TotalIncome – debt-to-income ratio (high DTI → higher risk).
LoanAmountLog = log(LoanAmount) – reduces skew.

Short why:
Income, loan size/term, and credit history show can the person pay and did they pay before. Family/education/job and area show stability and cost of living. Engineered ratios (DTI/EMI) directly measure payment burden.

### 🌟 Exercise 3: Training, Evaluating, and Optimizing the Model

Logistic Regression (baseline, simple, explainable).
Decision Tree / Random Forest (non-linear patterns, handles mixes of features).
Gradient Boosting (e.g., XGBoost/LightGBM) — often best for tabular data.

(Target = Loan_Status (Y/N). Handle class imbalance if needed.)

Evaluation plan (steps)
Split data: Train/Validation/Test (or Time-based split).
Preprocess:
Fill missing values; one-hot encode categoricals.
Make TotalIncome, EMI, DTI (optional).
Train baseline (Logistic).
Cross-validation (K-fold or Stratified K-fold).
Metrics (focus on default detection):
ROC-AUC and PR-AUC (good with class imbalance)
Recall (catch more defaults) and Precision (avoid false alarms).
F1-score (balance), Confusion Matrix.
Threshold tuning: choose cut-off to meet business goal (e.g., higher Recall).
Optimize: hyperparameters (Grid/Random/Bayes search); try RF/GBM.
Explainability: feature importance / SHAP (why the model predicts).

Final test: evaluate on hold-out Test set; check calibration if you use probabilities.

Decision rule
Pick the model with highest PR-AUC / Recall@target-Precision (or best F1), and acceptable business trade-off.

### Exercise 4: Exercise 4 : Designing Machine Learning Solutions for Specific Problems

Predicting Stock Prices → Supervised Learning (Regression)
Why: we have past prices as labeled targets (numbers). We learn a function to predict the next price.
Examples: Linear Regression, Random Forest Regressor, ARIMA/LSTM (time series).

Organizing a Library of Books → Unsupervised Learning (Clustering/Topic Modeling)
Why: books have no labels. We group by similarity (words, topics).
Examples: K-Means, Hierarchical Clustering, LDA topic modeling.

Robot in a Maze (shortest path) → Reinforcement Learning
Why: an agent tries actions, gets reward for reaching exit (or short path), learns a policy.
Examples: Q-Learning, Deep Q-Network (DQN).

### 🌟 Exercise 5: Designing an Evaluation Strategy for Different ML Models

1. Supervised Model (Classification, e.g., Logistic Regression)

Metrics: accuracy, precision, recall, F1-score.

Methods: split data into train/test, use cross-validation, draw ROC curve.

Challenge: imbalanced classes can make accuracy misleading.

2. Unsupervised Model (Clustering, e.g., K-Means)

Metrics: silhouette score, elbow method.

Challenge: no true labels, hard to check if clusters are “good.”

3. Reinforcement Learning Model (e.g., Q-learning)

Metrics: total reward, how fast agent converges to good strategy, balance between exploration and exploitation.

Challenge: training takes time, results can change with environment.