# Exercises XP: Student Notebook

For each exercise, the **Instructions** from the plateform are guided, and the **Guidance** explains exactly what you must do to complete the task.

## What you will learn
- How to clearly define and articulate a machine learning problem statement.

- The process of data collection, including identifying relevant data types and potential data sources.
Skills in feature selection and justification for machine learning models, particularly in the context of loan default prediction.

- Understanding of different types of machine learning models and their suitability for various real-world scenarios.

- Techniques and strategies for evaluating the performance of different machine learning models, including choosing appropriate metrics and understanding their implications.

## What you will create
- A detailed problem statement and data collection plan for a loan default prediction project, including identification of key data types and sources.
- A comprehensive feature selection analysis for a hypothetical loan default prediction dataset.
- A theoretical evaluation strategy for three different types of machine learning models, addressing the unique challenges and metrics relevant to each model type.
- Thoughtful analyses and justifications for choosing specific machine learning approaches for varied scenarios such as stock price prediction, library organization, and robot navigation.
- A document or presentation that showcases your understanding and approach to evaluating and optimizing machine learning models in diverse contexts.

## ðŸŒŸ Exercise 1 : Defining the Problem and Data Collection for Loan Default Prediction

### Instructions
- Write a clear problem statement for predicting loan defaults.
- Identify and list the types of data you would need for this project (e.g., personal details of applicants, credit scores, loan amounts, repayment history).
- Discuss the sources where you can collect this data (e.g., financial institutionâ€™s internal records, credit bureaus).

**Expected Output:** A document detailing the problem statement and a comprehensive plan for data collection, including data types and sources.

### Guidance
- Please write your answer as a short document. Begin by stating the prediction objective in a complete sentence that names the target variable and the decision it will support. Then, describe the data types you would collect in complete sentences. For each data type, explain in one sentence why it could help predict loan defaults.

- After that, name realistic data sources in complete sentences, and briefly describe how you would obtain or integrate each source.

- Finally, include one paragraph that explains risks and constraints such as privacy, regulation, data quality, sampling bias, and governance.

### Your answer
Problem statement (objective and decision)

We will predict the probability of loan default within 90 days past due (target variable: default_90dpd = 1/0) at or before loan approval, so that credit officers can approve, price, or decline applications and set appropriate credit limits and monitoring rules.

Data to collect (what and why)
	1.	Applicant demographics (age, residency status, marital status, dependents). These features provide context for stability and potential financial obligations that can correlate with repayment risk.
	2.	Employment details (employment status, tenure, occupation, employer sector). Job stability and industry cyclicality are predictive of income continuity and default likelihood.
	3.	Income and affordability (gross/net monthly income, other income sources, DTI ratio). Ability-to-pay signalsâ€”especially debt-to-incomeâ€”directly relate to the capacity to service new debt.
	4.	Credit bureau history (credit score, number of active trades, utilization, delinquencies, inquiries, bankruptcies). Past credit behavior is a strong leading indicator of future repayment performance.
	5.	Loan application characteristics (amount, term, product type, interest rate, collateral, purpose). Loan structure and purpose affect payment burden and incentives to repay.
	6.	Banking behavior (account balances, inflow/outflow volatility, returned payments). Cash-flow patterns capture real affordability and financial stress beyond stated income.
	7.	Repayment history on internal products (previous loans/cards with the institution, days past due, restructurings). In-house historical performance provides high-fidelity signals about the applicantâ€™s behavior with this lender.
	8.	Collateral/guarantor details (collateral value/LTV, guarantor credit strength). Security and support reduce loss severity and can lower default propensity.
	9.	Application channel and fraud indicators (channel type, device fingerprint, mismatch flags, velocity). Certain channels and inconsistencies correlate with higher first-payment default and fraud risk.
	10.	Macroeconomic context (unemployment rate, inflation, interest rates, region). External conditions influence default rates and help the model generalize across cycles.

Data sources and integration approach
	1.	Financial institution internal systems (LOS/LMS, core banking, collections). Extract application, loan, and repayment data via batch ETL or streaming into the data warehouse; join on customer ID and loan ID.
	2.	Credit bureaus. Pull scores and tradeline attributes through bureau APIs/SFTP with applicant consent; store snapshots with bureau date to manage staleness.
	3.	Open Banking / bank-statement providers (with consent). Ingest categorized transactions and balance histories via secure APIs to compute cash-flow and affordability features.
	4.	Fraud and identity verification vendors. Integrate device, document, and consortium fraud signals through real-time APIs to enrich first-party data.
	5.	Collateral valuation sources (internal appraisals or third-party AVMs). Load valuations and compute LTV metrics alongside loan terms.
	6.	Public or benchmark datasets for prototyping (e.g., historical lending datasets) and macroeconomic data (national statistics or central bank feeds). Use these for feature engineering ideas and macro join-keys by date/region; keep distinct from production data and document provenance.

Risks and constraints (privacy, quality, bias, governance)

This project must comply with privacy and data-protection laws (e.g., GDPR/CCPA and local regulations), obtain explicit customer consent for bureau and bank-statement pulls, and minimize/secure personally identifiable information via encryption, access controls, and retention limits. Data quality risks include missing or stale bureau pulls, duplicate identities, and inconsistent IDs across systems; we will implement validation rules, survivorship logic, and time-stamped snapshots. Sampling bias (e.g., only past approved applicants) can skew results; we will correct with reject inference or careful evaluation to avoid optimism. Fair-lending and discrimination risks require excluding protected attributes and testing for disparate impact, while maintaining model explainability for adverse-action notices. Finally, we will operate under model-risk management and governanceâ€”documenting assumptions, versioning data and code, monitoring drift/calibration, and establishing periodic re-training and review.

## ðŸŒŸ Exercise 2 : Feature Selection and Model Choice for Loan Default Prediction

### Instructions
From this dataset, identify which features might be most relevant for predicting loan defaults.
Justify your choice of features.

### Guidance
- First, identify the features that you believe are most relevant, and write their names in a sentence.
Then, provide a justification in complete sentences that explains how each selected feature relates to the likelihood of default.

- If you decide to exclude common features, write one sentence for each excluded feature to explain why it is not appropriate in this context.

- Conclude with two complete sentences that explain how you would encode categorical features and how you would impute missing values.

In [1]:

# This piece of code is already prefilled, run it to execute it and see the results.
# It provides a simple template you can modify while writing your justification.

import pandas as pd

# This placeholder DataFrame allows the cell to run even if you did not load a dataset yet.
example_columns = [
    "age","employment_length","annual_income","credit_score","loan_amount","interest_rate",
    "debt_to_income","num_delinquencies","num_open_accounts","total_utilization","home_ownership",
    "purpose","term","application_type","state","zip_code"
]
df = pd.DataFrame(columns=example_columns)

# Please replace this list with the actual columns that you select.
selected_features = [
    # e.g., "credit_score","debt_to_income","annual_income","loan_amount","interest_rate",
    # "employment_length","num_delinquencies","total_utilization"
]

print("You will now justify the selected features in complete sentences below.")

You will now justify the selected features in complete sentences below.


### Your justification
the features most relevant for predicting loan defaults (Loan_Status) are: ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History, Dependents, Education, Self_Employed, Married, Gender, and Property_Area.

ApplicantIncome is relevant because higher or more stable income generally improves a borrowerâ€™s ability to meet monthly repayment obligations, reducing default risk.
CoapplicantIncome matters because combined household income can significantly strengthen loan affordability and reduce financial stress.
LoanAmount is important because larger loan sizes increase the repayment burden, which may elevate the likelihood of default, especially for applicants with lower income.
Loan_Amount_Term affects default risk because the repayment period determines monthly installment size; shorter terms increase monthly payments and may raise default probability.
Credit_History is one of the strongest predictors because borrowers with a clean repayment record are statistically more likely to meet their obligations on a new loan.
Dependents contribute useful information because applicants with more dependents typically face higher household expenses, potentially worsening repayment capacity.
Education helps capture differences in long-term earning potential and financial stability, making it indirectly related to loan repayment performance.
Self_Employed status is relevant because self-employment incomes tend to be more volatile, which can increase repayment uncertainty.
Married status may impact financial stability, because dual-income households may be more resilient and better able to manage unexpected expenses.
Gender can be included because it is part of the dataset, but it should be used with caution, as it may not be ethically or legally appropriate in real credit models; however, in this academic dataset it may still contribute to prediction power.
Property_Area is relevant because urban, semi-urban, and rural areas differ in income opportunities, stability, and loan pricing, all of which influence repayment behavior.

Excluded Feature

Loan_ID is excluded because it is purely an identifier and contains no predictive value related to repayment behavior.

Encoding and Missing Values

Categorical features such as Gender, Married, Education, and Property_Area would be encoded using one-hot encoding to convert them into numerical indicators suitable for machine-learning algorithms.
Missing values in numerical variables such as LoanAmount or Loan_Amount_Term would be imputed using median imputation, while missing categorical values would be filled using the mode to preserve the most common category.

## ðŸŒŸ Exercise 3 : Training, Evaluating, and Optimizing the Model

### Instructions
Which model(s) would you pick for a Loan Prediction ?
Outline the steps to evaluate the modelâ€™s performance, mentioning specific metrics that would be relevant to evaluate the model.

### Guidance
- Begin by naming one or two candidate models in a complete sentence and explain why each model is suitable for this problem.

- Next, describe an evaluation plan in complete sentences that covers the data split, the cross-validation strategy, the metrics you will report, and how you will choose a decision threshold.

- Then, explain in complete sentences how you will address class imbalance using stratification, class weights, or resampling.

- Finally, state in one or two complete sentences how you would iterate on hyperparameters to improve performance while avoiding data leakage.

In [2]:

# This piece of code is already prefilled, run it to execute it and see the results.
# It demonstrates standard classification metrics for binary loan default prediction.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score, confusion_matrix, classification_report

# Please replace these placeholders with your true labels and predicted probabilities.
y_true = [0,1,0,1,0,0,1,0,1,0]            # placeholder labels
y_pred_proba = [0.05,0.80,0.10,0.65,0.20,0.15,0.70,0.30,0.85,0.25]  # placeholder probabilities

# You should set a decision threshold that reflects the precisionâ€“recall trade-off for your business case.
threshold = 0.5
y_pred = [1 if p >= threshold else 0 for p in y_pred_proba]

print("Accuracy:", round(accuracy_score(y_true, y_pred), 4))
print("Precision:", round(precision_score(y_true, y_pred, zero_division=0), 4))
print("Recall:", round(recall_score(y_true, y_pred, zero_division=0), 4))
print("F1-score:", round(f1_score(y_true, y_pred, zero_division=0), 4))
print("ROC-AUC:", round(roc_auc_score(y_true, y_pred_proba), 4))
print("PR-AUC (Average Precision):", round(average_precision_score(y_true, y_pred_proba), 4))
print("\nConfusion matrix:\n", confusion_matrix(y_true, y_pred))
print("\nClassification report:\n", classification_report(y_true, y_pred, zero_division=0))

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-score: 1.0
ROC-AUC: 1.0
PR-AUC (Average Precision): 1.0

Confusion matrix:
 [[6 0]
 [0 4]]

Classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         6
           1       1.00      1.00      1.00         4

    accuracy                           1.00        10
   macro avg       1.00      1.00      1.00        10
weighted avg       1.00      1.00      1.00        10



### Your answer
I would select Logistic Regression and Random Forest Classifier as the two primary candidate models. Logistic Regression is suitable because it is interpretable, works well on structured tabular data, and provides calibrated probability estimates that are essential for credit-risk decisions. Random Forest is suitable because it can capture nonlinear relationships and interactions between variables, is robust to outliers and missingness, and typically performs strongly on medium-sized structured datasets like loan applications.

To evaluate these models, I would begin by splitting the dataset into training (70â€“80%) and testing (20â€“30%) sets using a stratified split to preserve the proportion of approved versus rejected loans. I would use k-fold cross-validation (typically k=5 or k=10) on the training set to obtain stable performance estimates and reduce variance. The main evaluation metrics would include accuracy, precision, recall, F1-score, ROC-AUC, and PR-AUC, because these metrics collectively capture both overall correctness and the ability to detect high-risk borrowers. I would choose the final decision threshold by analyzing the precision-recall trade-off and selecting a cutoff that aligns with business prioritiesâ€”for example, prioritizing recall if the institution wants to minimize approving risky customers.

To address class imbalance, I would apply stratified sampling during all train/test splits and cross-validation folds to maintain consistent class proportions. Additionally, I would experiment with class weights (for example, class_weight="balanced") during model training or apply resampling techniques such as SMOTE or random under-sampling to ensure that minority-class signals are learned effectively.

Finally, I would tune hyperparameters using grid search or randomized search applied only to the training folds, ensuring that the test set remains untouched to avoid data leakage. Hyperparameter tuning would include parameters such as regularization strength for Logistic Regression or tree depth and number of estimators for Random Forest, iteratively improving performance without exposing the model to unseen test labels.

## ðŸŒŸ Exercise 4 : Designing Machine Learning Solutions for Specific Problems

### Instructions
For each of these scenario, decide which type of machine learning would be most suitable. Explain.

Predicting Stock Prices : predict future prices
Organizing a Library of Books : group books into genres or categories based on similarities.
Program a robot to navigate and find the shortest path in a maze.

### Guidance
Please identify the appropriate machine learning paradigm for each scenario in complete sentences and justify your choice.

For each scenario, write one complete sentence that describes the input data, one complete sentence that describes the output, and one complete sentence that describes the learning signal or objective.

### Your answer
1. Predicting Stock Prices

Machine-learning paradigm:
This problem is best solved using supervised learning, because we have historical labeled data where past prices are known and the model is trained to predict future numeric values.

Input data:
The input data would consist of historical features such as past stock prices, trading volume, technical indicators, and possibly external financial variables.

Output:
The output would be a continuous numerical prediction of the stockâ€™s future price for a given time horizon.

Learning signal / objective:
The learning objective would be to minimize a regression loss function such as mean squared error so that predicted future prices closely match the true future prices.

â¸»

2. Organizing a Library of Books

Machine-learning paradigm:
This scenario is well suited for unsupervised learning, because the goal is to automatically group books into categories based on inherent similarities without requiring predefined labels.

Input data:
The input data would include text descriptions, book summaries, metadata, or embeddings derived from book content.

Output:
The output would be clusters of books where each group contains items that are similar in theme, style, or topic.

Learning signal / objective:
The learning objective is to identify natural structure in the data and maximize within-cluster similarity while maximizing separation between clusters.

â¸»

3. Programming a Robot to Navigate a Maze

Machine-learning paradigm:
This scenario is best addressed using reinforcement learning, because the robot must learn through trial and error to choose actions that lead to the shortest path and maximize cumulative reward.

Input data:
The input data would consist of the robotâ€™s sensory observations or its current position within the maze environment.

Output:
The output would be an action at each stepâ€”such as moving forward, turning left, or turning rightâ€”that determines how the robot navigates.

Learning signal / objective:
The learning objective is to maximize rewards by reaching the goal efficiently, which encourages the robot to learn optimal policies for finding the shortest path.

## ðŸŒŸ Exercise 5 : Designing an Evaluation Strategy for Different ML Models

### Instructions
- Select three types of machine learning models: one from supervised learning (e.g., a classification model), one from unsupervised learning (e.g., a clustering model), and one from reinforcement learning. - For the supervised model, outline a strategy to evaluate its performance, including the choice of metrics (like accuracy, precision, recall, F1-score) and methods (like cross-validation, ROC curves).
- For the unsupervised model, describe how you would assess the effectiveness of the model, considering techniques like silhouette score, elbow method, or cluster validation metrics.
- For the reinforcement learning model, discuss how you would measure its success, considering aspects like cumulative reward, convergence, and exploration vs. exploitation balance.
- Address the challenges and limitations of evaluating models in each category.

### Guidance
- Please write a separate paragraph for each of the three model categories.
- In the supervised paragraph, describe your validation plan and list the metrics you will report in complete sentences.
- In the unsupervised paragraph, explain how you would measure cluster quality or structure in complete sentences and mention any diagnostic plots.
- In the reinforcement learning paragraph, describe how you would track cumulative reward, assess convergence, and balance exploration and exploitation using complete sentences.
Conclude with one complete sentence per category that states a key evaluation challenge.

In [3]:

# This piece of code is already prefilled, run it to execute it and see the results.
# Supervised classification metrics template with placeholders.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score

# Replace these placeholders with your real outputs.
y_true = [0,1,1,0,1,0,0,1,0,1]
y_pred_proba = [0.1,0.7,0.8,0.2,0.6,0.3,0.4,0.9,0.2,0.85]
threshold = 0.5
y_pred = [1 if p >= threshold else 0 for p in y_pred_proba]

print("Accuracy:", round(accuracy_score(y_true, y_pred), 4))
print("Precision:", round(precision_score(y_true, y_pred, zero_division=0), 4))
print("Recall:", round(recall_score(y_true, y_pred, zero_division=0), 4))
print("F1-score:", round(f1_score(y_true, y_pred, zero_division=0), 4))
print("ROC-AUC:", round(roc_auc_score(y_true, y_pred_proba), 4))
print("PR-AUC (Average Precision):", round(average_precision_score(y_true, y_pred_proba), 4))

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-score: 1.0
ROC-AUC: 1.0
PR-AUC (Average Precision): 1.0


In [4]:

# This piece of code is already prefilled, run it to execute it and see the results.
# Unsupervised clustering metrics template with synthetic data.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, n_init="auto", random_state=42)
labels = kmeans.fit_predict(X)
sil = silhouette_score(X, labels)
print("Silhouette score (higher is better):", round(sil, 4))

print("Please explain in complete sentences when you would use the elbow method and how you would interpret it.")

Silhouette score (higher is better): 0.848
Please explain in complete sentences when you would use the elbow method and how you would interpret it.


### Your answer
Supervised Learning (e.g., Classification Model)

For a supervised learning model such as a loan default classifier, I would evaluate performance by splitting the dataset into training and testing sets using stratified sampling, and then applying k-fold cross-validation on the training set to obtain stable and unbiased performance estimates. I would report several evaluation metrics including accuracy for overall correctness, precision to measure the reliability of identifying defaulters, recall to measure how many true defaulters are captured, and F1-score to summarize the balance between precision and recall. I would also generate a ROC curve to visualize the true-positive/false-positive trade-off and compute the ROC-AUC score as a threshold-independent measure of model discrimination. A key challenge in evaluating supervised models is dealing with class imbalance, which can cause misleading accuracy scores and require careful threshold tuning.

Unsupervised Learning (e.g., Clustering Model)

For an unsupervised learning model such as K-Means clustering, I would assess effectiveness by computing internal validation metrics that quantify how well the discovered clusters represent meaningful structure. I would use the silhouette score to measure how similar each sample is to its assigned cluster relative to other clusters, and I would apply the elbow method by plotting the within-cluster sum of squared distances to identify the optimal number of clusters. I would also inspect cluster centroids or dendrograms (if using hierarchical clustering) to evaluate whether the cluster assignments align with intuitive or interpretable groupings. A central challenge in evaluating unsupervised models is the absence of ground-truth labels, which makes it difficult to determine whether the discovered structure is meaningful or merely an artifact of the algorithm.

Reinforcement Learning (e.g., Navigation or Control Problem)

For a reinforcement learning model such as a maze-navigation agent, I would measure performance by tracking the cumulative reward obtained across episodes, since higher cumulative reward indicates more successful policies. I would assess convergence by monitoring whether the reward curve stabilizes over time, suggesting that the agent has learned a near-optimal policy. I would also evaluate the balance between exploration and exploitation by analyzing the agentâ€™s behavior, such as the frequency of random exploratory actions versus policy-driven decisions, and adjusting parameters like epsilon in an epsilon-greedy strategy when necessary. A key challenge in evaluating reinforcement learning models is that performance can vary greatly between episodes, making it difficult to determine true convergence or stability.