# TECHNICAL REPORT
## Credit Risk Assessment – Machine Learning Pipeline

## 1. Executive Summary

Financial institutions must accurately assess loan applicants to minimize financial losses caused by defaults. FinTech Solutions Inc. currently relies on a manual credit evaluation system that takes 3–5 days per application and produces inconsistent decisions.

This project develops an automated Machine Learning pipeline to predict whether a customer is likely to default on a loan. Multiple classification models were developed and compared, including Logistic Regression, Decision Tree, Random Forest, and XGBoost.

The primary business objective was to achieve Recall ≥ 0.75, ensuring that most high-risk customers are correctly identified.

After hyperparameter tuning using 5-fold cross-validation, XGBoost was selected as the final model. The solution includes MLflow tracking, model versioning, and a complete deployment strategy.

The system is scalable, production-ready, and aligned with business requirements.

## 2. Business Context

FinTech Solutions Inc. is a digital lending company that provides personal loans to customers. The company faces the following challenges:

* Manual credit assessment process

* Decision delays (3–5 days per application)

* Inconsistent risk evaluation

* Financial losses due to undetected defaulters

An automated ML-based system can significantly improve efficiency, consistency, and risk management.

## 3. Problem Statement

The objective is to build a classification model that predicts whether a loan applicant will:

* Good Credit (Non-default)

* Bad Credit (Default)

The key business requirement is:

* Maintain Recall ≥ 0.75 for detecting defaulters.

## 4. Dataset Description

The dataset contains customer financial and demographic information including:

* Age

- Sex

- Job

- Housing

- Saving Accounts

- Checking Account

- Credit Amount

- Duration

- Purpose

The target variable indicates whether the customer defaulted or not.

## 5. Exploratory Data Analysis (EDA)

EDA was performed to:

- Understand data distribution

- Identify missing values

- Detect outliers

- Analyze class imbalance

- Study relationships between features and target

Key observations:

- Moderate class imbalance exists

- Credit amount and duration influence default probability

- Some categorical variables required encoding

## 6. Data Preprocessing

The following preprocessing steps were applied:

- Handling missing values

- Label encoding / one-hot encoding

- Train-test split (80:20)

- Feature scaling using StandardScaler

- Ensuring no data leakage

7. Model Development

Four models were trained:

- Logistic Regression (Baseline)

- Decision Tree

- Random Forest

- XGBoost

Evaluation metrics used:

- Accuracy

- Precision

- Recall

- F1-score

- AUC-ROC

## 8. Hyperparameter Tuning

Hyperparameter tuning was performed using GridSearchCV with 5-fold cross-validation.

The goal was to optimize Recall while maintaining balanced overall performance.

XGBoost achieved the highest recall after tuning and satisfied the business requirement.

## 9. Final Model Selection

XGBoost was selected because:

- Highest Recall score

- Strong generalization

- Handles non-linearity effectively

- Robust against overfitting

Final model met:

✔ Recall ≥ 0.75
✔ Stable performance across folds

## 10. MLflow Tracking and Experiment Management

MLflow was used for:

- Logging model parameters

- Logging evaluation metrics

- Tracking experiments

- Model versioning

This ensures reproducibility and proper model lifecycle management.

## 11. Deployment Plan

- Infrastructure

The model will be deployed on cloud platforms such as AWS EC2, Amazon SageMaker, or GCP.

Docker will be used for containerization to ensure scalability and portability.

- API Design

The model will be exposed via REST API using FastAPI or Flask.

Input: JSON
Output: Prediction + Default Probability

- Monitoring

- Accuracy

- Precision

- Recall (≥ 0.75)

- AUC-ROC

- Latency

- Data drift

- Retraining

Every 3–6 months
OR

If recall drops below threshold

A/B Testing

20% traffic to new model

80% to existing system

## 12. Risk Mitigation

- Bias audits

- SHAP-based explainability

- MLflow version control

- Rollback mechanism

- Secure API communication

## 13. Conclusion

This project successfully developed a production-ready credit risk classification system.

Among all evaluated models, XGBoost achieved the best performance and satisfied the business requirement of Recall ≥ 0.75.

The solution includes:

- End-to-end ML pipeline

- Hyperparameter tuning

- MLflow experiment tracking

- Deployment and monitoring strategy

The system improves decision speed, reduces manual effort, and minimizes financial risk.
