The objective of the final project was to build a fully reproducible machine learning project that uses real data to address an applied classification problem. The dataset provided was German_bank.csv, containing customer and financial information, with the goal of predicting loan default.
This project combines data preprocessing, exploratory analysis, feature engineering, class imbalance handling, model training, hyperparameter tuning, and performance evaluation across multiple classification methods.
final-project-assignment/
│── README.md
│── requirements.txt
│── .gitignore
│── Week6_Final_Project.ipynb
└── Dataset/
└── German_bank.csv
- Loaded German_bank.csv using pandas.
- Inspected data types, summary statistics, and missing values.
- Identified and replaced
unknown
values in categorical features (checking_balance
,savings_balance
).
- Visualized distributions of categorical variables with countplots.
- Outlier detection and removal for numeric variables using IQR.
- Encoding:
- Label Encoding for ordinal variables (
checking_balance
,savings_balance
,employment_duration
). - One-Hot Encoding for nominal variables (
credit_history
,purpose
,other_credit
,housing
,job
,phone
,default
).
- Label Encoding for ordinal variables (
- Correlation analysis with heatmaps.
- Feature importance via Random Forest and XGBoost models.
Final selected features included:
credit_history
, housing
, age
, checking_balance
, purpose
, other_credit
, months_loan_duration
, amount
, percent_of_income
, existing_loans_count
, years_at_residence
.
- Oversampling with RandomOverSampler from
imbalanced-learn
. - Scaled numeric features with StandardScaler.
- Split oversampled dataset into 80% training and 20% testing sets.
Five models were fit, tuned, and compared:
- Logistic Regression
- Random Forest
- XGBoost
- Support Vector Machine (SVM)
- Gradient Boosting
Each model was trained in two phases: baseline model and hyperparameter-tuned model (via GridSearchCV with cross-validation).
- Baseline: Accuracy 69%, Recall 69%.
- Tuned: Accuracy 64%, Recall 82%.
- Tradeoff: Tuned version improved recall but at the expense of accuracy and precision.
- Accuracy: ~90%, Recall ~96%, F1 ~89%.
- Best performance overall; stable even without tuning.
- Baseline: Accuracy ~85%, Recall ~97%.
- Tuned: Accuracy ~86%, Precision improved, Recall balanced.
- Strong candidate model with improvements after tuning.
- Baseline: Accuracy ~74%, Recall ~79%.
- Tuned: Accuracy ~77%, Recall ~86%.
- Gains after tuning, particularly in precision and recall.
- Baseline: Accuracy ~75%, Recall ~86%.
- Tuned: Accuracy ~82%, Recall ~91%.
- Significant improvements after tuning; competitive with XGBoost.
- Random Forest delivered the best overall performance, balancing accuracy, precision, recall, and F1 score.
- XGBoost and Gradient Boosting showed strong improvements after hyperparameter tuning.
- SVM improved with tuning but lagged behind ensemble methods.
- Logistic Regression provided a baseline; tuning improved recall but reduced overall accuracy.
Conclusion: Ensemble methods (Random Forest, XGBoost, Gradient Boosting) consistently outperformed simpler models, with Random Forest being the most robust for predicting loan defaults.
The assignment was developed in Python using a shared virtual environment.
To install dependencies:
cd ~/projects/final-project-assignment
source ~/venvs/ml-env/bin/activate
pip install -r requirements.txt
- James, Witten, Hastie, Tibshirani (2021). An Introduction to Statistical Learning with Applications in Python.
- Scikit-learn Documentation: https://scikit-learn.org/stable/
- XGBoost Documentation: https://xgboost.readthedocs.io/
- Imbalanced-learn Documentation: https://imbalanced-learn.org/
---