Week 6: Final Project

Objective

The objective of the final project was to build a fully reproducible machine learning project that uses real data to address an applied classification problem. The dataset provided was German_bank.csv, containing customer and financial information, with the goal of predicting loan default.

This project combines data preprocessing, exploratory analysis, feature engineering, class imbalance handling, model training, hyperparameter tuning, and performance evaluation across multiple classification methods.

Repository Structure


final-project-assignment/
│── README.md
│── requirements.txt
│── .gitignore
│── Week6_Final_Project.ipynb
└── Dataset/
└── German_bank.csv

Workflow

A) Data Loading and Exploration

Loaded German_bank.csv using pandas.
Inspected data types, summary statistics, and missing values.
Identified and replaced unknown values in categorical features (checking_balance, savings_balance).

B) Exploratory Data Analysis (EDA)

Visualized distributions of categorical variables with countplots.
Outlier detection and removal for numeric variables using IQR.
Encoding:
- Label Encoding for ordinal variables (checking_balance, savings_balance, employment_duration).
- One-Hot Encoding for nominal variables (credit_history, purpose, other_credit, housing, job, phone, default).
Correlation analysis with heatmaps.
Feature importance via Random Forest and XGBoost models.

C) Feature Selection

Final selected features included:
credit_history, housing, age, checking_balance, purpose, other_credit, months_loan_duration, amount, percent_of_income, existing_loans_count, years_at_residence.

D) Handling Class Imbalance

Oversampling with RandomOverSampler from imbalanced-learn.
Scaled numeric features with StandardScaler.

E) Train-Test Split

Split oversampled dataset into 80% training and 20% testing sets.

F) Models Trained

Five models were fit, tuned, and compared:

Logistic Regression
Random Forest
XGBoost
Support Vector Machine (SVM)
Gradient Boosting

Each model was trained in two phases: baseline model and hyperparameter-tuned model (via GridSearchCV with cross-validation).

Results and Analysis

Logistic Regression

Baseline: Accuracy 69%, Recall 69%.
Tuned: Accuracy 64%, Recall 82%.
Tradeoff: Tuned version improved recall but at the expense of accuracy and precision.

Random Forest

Accuracy: ~90%, Recall ~96%, F1 ~89%.
Best performance overall; stable even without tuning.

XGBoost

Baseline: Accuracy ~85%, Recall ~97%.
Tuned: Accuracy ~86%, Precision improved, Recall balanced.
Strong candidate model with improvements after tuning.

SVM

Baseline: Accuracy ~74%, Recall ~79%.
Tuned: Accuracy ~77%, Recall ~86%.
Gains after tuning, particularly in precision and recall.

Gradient Boosting

Baseline: Accuracy ~75%, Recall ~86%.
Tuned: Accuracy ~82%, Recall ~91%.
Significant improvements after tuning; competitive with XGBoost.

Key Findings

Random Forest delivered the best overall performance, balancing accuracy, precision, recall, and F1 score.
XGBoost and Gradient Boosting showed strong improvements after hyperparameter tuning.
SVM improved with tuning but lagged behind ensemble methods.
Logistic Regression provided a baseline; tuning improved recall but reduced overall accuracy.

Conclusion: Ensemble methods (Random Forest, XGBoost, Gradient Boosting) consistently outperformed simpler models, with Random Forest being the most robust for predicting loan defaults.

Setup Instructions

The assignment was developed in Python using a shared virtual environment.

To install dependencies:

cd ~/projects/final-project-assignment
source ~/venvs/ml-env/bin/activate
pip install -r requirements.txt

References

James, Witten, Hastie, Tibshirani (2021). An Introduction to Statistical Learning with Applications in Python.
Scikit-learn Documentation: https://scikit-learn.org/stable/
XGBoost Documentation: https://xgboost.readthedocs.io/
Imbalanced-learn Documentation: https://imbalanced-learn.org/

---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Week 6: Final Project

Objective

Repository Structure

Workflow

A) Data Loading and Exploration

B) Exploratory Data Analysis (EDA)

C) Feature Selection

D) Handling Class Imbalance

E) Train-Test Split

F) Models Trained

Results and Analysis

Logistic Regression

Random Forest

XGBoost

SVM

Gradient Boosting

Key Findings

Setup Instructions

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Dataset		Dataset
.gitignore		.gitignore
README.md		README.md
Week6_Final_Project.ipynb		Week6_Final_Project.ipynb
requirements.txt		requirements.txt

JDede1/final-project-assignment

Folders and files

Latest commit

History

Repository files navigation

Week 6: Final Project

Objective

Repository Structure

Workflow

A) Data Loading and Exploration

B) Exploratory Data Analysis (EDA)

C) Feature Selection

D) Handling Class Imbalance

E) Train-Test Split

F) Models Trained

Results and Analysis

Logistic Regression

Random Forest

XGBoost

SVM

Gradient Boosting

Key Findings

Setup Instructions

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages