Skip to content

Week 6 Final Project for the Master’s in MIS/ML program at the University of Arizona. This project applies a full ML workflow to the German Bank dataset to predict loan defaults, including EDA, preprocessing, feature engineering, class imbalance handling, model training, hyperparameter tuning, and evaluation.

Notifications You must be signed in to change notification settings

JDede1/final-project-assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Week 6: Final Project

Objective

The objective of the final project was to build a fully reproducible machine learning project that uses real data to address an applied classification problem. The dataset provided was German_bank.csv, containing customer and financial information, with the goal of predicting loan default.

This project combines data preprocessing, exploratory analysis, feature engineering, class imbalance handling, model training, hyperparameter tuning, and performance evaluation across multiple classification methods.


Repository Structure


final-project-assignment/
│── README.md
│── requirements.txt
│── .gitignore
│── Week6_Final_Project.ipynb
└── Dataset/
└── German_bank.csv


Workflow

A) Data Loading and Exploration

  • Loaded German_bank.csv using pandas.
  • Inspected data types, summary statistics, and missing values.
  • Identified and replaced unknown values in categorical features (checking_balance, savings_balance).

B) Exploratory Data Analysis (EDA)

  • Visualized distributions of categorical variables with countplots.
  • Outlier detection and removal for numeric variables using IQR.
  • Encoding:
    • Label Encoding for ordinal variables (checking_balance, savings_balance, employment_duration).
    • One-Hot Encoding for nominal variables (credit_history, purpose, other_credit, housing, job, phone, default).
  • Correlation analysis with heatmaps.
  • Feature importance via Random Forest and XGBoost models.

C) Feature Selection

Final selected features included:
credit_history, housing, age, checking_balance, purpose, other_credit, months_loan_duration, amount, percent_of_income, existing_loans_count, years_at_residence.

D) Handling Class Imbalance

  • Oversampling with RandomOverSampler from imbalanced-learn.
  • Scaled numeric features with StandardScaler.

E) Train-Test Split

  • Split oversampled dataset into 80% training and 20% testing sets.

F) Models Trained

Five models were fit, tuned, and compared:

  1. Logistic Regression
  2. Random Forest
  3. XGBoost
  4. Support Vector Machine (SVM)
  5. Gradient Boosting

Each model was trained in two phases: baseline model and hyperparameter-tuned model (via GridSearchCV with cross-validation).


Results and Analysis

Logistic Regression

  • Baseline: Accuracy 69%, Recall 69%.
  • Tuned: Accuracy 64%, Recall 82%.
  • Tradeoff: Tuned version improved recall but at the expense of accuracy and precision.

Random Forest

  • Accuracy: ~90%, Recall ~96%, F1 ~89%.
  • Best performance overall; stable even without tuning.

XGBoost

  • Baseline: Accuracy ~85%, Recall ~97%.
  • Tuned: Accuracy ~86%, Precision improved, Recall balanced.
  • Strong candidate model with improvements after tuning.

SVM

  • Baseline: Accuracy ~74%, Recall ~79%.
  • Tuned: Accuracy ~77%, Recall ~86%.
  • Gains after tuning, particularly in precision and recall.

Gradient Boosting

  • Baseline: Accuracy ~75%, Recall ~86%.
  • Tuned: Accuracy ~82%, Recall ~91%.
  • Significant improvements after tuning; competitive with XGBoost.

Key Findings

  • Random Forest delivered the best overall performance, balancing accuracy, precision, recall, and F1 score.
  • XGBoost and Gradient Boosting showed strong improvements after hyperparameter tuning.
  • SVM improved with tuning but lagged behind ensemble methods.
  • Logistic Regression provided a baseline; tuning improved recall but reduced overall accuracy.

Conclusion: Ensemble methods (Random Forest, XGBoost, Gradient Boosting) consistently outperformed simpler models, with Random Forest being the most robust for predicting loan defaults.


Setup Instructions

The assignment was developed in Python using a shared virtual environment.

To install dependencies:

cd ~/projects/final-project-assignment
source ~/venvs/ml-env/bin/activate
pip install -r requirements.txt

References


---

About

Week 6 Final Project for the Master’s in MIS/ML program at the University of Arizona. This project applies a full ML workflow to the German Bank dataset to predict loan defaults, including EDA, preprocessing, feature engineering, class imbalance handling, model training, hyperparameter tuning, and evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published