Binary Classification Pipeline (Logistic Regression)

A reproducible, end-to-end binary classification workflow in scikit-learn — preprocessing, scaling, model training, GridSearchCV tuning, and a full evaluation suite. Built to demonstrate pipeline discipline, not just a one-off model.

Stack: Python · scikit-learn · Pandas · NumPy · Matplotlib Status: Completed Author: Justin Ali · LinkedIn

The problem

Fitting a logistic regression is easy. Wrapping that model in a workflow another analyst can re-run six months later and get the same answer is much harder. This project is less about a specific business outcome and more about the discipline around the model: a clean pipeline with explicit preprocessing, documented validation, and an evaluation suite a stakeholder can actually read.

Approach

Preprocessing pipeline — All transformations (imputation, scaling, encoding) wrapped in a scikit-learn Pipeline so train/test leakage is impossible by construction.
EDA — Class balance, feature distributions, correlation with target.
Feature scaling — Standardized features so logistic regression coefficients are comparable.
Modeling — Logistic regression trained inside the pipeline; coefficients interpreted as log-odds contributions.
Tuning — GridSearchCV across regularization strength (C) and penalty type (L1 vs. L2), with cross-validated ROC-AUC as the selection criterion.
Evaluation — Confusion matrix, precision, recall, ROC curve, AUC. Each metric chosen because it tells you something a different one does not.
Documentation — Every step explained inline so a stakeholder reading the notebook understands not just what but why.

Results

A working pipeline with cross-validated performance numbers, documented methodology, and a model that another analyst can re-train or re-tune by running one notebook top to bottom.

What I would do next

Wrap the trained pipeline in a joblib-serialized artifact plus a small inference script for deployment.
Add SHAP value explanations so per-prediction reasoning is auditable.
Compare against tree-based methods (Random Forest, gradient boosting) to confirm logistic regression is actually the right choice for this data.
Add a calibration plot — most binary classifiers need one before their probabilities are trustworthy.

Repo contents

.
├── notebooks/
└── README.md

How to run

git clone https://github.com/JustinAliData/binary-classification-pipeline.git
cd binary-classification-pipeline
python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
jupyter lab

Acknowledgments

Built as part of the Springboard Data Science Career Track.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
images		images
.DS_Store		.DS_Store
.gitattributes		.gitattributes
Logistic Regression Advanced Case Study.ipynb		Logistic Regression Advanced Case Study.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Binary Classification Pipeline (Logistic Regression)

The problem

Approach

Results

What I would do next

Repo contents

How to run

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Binary Classification Pipeline (Logistic Regression)

The problem

Approach

Results

What I would do next

Repo contents

How to run

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages