Logistic Regression Implementation from Scratch

A comprehensive implementation of logistic regression algorithm built from scratch using Python, demonstrating binary classification for heart disease prediction and study hours analysis.

Overview

This project implements logistic regression from scratch without using pre-built machine learning libraries for the core algorithm. It demonstrates:

Binary classification for medical diagnosis (heart disease prediction)
Simple binary classification (study hours vs pass/fail)
Mathematical concepts behind logistic regression
Data preprocessing and feature scaling
Model evaluation with multiple metrics
Comprehensive visualizations

Project Structure

Logistic-Regression/
├── README.md
├── Real Datasets/
│   ├── heart_disease.csv           # Heart disease dataset (303 samples)
│   ├── heartdiseasecode.py         # Main implementation with visualizations
│   └── logregExplained.py          # Detailed commented version
└── Logistic Classification/
    └── StudyHours.py               # Simple example implementation

Dataset Information

Heart Disease Dataset (`heart_disease.csv`)

Size: 303 samples, 14 features
Target: Binary classification (0 = No heart disease, 1 = Heart disease)
Features:
- age: Age in years
- sex: Gender (1 = male, 0 = female)
- cp: Chest pain type (0-3)
- trestbps: Resting blood pressure (mm Hg)
- chol: Serum cholesterol (mg/dl)
- fbs: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
- restecg: Resting electrocardiographic results (0-2)
- thalach: Maximum heart rate achieved
- exang: Exercise induced angina (1 = yes, 0 = no)
- oldpeak: ST depression induced by exercise
- slope: Slope of peak exercise ST segment (0-2)
- ca: Number of major vessels colored by fluoroscopy (0-3)
- thal: Thalassemia type (1-3)
- target: Heart disease diagnosis (0 = no, 1 = yes)

Study Hours Dataset (Synthetic)

Size: 6 samples
Purpose: Demonstrate basic logistic regression concept
Features: Hours studied (1-6)
Target: Pass/Fail (0/1)

Implementation Details

Core Algorithm Components

Sigmoid Function
```
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
```
- Converts linear predictions to probabilities (0-1 range)
- S-shaped curve ideal for binary classification
Cost Function
- Binary Cross-Entropy Loss
- Formula: J = -1/m * Σ[y*log(h) + (1-y)*log(1-h)]
- Penalizes confident wrong predictions heavily
Gradient Descent
- Updates weights: w = w - α * dw
- Updates bias: b = b - α * db
- Learning rate (α) controls step size
Feature Scaling
- StandardScaler: (x - μ) / σ
- Ensures all features contribute equally
- Prevents larger-scale features from dominating

Files Description

1. `heartdiseasecode.py`

Main implementation with complete workflow

Full logistic regression from scratch
Heart disease dataset processing
Train/test split (80/20)
Feature scaling with StandardScaler
Model training (5000 epochs, lr=0.01)
Multiple evaluation metrics
Three visualization plots:
- Training loss convergence curve
- Confusion matrix heatmap
- ROC curve with AUC score

Key Results:

Classification Accuracy: ~79.5%
AUC Score: Typically 0.85-0.90
Training converges smoothly

2. `logregExplained.py`

Educational version with extensive documentation

Detailed comments explaining every step
Mathematical concepts explained
Purpose of each function documented
Step-by-step algorithm breakdown
Same functionality as main implementation
Perfect for learning and understanding

Educational Value:

Explains sigmoid function purpose
Details gradient descent process
Clarifies feature scaling necessity
Describes evaluation metrics meaning

3. `StudyHours.py`

Simple demonstration example

Minimal 6-sample dataset
Basic logistic regression implementation
Perfect for understanding core concepts
No external dependencies except NumPy
Clear input/output demonstration

Example Output:

Predictions: [0, 0, 0, 1, 1, 1]
Actual:      [0, 0, 0, 1, 1, 1]

Installation & Setup

Prerequisites

Python 3.7+

Required Libraries

pip install numpy pandas matplotlib scikit-learn

Installation Commands

# Clone the repository
git clone https://github.com/dnjstr/Logistic-Regression.git
cd Logistic-Regression

# Install dependencies
pip install numpy pandas matplotlib scikit-learn

# Run the main implementation
python "Real Datasets/heartdiseasecode.py"

# Run the educational version
python "Real Datasets/logregExplained.py"

# Run the simple example
python "Logistic Classification/StudyHours.py"

Usage

Running Heart Disease Prediction

cd "Real Datasets"
python heartdiseasecode.py

Expected Output:

Working directory: C:\...\Logistic-Regression
Classification Accuracy: 0.7951219512195122
Sample 0: true=1, prob=0.9724, pred_label=1
Sample 1: true=1, prob=0.9969, pred_label=1
Sample 2: true=0, prob=0.0160, pred_label=0
Sample 3: true=1, prob=0.9700, pred_label=1
Sample 4: true=0, prob=0.0461, pred_label=0

Running Simple Example

cd "Logistic Classification"
python StudyHours.py

Mathematical Foundation

Logistic Regression Equation

h(x) = σ(θᵀx + b)
where σ(z) = 1/(1 + e^(-z))

Cost Function (Binary Cross-Entropy)

J(θ) = -1/m * Σᵢ[yᵢlog(h(xᵢ)) + (1-yᵢ)log(1-h(xᵢ))]

Gradient Descent Updates

θⱼ := θⱼ - α * ∂J/∂θⱼ
b := b - α * ∂J/∂b

Derivatives

∂J/∂θⱼ = 1/m * Σᵢ(h(xᵢ) - yᵢ) * xᵢⱼ
∂J/∂b = 1/m * Σᵢ(h(xᵢ) - yᵢ)

Results & Performance

Heart Disease Prediction Model

Accuracy: 79.51%
Training Time: ~2-3 seconds (5000 epochs)
Convergence: Loss decreases smoothly
Overfitting: Minimal (good generalization)

Model Performance Metrics

Precision: ~0.80
Recall: ~0.78
F1-Score: ~0.79
AUC-ROC: ~0.87

Comparison with Sklearn

Our implementation performs comparably to sklearn's LogisticRegression:

Similar accuracy (±2%)
Same mathematical approach
Educational transparency

Visualizations

1. Training Loss Curve

Purpose: Monitor convergence
Pattern: Decreasing loss over epochs
Insight: Model learning progress

2. Confusion Matrix

Format: 2x2 grid
Diagonal: Correct predictions
Off-diagonal: Classification errors
Labels: "No Heart Disease" vs "Heart Disease"

3. ROC Curve

X-axis: False Positive Rate
Y-axis: True Positive Rate
Diagonal: Random classifier baseline
AUC: Model discrimination ability

Key Features

Technical Features

From Scratch Implementation: No sklearn for core algorithm
Gradient Descent: Manual parameter optimization
Feature Scaling: StandardScaler preprocessing
Multiple Metrics: Accuracy, precision, recall, F1, AUC
Comprehensive Visualization: Loss curve, confusion matrix, ROC curve
Real Dataset: Medical diagnosis application

Educational Features

Extensive Documentation: Every step explained
Mathematical Clarity: Formulas and concepts detailed
Progressive Complexity: Simple to advanced examples
Code Comments: Line-by-line explanations
Practical Application: Real-world medical dataset

Software Engineering Features

Clean Code Structure: Modular and readable
Error Handling: Robust implementation
Consistent Naming: Clear variable names
Type Hints: Enhanced code clarity
Documentation: Comprehensive README

Learning Outcomes

Mathematical Understanding

Sigmoid Function: Probability mapping
Cross-Entropy: Classification loss function
Gradient Descent: Optimization algorithm
Feature Scaling: Data preprocessing importance

Machine Learning Concepts

Binary Classification: Two-class prediction
Train/Test Split: Model evaluation strategy
Overfitting Prevention: Generalization techniques
Performance Metrics: Model assessment methods

Programming Skills

NumPy Operations: Matrix computations
Data Preprocessing: Real dataset handling
Visualization: Matplotlib plotting
Class Design: Object-oriented implementation

Practical Applications

Medical Diagnosis: Healthcare AI applications
Risk Assessment: Binary decision making
Pattern Recognition: Feature importance analysis
Model Interpretation: Understanding predictions
All implementations are educational and demonstrate core concepts
Real-world applications should consider additional validation
Dataset is for educational purposes only
Medical predictions require professional validation

License

This project is for educational purposes. Feel free to use and modify for learning.

Author: Den & Adrian Date: September 2025
Purpose: Educational Implementation of Logistic Regression
Accuracy: ~79.5% on Heart Disease Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Logistic Classification		Logistic Classification
Logistic Regression		Logistic Regression
Real Datasets		Real Datasets
README.md		README.md

AdiFranDev/Logistic-Regression

Folders and files

Latest commit

History

Repository files navigation