A comprehensive implementation of logistic regression algorithm built from scratch using Python, demonstrating binary classification for heart disease prediction and study hours analysis.
- Overview
- Project Structure
- Dataset Information
- Implementation Details
- Files Description
- Installation & Setup
- Usage
- Mathematical Foundation
- Results & Performance
- Visualizations
- Key Features
- Learning Outcomes
This project implements logistic regression from scratch without using pre-built machine learning libraries for the core algorithm. It demonstrates:
- Binary classification for medical diagnosis (heart disease prediction)
- Simple binary classification (study hours vs pass/fail)
- Mathematical concepts behind logistic regression
- Data preprocessing and feature scaling
- Model evaluation with multiple metrics
- Comprehensive visualizations
Logistic-Regression/
├── README.md
├── Real Datasets/
│ ├── heart_disease.csv # Heart disease dataset (303 samples)
│ ├── heartdiseasecode.py # Main implementation with visualizations
│ └── logregExplained.py # Detailed commented version
└── Logistic Classification/
└── StudyHours.py # Simple example implementation
- Size: 303 samples, 14 features
- Target: Binary classification (0 = No heart disease, 1 = Heart disease)
- Features:
age: Age in yearssex: Gender (1 = male, 0 = female)cp: Chest pain type (0-3)trestbps: Resting blood pressure (mm Hg)chol: Serum cholesterol (mg/dl)fbs: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)restecg: Resting electrocardiographic results (0-2)thalach: Maximum heart rate achievedexang: Exercise induced angina (1 = yes, 0 = no)oldpeak: ST depression induced by exerciseslope: Slope of peak exercise ST segment (0-2)ca: Number of major vessels colored by fluoroscopy (0-3)thal: Thalassemia type (1-3)target: Heart disease diagnosis (0 = no, 1 = yes)
- Size: 6 samples
- Purpose: Demonstrate basic logistic regression concept
- Features: Hours studied (1-6)
- Target: Pass/Fail (0/1)
-
Sigmoid Function
def sigmoid(z): return 1 / (1 + np.exp(-z))
- Converts linear predictions to probabilities (0-1 range)
- S-shaped curve ideal for binary classification
-
Cost Function
- Binary Cross-Entropy Loss
- Formula:
J = -1/m * Σ[y*log(h) + (1-y)*log(1-h)] - Penalizes confident wrong predictions heavily
-
Gradient Descent
- Updates weights:
w = w - α * dw - Updates bias:
b = b - α * db - Learning rate (α) controls step size
- Updates weights:
-
Feature Scaling
- StandardScaler:
(x - μ) / σ - Ensures all features contribute equally
- Prevents larger-scale features from dominating
- StandardScaler:
Main implementation with complete workflow
- Full logistic regression from scratch
- Heart disease dataset processing
- Train/test split (80/20)
- Feature scaling with StandardScaler
- Model training (5000 epochs, lr=0.01)
- Multiple evaluation metrics
- Three visualization plots:
- Training loss convergence curve
- Confusion matrix heatmap
- ROC curve with AUC score
Key Results:
- Classification Accuracy: ~79.5%
- AUC Score: Typically 0.85-0.90
- Training converges smoothly
Educational version with extensive documentation
- Detailed comments explaining every step
- Mathematical concepts explained
- Purpose of each function documented
- Step-by-step algorithm breakdown
- Same functionality as main implementation
- Perfect for learning and understanding
Educational Value:
- Explains sigmoid function purpose
- Details gradient descent process
- Clarifies feature scaling necessity
- Describes evaluation metrics meaning
Simple demonstration example
- Minimal 6-sample dataset
- Basic logistic regression implementation
- Perfect for understanding core concepts
- No external dependencies except NumPy
- Clear input/output demonstration
Example Output:
Predictions: [0, 0, 0, 1, 1, 1]
Actual: [0, 0, 0, 1, 1, 1]
Python 3.7+pip install numpy pandas matplotlib scikit-learn# Clone the repository
git clone https://github.com/dnjstr/Logistic-Regression.git
cd Logistic-Regression
# Install dependencies
pip install numpy pandas matplotlib scikit-learn
# Run the main implementation
python "Real Datasets/heartdiseasecode.py"
# Run the educational version
python "Real Datasets/logregExplained.py"
# Run the simple example
python "Logistic Classification/StudyHours.py"cd "Real Datasets"
python heartdiseasecode.pyExpected Output:
Working directory: C:\...\Logistic-Regression
Classification Accuracy: 0.7951219512195122
Sample 0: true=1, prob=0.9724, pred_label=1
Sample 1: true=1, prob=0.9969, pred_label=1
Sample 2: true=0, prob=0.0160, pred_label=0
Sample 3: true=1, prob=0.9700, pred_label=1
Sample 4: true=0, prob=0.0461, pred_label=0
cd "Logistic Classification"
python StudyHours.pyh(x) = σ(θᵀx + b)
where σ(z) = 1/(1 + e^(-z))
J(θ) = -1/m * Σᵢ[yᵢlog(h(xᵢ)) + (1-yᵢ)log(1-h(xᵢ))]
θⱼ := θⱼ - α * ∂J/∂θⱼ
b := b - α * ∂J/∂b
∂J/∂θⱼ = 1/m * Σᵢ(h(xᵢ) - yᵢ) * xᵢⱼ
∂J/∂b = 1/m * Σᵢ(h(xᵢ) - yᵢ)
- Accuracy: 79.51%
- Training Time: ~2-3 seconds (5000 epochs)
- Convergence: Loss decreases smoothly
- Overfitting: Minimal (good generalization)
- Precision: ~0.80
- Recall: ~0.78
- F1-Score: ~0.79
- AUC-ROC: ~0.87
Our implementation performs comparably to sklearn's LogisticRegression:
- Similar accuracy (±2%)
- Same mathematical approach
- Educational transparency
- Purpose: Monitor convergence
- Pattern: Decreasing loss over epochs
- Insight: Model learning progress
- Format: 2x2 grid
- Diagonal: Correct predictions
- Off-diagonal: Classification errors
- Labels: "No Heart Disease" vs "Heart Disease"
- X-axis: False Positive Rate
- Y-axis: True Positive Rate
- Diagonal: Random classifier baseline
- AUC: Model discrimination ability
- From Scratch Implementation: No sklearn for core algorithm
- Gradient Descent: Manual parameter optimization
- Feature Scaling: StandardScaler preprocessing
- Multiple Metrics: Accuracy, precision, recall, F1, AUC
- Comprehensive Visualization: Loss curve, confusion matrix, ROC curve
- Real Dataset: Medical diagnosis application
- Extensive Documentation: Every step explained
- Mathematical Clarity: Formulas and concepts detailed
- Progressive Complexity: Simple to advanced examples
- Code Comments: Line-by-line explanations
- Practical Application: Real-world medical dataset
- Clean Code Structure: Modular and readable
- Error Handling: Robust implementation
- Consistent Naming: Clear variable names
- Type Hints: Enhanced code clarity
- Documentation: Comprehensive README
- Sigmoid Function: Probability mapping
- Cross-Entropy: Classification loss function
- Gradient Descent: Optimization algorithm
- Feature Scaling: Data preprocessing importance
- Binary Classification: Two-class prediction
- Train/Test Split: Model evaluation strategy
- Overfitting Prevention: Generalization techniques
- Performance Metrics: Model assessment methods
- NumPy Operations: Matrix computations
- Data Preprocessing: Real dataset handling
- Visualization: Matplotlib plotting
- Class Design: Object-oriented implementation
-
Medical Diagnosis: Healthcare AI applications
-
Risk Assessment: Binary decision making
-
Pattern Recognition: Feature importance analysis
-
Model Interpretation: Understanding predictions
-
All implementations are educational and demonstrate core concepts
-
Real-world applications should consider additional validation
-
Dataset is for educational purposes only
-
Medical predictions require professional validation
This project is for educational purposes. Feel free to use and modify for learning.
Author: Den & Adrian
Date: September 2025
Purpose: Educational Implementation of Logistic Regression
Accuracy: ~79.5% on Heart Disease Dataset