# Day 14: Algorithm Comparison Mini-Project

**Week 2 - Comprehensive ML Model Comparison Practice Template**

Welcome to Day 14 of your ML journey! Today we tackle one of the most critical decisions in machine learning: **Algorithm Selection**. Choosing the right algorithm can make the difference between a mediocre model and an exceptional one. you'll perform comprehensive model comparison, performance evaluation across multiple metrics, cross-validation techniques, and systematic algorithm selection strategies.

---

## Learning Objectives

By the end of this project, you will be able to:

1. **Apply multiple ML algorithms** to the same dataset
2. **Compare model performance** using various evaluation metrics  
3. **Understand algorithm strengths and weaknesses** in practice
4. **Make informed decisions** about model selection
5. **Create visualizations** to communicate results effectively

### Key ML Concepts Practiced

- Complete ML Pipeline: From raw data to model evaluation
- Data Preprocessing: Handling missing values, encoding, scaling
- Model Building: Implementing 9 different algorithms
- Model Evaluation: Accuracy, precision, recall, F1-Score, ROC-AUC
- Cross-Validation: Ensuring robust model evaluation
- Visualization: Creating comparison charts and performance plots

---


## Project Overview

### The Challenge

In this project, you will work with a **Heart Disease Prediction** dataset to:
- Compare 9 different ML algorithms
- Evaluate their performance comprehensively
- Identify the best model for this medical diagnosis problem

### Why This Matters

In real-world ML projects:
- **No single algorithm works best for all problems**
- **Data characteristics matter**: Dataset size, features, noise, class balance
- **Different metrics reveal different insights**: Accuracy alone is not enough
- **Computational cost vs. performance**: Sometimes simpler models are better

### Algorithms You Will Compare

| Algorithm | Type | Best Use Cases |
|-----------|------|----------------|
| Logistic Regression | Linear | Baseline, interpretability |
| Decision Tree | Tree-based | Non-linear patterns, interpretability |
| Random Forest | Ensemble | High accuracy, feature importance |
| Gradient Boosting | Ensemble | Competition-winning performance |
| XGBoost | Ensemble (Advanced) | State-of-the-art performance |
| LightGBM | Ensemble (Advanced) | Large datasets, speed |
| Support Vector Machine | Kernel-based | High-dimensional data |
| k-Nearest Neighbors | Instance-based | Small datasets, non-parametric |
| Naive Bayes | Probabilistic | Fast training, works with small data |

---


## Part 1: Setup and Data Loading

Let's start by importing all necessary libraries and loading our dataset.


In [104]:
# TODO: Import all required libraries
# Data manipulation: numpy, pandas
# Visualization: matplotlib, seaborn  
# Configure warnings and matplotlib inline
# Preprocessing: train_test_split, cross_val_score, StratifiedKFold, StandardScaler, SimpleImputer
# Models: LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier, 
#         SVC, KNeighborsClassifier, GaussianNB, XGBClassifier, LGBMClassifier
# Metrics: accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, 
#          confusion_matrix, classification_report, roc_curve


### Loading the Dataset

**Dataset**: Heart Disease UCI  
**URL**: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data  
**Target**: Binary classification (0 = No disease, 1 = Disease)


In [105]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg',
                'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']

# TODO: Load dataset with pd.read_csv(url, names=column_names, na_values='?')
# TODO: Convert target to binary: (df['target'] > 0).astype(int)
# TODO: Display shape and first few rows


# Alternative you can run below cell and use this dataset

In [106]:
# # Alternative: Create a dummy dataset for practice
# # Use this if you want to practice without downloading from UCI

# import numpy as np
# import pandas as pd

# # Set random seed for reproducibility
# np.random.seed(42)

# # Generate dummy data with 300 samples
# n_samples = 300

# # Create dummy features based on real heart disease characteristics
# data = {
#     'age': np.random.normal(54, 9, n_samples).astype(int),  # Age 45-65
#     'sex': np.random.choice([0, 1], n_samples, p=[0.3, 0.7]),  # 70% male
#     'cp': np.random.choice([0, 1, 2, 3], n_samples, p=[0.4, 0.3, 0.2, 0.1]),  # Chest pain type
#     'trestbps': np.random.normal(130, 20, n_samples).astype(int),  # Resting BP
#     'chol': np.random.normal(250, 60, n_samples).astype(int),  # Cholesterol
#     'fbs': np.random.choice([0, 1], n_samples, p=[0.8, 0.2]),  # Fasting blood sugar
#     'restecg': np.random.choice([0, 1, 2], n_samples, p=[0.5, 0.3, 0.2]),  # Resting ECG
#     'thalach': np.random.normal(150, 25, n_samples).astype(int),  # Max heart rate
#     'exang': np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),  # Exercise angina
#     'oldpeak': np.random.exponential(1, n_samples).round(1),  # ST depression
#     'slope': np.random.choice([0, 1, 2], n_samples, p=[0.3, 0.5, 0.2]),  # Slope
#     'ca': np.random.choice([0, 1, 2, 3], n_samples, p=[0.6, 0.2, 0.15, 0.05]),  # Vessels
#     'thal': np.random.choice([3, 6, 7], n_samples, p=[0.7, 0.2, 0.1]),  # Thalassemia
# }

# # Create target variable with some logical relationships
# target = []
# for i in range(n_samples):
#     prob = 0.3  # Base probability
    
#     # Increase probability based on risk factors
#     if data['age'][i] > 55: prob += 0.2
#     if data['sex'][i] == 1: prob += 0.1  # Male
#     if data['cp'][i] in [2, 3]: prob += 0.15  # Atypical chest pain
#     if data['trestbps'][i] > 140: prob += 0.1  # High BP
#     if data['chol'][i] > 280: prob += 0.1  # High cholesterol
#     if data['exang'][i] == 1: prob += 0.2  # Exercise angina
#     if data['oldpeak'][i] > 1: prob += 0.15  # ST depression
#     if data['ca'][i] > 0: prob += 0.1  # Vessel narrowing
    
#     # Add some noise
#     prob += np.random.normal(0, 0.1)
#     prob = max(0, min(1, prob))  # Clamp between 0 and 1
    
#     target.append(1 if np.random.random() < prob else 0)

# data['target'] = target

# # Create DataFrame
# df = pd.DataFrame(data)

# # Add some missing values (about 5% randomly)
# missing_indices = np.random.choice(df.index, size=int(0.05 * len(df)), replace=False)
# missing_cols = ['ca', 'thal', 'oldpeak']  # These often have missing values
# for idx in missing_indices:
#     col = np.random.choice(missing_cols)
#     df.loc[idx, col] = np.nan

# print("Dummy Dataset Created!")
# print(f"Shape: {df.shape}")
# print(f"Target distribution: {df['target'].value_counts().to_dict()}")
# print(f"Missing values: {df.isnull().sum().sum()}")
# print("\nFirst few rows:")
# df.head()

In [107]:
# TODO: Display dataset info using df.info()
# TODO: Display statistical summary using df.describe()
# TODO: Check missing values using df.isnull().sum()


**Reflection Question 1:** What do you notice about missing values? How will this affect model building?

*Write your observations here:*


---

## Part 2: Exploratory Data Analysis

Before building models, analyze the data to understand patterns and relationships.


In [108]:
# TODO: Visualize target distribution with bar chart and pie chart
# TODO: Calculate class distribution percentages


In [109]:
# TODO: Plot distributions of numerical features
# Suggested: age, trestbps, chol, thalach, oldpeak


In [110]:
# TODO: Create correlation heatmap
# TODO: Find and display features most correlated with target


**Reflection Question 2:** Which features are most strongly correlated with heart disease? Any multicollinearity issues?

*Write your insights here:*


---

## Part 3: Data Preprocessing

Prepare the data for modeling by handling missing values, splitting data, and scaling features.


In [111]:
# TODO: Handle missing values using SimpleImputer with median strategy
# TODO: Verify no missing values remain


In [112]:
# TODO: Separate features (X) and target (y)
# TODO: Perform stratified train-test split (test_size=0.2, random_state=42)
# TODO: Display split sizes and class distribution


In [113]:
# TODO: Scale features using StandardScaler
# Important: Fit only on training data to prevent data leakage
# TODO: Verify scaling (mean ~0, std ~1)


**Reflection Question 3:** Why fit scaler only on training data? Explain data leakage.

*Explain the concept:*


---

## Part 4: Model Training and Comparison

Train multiple algorithms and collect their predictions for comparison.


In [114]:
# TODO: Create models dictionary with 9 algorithms:
# - Logistic Regression, Decision Tree, Random Forest
# - Gradient Boosting, XGBoost, LightGBM
# - SVM, k-Nearest Neighbors, Naive Bayes
# TODO: Print the number of models and their names


In [115]:
# TODO: Loop through models and:
# 1. Train on X_train_scaled, y_train
# 2. Get predictions on X_test_scaled
# 3. Get probability predictions (if available)
# 4. Calculate: accuracy, precision, recall, f1_score, ROC-AUC
# 5. Store results in dictionaries


---

## Part 5: Model Evaluation and Visualization

Analyze and visualize model performance using multiple metrics.


In [116]:
# TODO: Create results DataFrame and sort by accuracy
# TODO: Display formatted comparison table


In [117]:
# TODO: Create horizontal bar chart comparing model accuracies


In [118]:
# TODO: Create grouped bar chart for multiple metrics


In [119]:
# TODO: Plot ROC curves for all models


In [120]:
# TODO: Create confusion matrices for top 3 models


In [121]:
# TODO: Print classification reports for top 3 models


**Reflection Question 4:** Which model performed best? For medical diagnosis, which metric matters most and why?

*Write your analysis here:*


---

## Part 6: Cross-Validation Analysis

Use cross-validation to get more robust performance estimates.


In [122]:
# TODO: Perform 5-fold stratified cross-validation for all models
# TODO: Store mean, std, and test accuracy
# TODO: Display CV results table


In [123]:
# TODO: Create error bar plot for CV results


**Reflection Question 5:** How do CV results compare to test results? Which model shows most stability?

*Write your analysis here:*


---

## Part 7: Final Model Selection and Insights

Synthesize all metrics to determine the best overall model and provide actionable insights.


In [124]:
# TODO: Combine all metrics and calculate overall ranking
# TODO: Display comprehensive summary of top 3 models


**Final Reflection:** Which model would you deploy and why? Consider performance, interpretability, and deployment constraints.

*Write your decision and justification here:*


---

## Key Insights and Recommendations

### Algorithm Characteristics

**Ensemble Methods** - Random Forest, XGBoost, LightGBM typically perform best due to handling non-linear relationships well.

**Linear Models** - Logistic Regression provides interpretable baseline with fast training.

**Kernel Methods** - SVM sensitive to scaling; computationally expensive for large datasets.

**Instance-Based** - kNN depends heavily on parameter k and distance metric.

### Recommendations by Use Case

**Production Deployment:** Choose top ensemble method for best accuracy and robustness.

**Interpretability:** Choose Logistic Regression or Decision Tree when transparency is critical.

**Real-Time:** Choose LightGBM or Logistic Regression for fastest inference.

**Medical Context:** Recall is crucial - minimize false negatives.

---

## Optional Challenges (Bonus)

1. **Hyperparameter Tuning**: Use GridSearchCV to optimize your top model
2. **Feature Selection**: Implement and compare performance with selected features  
3. **Ensemble Creation**: Build voting/stacking classifier with top 3 models

---


## Project Submission Guidelines

### Submission Requirements

Before submitting your project, ensure you have:

1. **Completed all TODO sections** with working code
2. **Answered all reflection questions** with detailed analysis
3. **Generated all required visualizations** 
4. **Tested code and verified** all outputs
5. **Documented findings** and insights

### How to Submit

**Step 1:** Send LinkedIn connection request to: https://www.linkedin.com/in/hashirahmed07/

**Step 2:** Clean up notebook and ensure all cells run without errors

**Step 3:** Submit with title format: **30_Days_ML_Practice_Project_Week2**

### Evaluation Criteria

Your project will be evaluated based on:

- **Code Quality (30%)** - Correctness, organization, readability
- **Analysis Quality (30%)** - Depth of EDA and interpretation  
- **Visualization (20%)** - Clarity and professionalism
- **Insights (20%)** - Model comparison and justification

---

**Good luck with your project! Remember: The goal is to learn and practice, not just to get perfect results.**


---
## 📫 Let's Connect
- 💼 **LinkedIn:** [hashirahmed07](https://www.linkedin.com/in/hashirahmed07/)
- 📧 **Email:** [Hashirahmad330@gmail.com](mailto:Hashirahmad330@gmail.com)
- 🐙 **GitHub:** [CodeByHashir](https://github.com/CodeByHashir)
