# Day 14: Algorithm Comparison Mini-Project

**Week 2 - Comprehensive ML Model Comparison Practice Template**

Welcome to Day 14 of your ML journey! Today we tackle one of the most critical decisions in machine learning: **Algorithm Selection**. Choosing the right algorithm can make the difference between a mediocre model and an exceptional one. you'll perform comprehensive model comparison, performance evaluation across multiple metrics, cross-validation techniques, and systematic algorithm selection strategies.

---

## Learning Objectives

By the end of this project, you will be able to:

1. **Apply multiple ML algorithms** to the same dataset
2. **Compare model performance** using various evaluation metrics  
3. **Understand algorithm strengths and weaknesses** in practice
4. **Make informed decisions** about model selection
5. **Create visualizations** to communicate results effectively

### Key ML Concepts Practiced

- Complete ML Pipeline: From raw data to model evaluation
- Data Preprocessing: Handling missing values, encoding, scaling
- Model Building: Implementing 9 different algorithms
- Model Evaluation: Accuracy, precision, recall, F1-Score, ROC-AUC
- Cross-Validation: Ensuring robust model evaluation
- Visualization: Creating comparison charts and performance plots

---


## Project Overview

### The Challenge

In this project, you will work with a **Heart Disease Prediction** dataset to:
- Compare 9 different ML algorithms
- Evaluate their performance comprehensively
- Identify the best model for this medical diagnosis problem

### Why This Matters

In real-world ML projects:
- **No single algorithm works best for all problems**
- **Data characteristics matter**: Dataset size, features, noise, class balance
- **Different metrics reveal different insights**: Accuracy alone is not enough
- **Computational cost vs. performance**: Sometimes simpler models are better

### Algorithms You Will Compare

| Algorithm | Type | Best Use Cases |
|-----------|------|----------------|
| Logistic Regression | Linear | Baseline, interpretability |
| Decision Tree | Tree-based | Non-linear patterns, interpretability |
| Random Forest | Ensemble | High accuracy, feature importance |
| Gradient Boosting | Ensemble | Competition-winning performance |
| XGBoost | Ensemble (Advanced) | State-of-the-art performance |
| LightGBM | Ensemble (Advanced) | Large datasets, speed |
| Support Vector Machine | Kernel-based | High-dimensional data |
| k-Nearest Neighbors | Instance-based | Small datasets, non-parametric |
| Naive Bayes | Probabilistic | Fast training, works with small data |

---


## Part 1: Setup and Data Loading

Let's start by importing all necessary libraries and loading our dataset.


In [81]:
# TODO: Import all required libraries
# Data manipulation: numpy, pandas
# Visualization: matplotlib, seaborn  
# Configure warnings and matplotlib inline
# Preprocessing: train_test_split, cross_val_score, StratifiedKFold, StandardScaler, SimpleImputer
# Models: LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier, 
#         SVC, KNeighborsClassifier, GaussianNB, XGBClassifier, LGBMClassifier
# Metrics: accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, 
#          confusion_matrix, classification_report, roc_curve


### Loading the Dataset

**Dataset**: Heart Disease UCI  
**URL**: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data  
**Target**: Binary classification (0 = No disease, 1 = Disease)


In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg',
                'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']

# TODO: Load dataset with pd.read_csv(url, names=column_names, na_values='?')
# TODO: Convert target to binary: (df['target'] > 0).astype(int)
# TODO: Display shape and first few rows


<bound method NDFrame.head of       age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0    63.0  1.0  1.0     145.0  233.0  1.0      2.0    150.0    0.0      2.3   
1    67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0    1.0      1.5   
2    67.0  1.0  4.0     120.0  229.0  0.0      2.0    129.0    1.0      2.6   
3    37.0  1.0  3.0     130.0  250.0  0.0      0.0    187.0    0.0      3.5   
4    41.0  0.0  2.0     130.0  204.0  0.0      2.0    172.0    0.0      1.4   
..    ...  ...  ...       ...    ...  ...      ...      ...    ...      ...   
298  45.0  1.0  1.0     110.0  264.0  0.0      0.0    132.0    0.0      1.2   
299  68.0  1.0  4.0     144.0  193.0  1.0      0.0    141.0    0.0      3.4   
300  57.0  1.0  4.0     130.0  131.0  0.0      0.0    115.0    1.0      1.2   
301  57.0  0.0  2.0     130.0  236.0  0.0      2.0    174.0    0.0      0.0   
302  38.0  1.0  3.0     138.0  175.0  0.0      0.0    173.0    0.0      0.0   

     slope   ca  thal

In [83]:
# TODO: Display dataset info using df.info()
# TODO: Display statistical summary using df.describe()
# TODO: Check missing values using df.isnull().sum()


**Reflection Question 1:** What do you notice about missing values? How will this affect model building?

*Write your observations here:*


---

## Part 2: Exploratory Data Analysis

Before building models, analyze the data to understand patterns and relationships.


In [84]:
# TODO: Visualize target distribution with bar chart and pie chart
# TODO: Calculate class distribution percentages


In [85]:
# TODO: Plot distributions of numerical features
# Suggested: age, trestbps, chol, thalach, oldpeak


In [86]:
# TODO: Create correlation heatmap
# TODO: Find and display features most correlated with target


**Reflection Question 2:** Which features are most strongly correlated with heart disease? Any multicollinearity issues?

*Write your insights here:*


---

## Part 3: Data Preprocessing

Prepare the data for modeling by handling missing values, splitting data, and scaling features.


In [87]:
# TODO: Handle missing values using SimpleImputer with median strategy
# TODO: Verify no missing values remain


In [88]:
# TODO: Separate features (X) and target (y)
# TODO: Perform stratified train-test split (test_size=0.2, random_state=42)
# TODO: Display split sizes and class distribution


In [89]:
# TODO: Scale features using StandardScaler
# Important: Fit only on training data to prevent data leakage
# TODO: Verify scaling (mean ~0, std ~1)


**Reflection Question 3:** Why fit scaler only on training data? Explain data leakage.

*Explain the concept:*


---

## Part 4: Model Training and Comparison

Train multiple algorithms and collect their predictions for comparison.


In [90]:
# TODO: Create models dictionary with 9 algorithms:
# - Logistic Regression, Decision Tree, Random Forest
# - Gradient Boosting, XGBoost, LightGBM
# - SVM, k-Nearest Neighbors, Naive Bayes
# TODO: Print the number of models and their names


In [91]:
# TODO: Loop through models and:
# 1. Train on X_train_scaled, y_train
# 2. Get predictions on X_test_scaled
# 3. Get probability predictions (if available)
# 4. Calculate: accuracy, precision, recall, f1_score, ROC-AUC
# 5. Store results in dictionaries


---

## Part 5: Model Evaluation and Visualization

Analyze and visualize model performance using multiple metrics.


In [92]:
# TODO: Create results DataFrame and sort by accuracy
# TODO: Display formatted comparison table


In [93]:
# TODO: Create horizontal bar chart comparing model accuracies


In [94]:
# TODO: Create grouped bar chart for multiple metrics


In [95]:
# TODO: Plot ROC curves for all models


In [96]:
# TODO: Create confusion matrices for top 3 models


In [97]:
# TODO: Print classification reports for top 3 models


**Reflection Question 4:** Which model performed best? For medical diagnosis, which metric matters most and why?

*Write your analysis here:*


---

## Part 6: Cross-Validation Analysis

Use cross-validation to get more robust performance estimates.


In [98]:
# TODO: Perform 5-fold stratified cross-validation for all models
# TODO: Store mean, std, and test accuracy
# TODO: Display CV results table


In [99]:
# TODO: Create error bar plot for CV results


**Reflection Question 5:** How do CV results compare to test results? Which model shows most stability?

*Write your analysis here:*


---

## Part 7: Final Model Selection and Insights

Synthesize all metrics to determine the best overall model and provide actionable insights.


In [100]:
# TODO: Combine all metrics and calculate overall ranking
# TODO: Display comprehensive summary of top 3 models


**Final Reflection:** Which model would you deploy and why? Consider performance, interpretability, and deployment constraints.

*Write your decision and justification here:*


---

## Key Insights and Recommendations

### Algorithm Characteristics

**Ensemble Methods** - Random Forest, XGBoost, LightGBM typically perform best due to handling non-linear relationships well.

**Linear Models** - Logistic Regression provides interpretable baseline with fast training.

**Kernel Methods** - SVM sensitive to scaling; computationally expensive for large datasets.

**Instance-Based** - kNN depends heavily on parameter k and distance metric.

### Recommendations by Use Case

**Production Deployment:** Choose top ensemble method for best accuracy and robustness.

**Interpretability:** Choose Logistic Regression or Decision Tree when transparency is critical.

**Real-Time:** Choose LightGBM or Logistic Regression for fastest inference.

**Medical Context:** Recall is crucial - minimize false negatives.

---

## Optional Challenges (Bonus)

1. **Hyperparameter Tuning**: Use GridSearchCV to optimize your top model
2. **Feature Selection**: Implement and compare performance with selected features  
3. **Ensemble Creation**: Build voting/stacking classifier with top 3 models

---


## Project Submission Guidelines

### Submission Requirements

Before submitting your project, ensure you have:

1. **Completed all TODO sections** with working code
2. **Answered all reflection questions** with detailed analysis
3. **Generated all required visualizations** 
4. **Tested code and verified** all outputs
5. **Documented findings** and insights

### How to Submit

**Step 1:** Send LinkedIn connection request to: https://www.linkedin.com/in/hashirahmed07/

**Step 2:** Clean up notebook and ensure all cells run without errors

**Step 3:** Submit with title format: **30_Days_ML_Practice_Project_Week2**

### Evaluation Criteria

Your project will be evaluated based on:

- **Code Quality (30%)** - Correctness, organization, readability
- **Analysis Quality (30%)** - Depth of EDA and interpretation  
- **Visualization (20%)** - Clarity and professionalism
- **Insights (20%)** - Model comparison and justification

---

**Good luck with your project! Remember: The goal is to learn and practice, not just to get perfect results.**
