Complete ML Pipeline For Self Code Aceadmy Assignment Task : Heart Disease Prediction
============================================================
Dataset: UCI Heart Disease Dataset (from Kaggle)
Source: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset




In [77]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import os

print("ML TUTORIAL Assignment Task: Heart Disease Prediction")
print("Using Real Kaggle Dataset (UCI Heart Disease)")


ML TUTORIAL Assignment Task: Heart Disease Prediction
Using Real Kaggle Dataset (UCI Heart Disease)


In [78]:

print("STEP 2: Loading Dataset from Kaggle")

try:
    #  loading from UCI ML Repository
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

    # Column names for the heart disease dataset
    column_names = [
        'age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg',
        'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'
    ]

    df = pd.read_csv(url, names=column_names, na_values='?')
    print("Dataset loaded from UCI Repository!")

except Exception as e:
    print(f"Could not load from URL: {e}")
    print("Creating dataset locally with sample data...")

    # Fallback: Create a representative sample of the heart disease dataset
    # This is actual data structure from the UCI Heart Disease dataset
    np.random.seed(42)
    n_samples = 303  # Original dataset size

    df = pd.DataFrame({
        'age': np.random.randint(29, 77, n_samples),
        'sex': np.random.choice([0, 1], n_samples, p=[0.32, 0.68]),
        'cp': np.random.choice([0, 1, 2, 3], n_samples, p=[0.47, 0.17, 0.28, 0.08]),
        'trestbps': np.random.randint(94, 200, n_samples),
        'chol': np.random.randint(126, 564, n_samples),
        'fbs': np.random.choice([0, 1], n_samples, p=[0.85, 0.15]),
        'restecg': np.random.choice([0, 1, 2], n_samples, p=[0.49, 0.49, 0.02]),
        'thalach': np.random.randint(71, 202, n_samples),
        'exang': np.random.choice([0, 1], n_samples, p=[0.67, 0.33]),
        'oldpeak': np.round(np.random.uniform(0, 6.2, n_samples), 1),
        'slope': np.random.choice([0, 1, 2], n_samples, p=[0.07, 0.46, 0.47]),
        'ca': np.random.choice([0, 1, 2, 3], n_samples, p=[0.58, 0.22, 0.13, 0.07]),
        'thal': np.random.choice([1, 2, 3], n_samples, p=[0.06, 0.55, 0.39]),
        'target': np.random.choice([0, 1], n_samples, p=[0.46, 0.54])
    })
print("Sample dataset created!")
print(f"Dataset Shape: {df.shape[0]} patients, {df.shape[1]} columns")

STEP 2: Loading Dataset from Kaggle
Dataset loaded from UCI Repository!
Sample dataset created!
Dataset Shape: 303 patients, 14 columns


# STEP 3: EXPLORE THE DATASET

In [79]:

print("\nSTEP 3: Dataset Exploration")
print("-" * 50)

print("\nFirst 10 rows of the dataset:")
print(df.head(10))

print("\nDataset Statistics:")
print(df.describe().round(2))

print("\nColumn Information:")
feature_descriptions = {
    'age': 'Age in years',
    'sex': 'Sex (1 = male, 0 = female)',
    'cp': 'Chest pain type (0-3)',
    'trestbps': 'Resting blood pressure (mm Hg)',
    'chol': 'Serum cholesterol (mg/dl)',
    'fbs': 'Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)',
    'restecg': 'Resting ECG results (0-2)',
    'thalach': 'Maximum heart rate achieved',
    'exang': 'Exercise induced angina (1 = yes, 0 = no)',
    'oldpeak': 'ST depression induced by exercise',
    'slope': 'Slope of peak exercise ST segment',
    'ca': 'Number of major vessels (0-3)',
    'thal': 'Thalassemia (1 = normal, 2 = fixed defect, 3 = reversible defect)',
    'target': 'Heart Disease (1 = yes, 0 = no) - THIS IS WHAT WE PREDICT!'
}

for col, desc in feature_descriptions.items():
    print(f"  • {col}: {desc}")


STEP 3: Dataset Exploration
--------------------------------------------------

First 10 rows of the dataset:
    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  63.0  1.0  1.0     145.0  233.0  1.0      2.0    150.0    0.0      2.3   
1  67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0    1.0      1.5   
2  67.0  1.0  4.0     120.0  229.0  0.0      2.0    129.0    1.0      2.6   
3  37.0  1.0  3.0     130.0  250.0  0.0      0.0    187.0    0.0      3.5   
4  41.0  0.0  2.0     130.0  204.0  0.0      2.0    172.0    0.0      1.4   
5  56.0  1.0  2.0     120.0  236.0  0.0      0.0    178.0    0.0      0.8   
6  62.0  0.0  4.0     140.0  268.0  0.0      2.0    160.0    0.0      3.6   
7  57.0  0.0  4.0     120.0  354.0  0.0      0.0    163.0    1.0      0.6   
8  63.0  1.0  4.0     130.0  254.0  0.0      2.0    147.0    0.0      1.4   
9  53.0  1.0  4.0     140.0  203.0  1.0      2.0    155.0    1.0      3.1   

   slope   ca  thal  target  
0    3.0  0


# STEP 4: DATA PREPROCESSING


In [80]:

print("STEP 4: Data Preprocessing")

# Check for missing values
missing = df.isnull().sum()
print(f"Missing Values:")
print(missing[missing > 0] if missing.sum() > 0 else "  No missing values!")



STEP 4: Data Preprocessing
Missing Values:
ca      4
thal    2
dtype: int64


In [81]:
# Handle missing values if any
if df.isnull().sum().sum() > 0:
    print("\n Handling missing values...")
    # Fill numeric columns with median
    for col in df.columns:
        if df[col].isnull().sum() > 0:
            df[col].fillna(df[col].median(), inplace=True)
    print("  Missing values filled with median ")

# Check target distribution
print(f"\nTarget Variable Distribution:")
target_counts = df['target'].value_counts()
print(f"No Heart Disease (0): {target_counts.get(0, 0)} patients ({target_counts.get(0, 0)/len(df)*100:.1f}%)")
print(f"Heart Disease (1): {target_counts.get(1, 0)} patients ({target_counts.get(1, 0)/len(df)*100:.1f}%)")




 Handling missing values...
  Missing values filled with median 

Target Variable Distribution:
No Heart Disease (0): 164 patients (54.1%)
Heart Disease (1): 55 patients (18.2%)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)


In [82]:
# Separate features and target
print("Separating Features (X) and Target (y)...")
X = df.drop('target', axis=1)
y = df['target']
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Convert target to int (handle any edge cases)
y = y.astype(int)
# Handle target values > 0 (in original dataset, values 1-4 all mean disease)
y = (y > 0).astype(int)


Separating Features (X) and Target (y)...
Features shape: (303, 13)
Target shape: (303,)


In [83]:
# Train-test split
print("Splitting into Training (80%) and Testing (20%) sets...")
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # Maintain class balance
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

# Feature scaling
print("Scaling features (StandardScaler)...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Features scaled to mean=0, std=1 ")

Splitting into Training (80%) and Testing (20%) sets...
Training samples: 242
Testing samples: 61
Scaling features (StandardScaler)...
Features scaled to mean=0, std=1 


# STEP 5: MODEL TRAINING

In [84]:

print("\n Model Training")

print("\n Creating Logistic Regression model...")
model = LogisticRegression(random_state=42, max_iter=1000)

print(" Training the model on patient data...")
model.fit(X_train_scaled, y_train)
print(" Model trained successfully!")

# Display learned coefficients
print("\n What the Model Learned (Feature Importance):")
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0],
    'Impact': ['↑ Increases Risk' if c > 0 else '↓ Decreases Risk' for c in model.coef_[0]]
}).sort_values('Coefficient', key=abs, ascending=False)

print(feature_importance.to_string(index=False))


 Model Training

 Creating Logistic Regression model...
 Training the model on patient data...
 Model trained successfully!

 What the Model Learned (Feature Importance):
 Feature  Coefficient           Impact
      ca     1.107898 ↑ Increases Risk
    thal     0.677821 ↑ Increases Risk
     sex     0.655563 ↑ Increases Risk
      cp     0.543483 ↑ Increases Risk
   exang     0.383642 ↑ Increases Risk
   slope     0.354072 ↑ Increases Risk
 thalach    -0.348486 ↓ Decreases Risk
trestbps     0.313655 ↑ Increases Risk
     fbs    -0.220560 ↓ Decreases Risk
 restecg     0.217329 ↑ Increases Risk
    chol     0.215375 ↑ Increases Risk
 oldpeak     0.149953 ↑ Increases Risk
     age    -0.103159 ↓ Decreases Risk


# SUMMARY

In [85]:

print(" TUTORIAL COMPLETE!")
print(f"""
 Model Performance Summary:
   • Dataset: UCI Heart Disease (303 patients, 13 features)
   • Algorithm: Logistic Regression
   • Accuracy: {accuracy * 100:.2f}%
   • Training samples: {len(X_train)}
   • Testing samples: {len(X_test)}

 What i have Learned:
   • Loading real-world Kaggle datasets
   • Exploratory Data Analysis (EDA)
   • Data preprocessing (missing values, scaling)
   • Train-test split with stratification
   • Training Logistic Regression model
   • Evaluating with confusion matrix & classification report
   • Making predictions on new data

 Next Steps to try:
   • Try Random Forest or XGBoost for better accuracy
   • Perform hyperparameter tuning (GridSearchCV)
   • Add more visualizations (ROC curve, feature correlations)
   • Deploy as a web app with Streamlit or Flask
""")


 TUTORIAL COMPLETE!

 Model Performance Summary:
   • Dataset: UCI Heart Disease (303 patients, 13 features)
   • Algorithm: Logistic Regression
   • Accuracy: 86.89%
   • Training samples: 242
   • Testing samples: 61

 What i have Learned:
   • Loading real-world Kaggle datasets
   • Exploratory Data Analysis (EDA)
   • Data preprocessing (missing values, scaling)
   • Train-test split with stratification
   • Training Logistic Regression model
   • Evaluating with confusion matrix & classification report
   • Making predictions on new data

 Next Steps to try:
   • Try Random Forest or XGBoost for better accuracy
   • Perform hyperparameter tuning (GridSearchCV)
   • Add more visualizations (ROC curve, feature correlations)
   • Deploy as a web app with Streamlit or Flask

