# Heart Disease Risk Prediction: Logistic Regression

## Step 1: Load and Prepare the Dataset

**Data Source:** Heart Disease Dataset downloaded from Kaggle (https://www.kaggle.com/datasets/neurocipher/heartdisease).

The dataset contains 270 patient records with 14 clinical features and a binary target (Presence/Absence of heart disease).

In [None]:
%pip install numpy matplotlib pandas

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (6, 4)
plt.rcParams["axes.grid"] = True

# Load the dataset
df = pd.read_csv('Heart_Disease_Prediction.csv')
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nColumn names and types:")
print(df.dtypes)

Dataset shape: (270, 14)

First few rows:
   Age  Sex  Chest pain type   BP  Cholesterol  FBS over 120  EKG results  \
0   70    1                4  130          322             0            2   
1   67    0                3  115          564             0            2   
2   57    1                2  124          261             0            0   
3   64    1                4  128          263             0            0   
4   74    0                2  120          269             0            2   

   Max HR  Exercise angina  ST depression  Slope of ST  \
0     109                0            2.4            2   
1     160                0            1.6            2   
2     141                0            0.3            1   
3     105                1            0.2            2   
4     121                1            0.2            1   

   Number of vessels fluro  Thallium Heart Disease  
0                        3         3      Presence  
1                        0         7    

### 1.1 Exploratory Data Analysis (EDA)

In [None]:
# Basic statistics
print("Basic statistics:")
print(df.describe())

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

In [None]:
# Binarize target: 1 = Presence (disease), 0 = Absence (no disease)
df['Target'] = (df['Heart Disease'] == 'Presence').astype(int)

print("Target distribution:")
print(df['Target'].value_counts())
print(f"\nDisease rate: {df['Target'].mean()*100:.1f}%")

# Plot class distribution
plt.figure()
df['Target'].value_counts().plot(kind='bar', color=['tab:blue', 'tab:orange'])
plt.title("Class Distribution: Heart Disease")
plt.xlabel("Target (0 = Absence, 1 = Presence)")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.show()

### 1.2 Feature Selection and Preprocessing

We select 6 key clinical features:
- **Age**: Patient age in years
- **BP**: Resting blood pressure (mm Hg)
- **Cholesterol**: Serum cholesterol (mg/dl)
- **Max HR**: Maximum heart rate achieved
- **ST depression**: ST depression induced by exercise
- **Number of vessels fluro**: Number of major vessels (0-3)

In [None]:
# Select features for the model
selected_features = ['Age', 'BP', 'Cholesterol', 'Max HR', 'ST depression', 'Number of vessels fluro']

X = df[selected_features].values
y = df['Target'].values

print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"\nSelected features: {selected_features}")

In [None]:
# Stratified train/test split (70/30)
np.random.seed(42)

# Get indices for each class
idx_class_0 = np.where(y == 0)[0]
idx_class_1 = np.where(y == 1)[0]

# Shuffle indices
np.random.shuffle(idx_class_0)
np.random.shuffle(idx_class_1)

# Calculate split sizes for each class (30% test)
n_test_0 = int(len(idx_class_0) * 0.3)
n_test_1 = int(len(idx_class_1) * 0.3)

# Split indices
test_idx = np.concatenate([idx_class_0[:n_test_0], idx_class_1[:n_test_1]])
train_idx = np.concatenate([idx_class_0[n_test_0:], idx_class_1[n_test_1:]])

# Shuffle train and test indices
np.random.shuffle(train_idx)
np.random.shuffle(test_idx)

# Create train and test sets
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTraining set class distribution: {np.mean(y_train)*100:.1f}% disease")
print(f"Test set class distribution: {np.mean(y_test)*100:.1f}% disease")

In [None]:
# Feature normalization (z-score)
# Compute mean and std from training set only (avoid data leakage)
mu = np.mean(X_train, axis=0)
sigma = np.std(X_train, axis=0)

# Normalize both sets using training statistics
X_train_norm = (X_train - mu) / sigma
X_test_norm = (X_test - mu) / sigma

print("Normalization statistics (from training set):")
print(f"{'Feature':<25} {'Mean':>10} {'Std':>10}")
print("-" * 47)
for i, feat in enumerate(selected_features):
    print(f"{feat:<25} {mu[i]:>10.2f} {sigma[i]:>10.2f}")

print(f"\nNormalized training set mean: {np.mean(X_train_norm, axis=0).round(4)}")
print(f"Normalized training set std: {np.std(X_train_norm, axis=0).round(4)}")