# Heart Disease Prediction Project 
Sections:
1. Importing Libraries
2. Load & clean data
3. Preprocess data 
4. Splitting the dataset
5. Training the model
6. Explanation of Patient Information Terms
7. Sample prediction for a single patient
8. Explaining prediction with SHAP
9. Plot a gauge for the predicted risk 
10. Next steps

I will be using the `HeartDiseaseTrain-Test.csv` dataset downloaded from kaggle 

## Importing Libraries

We import the libraries used in preprocessing, modeling, evaluation and explainability.


In [106]:
# pandas for dataframes and CSV loading
import pandas as pd
# numpy for numeric operations  
import numpy as np  
# train/test split helper, feature scaling, model training and evaluation
from sklearn.model_selection import train_test_split 
# feature scaling 
from sklearn.preprocessing import StandardScaler  
# the model we train 
from sklearn.linear_model import LogisticRegression 
# to evaluate the model 
from sklearn.metrics import accuracy_score 
# SHAP for model explainability 
import shap  
 # plotly for gauge visualization 
import plotly.graph_objects as go 
import warnings
# hide sklearn warnings for clarity
warnings.filterwarnings('ignore')  


## Load and clean data

Loading the dataset as a CSV, fixing inconsistent column names and inspecting the dataset.


In [107]:
# Path to the dataset
csv_path = 'HeartDiseaseTrain-Test.csv'

# Read CSV into a pandas DataFrame
# load the dataset from file
df = pd.read_csv(csv_path)
# df.head() 

# Some datasets have typos in column names; we standardize them here
df.rename(columns={'cholestoral': 'cholesterol', 'Max_heart_rate': 'max_heart_rate'}, inplace=True)

# Show top 5 rows to verify data loaded correctly
# printing the head helps validate column names and sample values
df.head()  


Unnamed: 0,age,sex,chest_pain_type,resting_blood_pressure,cholesterol,fasting_blood_sugar,rest_ecg,max_heart_rate,exercise_induced_angina,oldpeak,slope,vessels_colored_by_flourosopy,thalassemia,target
0,52,Male,Typical angina,125,212,Lower than 120 mg/ml,ST-T wave abnormality,168,No,1.0,Downsloping,Two,Reversable Defect,0
1,53,Male,Typical angina,140,203,Greater than 120 mg/ml,Normal,155,Yes,3.1,Upsloping,Zero,Reversable Defect,0
2,70,Male,Typical angina,145,174,Lower than 120 mg/ml,ST-T wave abnormality,125,Yes,2.6,Upsloping,Zero,Reversable Defect,0
3,61,Male,Typical angina,148,203,Lower than 120 mg/ml,ST-T wave abnormality,161,No,0.0,Downsloping,One,Reversable Defect,0
4,62,Female,Typical angina,138,294,Greater than 120 mg/ml,ST-T wave abnormality,106,No,1.9,Flat,Three,Fixed Defect,0


## Preprocess data

Handling missing values, encode categorical columns into numeric codes and split into train/test sets.
Scaling is applied to ensure features have zero mean and unit variance


In [None]:
# Drop rows with missing values for a simple strategy (alternatives: imputation)
 # remove missing rows
df_clean = df.dropna().reset_index(drop=True) 

# Define categorical columns we expect; adjust if dataset differs
categorical_cols = [
    'sex', 'chest_pain_type', 'fasting_blood_sugar',
    'rest_ecg', 'exercise_induced_angina', 'slope',
    'vessels_colored_by_flourosopy', 'thalassemia'
]

# Convert categorical text columns to numeric codes (0,1,2,...). 
# This keeps things simple and reproducible.
for col in categorical_cols:
    if col in df_clean.columns:
        # map categories to integers
        df_clean[col] = df_clean[col].astype('category').cat.codes  

# Ensure target column exists
if 'target' not in df_clean.columns:
    raise ValueError("Dataset must contain a 'target' column with labels (0/1).")

# Separate features and target label
X = df_clean.drop('target', axis=1)  # features dataframe
y = df_clean['target']  # target series

# Save feature names for later (ensures we preserve column order)
feature_names = X.columns.tolist()




## Splitting The Dataset Ready For Training 

In [109]:
# Split into train and test sets to evaluate generalization (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features: fit on training data and transform both train and test
scaler = StandardScaler()  # StandardScaler removes mean and scales to unit variance
X_train_scaled = scaler.fit_transform(X_train)  # fit on train
X_test_scaled = scaler.transform(X_test)  # use same transform on test

# Show shapes to confirm everything looks right
print('X_train shape:', X_train_scaled.shape)
print('X_test shape :', X_test_scaled.shape)
print('Number of features:', len(feature_names))

X_train shape: (820, 13)
X_test shape : (205, 13)
Number of features: 13


## Training the model

In [110]:
# Initialize logistic regression model; set max_iter higher in case of slow convergence
model = LogisticRegression(max_iter=1000)

# Fit model on scaled training data
model.fit(X_train_scaled, y_train)  # train the model parameters

# Evaluate on the held-out test set
y_pred = model.predict(X_test_scaled)  # predicted labels for the test set
accuracy = accuracy_score(y_test, y_pred)  # compute simple accuracy metric
print(f'Model accuracy on test set: {accuracy*100:.2f}%')


Model accuracy on test set: 79.51%


## Explanation of Patient Information Terms


| **Term** | **Simple Meaning (Layman Explanation)** |
|-----------|------------------------------------------|
| **Age** | How old the person is, measured in years. |
| **Sex** | Whether the person is male or female. |
| **Chest Pain Type** | Describes what kind of chest pain the person feels. *Typical angina* means chest pain that happens when the heart doesn’t get enough oxygen, often during exercise. |
| **Resting Blood Pressure (mm Hg)** | The pressure of blood against your artery walls when you’re relaxed. A normal range is around 120/80 mm Hg. High numbers can strain your heart. |
| **Cholesterol (mg/dl)** | The amount of fat in your blood. High cholesterol can block blood flow and cause heart problems. |
| **Fasting Blood Sugar** | The level of sugar (glucose) in your blood before you eat in the morning. High sugar can mean diabetes risk. |
| **Rest ECG (Electrocardiogram)** | A simple test that shows how your heart is beating. “Normal” means your heart rhythm looks healthy. |
| **Max Heart Rate** | The fastest your heart beats during exercise. It shows how well your heart handles physical effort. |
| **Exercise-Induced Angina** | Chest pain that happens when you exercise. “No” means you don’t get chest pain during activity. |
| **Oldpeak (ST Depression)** | A small dip seen on the heart test after exercise. It helps doctors see how much stress your heart feels. |
| **Slope** | The shape of the line on the heart test after exercise. “Upsloping” is usually a normal sign. |
| **Vessels Colored by Fluoroscopy** | Checks if any heart blood vessels are blocked. “Zero” means all are clear — blood is flowing well. |
| **Thalassemia** | A blood condition that affects how your body makes red blood cells. “Reversible defect” means it’s mild and can improve. |


## Sample prediction for a single patient

Creating a sample input looking like as if entered by a user, scale it and run prediction + probability.



In [111]:
# Example patient values they can be replaced with real inputs when needed
sample_input = {
    'age': 52,  # patient's age in years
    'sex': 1,  # 1 for male, 0 for female - depends on your encoding
    'chest_pain_type': 1,  # encoded category value
    'resting_blood_pressure': 130,  # mm Hg
    'cholesterol': 200,  # mg/dl
    'fasting_blood_sugar': 0,  # encoded
    'rest_ecg': 1,  # encoded
    'max_heart_rate': 150,
    'exercise_induced_angina': 0,  # 0 = No, 1 = Yes
    'oldpeak': 1.0,  # ST depression
    'slope': 1,
    'vessels_colored_by_flourosopy': 0,
    'thalassemia': 2
}

# Build DataFrame for the single sample while enforcing original feature order
# reindex ensures same columns
input_df = pd.DataFrame([sample_input])[feature_names]  

# Scale the input using the previously fitted scaler
# returns 2D array with shape (1, n_features)
input_scaled = scaler.transform(input_df)  

# Predict label and probability
# 0 or 1
pred_label = model.predict(input_scaled)[0]  
# probability of class '1' as percent
pred_prob = model.predict_proba(input_scaled)[0, 1] * 100  

# Print results
print(f'Predicted label: {pred_label} (1=high risk, 0=low risk)')
print(f'Predicted probability of heart disease: {pred_prob:.1f}%')


Predicted label: 0 (1=high risk, 0=low risk)
Predicted probability of heart disease: 30.2%


## Explaining prediction with SHAP

SHAP values estimate how much each feature contributed to the model's output for a single prediction.
For linear models, we can use `LinearExplainer`.If SHAP computation fails, we fall back to a simple coef*value approximation.


In [None]:
# Try SHAP LinearExplainer for this linear model
try:
    explainer = shap.LinearExplainer(model, X_train_scaled, feature_perturbation='interventional')
    # returns array shaped (n_samples, n_features)
    shap_values = explainer.shap_values(input_scaled)  
    # map feature names to their SHAP contribution for the sample
    contributions = dict(zip(feature_names, shap_values[0]))
except Exception as e:
    print('SHAP failed with error:', e)
    # fallback: use coefficient * input value as a rough influence measure
    coefs = model.coef_[0]
    contributions = {f: float(coefs[i] * input_scaled[0, i]) for i, f in enumerate(feature_names)}

# Convert contributions to a DataFrame for easy display and sorting
contrib_df = pd.DataFrame({
    'Feature': list(contributions.keys()),
    'Contribution': list(contributions.values()),
    'Value': [input_df[f].iloc[0] for f in feature_names]
})

# Compute absolute impact to sort by strongest contributors
contrib_df['AbsImpact'] = contrib_df['Contribution'].abs()
# top 8 features
top_contrib = contrib_df.sort_values('AbsImpact', ascending=False).head(8)  

top_contrib[['Feature', 'Value', 'Contribution']]


Unnamed: 0,Feature,Value,Contribution
11,vessels_colored_by_flourosopy,0.0,-1.618142
2,chest_pain_type,1.0,0.960982
1,sex,1.0,-0.410563
4,cholesterol,200.0,0.407165
8,exercise_induced_angina,0.0,0.332666
12,thalassemia,2.0,-0.282826
5,fasting_blood_sugar,0.0,0.141568
6,rest_ecg,1.0,-0.140012


### Plot a gauge for the predicted risk (visual)
This uses Plotly to show the predicted probability on a gauge.


In [113]:
# Create a gauge chart to visualize the risk percentage
fig = go.Figure(go.Indicator(
    mode='gauge+number',
    value=pred_prob,
    number={'suffix': '%', 'valueformat': '.1f'},
    title={'text': '<b>Heart Disease Risk</b>'},
    gauge={
        'axis': {'range': [0, 100]},
        'bar': {'color': '#b30000' if pred_label == 1 else '#2a9d8f'},
        'steps': [
            {'range': [0, 40], 'color': '#d4f5e6'},
            {'range': [40, 70], 'color': '#fff2cc'},
            {'range': [70, 100], 'color': '#ffd6d6'}
        ],
        'threshold': {'line': {'color': 'black', 'width': 2}, 'thickness': 0.75, 'value': pred_prob}
    }
))
fig.update_layout(height=300, margin=dict(l=20, r=20, t=40, b=20))
fig.show()


### Next steps
- To run a Streamlit UI, I will use  `heart_app.py` and run `python -m streamlit run heart_app.py`.
- a .py file worked well for me 
