
# Cardiovascular Study Biostatistics and Predictive Analytics

This notebook demonstrates the analysis of the Cleveland Heart Disease dataset, covering hypothesis testing, linear regression, and logistic regression models.
        


## Loading and Examining the Cleveland Dataset

The Cleveland dataset is used to predict the presence of heart disease based on several patient characteristics. First, we load the data and examine its structure.
        


# Dataset Information and Download Links

The analysis in this notebook utilizes the **Cleveland Heart Disease dataset**, which is publicly available.

### Accessing the Dataset

You can download the dataset from the following sources:

1. **UCI Machine Learning Repository:**
   - [Heart Disease Dataset - UCI](https://archive.ics.uci.edu/ml/datasets/Heart%2BDisease)
   - Look for the file named `processed.cleveland.data`.

2. **Kaggle:**
   - [Heart Disease Cleveland UCI Dataset - Kaggle](https://www.kaggle.com/datasets/cherngs/heart-disease-cleveland-uci)

### Dataset Attributes

- The dataset contains the following features:
  - **age**: Age of the patient.
  - **sex**: Gender of the patient (0 = Female, 1 = Male).
  - **cp**: Chest pain type (categorical).
  - **trestbps**: Resting blood pressure (mmHg).
  - **chol**: Serum cholesterol (mg/dL).
  - **fbs**: Fasting blood sugar > 120 mg/dL (1 = true, 0 = false).
  - **restecg**: Resting electrocardiographic results (categorical).
  - **thalach**: Maximum heart rate achieved.
  - **exang**: Exercise-induced angina (1 = yes, 0 = no).
  - **oldpeak**: ST depression induced by exercise relative to rest.
  - **slope**: Slope of the peak exercise ST segment.
  - **ca**: Number of major vessels (0–3) colored by fluoroscopy.
  - **thal**: Thalassemia (categorical).
  - **cad**: Presence of heart disease (0 = no presence, 1–4 = varying severity).

### Usage Notes

- Handle missing values appropriately (`?` values in the dataset represent missing data).
- Recode the target variable (`cad`) into binary: 0 = No CAD, 1 = CAD presence.
- Refer to the [dataset documentation](https://archive.ics.uci.edu/ml/datasets/Heart%2BDisease) for more details.
    

In [None]:

import pandas as pd
import numpy as np

# Set the column names based on dataset documentation
column_names = [
    "age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
    "thalach", "exang", "oldpeak", "slope", "ca", "thal", "cad"
]

# Load the dataset (replace with the actual file path)
dataset = pd.read_csv(
    r'C:\Path\to\processed.cleveland.data',
    names=column_names, header=None, na_values=["?"]
)

# Display the first few rows
print(dataset.head())

# Dichotomize the 'cad' variable
dataset['cad'] = np.where(dataset['cad'] > 0, 1, 0)

# Group the dataset by 'cad' column
grouped_data = dataset.groupby("cad")

# Calculate and display descriptive statistics
statistics = grouped_data.describe()
print(statistics.T)
        


## Hypothesis Testing: Age Differences Between CAD and Non-CAD Subjects

Using a t-test, we check if the mean age differs significantly between subjects with and without CAD.
        

In [None]:

from scipy.stats import ttest_ind

# Separate data for CAD and non-CAD groups
datacad = dataset[dataset['cad'] == 1]
datacontrol = dataset[dataset['cad'] == 0]

# Perform t-test for age
t_statistic, p_value = ttest_ind(datacad['age'], datacontrol['age'])
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
        


## Linear Regression: Stress Test Heart Rate vs. ST Depression

We use multivariate linear regression to model the relationship between ST depression, age, and maximum heart rate achieved during a stress test.
        

In [None]:

import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

# Multivariate Linear Regression
linmod = smf.ols(
    formula='thalach ~ oldpeak + age', data=datacad
).fit()

# Print regression summary
print(linmod.summary())

# Regression plot
sns.regplot(x='oldpeak', y='thalach', data=datacad, ci=95)
plt.title('Linear Regression: Stress Test Max BPM vs ST Depression')
plt.xlabel('ST Depression Level (mm)')
plt.ylabel('Stress Test Max BPM')
plt.show()
        


## Logistic Regression: CAD Prediction Using ST Depression, Age, and Cholesterol

We create a logistic regression model to predict CAD status using ST depression, age, and serum cholesterol levels as predictors.
        

In [None]:

import statsmodels.api as sm

# Logistic regression model
logit_model = sm.Logit.from_formula(
    formula='cad ~ oldpeak + age + chol', data=dataset
).fit()

# Print logistic regression summary
print(logit_model.summary())
        


### Logistic Regression with Exercise-Induced Angina as an Additional Predictor

Adding exercise-induced angina (exang) to the model improves its predictive capability.
        

In [None]:

# Logistic regression model with exang
logit_model_exang = sm.Logit.from_formula(
    formula='cad ~ oldpeak + age + exang + chol', data=dataset
).fit()

# Print logistic regression summary
print(logit_model_exang.summary())

# Calculate odds ratio
odds_ratios = np.exp(logit_model_exang.params)
conf_intervals = np.exp(logit_model_exang.conf_int())

# Display results
summary_df = pd.DataFrame({
    'Odds Ratio': odds_ratios,
    'CI Lower': conf_intervals[0],
    'CI Upper': conf_intervals[1]
})
print(summary_df)
        