In [9]:
!pip install scipy


Collecting scipy
  Downloading scipy-1.15.3-cp313-cp313-win_amd64.whl.metadata (60 kB)
Downloading scipy-1.15.3-cp313-cp313-win_amd64.whl (41.0 MB)
   ---------------------------------------- 0.0/41.0 MB ? eta -:--:--
    --------------------------------------- 0.5/41.0 MB 4.2 MB/s eta 0:00:10
   - -------------------------------------- 1.8/41.0 MB 5.8 MB/s eta 0:00:07
   --- ------------------------------------ 3.4/41.0 MB 6.4 MB/s eta 0:00:06
   ---- ----------------------------------- 5.0/41.0 MB 6.7 MB/s eta 0:00:06
   ----- ---------------------------------- 5.5/41.0 MB 6.2 MB/s eta 0:00:06
   ------ --------------------------------- 6.8/41.0 MB 5.9 MB/s eta 0:00:06
   ------- -------------------------------- 8.1/41.0 MB 6.0 MB/s eta 0:00:06
   --------- ------------------------------ 9.4/41.0 MB 6.0 MB/s eta 0:00:06
   --------- ------------------------------ 10.2/41.0 MB 5.8 MB/s eta 0:00:06
   ----------- ---------------------------- 11.8/41.0 MB 5.9 MB/s eta 0:00:05
   -------


[notice] A new release of pip is available: 24.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [15]:
# Try reading the file as a CSV instead of Excel
df = pd.read_csv("cleaned_heart_cleveland_dataset.xls")

# Show the first few rows
df.head()


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,NumMajorVessels,Thalassemia,HeartDisease
0,69,Male,Typical Angina,160,234,>120 mg/dl,Left Ventricular Hypertrophy,131,No,0.1,Flat,1,Normal,No
1,69,Female,Typical Angina,140,239,≤120 mg/dl,Normal,151,No,1.8,Upsloping,2,Normal,No
2,66,Female,Typical Angina,150,226,≤120 mg/dl,Normal,114,No,2.6,Downsloping,0,Normal,No
3,65,Male,Typical Angina,138,282,>120 mg/dl,Left Ventricular Hypertrophy,174,No,1.4,Flat,1,Normal,Yes
4,64,Male,Typical Angina,110,211,≤120 mg/dl,Left Ventricular Hypertrophy,144,Yes,1.8,Flat,0,Normal,No


#  Insight Generation for Heart Disease Prediction


## 1. Heart Disease Distribution

In [6]:
df['HeartDisease'].value_counts(normalize=True)


HeartDisease
No     0.538721
Yes    0.461279
Name: proportion, dtype: float64

##   2. Correlation Between Numerical Variables
We will convert categorical values to numeric form for correlation and use a heatmap.

Example Insight (to be finalized):
    
    MaxHR (maximum heart rate) is negatively correlated with HeartDisease.

    Oldpeak is positively correlated – higher depression = higher chance of disease.

    Cholesterol shows weak correlation, meaning it's not a strong standalone predictor.

##   3. Grouping and Aggregation Insights
➤ Group by Heart Disease and compute mean:

In [7]:
df.groupby('HeartDisease').mean(numeric_only=True)
import ace_tools as tools; tools.display_dataframe_to_user(name="Grouped Mean Comparison", dataframe=grouped_means)


Unnamed: 0_level_0,Age,RestingBP,Cholesterol,MaxHR,Oldpeak,NumMajorVessels
HeartDisease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,52.64375,129.175,243.49375,158.58125,0.59875,0.275
Yes,56.759124,134.635036,251.854015,139.109489,1.589051,1.145985


## Insight:

    Patients with heart disease:

        Have lower MaxHR

        Have higher Oldpeak

        Are more likely to have Flat ST slope

##   4. Categorical Feature Association
We'll run a Chi-square test for categorical features like:

Sex vs HeartDisease

ChestPainType vs HeartDisease

FastingBS vs HeartDisease

Each test will tell if the distribution difference is statistically significant.



##   5. T-Test on Cholesterol & MaxHR
We'll compare:

Mean Cholesterol in heart disease vs no heart disease

Mean MaxHR in both groups

Significant p-values (< 0.05) will support that these features differ between classes.



In [13]:

from scipy.stats import chi2_contingency, ttest_ind
import numpy as np

# Convert target column to binary for analysis
df['HeartDiseaseBinary'] = df['HeartDisease'].map({'Yes': 1, 'No': 0})

# Prepare results dictionary
insight_results = {}

# 1. Distribution of heart disease
heart_disease_pct = df['HeartDiseaseBinary'].mean() * 100
insight_results['HeartDisease Prevalence'] = f"{heart_disease_pct:.2f}% of patients in the dataset have heart disease."

# 2. Correlation (only numerical)
numeric_cols = df.select_dtypes(include=[np.number])
correlation = numeric_cols.corr()['HeartDiseaseBinary'].sort_values(ascending=False)

# 3. Grouping: mean comparison
grouped_means = df.groupby('HeartDiseaseBinary').mean(numeric_only=True)

# 4. T-tests
ttest_chol = ttest_ind(
    df[df['HeartDiseaseBinary'] == 1]['Cholesterol'],
    df[df['HeartDiseaseBinary'] == 0]['Cholesterol'],
    equal_var=False
)

ttest_maxhr = ttest_ind(
    df[df['HeartDiseaseBinary'] == 1]['MaxHR'],
    df[df['HeartDiseaseBinary'] == 0]['MaxHR'],
    equal_var=False
)

# 5. Chi-square tests for categorical features
def chi_square_test(feature):
    table = pd.crosstab(df[feature], df['HeartDiseaseBinary'])
    chi2, p, _, _ = chi2_contingency(table)
    return p

chi_results = {
    'Sex': chi_square_test('Sex'),
    'ChestPainType': chi_square_test('ChestPainType'),
    'FastingBS': chi_square_test('FastingBS')
}

# Package all results
insight_results['Correlation with HeartDisease'] = correlation
insight_results['Mean Comparison by HeartDisease'] = grouped_means
insight_results['T-test Cholesterol p-value'] = ttest_chol.pvalue
insight_results['T-test MaxHR p-value'] = ttest_maxhr.pvalue
insight_results['Chi-square p-values'] = chi_results


insight_results


{'HeartDisease Prevalence': '46.13% of patients in the dataset have heart disease.',
 'Correlation with HeartDisease': HeartDiseaseBinary    1.000000
 NumMajorVessels       0.463189
 Oldpeak               0.424052
 Age                   0.227075
 RestingBP             0.153490
 Cholesterol           0.080285
 MaxHR                -0.423817
 Name: HeartDiseaseBinary, dtype: float64,
 'Mean Comparison by HeartDisease':                           Age   RestingBP  Cholesterol       MaxHR   Oldpeak  \
 HeartDiseaseBinary                                                             
 0                   52.643750  129.175000   243.493750  158.581250  0.598750   
 1                   56.759124  134.635036   251.854015  139.109489  1.589051   
 
                     NumMajorVessels  
 HeartDiseaseBinary                   
 0                          0.275000  
 1                          1.145985  ,
 'T-test Cholesterol p-value': np.float64(0.16501058969777038),
 'T-test MaxHR p-value': np.float

## 📊 Insight Generation: HealthLens – Understanding Patient Data & Health Trends
🫀 1. Heart Disease Prevalence
46.13% of patients in the dataset are diagnosed with heart disease. This balanced distribution allows effective modeling and analysis of patterns between healthy and at-risk individuals.

📉 2. Correlation Patterns
Positive correlations with heart disease:

NumMajorVessels (0.46): More blocked vessels are linked to higher heart disease risk.

Oldpeak (0.42): Greater ST depression during exercise correlates with disease.

Age (0.22): Risk increases with age.

Negative correlation:

MaxHR (-0.42): Lower max heart rate is associated with heart disease.

💡 Insight: Lower MaxHR and higher Oldpeak are strong indicators of heart disease presence.

📊 3. Groupwise Comparison (Mean Values)
From the table:

Feature	No Disease	Disease	Observation
Age	52.64 yrs	56.76 yrs	Older patients are more prone to heart disease
MaxHR	158.58	139.11	Diseased patients have lower MaxHR
Oldpeak	0.60	1.59	ST depression is higher in diseased patients
NumMajorVessels	0.28	1.15	Diseased patients have more blocked vessels

🧪 4. Statistical Tests
  T-Test Results:
Cholesterol:

p-value = 0.165 → No significant difference → Cholesterol alone isn't a strong predictor.

MaxHR:

p-value ≈ 6.1e-14 → Strongly significant → MaxHR is a key differentiator between groups.

  Chi-Square Results:
Feature	p-value	Interpretation
Sex	2.95e-06	Significant → Gender plays a role in heart disease
Chest Pain Type	1.17e-16	Highly significant → Key predictor of disease
FastingBS	1.00	Not significant → Fasting blood sugar is not useful

💡 Insight: Chest Pain Type and Sex are significantly associated with heart disease. Fasting Blood Sugar isn't.

  Summary of Top Risk Indicators
Based on correlation, statistical tests, and groupwise analysis:

🚩 MaxHR (Low)

🚩 Oldpeak (High)

🚩 Chest Pain Type (Asymptomatic or Atypical)

🚩 Age (Older age groups)

🚩 Male gender

❌ Fasting Blood Sugar and Cholesterol are less conclusive on their own.