Read dataset:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectKBest, f_classif 
import pickle
import jinja2

df = pd.read_csv('heart.csv')


Matplotlib is building the font cache; this may take a moment.


Attributes’ description table:

In [2]:
data = {
    "Attribute Name": [
        "age", "sex", "cp", "trestbps", "chol","fbs", "restecg", "thalach", 
        "exang", "oldpeak", "slope", "ca", "thal","target"
    ],
    "Description": [
        "age in years", "Gender of the patient", "chest pain type", "resting blood pressure (in mm Hg on admission to the hospital)", 
        "serum cholestoral in mg/dl", 
        "fasting blood sugar > 120 mg/dl",
        "resting electrocardiographic results",
        "maximum heart rate achieved", 
        "exercise induced angina", 
        "ST depression induced by exercise relative to rest", 
        "the slope of the peak exercise ST segment",
        "number of major vessels (0-3) colored by flourosopy", 
        " 1 = normal; 2 = fixed defect; 3 = reversable defect"," The class label,refers to the presence of heart disease in the patient"
    ],
    "Data Type": [
       "Numeric", "Binary", "Ordinal", "Numeric","Numeric", "Binary","Nominal", "Numeric", "Binary", 
        "Numeric", "Ordinal", "Ordinal", "Nominal", "Binary"
    ],
    "Possible Values": [
        "Range between 29-77","Female, Male","Range between 0-3","Range between 94-200","Range between 126-564","1= grater than 120mg/dl 0=less than 120mg/dl",
        "Range between 0-2","Range between 71-202","1=exercise induced angina  0= no","Range between 0-6.2","Range between 0-2","Range between 0-4",
        "Range between 0-3","1 = have heart disease 0 = no heart disease"
    ]
}
df = pd.DataFrame(data)
styled_df = df.style.set_properties(**{'text-align': 'center'})
styled_df

Unnamed: 0,Attribute Name,Description,Data Type,Possible Values
0,age,age in years,Numeric,Range between 29-77
1,sex,Gender of the patient,Binary,"Female, Male"
2,cp,chest pain type,Ordinal,Range between 0-3
3,trestbps,resting blood pressure (in mm Hg on admission to the hospital),Numeric,Range between 94-200
4,chol,serum cholestoral in mg/dl,Numeric,Range between 126-564
5,fbs,fasting blood sugar > 120 mg/dl,Binary,1= grater than 120mg/dl 0=less than 120mg/dl
6,restecg,resting electrocardiographic results,Nominal,Range between 0-2
7,thalach,maximum heart rate achieved,Numeric,Range between 71-202
8,exang,exercise induced angina,Binary,1=exercise induced angina 0= no
9,oldpeak,ST depression induced by exercise relative to rest,Numeric,Range between 0-6.2


In [19]:
sample=df.sample(n=20);
sample

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
911,58,0,1,136,319,1,0,152,0,0.0,2,2,2,0
368,58,1,2,105,240,0,0,154,1,0.6,1,0,3,1
224,51,1,0,140,261,0,0,186,1,0.0,2,0,2,1
274,66,1,0,160,228,0,0,138,0,2.3,2,0,1,1
130,60,0,3,150,240,0,1,171,0,0.9,2,0,2,1
302,55,0,1,132,342,0,1,166,0,1.2,2,0,2,1
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
719,52,1,0,108,233,1,1,147,0,0.1,2,3,3,1
908,62,1,0,120,267,0,1,99,1,1.8,1,2,3,0
642,64,1,0,128,263,0,1,105,1,0.2,1,1,3,1


Sample of 20 people from the dataset:

Show the missing values:

In [20]:
missing_counts = df.isnull().sum()
print("Missing values in each column:")
print(missing_counts)
print()
rows_with_missing = df.isnull().sum(axis=1)
print("Rows with missing values:")
print(rows_with_missing);

Missing values in each column:
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

Rows with missing values:
0       0
1       0
2       0
3       0
4       0
       ..
1020    0
1021    0
1022    0
1023    0
1024    0
Length: 1025, dtype: int64


We notice that there are no missing values, and all columns are complete.

In [None]:
Summary of data:

In [21]:
summary=df.describe();
print(summary);

               age          sex           cp     trestbps        chol  \
count  1025.000000  1025.000000  1025.000000  1025.000000  1025.00000   
mean     54.434146     0.695610     0.942439   131.611707   246.00000   
std       9.072290     0.460373     1.029641    17.516718    51.59251   
min      29.000000     0.000000     0.000000    94.000000   126.00000   
25%      48.000000     0.000000     0.000000   120.000000   211.00000   
50%      56.000000     1.000000     1.000000   130.000000   240.00000   
75%      61.000000     1.000000     2.000000   140.000000   275.00000   
max      77.000000     1.000000     3.000000   200.000000   564.00000   

               fbs      restecg      thalach        exang      oldpeak  \
count  1025.000000  1025.000000  1025.000000  1025.000000  1025.000000   
mean      0.149268     0.529756   149.114146     0.336585     1.071512   
std       0.356527     0.527878    23.005724     0.472772     1.175053   
min       0.000000     0.000000    71.000000  

description:

Age:
There is significant variability in ages, ranging from 29 to 77 years, with an average of 54.43 years. This indicates that the dataset primarily consists of middle-aged and older individuals, who are at a higher risk of heart disease.

Sex:
The values are binary (0 for female, 1 for male), with a mean of 0.695, indicating that about 70% of the dataset consists of males. This suggests a higher representation of males in the study, which is relevant as men often have different heart disease risk factors compared to women.

Chest Pain Type (CP):
The values range from 0 to 3, with an average of 0.94. This suggests that most individuals experience mild to moderate chest pain, which is a key symptom of heart disease. The standard deviation is 1.03, indicating variation in pain levels across individuals.

Resting Blood Pressure (Trestbps):
The values range from 94 to 200 mmHg, with the mean at 131.61 mmHg. The median is 130 mmHg, which is close to the mean, suggesting a relatively balanced distribution. The standard deviation is 17.51, indicating moderate variability in blood pressure levels among individuals.

serum cholestoral (Chol):
The values vary significantly, with a minimum of 126 and a maximum of 564. The mean is 246, which is slightly higher than the median of 240, indicating a slight skew in distribution. The standard deviation is 51.59, suggesting considerable variability in cholesterol levels, which may indicate different risk profiles for heart disease.

Fasting Blood Sugar (FBS):
The values are binary, limited to 0 and 1, with a mean of 0.149, indicating that around 15% of individuals have fasting blood sugar ≥ 120 mg/dl. This suggests that diabetes is present but not very common in this dataset.

Resting ECG (Restecg):
The values range from 0 to 2, with a mean of 0.53, indicating that about 53% of individuals have some form of abnormal resting ECG results. This suggests that a significant portion of individuals show heart abnormalities even at rest.

Maximum Heart Rate Achieved (Thalach):
Heart rate values range from 71 bpm to 202 bpm, which indicates a wide range of cardiovascular fitness levels. The mean is 149 bpm, which is relatively high. The median of 152 bpm is very close to the mean, suggesting that the heart rate distribution is fairly symmetrical. The standard deviation is 23.00, showing variability in heart rates among individuals.

Exercise-Induced Angina (Exang):
The values are binary, limited to 0 and 1, with a mean of 0.336, indicating that about 33% of individuals experience angina during exercise. This suggests that a significant portion of individuals show signs of heart-related issues when performing physical activity.

ST depression induced by exercise relative to rest(Oldpeak):
ST depression values range from 0 to 6.2, with a mean of 1.07. The median of 0.8 suggests that most individuals have mild ST depression, but a few cases show severe depression, which is a potential indicator of heart disease risk. The standard deviation is 1.17, showing that there is variability in how much ST depression different individuals experience.

Slope of the Peak Exercise ST Segment (Slope):
The values range from 0 to 2, with a mean of 1.38. This suggests that most individuals have a moderate or normal response to exercise, but some have abnormal slopes, which could indicate ischemia (reduced blood flow to the heart).

Number of Major Vessels Colored by Fluoroscopy (CA):
The values range from 0 to 4, with a mean of 0.75. This suggests that most individuals have fewer major blood vessel blockages, but some have severe obstructions, indicating a risk of cardiovascular issues.

Thalassemia (Thal):
The values range from 0 to 3, with an average of 2.32. This suggests that most individuals have normal or slightly abnormal thalassemia test results.

Heart Disease (Target):
The values are binary, limited to 0 and 1, with a mean of 0.513 and a standard deviation of 0.50. This suggests significant variability in heart disease presence, with a fairly even distribution between those with and without heart disease.

In [None]:
Calculate the variance :


In [22]:
variance= df.var(numeric_only=True);
print(variance)

age           82.306450
sex            0.211944
cp             1.060160
trestbps     306.835410
chol        2661.787109
fbs            0.127111
restecg        0.278655
thalach      529.263325
exang          0.223514
oldpeak        1.380750
slope          0.381622
ca             1.062544
thal           0.385219
target         0.250071
dtype: float64


Variance Analysis of the Dataset:

High Variance Values:
Cholesterol (2661.79) → Extremely high, meaning large differences in cholesterol levels.

Resting Blood Pressure (306.84) → High variance, suggesting a wide range of BP values.

Maximum Heart Rate Achieved (529.26) → Indicates significant variability in heart rate.

Age (82.31) → Shows a diverse age distribution.


Moderate Variance Values:
Chest Pain Type (1.06).

Number of Major Vessels (1.06).

ST Depression (1.38).

These values suggest some variation but not as extreme as cholesterol or heart rate.


Low Variance Values:
Binary columns like Sex (0.21), FBS (0.12), Exang (0.22), and Target (0.25) → Limited variation because they mostly contain 0s and 1s.
Thal (0.38) and Slope (0.38) → Relatively small spread.