In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
df = pd.read_csv("heart.csv") # reading in the file contents

information about the dataset
information about the attributes


### Explanation of the attributes in the dataset:

- **Age**: The age of the patient in years.
  
- **Sex**: The gender of the patient (M: Male, F: Female).
  
- **Chest Pain Type**: Describes the type of chest discomfort experienced by the patient:
    - **Typical Angina (TA)**: Chest pain or discomfort typically associated with heart problems, often triggered by physical exertion or emotional stress.
    - **Atypical Angina (ATA)**: Chest discomfort that doesn't fully meet the criteria for typical angina but still suggests a possible heart-related issue. The symptoms may vary and could include shortness of breath, nausea, or fatigue.
    - **Non-Anginal Pain (NAP)**: Chest discomfort not originating from the heart, which could be due to musculoskeletal issues, respiratory problems, gastrointestinal conditions, or anxiety.
    - **Asymptomatic (ASY)**: The patient does not experience any symptoms related to chest discomfort.

- **Resting Blood Pressure**: The patient's blood pressure measured while at rest, in millimeters of mercury (mm Hg).
  
- **Cholesterol**: The level of cholesterol in the patient's blood serum, measured in milligrams per deciliter (mm/dl).
  
- **Fasting Blood Sugar**: Indicates the patient's fasting blood sugar level:
    - **1**: If fasting blood sugar is greater than 120 mg/dl.
    - **0**: Otherwise.

- **Resting Electrocardiogram Results**: Interpretation of the resting electrocardiogram (ECG) findings:
    - **Normal**: ECG shows no abnormalities.
    - **ST**: ECG shows abnormalities in the ST-T wave, such as T wave inversions or ST elevation/depression of more than 0.05 millivolts (mV).
    - **LVH**: ECG suggests probable or definite left ventricular hypertrophy (enlargement of the heart's left ventricle) according to Estes' criteria.
  
- **Maximum Heart Rate Achieved**: The highest heart rate achieved during physical exertion, measured in beats per minute (bpm).

- **Exercise-Induced Angina**: Indicates whether the patient experienced angina (chest pain or discomfort) during exercise:
    - **Y**: Yes, the patient experienced exercise-induced angina.
    - **N**: No, the patient did not experience exercise-induced angina.
  
- **Oldpeak**: Refers to the ST depression observed during exercise, measured as a numeric value.

- **ST Slope**: Describes the slope of the peak exercise ST segment observed on the ECG:
    - **Up**: The ST segment slopes upwards during exercise.
    - **Flat**: The ST segment remains flat during exercise.
    - **Down**: The ST segment slopes downwards during exercise.
  
- **Heart Disease**: Indicates the presence or absence of heart disease:
    - **1**: Heart disease is present.
    - **0**: No heart disease is detected.

In [31]:
codebook={
    "attribute": ['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope', 'HeartDisease'],
    "unit": ["year", "n.a.", "n.a.", "mm of Hg", "mm/dl", "mg/dl", "n.a.", "bpm", "n.a.", "st depression", "n.a.", "n.a."],
    "dtype": ["uint8", "category", "category", "uint8", "uint8", "category", "category", "uint8", "category", "uint8", "category", "category"],
    "description": [
        "The age of the person", 
        "The gender of the person", 
        "The type of pain the person is experiencing", 
        "Recorded blood pressure at rest", 
        "The level of cholesterol in the patient's blood serum", 
        "Indicates the patient's fasting blood sugar level", 
        "Interpretation of the resting electrocardiogram (ECG) findings",
        "Maximum heat rate achieved during physical extraction", 
        "Indicates whether the patient experienced angina during exercise", 
        "ST depression observed during exercise", 
        "The slope of the peak exercise ST segment observed on the ECG", 
        "Indicates the presence or absence of heart disease"
    ]
}
pd.DataFrame(codebook).set_index("attribute")

Unnamed: 0_level_0,unit,dtype,description
attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Age,year,uint8,The age of the person
Sex,n.a.,category,The gender of the person
ChestPainType,n.a.,category,The type of pain the person is experiencing
RestingBP,mm of Hg,uint8,Recorded blood pressure at rest
Cholesterol,mm/dl,uint8,The level of cholesterol in the patient's bloo...
FastingBS,mg/dl,category,Indicates the patient's fasting blood sugar level
RestingECG,sometimes mv,category,Interpretation of the resting electrocardiogra...
MaxHR,bpm,uint8,Maximum heat rate achieved during physical ext...
ExerciseAngina,n.a.,category,Indicates whether the patient experienced angi...
Oldpeak,integer,uint8,ST depression observed during exercise


In [34]:
df.shape

(918, 12)

In [33]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,0,0,140,289,0,0,172,0,0.0,2,0
1,49,1,2,160,180,0,0,156,0,1.0,1,1
2,37,0,0,130,283,0,1,98,0,0.0,2,0
3,48,1,1,138,214,0,0,108,1,1.5,1,1
4,54,0,2,150,195,0,0,122,0,0.0,2,0


EDA univariaat
eda voor elk atribuut op zich zelf
bijv. verdeling per atribuut
- .isna().sum()
- .describe()
- .boxplot
- .hist
- sns.histplot
- pd.DataFrame(df["Age"].value_counts().sortindex())

In [None]:
print(df.describe())

In [None]:
print(df.nunique())

In [None]:
pd.DataFrame({
    "isna": df.isna().sum()
})

In [None]:
# Assuming 'data' is your dataset and 'attribute' is the column name of the continuous variable
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
sns.kdeplot(df['MaxHR'], shade=True, color="b", alpha=0.7)
plt.title("Density Plot of Attribute")
plt.xlabel("Age(Years)")
plt.ylabel("Density")
plt.show()


In [None]:
axs = df.boxplot(grid=False, vert=False, figsize=(9.6, 4.8))
axs.set_title("Boxplot distributies");

In [None]:
plt.figure(figsize=(10, 6))
sns.violinplot(x='RestingBP', data=df, palette="muted")
plt.title("Violin Plot of Attribute")
plt.xlabel("Attribute")
plt.ylabel("Value")
plt.show()

In [None]:
axs = df.hist(bins=40, figsize=(20, 12))

In [None]:
disease_mapping = {0: 'no disease', 1: 'disease'}

# Use the map function to replace values in the disease column
x = df['HeartDisease'].map(disease_mapping)

# Count the number of males and females
gender_counts = x.value_counts()

# Plot the counts
gender_counts.plot(kind='bar')
plt.xticks(rotation=0)
plt.title('Count of with and without heart disease')
plt.xlabel('with or without a heart disease')
plt.ylabel('Count')
plt.show()

In [None]:
# Count the number of males and females
gender_counts = df['Sex'].value_counts()

# Plot the counts
gender_counts.plot(kind='bar')
plt.xticks(rotation=0)
plt.title('Count of Males and Females')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

In [None]:
gender_counts = df['ST_Slope'].value_counts()

# Plot the counts
gender_counts.plot(kind='bar')
plt.xticks(rotation=0)
plt.title('Count of ST_slope')
plt.xlabel('slope')
plt.ylabel('Count')
plt.show()


In [None]:
gender_counts = df['HeartDisease'].value_counts()

# Plot the counts
gender_counts.plot(kind='bar')
plt.xticks(rotation=0)
plt.title('Count of Heart disease')
plt.xlabel('Heart disease or not')
plt.ylabel('Count')
plt.show()


EDA bivariaat
- sns.pairplot
- sns.heatmap
- sns.boxplot
- something with anova??


In [6]:
hm_df = df

gender = {'M': 0, 'F': 1}# Create a dictionary to map 'M' to 0 and 'F' to 1
hm_df['Sex'] = hm_df['Sex'].map(gender)# Use the map function to replace values in the gender column

chest_pain = {'ATA': 0, 'ASY': 1, 'NAP': 2, 'TA': 3}
hm_df['ChestPainType'] = hm_df['ChestPainType'].map(chest_pain)

resting_ecg = {'Normal': 0, 'ST': 1, 'LVH': 2}
hm_df['RestingECG'] = hm_df['RestingECG'].map(resting_ecg)

exercise = {'N': 0, 'Y': 1}
hm_df['ExerciseAngina'] = hm_df['ExerciseAngina'].map(exercise)

slope = {'Down': 0, 'Flat': 1, 'Up': 2}
hm_df['ST_Slope'] = hm_df['ST_Slope'].map(slope)


In [None]:
# heatmap
sns.pairplot(df, hue="HeartDisease");

In [None]:
axs = sns.heatmap(df.corr(), annot=True, cmap="jet", vmin=-1.0, vmax=1.0, square=True)

In [None]:
plt.scatter(df['RestingBP'], df['Age'])
plt.title('Title of the Plot')
plt.xlabel('Resting BP')
plt.ylabel('Ages')
plt.show()


data impurity if necessary

EDA multivariaat
- scaled data
- PCA maybe???