In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [55]:
df = pd.read_csv("heart.csv") # reading in the file contents

The first 5 rows of the dataset:

In [56]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [54]:
print(df.nunique())

Age                50
Sex                 2
ChestPainType       4
RestingBP          67
Cholesterol       222
FastingBS           2
RestingECG          3
MaxHR             119
ExerciseAngina      2
Oldpeak            53
ST_Slope            3
HeartDisease        2
dtype: int64


conclusie:
- Age, RestingBP, cholesterol, maxhr and oldpeak have very high unique numbers so those numbers range alot
- 2/3/4 aka very low numbers indicating that those attributes are probably categorical this is true for sexes, chestpain, fastingBS, restingECG, exercise angina, st_slope and heart disease

### Explanation of the 12 attributes in the dataset:

- **Age**: The age of the patient in years.
  
- **Sex**: The gender of the patient (M: Male, F: Female).
  
- **Chest Pain Type**: Describes the type of chest discomfort experienced by the patient:
    - **Typical Angina (TA)**: Chest pain or discomfort typically associated with heart problems, often triggered by physical exertion or emotional stress.
    - **Atypical Angina (ATA)**: Chest discomfort that doesn't fully meet the criteria for typical angina but still suggests a possible heart-related issue. The symptoms may vary and could include shortness of breath, nausea, or fatigue.
    - **Non-Anginal Pain (NAP)**: Chest discomfort not originating from the heart, which could be due to musculoskeletal issues, respiratory problems, gastrointestinal conditions, or anxiety.
    - **Asymptomatic (ASY)**: The patient does not experience any symptoms related to chest discomfort.

- **Resting Blood Pressure**: The patient's blood pressure measured while at rest, in millimeters of mercury (mm Hg).
  
- **Cholesterol**: The level of cholesterol in the patient's blood serum, measured in milligrams per deciliter (mm/dl).
  
- **Fasting Blood Sugar**: Indicates the patient's fasting blood sugar level:
    - **1**: If fasting blood sugar is greater than 120 mg/dl.
    - **0**: Otherwise.

- **Resting Electrocardiogram Results**: Interpretation of the resting electrocardiogram (ECG) findings:
    - **Normal**: ECG shows no abnormalities.
    - **ST**: ECG shows abnormalities in the ST-T wave, such as T wave inversions or ST elevation/depression of more than 0.05 millivolts (mV).
    - **LVH**: ECG suggests probable or definite left ventricular hypertrophy (enlargement of the heart's left ventricle) according to Estes' criteria.
  
- **Maximum Heart Rate Achieved**: The highest heart rate achieved during physical exertion, measured in beats per minute (bpm).

- **Exercise-Induced Angina**: Indicates whether the patient experienced angina (chest pain or discomfort) during exercise:
    - **Y**: Yes, the patient experienced exercise-induced angina.
    - **N**: No, the patient did not experience exercise-induced angina.
  
- **Oldpeak**: Refers to the ST depression observed during exercise, measured as a numeric value.

- **ST Slope**: Describes the slope of the peak exercise ST segment observed on the ECG:
    - **Up**: The ST segment slopes upwards during exercise.
    - **Flat**: The ST segment remains flat during exercise.
    - **Down**: The ST segment slopes downwards during exercise.
  
- **Heart Disease**: Indicates the presence or absence of heart disease:
    - **1**: Heart disease is present.
    - **0**: No heart disease is detected.

A codebook made of the dataset with help from the publications of the data.

In [59]:
codebook={
    "attribute": ['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope', 'HeartDisease'],
    "unit": ["year", "n.a.", "n.a.", "mm of Hg", "mm/dl", "n.a.", "n.a.", "bpm", "n.a.", "st depression", "n.a.", "n.a."],
    "dtype": ["int64", "category", "category", "int64", "int64", "category", "category", "int64", "category", "int64", "category", "category"],
    "description": [
        "The age of the person", 
        "The gender of the person", 
        "The type of pain the person is experiencing", 
        "Recorded blood pressure at rest", 
        "The level of cholesterol in the patient's blood serum", 
        "Indicates the patient's fasting blood sugar level", 
        "Interpretation of the resting electrocardiogram (ECG) findings",
        "Maximum heat rate achieved during physical extraction", 
        "Indicates whether the patient experienced angina during exercise", 
        "ST depression observed during exercise", 
        "The slope of the peak exercise ST segment observed on the ECG", 
        "Indicates the presence or absence of heart disease"
    ]
}
pd.DataFrame(codebook).set_index("attribute")

Unnamed: 0_level_0,unit,dtype,description
attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Age,year,int64,The age of the person
Sex,n.a.,category,The gender of the person
ChestPainType,n.a.,category,The type of pain the person is experiencing
RestingBP,mm of Hg,int64,Recorded blood pressure at rest
Cholesterol,mm/dl,int64,The level of cholesterol in the patient's bloo...
FastingBS,n.a.,category,Indicates the patient's fasting blood sugar level
RestingECG,n.a.,category,Interpretation of the resting electrocardiogra...
MaxHR,bpm,int64,Maximum heat rate achieved during physical ext...
ExerciseAngina,n.a.,category,Indicates whether the patient experienced angi...
Oldpeak,st depression,int64,ST depression observed during exercise


The data consists of 5 numeric and 7 categorical attributes. 

In [34]:
df.shape

(918, 12)

Verified that there are 918 rows and 12 attributes in the dataframe.

conclusie: 
- the data is correctly read and can be used for further steps.

## EDA univariaat
eda voor elk atribuut op zich zelf
bijv. verdeling per atribuut
- .isna().sum()
- .describe()
- .boxplot
- .hist
- sns.histplot
- pd.DataFrame(df["Age"].value_counts().sortindex())

Checking if there are NA values and describe every attribute.

In [52]:
print(df.describe())

              Age         Sex  ChestPainType   RestingBP  Cholesterol  \
count  918.000000  918.000000     918.000000  918.000000   918.000000   
mean    53.510893    0.210240       1.132898  132.396514   198.799564   
std      9.432617    0.407701       0.770069   18.514154   109.384145   
min     28.000000    0.000000       0.000000    0.000000     0.000000   
25%     47.000000    0.000000       1.000000  120.000000   173.250000   
50%     54.000000    0.000000       1.000000  130.000000   223.000000   
75%     60.000000    0.000000       2.000000  140.000000   267.000000   
max     77.000000    1.000000       3.000000  200.000000   603.000000   

        FastingBS  RestingECG       MaxHR  ExerciseAngina     Oldpeak  \
count  918.000000  918.000000  918.000000      918.000000  918.000000   
mean     0.233115    0.603486  136.809368        0.404139    0.887364   
std      0.423046    0.805968   25.460334        0.490992    1.066570   
min      0.000000    0.000000   60.000000        0

In [53]:
pd.DataFrame({
    "isna": df.isna().sum()
})

Unnamed: 0,isna
Age,0
Sex,0
ChestPainType,0
RestingBP,0
Cholesterol,0
FastingBS,0
RestingECG,0
MaxHR,0
ExerciseAngina,0
Oldpeak,0


conclusion:
- There are no values missing in any of the attributes.

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
sns.kdeplot(df['MaxHR'], shade=True, color="b", alpha=0.7)
plt.title("Density Plot of Attribute")
plt.xlabel("Age(Years)")
plt.ylabel("Density")
plt.show()


In [None]:
axs = df.boxplot(grid=False, vert=False, figsize=(9.6, 4.8))
axs.set_title("Boxplot distributies");

In [None]:
plt.figure(figsize=(10, 6))
sns.violinplot(x='RestingBP', data=df, palette="muted")
plt.title("Violin Plot of Attribute")
plt.xlabel("Attribute")
plt.ylabel("Value")
plt.show()

In [None]:
axs = df.hist(bins=40, figsize=(20, 12))

In [None]:
disease_mapping = {0: 'no disease', 1: 'disease'}

# Use the map function to replace values in the disease column
x = df['HeartDisease'].map(disease_mapping)

# Count the number of males and females
gender_counts = x.value_counts()

# Plot the counts
gender_counts.plot(kind='bar')
plt.xticks(rotation=0)
plt.title('Count of with and without heart disease')
plt.xlabel('with or without a heart disease')
plt.ylabel('Count')
plt.show()

In [None]:
# Count the number of males and females
gender_counts = df['Sex'].value_counts()

# Plot the counts
gender_counts.plot(kind='bar')
plt.xticks(rotation=0)
plt.title('Count of Males and Females')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

In [None]:
gender_counts = df['ST_Slope'].value_counts()

# Plot the counts
gender_counts.plot(kind='bar')
plt.xticks(rotation=0)
plt.title('Count of ST_slope')
plt.xlabel('slope')
plt.ylabel('Count')
plt.show()


In [None]:
gender_counts = df['HeartDisease'].value_counts()

# Plot the counts
gender_counts.plot(kind='bar')
plt.xticks(rotation=0)
plt.title('Count of Heart disease')
plt.xlabel('Heart disease or not')
plt.ylabel('Count')
plt.show()


EDA bivariaat
- sns.pairplot
- sns.heatmap
- sns.boxplot
- something with anova??


In [6]:
hm_df = df

gender = {'M': 0, 'F': 1}# Create a dictionary to map 'M' to 0 and 'F' to 1
hm_df['Sex'] = hm_df['Sex'].map(gender)# Use the map function to replace values in the gender column

chest_pain = {'ATA': 0, 'ASY': 1, 'NAP': 2, 'TA': 3}
hm_df['ChestPainType'] = hm_df['ChestPainType'].map(chest_pain)

resting_ecg = {'Normal': 0, 'ST': 1, 'LVH': 2}
hm_df['RestingECG'] = hm_df['RestingECG'].map(resting_ecg)

exercise = {'N': 0, 'Y': 1}
hm_df['ExerciseAngina'] = hm_df['ExerciseAngina'].map(exercise)

slope = {'Down': 0, 'Flat': 1, 'Up': 2}
hm_df['ST_Slope'] = hm_df['ST_Slope'].map(slope)


In [None]:
# heatmap
sns.pairplot(df, hue="HeartDisease");

In [None]:
axs = sns.heatmap(df.corr(), annot=True, cmap="jet", vmin=-1.0, vmax=1.0, square=True)

In [None]:
plt.scatter(df['RestingBP'], df['Age'])
plt.title('Title of the Plot')
plt.xlabel('Resting BP')
plt.ylabel('Ages')
plt.show()


data impurity if necessary

EDA multivariaat
- scaled data
- PCA maybe???