#### Final assignment - Heart dataset

**Context:**

<sub>
Cardiovascular diseases (CVDs) are the leading cause of death globally, accounting for an estimated 17.9 million lives each year (31% of all deaths). Four out of 5 CVD deaths result from heart attacks and strokes, with one-third occurring prematurely in individuals under 70 years old. Heart failure is a common event caused by CVDs, and this dataset comprises 11 features for predicting potential heart disease. Early detection and management are crucial for people with cardiovascular disease or those at high cardiovascular risk (due to risk factors such as hypertension, diabetes, hyperlipidemia, or established disease), where a machine learning model can be invaluable.
</sub>

**Goal:**

<sub>To predict the likelihood of heart failure based on the "Heart Failure Prediction" dataset. The output will be a binary classification (0 or 1) representing the likelihood of heart failure.</sub>

**Features (x):**

<sub>The features to be used are as follows: Age, Sex, ChestPainType, RestingBP, Cholesterol, FastingBS, RestingECG, MaxHR, ExerciseAngina, Oldpeak & ST_Slope.</sub>

**Target Variable (y):**

<sub>The target variable is "HeartDisease."</sub>

**Attribute Information:**

<sub>

- Age: Age of the patient [years]
- Sex: Sex of the patient [M: Male, F: Female]
- ChestPainType: Chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: Resting blood pressure [mm Hg]
- Cholesterol: Serum cholesterol [mm/dl]
- FastingBS: Fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: Resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR: Maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina: Exercise-induced angina [Y: Yes, N: No]
- Oldpeak: Oldpeak = ST [Numeric value measured in depression]
- ST_Slope: The slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: Output class [1: heart disease, 0: Normal]

</sub>

##### Step 1 - Data Loading and Initial Exploration:
<sub>

- Load the dataset and understand its structure.

- Check for missing values, handle them if necessary.

- Get a sense of the basic statistics and distribution of the data.

</sub>

In [3]:
# Imports.
import pandas as pd

# Assign & print the dataset.
dataset = 'data/modified_heart_dataset_supervised.csv'
df = pd.read_csv(dataset)
df

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,GeneticMarker1,GeneticMarker2,BodyWeightCategory,HeartDisease
0,40,M,ATA,141,289,0,Normal,173,N,0.0,Up,0.046501,11560,Normal,0
1,49,F,NAP,158,175,0,Normal,151,N,1.0,Flat,0.619699,8575,Overweight,1
2,37,M,ATA,135,285,0,ST,97,N,0.0,Up,0.561993,10545,Overweight,0
3,48,F,ASY,140,214,0,Normal,112,Y,1.5,Flat,0.345920,10272,Obese,1
4,54,M,NAP,149,192,0,Normal,124,N,0.0,Up,0.315190,10368,Underweight,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,106,268,0,Normal,135,N,1.2,Flat,0.125356,12060,Normal,1
914,68,M,ASY,142,190,1,Normal,141,N,3.4,Flat,0.336965,12920,Overweight,1
915,57,M,ASY,135,128,0,Normal,118,Y,1.2,Flat,0.999544,7296,Underweight,1
916,57,F,ATA,128,236,0,LVH,172,N,0.0,Flat,0.405751,13452,Obese,1


In [5]:
# Display basic information about the dataset.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 918 non-null    int64  
 1   Sex                 918 non-null    object 
 2   ChestPainType       918 non-null    object 
 3   RestingBP           918 non-null    int64  
 4   Cholesterol         918 non-null    int64  
 5   FastingBS           918 non-null    int64  
 6   RestingECG          918 non-null    object 
 7   MaxHR               918 non-null    int64  
 8   ExerciseAngina      918 non-null    object 
 9   Oldpeak             918 non-null    float64
 10  ST_Slope            918 non-null    object 
 11  GeneticMarker1      918 non-null    float64
 12  GeneticMarker2      918 non-null    int64  
 13  BodyWeightCategory  918 non-null    object 
 14  HeartDisease        918 non-null    int64  
dtypes: float64(2), int64(7), object(6)
memory usage: 107.7+ K

##### Step 2 - Data Preparation

<sub>

- Handle categorical variables: Encode them using one-hot encoding or label encoding.
- Split the dataset into features (X) and the target variable (y).
- Split the data into training and testing sets.

</sub>

##### Step 3 - Feature Engineering

<sub>

- If needed, create new features or transform existing ones to enhance the model's performance.
- Perform any necessary scaling or normalization.

</sub>

##### Step 4 - Exploratory Data Analysis (EDA)

<sub>

- Dive deeper into the relationships between different features.
- Visualize the data to gain insights.
- Identify patterns or trends that might be useful for modeling.

</sub>

##### Step 5 - Model Training

<sub>

- Choose a suitable classification model (e.g., Logistic Regression, Decision Trees, Random Forest, SVM).
- Train the model using the training dataset.
- Evaluate the model's performance on the testing set.

</sub>

##### Step 6 - Model Evaluation and Fine-Tuning

<sub>

- Evaluate the model's performance using appropriate metrics.
- Fine-tune hyperparameters to optimize the model.
- Consider cross-validation for a more robust evaluation.

</sub>