# Artificial Intelligence Final Project by Andrija Stankovic


This is the final project for the Artifical Intelligence course, which includes a full exploratory data analysis (**EDA**) of a chosen dataset. 


## Table of Contents

- <a href="#Task">Task</a>
- <a href="#Context">Context</a>
  - <a href="#attributes">Attributes</a>
  - <a href="#setup">Setting Up</a>
- <a href="#EDA">Exploratory Data Analysis</a>
- <a href="#Citation">Citations & Acknowledgements</a>


## <p id="Task">📝 | Task</p>

Some of the requirements include:

- Use a Classification model.
- Present the performance of the model.
- Change parameters in order to improve results.
- Express data using different techniques and plots.


For this EDA, I will be using the Heart Failure dataset from Kaggle with the aim to explore the dataset, and make conclusions based on the visualization of its data.

***Source:*** *https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction*


## <p id="Context">🩺 | Context</p>
**Cardiovascular diseases** (*CVDs*) are the leading cause of death globally, taking an estimated <u>**17.9 million**</u> lives each year. CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. 

Furthermore, 80% of CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. 

Heart failure is a common event caused by CVDs and this dataset contains <a href="#attributes">11 attributes</a> that can be used to predict a possible heart disease. As people with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors) need early detection and management, a machine learning model can be extremely helpful.

### <p id="attributes">The Attributess include:</p>

- **Age:** age of the patient [years]
  
- **Sex:** sex of the patient [*M*: Male, *F*: Female]
  
- **ChestPainType:** chest pain type [*TA*: Typical Angina, *ATA*: Atypical Angina, *NAP*: Non-Anginal Pain, *ASY*: Asymptomatic]
- **RestingBP:** resting blood pressure [mm Hg]
- **Cholesterol:** serum cholesterol [mm/dl]
- **FastingBS:** fasting blood sugar [*1*: if FastingBS > 120 mg/dl, *0*: otherwise]
- **RestingECG:** resting electrocardiogram results [*Normal*: Normal, *ST*: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), *LVH*: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- **MaxHR:** maximum heart rate achieved [Numeric value between 60 and 202]
- **ExerciseAngina:** exercise-induced angina [*Y*: Yes, *N*: No]
- **Oldpeak:** oldpeak = ST [Numeric value measured in depression]
- **ST_Slope:** the slope of the peak exercise ST segment [*Up*: upsloping, *Flat*: flat, *Down*: downsloping]
- **HeartDisease:** output class [*1*: heart disease, *0*: Normal]



### Setting up the project

Before we can start, we must first import the required libraries and modules which will be used for this EDA, as well as import the dataset. Once we do that, we can check if everything works by using the **.head()** method which will print the first 5 instances of the data set.

In [61]:
# Imported Libraries and Modules

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px


# Reads the data
df = pd.read_csv('heart.csv')

# Prints the first 5 rows of the data
df.head()


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


After importing everything we need, we must check whether all of the data is easily accessible and ready to be used. We can do this by firstly checking whether there are any null or missing values, and fill them if needed.

As there are no missing values across all columns, no action is needed. 

In [58]:
# Check if there are any null values
print("~ Number of Null Values per Column ~")
print(df.isnull().sum())

~ Number of Null Values per Column ~
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64


Next, we want to check the data types of the columns and check whether all of them are usable.

In [53]:
# Prints the current data types of the columns
print("~ Current Data Types ~")
print(df.dtypes, "\n")

~ Current Data Types ~
Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object 



Since some of the columns (*Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope*) hold string values in the form of 'Object', we must translate it to something that can easily be accessed/used/compared with other ones.

We can do this by creating a new column 'string_col' and assigning it to all columns that are of 'Object' data type. Afterwards, we can convert them all intro 'String' data type with the **.astype()** method.

In [55]:
# Change the object data type to Strings
string_col = df.select_dtypes(include="object").columns
df[string_col]=df[string_col].astype("string")

# Prints the fixed data types of the columns
print("~ Data Types After Change~")
print(df.dtypes)

~ Data Types After Change~
Age                 int64
Sex                string
ChestPainType      string
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         string
MaxHR               int64
ExerciseAngina     string
Oldpeak           float64
ST_Slope           string
HeartDisease        int64
dtype: object


## <p id="EDA">🔬 | Exploratory Data Analysis</p>

### Basic

  df.corr()


Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
Age,1.0,0.254399,-0.095282,0.198039,-0.382045,0.258612,0.282039
RestingBP,0.254399,1.0,0.100893,0.070193,-0.112135,0.164803,0.107589
Cholesterol,-0.095282,0.100893,1.0,-0.260974,0.235792,0.050148,-0.232741
FastingBS,0.198039,0.070193,-0.260974,1.0,-0.131438,0.052698,0.267291
MaxHR,-0.382045,-0.112135,0.235792,-0.131438,1.0,-0.160691,-0.400421
Oldpeak,0.258612,0.164803,0.050148,0.052698,-0.160691,1.0,0.403951
HeartDisease,0.282039,0.107589,-0.232741,0.267291,-0.400421,0.403951,1.0


## <p id="Citation">🪪 | Citations & Acknowledgements</p>

The dataset used was created by Fedesoriano, and was taken from Kaggle. 

(https://www.kaggle.com/fedesoriano/heart-failure-prediction)

***Additional information:***

This dataset was created by combining different datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:

    Cleveland: 303 observations
    Hungarian: 294 observations
    Switzerland: 123 observations
    Long Beach VA: 200 observations
    Stalog (Heart) Data Set: 270 observations

Total: 1190 observations
Duplicated: 272 observations

Final dataset: 918 observations

Every dataset used can be found under the Index of heart disease datasets from UCI Machine Learning Repository on the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/
