# Artificial Intelligence Final Project by Andrija Stankovic


This is the final project for the Artifical Intelligence course, which includes a full exploratory data analysis (**EDA**) of a chosen dataset. 


## Table of Contents

- <a href="#Task">Task</a>
- <a href="#Context">Context</a>
  - <a href="#attributes">Attributes</a>
  - <a href="#setup">Setting Up</a>
- <a href="#EDA">Exploratory Data Analysis</a>
  - <a href="#Correlation">Correlation Matrix</a>
- <a href="#Citation">Citations & Acknowledgements</a>


## <p id="Task">📝 | Task</p>

Some of the requirements include:

- Use a Classification model.
- Present the performance of the model.
- Change parameters in order to improve results.
- Express data using different techniques and plots.


For this EDA, I will be using the Heart Failure dataset from Kaggle with the aim to explore the dataset, and make conclusions based on the visualization of its data.

***Source:*** *https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction*


## <p id="Context">🩺 | Context</p>

<h3>Summary:</h3>

**Cardiovascular diseases** (*CVDs*) are the leading cause of death globally, taking an estimated <u>**17.9 million**</u> lives each year. CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. 

Furthermore, 80% of CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. 

Heart failure is a common event caused by CVDs and this dataset contains <a href="#attributes">11 attributes</a> that can be used to predict a possible heart disease. As people with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors) need early detection and management, a machine learning model can be extremely helpful.

### <p id="attributes">The Attributess include:</p>

- **Age:** age of the patient [years]
  
- **Sex:** sex of the patient [*M*: Male, *F*: Female]
  
- **ChestPainType:** chest pain type [*TA*: Typical Angina, *ATA*: Atypical Angina, *NAP*: Non-Anginal Pain, *ASY*: Asymptomatic]
- **RestingBP:** resting blood pressure [mm Hg]
- **Cholesterol:** serum cholesterol [mm/dl]
- **FastingBS:** fasting blood sugar [*1*: if FastingBS > 120 mg/dl, *0*: otherwise]
- **RestingECG:** resting electrocardiogram results [*Normal*: Normal, *ST*: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), *LVH*: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- **MaxHR:** maximum heart rate achieved [Numeric value between 60 and 202]
- **ExerciseAngina:** exercise-induced angina [*Y*: Yes, *N*: No]
- **Oldpeak:** oldpeak = ST [Numeric value measured in depression]
- **ST_Slope:** the slope of the peak exercise ST segment [*Up*: upsloping, *Flat*: flat, *Down*: downsloping]
- **HeartDisease:** output class [*1*: heart disease, *0*: Normal]



## <p id="EDA">🔬 | Exploratory Data Analysis</p>


### 1. Setting up the project


#### 1.1 Imports

Before we can start, we must first import the required libraries and modules which will be used for this EDA, as well as import the dataset. Once we do that, we can check if everything works by using the **.head()** method which will print the first 5 instances of the data set.



In [160]:
# Imported Libraries and Modules

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)


# Insight into the column data
def colInfo(column):
    print("Column Name              : \"" + column + "\"")
    print("--------------------------------")
    print("Number of Unique Values  : ", df[column].nunique())
    print("Null Values              : ", df[column].isnull().sum())
    print("Unique Values            :\n", df[column].unique())
    print("")


# Reads the data
df = pd.read_csv('heart.csv')

# Prints the first 5 rows of the data
df.head()


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


#### 1.2 Missing Values

After importing everything we need, we must check whether all of the data is easily accessible and ready to be used. We can do this by firstly checking whether there are any null or missing values, and fill them if needed. 

In [161]:
# Check if there are any null values
print("~ Number of Null Values per Column ~")
print(df.isnull().sum())


~ Number of Null Values per Column ~
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64


As there are no missing values across all columns, no action is needed. 

While we are here, we can also check and see how many values are unique.

In [162]:
# Number of Unique Values
print("~ Number of Unique Values per Column ~")
print(df.nunique())

~ Number of Unique Values per Column ~
Age                50
Sex                 2
ChestPainType       4
RestingBP          67
Cholesterol       222
FastingBS           2
RestingECG          3
MaxHR             119
ExerciseAngina      2
Oldpeak            53
ST_Slope            3
HeartDisease        2
dtype: int64


#### 1.3 Datatypes

Next, we want to check the data types of the columns and check whether all of them are usable.

In [163]:
# Prints the current data types of the columns
print("~ Current Data Types ~")
print(df.dtypes, "\n")

~ Current Data Types ~
Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object 



Since some of the columns (*Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope*) hold string values in the form of 'Object', we must translate it to something that can easily be accessed/used/compared with other ones.

We can do this by creating a new column 'string_col' and assigning it to all columns that are of 'Object' data type. Afterwards, we can convert them all intro 'String' data type with the **.astype()** method.

In [164]:
# Change the object data type to Strings
string_col = df.select_dtypes(include="object").columns
df[string_col]=df[string_col].astype("string")

# Prints the fixed data types of the columns
print("~ Data Types After Change~")
print(df.dtypes)

~ Data Types After Change~
Age                 int64
Sex                string
ChestPainType      string
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         string
MaxHR               int64
ExerciseAngina     string
Oldpeak           float64
ST_Slope           string
HeartDisease        int64
dtype: object


### <p id="Basic">**2. General Exploration**</p>



#### **2.1 Basic Visualization**

Before we start diving into the analysis, it could be useful to first take a look and undestand how our dataset is constructed.

We can show some instances of the dataset through one of these functions:

In [165]:
# Shows the first x rows of the data
df.head(5)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [166]:
# Shows the last x rows of the data
df.tail(5)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1
917,38,M,NAP,138,175,0,Normal,173,N,0.0,Up,0


In [167]:
# Shows x random rows of the data
df.sample(5)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
364,56,M,NAP,120,0,0,Normal,97,N,0.0,Flat,0
750,46,F,NAP,142,177,0,LVH,160,Y,1.4,Down,0
253,62,M,ATA,140,271,0,Normal,152,N,1.0,Up,0
300,60,M,ASY,160,0,1,Normal,149,N,0.4,Flat,1
882,56,F,ATA,140,294,0,LVH,153,N,1.3,Flat,0


In [168]:
# Shows the indexed column names
df.columns

Index(['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS',
       'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope',
       'HeartDisease'],
      dtype='object')

In [169]:
# Print the number of rows and columns
print("The dataset is made up of ", df.shape[0], "observations and", df.shape[1], "columns.\n")


The dataset is made up of  918 observations and 12 columns.



Additionally, by using the **.info()** method, we can show the detailed structure of the dataset, which includes:
- the number of entities
- column names
- non-null count
- data types
- memory usage 

In [170]:
# Detailed Information of Dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    string 
 2   ChestPainType   918 non-null    string 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    string 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    string 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    string 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), string(5)
memory usage: 86.2 KB


Besides this, we can also use **.describe()** method to list the statistical summary of the dataset. This lists:
- **count:** Number of instances
- **mean:** The mean (total sum of values / total count)
- **std:** Standard Deviation (square root of its variance)
- **min:** The smallest value in the category
- **x%:** The value at xth percentile (25%, 50%, 75%)
- **max:** The largest value in the category

(We can use the **.transpose()** method to switch the rows and columns which is purely done for visual purposes.)

In [171]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,918.0,53.510893,9.432617,28.0,47.0,54.0,60.0,77.0
RestingBP,918.0,132.396514,18.514154,0.0,120.0,130.0,140.0,200.0
Cholesterol,918.0,198.799564,109.384145,0.0,173.25,223.0,267.0,603.0
FastingBS,918.0,0.233115,0.423046,0.0,0.0,0.0,0.0,1.0
MaxHR,918.0,136.809368,25.460334,60.0,120.0,138.0,156.0,202.0
Oldpeak,918.0,0.887364,1.06657,-2.6,0.0,0.6,1.5,6.2
HeartDisease,918.0,0.553377,0.497414,0.0,0.0,1.0,1.0,1.0


#### **2.2 Target Variable Exploration**

The target variable is the variable whose values are modeled and predicted by other variables.

The target variable of this exploration is **Heart Disease** in the contex of determining whether anybody is likely to get it based on the input parameters like sex, age and others.

In [172]:
# Prints column information
colInfo("HeartDisease")

Column Name              : "HeartDisease"
--------------------------------
Number of Unique Values  :  2
Null Values              :  0
Unique Values            :
 [0 1]



As this is the crucial variable, we should first see the ratio between participants with and without Heart Disease.

In [173]:
# Prints column information
print("~ Number of Participants with Heart Disease ~")
print(df["HeartDisease"].value_counts())

# Pie plot illustrating the number of participants with heart disease
px.pie(df, names="HeartDisease", title="Heart Disease Distribution", width=500, height=500)


~ Number of Participants with Heart Disease ~
1    508
0    410
Name: HeartDisease, dtype: int64


In order to further analyize how each attribute relates to Heart Disease, the best course of action is to split them into two groups:

- **Numerical Data:** Values that can be measured.
- **Categorical Data:** Values that is divided into groups. 

#### **2.3 Numerical Data Analysis**



We can seperate numerical data from the rest by including only columns of data type **Number**.

In [174]:
# Numerical Data Sample
numerical = df.drop(["HeartDisease"], axis=1).select_dtypes(include=np.number).columns
df[numerical].sample(5).transpose()

Unnamed: 0,144,605,92,231,734
Age,56.0,51.0,52.0,40.0,56.0
RestingBP,120.0,114.0,120.0,130.0,130.0
Cholesterol,279.0,258.0,210.0,281.0,283.0
FastingBS,0.0,1.0,0.0,0.0,1.0
MaxHR,150.0,96.0,148.0,167.0,103.0
Oldpeak,1.0,1.0,0.0,0.0,1.6


Afterwards, we can summarize the statistical data to find useful values such as mean, standard deviation, min and max.

In [181]:
# Statistical Summary of Numerical Data
df[numerical].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,918.0,53.510893,9.432617,28.0,47.0,54.0,60.0,77.0
RestingBP,918.0,132.396514,18.514154,0.0,120.0,130.0,140.0,200.0
Cholesterol,918.0,198.799564,109.384145,0.0,173.25,223.0,267.0,603.0
FastingBS,918.0,0.233115,0.423046,0.0,0.0,0.0,0.0,1.0
MaxHR,918.0,136.809368,25.460334,60.0,120.0,138.0,156.0,202.0
Oldpeak,918.0,0.887364,1.06657,-2.6,0.0,0.6,1.5,6.2


Although the above information is useful but the data can be represented in a better way using plots.

In [175]:
# Histogram of Numerical Data
df[numerical].iplot(kind='histogram', subplots=True, bins=60, title='Histogram of Numerical Data')

#### 2.4.2 Box Plots

Box plots visually show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages.

They are very useful in showing us the concentration of values, and outliers for each attribute.

In [176]:
# Iterate through all columns of Numerical Data and plot a box plot

for column in numerical:
    if column != "FastingBS": # FastingBS has only two type of values
        df[column].iplot(kind="box", title=column, boxpoints="all")


Basing the plots, we 

#### **2.5 Categorical Data**



In [177]:
categorical = df.select_dtypes(include="string").columns
df[categorical].head().transpose()

Unnamed: 0,0,1,2,3,4
Sex,M,F,M,F,M
ChestPainType,ATA,NAP,ATA,ASY,NAP
RestingECG,Normal,Normal,ST,Normal,Normal
ExerciseAngina,N,N,N,Y,N
ST_Slope,Up,Flat,Up,Flat,Up


#### <p id="Correlation">**2.6 Correlation Matrix**</p>

A correlation matrix is a table which shows correlation coefficients between variables. The purpose of it is to summarize data and see clear patterns.




Typically, a correlation matrix is “square”, with the same variables shown in the rows and columns. The below matrix shows correlations between the Heart Failure attributes. The line of 1.00s going from the top left to the bottom right is the main diagonal, which shows that each variable always perfectly correlates with itself. This matrix is symmetrical, with the same correlation is shown above the main diagonal being a mirror image of those below the main diagonal.

This can be done using the following methods:
- **.corr():** calculates correlation coefficients.
- **.imshow():** visualizes the matrix with the .corr() as input.

In [178]:
# Correlation Matrix
px.imshow(df.corr(), title="Correlation Matrix for Heart Failure Prediction")





With the matrix in mind, we can conclude various correlations regarding ***Heart Disease***:
- It has a positive correlation with "*Oldpeak*" [**0.4039507**], "*FastingBS*" [**0.2672912**], and "*RestingBP*" [**0.107589**]. 
- It has a negative correlation with "*MaxHR*" [**-0.4004208**], and "*Cholesterol*" [**-0.2327406**].

### <p id="Histogram">Histogram Graphs</p>

A histogram is a graphical display of data using bars of different heights. 

Each bar groups numbers into ranges, in which taller bars show that more data falls in that range and vice versa. In the histogram below, we can see the clear ratio between participants by their Sex. The higher bar indicates the Males, which account to  750 participants compared to 193 Females.


In [179]:
# Creates and splits the Sexes into two dataframes
numMales = df[df["Sex"] == "M"].value_counts().sum()
numFemales = df[df["Sex"] == "F"].value_counts().sum()

# Prints the counted values of Males and Females, as well as the Total number of participants.
print("Total number of participants:", numMales+numFemales)
print("Number of male participants:", numMales)
print("Number of female participants:", numFemales)

# Plots a Histogram to illustrate
px.histogram(df, x="Sex", title="Sex Distribution of Participants")

Total number of participants: 918
Number of male participants: 725
Number of female participants: 193


Besides being able to compare the ratio between two values/classes (male to female), we can also use histograms to showcase many categories at once. This is particularly useful if we want to differenciate the number of people with/without heart diseases based on Sex.

In [180]:
# Histogram showing the distribution of HD by Sex
hd_Figure = px.histogram(df, x="HeartDisease",
                 color="Sex",
                 hover_data=df.columns,
                 title="Distribution of Heart Diseases by Sex",
                 barmode="group")

# Edits the X and Y Axis labels
hd_Figure.update_layout(
    xaxis_title="Heart Disease", yaxis_title="Number of Participants"
)

# Calculates the percentage of each
numMales_hd = round((df[(df["Sex"] == "M") & (df["HeartDisease"] == 1)].value_counts().sum() / numMales) * 100, 2)
numFemales_hd = round((df[(df["Sex"] == "F") & (df["HeartDisease"] == 1)].value_counts().sum() / numFemales) * 100, 2)
print("Percentage of Males with Heart Disease: ", numMales_hd, "%")
print("Percentage of Males with Heart Disease: ", numFemales_hd, "%")


# Shows the plot
hd_Figure.show()

Percentage of Males with Heart Disease:  63.17 %
Percentage of Males with Heart Disease:  25.91 %


## <p id="Citation">🪪 | Citations & Acknowledgements</p>

The dataset used was created by Fedesoriano, and was taken from Kaggle. 

(https://www.kaggle.com/fedesoriano/heart-failure-prediction)

***Additional information:***

This dataset was created by combining different datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:

    Cleveland: 303 observations
    Hungarian: 294 observations
    Switzerland: 123 observations
    Long Beach VA: 200 observations
    Stalog (Heart) Data Set: 270 observations

Total: 1190 observations
Duplicated: 272 observations

Final dataset: 918 observations

Every dataset used can be found under the Index of heart disease datasets from UCI Machine Learning Repository on the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/
