# An Analysis of Heart Disease and Factors that Contribute to Diagnosis Specifically Examining Cholesterol in Different Demographics to Predict Cases



# An Analysis of Heart Disease and Factors that Contribute to Diagnosis Specifically Examining Cholesterol in Different Demographics



Heart disease is a broad spectrum term that references any condition that involves complications of the heart. In order to correctly predict heart disease it is important to understand what factors play an significant role in onset and diagnosis.

Using classification techniques, work will be conducted to provide insight into the correllation between serum cholesterol and heart disease. Age and gender between cases will be examined also.  

The names and social security numbers of the patients were removed from the database and given dummy variables. One file has been "processed", that one containing the Cleveland database.  All four unprocessed files also exist in this directory.




In [1]:
# import the pandas library
import pandas as pd

# install the UCI's repository containing the heart disease dataset.
pip install ucimlrepo

The dataset contains a total of 76 attributes (i.e., columns); however, the analysis is conducted on a subset of the original dataset, consisting of 14 attributes: age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, and num (the target variable).

I began by importing the dataset as instructed on the machine learning repository, then printed the metadata and variable information. The metadata displays the column names present and the variable information explains the abbreviation of the column names.  

In [3]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
heart_disease = fetch_ucirepo(id=45)

# data (as pandas dataframes)
X = heart_disease.data.features
y = heart_disease.data.targets

# metadata
print(heart_disease.metadata)

# variable information
print(heart_disease.variables)

{'uci_id': 45, 'name': 'Heart Disease', 'repository_url': 'https://archive.ics.uci.edu/dataset/45/heart+disease', 'data_url': 'https://archive.ics.uci.edu/static/public/45/data.csv', 'abstract': '4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 303, 'num_features': 13, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': ['Age', 'Sex'], 'target_col': ['num'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1989, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C52P4X', 'creators': ['Andras Janosi', 'William Steinbrunn', 'Matthias Pfisterer', 'Robert Detrano'], 'intro_paper': {'ID': 231, 'type': 'NATIVE', 'title': 'International application of a new probability algorithm for the diagnosis of coronary artery disease.', 'authors': 'R. Detrano, A. Jánosi, W. Steinbrunn, M

Outputting the raw data presents me with these two dataframes. X and y respectively.

In [13]:
X # displays the variable dataframe

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,1,110,264,0,0,132,0,1.2,2,0.0,7.0
299,68,1,4,144,193,1,0,141,0,3.4,2,2.0,7.0
300,57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0
301,57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0


In [4]:
y # displays the target dataframe from the Cleveland database

Unnamed: 0,num
0,0
1,2
2,1
3,0
4,0
...,...
298,1
299,2
300,3
301,1


Some simple cleaning on the dataset needed to be done to improve readability, firstly the addition of the num column to the dataframe showing the predicted with the predicted attribute, no disease = 0 and disease = 1, (angiographic disease status). This data set however uses the experiments from the Cleveland database that has distinguished presence of heart disease (values 1,2,3,4) from absence (value 0)

In [5]:
# concatenate the y column onto the dataframe
X_with_target = pd.concat([X, y], axis=1)

# display the dataframe
X_with_target

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,1,110,264,0,0,132,0,1.2,2,0.0,7.0,1
299,68,1,4,144,193,1,0,141,0,3.4,2,2.0,7.0,2
300,57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,3
301,57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0,1


Then identifying any missing values in the dataset and remove them. luckily the repository explains that there is null values in the 'ca' and 'thal' columns so the next task is to simply remove them.

As there are two columns two brackets were needed and an .isnull() function was applied to identify the rows with null values.

In [6]:
# select the rows with mull values present
missing_rows = X_with_target[X_with_target[['ca', 'thal']].isnull().any(axis=1)]

# display the dataframe
missing_rows

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
87,53,0,3,128,216,0,2,115,0,0.0,1,0.0,,0
166,52,1,3,138,223,0,0,169,0,0.0,1,,3.0,0
192,43,1,4,132,247,1,2,143,1,0.1,2,,7.0,1
266,52,1,4,128,204,1,0,156,1,1.0,2,0.0,,2
287,58,1,2,125,220,0,0,144,0,0.4,2,,7.0,0
302,38,1,3,138,175,0,0,173,0,0.0,1,,3.0,0


Using the drop function i was able to remove the null values from the dataframe. I renamed the Dataframe and had finished pre-processing

In [15]:
# remove the null values from the dataframe
clean_X = X_with_target.drop(missing_rows.index)

# # display the dataframe
clean_X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57,0,4,140,241,0,0,123,1,0.2,2,0.0,7.0,1
298,45,1,1,110,264,0,0,132,0,1.2,2,0.0,7.0,1
299,68,1,4,144,193,1,0,141,0,3.4,2,2.0,7.0,2
300,57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,3


The dataset has now been cleaned and a basic understanding of the data has been detailed. From here on it would be ideal to do some research into cholesterol its effects and what would constitute a low, medium or high amount serum cholesterol, this would be ideal to then perform techniques such as descision trees and random forrest to predit the way unseen that would react and whetere cholesterol had a significant role.