The initial, originally published Cleveland dataset "cleveland.data" contains 76 attributes, but all published literature refer to using a subset of 14 of them, which is available in the UCI Machine Learning Repository as "processed.cleveland.data".

Beside the 14 features that most of the researchers are considering, many of the attributes in the dataset have missing values or errors, that are possibly affecting the performances of the models. Furthermore, the other attributes beside these 14 are not directly related to heart diseases or may not have a strong correlation with the target variable, because they are just specifications of some of the 14 variables. Analyzing too many attributes can result in overly complex models that are prone to overfitting. It is possible to simplify the models and enhance their generalization abilities by employing this selection of features. That is why my research is also focusing on this subset of the Cleveland dataset.

It is important to note, that we will refer to the processed version of the Cleveland data (which has 14 columns) in the following as "cleveland.data".

In [30]:
# Converting the "cleveland.data" to CSV file format:
def data_to_csv(data_file_path, csv_file_path):
    with open(data_file_path, 'r') as data_file:
        data = data_file.readlines()

    with open(csv_file_path, 'w') as csv_file:
        for line in data:
            csv_file.write(','.join(line.strip().split()))
            csv_file.write('\n')

data_to_csv('cleveland.data', 'clevelanddata.csv')

In [31]:
# Converting the "Statlog.data" to CSV file format:
def data_to_csv(data_file_path, csv_file_path):
    with open(data_file_path, 'r') as data_file:
        data = data_file.readlines()

    with open(csv_file_path, 'w') as csv_file:
        for line in data:
            csv_file.write(','.join(line.strip().split()))
            csv_file.write('\n')

data_to_csv('Statlog.dat', 'statlogdata.csv')

In [32]:
# This code creates the header for the cleveland dataset with the descriprions from the UCI Website

import csv

# Defining the header according to the UCI Website
header = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
          'exang', 'oldpeak', 'slope', 'ca', 'thal', 'absence or presence of heart disease']

# Open the CSV for reading
with open('clevelanddata.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)

    # Open new CSV file for writing
    with open('clevelanddraft.csv', 'w', newline='') as new_csv_file:
        writer = csv.writer(new_csv_file)

        # Create the header
        writer.writerow(header)

        # Copy the rows
        for row in reader:
            writer.writerow(row)
            
        
# CHECKING THE RESULTS
# Open the final CSV file
with open('clevelanddraft.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)

    # Print the first 5 rows incl. header
    for i, row in enumerate(reader):
        if i < 5:
            print(row)

['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'absence or presence of heart disease']
['63.0', '1.0', '1.0', '145.0', '233.0', '1.0', '2.0', '150.0', '0.0', '2.3', '3.0', '0.0', '6.0', '0']
['67.0', '1.0', '4.0', '160.0', '286.0', '0.0', '2.0', '108.0', '1.0', '1.5', '2.0', '3.0', '3.0', '2']
['67.0', '1.0', '4.0', '120.0', '229.0', '0.0', '2.0', '129.0', '1.0', '2.6', '2.0', '2.0', '7.0', '1']
['37.0', '1.0', '3.0', '130.0', '250.0', '0.0', '0.0', '187.0', '0.0', '3.5', '3.0', '0.0', '3.0', '0']


In [33]:
# This code now creates the header for the statlog dataset with descriprions from the UCI Website

# Defining the header according to the UCI Website
header = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
          'exang', 'oldpeak', 'slope', 'ca', 'thal', 'absence or presence of heart disease']

# Open the CSV for reading
with open('statlogdata.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)

    # Open new CSV file for writing
    with open('statlogdraft.csv', 'w', newline='') as new_csv_file:
        writer = csv.writer(new_csv_file)

        # Create the header to new CSV
        writer.writerow(header)

        # Copy the rows from the original CSV file
        for row in reader:
            writer.writerow(row)
            
            
# CHECKING THE RESULTS
# Open the final CSV file
with open('statlogdraft.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)

    # Print the first 5 rows incl. header
    for i, row in enumerate(reader):
        if i < 5:
            print(row)

['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'absence or presence of heart disease']
['70.0', '1.0', '4.0', '130.0', '322.0', '0.0', '2.0', '109.0', '0.0', '2.4', '2.0', '3.0', '3.0', '2']
['67.0', '0.0', '3.0', '115.0', '564.0', '0.0', '2.0', '160.0', '0.0', '1.6', '2.0', '0.0', '7.0', '1']
['57.0', '1.0', '2.0', '124.0', '261.0', '0.0', '0.0', '141.0', '0.0', '0.3', '1.0', '0.0', '7.0', '2']
['64.0', '1.0', '4.0', '128.0', '263.0', '0.0', '0.0', '105.0', '1.0', '0.2', '2.0', '1.0', '7.0', '1']


In [34]:
# IMPORTANT: this code transforms the values of the last column of the Cleveland dataset, because originally it is: 0 = absence of heart disease, 1,2,3,4 = presence of heart disease. For easier handling, I transformed 0 to 1, and every value > 0 to 2.

# iloc is used to select all rows and the last column, the applied method is used to apply a lambda function to every value in the last column, which transforms 0 to 1 and any value greater than 0 to 2
import pandas as pd

data = pd.read_csv('clevelanddraft.csv')

# Transforming the last column
data.iloc[:, -1] = data.iloc[:, -1].apply(lambda x: 1 if x == 0 else 2 if x > 0 else x)

# Write the transformed data to a new file called 'cleveland.csv'
data.to_csv('clevelanddraft2.csv', index=False)

In [35]:
import numpy as np

# Replacing the remaining missing values and incorrect entries in both datasets with NaN

# For the Cleveland data:

df = pd.read_csv('clevelanddraft2.csv')

# Replace ?s and missing values with NaN
df.replace('?', np.nan, inplace=True)
df.replace('', np.nan, inplace=True)

# Save the updated dataframe back to a csv file
df.to_csv('cleveland.csv', index=False)

# For the Statlog data:

df = pd.read_csv('statlogdraft.csv')

# Replace ?s and missing values with NaN
df.replace('?', np.nan, inplace=True)
df.replace('', np.nan, inplace=True)

# Save the updated dataframe back to a csv file
df.to_csv('statlog.csv', index=False)

Both Datasets had some missing values or entries "?". Both are replaced by NaNs.

Using the datasets with NaNs will cause errors while performing ML classifications. Taking the mean of a given column to replace the NaNs can be a problematic approach because it assumes that the NaN values are missing completely at random (MCAR) and that the mean of the other data in the column are representative of these. This assumption can lead to biased or unreliable results, as this is a relatively small dataset and represents real symptoms and health data of individuals.

Therefore, we are removing all rows with NaNs.

In [37]:
# Deleting all rows in both datasets containing NaNs
clevelanddata = pd.read_csv('cleveland.csv') 
statlogdata = pd.read_csv('statlog.csv') 

clevelanddata.dropna(inplace=True)
statlogdata.dropna(inplace=True)

clevelanddata.to_csv('cleveland_cleaned.csv', index=False)
statlogdata.to_csv('statlog_cleaned.csv', index=False)

In [26]:
# Checking the results of the cleaned Cleveland dataset
cleveland = pd.read_csv('cleveland_cleaned.csv') 
cleveland.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297 entries, 0 to 296
Data columns (total 14 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   age                                   297 non-null    float64
 1   sex                                   297 non-null    float64
 2   cp                                    297 non-null    float64
 3   trestbps                              297 non-null    float64
 4   chol                                  297 non-null    float64
 5   fbs                                   297 non-null    float64
 6   restecg                               297 non-null    float64
 7   thalach                               297 non-null    float64
 8   exang                                 297 non-null    float64
 9   oldpeak                               297 non-null    float64
 10  slope                                 297 non-null    float64
 11  ca                 

In [27]:
# Checking the results of the cleaned Statlog dataset
statlog = pd.read_csv('statlog_cleaned.csv')
statlog.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270 entries, 0 to 269
Data columns (total 14 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   age                                   270 non-null    float64
 1   sex                                   270 non-null    float64
 2   cp                                    270 non-null    float64
 3   trestbps                              270 non-null    float64
 4   chol                                  270 non-null    float64
 5   fbs                                   270 non-null    float64
 6   restecg                               270 non-null    float64
 7   thalach                               270 non-null    float64
 8   exang                                 270 non-null    float64
 9   oldpeak                               270 non-null    float64
 10  slope                                 270 non-null    float64
 11  ca                 

In [40]:
# For later experimentation, we combine the two datasets as a new CSV:

cleveland_df = pd.read_csv('cleveland_cleaned.csv')
statlog_df = pd.read_csv('statlog_cleaned.csv')

# merge the two dataframes
combined_df = pd.concat([cleveland_df, statlog_df], axis=0, ignore_index=True, sort=False)

# shuffle the rows randomly, so the entries will be mixed
combined_df = combined_df.sample(frac=1, random_state=42).reset_index(drop=True)

# save the dataframe to a CSV
combined_df.to_csv('combined.csv', index=False)

In [42]:
# Checking the results of the Combined dataset
combineddataset = pd.read_csv('combined.csv')
combineddataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 567 entries, 0 to 566
Data columns (total 14 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   age                                   567 non-null    float64
 1   sex                                   567 non-null    float64
 2   cp                                    567 non-null    float64
 3   trestbps                              567 non-null    float64
 4   chol                                  567 non-null    float64
 5   fbs                                   567 non-null    float64
 6   restecg                               567 non-null    float64
 7   thalach                               567 non-null    float64
 8   exang                                 567 non-null    float64
 9   oldpeak                               567 non-null    float64
 10  slope                                 567 non-null    float64
 11  ca                 

Description of the Attributes used in both datasets:


age: patient age in number

sex: 1.0=male, 0.0=female

cp: chest pain type (Value 1.0: typical angina, Value 2.0: atypical angina, Value 3.0: non-anginal pain, Value 4.0: asymptomatic)

trestbp: resting blood pressure in Hgmm

chol: serum cholestoral in mg/dl

fbs: fasting blood sugar > 120 mg/dl (1.0 = true, 0.0 = false)

restecg: resting electrocardiographic results (Value 0.0: normal, Value 1.0: having ST-T wave abnormality, Value 2.0: showing probable or definite left ventricular hypertrophy by Estes' criteria)

thalach: maximum heart rate achieved

exang: exercise induced angina (1.0 = yes, 0.0 = no)

oldpeak = ST depression induced by exercise relative to rest

slope: the slope of the peak exercise ST segment (Value 1.0: upsloping, Value 2.0: flat, Value 3.0: downsloping)

ca: number of major vessels (0.0 - 3.0) colored by flourosopy

thal: displays the thalassemia (3.0 = normal, 6.0 = fixed defect, 7.0 = reversable defect)

absence or presence of heart disease: 1=absence, 2=presence