In [1]:
import pandas as pd
import numpy as np

cleveland_data = pd.read_csv("processed.cleveland.data", header=None)
hungarian_data = pd.read_csv("processed.hungarian.data", header=None)
switzerland_data = pd.read_csv("processed.switzerland.data", header=None)
va_data = pd.read_csv("processed.va.data", header=None)

FileNotFoundError: ignored

In [None]:
cleveland_data.head()

In [None]:
hungarian_data.head()

In [None]:
switzerland_data.head()

In [None]:
va_data.head()

From the Heart Diseases dataset, we have data from 4 different areas, Cleveland, Hungary, Switzerland and Virginia. We want to combine these datasets together. We will first change the column names.

According to the attribute information, the following 14 attributes were used.

1.   age
2.   sex   
3.   cp   
4.   trestbps
5.   chol   
6.   fbs     
7.   restecg 
8.   thalach
9.   exang   
10.  oldpeak  
11.  slope   
12.  ca       
13.  thal     
14.  num (the predicted attribute)


Here is the description and type of each of the attributes:
      
1.   age: age in years (**Numerical**)

2.   sex: sex (1 = male; 0 = female) (**Categorical**)

3.   cp: chest pain type (**Categorical**)
       - Value 1: typical angina
       - Value 2: atypical angina
       - Value 3: non-anginal pain
       - Value 4: asymptomatic
    
4.   trestbps: resting blood pressure (in mm Hg on admission to the hospital) (**Numerical**)

5.   chol: serum cholestoral in mg/dl (**Numerical**)

6.   fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false) (**Categorical**)

7.   restecg: resting electrocardiographic results (**Categorical**)
       - Value 0: normal
       - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
       - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
                    
8.   thalach: maximum cleveland rate achieved (**Numerical**)

9.   exang: exercise induced angina (1 = yes; 0 = no) (**Categorical**)

10.  oldpeak = ST depression induced by exercise relative to rest (**Numerical**)

11.  slope: the slope of the peak exercise ST segment (**Categorical**)
       - Value 1: upsloping
       - Value 2: flat
       - Value 3: downsloping
        
12.  ca: number of major vessels (0-3) colored by flourosopy (**Numerical**)

13.  thal: 3 = normal; 6 = fixed defect; 7 = reversable defect (**Categorical**)

14.  num: diagnosis of cleveland disease (angiographic disease status) (**Categorical**)
       - Value 0: < 50% diameter narrowing
       - Value 1: > 50% diameter narrowing
        

In [None]:
# Creating a copy of the datasets
cleveland_clean = cleveland_data.copy()
hungarian_clean = hungarian_data.copy()
switzerland_clean = switzerland_data.copy()
va_clean = va_data.copy()

In [None]:
# Adding column titles to all the datasets
cleveland_clean.columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalch', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']
hungarian_clean.columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalch', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']
switzerland_clean.columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalch', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']
va_clean.columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalch', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']

In [None]:
# Convert all Variable Names to UPPERCASE
cleveland_clean.columns = cleveland_clean.columns.str.upper()
hungarian_clean.columns = hungarian_clean.columns.str.upper()
switzerland_clean.columns = switzerland_clean.columns.str.upper()
va_clean.columns = va_clean.columns.str.upper()

We will add a new column with the location. The following are the short forms used:
   * CL - Cleveland 
   * HU - Hungary
   * SW - Switzerland
   * VA - Virginia

In [None]:
cleveland_clean['LOC'] = 'CL'
hungarian_clean['LOC'] = 'HU'
switzerland_clean['LOC'] = 'SW'
va_clean['LOC'] = 'VA'

In [None]:
# Print the Variable Information to check
cleveland_clean.info()
hungarian_clean.info()
switzerland_clean.info()
va_clean.info()

We notice that there are ? in the dataset. We shall change these to NULL.

In [None]:
# Changing ? in dataframe to NaN
cleveland_clean = cleveland_clean.replace('?', np.nan)
hungarian_clean = hungarian_clean.replace('?', np.nan)
switzerland_clean = switzerland_clean.replace('?', np.nan)
va_clean = va_clean.replace('?', np.nan)

In [None]:
cleveland_clean.isna().sum()

In [None]:
hungarian_clean.isna().sum()

In [None]:
switzerland_clean.isna().sum()

In [None]:
va_clean.isna().sum()

We notice that there are a lot of missing values in the datasets for Hungary, Switzerland and Virginia. If we were to drop the columns or rows, this would result in a lot of missing data and limits our analysis. Therefore, we shall stick with the Cleveland dataset since it only has 6 rows with missing information.

In [None]:
# Dropping 6 rows with NaN values.
cleveland_clean = cleveland_clean.dropna()

In [None]:
# Printing the Variable Information to check after deleting NaN rows
cleveland_clean.info()

We noticed that for the NUM attribute, there were numbers ranging from 0 to 4, however we decided to only have 2 categories. The first category would be presence of heart disease, represented by 1 and the second category would be absence of heart disease, represented by 0.

In [None]:
cleveland_clean['DIS'] = np.where(cleveland_clean['NUM']==0, 0, 1)

In [None]:
cleveland_clean.to_csv('heartdata.csv')