<a href="https://colab.research.google.com/github/AakarshPatel/pytorch-heart-disease/blob/main/02_Data__Cleaning_and_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Heart Disease Prediction Data Preprocessing

This notebook focuses on the crucial initial steps of a machine learning project: data loading, cleaning, preprocessing, and preparation for model training. Specifically, it uses a heart disease dataset to demonstrate various data manipulation techniques, including handling missing values, converting data types, scaling numerical features, and splitting the data into training and testing sets.

In [1]:
from google.colab import drive
drive.mount('/content/drive') # Mount Google Drive to access files.

Mounted at /content/drive


In [2]:
import pandas as pd # Import the pandas library for data manipulation.

In [3]:
data = pd.read_csv('/content/drive/MyDrive/heart_disease/processed.cleveland.data',header=None) # Load data using pandas.

In [4]:
data.rename(columns={0:'age',1:'sex',2:'cp',3:'trestbps',4:'chol',5:'fbs',
                     6:'restecg',7:'thalach',8: 'exang',9:'oldpeak',10:'slope',
                     11:'ca',12:'thal', 13:'num'},inplace=True) # Rename numerical column headers to more descriptive feature names for the heart disease dataset.

In [5]:
data = data[data['ca'] != '?'] # Remove rows where 'ca' column has '?' (missing values).
data = data[data['thal'] != '?'] # Remove rows where 'thal' column has '?' (missing values).

In [6]:
print("Unique values in 'ca' column:", data['ca'].unique()) # Display unique values in the 'ca' column.
print("Unique values in 'thal' column:", data['thal'].unique()) # Display unique values in the 'thal' column.

Unique values in 'ca' column: ['0.0' '3.0' '2.0' '1.0']
Unique values in 'thal' column: ['6.0' '3.0' '7.0']


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 297 entries, 0 to 301
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       297 non-null    float64
 1   sex       297 non-null    float64
 2   cp        297 non-null    float64
 3   trestbps  297 non-null    float64
 4   chol      297 non-null    float64
 5   fbs       297 non-null    float64
 6   restecg   297 non-null    float64
 7   thalach   297 non-null    float64
 8   exang     297 non-null    float64
 9   oldpeak   297 non-null    float64
 10  slope     297 non-null    float64
 11  ca        297 non-null    object 
 12  thal      297 non-null    object 
 13  num       297 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 34.8+ KB


In [8]:
data['ca'] = data['ca'].astype(float) # Convert 'ca' column to float type.
data['thal'] = data['thal'].astype(float) # Convert 'thal' column to float type.

In [9]:
data.info() # Display data types after conversion to confirm changes.

<class 'pandas.core.frame.DataFrame'>
Index: 297 entries, 0 to 301
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       297 non-null    float64
 1   sex       297 non-null    float64
 2   cp        297 non-null    float64
 3   trestbps  297 non-null    float64
 4   chol      297 non-null    float64
 5   fbs       297 non-null    float64
 6   restecg   297 non-null    float64
 7   thalach   297 non-null    float64
 8   exang     297 non-null    float64
 9   oldpeak   297 non-null    float64
 10  slope     297 non-null    float64
 11  ca        297 non-null    float64
 12  thal      297 non-null    float64
 13  num       297 non-null    int64  
dtypes: float64(13), int64(1)
memory usage: 34.8 KB


In [10]:
data.shape

(297, 14)

In [11]:
# Iterate through each row of the DataFrame to binarize the 'num' column.
# The 'num' column indicates the presence of heart disease (0 for no disease, >0 for disease).
for index, row_data in data.iterrows():
  value = row_data['num']
  # If the value is not 0 (meaning heart disease is present), change it to 1.
  if value != 0:
    data.loc[index, 'num'] = 1

# Print the unique values in the 'num' column to confirm binarization.
print(data['num'].unique())

[0 1]


### Explanation of 'num' column binarization

The 'num' column, which represents the presence of heart disease, originally contained integer values from 0 to 4:

*   **0**: No heart disease
*   **1, 2, 3, 4**: Different stages of heart disease

For many classification tasks, particularly when building models to predict the presence or absence of a condition, it's often beneficial to convert the target variable into a binary format. This simplifies the problem into a clear 'yes' or 'no' outcome.

Therefore, I binarized the 'num' column with the following mapping:

*   Any value of **0** remains **0** (no heart disease).
*   Any value **greater than 0** (i.e., 1, 2, 3, or 4) is converted to **1** (presence of heart disease).

This transformation makes 'num' a binary target variable, suitable for binary classification models.

In [12]:
# Define the name of the target variable column.
target_column_name = 'num'

# Create the feature matrix (x) by dropping the target column from the original data.
x = data.drop(columns=[target_column_name])

# Create the target vector (y) by selecting only the target column from the original data.
y = data[target_column_name]

In [13]:
print(f"Shape of features (x): {x.shape}")
print(f"Shape of target (y): {y.shape}")

print("\nData types for features (x):")
x.info()

print("\nData types for target (y):")
print(y.dtype)

print("\nFirst 5 rows of features (x):")
display(x.head())

print("\nFirst 5 rows of target (y):")
display(y.head())

print("\nChecking for missing values in x:")
print(x.isnull().sum().sum())

print("\nChecking for missing values in y:")
print(y.isnull().sum())

Shape of features (x): (297, 13)
Shape of target (y): (297,)

Data types for features (x):
<class 'pandas.core.frame.DataFrame'>
Index: 297 entries, 0 to 301
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       297 non-null    float64
 1   sex       297 non-null    float64
 2   cp        297 non-null    float64
 3   trestbps  297 non-null    float64
 4   chol      297 non-null    float64
 5   fbs       297 non-null    float64
 6   restecg   297 non-null    float64
 7   thalach   297 non-null    float64
 8   exang     297 non-null    float64
 9   oldpeak   297 non-null    float64
 10  slope     297 non-null    float64
 11  ca        297 non-null    float64
 12  thal      297 non-null    float64
dtypes: float64(13)
memory usage: 40.6 KB

Data types for target (y):
int64

First 5 rows of features (x):


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0



First 5 rows of target (y):


Unnamed: 0,num
0,0
1,1
2,1
3,0
4,0



Checking for missing values in x:
0

Checking for missing values in y:
0


### Verification of Data Separation (x and y)

From the output above, we can confirm the following:

*   **Shapes**: `x` has `(297, 13)` which means 297 rows and 13 columns (features), and `y` has `(297,)` which means 297 rows (the target variable).
    *   This is correct because the original `data` had 14 columns, and by dropping the 'num' column, `x` should have 13 columns.
    *   Both `x` and `y` have the same number of rows, indicating no loss of observations during separation.

*   **Data Types**: All columns in `x` are `float64`, and `y` is `int64`. These are appropriate numerical data types for machine learning models.

*   **Content**: The `head()` displays for both `x` and `y` show that the 'num' column is indeed removed from `x` and forms `y`, as intended.

*   **Missing Values**: There are `0` missing values in `x` and `0` missing values in `y`, confirming that no `NaN`s were introduced during the separation process.

In [14]:
print("Unique values for 'sex':", x['sex'].unique())
print("Unique values for 'fbs':", x['fbs'].unique())
print("Unique values for 'exang':", x['exang'].unique())
print("Unique values for 'cp':", x['cp'].unique())
print("Unique values for 'slope':", x['slope'].unique())
print("Unique values for 'thal':", x['thal'].unique())
print("Unique values for 'ca':", x['ca'].unique())

Unique values for 'sex': [1. 0.]
Unique values for 'fbs': [1. 0.]
Unique values for 'exang': [0. 1.]
Unique values for 'cp': [1. 4. 3. 2.]
Unique values for 'slope': [3. 2. 1.]
Unique values for 'thal': [6. 3. 7.]
Unique values for 'ca': [0. 3. 2. 1.]


### Encoding Categorical Variables

Based on the unique values and the nature of the features, here's how we'll treat the specified columns:

*   **Binary Variables (`sex`, `fbs`, `exang`):** These columns already contain binary (0 or 1) numerical values, as confirmed by their unique values. They are perfectly suitable for machine learning models in their current state, so no further encoding is necessary.

*   **Ordinal Variables (`cp`, `slope`, `thal`):** These features have distinct categories that inherently possess an order. For instance, `cp` (chest pain type) values (1-4) suggest increasing severity or different types of pain. Similarly, `slope` (ST segment slope) and `thal` (thalassemia type) have ordered meanings. Since they are already represented by numerical values (`float64`), which implies this ordinal relationship, we will convert them to `int` type for clarity and to indicate discrete, ordered categories.

*   **`ca` (number of major vessels):** This feature represents the count of major vessels (0-3) colored by fluoroscopy. While it's categorical in nature (each number represents a distinct state), the numbers themselves have an inherent order (0 vessels is less than 1, etc.). Therefore, treating it as an ordinal variable by converting it to `int` type is a reasonable approach.

This approach maintains the inherent order of these features while ensuring they are represented with appropriate data types.

In [15]:
# Convert ordinal-like columns to integer type
x['cp'] = x['cp'].astype(int)
x['slope'] = x['slope'].astype(int)
x['thal'] = x['thal'].astype(int)
x['ca'] = x['ca'].astype(int)

print("Data types after conversion:")
x.info()

Data types after conversion:
<class 'pandas.core.frame.DataFrame'>
Index: 297 entries, 0 to 301
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       297 non-null    float64
 1   sex       297 non-null    float64
 2   cp        297 non-null    int64  
 3   trestbps  297 non-null    float64
 4   chol      297 non-null    float64
 5   fbs       297 non-null    float64
 6   restecg   297 non-null    float64
 7   thalach   297 non-null    float64
 8   exang     297 non-null    float64
 9   oldpeak   297 non-null    float64
 10  slope     297 non-null    int64  
 11  ca        297 non-null    int64  
 12  thal      297 non-null    int64  
dtypes: float64(9), int64(4)
memory usage: 40.6 KB


In [16]:
from sklearn.preprocessing import StandardScaler # Import StandardScaler for feature scaling.

In [17]:
# Identify numerical columns for scaling.
numerical_cols = x.select_dtypes(include=['float64', 'int64']).columns

print("Numerical columns identified for scaling:")
print(numerical_cols)

Numerical columns identified for scaling:
Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'],
      dtype='object')


In [18]:
# Initialize the StandardScaler.
scaler = StandardScaler()

# Apply StandardScaler to the numerical columns of the feature matrix 'x'.
x_scaled = scaler.fit_transform(x[numerical_cols])

# Convert the scaled array back to a DataFrame, maintaining column names and index.
x_scaled = pd.DataFrame(x_scaled, columns=numerical_cols, index=x.index)

# Display the first 5 rows of the scaled features to observe the transformation.
print("First 5 rows of scaled features (x_scaled):")
display(x_scaled.head())

First 5 rows of scaled features (x_scaled):


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.936181,0.691095,-2.240629,0.75038,-0.276443,2.430427,1.010199,0.017494,-0.696419,1.068965,2.264145,-0.721976,0.655877
1,1.378929,0.691095,0.87388,1.596266,0.744555,-0.41145,1.010199,-1.816334,1.435916,0.381773,0.643781,2.478425,-0.89422
2,1.378929,0.691095,0.87388,-0.659431,-0.3535,-0.41145,1.010199,-0.89942,1.435916,1.326662,0.643781,1.411625,1.172577
3,-1.94168,0.691095,-0.164289,-0.095506,0.051047,-0.41145,-1.003419,1.63301,-0.696419,2.099753,2.264145,-0.721976,-0.89422
4,-1.498933,-1.44698,-1.202459,-0.095506,-0.835103,-0.41145,1.010199,0.978071,-0.696419,0.295874,-0.976583,-0.721976,-0.89422


In [19]:
# Replace the original numerical columns in 'x' with their scaled versions.
for col in numerical_cols:
    x[col] = x_scaled[col]

# Display the first 5 rows of the updated feature matrix 'x' to confirm scaling.
print("First 5 rows of the feature matrix 'x' after scaling numerical columns:")
display(x.head())

First 5 rows of the feature matrix 'x' after scaling numerical columns:


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.936181,0.691095,-2.240629,0.75038,-0.276443,2.430427,1.010199,0.017494,-0.696419,1.068965,2.264145,-0.721976,0.655877
1,1.378929,0.691095,0.87388,1.596266,0.744555,-0.41145,1.010199,-1.816334,1.435916,0.381773,0.643781,2.478425,-0.89422
2,1.378929,0.691095,0.87388,-0.659431,-0.3535,-0.41145,1.010199,-0.89942,1.435916,1.326662,0.643781,1.411625,1.172577
3,-1.94168,0.691095,-0.164289,-0.095506,0.051047,-0.41145,-1.003419,1.63301,-0.696419,2.099753,2.264145,-0.721976,-0.89422
4,-1.498933,-1.44698,-1.202459,-0.095506,-0.835103,-0.41145,1.010199,0.978071,-0.696419,0.295874,-0.976583,-0.721976,-0.89422


In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)

### Verify the train test split

In [22]:
total_samples = x.shape[0]

print(f"Original dataset size (x): {total_samples} samples")
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

train_ratio = (X_train.shape[0] / total_samples) * 100
test_ratio = (X_test.shape[0] / total_samples) * 100

print(f"\nTrain set proportion: {train_ratio:.2f}%")
print(f"Test set proportion: {test_ratio:.2f}%")

# Confirm row counts match between features and target for train/test sets
train_match = X_train.shape[0] == y_train.shape[0]
test_match = X_test.shape[0] == y_test.shape[0]

print(f"\nNumber of rows in X_train matches y_train: {train_match}")
print(f"Number of rows in X_test matches y_test: {test_match}")

Original dataset size (x): 297 samples
Shape of X_train: (237, 13)
Shape of X_test: (60, 13)
Shape of y_train: (237,)
Shape of y_test: (60,)

Train set proportion: 79.80%
Test set proportion: 20.20%

Number of rows in X_train matches y_train: True
Number of rows in X_test matches y_test: True


### Save the data as csv

In [25]:
# Define the file paths for saving the datasets.
X_train_path = '/content/X_train.csv'
X_test_path = '/content/X_test.csv'
y_train_path = '/content/y_train.csv'
y_test_path = '/content/y_test.csv'

# Save X_train to a CSV file.
X_train.to_csv(X_train_path, index=False)
print(f"X_train saved to {X_train_path}")

# Save X_test to a CSV file.
X_test.to_csv(X_test_path, index=False)
print(f"X_test saved to {X_test_path}")

# Save y_train to a CSV file.
y_train.to_csv(y_train_path, index=False, header=True)
print(f"y_train saved to {y_train_path}")

# Save y_test to a CSV file.
y_test.to_csv(y_test_path, index=False, header=True)
print(f"y_test saved to {y_test_path}")

X_train saved to /content/X_train.csv
X_test saved to /content/X_test.csv
y_train saved to /content/y_train.csv
y_test saved to /content/y_test.csv


### Verification of Saved Files



In [26]:
!ls -l /content/*.csv

-rw-r--r-- 1 root root 15378 Feb  9 14:54 /content/X_test.csv
-rw-r--r-- 1 root root 60277 Feb  9 14:54 /content/X_train.csv
-rw-r--r-- 1 root root   124 Feb  9 14:54 /content/y_test.csv
-rw-r--r-- 1 root root   478 Feb  9 14:54 /content/y_train.csv
