## Title: Heart Disease Prediction Using Generative Classifiers (KNN Classifier) 

# **Step 1: Import Necessary Python Libraries**

In [1]:
import pandas as pd
import numpy as np
from math import exp, sqrt, pi
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter

# **Step 2: Load Dataset (EDA)**

In [2]:
df = pd.read_csv('heart_disease_uci.csv')
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


# Step 3: Understanding the UCI Heart Disease Dataset

Let's now take a look at our dataset attributes and understand their meaning and significance.


| Attribute Name | Type | Description |
|-----------------------|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| id | Discrete | Unique identity for each patient |
| age | Continuous | Represents age of the patient in years|
| sex  | Categorical | Represents male or female  <br>(1 = male, 0 = female) |
| dataset| Categorical |   Represents the place of study <br>(0:Cleveland, 1:Hungary, 2:Switzerland, 3:VA Long Beach) |
| cp| Categorical |   Represents the chest pain type <br>(0: asymptomatic, 1: atypical angina, 2: non-anginal pain, 3: typical angina) |
| trestbps | Continuous | resting blood pressure  (in mm Hg on admission to the hospital) |
| chol | Continuous | serum cholesterol  (in mg/dl) |
| fbs  | Categorical | Represents if fasting blood sugar > 120 mg/dl <br>(0 = false, 1 = true)|
| restecg  | Categorical | Represents the resting electrocardiographic results <br>(0: showing probable or definite left ventricular hypertrophy by Estes’ criteria, 1: normal, 2: having ST-T wave abnormality)|
| thalach | Continuous | The maximum heart rate achieved |
| exang  | Categorical | Represents the exercise-induced angina <br>(0 = false, 1 = true)|
| oldpeak | Continuous | ST depression induced by exercise relative to rest |
| slope  | Categorical | Represents the  slope of the peak exercise ST segment <br>(0: downsloping; 1: flat; 2: upsloping)|
| ca | Continuous | number of major vessels (0-3) colored by fluoroscopy |
| thal | Categorical | Represents <br>(0 = normal, 1 = fixed defect, 2 = reversible defect) |
| num | Discrete | Represents the class label or predicted attribute where 0 indicates no heart disease and 1, 2, 3, and 4 represent the different stages of heart disease. <br>(0,1,2,3,4) |

We have a total of 14 features and our objective is to predict if the patient has a heart disease. Hence we will be building and interpreting a classification model.

In [3]:
df.shape 

(920, 16)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        920 non-null    int64  
 1   age       920 non-null    int64  
 2   sex       920 non-null    object 
 3   dataset   920 non-null    object 
 4   cp        920 non-null    object 
 5   trestbps  861 non-null    float64
 6   chol      890 non-null    float64
 7   fbs       830 non-null    object 
 8   restecg   918 non-null    object 
 9   thalch    865 non-null    float64
 10  exang     865 non-null    object 
 11  oldpeak   858 non-null    float64
 12  slope     611 non-null    object 
 13  ca        309 non-null    float64
 14  thal      434 non-null    object 
 15  num       920 non-null    int64  
dtypes: float64(5), int64(3), object(8)
memory usage: 115.1+ KB


In [5]:
df.isnull().sum()

id            0
age           0
sex           0
dataset       0
cp            0
trestbps     59
chol         30
fbs          90
restecg       2
thalch       55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
num           0
dtype: int64

# **Step 4: Compute K-Nearest Neighbors Classifier Without Sklearn**

**Step 4a: Handling Missinag values by filling continuous/float columns with mean and categorical/object columns with mode**

In [None]:
for column in df.select_dtypes(include=['float64', 'int64']).columns:
    mean_value = df[column].mean()
    df[column] = df[column].fillna(mean_value) 
    
for column in df.select_dtypes(include=['object']).columns:
    mode_value = df[column].mode()[0]  # Get the first mode if there are multiple
    df[column] = df[column].fillna(mode_value).infer_objects(copy=False)  # Assign back to the DataFrame and infer data types


**Step 4b: Scale continous variables**

In [7]:
columns_to_scale = ['age', 'trestbps', 'chol', 'thalch', 'oldpeak', 'ca']

for column in columns_to_scale:
    mean_value = df[column].mean()  # Calculate mean, ignoring NaNs
    std_value = df[column].std()    # Calculate standard deviation, ignoring NaNs
    df[column] = (df[column] - mean_value) / std_value  # Apply scaling



**Step 4c: One Hot Encoding**

In [8]:
# Define the mapping dictionary
sex_mapping = {'Male': 1, 'Female': 0}

# Map the values in the 'sex' column using the mapping dictionary and if the value is not 'M' or 'F', assign as 2
df['sex'] = df['sex'].map(lambda x: sex_mapping.get(x, 2))

print(df['sex'].unique())

[1 0]


In [9]:
#Define the mapping dictionary
dataset_mapping = {'Cleveland': 0, 'Hungary': 1 , 'Switzerland': 2 , 'VA Long Beach': 3}
                                                             
# Map the values in the column using the mapping dictionary and if the value is not listed in the mapping assign as 4
df['dataset'] = df['dataset'].map(lambda x: dataset_mapping.get(x, 4))

print(df['dataset'].unique())

[0 1 2 3]


In [10]:
#Define the mapping dictionary
cp_mapping = {'asymptomatic': 0, 'atypical angina': 1 , 'non-anginal': 2 , 'typical angina': 3}
                                                             
# Map the values in the column using the mapping dictionary and if the value is not listed in the mapping assign as 4
df['cp'] = df['cp'].map(lambda x: cp_mapping.get(x, 4))

print(df['cp'].unique())


[3 0 2 1]


In [11]:
#Define the mapping dictionary
fbs_mapping = {True: 0, False: 1}
                                                             
# Map the values in the column using the mapping dictionary and if the value is not listed in the mapping assign as 4
df['fbs'] = df['fbs'].map(lambda x: fbs_mapping.get(x, 2))

print(df['fbs'].unique())


[0 1]


In [12]:
#Define the mapping dictionary
restecg_mapping = {'lv hypertrophy':0, 'normal' :1, 'st-t abnormality':2}
                                                             
# Map the values in the column using the mapping dictionary and if the value is not listed in the mapping assign as 4
df['restecg'] = df['restecg'].map(lambda x: restecg_mapping.get(x, 3))

print(df['restecg'].unique())


[0 1 2]


In [13]:
# Define the mapping dictionary
exang_mapping = {True: 0, False: 1}
                                                             
# Map the values in the column using the mapping dictionary and if the value is not listed in the mapping assign as 4
df['exang'] = df['exang'].map(lambda x: exang_mapping.get(x, 2))

print(df['exang'].unique())


[1 0]


In [14]:
#Define the mapping dictionary
slope_mapping = {'downsloping':0, 'flat' :1, 'upsloping':2}
                                                             
# Map the values in the column using the mapping dictionary and if the value is not listed in the mapping assign as 4
df['slope'] = df['slope'].map(lambda x: slope_mapping.get(x, 3))

print(df['slope'].unique())


[0 1 2]


In [15]:
#Define the mapping dictionary
thal_mapping = {'fixed defect':0, 'normal' :1, 'reversable defect':2}
                                                             
# Map the values in the column using the mapping dictionary and if the value is not listed in the mapping assign as 4
df['thal'] = df['thal'].map(lambda x: thal_mapping.get(x, 3))

print(df['thal'].unique())


[0 1 2]


In [16]:
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,1.006838,1,0,3,0.697662,0.310852,0,0,0.495429,1,1.348688,0,-1.248692,0,0
1,2,1.431255,1,0,0,1.510939,0.797279,1,0,-1.175316,0,0.589512,1,4.289765,1,2
2,3,1.431255,1,0,0,-0.657801,0.27414,1,0,-0.339943,0,1.633379,1,2.443613,2,1
3,4,-1.751875,1,0,2,-0.115616,0.466876,1,1,1.967275,1,2.487452,0,-1.248692,1,0
4,5,-1.327458,0,0,1,-0.115616,0.044693,1,0,1.370581,1,0.494615,2,-1.248692,1,0


**Step 4d: Functions to compute statistics and probablities of KNN**

In [17]:
import numpy as np
import pandas as pd
from collections import Counter

# Step 1: Shuffle the dataset
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Step 2: Split into training, validation, and test sets (60:20:20)
train_size = int(0.6 * len(df_shuffled))
val_size = int(0.2 * len(df_shuffled))

train_set = df_shuffled[:train_size]
val_set = df_shuffled[train_size:train_size + val_size]
test_set = df_shuffled[train_size + val_size:]

# Separate features and labels for training, validation, and test sets
X_train = train_set.drop(columns=['id', 'num'])  # Exclude 'id' and 'num'
y_train = train_set['num']

X_val = val_set.drop(columns=['id', 'num'])
y_val = val_set['num']

X_test = test_set.drop(columns=['id', 'num'])
y_test = test_set['num']



In [18]:
# Function to calculate Euclidean distance between two data points
def euclidean_distance(point1, point2):
    return np.sqrt(np.sum((point1 - point2) ** 2))

# Function to find the k nearest neighbors
def get_neighbors(X_train, y_train, test_point, k):
    distances = []
    for i in range(len(X_train)):
        distance = euclidean_distance(X_train.iloc[i], test_point)
        distances.append((y_train.iloc[i], distance))
    
    # Sort the distances and select the k nearest neighbors
    distances.sort(key=lambda x: x[1])
    neighbors = [distances[i][0] for i in range(k)]
    
    return neighbors

# Function to predict the class of a test point
def predict_class_knn(X_train, y_train, test_point, k):
    neighbors = get_neighbors(X_train, y_train, test_point, k)
    most_common = Counter(neighbors).most_common(1)
    return most_common[0][0]

# Function to compute accuracy
def compute_accuracy(X_train, y_train, X_test, y_test, k):
    correct = 0
    for i in range(len(X_test)):
        predicted_class = predict_class_knn(X_train, y_train, X_test.iloc[i], k)
        actual_class = y_test.iloc[i]
        if predicted_class == actual_class:
            correct += 1
    return correct / len(X_test)

# Step 3: Tune the hyperparameter k using the validation set
#k_values = [2, 3, 4, 5,6,7,8,9,10]
best_k = None
best_accuracy = 0

#for k in k_values:
for k in range(3,13):
    accuracy = compute_accuracy(X_train, y_train, X_val, y_val, k)
    print(f'Accuracy on validation set for k={k}: {accuracy :.2f}')
    
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_k = k

print(f'Best k value: {best_k}')


Accuracy on validation set for k=1: 0.45
Accuracy on validation set for k=2: 0.45
Accuracy on validation set for k=3: 0.49
Accuracy on validation set for k=4: 0.50
Accuracy on validation set for k=5: 0.54
Accuracy on validation set for k=6: 0.52
Accuracy on validation set for k=7: 0.53
Accuracy on validation set for k=8: 0.52
Accuracy on validation set for k=9: 0.50
Accuracy on validation set for k=10: 0.50
Accuracy on validation set for k=11: 0.52
Accuracy on validation set for k=12: 0.53
Best k value: 5


**Step 4e: Functions to compute the confusion matrix for KNN**

In [21]:
# Function to compute confusion matrix for multi-class classification
def compute_confusion_matrix_knn(X_train, y_train, X_test, y_test, k):
    # Initialize counts for True Positives, True Negatives, False Positives, and False Negatives
    num_classes = len(set(y_train))  # Dynamically find the number of classes from training labels
    True_positives = [0] * num_classes
    True_negatives = [0] * num_classes
    False_positives = [0] * num_classes
    False_negatives = [0] * num_classes

    # Loop through each test example
    for i in range(len(X_test)):
        example_data_point = X_test.iloc[i]
        actual_class = y_test.iloc[i]
        predicted_class = predict_class_knn(X_train, y_train, example_data_point, k)

        # Update confusion matrix based on predicted and actual classes
        if predicted_class == actual_class:
            True_positives[actual_class] += 1
        else:
            False_positives[predicted_class] += 1
            False_negatives[actual_class] += 1

    # Calculate True Negatives for each class
    for j in range(num_classes):
        True_negatives[j] = len(X_test) - (True_positives[j] + False_positives[j] + False_negatives[j])

    # Compute class-specific performance metrics
    class_specific_accuracy = [(True_positives[j] + True_negatives[j]) / (True_positives[j] + True_negatives[j] + False_positives[j] + False_negatives[j]) if (True_positives[j] + True_negatives[j] + False_positives[j] + False_negatives[j]) > 0 else 0 for j in range(num_classes)]

    class_specific_precision = [True_positives[j] / (True_positives[j] + False_positives[j]) if (True_positives[j] + False_positives[j]) > 0 else 0 for j in range(num_classes)]
    
    class_specific_recall = [True_positives[j] / (True_positives[j] + False_negatives[j]) if (True_positives[j] + False_negatives[j]) > 0 else 0 for j in range(num_classes)]
    
    class_specific_FScore = [2 * class_specific_precision[j] * class_specific_recall[j] / (class_specific_precision[j] + class_specific_recall[j]) if (class_specific_precision[j] + class_specific_recall[j]) > 0 else 0 for j in range(num_classes)]

    # Compute overall model performance
    Average_accuracy = sum(class_specific_accuracy) / num_classes
    Average_precision = sum(class_specific_precision) / num_classes
    Average_recall = sum(class_specific_recall) / num_classes
    Average_FScore = sum(class_specific_FScore) / num_classes

    return {
        "Overall Accuracy": Average_accuracy,
        "Average Precision": Average_precision,
        "Average Recall": Average_recall,
        "Average F-Score": Average_FScore,
        "Class-Specific Metrics": {
            "Accuracy": class_specific_accuracy,
            "Precision": class_specific_precision,
            "Recall": class_specific_recall,
            "F-Score": class_specific_FScore
        }
    }

# Example: Call the function to compute the confusion matrix on the test set
k = 5
metrics = compute_confusion_matrix_knn(X_train, y_train, X_test, y_test, k)


# Print Precision, Recall, and F1 Score for each class
for i in range(len(metrics["Class-Specific Metrics"]["Accuracy"])):
    print(f'Class {i} - Accuracy: {metrics["Class-Specific Metrics"]["Accuracy"][i]:.2f}, '
          f'Precision: {metrics["Class-Specific Metrics"]["Precision"][i]:.2f}, '
          f'Recall: {metrics["Class-Specific Metrics"]["Recall"][i]:.2f}, '
          f'F-Score: {metrics["Class-Specific Metrics"]["F-Score"][i]:.2f}')

print(" ")


# Print the overall performance metrics
print(f'Overall Accuracy on the test set: {metrics["Overall Accuracy"]:.2f}')
print(f'Average Precision: {metrics["Average Precision"]:.2f}')
print(f'Average Recall: {metrics["Average Recall"]:.2f}')
print(f'Average F-Score: {metrics["Average F-Score"]:.2f}')

Class 0 - Accuracy: 0.82, Precision: 0.80, Recall: 0.80, F-Score: 0.80
Class 1 - Accuracy: 0.77, Precision: 0.57, Recall: 0.65, F-Score: 0.61
Class 2 - Accuracy: 0.84, Precision: 0.11, Recall: 0.12, F-Score: 0.12
Class 3 - Accuracy: 0.79, Precision: 0.14, Recall: 0.13, F-Score: 0.14
Class 4 - Accuracy: 0.96, Precision: 1.00, Recall: 0.12, F-Score: 0.22
 
Overall Accuracy on the test set: 0.83
Average Precision: 0.53
Average Recall: 0.37
Average F-Score: 0.38
