Try use kmeans on the heart disease dataset that we used before in Lab 2, since the dataset is for binary classification, we can set the number of cluster to be 2, and investigate the clustering result. 

### Data Preprocessing

The entire pre-processing should include: loading the data, handingling missing values, checking and transforming data types, dealing with categorical variables, extracting values from input variables (define inputs variables and the target variable), and scaling the variables with min-max normalization

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# load dataset
dataset = pd.read_csv("heart_disease.csv")
print("dataset length:", len(dataset))

# handling missing values
dataset = dataset.dropna()
print("dataset length after removing missing values:", len(dataset))

# transform data type
cols = ['Number of major vessels', 'thal']
dataset[cols] = dataset[cols].astype(int)

# dealing with categorical variables
dataset['Sex'] = dataset['Sex'].replace({'male': 1, 'female': 0})
dataset['Fasting Blood Sugar'] = dataset['Fasting Blood Sugar'].replace({True: 1, False: 0})
dataset['Chest Pain Type'] = dataset['Chest Pain Type'].replace({'typical angina': 1, 'atypical angina': 2, 'non-anginal pain': 3, 'asymptomatic': 3})

# define input variables
array = dataset.values
X = array[:,0:-1]

# fit scaler on training data
norm = MinMaxScaler().fit(X)
X = norm.transform(X)

dataset length: 303
dataset length after removing missing values: 297


### Build a kmeans model
Define the random state number as 0 and since the the heart disease has two classes, we define the cluster number as 2. 

In [2]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

#### check how many data samples in each cluster

In [3]:
import numpy as np

unique_labels, unique_counts = np.unique(kmeans.labels_, return_counts=True)
dict(zip(unique_labels, unique_counts))

{0: 152, 1: 145}

#### Extract prototype for each cluster

In [4]:
from sklearn.metrics.pairwise import pairwise_distances_argmin

kmeans_cluster_centers = kmeans.cluster_centers_
closest = pairwise_distances_argmin(kmeans.cluster_centers_, X)

# show the three data samples that can represent the three clusters
dataset.iloc[closest, :]

Unnamed: 0,Age,Sex,Chest Pain Type,Resting Blood Pressure,Serum Cholestoral,Fasting Blood Sugar,Resting electrocardiographic results,Maximum heart rate achieved,Exercise induced angina,ST depression,the slope,Number of major vessels,thal,Diagnosis
242,49,0,3,130,269,0,0,163,0,0.0,1,0,3,0
96,59,1,3,110,239,0,2,142,1,1.2,2,1,7,1


#### Check accuracy of the clustering model

In [5]:
from sklearn.metrics import accuracy_score

y = array[:,-1]
kmeans_labels = kmeans.labels_
# or kmeans_lables = kmeans.predict(X)

accuracy = accuracy_score(y, kmeans_labels)
print("k means prediction accuracy:", accuracy)

k means prediction accuracy: 0.7912457912457912
