# K-Means Clustering

"K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes."

"You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.

Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.

In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid."


## Heart Failure Prediction Dataset - Standardized

### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ds_functions as ds
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/hf_scaled/HF_standardized.csv')
data.pop('DEATH_EVENT') #Remove target variable

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (11,6)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = KMeans(n_clusters=k)
    estimator.fit(data)
    mse.append(estimator.inertia_)
    sc.append(silhouette_score(data, estimator.labels_))
    db.append(davies_bouldin_score(data, estimator.labels_))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, estimator.labels_.astype(float), estimator.cluster_centers_, k,
                             f'KMeans k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='KMeans MSE', xlabel='k', ylabel='MSE')
plt.show()

print(db)

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='KMeans SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='KMeans DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

"Yes, it is unlikely that binary data can be clustered satisfactorily. To see why, consider what happens as the K-Means algorithm processes cases.

For binary data, the Euclidean distance measure used by K-Means reduces to counting the number of variables on which two cases disagree. After the initial centers are chosen (which depends on the order of the cases), the centers are still binary data. For the first iteration, as the cases are compared to cluster means, they will always be at some integer distance from each of the centers. There will often be ties, and the case will be assigned to a cluster in an arbitrary manner. Using Euclidean distance (the only measure available to K-Means), it is impossible to overcome the symmetry and break the ties in any meaningful way.
"

### K-Means No Binary Data

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/hf_scaled/HF_standardized.csv')
data.pop('DEATH_EVENT') #Remove target variable
numeric_vars = data.select_dtypes(include='number').columns

numeric_data = data
binary_data = data

for n in range(len(numeric_vars)):
    num_unique = len(list(set(data[numeric_vars[n]].values)))
    if num_unique == 2:
        numeric_data = numeric_data.drop(columns=[data.columns[n]], axis=1) #Remove binary columns

    else:
        binary_data = binary_data.drop(columns=[data.columns[n]], axis=1) #Remove non-binary columns

print(numeric_data.head())
data = numeric_data

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (6,4)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = KMeans(n_clusters=k)
    estimator.fit(data)
    mse.append(estimator.inertia_)
    sc.append(silhouette_score(data, estimator.labels_))
    db.append(davies_bouldin_score(data, estimator.labels_))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, estimator.labels_.astype(float), estimator.cluster_centers_, k,
                             f'KMeans k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='KMeans MSE', xlabel='k', ylabel='MSE')
plt.show()

print(db)

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='KMeans SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='KMeans DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>

## Heart Failure Prediction Dataset - Standardized + ANOVA + FG + Outlier + Balancing

### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ds_functions as ds
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HF_S_FAnova_extra_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (10,6)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = KMeans(n_clusters=k)
    estimator.fit(data)
    mse.append(estimator.inertia_)
    sc.append(silhouette_score(data, estimator.labels_))
    db.append(davies_bouldin_score(data, estimator.labels_))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, estimator.labels_.astype(float), estimator.cluster_centers_, k,
                             f'KMeans k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='KMeans MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='KMeans SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='KMeans DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

"Yes, it is unlikely that binary data can be clustered satisfactorily. To see why, consider what happens as the K-Means algorithm processes cases.

For binary data, the Euclidean distance measure used by K-Means reduces to counting the number of variables on which two cases disagree. After the initial centers are chosen (which depends on the order of the cases), the centers are still binary data. For the first iteration, as the cases are compared to cluster means, they will always be at some integer distance from each of the centers. There will often be ties, and the case will be assigned to a cluster in an arbitrary manner. Using Euclidean distance (the only measure available to K-Means), it is impossible to overcome the symmetry and break the ties in any meaningful way.
"

### K-Means No Binary Data

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HF_S_FAnova_extra_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable
numeric_vars = data.select_dtypes(include='number').columns

numeric_data = data
binary_data = data

for n in range(len(numeric_vars)):
    num_unique = len(list(set(data[numeric_vars[n]].values)))
    if num_unique == 2 or num_unique == 1:
        numeric_data = numeric_data.drop(columns=[data.columns[n]], axis=1) #Remove binary columns
    else:
        binary_data = binary_data.drop(columns=[data.columns[n]], axis=1) #Remove non-binary columns

print(numeric_data.head())
data = numeric_data

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (6,4)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = KMeans(n_clusters=k)
    estimator.fit(data)
    mse.append(estimator.inertia_)
    sc.append(silhouette_score(data, estimator.labels_))
    db.append(davies_bouldin_score(data, estimator.labels_))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, estimator.labels_.astype(float), estimator.cluster_centers_, k,
                             f'KMeans k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='KMeans MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='KMeans SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='KMeans DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>

## Heart Failure Prediction Dataset - Standardized + Importance + FG + Outlier + Balancing

### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ds_functions as ds
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HR_S_FImp_extra_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (5,6)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = KMeans(n_clusters=k)
    estimator.fit(data)
    mse.append(estimator.inertia_)
    sc.append(silhouette_score(data, estimator.labels_))
    db.append(davies_bouldin_score(data, estimator.labels_))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, estimator.labels_.astype(float), estimator.cluster_centers_, k,
                             f'KMeans k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='KMeans MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='KMeans SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='KMeans DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

"Yes, it is unlikely that binary data can be clustered satisfactorily. To see why, consider what happens as the K-Means algorithm processes cases.

For binary data, the Euclidean distance measure used by K-Means reduces to counting the number of variables on which two cases disagree. After the initial centers are chosen (which depends on the order of the cases), the centers are still binary data. For the first iteration, as the cases are compared to cluster means, they will always be at some integer distance from each of the centers. There will often be ties, and the case will be assigned to a cluster in an arbitrary manner. Using Euclidean distance (the only measure available to K-Means), it is impossible to overcome the symmetry and break the ties in any meaningful way.
"

### K-Means No Binary Data

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HR_S_FImp_extra_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable
numeric_vars = data.select_dtypes(include='number').columns

numeric_data = data
binary_data = data

for n in range(len(numeric_vars)):
    num_unique = len(list(set(data[numeric_vars[n]].values)))
    if num_unique == 2 or num_unique == 1:
        numeric_data = numeric_data.drop(columns=[data.columns[n]], axis=1) #Remove binary columns
    else:
        binary_data = binary_data.drop(columns=[data.columns[n]], axis=1) #Remove non-binary columns

print(numeric_data.head())
data = numeric_data

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (6,4)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = KMeans(n_clusters=k)
    estimator.fit(data)
    mse.append(estimator.inertia_)
    sc.append(silhouette_score(data, estimator.labels_))
    db.append(davies_bouldin_score(data, estimator.labels_))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, estimator.labels_.astype(float), estimator.cluster_centers_, k,
                             f'KMeans k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='KMeans MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='KMeans SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='KMeans DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>

## Heart Failure Prediction Dataset - Standardized + Mixed + Outlier + Balancing

### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ds_functions as ds
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HR_S_FMixed_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (2,5)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = KMeans(n_clusters=k)
    estimator.fit(data)
    mse.append(estimator.inertia_)
    sc.append(silhouette_score(data, estimator.labels_))
    db.append(davies_bouldin_score(data, estimator.labels_))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, estimator.labels_.astype(float), estimator.cluster_centers_, k,
                             f'KMeans k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='KMeans MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='KMeans SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='KMeans DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

"Yes, it is unlikely that binary data can be clustered satisfactorily. To see why, consider what happens as the K-Means algorithm processes cases.

For binary data, the Euclidean distance measure used by K-Means reduces to counting the number of variables on which two cases disagree. After the initial centers are chosen (which depends on the order of the cases), the centers are still binary data. For the first iteration, as the cases are compared to cluster means, they will always be at some integer distance from each of the centers. There will often be ties, and the case will be assigned to a cluster in an arbitrary manner. Using Euclidean distance (the only measure available to K-Means), it is impossible to overcome the symmetry and break the ties in any meaningful way.
"

### K-Means No Binary Data

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HR_S_FMixed_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable
numeric_vars = data.select_dtypes(include='number').columns

numeric_data = data
binary_data = data

for n in range(len(numeric_vars)):
    num_unique = len(list(set(data[numeric_vars[n]].values)))
    if num_unique == 2 or num_unique == 1:
        numeric_data = numeric_data.drop(columns=[data.columns[n]], axis=1) #Remove binary columns
    else:
        binary_data = binary_data.drop(columns=[data.columns[n]], axis=1) #Remove non-binary columns

print(numeric_data.head())
data = numeric_data

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (3,4)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = KMeans(n_clusters=k)
    estimator.fit(data)
    mse.append(estimator.inertia_)
    sc.append(silhouette_score(data, estimator.labels_))
    db.append(davies_bouldin_score(data, estimator.labels_))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, estimator.labels_.astype(float), estimator.cluster_centers_, k,
                             f'KMeans k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='KMeans MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='KMeans SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='KMeans DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()