# Expectation Maximization Clustering

"The K-means approach is an example of a hard assignment clustering, where each point can belong to only one cluster. Expectation-Maximization algorithm is a way to generalize the approach to consider the soft assignment of points to clusters so that each point has a probability of belonging to each cluster."

"Maximum likelihood estimation is an approach to density estimation for a dataset by searching across probability distributions and their parameters.

It is a general and effective approach that underlies many machine learning algorithms, although it requires that the training dataset is complete, e.g. all relevant interacting random variables are present. Maximum likelihood becomes intractable if there are variables that interact with those in the dataset but were hidden or not observed, so-called latent variables.

The expectation-maximization algorithm is an approach for performing maximum likelihood estimation in the presence of latent variables. It does this by first estimating the values for the latent variables, then optimizing the model, then repeating these two steps until convergence. It is an effective and general approach and is most commonly used for density estimation with missing data, such as clustering algorithms like the Gaussian Mixture Model."

## Heart Failure Prediction Dataset - Standardized

### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ds_functions as ds
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/hf_scaled/HF_standardized.csv')
data.pop('DEATH_EVENT') #Remove target variable

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (11,6)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = GaussianMixture(n_components=k)
    estimator.fit(data)
    labels = estimator.predict(data)
    
    mse.append(ds.compute_mse(data.values, labels, estimator.means_))
    sc.append(silhouette_score(data, labels))
    db.append(davies_bouldin_score(data, labels))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, labels.astype(float), estimator.means_, k,
                     f'EM k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='EM MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='EM SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='EM DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

"
Gaussian distributions are continuous distributions.

There is no meaningful way to apply this famous "bell shaped curve" onto categorical data - binary encoding clearly does not make sense either. You have to find something else to use instead of Gaussians...

So instead of hacking to make your data fake Gaussian, you should rather make the algorithm match your data and problem.
"

### K-Means No Binary Data

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/hf_scaled/HF_standardized.csv')
data.pop('DEATH_EVENT') #Remove target variable
numeric_vars = data.select_dtypes(include='number').columns

numeric_data = data
binary_data = data

for n in range(len(numeric_vars)):
    num_unique = len(list(set(data[numeric_vars[n]].values)))
    if num_unique == 2:
        numeric_data = numeric_data.drop(columns=[data.columns[n]], axis=1) #Remove binary columns

    else:
        binary_data = binary_data.drop(columns=[data.columns[n]], axis=1) #Remove non-binary columns

print(numeric_data.head())
data = numeric_data

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (6,4)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = GaussianMixture(n_components=k)
    estimator.fit(data)
    labels = estimator.predict(data)
    
    mse.append(ds.compute_mse(data.values, labels, estimator.means_))
    sc.append(silhouette_score(data, labels))
    db.append(davies_bouldin_score(data, labels))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, labels.astype(float), estimator.means_, k,
                     f'EM k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='EM MSE', xlabel='k', ylabel='MSE')
plt.show()

print(db)

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='EM SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='EM DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>

## Heart Failure Prediction Dataset - Standardized + ANOVA + FG + Outlier + Balancing

### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ds_functions as ds
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HF_S_FAnova_extra_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (10,6)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = GaussianMixture(n_components=k)
    estimator.fit(data)
    labels = estimator.predict(data)
    
    mse.append(ds.compute_mse(data.values, labels, estimator.means_))
    sc.append(silhouette_score(data, labels))
    db.append(davies_bouldin_score(data, labels))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, labels.astype(float), estimator.means_, k,
                     f'EM k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='EM MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='EM SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='EM DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

"Yes, it is unlikely that binary data can be clustered satisfactorily. To see why, consider what happens as the K-Means algorithm processes cases.

For binary data, the Euclidean distance measure used by K-Means reduces to counting the number of variables on which two cases disagree. After the initial centers are chosen (which depends on the order of the cases), the centers are still binary data. For the first iteration, as the cases are compared to cluster means, they will always be at some integer distance from each of the centers. There will often be ties, and the case will be assigned to a cluster in an arbitrary manner. Using Euclidean distance (the only measure available to K-Means), it is impossible to overcome the symmetry and break the ties in any meaningful way.
"

### K-Means No Binary Data

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HF_S_FAnova_extra_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable
numeric_vars = data.select_dtypes(include='number').columns

numeric_data = data
binary_data = data

for n in range(len(numeric_vars)):
    num_unique = len(list(set(data[numeric_vars[n]].values)))
    if num_unique == 2 or num_unique == 1:
        numeric_data = numeric_data.drop(columns=[data.columns[n]], axis=1) #Remove binary columns
    else:
        binary_data = binary_data.drop(columns=[data.columns[n]], axis=1) #Remove non-binary columns

print(numeric_data.head())
data = numeric_data

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (6,4)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = GaussianMixture(n_components=k)
    estimator.fit(data)
    labels = estimator.predict(data)
    
    mse.append(ds.compute_mse(data.values, labels, estimator.means_))
    sc.append(silhouette_score(data, labels))
    db.append(davies_bouldin_score(data, labels))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, labels.astype(float), estimator.means_, k,
                     f'EM k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='EM MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='EM SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='EM DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>

## Heart Failure Prediction Dataset - Standardized + Importance + FG + Outlier + Balancing

### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ds_functions as ds
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HR_S_FImp_extra_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (5,6)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = GaussianMixture(n_components=k)
    estimator.fit(data)
    labels = estimator.predict(data)
    
    mse.append(ds.compute_mse(data.values, labels, estimator.means_))
    sc.append(silhouette_score(data, labels))
    db.append(davies_bouldin_score(data, labels))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, labels.astype(float), estimator.means_, k,
                     f'EM k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='EM MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='EM SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='EM DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

"Yes, it is unlikely that binary data can be clustered satisfactorily. To see why, consider what happens as the K-Means algorithm processes cases.

For binary data, the Euclidean distance measure used by K-Means reduces to counting the number of variables on which two cases disagree. After the initial centers are chosen (which depends on the order of the cases), the centers are still binary data. For the first iteration, as the cases are compared to cluster means, they will always be at some integer distance from each of the centers. There will often be ties, and the case will be assigned to a cluster in an arbitrary manner. Using Euclidean distance (the only measure available to K-Means), it is impossible to overcome the symmetry and break the ties in any meaningful way.
"

### K-Means No Binary Data

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HR_S_FImp_extra_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable
numeric_vars = data.select_dtypes(include='number').columns

numeric_data = data
binary_data = data

for n in range(len(numeric_vars)):
    num_unique = len(list(set(data[numeric_vars[n]].values)))
    if num_unique == 2 or num_unique == 1:
        numeric_data = numeric_data.drop(columns=[data.columns[n]], axis=1) #Remove binary columns
    else:
        binary_data = binary_data.drop(columns=[data.columns[n]], axis=1) #Remove non-binary columns

print(numeric_data.head())
data = numeric_data

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (6,4)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = GaussianMixture(n_components=k)
    estimator.fit(data)
    labels = estimator.predict(data)
    
    mse.append(ds.compute_mse(data.values, labels, estimator.means_))
    sc.append(silhouette_score(data, labels))
    db.append(davies_bouldin_score(data, labels))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, labels.astype(float), estimator.means_, k,
                     f'EM k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='EM MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='EM SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='EM DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

<br/>
<br/>
<br/>
<br/>
<br/>

## Heart Failure Prediction Dataset - Standardized + Mixed + Outlier + Balancing

### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ds_functions as ds
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HR_S_FMixed_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (2,5)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = GaussianMixture(n_components=k)
    estimator.fit(data)
    labels = estimator.predict(data)
    
    mse.append(ds.compute_mse(data.values, labels, estimator.means_))
    sc.append(silhouette_score(data, labels))
    db.append(davies_bouldin_score(data, labels))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, labels.astype(float), estimator.means_, k,
                     f'EM k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='EM MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='EM SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='EM DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()

"Yes, it is unlikely that binary data can be clustered satisfactorily. To see why, consider what happens as the K-Means algorithm processes cases.

For binary data, the Euclidean distance measure used by K-Means reduces to counting the number of variables on which two cases disagree. After the initial centers are chosen (which depends on the order of the cases), the centers are still binary data. For the first iteration, as the cases are compared to cluster means, they will always be at some integer distance from each of the centers. There will often be ties, and the case will be assigned to a cluster in an arbitrary manner. Using Euclidean distance (the only measure available to K-Means), it is impossible to overcome the symmetry and break the ties in any meaningful way.
"

### K-Means No Binary Data

In [None]:
data: pd.DataFrame = pd.read_csv('../../datasets/TO_TEST/HF/HR_S_FMixed_outlierTrim_IQS_B.csv')
data.pop('DEATH_EVENT') #Remove target variable
numeric_vars = data.select_dtypes(include='number').columns

numeric_data = data
binary_data = data

for n in range(len(numeric_vars)):
    num_unique = len(list(set(data[numeric_vars[n]].values)))
    if num_unique == 2 or num_unique == 1:
        numeric_data = numeric_data.drop(columns=[data.columns[n]], axis=1) #Remove binary columns
    else:
        binary_data = binary_data.drop(columns=[data.columns[n]], axis=1) #Remove non-binary columns

print(numeric_data.head())
data = numeric_data

N_CLUSTERS = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30]
rows, cols = (3,4)

In [None]:
mse: list = []
sc: list = []
db: list = []
    
for n in range(len(N_CLUSTERS)):
    fig, axs = plt.subplots(rows, cols, figsize=(cols*5, rows*5), squeeze=False)
    i, j = 0, 0
    
    k = N_CLUSTERS[n]
    estimator = GaussianMixture(n_components=k)
    estimator.fit(data)
    labels = estimator.predict(data)
    
    mse.append(ds.compute_mse(data.values, labels, estimator.means_))
    sc.append(silhouette_score(data, labels))
    db.append(davies_bouldin_score(data, labels))
    
    print("K - " + str(n))
    
    for f1 in range (len(data.columns)):
        for f2 in range(f1+1, len(data.columns)): 
            ds.plot_clusters(data, f2, f1, labels.astype(float), estimator.means_, k,
                     f'EM k={k}', ax=axs[i,j])
            
            i, j = (i + 1, 0) if (j+1) % cols == 0 else (i, j + 1)    
    plt.show()

In [None]:
fig = plt.figure(figsize=(9,3))
ds.plot_line(N_CLUSTERS, mse, title='EM MSE', xlabel='k', ylabel='MSE')
plt.show()

fig, ax = plt.subplots(1, 2, figsize=(9, 3), squeeze=False)
ds.plot_line(N_CLUSTERS, sc, title='EM SC', xlabel='k', ylabel='SC', ax=ax[0, 0], percentage=True)
ds.plot_line(N_CLUSTERS, db, title='EM DB', xlabel='k', ylabel='DB', ax=ax[0, 1], percentage=False)
plt.show()