In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

### 1. Apply GMM to the heart disease data by setting n_components=2. Get ARI and silhoutte scores for your solution and compare it with those of the k-means and hierarchical clustering solutions that you implemented in the assignments of the previous checkpoints. Which algorithm does perform better?

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn import metrics
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings('ignore')
import config


<IPython.core.display.Javascript object>

In [3]:
postgres_user = config.user
postgres_pw = config.password
postgres_host = config.host
postgres_port = config.port
postgres_db = "heartdisease"

engine = create_engine(
    "postgresql://{}:{}@{}:{}/{}".format(
        postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db
    )
)

df = pd.read_sql_query("select * from heartdisease", con=engine)

# No need for an open connection,
# because you're only doing a single query
engine.dispose()

# Define the features and the outcome
X = df.iloc[:, :13]
y = df.iloc[:, 13]

# Replace missing values (marked by ?) with a 0
X = X.replace(to_replace="?", value=0)

# make y binary
y = np.where(y > 0, 0, 1)

# Scale the data
scaler = StandardScaler()
scaled = scaler.fit_transform(X)

<IPython.core.display.Javascript object>

In [7]:
gmm_cluster = GaussianMixture(n_components=2, random_state=1234)

clusters = gmm_cluster.fit_predict(scaled)

print("ARI score: {}".format(metrics.adjusted_rand_score(y, clusters)))
print('Silhouette Score {}'.format(metrics.silhouette_score(scaled, clusters, metric='euclidean')))


ARI score: 0.4207322145049338
Silhouette Score 0.16118591340148433


<IPython.core.display.Javascript object>

GMM scores lower than k-means and hierarchical clustering in both the ARI and silhouette scores.

### 2. GMM implementation of scikit-learn has a parameter called covariance_type. This parameter determines the type of covariance parameters to use. Specifically, there are four types you can specify

* full: This is the default. Each component has its own general covariance matrix.
* tied: All components share the same general covariance matrix.
* diag: Each component has its own diagonal covariance matrix.
* spherical: Each component has its own single variance.

Try all of these. Which one does perform better in terms of ARI and silhouette scores?

In [10]:
# Define the GMM
gmm_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type="full")

# Fit model
clusters = gmm_cluster.fit_predict(scaled)

print(
    "ARI score with covariance_type=full: {}".format(
        metrics.adjusted_rand_score(y, clusters)
    )
)

print(
    "Silhouette score with covariance_type=full: {}".format(
        metrics.silhouette_score(scaled, clusters, metric="euclidean")
    )
)
print("------------------------------------------------------")

# Define the GMM
gmm_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type="tied")

# Fit model
clusters = gmm_cluster.fit_predict(scaled)

print(
    "ARI score with covariance_type=tied: {}".format(
        metrics.adjusted_rand_score(y, clusters)
    )
)

print(
    "Silhouette score with covariance_type=tied: {}".format(
        metrics.silhouette_score(scaled, clusters, metric="euclidean")
    )
)
print("------------------------------------------------------")

# Define the GMM
gmm_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type="diag")

# Fit model
clusters = gmm_cluster.fit_predict(scaled)

print(
    "ARI score with covariance_type=diag: {}".format(
        metrics.adjusted_rand_score(y, clusters)
    )
)

print(
    "Silhouette score with covariance_type=diag: {}".format(
        metrics.silhouette_score(scaled, clusters, metric="euclidean")
    )
)
print("------------------------------------------------------")


# Define the GMM
gmm_cluster = GaussianMixture(
    n_components=2, random_state=123, covariance_type="spherical"
)

# Fit model
clusters = gmm_cluster.fit_predict(scaled)

print(
    "ARI score with covariance_type=spherical: {}".format(
        metrics.adjusted_rand_score(y, clusters)
    )
)

print(
    "Silhouette score with covariance_type=spherical: {}".format(
        metrics.silhouette_score(scaled, clusters, metric="euclidean")
    )
)
print("------------------------------------------------------")

ARI score with covariance_type=full: 0.18389186035089963
Silhouette score with covariance_type=full: 0.13628813153331445
------------------------------------------------------
ARI score with covariance_type=tied: 0.18389186035089963
Silhouette score with covariance_type=tied: 0.13628813153331445
------------------------------------------------------
ARI score with covariance_type=diag: 0.18389186035089963
Silhouette score with covariance_type=diag: 0.13628813153331445
------------------------------------------------------
ARI score with covariance_type=spherical: 0.20765243525722465
Silhouette score with covariance_type=spherical: 0.12468753110276876
------------------------------------------------------


<IPython.core.display.Javascript object>

The GMM clustering with spherical covariance type produced the highest ARI score, it also produced the lowest silhouette score. The GMM algorithm using the other two covariance types scored the same.