# Gaussian Mixture Models (EM)

Fit GMMs with varying components and covariance types, select by BIC/AIC, and inspect soft assignments (responsibilities).

In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, warnings
from sklearn.mixture import GaussianMixture

!wget -q https://raw.githubusercontent.com/Jihun-ust/ust-mail-557/main/Unsupervised/unsup_utils.py.py
import unsup_utils as utils
csv_path = "https://raw.githubusercontent.com/Jihun-ust/ust-mail-557/main/Unsupervised/unsup.csv"
warnings.filterwarnings("ignore")

df = pd.read_csv(csv_path)
X, cols, sc = utils.feature_matrix(df, use_emb=True)

components = list(range(2,9))
cov_types = ["full","tied","diag","spherical"]
results = []

for cov in cov_types:
    bics, aics = [], []
    for k in components:
        gmm = GaussianMixture(n_components=k, covariance_type=cov, n_init=3, random_state=42)
        gmm.fit(X)
        bics.append(gmm.bic(X)); aics.append(gmm.aic(X))
    results.append((cov, bics, aics))
    plt.figure(figsize=(8,3.5)); plt.plot(components, bics, marker="o"); plt.title(f"BIC — {cov}"); plt.xlabel("components"); plt.ylabel("BIC"); plt.tight_layout(); plt.show()

# choose the best overall by min BIC
best = None; best_val = 1e18
for cov, bics, aics in results:
    k = components[int(np.argmin(bics))]; val = np.min(bics)
    if val < best_val: best_val, best = val, (cov, k)

print("Best by BIC:", best)
cov, k = best
gmm = GaussianMixture(n_components=k, covariance_type=cov, n_init=5, random_state=42)
df["cluster_gmm"] = gmm.fit_predict(X)
resp = gmm.predict_proba(X)  # responsibilities

# Visualize in PCA space
X2, p = utils.pca_2d(X)
utils.plot_xy(X2, title="PCA (colored by GMM clusters)", labels=df["cluster_gmm"].values)

# Soft confidence
df["gmm_max_prob"] = resp.max(axis=1)
df["gmm_low_conf"] = (df["gmm_max_prob"] < 0.6).astype(int)
df[["gmm_max_prob","gmm_low_conf"]].head()

### Quick review: alignment to doc_type

In [None]:
pd.crosstab(df['cluster_gmm'], df['doc_type'])

### (Optional) Model Deep Dive
- Automated BIC/AIC method alone may not yield the most meaningful GMM solution. In this case, five clusters with full covariance can offer clearer structure, with one cluster potentially capturing anomalies, highlighting the need to revisit the raw data for interpretation.

In [None]:
custom_best = ('full', 5)
print("Another Possible Best:", custom_best)
cov, k = custom_best
gmm = GaussianMixture(n_components=k, covariance_type=cov, n_init=5, random_state=42)
df["cluster_gmm"] = gmm.fit_predict(X)
resp = gmm.predict_proba(X)

# Visualize in PCA space
X2, p = utils.pca_2d(X)
utils.plot_xy(X2, title="PCA (colored by GMM clusters)", labels=df["cluster_gmm"].values)

pd.crosstab(df['cluster_gmm'], df['doc_type'])

### BIC alone may not yield the most meaningful GMM solution
Note: In Gaussian Mixture Models, the selection of cluster number and covariance structure based solely on the Bayesian Information Criterion (BIC) does not necessarily guarantee the most appropriate or interpretable solution. In the present example, a specification with five clusters under a full covariance structure yields results that appear more coherent (see BIC-full chart above).

In particular, the fourth cluster may be interpreted as capturing outlier observations or anomalous patterns. This underscores the importance of complementing model selection criteria with substantive examination of the original data, in order to determine whether such clusters reflect meaningful structure, rare but informative cases, or potential data irregularities.