# K-means Clustering

## **1 Introduction**

This notebook is my learning material to keep track of the notions approached in the [Unsupervised Learning, Recommenders, Reinforcement Learning](https://www.coursera.org/learn/unsupervised-learning-recommenders-reinforcement-learning) course from the [Machine Learning Specialization](https://www.coursera.org/specializations/machine-learning-introduction) offered by DeepLearning.AI and Standford University.

Through this notebook, I use the [HCV data](https://archive-beta.ics.uci.edu/ml/datasets/hcv+data) created UCI Machine Learning Repository.

### **1.0.1 Imports**

In [None]:
import os
import wget
import zipfile

# Data manipulation
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Options for seaborn
sns.set_style('darkgrid')
%matplotlib inline

from IPython import get_ipython
ipython = get_ipython()

# Autoreload extesnions
if 'autoreload' not in ipython.extension_manager.loaded:
    %load_ext autoreload

### **1.1 Data**

#### **1.1.0.1 Download**

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00571/hcvdat0.csv'
filename = wget.download(url)

#### **1.1.0.2 Import**

In [None]:
hcv = pd.read_csv(filename)

#### **1.1.1 Exploratory Data Analysis**

In [None]:
hcv.info()
hcv.describe()

In [None]:
hcv.head()

In [None]:
print(f'Number of missing values: {hcv.isna().sum().sum()}')

In [None]:
hcv.isna().sum()

In [None]:
sns.pairplot(data=hcv.drop(['Unnamed: 0'], axis=1), hue='Category')

## **2 Clustering with NumPy**

### **2.1 Preprocessing**

#### **2.1.1 Fill missing values**

In [None]:
for c in hcv.columns[hcv.isna().any()]:
    hcv[c].fillna(round(hcv[c].mean(), 2), inplace=True)

print(f'Number of missing values: {hcv.isna().sum().sum()}')

#### **2.1.2 Feature selection**

In [None]:
X = hcv.drop(['Unnamed: 0', 'Category', 'Sex'], axis=1).values
X

### **2.2 Model**

#### **2.2.1 Cluster assignment**

$$
c^{(i)} := j \quad \text{that minimizes} \quad ||x^{(i)} - \mu_j||^2
$$

$$
\begin{align*}
    c^{(i)} &: \text{index of the centroid that is closest to $x^{(i)}$} \\
    \mu_j &: \text{position of the $j$-th centroid}
\end{align*}
$$

In [None]:
def assign_cluster(X, centroids):
    idx = np.zeros(X.shape[0], dtype=int)
    
    for i, x in enumerate(X):
        dist_min = np.inf
        
        for j, c in enumerate(centroids):
            dist = np.linalg.norm(x - c)**2
            
            if dist < dist_min:
                dist_min = dist
                idx[i] = j
    return idx

#### **2.2.2 Centroids means**

$$
\mu_k = \frac{1}{|C_k|} \sum_{i \in C_k} x^{(i)}
$$

$$
\begin{align*}
    C_k &: \text{set of examples that are assigned to centroid $k$,} \\
    |C_k| &: \text{number of examples in the set $C_k$}
\end{align*}
$$

In [None]:
def update_centroids(X, idx, K):
    centroids = np.zeros((K, X.shape[1]))
    
    for k in range(K):
        pts = X[idx == k]
        centroids[k] = pts.mean(axis=0)
        
    return centroids

#### **2.2.2 Training**

In [None]:
def kmeans(X, K, epochs=10):
    m, n = X.shape
    idx = np.zeros(m)
    
    # Initialize K random centroids
    ic = np.random.choice(range(m), size=(K), replace=False)
    centroids = X[ic]
    
    # Run K-means
    for e in range(epochs):
        idx = assign_cluster(X, centroids)
        centroids = update_centroids(X, idx, K)
        
    return centroids, idx

In [None]:
centroids, idx = kmeans(X, K=4)

### **2.3 Results**

In [None]:
m = 4
fig, axes = plt.subplots(m, m,
                         figsize=(14, 7))

for i in range(m):
    for j in range(m):
        sns.scatterplot(x=X[:, i], y=X[:, j],
                        hue=idx, palette='Set2',
                        marker='o',
                        ax=axes[i, j])

        sns.scatterplot(x=centroids[:, i], y=centroids[:, j],
                        marker='X', s=80,
                        color='black',
                        ax=axes[i, j])

## **2 Clustering with TensorFlow**

### **2.1 Model**

In [None]:
kmeans_model = KMeans(n_clusters=4)

labels = kmeans_model.fit_predict(X)

### **2.2 Results**

In [None]:
centroids = kmeans_model.cluster_centers_
u_labels = np.unique(labels)
palette = iter(sns.color_palette())

m = 4
fig, axes = plt.subplots(m, m,
                         figsize=(14, 7))

for i in range(m):
    for j in range(m):
        for l in u_labels:
            sns.scatterplot(x=X[labels == l, i], y=X[labels == l, j],
                            label=l,
                            ax=axes[i, j])

            sns.scatterplot(x=centroids[:, i], y=centroids[:, j],
                            marker='X', s=80, color='black',
                            ax=axes[i, j])