# K-Means clustering
 
- K means runs significantly faster on large datasets compared to hierarchical clustering

## Steps:
1. Generate cluster centers
    - kmeans(obs, k_or_guess, iter, thresh, check_finite)
    - Parameters:
        - obs: standardized observations
        - k_or_guess: number of clusters
        - iter: number of iterations
        - thres: threshold
        - check_finite: whether to check if observations contain only finite numbers. 
    - Returns:
        - cluster centers
        - distortion: sum of squares of distances of points from cluster center
2. Generate cluster labels
    - vq(osb, code_book, check_finite=True)
    - Parameters:
        - same as previous one
        - code_book: output of step 1
    - Returns: 
        - a LIST of all the distortions

## How many clusters?
- Elbow method (graph, with number of clusters in the x axis, and distortion in the y axis)
    - Ideal point is where the distortion stops declining strongly - 'the elbow' 

## Limitations
- How to find the right number of clusters (K)
- Impact of seeds
- Biased towards equal sized clusters

## Important note
- Remember that clustering is part of the EDA - so it is ok to have errors in this phase. 

In [None]:
from scipy.cluster.vq import kmeans, vq 

# Generate cluster centers and labels
cluster_centers, _ = kmeans(df[['scaled_x', 'scaled_y']], 3)
df['cluster_labels'], _ = vq(df[['scaled_x', 'scaled_y']], cluster_centers)

# Plot clusters
sns.scatterplot(x='scaled_x', y='scaled_y', hue='cluster_labels', data=df)
plt.show()

### Elbow method

In [None]:
distortions = []
num_clusters = range(1, 7)

# Create a list of distortions from the kmeans function
for i in num_clusters:
    cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], i)
    distortions.append(distortion)

# Create a DataFrame with two lists - num_clusters, distortions
elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions})

# Creat a line plot of num_clusters and distortions
sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot)
plt.xticks(num_clusters)
plt.show()