# Simple K-Mean Example
Below is a simpler example of K-Mean clustering to show intuition behind the algorithm.

## Import Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px #viz package for interactive charts
from sklearn.cluster import KMeans

## Dummy Data Set
- below are 10 simulated companies with their D/EBITDA, Sales YoY Growth rates and EBITDA Margins
- see if you can figure out what the "groupings" should be, before running K-Means algorithm
- now imagine you had thousands companies you have to categorize

In [2]:
cos = ['A','B','C','D','E','F','G','H','I', 'J']
d_ebitda = [4.5, 5, 3.75, 4, 1.0, 0.5, 0, 6, 7, 6.5]
sales_growth = [15, 10, 12, 11, 40, 60, 35, -1, -2, 0]
margins = [60, 80, 60, 40, 15, 10, 0, 15, 20, 5]

In [3]:
df = pd.DataFrame({'Company':cos,'D/EBITDA':d_ebitda,'Sales YoY':sales_growth,'EBITDA Margin':margins})

In [4]:
df

In [5]:
#Quick Visualizations
df.plot(x='D/EBITDA',y='Sales YoY', kind='scatter')
df.plot(x='D/EBITDA',y='EBITDA Margin', kind='scatter')
df.plot(x='Sales YoY',y='EBITDA Margin', kind='scatter')

In [6]:
sns.pairplot(df) #shows scatter permutations between all columns

## K-Means Cluster - Finding Optimal # of Clusters
- based on graphs above how many "groups" or clusters do you think you will need?
- what if it's not as easy to determine?
- we can use "Elbow Curve" to find optimal number

In [7]:
# X = df[['D/EBITDA','Sales YoY']]
X = df[['EBITDA Margin','D/EBITDA','Sales YoY']]

In [8]:
sse = []
for k in range(1,11): #number of clusters
    kmeans = KMeans(n_clusters = k)
    kmeans.fit(X)
    
    sse.append(kmeans.inertia_)

#Quick plot of Elbow Curve
plt.plot(range(1,len(sse)+1), sse)
plt.title("Elbow Curve")
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
plt.show()

It looks like 3 clusters is the right amount.

## K-Means - Running Optimal # of Clusters

In [9]:
kmeans = KMeans(n_clusters = 3)
kmeans.fit(X)

In [10]:
X.head()

## K-Means - Useful outputs
- centroids - these are the centers of the clusters: `kmeans.cluster_centers_`
- labels - these are the determined category numbers: `kmeans.labels_`

In [11]:
kmeans = KMeans(n_clusters = 3).fit(X)
centroids = kmeans.cluster_centers_
centroids

In [12]:
centroids_df = pd.DataFrame(centroids)
# centroids_df.columns = X.columns
# cenrtoids_df
centroids_df.columns = X.columns
centroids_df

In [13]:
kmeans.labels_

In [14]:
df['Category'] = kmeans.labels_
df

In [15]:
x = 'EBITDA Margin'
y = 'D/EBITDA'
# plt.scatter(centroids[:,0], centroids[:,1], c = 'black', marker='x')
plt.scatter(centroids_df[x], centroids_df[y], c = 'black', marker='x')
plt.scatter(df[x], df[y], c = kmeans.labels_, cmap ="rainbow") 
        # https://matplotlib.org/stable/tutorials/colors/colormaps.html
plt.xlabel(x)
plt.ylabel(y)
plt.show()