### Implementación

Vamos a implementar kmeans directamente con la librería scikitlearn.

In [38]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

Este dataset contiene datos sobre flores de iris, con 4 variables, así como la clase a la que pertenecen (3 clases en total).

In [39]:
iris = load_iris()
iris_df = pd.DataFrame(data= iris.data, columns= iris.feature_names)

iris_df['original_label'] = iris.target
iris_df.head() 

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),original_label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Ahora creamos nuestro modelo de clustering. Como sabemos de antemano que hay 3 clases, tomamos $k=3$.

In [40]:
kmeans = KMeans(n_clusters=3, max_iter=300, random_state=1)
kmeans.fit(iris.data)
print(kmeans.labels_)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2
 2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2
 2 0]


Comparamos nuestros resultados con las variedades reales de iris:

In [41]:
iris_df["cluster"] = kmeans.labels_
iris_df.groupby(['original_label','cluster']).agg({'sepal length (cm)': 'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,sepal length (cm)
original_label,cluster,Unnamed: 2_level_1
0,1,50
1,0,47
1,2,3
2,0,14
2,2,36


### Preprocesado

Este algoritmo mide distancias, y como tal debe tenerse en cuenta:

Escalas: Los atributos pueden tener escalas muy distintas. Por eso es importante normalizar los datos.
Atributos categóricos: Para medir distancia entre atributos categóricos necesitamos transformarlos a numéricos, para poder usar las fórmulas de distancia.
Importancia de los atributos: Si un atributo es más importante que otro, se puede ponderar la distancia de cada atributo modificando la fórmula de distancia o bien el rango de valores del atributo.

In [42]:
from sklearn import preprocessing

cols = iris_df.columns[:-2]  # Exclude 'target' and 'cluster' columns

iris_df_norm = iris_df[cols].copy()
iris_df_norm[cols] = preprocessing.normalize(iris_df[cols])
iris_df_norm["original_label"] = iris_df["original_label"]
iris_df_norm.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),original_label
count,150.0,150.0,150.0,150.0,150.0
mean,0.7514,0.405174,0.454784,0.141071,1.0
std,0.044368,0.105624,0.159986,0.077977,0.819232
min,0.653877,0.238392,0.167836,0.014727,0.0
25%,0.715261,0.326738,0.250925,0.048734,0.0
50%,0.754883,0.354371,0.536367,0.164148,1.0
75%,0.786912,0.527627,0.580025,0.197532,2.0
max,0.860939,0.607125,0.636981,0.280419,2.0


In [43]:
kmeans_norm = KMeans(n_clusters=3, max_iter=300, random_state=1)
kmeans_norm.fit(iris_df_norm[cols])

iris_df_norm["cluster"] = kmeans_norm.labels_
iris_df_norm.groupby(['original_label','cluster']).agg({'sepal length (cm)': 'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,sepal length (cm)
original_label,cluster,Unnamed: 2_level_1
0,1,50
1,0,45
1,2,5
2,2,50


In [None]:
from sklearn.preprocessing import StandardScaler

iris_df_stan = iris_df[cols].copy()
iris_df_stan[cols] = preprocessing.normalize(iris_df[cols])
iris_df_stan.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),original_label
count,150.0,150.0,150.0,150.0,150.0
mean,0.7514,0.405174,0.454784,0.141071,1.0
std,0.044368,0.105624,0.159986,0.077977,0.819232
min,0.653877,0.238392,0.167836,0.014727,0.0
25%,0.715261,0.326738,0.250925,0.048734,0.0
50%,0.754883,0.354371,0.536367,0.164148,1.0
75%,0.786912,0.527627,0.580025,0.197532,2.0
max,0.860939,0.607125,0.636981,0.280419,2.0


In [51]:
iris_df_stan = iris_df[cols].copy()
iris_df_stan["original_label"] = iris_df["original_label"]
scaler = StandardScaler()

kmeans_stan = KMeans(n_clusters=3, max_iter=300, random_state=1)
kmeans_stan.fit(iris_df_stan[cols])

iris_df_stan["cluster"] = kmeans_stan.labels_
iris_df_stan.groupby(['original_label','cluster']).agg({'sepal length (cm)': 'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,sepal length (cm)
original_label,cluster,Unnamed: 2_level_1
0,1,50
1,0,47
1,2,3
2,0,14
2,2,36
