<img src="imagens/md-logo.jpg" width="96" height="96" align="left"/>

# Clustering

<font color=blue><b> Bootcamp Minerando Dados</b></font><br>
www.minerandodados.com.br

<img src="imagens/clusters.png" width="300" height="300" align="left"/>

** Importando bibliotecas **

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.neighbors import DistanceMetric

** Carregando a base de dados iris **

In [2]:
iris = pd.read_csv("datasets/iris.csv")

In [3]:
iris.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
from IPython.display import Image
Image(filename ="imagens/iris-features.png", width=500, height=500)

FileNotFoundError: [Errno 2] No such file or directory: 'imagens/iris-features.png'

** Separando os valores de dados e classes **

In [None]:
X = iris.drop('Species',axis=1)
X[:10]

In [None]:
y = iris.Species
y.unique()

** Convertendo valores categórios de classe em numéricos **

In [None]:
def converte_classe(l):
    if l == 'Iris-virginica':
        return 0
    elif l == 'Iris-setosa':
        return 1
    elif l == 'Iris-versicolor':
        return 2

In [None]:
y = y.apply(converte_classe)
y.value_counts()

** Calculando a distância euclidiana para dois pontos **

<img src="imagens/euclidean-distance.png" width="300" height="300" align="left"/>

In [None]:
dist = DistanceMetric.get_metric('euclidean')
x = [[0, 1, 2]]
y = [[1.2, 1.1, 5.2]]
dist.pairwise(X,Y)

In [None]:
def calcula_distancia(x,c):
    dist = DistanceMetric.get_metric('euclidean')
    return dist.pairwise(x,c)

In [None]:
calcula_distancia(x,y)

** Instânciando o Algoritmo K-means com 3 clusters **

In [None]:
kmeans = KMeans(n_clusters = 3, init = 'random')

** Executando o algoritmo K-means **

In [None]:
kmeans.fit(X)

** Verificando valores de centroids **

In [None]:
kmeans.cluster_centers_

** Tabela de distância **

In [None]:
distance = kmeans.fit_transform(X)
distance

In [None]:
distance[0]

** Visualizando valores de distância para cada cluster **

In [None]:
%matplotlib notebook
x = ['Cluster 0','Cluster 1','Cluster 2']
plt.barh(x,distance[0])
plt.xlabel('Distância')
plt.title('Distância por Clusters ')
plt.show()

** Imprimindo Rótulos **

In [None]:
labels = kmeans.labels_
labels

## Visualizando os Centroids ##

In [None]:
%matplotlib notebook
plt.figure(figsize=(8,6))
plt.scatter(X['SepalLength'], X['SepalWidth'], s = 100, c = kmeans.labels_)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'red',label = 'Centroids')
plt.title('Dataset Iris e Centroids')
plt.xlabel('SepalLength')
plt.ylabel('SepalWidth')
plt.show()

** Agrupando novos dados **

In [None]:
data = [
        [ 4.12, 3.4, 1.6, 0.7],
        [ 5.2, 5.8, 5.2, 6.7],
        [ 3.1, 3.5, 3.3, 3.0]
    ]
kmeans.predict(data)

** Visualizando os resultados **

In [None]:
%matplotlib notebook
f,(ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(8,6))
ax1.set_title('Original')
ax1.scatter(X['SepalLength'], X['SepalWidth'],s=150,c=sorted(y))
ax2.set_title('KMeans')
ax2.scatter(X['SepalLength'], X['SepalWidth'],s=150,c=sorted(kmeans.labels_))

** Estimando o valor do parametro K - Método Elbow **

In [None]:
%matplotlib notebook
wcss = []

for i in range(1, 11):
    kmeans2 = KMeans(n_clusters = i, init = 'random')
    kmeans2.fit(X)
    print (i,kmeans2.inertia_)
    wcss.append(kmeans2.inertia_)
    
plt.plot(range(1, 11), wcss)
plt.title('O Metodo Elbow')
plt.xlabel('Numero de Clusters')
plt.ylabel('WSS') #within cluster sum of squares
plt.show()

## Técnicas de Validação

### Matriz de Confusão

In [None]:
print (pd.crosstab(y,kmeans.labels_, rownames=['Real'], colnames=['Predito'], margins=True))