<h1>Algortimos: Knn e K-means </h1>
<p>Bases de dados Iris e Boston</p>

<p>Resumo: Uso dos algoritmos de aprendizagem Knn e K-means sobre as bases de dados Iris e Boston. O Knn utiliza um aprendizado supervisionado com foco em classificação. O K-menas utiliza aprendizado não supervisionado.</p>

<h2>1) Introdução</h2>
<p> Ambas as bases de dados utilizadas podem ser importadas facilmente da <i>sklearn.datasets</i>. Após a importação, é possível convertê-las para o formato de <i>Data Frame</i> para facilitar a visualização dos dados. O devido conhecimento do formato dos dados, nos permitirá aplicar os algoritmos Knn e K-means.</p>

<p>Nomeamos o target da Iris de <i>Class</i> e o target da Boston de <i>Value</i>.</p>

In [19]:
#Importações
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.datasets import load_boston

iris = load_iris()
table_iris = pd.DataFrame(iris.data)
table_iris.columns = iris.feature_names
table_iris['Class'] = iris.target

boston = load_boston()
table_boston = pd.DataFrame(boston.data)
table_boston.columns = boston.feature_names
table_boston['Value'] = boston.target

In [20]:
#EXIBIÇÃO: iris
table_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [21]:
#EXIBIÇÃO: boston
table_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Value
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


<h2>2) Metodologia</h2>
<p>Notemos pela tabela de Boston que seus valores para o target são contínuos. O Knn é um algoritmo que, tradicionalmente, trabalha com classificação, ou seja, uma divisão discreta do target. Para corrigir essa diferença no formato dos dados iremos realizar o seguinte procedimento:</p>

<ol>
    <li>Ordenar a coluna do target de Boston em ordem crescente.</li>
    <li>Realizar 4 partiçãoes na coluna do target.</li>
    <li>Atribuir uma <i>Classe</i>, representada por um número, às linhas que pertencem ao mesmo intervalo de partição.</li>
</ol>

<p>Como conseguência do procedimento acima, obtemos uma discretizção da base de dados Boston, em função de seus intervalos de preços.</p>

In [25]:
#discretiza os preços em 4 Valores
table_boston = table_boston.sort_values(by=['Value'])
table_boston = table_boston.reset_index(drop=True)
for i in range(126):
    table_boston.iloc[i]['Value'] = 4

for i in range(126, 253):
    table_boston.iloc[i]['Value'] = 3

for i in range(253, 380):
    table_boston.iloc[i]['Value'] = 2

for i in range(380, 506):
    table_boston.iloc[i]['Value'] = 1
    
#table_boston['Value'] = pd.to_numeric(table_boston['Value'])
table_boston['Value'] = table_boston['Value'].astype("int32")

In [26]:
#EXIBIÇÃO: boston discretizada
table_boston.head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Value
0,38.3518,0.0,18.1,0.0,0.693,5.453,100.0,1.4896,24.0,666.0,20.2,396.9,30.59,4
1,6.71772,0.0,18.1,0.0,0.713,6.749,92.6,2.3236,24.0,666.0,20.2,0.32,17.44,4
2,67.9208,0.0,18.1,0.0,0.693,5.683,100.0,1.4254,24.0,666.0,20.2,384.97,22.98,4
3,25.0461,0.0,18.1,0.0,0.693,5.987,100.0,1.5888,24.0,666.0,20.2,396.9,26.77,4
4,9.91655,0.0,18.1,0.0,0.693,5.852,77.8,1.5004,24.0,666.0,20.2,338.16,29.97,4
5,45.7461,0.0,18.1,0.0,0.693,4.519,100.0,1.6582,24.0,666.0,20.2,88.27,36.98,4
6,0.18337,0.0,27.74,0.0,0.609,5.414,98.3,1.7554,4.0,711.0,20.1,344.05,23.97,4
7,14.2362,0.0,18.1,0.0,0.693,6.343,100.0,1.5741,24.0,666.0,20.2,396.9,20.32,4
8,16.8118,0.0,18.1,0.0,0.7,5.277,98.1,1.4261,24.0,666.0,20.2,396.9,30.81,4
9,18.0846,0.0,18.1,0.0,0.679,6.434,100.0,1.8347,24.0,666.0,20.2,27.25,29.05,4


<h3>2.1) Iris</h3>

<p>Para realizar a classificação da Iris, utilizamos o Knn desenvolvido nesse trabalho. Primeiramente realizamos a divisão entre as características e o target de classificação. Uma parte dessas informações forem realizadas para a aprendizagem supervisionada do algoritmo, e uma outra das informações foram usadas para realizar a verificação da classificação. Mostramos também o score, que para a Iris, alcança 100% quando a partição realizada utiliza 5 vizinhos mais próximos.</p>

In [65]:
knn = MyKnn(5)
X = iris.data[:,:3] #as caracteristicas
y = iris.target #classificacao
#dados de treinamento 'até 40' de cada classe
xt = np.concatenate([X[:40,:], X[51:90,:], X[101:140,:]])
yt = np.concatenate([y[:40], y[51:90], y[101:140]])
knn.fit(xt, yt)

#validacão com o restante dos dados
xv = np.concatenate([X[40:50,:], X[90:100,:], X[140:150,:]])
yv = np.concatenate([y[40:50], y[90:100], y[140:150]])
yp = knn.predict(xv)

print(yp) #resultado encontrado
print(yv) #resultado esperadoklearn.neighbors import KNeighborsClassifier
print(knn.score(xv, yv))

[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2]
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2]
100.0


<p>A aplicação do K-means sobre a Iris utilizou a mesma partição entre o conjunto de dados que fora utilizado no Knn. A grande diferença está no fato de que não utilizamos o aprendizado supervisonado. O K-means trabalha com a estimativa de cortes no conjuntos de dados, tentando realizar um agrupamento de elementos com caracteristicas semelhantes. Para nosso teste, utilizamos uma separação em 3 grupos, pois sabemos de antemão que a classificação da Iris se dá dessa forma.</p>

<p>Notemos que o algoritmo Knn tem maior acerto na comparação com o K-means para esse exmplo.<p>

In [60]:
kmeans = MyKMeans(3)
X = iris.data[:,:3] #as caracteristicas
y = iris.target #classificacao
#dados de treinamento 'até 40' de cada classe
xt = np.concatenate([X[:40,:], X[51:90,:], X[101:140,:]])
yt = np.concatenate([y[:40], y[51:90], y[101:140]])
kmeans.fit(xt)

xv = np.concatenate([X[40:50,:], X[90:100,:], X[140:150,:]])
yv = np.concatenate([y[40:50], y[90:100], y[140:150]])
yp = kmeans.predict(xv)
print(yp) #resultado encontrado
print(yv)

[0 0 0 0 0 0 0 0 0 0 2 2 2 0 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2]
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2]


<h3>2.2) Boston</h3>

In [82]:
knn = MyKnn(5)
from sklearn.model_selection import train_test_split
X = table_boston.loc[:,:-1] #as caracteristicas
y = table_boston.loc[:,-1:] #classificacao
#dados de treinamento 'até 40' de cada classe
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0

SyntaxError: unexpected EOF while parsing (<ipython-input-82-682355641a98>, line 6)

In [73]:
table_boston.iloc[:,0:-1]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,38.35180,0.0,18.10,0.0,0.6930,5.453,100.0,1.4896,24.0,666.0,20.2,396.90,30.59
1,6.71772,0.0,18.10,0.0,0.7130,6.749,92.6,2.3236,24.0,666.0,20.2,0.32,17.44
2,67.92080,0.0,18.10,0.0,0.6930,5.683,100.0,1.4254,24.0,666.0,20.2,384.97,22.98
3,25.04610,0.0,18.10,0.0,0.6930,5.987,100.0,1.5888,24.0,666.0,20.2,396.90,26.77
4,9.91655,0.0,18.10,0.0,0.6930,5.852,77.8,1.5004,24.0,666.0,20.2,338.16,29.97
5,45.74610,0.0,18.10,0.0,0.6930,4.519,100.0,1.6582,24.0,666.0,20.2,88.27,36.98
6,0.18337,0.0,27.74,0.0,0.6090,5.414,98.3,1.7554,4.0,711.0,20.1,344.05,23.97
7,14.23620,0.0,18.10,0.0,0.6930,6.343,100.0,1.5741,24.0,666.0,20.2,396.90,20.32
8,16.81180,0.0,18.10,0.0,0.7000,5.277,98.1,1.4261,24.0,666.0,20.2,396.90,30.81
9,18.08460,0.0,18.10,0.0,0.6790,6.434,100.0,1.8347,24.0,666.0,20.2,27.25,29.05


<h2>3) Códigos</h2>

<p>Abaixo estão os códigos desenvolvidos para esse trabalho e utilizados em nossa metodologia.</p>

In [37]:
#Knn
import numpy as np
class MyKnn:

    kneighbor = None
    instances = None
    numOfInstances = None
    classification = None
    classes = None
    numOfClasses = None
    
    def __init__(self, kneighbor):
        self.kneighbor = kneighbor
    
    def distEuclid (self, x, y):
        sizeX = len(x)
        sizeY = len(y)
        if sizeX != sizeY :
            print("Error: incompatible dimensions")
            return False

        result = 0
        for i in range(sizeX):
            result = result + (x[i] - y[i])**2

        return (result)**(1/2)
    
    def fit (self, instances, classification):
        self.instances = instances
        self.numOfInstances = len(instances)
        self.classification = classification
        r  = len(classification)
        self.classes = np.array(classification[0])
        for i in range(r):
            if classification[i] not in self.classes :
                self.classes = np.append(self.classes, classification[i])
        
        self.numOfClasses = len(self.classes)
        
    def predict(self, x):
        [r, c] = x.shape
        y = np.zeros(r, dtype = self.classification.dtype)
        
        for i in range(r):
            store = x[i, :]
            dist_neighbor = np.zeros(self.numOfInstances)
            for j in range(self.numOfInstances):
                dist_neighbor[j] = self.distEuclid(store, self.instances[j])
            
            dist_index = np.argsort(dist_neighbor)
            election = np.zeros(self.numOfClasses)
            
            for k in range(self.kneighbor):
                for z in range(self.numOfClasses):
                    if (self.classification[dist_index[k]] == self.classes[z]) :
                        election[z] = election[z] + 1
                        break
            
            y[i] = self.classes[np.argmax(election)]
            
        return y
    
    def score(self, x, correct):
        y = self.predict(x)
        success = 0
        for i in range(len(x)):
            if y[i] == correct[i] :
                success = success + 1
                
        return success * 100/len(x)

In [38]:
#K-means
#1 inicializar centroides aleaotiramente
#2 para cada ponto na base de dados, calcular distância para cada centroide
#e associar ao que estiver mais perto
#3 calcular média de todos os pontos ligados a cada centrodide e definir
#um novo centroide (repetir etapas 2 e 3)
class MyKMeans:
    
    k = None
    centers = None
    numInCenter = None
    instances = None
    
    def __init__(self, k):
        self.k = k
    
    def distEuclid (self, x, y):
        sizeX = len(x)
        sizeY = len(y)
        if sizeX != sizeY :
            print("Error: incompatible dimensions")
            return False

        result = 0
        for i in range(sizeX):
            result = result + (x[i] - y[i])**2

        return (result)**(1/2)
    
    def upCenter(self, indCenter, instance):
        temp = len(instance) - 1
        for j in range(temp):
            self.centers[indCenter][j] *= self.numInCenter[indCenter]
            self.centers[indCenter][j] += instance[j]
            self.centers[indCenter][j] = self.centers[indCenter][j]/(self.numInCenter[indCenter] + 1)
        
        self.numInCenter[indCenter] += 1
    
    def refactorCenter(self, indCenter, i, instance):
        previousCenter = int(self.instances[i, -1]) #acessar a última coluna da linha i
        temp = len(instance) - 1
        
        for j in range(temp):
            self.centers[previousCenter][j] *= self.numInCenter[previousCenter]
            self.centers[previousCenter][j] -= instance[j]
            self.centers[previousCenter][j] = self.centers[previousCenter][j]/(self.numInCenter[previousCenter] - 1)
        
        self.numInCenter[previousCenter] -= 1        
        self.upCenter(indCenter, instance)
        
    def fit (self, instances):        
        self.instances = instances
        [r , c] = instances.shape        
        minCoordCenter = np.zeros(c)
        maxCoordCenter = np.zeros(c)
        self.instances = np.c_[self.instances, (-1)*np.ones((r,1))]
        
        for j in range(c):
            minCoordCenter[j] = min(instances[:,j])
            maxCoordCenter[j] = max(instances[:,j])
        
        self.centers = np.zeros((self.k,c))
        self.numInCenter = np.ones(self.k)
        for k in range(self.k):
            for j in range(c):
                self.centers[k,j] = np.random.uniform(minCoordCenter[j], maxCoordCenter[j])
                   
        while(True):
            flag = 0
            for i in range(r):
                minDist = 10*100*100
                indCenterMinDist = None
                # decide qual centro mais próximo
                for k in range(self.k):
                    temp = self.distEuclid(self.centers[k], self.instances[i,:-1])
                    if minDist > temp:
                        minDist = temp
                        indCenterMinDist = k
                # verfica se está alterando um valor de centro. marca na última coluna o indice do centro
                if self.instances[i, c] == -1 :
                    self.instances[i, c] = indCenterMinDist
                    self.upCenter(indCenterMinDist, self.instances[i])
                    flag = 1
                elif self.instances[i, c] != indCenterMinDist :
                    self.refactorCenter(indCenterMinDist, i, self.instances[i])
                    self.instances[i, c] = indCenterMinDist
                    flag = 1
            
            if flag == 0 :
                break
    
    def predict(self, x):
        [r, c] = x.shape
        y = np.zeros(r, dtype = 'int32')
        for i in range(r):
            minDist = 10*100*100
            indCenterMinDist = None
            # decide qual centro mais próximo
            for k in range(self.k):
                temp = self.distEuclid(self.centers[k], x[i])
                if minDist > temp:
                    minDist = temp
                    indCenterMinDist = k
            y[i] = indCenterMinDist
        return y