# Homework

1) Write your own implementation of k-means algorithm with random centroid initialization and 2 stopping conditions: max iterations and centroid convergence (if all attributes of all centroids changes not more than some epsilon the algorithm should stop).

2) Use your implementation to cluster data containing data about cereal products with their dietary characteristics (cereals.csv, 16 attributes).

3) It contains some nominal attributes (name, mfr, type). You can omit the first two of them. Type attribute is binary, so you can replace it with values 0 and 1.

4) Perform the clustering of the cereals into 3 groups using k-means algorithm.

5) Remember to preprocess the given input: normalization/standarization, attribute selection.

6) Try to describe the obtained groups based on the obtained centroids, what do all cereals within this group have in common?

7) Write a report containing information about used preprocessing methods, number of cereals within each cluster and your conclusions about the clustering results.

Deadline +2 weeks


In [1]:
import pandas as pd
import numpy as np
import math

In [2]:
#Normalizacja danych
def normalizer(col):
    mn=col.min()
    mx=col.max()
    return col.apply(lambda x: (x-mn)/(mx-mn))

In [3]:
#Wczytanie dataseta, preprocessing
df=pd.read_csv("cereal.csv")
df['type']=df['type'].replace(['H', 'C'], [0, 1])
indexes=df.columns[2:]
indexes=indexes[(indexes!='weight') & (indexes!='cups') & (indexes!='shelf') & (indexes!='rating')]
print(indexes)

for x in indexes:
    df[x]=normalizer(df[x])

Index(['type', 'calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo',
       'sugars', 'potass', 'vitamins'],
      dtype='object')


In [4]:
#Kwadrat długości odcinka (odległość między dwoma ciągami) w mierze euklidesowej 
def calculate_dist(serie1, serie2):
    res=sum((serie1-serie2)*(serie1-serie2))
    return res

#porównanie, czy różnica między starymi a nowymi centroidami jest wystarczająca, aby zakończyć algorytm
def equivalence(list1, list2, eps):
    for x, y in zip(list1, list2):
        for z, w in zip(x, y):
            res=(z-w)*(z-w)
            if(res>eps*eps):
                return False
    return True

#Argumenty - dataframe df i kolumny indexes, które zostaną użyte; classes - liczba centroidów
#max_ite - maksymalna liczba iteracji, eps - epsilon, seed - po podaniu listy wybranych indeksów pozwala uruchomić
#algorytm deterministycznie - jest to przydatne do debugowania i tworzenia sprawozdania
def kMeans(df, indexes, classes=3, max_ite=100, eps=0.00001, seed=None):
    #Randomowe centroidy
    if seed is None:
        randomized_indexes=np.random.choice(range(df.shape[0]), classes, replace=False)
    else:
        randomized_indexes=seed
    all_centroids=[df[indexes].loc[x] for x in randomized_indexes]
    current_indexes=[[] for i in range(classes)]
    last_centroids=[df[indexes].loc[x] for x in randomized_indexes]
    
    for ij in range(max_ite):
        #Znajdowanie indeksów dla każdegi centroidu
        for j, x in df.iterrows():
            correct_x=x[indexes]
            
            dist=1000000000
            cur_index=-1
            for i in range(classes):
                res=calculate_dist(correct_x, all_centroids[i])
                if (res<dist):
                    dist, cur_index=res, i
            current_indexes[cur_index].append(j)
        
        #Przesuwanie centroidów
        for i in range(classes):
            for x in indexes:
                last_centroids[i][x]=all_centroids[i][x].copy()
                all_centroids[i][x]=0
            
            for j in current_indexes[i]:
                for x in indexes:
                    all_centroids[i][x]+=df.loc[j,x]/len(current_indexes[i])
        
        #Sprawdzenie równoważności nowych i poprzednich centroidów zależnie od podanego epsilona
        if (equivalence(last_centroids, all_centroids, eps)):
            return [current_indexes, all_centroids]
        current_indexes=[[] for i in range(classes)]
        
        print(f'Iteracja nr {ij}')
    return [current_indexes, all_centroids]

In [12]:
#W wywołaniu podano seeda, aby otrzymać takie same wyniki jak w sprawozdaniu - usunięcie seeda prowadzi do
#Randomizacji początkowych centroidów
rf=kMeans(df, indexes, seed=[27, 36, 12])

Iteracja nr 0
Iteracja nr 1
Iteracja nr 2
Iteracja nr 3
Iteracja nr 4
Iteracja nr 5
Iteracja nr 6
Iteracja nr 7
Iteracja nr 8
Iteracja nr 9
Iteracja nr 10
Iteracja nr 11
Iteracja nr 12
Iteracja nr 13
Iteracja nr 14


In [13]:
print('Indeksy należące do kolejnych grup: \n')
_=[print(f'{x}') for x in rf[0]]

Indeksy należące do kolejnych grup: 

[0, 2, 3, 20, 26, 43, 54, 55, 57, 60, 63, 64, 65, 68]
[8, 9, 11, 15, 16, 21, 23, 32, 33, 38, 39, 40, 47, 49, 50, 53, 56, 61, 62, 67, 69, 71, 72, 74, 75]
[1, 4, 5, 6, 7, 10, 12, 13, 14, 17, 18, 19, 22, 24, 25, 27, 28, 29, 30, 31, 34, 35, 36, 37, 41, 42, 44, 45, 46, 48, 51, 52, 58, 59, 66, 70, 73, 76]


In [14]:
print('Pozycja centroidów: ')
print()
_=[print(f'{x}\n') for x in rf[1]]

Pozycja centroidów: 

type        0.785714
calories    0.279221
protein     0.400000
fat         0.071429
sodium      0.139509
fiber       0.284184
carbo       0.571429
sugars      0.200893
potass      0.403107
vitamins    0.125000
Name: 27, dtype: float64

type        1.000000
calories    0.512727
protein     0.352000
fat         0.128000
sodium      0.686875
fiber       0.125714
carbo       0.805000
sugars      0.307500
potass      0.241692
vitamins    0.400000
Name: 36, dtype: float64

type        1.000000
calories    0.607656
protein     0.247368
fat         0.300000
sodium      0.507812
fiber       0.124060
carbo       0.576754
sugars      0.726974
potass      0.286771
vitamins    0.263158
Name: 12, dtype: float64



In [15]:
for i in range(3):
    print(df.loc[rf[0][i]].describe())
    print()

            type   calories   protein        fat     sodium      fiber  \
count  14.000000  14.000000  14.00000  14.000000  14.000000  14.000000   
mean    0.785714   0.279221   0.40000   0.071429   0.139509   0.284184   
std     0.425815   0.176367   0.22188   0.126665   0.249433   0.294297   
min     0.000000   0.000000   0.00000   0.000000   0.000000   0.000000   
25%     1.000000   0.181818   0.20000   0.000000   0.000000   0.089286   
50%     1.000000   0.363636   0.40000   0.000000   0.000000   0.214286   
75%     1.000000   0.431818   0.60000   0.150000   0.199219   0.267857   
max     1.000000   0.454545   0.80000   0.400000   0.812500   1.000000   

           carbo     sugars     potass   vitamins      shelf     weight  \
count  14.000000  14.000000  14.000000  14.000000  14.000000  14.000000   
mean    0.571429   0.200893   0.403107   0.125000   2.142857   0.916429   
std     0.259658   0.183742   0.314920   0.129719   0.864438   0.182108   
min     0.000000   0.000000   0.0

In [9]:
for i in range(3):
    print (f"######### {i+1}. cluster, size: {len(df.loc[rf[0][i]]['name'])} ###########")
    print(df.loc[rf[0][i]]['name'])

######### 1. cluster, size: 14 ###########
0                     100% Bran
2                      All-Bran
3     All-Bran with Extra Fiber
20       Cream of Wheat (Quick)
26          Frosted Mini-Wheats
43                        Maypo
54                  Puffed Rice
55                 Puffed Wheat
57               Quaker Oatmeal
60               Raisin Squares
63               Shredded Wheat
64       Shredded Wheat 'n'Bran
65    Shredded Wheat spoon size
68      Strawberry Fruit Wheats
Name: name, dtype: object
######### 2. cluster, size: 25 ###########
8                       Bran Chex
9                     Bran Flakes
11                       Cheerios
15                      Corn Chex
16                    Corn Flakes
21                        Crispix
23                    Double Chex
32              Grape Nuts Flakes
33                     Grape-Nuts
38    Just Right Crunchy  Nuggets
39         Just Right Fruit & Nut
40                            Kix
47           Multi-Grain Cheerio