# Clustering drills

Welcome, traveller, you have arrived in the drills section of the clustering chapter. Here, you can practice some clustering techniques.

If you have not checked out the [example](./1.clustering_with_sklearn.ipynb), I advise you do so. We will use the same [pokemon](./assets/pokemon.csv) dataset to further our journey.

## 1. Multi-dimensional data

in the example, we wanted to determine the **most physically diverse** [pokemon](./assets/pokemon.csv) team there is. To do so, we clustered the pokemon into groups according to their **weight** and **height** using **k-means**.

Is this really the most diverse team out there though? The pokemon selected there are still similar in terms of **combat abilities**, and we have this data available.

For the first drill, I want you to:
   - cluster the pokemon into 6 groups according to similar:
       - height
       - weight
       - hp
       - attack
       - defense
       - speed
   - visualise these multidimensional clusters using a scatter plot matrix
   - determine the most dissimilar pokemon team from these clusters
   
So that you have **6-dimensional** clusters containing **diverse** pokemon groups from which you can determine your pokemon team similar to the method described in the example.

In [1]:
# cluster your pokemon here
import pandas as pd

pokemon = pd.read_csv("./assets/pokemon.csv")

In [2]:
print(pokemon['weight_kg'].isnull().sum())
print(pokemon['height_m'].isnull().sum())
# print(pokemon['hp'].isnull().sum())
# print(pokemon['attack'].isnull().sum())
# print(pokemon['defense'].isnull().sum())
# print(pokemon['speed'].isnull().sum())


20
20


In [3]:
pokemon = pokemon.dropna(axis=0, subset=['weight_kg'])
pokemon = pokemon.dropna(axis=0, subset=['height_m'])

# adjusting index
pokemon = pokemon.reset_index(drop=True)

In [None]:
print(pokemon['height_m'].isnull().sum())


In [None]:
pokemon.head(40)

In [None]:
pokemon_subset = pokemon[['weight_kg','height_m','hp','attack','defense','speed']]

In [None]:
pokemon.head(5)

In [None]:
pokemon.info()

In [None]:
# visualise your clusters here (take a look at the pandas scatter_matrix or seaborn's pairplot method)
import seaborn as sns

pokemon_type_colors = ['#78C850',  # Grass
                       '#F08030',  # Fire
                       '#6890F0',  # Water
                       '#A8B820',  # Bug
                       '#A8A878',  # Normal
                       '#A040A0',  # Poison
                       '#F8D030',  # Electric
                       '#E0C068',  # Ground
                       '#EE99AC',  # Fairy
                       '#C03028',  # Fighting
                       '#F85888',  # Psychic
                       '#B8A038',  # Rock
                       '#705898',  # Ghost
                       '#98D8D8',  # Ice
                       '#7038F8',  # Dragon
                   ]

sns.pairplot(data=pokemon, vars=['weight_kg','height_m','hp','attack','defense','speed'], hue='type1', kind='scatter', palette= pokemon_type_colors)




In [None]:
# determine your final pokemon here

In [None]:
from sklearn.cluster import KMeans

# number of pokemon clusters
team_size = 6

# make new dataframe with relevant metrics
# pokemon_metrics = pokemon['weight_kg'].to_frame().join(pokemon['height_m'].to_frame())

pokemon_metrics = pokemon[['weight_kg','height_m','hp','attack','defense','speed']]

# z-score normalisation
pokemon_metrics_standardized =(pokemon_metrics-pokemon_metrics.mean())/pokemon_metrics.std()
pokemon_metrics_standardized = pokemon_metrics_standardized.rename(columns={'weight_kg':'weight_zscore',
                                                                        'height_m':'height_zscore',
                                                                       'hp':'hp_zscore',
                                                                       'attack':'attack_zscore',
                                                                       'defense':'defense_zscore',
                                                                       'speed':'speed_zscore'})

# fit a kmeans object to the dataset
kmeans = KMeans(n_clusters=team_size, init='k-means++').fit(pokemon_metrics_standardized)

# clusters is an attribute of the object
cluster_centers = kmeans.cluster_centers_

# add cluster index to dataframe
cluster_labels = pd.Series(kmeans.labels_, name='cluster')
pokemon_metrics_standardized = pokemon_metrics_standardized.join(cluster_labels.to_frame())

In [None]:
sns.pairplot(data=pokemon_metrics_standardized, hue='cluster', kind='scatter')


In [None]:
import numpy as np        
        
def distance_to_other_clusters(single_pokemon):
    metric = np.array([single_pokemon['weight_zscore'], single_pokemon['height_zscore'],single_pokemon['hp_zscore'],
                       single_pokemon['attack_zscore'],single_pokemon['defense_zscore'],single_pokemon['speed_zscore']])
    cluster_number = round(single_pokemon['cluster'])
    distance = 0
    for cluster_index in range(0, len(cluster_centers)):
        if cluster_index == cluster_number:
            continue
        center = cluster_centers[cluster_index]
        distance += np.sqrt(sum(np.square(metric - center)))
    return distance

# evaluate all pokemon
pokemon_dissimilarity = pokemon_metrics_standardized.apply(distance_to_other_clusters, axis=1)
pokemon_dissimilarity = pokemon_dissimilarity.rename('dissimilarity')

# join to other metrics
pokemon_processed = pokemon_metrics_standardized.join(pokemon_dissimilarity.to_frame()).join(pokemon['name'].to_frame())

# pick most dissimilar pokemon per cluster
chosen_pokemon = pd.DataFrame()
for cluster_index in range(0, len(cluster_centers)):
    pokemon_cluster = pokemon_processed[pokemon_processed['cluster'] == cluster_index]
    chosen_pokemon = chosen_pokemon.append(pokemon_cluster[pokemon_cluster['dissimilarity']==pokemon_cluster['dissimilarity'].max()])


In [None]:
chosen_pokemon

## 2. Similarity criteria

Very nice! What a unique team!

You probably used the same **similarity criteria** as the introduction example. **k-means** uses Euclidean distance as a similarity criteria, so it makes sense that we also use Euclidean distance for our **dissimilarity criteria**, but what would happen if we picked something else?

"Woah, you're going too fast, 'Euclidean distance'? What do you mean by that?"

This is distance as we know it in the **real world**, a line connecting two points. But distance does not have to be defined this way in our **problem space**, it can be **Manhattan distance**, or **squared distance**, or something else entirely.

<img src="https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-981-10-8818-6_7/MediaObjects/463464_1_En_7_Fig2_HTML.jpg" align="center" width="600"/>

Replacing this similarity criteria for the `sklearn` k-means is no trivial task, so I will not ask this of you, but changing it for our **dissimilarity criteria** should be doable.

For the next exercise, I would like you to:
- replace the dissimilarity criteria from the example by:
    - manhattan distance
    - squared distance
    - 1/(squared distance)

In [None]:
# Assign dissimilarity to your pokemon here

In [None]:
import numpy as np        
        
def manhattan_distances(single_pokemon):
    metric = np.array([single_pokemon['weight_zscore'], single_pokemon['height_zscore'],single_pokemon['hp_zscore'],
                       single_pokemon['attack_zscore'],single_pokemon['defense_zscore'],single_pokemon['speed_zscore']])
    cluster_number = round(single_pokemon['cluster'])
    distance = 0
    for cluster_index in range(0, len(cluster_centers)):
        if cluster_index == cluster_number:
            continue
        center = cluster_centers[cluster_index]
        distance += sum(abs(metric - center))
    return distance

# evaluate all pokemon
pokemon_dissimilarity = pokemon_metrics_standardized.apply(manhattan_distances, axis=1)
pokemon_dissimilarity = pokemon_dissimilarity.rename('dissimilarity')

# join to other metrics
pokemon_processed = pokemon_metrics_standardized.join(pokemon_dissimilarity.to_frame()).join(pokemon['name'].to_frame())

# pick most dissimilar pokemon per cluster
manhattan_chosen_pokemon = pd.DataFrame()
for cluster_index in range(0, len(cluster_centers)):
    pokemon_cluster = pokemon_processed[pokemon_processed['cluster'] == cluster_index]
    manhattan_chosen_pokemon = manhattan_chosen_pokemon.append(pokemon_cluster[pokemon_cluster['dissimilarity']==pokemon_cluster['dissimilarity'].max()])


In [None]:
manhattan_chosen_pokemon

In [None]:
import numpy as np        
        
def squared_distances(single_pokemon):
    metric = np.array([single_pokemon['weight_zscore'], single_pokemon['height_zscore'],single_pokemon['hp_zscore'],
                       single_pokemon['attack_zscore'],single_pokemon['defense_zscore'],single_pokemon['speed_zscore']])
    cluster_number = round(single_pokemon['cluster'])
    distance = 0
    for cluster_index in range(0, len(cluster_centers)):
        if cluster_index == cluster_number:
            continue
        center = cluster_centers[cluster_index]
        distance += sum(np.square(metric - center))
    return distance

# evaluate all pokemon
pokemon_dissimilarity = pokemon_metrics_standardized.apply(squared_distances, axis=1)
pokemon_dissimilarity = pokemon_dissimilarity.rename('dissimilarity')

# join to other metrics
pokemon_processed = pokemon_metrics_standardized.join(pokemon_dissimilarity.to_frame()).join(pokemon['name'].to_frame())

# pick most dissimilar pokemon per cluster
squared_chosen_pokemon = pd.DataFrame()
for cluster_index in range(0, len(cluster_centers)):
    pokemon_cluster = pokemon_processed[pokemon_processed['cluster'] == cluster_index]
    squared_chosen_pokemon = squared_chosen_pokemon.append(pokemon_cluster[pokemon_cluster['dissimilarity']==pokemon_cluster['dissimilarity'].max()])


In [None]:
squared_chosen_pokemon

In [None]:
import numpy as np        
        
def one_over_squared_distances(single_pokemon):
    metric = np.array([single_pokemon['weight_zscore'], single_pokemon['height_zscore'],single_pokemon['hp_zscore'],
                       single_pokemon['attack_zscore'],single_pokemon['defense_zscore'],single_pokemon['speed_zscore']])
    cluster_number = round(single_pokemon['cluster'])
    distance = 0
    for cluster_index in range(0, len(cluster_centers)):
        if cluster_index == cluster_number:
            continue
        center = cluster_centers[cluster_index]
        distance += 1/(sum(np.square(metric - center)))
    return distance

# evaluate all pokemon
pokemon_dissimilarity = pokemon_metrics_standardized.apply(one_over_squared_distances, axis=1)
pokemon_dissimilarity = pokemon_dissimilarity.rename('dissimilarity')

# join to other metrics
pokemon_processed = pokemon_metrics_standardized.join(pokemon_dissimilarity.to_frame()).join(pokemon['name'].to_frame())

# pick most dissimilar pokemon per cluster
one_over_squared_chosen_pokemon = pd.DataFrame()
for cluster_index in range(0, len(cluster_centers)):
    pokemon_cluster = pokemon_processed[pokemon_processed['cluster'] == cluster_index]
    one_over_squared_chosen_pokemon = one_over_squared_chosen_pokemon.append(pokemon_cluster[pokemon_cluster['dissimilarity']==pokemon_cluster['dissimilarity'].max()])


In [None]:
one_over_squared_chosen_pokemon

Did your team change? Why do you think it did(n't)? Discuss this with one of your colleagues!

## 3. Heterogenous data

There! We did it! The most **diverse pokémon team** possible...or is it?

We have clustered our pokémon according to **weight** and **height** in the example, and according to **combat abilities** in the first drill, but what about **pokémon type**?

Some of the chosen pokémon may have the same type, as this data was ignored during clustering? But to get a really diverse team, we should take these into account!

For this drill I want you to:
- cluster the pokémon into 6 groups according to similar:
    - weight
    - height
    - primary pokémon type
    - secondary pokémon type
- determine the most dissimilar pokemon team from these clusters

But wait, these pokémon types, they're in **text format**, how do you compare these to the **numerical data**? It's time to **vectorize** this data. **Vectorizing** this textual data means representing this data in a way that can be understood by machine learning algorithms. 

For example, let's say there are only 3 pokémon types, and pokémon can only have one type. Vectorising a **grass**, **fire**, and **water** pokémon would look like this:
- grass -> [1, 0, 0]
- fire  -> [0, 1, 0]
- water -> [0, 0, 1]

So in this case, **3-dimensional** data. In our case though, we have a weight dimension, a height dimension, 18 primary and secondary dimensions, so a whopping **38 dimensions**

In [None]:
# vectorize your pokémon type data here (there are modules that vectorize data)

In [4]:
print(pokemon['type1'].isnull().sum())
print(pokemon['type2'].isnull().sum())


0
383


In [5]:
pokemon = pokemon.fillna(value='unknown', axis=1)

## Using pandas to vectorize:

In [None]:
pokemon_types = pokemon[['name','weight_kg','height_m']]

In [None]:
dummies1 = pd.get_dummies(pokemon.type1)
# dummies1
dummies2 = pd.get_dummies(pokemon.type2)
dummies2

In [None]:
merged = pd.concat([pokemon_types,dummies1],axis='columns')
merged = pd.concat([pokemon_types,dummies2],axis='columns')

merged

## OneHotEncoder Vectorizing

In [6]:
pokemon['type1']

0        grass
1        grass
2        grass
3         fire
4         fire
        ...   
776      steel
777      grass
778       dark
779    psychic
780      steel
Name: type1, Length: 781, dtype: object

In [7]:
pokemon['type2']

0       poison
1       poison
2       poison
3      unknown
4      unknown
        ...   
776     flying
777      steel
778     dragon
779    unknown
780      fairy
Name: type2, Length: 781, dtype: object

In [8]:
from sklearn.preprocessing import OneHotEncoder

In [9]:
enc = OneHotEncoder()

In [10]:
type1_hot = enc.fit_transform(pokemon[['type1']])

In [11]:
type1_hot_cat = enc.categories_
type1_hot_cat

[array(['bug', 'dark', 'dragon', 'electric', 'fairy', 'fighting', 'fire',
        'flying', 'ghost', 'grass', 'ground', 'ice', 'normal', 'poison',
        'psychic', 'rock', 'steel', 'water'], dtype=object)]

In [13]:
type1_hot_df = pd.DataFrame(type1_hot.toarray(),columns=type1_hot_cat)

In [14]:
type1_hot_df

Unnamed: 0,bug,dark,dragon,electric,fairy,fighting,fire,flying,ghost,grass,ground,ice,normal,poison,psychic,rock,steel,water
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
776,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
777,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
778,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
779,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [15]:
pokemon['type2'].unique()
# pokemon['type1']

array(['poison', 'unknown', 'flying', 'ground', 'fairy', 'grass',
       'fighting', 'psychic', 'steel', 'ice', 'rock', 'water', 'electric',
       'fire', 'dragon', 'dark', 'ghost', 'bug', 'normal'], dtype=object)

In [16]:
type2_hot = enc.fit_transform(pokemon[['type2']])
type2_hot_cat = enc.categories_
type2_hot_df = pd.DataFrame(type2_hot.toarray(),columns=type2_hot_cat)
type2_hot_df

Unnamed: 0,bug,dark,dragon,electric,fairy,fighting,fire,flying,ghost,grass,ground,ice,normal,poison,psychic,rock,steel,unknown,water
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
776,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
777,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
778,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
779,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [18]:
poke_height_weight = pokemon[['height_m','weight_kg']]

In [19]:
vectorized_data = pd.concat([poke_height_weight,type1_hot_df], axis=1)

In [21]:
vectorized_data = pd.concat([vectorized_data, type2_hot_df], axis=1)

In [22]:
vectorized_data.head()

Unnamed: 0,height_m,weight_kg,"(bug,)","(dark,)","(dragon,)","(electric,)","(fairy,)","(fighting,)","(fire,)","(flying,)",...,"(grass,)","(ground,)","(ice,)","(normal,)","(poison,)","(psychic,)","(rock,)","(steel,)","(unknown,)","(water,)"
0,0.7,6.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,2.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.6,8.5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.1,19.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [None]:
# cluster your multi-dimensional pokémon data here

In [None]:
from sklearn.cluster import KMeans

# number of pokemon clusters
team_size = 6

pokemon_metrics = vectorized_data

# z-score normalisation
pokemon_metrics_standardized =(pokemon_metrics-pokemon_metrics.mean())/pokemon_metrics.std()
pokemon_metrics_standardized = pokemon_metrics_standardized.add_suffix('_zscore')

# fit a kmeans object to the dataset
kmeans = KMeans(n_clusters=team_size, init='k-means++').fit(pokemon_metrics_standardized)

# clusters is an attribute of the object
cluster_centers = kmeans.cluster_centers_

# add cluster index to dataframe
cluster_labels = pd.Series(kmeans.labels_, name='cluster')
pokemon_metrics_standardized = pokemon_metrics_standardized.join(cluster_labels.to_frame())

In [None]:
pokemon_metrics_standardized

In [None]:
# determine your unique team here

In [None]:
import numpy as np        
       
def distance_to_other_clusters(single_pokemon):
    metric = np.array([single_pokemon[x] for x in pokemon_metrics_standardized.columns[:-1]])
    cluster_number = round(single_pokemon['cluster'])
    distance = 0
    for cluster_index in range(0, len(cluster_centers)):
        if cluster_index == cluster_number:
            continue
        center = cluster_centers[cluster_index]
        distance += np.sqrt(sum(np.square(metric - center)))
    return distance

# evaluate all pokemon
pokemon_dissimilarity = pokemon_metrics_standardized.apply(distance_to_other_clusters, axis=1)
pokemon_dissimilarity = pokemon_dissimilarity.rename('dissimilarity')

# join to other metrics
pokemon_processed = pokemon_metrics_standardized.join(pokemon_dissimilarity.to_frame()).join(pokemon['name'].to_frame())

# pick most dissimilar pokemon per cluster
chosen_pokemon = pd.DataFrame()
for cluster_index in range(0, len(cluster_centers)):
    pokemon_cluster = pokemon_processed[pokemon_processed['cluster'] == cluster_index]
    chosen_pokemon = chosen_pokemon.append(pokemon_cluster[pokemon_cluster['dissimilarity']==pokemon_cluster['dissimilarity'].max()])


In [None]:
chosen_pokemon

But wait, did you properly **normalize** your data? If you simply vectorize your data like in the example shown above, you might not get the results you want (try this for yourselves, what do you notice?).

The example normalizes its data using the **z-score**. What does this mean? z-score or mean normalization means we are using our **problem space** where our data lives optimally. 

in the case of the pokemon weight and height, it was clear to see the **order of magnitude** of the weight is larger than that of the height. During clustering, this would mean that **weight similarity would matter more than height similarity**, since the euclidian distance between points of data would be larger.

For example, A pokemon weighing 200kg and measuring 4m is about **as similar** as another pokémon weighing 100kg and measuring 2m in terms of both weight and height. But the euclidian difference in between the weight difference is 100(kg), while the difference of height is only 2(m). That is where normalization comes in handy. It scales these metrics so they can be compared fairly.

Which is why when we vectorize out **pokémon types** into vectors of length one, the **euclidian distance** between one pokémon type and the other is about 1.4 (thanks Pythagoras), which isn't that much compared to the weight and height difference.

Show me how you would **make sure** that **similarity or dissimilarity** of the **pokémon type** matters more than **weight or height**?

In [None]:
# normalize and scale your data in such a way that pokémon type similarity matters more than the other metrics

In [None]:

# number of pokemon clusters
team_size = 6

pokemon_metrics = vectorized_data

# z-score normalisation
pokemon_metrics_standardized =(pokemon_metrics-pokemon_metrics.mean())/pokemon_metrics.std()
pokemon_metrics_standardized = pokemon_metrics_standardized.add_suffix('_zscore')

#weighted height and weight columns
pokemon_metrics_standardized.height_m_zscore = (pokemon_metrics_standardized.height_m_zscore)*0.5
pokemon_metrics_standardized.weight_kg_zscore = (pokemon_metrics_standardized.weight_kg_zscore)*0.5

# fit a kmeans object to the dataset
kmeans = KMeans(n_clusters=team_size, init='k-means++').fit(pokemon_metrics_standardized)

# clusters is an attribute of the object
cluster_centers = kmeans.cluster_centers_

# add cluster index to dataframe
cluster_labels = pd.Series(kmeans.labels_, name='cluster')
pokemon_metrics_standardized = pokemon_metrics_standardized.join(cluster_labels.to_frame())

pokemon_metrics_standardized.head()

In [None]:
import numpy as np        
       
def distance_to_other_clusters(single_pokemon):
    metric = np.array([single_pokemon[x] for x in pokemon_metrics_standardized.columns[:-1]])
    cluster_number = round(single_pokemon['cluster'])
    distance = 0
    for cluster_index in range(0, len(cluster_centers)):
        if cluster_index == cluster_number:
            continue
        center = cluster_centers[cluster_index]
        distance += np.sqrt(sum(np.square(metric - center)))
    return distance

# evaluate all pokemon
pokemon_dissimilarity = pokemon_metrics_standardized.apply(distance_to_other_clusters, axis=1)
pokemon_dissimilarity = pokemon_dissimilarity.rename('dissimilarity')

# join to other metrics
pokemon_processed = pokemon_metrics_standardized.join(pokemon_dissimilarity.to_frame()).join(pokemon['name'].to_frame())

# pick most dissimilar pokemon per cluster
chosen_pokemon = pd.DataFrame()
for cluster_index in range(0, len(cluster_centers)):
    pokemon_cluster = pokemon_processed[pokemon_processed['cluster'] == cluster_index]
    chosen_pokemon = chosen_pokemon.append(pokemon_cluster[pokemon_cluster['dissimilarity']==pokemon_cluster['dissimilarity'].max()])


## 4. Cluster method comparison

I hope you're getting a bit more comfortable with the **k-means** method, it sure is a popular one, but it's [not the only clustering technique](https://scikit-learn.org/stable/modules/clustering.html) out there!

For this excercise, I want you to:
- pick 3 clustering techniques from the `scikit-learn` library
- cluster the pokémon according weight and height
- try to adjust the cluster method arguments so 6 clusters are obtained after clustering
- evaluate in-cluster similarity and cluster-to-cluster similarity:
  - compare every pokémon in a cluster to every other pokémon within that same cluster (choose your own similarity criteria)
  - take the average of these in-cluster similarities
  - do this for every cluster
  - take the the average or centroid of ever cluster, and determine the similarity to every other cluster
  - compare these two metrics (in-cluster similarity and cluster-to-cluster similarity) for every chosen clustering technique
  - determine the 'best' technique by maximising in-cluster similarity and minimizing cluster-to-cluster similarity
- visualize the results

Bonus: track these metrics for every iteration of the algorithms and plot the progression from start to finish

In [None]:
# compare your techniques here

In [23]:
def cluster_evaluation(dataset, cluster_method):
    from sklearn import metrics
    
    dbs = metrics.davies_bouldin_score(dataset, cluster_method)
    chs = metrics.calinski_harabasz_score(dataset, cluster_method)
    ss = metrics.silhouette_score(dataset, cluster_method, metric='euclidean')
    
    print(f"Davies Bouldin Index: {dbs}")
    print(f"Calinski-Harabasz Index: {chs}")
    print(f"Silhouette Coefficient: {ss}")

In [None]:
test = cluster_evaluation()