# Build a K-means model

1. Activity: Build a K-means model

**Note**: This notebook is part of the Course Google Advanced Data Analytics Professional Certificate on Coursera platform.

### 1.1 Introduction

K-means clustering is very effective when segmenting data and attempting to find patterns. Because clustering is used in a broad array of industries, becoming proficient in this process will help you expand your skillset in a widely applicable way.

In this activity, you are a consultant for a scientific organization that works to support and sustain penguin colonies. You are tasked with helping other staff members learn more about penguins in order to achieve this mission.

The data for this activity is in a spreadsheet that includes datapoints across a sample size of 345 penguins, such as species, island, and sex. Your will use a K-means clustering model to group this data and identify patterns that provide important insights about penguins.

**Note**: Because this lab uses a real dataset, this notebook will first require basic EDA, data cleaning, and other manipulations to prepare the data for modeling.


### 1.2 Step 1: Imports

Import statements including K-means, silhouette_score, and StandardScaler.

In [None]:
# Import standard operational packages.
import numpy as np
import pandas as pd
# Important tools for modeling and evaluation.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
# Import visualization packages.
import seaborn as sns

Pandas is used to load the penguins dataset, which is built into the seaborn library. The resulting pandas DataFrame is saved in a variable named penguins. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more 1 code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
# Save the `pandas` DataFrame in variable `penguins`.
penguins = pd.read_csv("../input/penguins/penguins.csv.xls")

Now, review the first 10 rows of data.

In [None]:
# Review the first 10 rows.
penguins.head(10)

### 1.3 Step 2: Data exploration

After loading the dataset, the next step is to prepare the data to be suitable for clustering. This includes:
* Exploring data
* Checking for missing values
* Encoding data
* Dropping a column
* Scaling the features using StandardScaler

#### 1.3.1 Explore data

To cluster penguins of multiple different species, determine how many different types of penguin species are in the dataset.

In [None]:
# Find out how many penguin types there are.
penguins["species"].unique()

In [None]:
# Find the count of each species type.
penguins["species"].value_counts()

**Question**: How many types of species are present in the dataset?
There are 3 types of species in penguins dataset: ‘Adelie’, ‘Chinstrap’ and ‘Gentoo’.

**Question**: Why is it helpful to determine the perfect number of clusters using K-means when you already know how many penguin species the dataset contains?

With this data we can construct a model that identifies the clusters and after that compare the results with real species information. Like this we can find out if our model is working well.

#### 1.3.2 Check for missing values

An assumption of K-means is that there are no missing values. Check for missing values in the rows of the data.

In [None]:
# Check for missing values.
penguins.isna().sum()

Now, drop the rows with missing values and save the resulting pandas DataFrame in a variable named *penguins_subset*.

In [None]:
# Drop rows with missing values.
# Save DataFrame in variable `penguins_subset`.
penguins_subset = penguins.dropna(axis=0).reset_index(drop = True)

Use dropna. Note that an axis parameter passed in to this function should be set to 0 if you want to drop rows containing missing values or 1 if you want to drop columns containing missing values. Optionally, reset_index may also be used to avoid a SettingWithCopy warning later in the notebook.

Next, check to make sure that penguins_subset does not contain any missing values.

In [None]:
# Check for missing values.
penguins_subset.isna().sum()

Now, review the first 10 rows of the subset.

In [None]:
# View first 10 rows.
penguins_subset.head(10)

#### 1.3.3 Encode data

Some versions of the penguins dataset have values encoded in the sex column as ‘Male’ and ‘Female’ instead of ‘MALE’ and ‘FEMALE’. The code below will make sure all values are ALL CAPS.

In [None]:
penguins_subset['sex'] = penguins_subset['sex'].str.upper()

K-means needs numeric columns for clustering. Convert the categorical column 'sex' into numeric. There is no need to convert the 'species' column because it isn’t being used as a feature in the clustering algorithm.

In [None]:
# Convert `sex` column from categorical to numeric.
penguins_subset = pd.get_dummies(penguins_subset, drop_first = True, columns = ["sex"])

#### 1.3.4 Drop a column

Drop the categorical column island from the dataset. While it has value, this notebook is trying to confirm if penguins of the same species exhibit different physical characteristics based on sex.

This doesn’t include location.

Note that the 'species' column is not numeric. Don’t drop the 'species' column for now. It
could potentially be used to help understand the clusters later.

In [None]:
# Drop the island column.
penguins_subset.drop(["island"], axis = 1, inplace = True)

#### 1.3.5 Scale the features

Because K-means uses distance between observations as its measure of similarity, it’s important to scale the data before modeling. Use a third-party tool, such as scikit-learn’s StandardScaler function. StandardScaler scales each point x by subtracting the mean observed value for that feature and dividing by the standard deviation:

x-scaled = (x – mean(X)) /

This ensures that all variables have a mean of 0 and variance/standard deviation of 1.

**Note**: Because the species column isn’t a feature, it doesn’t need to be scaled.
First, copy all the features except the 'species' column to a DataFrame X.

In [None]:
 # Exclude `species` variable from X
X = penguins_subset[["bill_length_mm",
                     "bill_depth_mm",
                     "flipper_length_mm",
                     "body_mass_g",
                     "sex_MALE"]]

Scale the features in X using StandardScaler, and assign the scaled data to a new variable
X_scaled.

In [None]:
#Scale the features.
#Assign the scaled data to variable `X_scaled`.
X_scaled = StandardScaler().fit_transform(X)

### 1.4 Step 3: Data modeling

Now, fit K-means and evaluate inertia for different values of k. Because you may not know how many clusters exist in the data, start by fitting K-means and examining the inertia values for different values of k. To do this, write a function called *kmeans_inertia* that takes in *num_clusters* and *x_vals* (*X_scaled*) and returns a list of each k-value’s inertia.

When using K-means inside the function, set the random_state to 42. This way, others can
reproduce your results.

In [None]:
# Fit K-means and evaluate inertia for different values of k.
num_clusters = [i for i in range(2, 11, 1)]
def kmeans_inertia(num_clusters, x_vals):
    inertia = []
    for num in num_clusters:
        kms = KMeans(n_clusters = num, random_state = 42)
        kms.fit(x_vals)
        inertia.append(kms.inertia_)
    return inertia

Use the kmeans_inertia function to return a list of inertia for k=2 to 10.

In [None]:
# Return a list of inertia for k=2 to 10.
inertia = kmeans_inertia(num_clusters, X_scaled)
inertia

Next, create a line plot that shows the relationship between num_clusters and inertia. Use either seaborn or matplotlib to visualize this relationship.

In [None]:
# Create a line plot.
plt = sns.lineplot(x = num_clusters, y = inertia)
plt.set_xlabel("Number of Clusters")
plt.set_ylabel("Inertia")
plt.set_title("Number of clusters vs inertia")

**Question**: Where is the elbow in the plot?

The elbow is on the number 6. That means that inertia indicator is optimal for 6 clusters.

### 1.5 Step 4: Results and evaluation

Now, evaluate the silhouette score using the silhouette_score() function. Silhouette scores are used to study the distance between clusters.

Then, compare the silhouette score of each value of k, from 2 through 10. To do this, write a function called kmeans_sil that takes in num_clusters and x_vals (X_scaled) and returns a list of each k-value’s silhouette score.

In [None]:
# Evaluate silhouette score.
# Write a function to return a list of each k-value's score.
num_clusters = [i for i in range(2, 10, 1)]
def kmeans_sil(num_clusters, x_vals):
    sil_score = []
    for num in num_clusters:
        kms = KMeans(n_clusters = num, random_state = 42)
        kms.fit(x_vals)
        sil_score.append(silhouette_score(x_vals, kms.labels_))
    return sil_score

sil_score = kmeans_sil(num_clusters, X_scaled)
sil_score

Next, create a line plot that shows the relationship between num_clusters and sil_score. Use
either seaborn or matplotlib to visualize this relationship.

In [None]:
# Create a line plot.
plt = sns.lineplot(x = num_clusters, y = sil_score)
plt.set_xlabel("Number of Clusters")
plt.set_ylabel("Silhouette Score")
plt.set_title("Number of clusters vs Silhouette Score")

**Question**: What does the graph show?

The graph shows the same result as it was with inertia score. The optimal number of clusters is 6.

#### 1.5.1 Optimal k-value
To decide on an optimal k-value, fit a six-cluster model to the dataset.

In [None]:
# Fit a 6-cluster model.
k6_means = KMeans(n_clusters = 6, random_state = 42)
k6_means.fit(X_scaled)

Print out the unique labels of the fit model.

In [None]:
# Print unique labels.
np.unique(k6_means.labels_)

Now, create a new column cluster that indicates cluster assignment in the DataFrame
penguins_subset. It’s important to understand the meaning of each cluster’s labels, then de-
cide whether the clustering makes sense.

**Note**: This task is done using penguins_subset because it is often easier to interpret unscaled data.

In [None]:
# Create a new column `cluster`.
penguins_subset["cluster"]= k6_means.labels_

Use groupby to verify if any 'cluster' can be differentiated by 'species'.

In [None]:
# Verify if any `cluster` can be differentiated by `species`.
penguins_subset.groupby(by=["species", "cluster"]).size()

Next, interpret the groupby outputs. Although the results of the groupby show that each 'cluster' can be differentiated by 'species', it is useful to visualize these results. The graph shows that each 'cluster' can be differentiated by 'species'.
**Note**: The code for the graph below is outside the scope of this lab.

In [None]:
penguins_subset.groupby(by=['cluster', 
                            'species']).size().plot.bar(title='Clusters differentiated by species',
                                                        figsize=(6, 5),
                                                        ylabel='Size',
                                                        xlabel='(Cluster, Species)');

Use groupby to verify if each 'cluster' can be differentiated by 'species' AND 'sex_MALE'.

In [None]:
# Verify if each `cluster` can be differentiated by `species' AND `sex_MALE`.
penguins_subset.groupby(by=["cluster", "species", "sex_MALE"]).size()

**Question**: Are the clusters differentiated by 'species' and 'sex_MALE'?

Yes, some clusters from different species are mixed. For example Species Adelie and Chinstrap have mixed data in clusters 2 and 4. Cluster 2 is for females and cluster 4 is for males.

Finally, interpret the groupby outputs and visualize these results. The graph shows that each 'cluster' can be differentiated by 'species' and 'sex_MALE'. Furthermore, each cluster is mostly comprised of one sex and one species.

**Note**: The code for the graph below is outside the scope of this lab.

In [None]:
penguins_subset.groupby(by=['cluster',
                            'species',
                            'sex_MALE']).size().unstack(level = 'species', 
                                                        fill_value=0).plot.bar(title='Clusters differentiated by species and sex',
                                                                               figsize=(6, 5),
                                                                               ylabel='Size',
                                                                               xlabel='(Cluster, Sex)')
plt.legend(bbox_to_anchor=(1.3, 1.0))

### 1.6 Considerations

**What are some key takeaways that you learned during this lab? Consider the process
you used, key tools, and the results of your investigation.**

Analysis results for both inertia and silhouette score show that the optimal number of clusters is 6. There are 3 species and there are main differences between male and female individuals of the same specie. The result of 6 clusters reffers to 3 species and 2 sexes for each specie, therefore there are 6 clusters. 2 species: Adelie and Chinstrap have some mixed data in clusters 2 and 4. Other species are accurately identified by the model.

**What summary would you provide to stakeholders?**
The model has a high accuracy as it identified 6 clusters as the optimal number. That refers to 3 species and 2 sexes for each specie.

#### 1.6.1 References

Gorman, Kristen B., et al. “Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis).” PLOS ONE, vol. 9, no. 3, Mar. 2014, p. e90081. PLoS Journals
Sklearn Preprocessing StandardScaler scikit-learn