# K-means Model
---

1.   **[Introduction to K-means](#1.-Introduction-to-K-means)**
2.   **[Foundations of K-means](#2.-Foundations-of-K-means)**
3.   **[Model Assumptions](#3.-Model-Assumptions)**
4.   **[Model Evaluation](#4.-Model-Evaluation)**
5.   **[Exploratory Data Analysis](#5.-Exploratory-Data-Analysis)**
6.   **[Model Construction](#6.-Model-Construction)**
7.   **[Model Results](#7.-Model-Results)**

---
<a name="1.-Introduction-to-K-means"></a>
### 1. Introduction to K-means

#### 1.1 Definitions

**K-means |** An unsupervised learning partitioning algorithm used for clustering unlabeled data

**Unsupervised learning |** Used on unlabeled data where the goal is to learn about the data's underlying structure

**Centroid |** The center of a cluster determined by the mathematical mean of all the points in that cluster 

---
<a name="2.-Foundations-of-K-means"></a>
### 2. Foundations of K-means

**There are 4 steps in the creation of a K-means model:**
1. Randomly place centroids in the data space
2. Assign each point to its nearest centroid
3. Update the location of each centroid to the mean position of all the points assigned to it
4. Repeat steps 2 and 3 until the model converges(i.e. all centroid locations remain unchanged with successive iterations).

---
<a name="3.-Model-Assumptions"></a>
### 3. Model Assumptions

Model assumptions are statements about the data that must be true in order to justify the use of a particular modeling technique



#### 3.1 K-means Assumptions
1. **Assumption of Homogeneity of Variance:** K-means assumes that the variance of the distribution of each variable is equal across all clusters.

2. **Assumption of Independence:** K-means assumes that the observations in the dataset are independent of each other. This means that the value of one observation does not affect the value of another observation.

3. **Assumption of Euclidean Distance:** K-means assumes that the distances between observations are measured using Euclidean distance. This means that the distance between two observations is calculated as the square root of the sum of the squared differences between their corresponding attributes.

4. **Assumption of Clustering Structure:** K-means assumes that the data points in each cluster have a similar clustering structure, which means that the clusters are spherical, and have equal variance.


#### 3.2 Assumption Checks

1. **Homogeneity of Variance:** To check if the variance of each variable is equal across all clusters, you can use statistical tests such as Bartlett's test or Levene's test. If the test shows that the variances are significantly different, then the assumption of homogeneity of variance may not hold.

2. **Independence:** To check if the observations are independent of each other, you can examine the dataset for any dependencies or correlations among the variables. If there are dependencies or correlations, the assumption of independence may not hold. Additionally, if the dataset has a time-series structure, then the assumption of independence may not hold, and time-series clustering techniques may be more appropriate.

3. **Euclidean Distance:** The Euclidean distance is a fundamental assumption of k-means. To ensure this assumption holds, you can check if the variables are on the same scale, and if not, standardize the variables so that they have a similar range. Additionally, you can use other distance measures such as Manhattan distance or Mahalanobis distance, which can handle variables on different scales or account for correlations between variables.

4. **Clustering Structure:** To check if the data points in each cluster have a similar clustering structure, you can use visualization techniques such as scatterplots or boxplots to examine the distribution of the variables within each cluster. If the clusters have different shapes or sizes, then the assumption of clustering structure may not hold.


---
<a name="4.-Model-Evaluation"></a>
### 4. Model Evaluation

#### 4.1 Inertia
Sum of the squared distances between each observation and its nearest centroid 

- Measures intracluster distance
- Equal to the sum of the squared distance between each point and the centroid of the cluster that it’s assigned to
- Used in elbow plots
- All else equal, lower values are generally better

Inertia = $\sum\limits_{i=1}^{n} (X_1 - C_k)^2$

**The elbow method |** The elbow method is a way to help decide which clustering gives the most meaningful model of your data. It uses a line plot to visually compare the inertias of different models. With K-means models, this is done as a comparison between different values of *k.*



####  4.2 Silhouette analysis
A silhouette analysis is the comparison of different models’ silhouette scores. To calculate a model’s silhouette score, first, a silhouette coefficient is calculated for each instance in the data. 
An instance’s silhouette coefficient is defined by the following formula, where:

- $a$= the mean distance between the instance and each other instance in the same cluster 

- $b$= the mean distance from the instance to each instance in the nearest other cluster (i.e., excluding the cluster that the instance is assigned to)

- $max(a,b)$ = whichever value is greater, $a$ or $b$

- $\text{Silhouette coefficient} = \frac{(b-a)}{max(a,b)}$

A silhouette coefficient can range between -1 and +1. A value closer to +1 means that a point is close to other points in its own cluster and well separated from points in other clusters. As with inertia values, you can plot silhouette scores for different models to compare them against each other. 

Note that, unlike inertia, silhouette coefficients contain information about both intracluster distance (captured by the variable a) and intercluster distance (captured by the variable b).

Silhouette summary:
- Measures both intercluster distance and intracluster distance
- Equal to the average of all points’ silhouette coefficients
- Can be between -1 and +1 (greater values are better)

---
<a name="5.-Exploratory-Data-Analysis"></a>
### 5. Exploratory Data Analysis

#### 5.1 Imports

In [None]:
# Import standard operational packages.
import numpy as np
import pandas as pd

# Important tools for modeling and evaluation.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

# Import visualization packages.
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset into a DataFrame and save in a variable
data = pd.read_csv("example_file.csv")

#### 5.2 Data Exploration
After loading the dataset, the next step is to prepare the data to be suitable for clustering. This includes: 

*   Exploring data
*   Checking for missing values
*   Encoding categorical data 
*   Dropping irrelevant columns
*   Scaling the features using `StandardScaler`

In [None]:
# Display the first 10 rows of the data
data.head(10)

In [None]:
# Display number of rows, number of columns
data.shape

In [None]:
# Display the data type for each column. NB logistic regression models expect numeric data
data.dtypes

##### 5.2.1 Check for Missing Values

In [None]:
# Check for missing values.
data.isnull().sum()

In [None]:
# Drop rows with missing values.
# Save DataFrame in variable `data_subset`.
data_subset = data.dropna(axis=0).reset_index(drop = True)

In [None]:
# Check for missing values.
data_subset.isna().sum()

In [None]:
# View first 10 rows.
data_subset.head(10)

##### 5.2.2 Encode Data

In [None]:
# Convert columns from categorical to numeric.
data_subset = pd.get_dummies(data_subset, drop_first= True, columns= ['categorical_column'])

##### 5.2.3 Drop Columns

In [None]:
# Drop the island column.
data_subset = data_subset.drop(['irrelevant_column'], axis=1)

##### 5.2.4 Scale Features
Because K-means uses distance between observations as its measure of similarity, it's important to scale the data before modeling. Use a third-party tool, such as scikit-learn's StandardScaler function. StandardScaler scales each point xᵢ by subtracting the mean observed value for that feature and dividing by the standard deviation:

x-scaled = (xᵢ – mean(X)) / σ

This ensures that all variables have a mean of 0 and variance/standard deviation of 1.

In [None]:
# Create a dataframe of the independent variables X by excluding any columns which do not belong
X = data_subset.drop(['column'], axis=1)

In [None]:
#Scale the features.
#Assign the scaled data to variable `X_scaled`.
X_scaled = StandardScaler().fit_transform(X)

---
<a name="6.-Model-Construction"></a>
### 6. Model Construction


Now, fit K-means and evaluate inertia for different values of k. Because you may not know how many clusters exist in the data, start by fitting K-means and examining the inertia values for different values of k. To do this, write a function called `kmeans_inertia` that takes in `num_clusters` and `x_vals` (`X_scaled`) and returns a list of each k-value's inertia.

When using K-means inside the function, set the `random_state` to `42`. This way, others can reproduce your results.

In [None]:
# Fit K-means and evaluate inertia for different values of k.
num_clusters = [i for i in range(2, 11)]

def kmeans_inertia(num_clusters, x_vals):
    """
    Accepts as arguments list of ints and data array. 
    Fits a KMeans model where k = each value in the list of ints. 
    Returns each k-value's inertia appended to a list.
    """
    inertia = []
    for num in num_clusters:
        kms = KMeans(n_clusters=num, random_state=42)
        kms.fit(x_vals)
        inertia.append(kms.inertia_)

    return inertia

Use the `kmeans_inertia` function to return a list of inertia for k= 2 to 10

In [None]:
# Return a list of inertia for k=2 to 10.
inertia = kmeans_inertia(num_clusters, X_scaled)
inertia

In [None]:
# Create a line plot that shows the relationship between num_clusters and inertia
plot = sns.lineplot(x=num_clusters, y=inertia, marker = 'o')
plot.set_xlabel("Number of clusters");
plot.set_ylabel("Inertia");

**Question:** Where is the elbow in the plot?

---
<a name="7.-Model-Results"></a>
### 7. Model Results

##### 7.1 Silhouette Score

Evaluate the silhouette score using the silhouette_score() function. Silhouette scores are used to study the distance between clusters.

Then, compare the silhouette score of each value of k, from 2 through 10. To do this, write a function called kmeans_sil that takes in num_clusters and x_vals (X_scaled) and returns a list of each k-value's silhouette score.

In [None]:
# Evaluate silhouette score.
# Write a function to return a list of each k-value's score.
def kmeans_sil(num_clusters, x_vals):
    """
    Accepts as arguments list of ints and data array. 
    Fits a KMeans model where k = each value in the list of ints.
    Calculates a silhouette score for each k value. 
    Returns each k-value's silhouette score appended to a list.
    """
    sil_score = []
    for num in num_clusters:
        kms = KMeans(n_clusters=num, random_state=42)
        kms.fit(x_vals)
        sil_score.append(silhouette_score(x_vals, kms.labels_))

    return sil_score

sil_score = kmeans_sil(num_clusters, X_scaled)
sil_score

In [None]:
# Create a line plot that shows the relationship between num_clusters and sil_score
plot = sns.lineplot(x=num_clusters, y=sil_score, marker = 'o')
plot.set_xlabel("# of clusters");
plot.set_ylabel("Silhouette Score");

**Question:** What does the graph show?

Silhouette scores near 1 indicate that samples are far away from neighboring clusters. Scores close to 0 indicate that samples are on or very close to the decision boundary between two neighboring clusters.

##### 7.2 Optimal k-value
To decide on an optimal k-value, fit a n-cluster model to the dataset where n is the number produced by the silhouette score above

In [None]:
# To decide on an optimal k-value, fit a n-cluster model to the dataset
kmeans6 = KMeans(n_clusters= n, random_state=42)
kmeans6.fit(X_scaled)

In [None]:
# Print unique labels.
print('Unique labels:', np.unique(kmeans6.labels_))

Now, create a new column `cluster` that indicates cluster assignment in the DataFrame `data_subset`. It's important to understand the meaning of each cluster's labels, then decide whether the clustering makes sense. 

**Note:** This task is done using `data_subset` because it is often easier to interpret unscaled data.

In [None]:
# Create a new column `cluster`.
data_subset['cluster'] = kmeans6.labels_
data_subset.head()

Use `groupby` to verify if any `'cluster'` can be differentiated by `'categorical_column'`.

In [None]:
# Verify if any `cluster` can be differentiated by `categorical_column1`.
data_subset.groupby(by=['cluster', 'categorical_column1']).size()

In [None]:
# Interpret the groupby outputs using a visualization
data_subset.groupby(by=['cluster', 'categorical_column1']).size().plot.bar(title='Clusters differentiated by categorical_column1',
                                                                   figsize=(6, 5),
                                                                   ylabel='Size',
                                                                   xlabel='(Cluster, categorical_column1)');

Use `groupby` to verify if each `'cluster'` can be differentiated by `'categorical_column1'` AND `'categorical_column2'`.

In [None]:
# Verify if each `cluster` can be differentiated by `categorical_column1` AND `categorical_column2`.
data_subset.groupby(by=['cluster','categorical_column1', 'categorical_column2']).size().sort_values(ascending= False)

**Question:** Are the clusters differentiated by `'categorical_column1'` and `'categorical_column2'`?

The results of the above evaluation with `groupby` will indicate whether the algorithm clusters produced make sense and how well it performed 

Finally, interpret the groupby outputs and visualize these results. The graph below shows that each `'cluster'` can be differentiated by `'categorical_column1'` and `'categorical_column2'`

In [None]:
data_subset.groupby(by=['cluster','categorical_column1','categorical_column2']).size().unstack(level = 'categorical_column1', fill_value=0).plot.bar(title='Clusters differentiated by categorical_column1 and categorical_column2',
                                                                                                                      figsize=(6, 5),
                                                                                                                      ylabel='Size',
                                                                                                                      xlabel='(Cluster, categorical_column2)')
plt.legend(bbox_to_anchor=(1.3, 1.0))