# Applying k-modes and k-prototypes to MPG data

This notebook uses data from [ggplot2's example datasets](https://ggplot2.tidyverse.org/reference/mpg.html) relating to fuel economy expressed as _miles per gallon_ for selected cars offered for sale in the USA.

The data was obtained from [here](https://github.com/tidyverse/ggplot2/blob/main/data-raw/mpg.csv) and is named `ggplot-mpg.csv`.

### Import the standard libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Import the kmodes modules

In [None]:
from kmodes.kmodes import KModes
from kmodes.kprototypes import KPrototypes

## Reading and preparing the MPG data

### Read the data

In [None]:
df = pd.read_csv("data/ggplot-mpg.csv")

### Rename the columns, to make them easier to interpret

In [None]:
df.rename(columns={"displ": "displacement", "cyl": "numCyl",
                   "drv" : "drivetrain", "cty" : "urban",
                   "hwy" : "extra-urban", "fl" : "fuel"}, inplace=True)

### Derive the `transmission` column

In [None]:
# See https://saturncloud.io/blog/how-to-apply-regex-to-a-pandas-dataframe/
df["transmission"] = df["trans"].str.extract(r'^([a-z]+)')

### Derive the `numGears` column

In [None]:
# Extract the number (assuming it exists)
df["numGears"] = df["trans"].str.extract(r'(\d+)')

# Where a number is not provided, assume it os continuously variable transmission
# with numGears equivalent to a large number, say 100
df["numGears"] = df["numGears"].fillna(100)

### Standardise the model column, by removing descriptive attributes that are stored in other columns

In [None]:
df["model"] = df["model"].str.replace(r' quattro$','',regex=True)
df["model"] = df["model"].str.replace(r' 2wd$','',regex=True)
df["model"] = df["model"].str.replace(r' 4wd$','',regex=True)
df["model"] = df["model"].str.replace(r' pickup$','',regex=True)
df["model"] = df["model"].str.replace(r'^toyota ','',regex=True)

### Prepare the categorical columns:

In [None]:
# 1. Identify them by column name
catCols = ["manufacturer", "model", "drivetrain", "fuel", "class", "transmission"]

# 2. Ensure they are strings
# 3. Convert to categorical
for catCol in catCols:
  df[catCol] = pd.Categorical(df[catCol].astype(str))

### Prepare the ordered categorical columns:

In [None]:
# 1. Identify them by column name
orderedCatCols = ["displacement", "year", "numCyl", "numGears"]

# 2. Ensure they are strings
# 3. Convert to categorical
for orderedCatCol in orderedCatCols:
  df[orderedCatCol] = pd.Categorical(df[orderedCatCol].astype(str), ordered=True)

### Prepare the numerical columns:

In [None]:
# 1. Identify them by column name
numCols = ["urban", "extra-urban"]

# 2. Ensure they are floats
for numCol in numCols:
  df[numCol] = df[numCol].astype(float)

### Check the "fixes" that have been applied

In [None]:
display(df.head())

## Using KModes on the categorical columns in the MPG data

### Find the best k for kmodes on this data

In [None]:
kRange = range(1,9)
allCatCols = catCols + orderedCatCols
scores = dict()
for k in kRange:
  # Use original Huang initialisation, start from 5 random starting starting points, turn off logging
  model = KModes(n_clusters=k, init='Huang', verbose=0, random_state=42, n_init=5)
  fittedModel = model.fit(df[allCatCols])
  scores[k] = fittedModel.cost_
print(scores)

### Make sure the `/res` directory exists, so outputs can be sent there

In [None]:
# See https://www.tutorialspoint.com/How-can-I-create-a-directory-if-it-does-not-exist-using-Python
if not os.path.exists('res'):
  os.makedirs('res')

### Define a function to plot scores

In [None]:
def plotScores(scores, modelType, dataset):
  ModelType = modelType.capitalize()
  fig, ax = plt.subplots()
  ax.plot(scores.keys(),scores.values(),'-o') # line plot with points shown as filled o's
  ax.set_xlabel("Number of clusters") 
  ax.set_ylabel("Total Cluster Variance") 
  ax.set_title(f"Variance scores for {modelType} on {dataset} data")
  fig.savefig(f'res/elbowFor{ModelType}.pdf', bbox_inches='tight')
  plt.show()

### Plot the scores for kmodes

In [None]:
plotScores(scores, 'kmodes', 'ggplot-mpg')

### Print the best k, variance and iteration count for kmodes on this data with the bestK number of clusters

In [None]:
bestK = 4
model = KModes(n_clusters=bestK, init='Huang', verbose=0, random_state=42, n_init=5)
fittedModel = model.fit(df[allCatCols])
clusterIDs = fittedModel.predict(df[allCatCols])
print("Best k is {} with variance {}, found after {} iterations".format(
      bestK, fittedModel.cost_, fittedModel.n_iter_))

### Show the cluster centres for the best values of k for kmodes on this data

In [None]:
clusterDf = pd.DataFrame(data=fittedModel.cluster_centroids_, columns=allCatCols)
print(f"With best k = {bestK}, for this data, the kmodes centres are")
display(clusterDf)
print("The modes are computed per column, so the centres do not need to coincide with existing rows in the data")

### Show the cluster labels for each vehicle using the best values of k for kmodes

In [None]:
print(f"Categorical columns only, labeled by kmodes with best k = {bestK}")
labeledDf = df[allCatCols].assign(clusterID=clusterIDs)
display(labeledDf)

## Using KPrototypes on the categorical columns in the MPG data

### Find the best k for kprototypes on this data

In [None]:
kRange = range(1,8)
allCols = numCols + allCatCols
catColIDs = list(range(len(numCols),len(numCols)+len(allCatCols)))
scores = dict()
for k in kRange:
  # Use Huang initialisation, use 5 random starting starting points, turn off logging
  model = KPrototypes(n_clusters=k, init='Huang', verbose=0, random_state=42, n_init=5)
  # Note that we need to tell the model which are the categorical columns
  fittedModel = model.fit(df[allCols], categorical=catColIDs)
  scores[k] = fittedModel.cost_
print(scores)

### Plot the scores for kprototypes

In [None]:
plotScores(scores, 'kprototypes', 'ggplot-mpg')

### Print the best k, variance and iteration count for kprototypes on this data

In [None]:
bestK = 3
model = KPrototypes(n_clusters=bestK, init='Huang', verbose=0, random_state=42, n_init=5)
fittedModel = model.fit(df[allCols], categorical=catColIDs)
# Note that we need to tell the model which are the categorical columns
clusterIDs = fittedModel.predict(df[allCols], categorical=catColIDs)
print("Best k is {} with variance {}, found after {} iterations".format(
      bestK, fittedModel.cost_, fittedModel.n_iter_))

### Show the cluster centres for the best values of k for kprototypes on this data

In [None]:
clusterDf = pd.DataFrame(data=fittedModel.cluster_centroids_, columns=allCols)
print(f"With best k = {bestK}, for this data, the kprototypes centres are")
display(clusterDf)
print("The modes are computed per column, so the centres do not need to coincide with existing rows in the data")

### Show the cluster labels for each vehicle using the best values of k for kprototypes

In [None]:
print(f"All columns, labeled by kprototypes with best k = {bestK}")
labeledDf = df[allCols].assign(clusterID=clusterIDs)
display(labeledDf)

## Exercises
 
1. Try adjusting the columns included for both kmeans and k prototypes. What
   effect does it have on the cluster centres and hence the cluster assignments?
2. Apply kmeans to the urban and extra-urban columns. How do the cluster assignments
   differ with those from kmodes and kprototypes?