# Application - Preprocessing

KMeans can also be used as a dimensionality reduction technique built on top of another machine learning model. We'll demonstrate this with the digits dataset.

In [1]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split


X_digits, y_digits = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits)

Identify a baseline accuracy with logistic regression

In [2]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='liblinear', multi_class='auto', random_state=40)
log_reg.fit(X_train, y_train)
log_reg.score(X_test, y_test)

0.9755555555555555

Now we will feed in the logistic regression model then input of a reduced feature space of 50 features. It does this by taking the `fit_transform`ation from the KMeans model.

More specifically, `fit_transform` on KMeans will take identify the `k` centroids by fitting the model, and then transforming each instance in `X` by casting it to the cluster-distance space. The cluster-distance space is an array for each instance in X is `[distance(c) for c in all_clusters]`.

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans


pipeline = Pipeline([
    ("kmeans", KMeans(n_clusters=60)),
    ("log_reg", LogisticRegression(solver='liblinear', multi_class='auto', random_state=40)),
])
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

0.9777777777777777

96 to 98. Not bad. Note that it is intuitive to that that 10 clusters would be optimal. This is an example where our intuition fails us. While it is true that there are 10 digits, there are only 10 clusters if each digit was drawn exactly the same way, and as we know there are certainly more than 10 different ways to write digits.

We can improve what we've done here by not arbitrarily chosing k. Using gridsearch, we can optimize k in the context of the logistic regression model. Note that using the silhouette score for optimizing k is unnessessary because evaluation takes place on the logistic regression model and we want to bring both into account here.

In [4]:
from sklearn.model_selection import GridSearchCV


param_grid = dict(kmeans__n_clusters=range(2, 100))
grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=1, n_jobs=-1)
grid_clf.fit(X_train, y_train)
grid_clf.score(X_test, y_test)

Fitting 3 folds for each of 98 candidates, totalling 294 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   30.6s
[Parallel(n_jobs=-1)]: Done 294 out of 294 | elapsed:  1.1min finished


0.9822222222222222