You want to cluster observations so that similar observations are grouped
together

If you know that you have k groups, you can use k-means clustering to group
similar observations and output a new feature containing each observation’s
group membership:

In [1]:
# Load libraries
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans


In [2]:
# Make simulated feature matrix
features, _ = make_blobs(n_samples = 50,
n_features = 2,
centers = 3,
random_state = 1)

In [3]:
# Create DataFrame
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])

In [4]:
dataframe.head(5)

Unnamed: 0,feature_1,feature_2
0,-9.877554,-3.336145
1,-7.28721,-8.353986
2,-6.943061,-7.023744
3,-7.440167,-8.791959
4,-6.641388,-8.075888


In [5]:
cluster=KMeans(n_clusters=3, random_state=0)

In [7]:
cluster.fit(features)
dataframe['group']=cluster.predict(features)
dataframe.head(5)

Unnamed: 0,feature_1,feature_2,group
0,-9.877554,-3.336145,0
1,-7.28721,-8.353986,2
2,-6.943061,-7.023744,2
3,-7.440167,-8.791959,2
4,-6.641388,-8.075888,2


We are jumping ahead of ourselves a bit and will go much more in depth about
clustering algorithms later in the book. However, I wanted to point out that we
can use clustering as a preprocessing step. Specifically, we use unsupervised
learning algorithms like k-means to cluster observations into groups. The end
result is a categorical feature with similar observations being members of the
same group.
Don’t worry if you did not understand all of that right now: just file away the
idea that clustering can be used in preprocessing. And if you really can’t wait,
feel free to flip to Chapter 19 now.
