## Clustering

How the K-means algorithm works with a sample dataset of delivery fleet driver data.<br>
For the sake of simplicity, we'll only be looking at two driver features:
- mean distance driven per day
- the mean percentage of time a driver was >5 mph over the speed limit.

In general, this algorithm can be used for any number of features, so long as the number of data samples is much greater than the number of features.

### Step 1: Clean and Transform Your Data

For this example, the data is already cleaned. A sample of the data as a pandas DataFrame is shown below.

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv("delivery-fleet_data.tsv", delimiter="\t")
df.head()

In [None]:
df.shape

In [None]:
import matplotlib.pyplot as plt

plt.scatter(df.Distance_Feature,df.Speeding_Feature)
plt.title("Delivery Fleet Data")
plt.xlabel("Distance")
plt.ylabel("Speeding > 5 mph")
plt.savefig("drivers.png")
plt.show()

### Step 2: Choose K and Run the Algorithm

Start by choosing K=2.

In [None]:
from sklearn.cluster import KMeans
data = df.drop(columns=['Driver_ID'])

In [None]:
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)

In [None]:
kmeans.labels_

In [None]:
kmeans.cluster_centers_

In [None]:
centroids = kmeans.cluster_centers_

### Step 3: Review the Results

The chart below shows the results. Visually, you can see that the K-means algorithm splits the two groups based on the distance feature. Each cluster centroid is marked with a star.

- Group 1 Centroid = (50, 8.8)
- Group 2 Centroid = (180, 18.3)
Using domain knowledge of the dataset, we can infer that Group 1 is urban drivers and Group 2 is rural drivers.

In [None]:
plt.scatter(df.Distance_Feature,df.Speeding_Feature)
plt.scatter(centroids[0][0], centroids[0][1], marker='*', s=800, c='r')
plt.scatter(centroids[1][0], centroids[1][1], marker='*', s=800)
plt.title("Delivery Fleet Data")
plt.xlabel("Distance")
plt.ylabel("Speeding > 5 mph")
plt.show()

### Step 4: Iterate Over Several Values of K
Test how the results look for $ K=4 $. To do this, all you need to change is the target number of clusters in the `KMeans()` function.

In [None]:
kmeans = KMeans(n_clusters=4).fit(data)

In [None]:
kmeans.labels_

In [None]:
kmeans.cluster_centers_

In [None]:
centroids_1 = kmeans.cluster_centers_

In [None]:
plt.scatter(df.Distance_Feature,df.Speeding_Feature)
plt.scatter(centroids_1[0][0], centroids_1[0][1], marker='*', s=800,c='r')
plt.scatter(centroids_1[1][0], centroids_1[1][1], marker='*', s=800,c='w')
plt.scatter(centroids_1[2][0], centroids_1[2][1], marker='*', s=800,c='y')
plt.scatter(centroids_1[3][0], centroids_1[3][1], marker='*', s=800,c='b')
plt.title("Delivery Fleet Data")
plt.xlabel("Distance")
plt.ylabel("Speeding > 5 mph")
plt.show()

The chart above shows the resulting clusters.

We see that four distinct groups have been identified by the algorithm; now speeding drivers have been separated from those who follow speed limits, in addition to the rural vs. urban divide.

The threshold for speeding is lower with the urban driver group than for the rural drivers, likely due to urban drivers spending more time in intersections and stop-and-go traffic.

In [None]:
kmeans.predict(data)

In [None]:
data['cluster_no'] = kmeans.predict(data)

In [None]:
data.head()

In [None]:
data.cluster_no.value_counts()

In [None]:
data.head()