# Cluster Analysis Using Bisecting K-Means Clustering

Bisecting k-means clustering, from `scikit-learn` is a derivative of
k-means clustering that is typically more efficient, due to its
bisecting nature. This process, also changes how clusters and
centroids are selected, resulting in a completely different, while
still somewhat similar, analysis compared to using k-means.

The largest difference between the two is the way clusters are created.
With bisecting k-means, by default the model will select the cluster
with the biggest inertia to be split. This process of selecting and
splitting happens recursively until the desired number of clusters is
reached. Whereas k-means iteratively tries to separate samples into
groups of equal variance. The recursive nature of bisecting k-means
commonly gives the output a hierarchy, which can drastically differ
from k-means.

The process to perform this cluster analysis is very similar, so
let's get started.

## Installing Dependencies and Importing Libraries

Again, we have to install and import some things to perform any analysis.

In [1]:
! pip install matplotlib=="3.8.0"

Collecting matplotlib==3.8.0
  Obtaining dependency information for matplotlib==3.8.0 from https://files.pythonhosted.org/packages/40/d9/c1784db9db0d484c8e5deeafbaac0d6ed66e165c6eb4a74fb43a5fa947d9/matplotlib-3.8.0-cp311-cp311-win_amd64.whl.metadata
  Using cached matplotlib-3.8.0-cp311-cp311-win_amd64.whl.metadata (5.9 kB)
Using cached matplotlib-3.8.0-cp311-cp311-win_amd64.whl (7.6 MB)
Installing collected packages: matplotlib
Successfully installed matplotlib-3.8.0



[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import BisectingKMeans

## Performing Bisecting K-Means Cluster Analysis on Solar Farm Data

### Prepping the Data

Like always, we start by loading our data into a `pandas` data frame. We can then examine the data frame to see what we want to analyze.

In [10]:
df = pd.read_csv('../data/solar.csv')
print(df)

          id        lat        long  irradiance           farm_type  capacity  \
0          0  25.896492  -97.460358    5.634079     large_community      5.00   
1          1  26.032654  -97.738098    5.616413       small_utility      5.00   
2          2  26.059063  -97.208252    5.746738     small_community      0.15   
3          3  26.078449  -98.073364    5.742196       small_utility      5.00   
4          4  26.143227  -98.311340    5.817187       small_utility      5.00   
...      ...        ...         ...         ...                 ...       ...   
11382  11382  48.977253 -113.406967    4.639617  medium_residential      0.01   
11383  11383  47.116753  -68.695343    4.393464  medium_residential      0.01   
11384  11384  47.163166  -68.642029    4.319452       small_utility      5.00   
11385  11385  48.486320 -122.074875    3.589022  medium_residential      0.01   
11386  11386  48.945011 -122.131317    3.713836    medium_community      2.00   

       capacity_factor  pow

Again, we are looking at trying to draw conclusions about locational data and their relation to weather properties,
so similarly to k-means, lets go with `lat`, `long`, and `irradiance` this time.

In [11]:
X = df.loc[:, df.columns[1:4]]
print(X)

             lat        long  irradiance
0      25.896492  -97.460358    5.634079
1      26.032654  -97.738098    5.616413
2      26.059063  -97.208252    5.746738
3      26.078449  -98.073364    5.742196
4      26.143227  -98.311340    5.817187
...          ...         ...         ...
11382  48.977253 -113.406967    4.639617
11383  47.116753  -68.695343    4.393464
11384  47.163166  -68.642029    4.319452
11385  48.486320 -122.074875    3.589022
11386  48.945011 -122.131317    3.713836

[11387 rows x 3 columns]


### Clustering the Data

Now that the data is prepped and ready to go, we can begin clustering it.
Even though it is a different data set, lets go with the "optimal" cluster
number we got from the last notebook to start. This can always be changed later
to examine different outputs.

So, here we set `n_clusters` to 20, and just create the model. This may look simpler
than the k-means process, but the bisecting k-means function call comes with
desirable default parameters. Some of these include:

- `init`: This defaults to `"random"`, which randomly chooses cluster centroids,
    which is very different from how k-means handles this process.
- `n_init`: This defaults to `1`, meaning that no other seeds are considered. It makes the
    assumption that one round of random choices is suitable.
- `algorithm`: This defaults to `lloyd`, which is a classical EM-style bisection algorithm
    used to split clusters.

Although it cannot be seen, this algorithm offers a wider riange of interesting specifications,
it just so happens that the default values are different enough from k-means that there is no
need to change them.

With that said, let's make and fit the model.

In [12]:
n_clusters = 20

bkmeans = BisectingKMeans(n_clusters)
bkmeans.fit(X[X.columns[0:2]])

Similar to before, after fitting, we can make predictions using the same `fit_predict` method. The cluster values assigned are then stored
in a new column in our isolated data set headed `cluster_label`. This can also be seen in a truncated version of our data.

In [13]:
X['cluster_label'] = bkmeans.fit_predict(X[X.columns[0:2]])
print(X.head(10))

         lat       long  irradiance  cluster_label
0  25.896492 -97.460358    5.634079              0
1  26.032654 -97.738098    5.616413              0
2  26.059063 -97.208252    5.746738              0
3  26.078449 -98.073364    5.742196              0
4  26.143227 -98.311340    5.817187              0
5  26.149040 -98.075409    5.701752              0
6  26.180355 -97.367737    5.720004              0
7  26.254963 -98.078491    5.730308              0
8  26.272160 -98.098694    5.734213              0
9  26.272625 -98.078979    5.755140              0


Now we can grab the centroids for our bisecting k-means model, as well as take those labels
from above and store them in a variable `labels` used for plotting.

In [14]:
centers = bkmeans.cluster_centers_
labels = bkmeans.predict(X[X.columns[0:2]])

The visualization of this model is the same and with k-means. We start by plotting our isolated data on the bottom.
Each data point is assigned a color based on its `cluster_label`. We then plot the centroids on top, which then
creates a complete plot that showcases our clusters with centroids within.

In [15]:
X.plot.scatter(x = 'long', y = 'lat', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 1], centers[:, 0], c='black', s=200, alpha=0.5)
plt.show()

AttributeError: module 'matplotlib.cbook' has no attribute '_safe_first_finite'

<Figure size 640x480 with 0 Axes>

Similarly, we can make out the United States again. But notice the shape and placement
of the clusters. Even though it is different data, the hierarchy is clearly visible here,
which is so much different from the seemingly random dispersion seen in the k-means
clustering. Like before, try experimenting with the number of clusters by changing the value
of `n_clusters` above, and re-running the cells. Make note of any differences, as well as how
prevalent, or ambigous the hierarchy becomes.

### Saving the Clusters

Once your experimentation is completed, if it is desired, one can save the output of the most recent cluster labeling.
By running the cell below the saved data will be put into the `output` directory with the name `bisecting_k_means_output.csv`.

In [None]:
X.to_csv('../output/bisecting_k_means_output.csv', index=None, header=True)

## Next Steps

There are no more notebooks to run, but the work here is far from over. To see where this project is heading, visit the `Future Work` section in the [README](../README.md).