# Cluster Analysis Using Bisecting K-Means Clustering

Bisecting k-means clustering, from `scikit-learn` is a derivative of
k-means clustering that is typically more efficient, due to its
bisecting nature. This process, also changes how clusters and
centroids are selected, resulting in a completely different, while
still somewhat similar, analysis compared to using k-means.

The largest difference between the two is the way clusters are created.
With bisecting k-means, by default the model will select the cluster
with the biggest inertia to be split. This process of selecting and
splitting happens recursively until the desired number of clusters is
reached. Whereas k-means iteratively tries to separate samples into
groups of equal variance. The recursive nature of bisecting k-means
commonly gives the output a hierarchy, which can drastically differ
from k-means.

The process to perform this cluster analysis is very similar, so
let's get started.

## Installing Dependencies and Importing Libraries

Again, we have to install and import some things to perform any analysis.

In [None]:
! pip install matplotlib=="3.8.0"

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import BisectingKMeans

## Performing Bisecting K-Means Cluster Analysis on Solar Farm Data

### Prepping the Data

Like always, we start by loading our data into a `pandas` data frame. We can then examine the data frame to see what we want to analyze.

In [None]:
df = pd.read_csv('../data/solar.csv')
print(df)

Again, we are looking at trying to draw conclusions about locational data and their relation to weather properties,
so similarly to k-means, lets go with `lat`, `long`, and `irradiance` this time.

In [None]:
X = df.loc[:, df.columns[1:4]]
print(X)

### Clustering the Data

Now that the data is prepped and ready to go, we can begin clustering it.
Even though it is a different data set, lets go with the "optimal" cluster
number we got from the last notebook to start. This can always be changed later
to examine different outputs.

So, here we set `n_clusters` to 20, and just create the model. This may look simpler
than the k-means process, but the bisecting k-means function call comes with
desirable default parameters. Some of these include:

- `init`: This defaults to `"random"`, which randomly chooses cluster centroids,
    which is very different from how k-means handles this process.
- `n_init`: This defaults to `1`, meaning that no other seeds are considered. It makes the
    assumption that one round of random choices is suitable.
- `algorithm`: This defaults to `lloyd`, which is a classical EM-style bisection algorithm
    used to split clusters.

Although it cannot be seen, this algorithm offers a wider riange of interesting specifications,
it just so happens that the default values are different enough from k-means that there is no
need to change them.

With that said, let's make and fit the model.

In [None]:
n_clusters = 20

bkmeans = BisectingKMeans(n_clusters)
bkmeans.fit(X[X.columns[0:2]])

Similar to before, after fitting, we can make predictions using the same `fit_predict` method. The cluster values assigned are then stored
in a new column in our isolated data set headed `cluster_label`. This can also be seen in a truncated version of our data.

In [None]:
X['cluster_label'] = bkmeans.fit_predict(X[X.columns[0:2]])
print(X.head(10))

Now we can grab the centroids for our bisecting k-means model, as well as take those labels
from above and store them in a variable `labels` used for plotting.

In [None]:
centers = bkmeans.cluster_centers_
labels = bkmeans.predict(X[X.columns[0:2]])

The visualization of this model is the same and with k-means. We start by plotting our isolated data on the bottom.
Each data point is assigned a color based on its `cluster_label`. We then plot the centroids on top, which then
creates a complete plot that showcases our clusters with centroids within.

In [None]:
X.plot.scatter(x = 'long', y = 'lat', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 1], centers[:, 0], c='black', s=200, alpha=0.5)
plt.show()

Similarly, we can make out the United States again. But notice the shape and placement
of the clusters. Even though it is different data, the hierarchy is clearly visible here,
which is so much different from the seemingly random dispersion seen in the k-means
clustering. Like before, try experimenting with the number of clusters by changing the value
of `n_clusters` above, and re-running the cells. Make note of any differences, as well as how
prevalent, or ambigous the hierarchy becomes.

### Saving the Clusters

Once your experimentation is completed, if it is desired, one can save the output of the most recent cluster labeling.
By running the cell below the saved data will be put into the `output` directory with the name `bisecting_k_means_output.csv`.

In [None]:
X.to_csv('../output/bisecting_k_means_output.csv', index=None, header=True)

## Next Steps

There are no more notebooks to run, but the work here is far from over. To see where this project is heading, visit the `Future Work` section in the [README](../README.md).