# Cluster Analysis Using K-Means Clustering

Clustering, in simple terms, is the act of taking similar data points and
clumping them together to treat them as one. For example, say we had a data
set that looked like this:
```
[apple, orange, corn, carrot]
```
Clustering this data set might intuitively result in something that looks like this:
```
{fruit: [apple, orange], vegetable: [corn, carrot]}
```
As can be seen, similar enough data points are grouped together and treated
as one thing. In this example apples and oranges are simply seen as fruits,
and corn and carrots are seen as vegetables. These larger generalizations
are considered `centroids`, which are averages of the data points within a
cluster, representing all values within the cluster.

This type of assignment can be very powerful with making predictions, especially
when considering weather related factors like wind speeds or solar irradiance.
This is because areas close by one another would experience similar weather most
of the time. Therefore, it is a good fit for this use case as the data is
locational and weather-based.

Reeling back a bit, when it comes to our data, we will be trying to identify
clusters of locations that experience similar wind speeds in this notebook
using K-Means clustering. The K-Means clustering algorithm from `scikit-learn`
simply seeks to pick centroids that minimise `inertia` or how inherently
coherent clusters are.

## Installing Dependencies and Importing Libraries

Similar to before, we have to install and import some things to get started.

In [1]:
! pip install matplotlib=="3.8.0"

Collecting matplotlib==3.8.0
  Obtaining dependency information for matplotlib==3.8.0 from https://files.pythonhosted.org/packages/40/d9/c1784db9db0d484c8e5deeafbaac0d6ed66e165c6eb4a74fb43a5fa947d9/matplotlib-3.8.0-cp311-cp311-win_amd64.whl.metadata
  Using cached matplotlib-3.8.0-cp311-cp311-win_amd64.whl.metadata (5.9 kB)
Using cached matplotlib-3.8.0-cp311-cp311-win_amd64.whl (7.6 MB)
Installing collected packages: matplotlib
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.6.0
    Uninstalling matplotlib-3.6.0:
      Successfully uninstalled matplotlib-3.6.0
Successfully installed matplotlib-3.8.0



[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

## Performing K-Means Cluster Analysis on Wind Farm Data

### Prepping the Data

To start, we once again must load the data into a `pandas` data frame.
Looking at the data frame allows us to see which columns we want to
select to perform cluster analysis.

In [3]:
df = pd.read_csv('../data/wind.csv')
print(df)

            id        lat        long  wind_speed farm_type  capacity  \
0            0  23.510410 -117.147260        6.07  offshore        16   
1            1  24.007446  -93.946777        7.43  offshore        16   
2            2  25.069138  -97.482483        8.19  offshore        16   
3            3  25.069443  -97.463135        8.19  offshore        16   
4            4  25.069763  -97.443756        8.19  offshore        16   
...        ...        ...         ...         ...       ...       ...   
126687  126687  22.871800  -79.605350        7.50  offshore        16   
126688  126688  20.601960  -81.438600        6.67  offshore        16   
126689  126689  23.735790  -76.708770        7.22  offshore        16   
126690  126690  22.583120  -79.004090        7.61  offshore        16   
126691  126691  23.448720  -77.410400        7.31  offshore        16   

        capacity_factor  power_generation  estimated_cost  
0                 0.169          23687.04        20800000  
1  

After looking at the data frame and considering our goal, we want to use
`lat`, `long`, and one other factor that may be affected by location. Let's
go with `wind_speed`. To use them, we use the `pandas` `loc` method to isolate
columns that we want to work with from our data set.

In [4]:
X = df.loc[:, df.columns[1:4]]
print(X)

              lat        long  wind_speed
0       23.510410 -117.147260        6.07
1       24.007446  -93.946777        7.43
2       25.069138  -97.482483        8.19
3       25.069443  -97.463135        8.19
4       25.069763  -97.443756        8.19
...           ...         ...         ...
126687  22.871800  -79.605350        7.50
126688  20.601960  -81.438600        6.67
126689  23.735790  -76.708770        7.22
126690  22.583120  -79.004090        7.61
126691  23.448720  -77.410400        7.31

[126692 rows x 3 columns]


### Finding an "Optimal" Number of Clusters

When performing K-Means clustering, and some other forms of clustering,
the number of clusters that the data will be split up into is
manually input by the user. This brings along with it a looming question:
How many clusters is sufficient? One way to figure this out is to
create a bunch of models and score how effective more clusters are.
To start, we must define the range of the number of clusters to use,
and then create a bunch of models to represent those.

In [5]:
K_clusters = range(1,100)

kmeans = [KMeans(n_clusters=i, n_init='auto') for i in K_clusters]

Now we want to use the columns from our data set to test
for this effectiveness.

In [6]:
Y_axis = X[['lat']]
X_axis = X[['long']]

Every cluster model made gets fit with the data from our data set and then gets scored.
The score is an indicator of how effective the number of clusters used was towards modeling the data.
A low value indicates little effectiveness, while a higher value, closer to zero, indicates more
effectiveness.

In [None]:
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]

Once we have this list of scores, we can plot them to visualize the results.
From the results, we will notice that it levels off, indicating diminishing returns.
Using more clusters does not necessarily supply a more accurate prediction; there
seems to be a sweet-spot. From the graph, it seems somewhere around 15-20 clusters
yeilds the highest score when looking to minimize cluster count. As we will see,
this method does not always give the correct number of clusters to use, as sometimes
this can be subjective based on the data set.

In [8]:
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()

AttributeError: module 'matplotlib.cbook' has no attribute '_safe_first_finite'

<Figure size 640x480 with 0 Axes>

### Clustering the Data

With the number of clusters now obtained, we can begin clustering our data.
To start, we will use the "optimal" 20 clusters, but we can always come back
and change the `n_clusters` value to a different number to change the number
of clusters to group our data into. the `init` value dictaes the method for initialization.
Here we use `k-means++`, which selects centroids iteratively in a greedy fashion based
on their contribution to `inertia`. `n_init` dictates how many times this algorithm is
run with different centroid seeds, meaning different choices for initial centroids.

We start by creating a model and fitting our data to said model.

In [None]:
n_clusters = 20

kmeans = KMeans(n_clusters, init='k-means++', n_init='auto')
kmeans.fit(X[X.columns[0:2]])

We can then perform predictions on this model as well. This is
done using the `fit_predict` method, which does the same fitting
as above, but also assigned labels to values in the same cluster.
We store these values in a new column in our isolated data set
labeled `cluster_label`, which can be seen below.

In [None]:
X['cluster_label'] = kmeans.fit_predict(X[X.columns[0:2]])
print(X.head(10))

Now we can get the cluster centers, which are actually the centroids
mentioned previously. These are the representative averages of each cluster.
We also store the labels assigned above so that they can be used for
plotting.

In [None]:
centers = kmeans.cluster_centers_
labels = kmeans.predict(X[X.columns[0:2]])

Visualizing this model is quite intuitive. First the data points from our isolated
data set are plotted, with the `cluster_label` giving each data point within a cluster
its own color. Then, the centroids are plotted on top of this map of clusters, creating
a comprehensive plot that showcases all clusters.

In [None]:
X.plot.scatter(x = 'long', y = 'lat', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 1], centers[:, 0], c='black', s=200, alpha=0.5)
plt.show()

Looking at the plot, we can see that it is modelling the United States, albeit it is a bit rough,
but it is a good sign nonetheless. But returning the the question of: How many clusters is enough?
Does the "optimal" number of 20 clusters seem to make sense here? There are arguments for both yes
no, but either way, this is where clustering becomes a bit tedious, as the only way to find out is
to change the number of clusters used. So scroll back up and experiment with new values of `n_clusters`
and see what you come up with.

### Saving the Clusters

Once your experimentation is completed, if it is desired, one can save the output of the most recent cluster labeling.
By running the cell below the saved data will be put into the `output` directory with the name `k_means_output.csv`.

In [None]:
X.to_csv('../output/k_means_output.csv', index=None, header=True)

## Next Steps

The last notebook to look at examines the `solar.csv` data. In [bisecting_k_means.ipynb](./bisecting_k_means.ipynb)
a similar cluster analysis is performed, but this time using different data, as well as a different method.