## Clustering Cryptocurrencies Using K-means

We use the KMeans algorithm from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
 to cluster the cryptocurrencies using the PCA data.  
 
We Completed the following tasks:
- Create an elbow curve to find the best value for K, and use the pcs_df DataFrame.
- Define the best value for K, run the K-means algorithm to predict the K clusters for the cryptocurrencies’ data. Use the pcs_df to run the K-means algorithm.
- Create a new DataFrame named “clustered_df,” that includes the following columns: Algorithm, ProofType, TotalCoinsMined, TotalCoinSupply, PC 1, PC 2, PC 3, CoinName, and Class. Maintain the index of the crypto_df DataFrames. 

K-means is an unsupervised learning algorithm used to identify and solve clustering issues.  
- **K** represents how many clusters there will be. These clusters are then determined by the **means** of all the points that will belong to the cluster.  
- The K-means algorithm groups the data into K clusters, where belonging to a cluster is based on some similarity or distance measure to a centroid.  
A **centroid** is a data point that is the arithmetic mean position of all the points on a cluster.  
- The centroid is found by taking the mean of all the x values in a cluster, and the mean of all the y values in a cluster.

In [6]:
# import our libraries
import pandas as pd
import plotly.express as px
import hvplot.pandas
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 


In [7]:
# Loading data
file_path ="Resources/X.csv"
X = pd.read_csv(file_path)
X.head(10)

Unnamed: 0,PC 1,PC 2,PC 3
0,-0.253347,1.654305,-1.040682
1,-0.234521,1.655466,-1.040402
2,2.774328,1.724176,1.455785
3,-0.171948,-0.919563,0.749252
4,-0.143976,-1.832913,0.294265
5,-0.131969,0.295193,-1.526226
6,-0.335313,1.065929,1.819291
7,-0.174146,-1.49638,-0.336046
8,-0.142192,-1.83289,0.294289
9,-0.159256,-1.696858,-0.23073


### Initialize the K Starting Centroids

After data has been loaded, create an instance of the K-means algorithm and initialize it with the desired number of clusters (K).

In [8]:
# Initializing model with K = 3 (since we already know there are three classes of iris plants)
model = KMeans(n_clusters=3, random_state=5)
model

KMeans(n_clusters=3, random_state=5)

In [9]:
### Data Points Assigned to Nearest Centroid

Once the model instance is created, our next step is to fit the model with the unlabeled data. This step should be familiar with fitting data from supervised learning; however, you’ll notice that data is not being split into training and test data. When the model is being trained (fit the data), the K-means algorithm will iteratively look for the best centroid for each of the K clusters:

In [10]:
# Fitting model
model.fit(X)

KMeans(n_clusters=3, random_state=5)

### Group Data Points

After the model is fit, the corresponding cluster for every in the dataset can be found using the predict() method:

In [11]:
# Get predictions
predictions = model.predict(X)
print(predictions)

[2 2 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 2 2 2 0 2 0 2 2 2 0 0 2 2 2 2 2 0 2 0 0
 0 2 2 2 2 0 2 2 0 2 2 2 2 0 2 0 0 2 2 2 0 2 0 2 2 0 2 0 2 0 2 0 0 2 2 2 2
 2 2 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 0 2 0 0 0 0 2 0 0 0 0 2 0 0 2 2 2 0 2 2
 2 2 0 2 0 0 0 0 2 2 2 0 2 0 2 0 0 0 0 0 0 0 0 2 2 2 0 0 0 0 0 0 0 0 2 0 0
 0 2 0 0 0 2 2 0 0 2 0 0 2 0 2 0 2 0 0 0 0 0 0 0 2 0 2 2 2 0 0 0 0 0 2 0 0
 0 0 2 0 2 2 2 0 2 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 2 0 0 0 0 2 0
 2 2 0 0 2 2 2 2 0 0 2 0 0 2 0 0 2 0 2 0 2 2 2 0 0 0 2 2 0 2 0 0 0 0 2 0 0
 0 0 0 2 2 2 0 0 2 2 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 0 2 0 0 0 2
 0 0 0 0 0 0 2 0 2 0 2 2 2 2 2 2 2 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 2 0 2 2
 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0 2 2 2 0 0 0 2 2 0 0 0 0 2 2 2 0
 2 0 0 0 0 0 2 2 0 0 2 0 0 2 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0
 0 0 0 2 2 0 2 0 0 0 2 0 0 0 0 0 0 0 2 0 0 0 0 2 0 2 2 0 0 0 0 0 2 2 0 2 0
 0 0 2 0 2 0 2 0 0 0 0 0 0 2 0 0 0 2 0 0 2 2 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0
 0 2 0 0 0 0 0 0 0 0 0 0 

**Important**  
As you can see, there were three subclasses that were labeled 0, 1, and 2. These are not the means for the centroids, but rather just the label names. The actual naming of the classes is part of the job by a subject matter expert, or whoever performs the analysis, such as yourself. The K-means algorithm is able to identify how many clusters are in the data and label them with numbers.

After we have the class for each data point, we can add a new column to the DataFrame with the predicted classes:

In [12]:
#Add a new class column to X
X["class"] = model.labels_
X.head()

Unnamed: 0,PC 1,PC 2,PC 3,class
0,-0.253347,1.654305,-1.040682,2
1,-0.234521,1.655466,-1.040402,2
2,2.774328,1.724176,1.455785,0
3,-0.171948,-0.919563,0.749252,0
4,-0.143976,-1.832913,0.294265,0


## Elbow Curve

**Create an elbow curve to find the best value for K, and use the pcs_df DataFrame.**

An easy method for determining the best number for K is the elbow curve. Elbow curves get their names from their shape: they turn on a specific value, which looks a bit like an elbow!  

To create an elbow curve, we’ll plot the clusters on the x-axis and the values of a selected objective function on the y-axis.  

**Inertia** is one of the most common objective functions to use when creating an elbow curve. While what it’s actually doing can get into some pretty complicated math, basically the inertia objective function is measuring the amount of variation in the dataset.  

So, for our elbow curve, we’ll plot the number of clusters (also known as the values of K) on the x-axis and the inertia values on the y-axis.

### Store Values of K to Plot

We’ll start with creating an empty list to hold inertia values. We’ll also store a range of K values we want to test. Enter the code in a new cell:

In [13]:
inertia = []
k = range(1, 11)

### Loop Through K  Values and Find Inertia

Next, we’ll loop through each K value, find the inertia, and store it into our list. Enter the code in the next cell:

In [14]:
# Looking for the best K
for i in k:
    km = KMeans(n_clusters=i, random_state=0)
    km.fit(X)
    inertia.append(km.inertia_)

### Create a DataFrame and Plot the Elbow Curve

We’ll create a DataFrame that stores our K values and their appropriate inertia values. This will allow for an easy plot of the results with hvplot.

In [15]:
# Define a DataFrame to plot the Elbow Curve using hvPlot
elbow_data = {"k": k, "inertia": inertia}

In [16]:
df_elbow = pd.DataFrame(elbow_data)

In [17]:
df_elbow.hvplot.line(x="k", y="inertia", title="Elbow Curve", xticks=k)

### Use the Elbow Curve to Determine the Best K Value

**Note** the shape of the curve on the following graph. At point 0 (top left), the line starts as a steep vertical slope that breaks at point 2, shifts to a slightly horizontal slope, breaks again at point 4, then shifts to a strong horizontal line that reaches to point 10. The angle at point 3 looks like an elbow, which gives this type of curve its name.

### Example of how to use the elbow curve.

To create the elbow curve, remember there are two values we need: a list of K values and a list of inertia values. Recall that inertia is the objective function to plot K values against. We will loop through 10 values for K and determine the inertia:

Next, let’s create a plot for the elbow curve:

In [18]:
inertia = []
k = range(1, 11)
# Calculate the inertia for the range of K values
for i in k:
    km = KMeans(n_clusters=i, random_state=0)
    km.fit(X)
    inertia.append(km.inertia_)

Next, let’s create a plot for the elbow curve:

In [19]:
# Create the Elbow Curve using hvPlot
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)
df_elbow.hvplot.line(x="k", y="inertia", xticks=k, title="Elbow Curve")

**Define the best value for K, run the K-means algorithm to predict the K clusters for the cryptocurrencies’ data. Use the pcs_df to run the K-means algorithm.**

Before plotting the two K values, let’s create a K-means function again to reuse the K-means cluster. As you may recall, functions allow us to save time because we don’t need to write the code contained in the function more than once:

In [20]:
def get_clusters(k, data):
    # Initialize the K-means model
    model = KMeans(n_clusters=k, random_state=0)
    # Fit the model
    model.fit(data)

    # Predict clusters
    predictions = model.predict(data)
    # Create return DataFrame with predicted clusters
    data["class"] = model.labels_
    
    return data

In [35]:
# We can now run the function for K = 4:
four_clusters = get_clusters(5, X)
four_clusters.head()

Unnamed: 0,PC 1,PC 2,PC 3,class
0,-0.253347,1.654305,-1.040682,3
1,-0.234521,1.655466,-1.040402,3
2,2.774328,1.724176,1.455785,0
3,-0.171948,-0.919563,0.749252,1
4,-0.143976,-1.832913,0.294265,1


In [36]:
# Plotting the 2D-Scatter with x="TotalCoinsMined" and y="TotalCoinSupply"
four_clusters.hvplot.scatter(x="TotalCoinsMined", y="TotalCoinSupply", by="class")

DataError: Supplied data does not contain specified dimensions, the following dimensions were not found: ['TotalCoinsMined', 'TotalCoinSupply']

PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html

In [37]:
# Plotting the 3D-Scatter with x="TotalCoinsMined", y="TotalCoinSupply" and z="ProofType"
fig = px.scatter_3d(
    four_clusters,
    x="TotalCoinsMined",
    y="TotalCoinSupply",
    z="ProofType",
    color="class",
    symbol="class",
    width=800,
)
fig.update_layout(legend=dict(x=0, y=1))
fig.show()

ValueError: Value of 'x' is not the name of a column in 'data_frame'. Expected one of ['PC 1', 'PC 2', 'PC 3', 'class'] but received: TotalCoinsMined

In [38]:
# Plotting the 2D-Scatter with x="TotalCoinsMined" and y="TotalCoinSupply"
four_clusters.hvplot.scatter(x="TotalCoinsMined", y="TotalCoinSupply", by="class")

DataError: Supplied data does not contain specified dimensions, the following dimensions were not found: ['TotalCoinsMined', 'TotalCoinSupply']

PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html

In [39]:
# Plotting the 3D-Scatter with x="TotalCoinsMined", y="TotalCoinSupply" and z="ProofType"
fig = px.scatter_3d(
    four_clusters,
    x="TotalCoinsMined",
    y="TotalCoinSupply",
    z="ProofType",
    color="class",
    symbol="class",
    width=800,
)
fig.update_layout(legend=dict(x=0, y=1))
fig.show()

ValueError: Value of 'x' is not the name of a column in 'data_frame'. Expected one of ['PC 1', 'PC 2', 'PC 3', 'class'] but received: TotalCoinsMined

Recalling the trial-and-error method, both graphs displayed multiple clusters. We’re still applying some trial and error here, but the elbow curve helps narrow down the number of clusters.  

Now, the important question: So do we use five or six groups? This depends on what insights you can take away from the data. One might conclude that six groups would be most useful because they could be broken down like so:  
- Cluster 0: medium mined, low supply
- Cluster 1: low mined, low supply
- Cluster 2: high mined, low supply
- Cluster 3: low mined, high supply
- Cluster 4: medium mined, high supply
- Cluster 5: very high mined, high supply  

If we choose five groups, they would need to be different and would not fit into what you’re looking for, which is grouping types of coins. Remember, unsupervised learning can help us make decisions about the data, up to a point, then it is up to you, the expert, to make the final call.

**Create a new DataFrame named “clustered_df,” that includes the following columns: Algorithm, ProofType, TotalCoinsMined, TotalCoinSupply, PC 1, PC 2, PC 3, CoinName, and Class. Maintain the index of the crypto_df DataFrames.**

In [27]:
# Loading data
file_path ="Resources/coins_name.csv"
coins_name = pd.read_csv(file_path)

In [28]:
# Loading data
file_path ="Resources/X_pca.csv"
X_pca = pd.read_csv(file_path)

In [29]:
df_y = pd.DataFrame(data=X_pca, columns=["PC 1", "PC 2", "PC 3"])
df = coins_name.join(df_y, how='inner')
df.head()

Unnamed: 0,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply,PC 1,PC 2,PC 3
0,Scrypt,0,41.99995,42.0,-0.253347,1.654305,-1.040682
1,Scrypt,0,1055185000.0,532000000.0,-0.234521,1.655466,-1.040402
2,X13,0,29279420000.0,314159300000.0,2.774328,1.724176,1.455785
3,SHA-256,2,17927180.0,21000000.0,-0.171948,-0.919563,0.749252
4,Ethash,2,107684200.0,0.0,-0.143976,-1.832913,0.294265


In [30]:
# Loading data
file_path ="Resources/crypto_data.csv"
crypto_df = pd.read_csv(file_path)

In [31]:
df_y = pd.DataFrame(data=crypto_df, columns=['CoinName'])
df2 = df.join(df_y, how='inner')
df2.head()

Unnamed: 0,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply,PC 1,PC 2,PC 3,CoinName
0,Scrypt,0,41.99995,42.0,-0.253347,1.654305,-1.040682,42 Coin
1,Scrypt,0,1055185000.0,532000000.0,-0.234521,1.655466,-1.040402,365Coin
2,X13,0,29279420000.0,314159300000.0,2.774328,1.724176,1.455785,404Coin
3,SHA-256,2,17927180.0,21000000.0,-0.171948,-0.919563,0.749252,SixEleven
4,Ethash,2,107684200.0,0.0,-0.143976,-1.832913,0.294265,808


In [32]:
df_y = pd.DataFrame(data=six_clusters, columns=['class'])
df3 = df2.join(df_y, how='inner')
df3.head()

Unnamed: 0,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply,PC 1,PC 2,PC 3,CoinName,class
0,Scrypt,0,41.99995,42.0,-0.253347,1.654305,-1.040682,42 Coin,5
1,Scrypt,0,1055185000.0,532000000.0,-0.234521,1.655466,-1.040402,365Coin,5
2,X13,0,29279420000.0,314159300000.0,2.774328,1.724176,1.455785,404Coin,0
3,SHA-256,2,17927180.0,21000000.0,-0.171948,-0.919563,0.749252,SixEleven,4
4,Ethash,2,107684200.0,0.0,-0.143976,-1.832913,0.294265,808,3


In [33]:
df_y = pd.DataFrame(data=crypto_df, columns=['Unnamed: 0'])
clustered_df = df3.join(df_y, how='inner')
clustered_df.head()

Unnamed: 0.1,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply,PC 1,PC 2,PC 3,CoinName,class,Unnamed: 0
0,Scrypt,0,41.99995,42.0,-0.253347,1.654305,-1.040682,42 Coin,5,42
1,Scrypt,0,1055185000.0,532000000.0,-0.234521,1.655466,-1.040402,365Coin,5,365
2,X13,0,29279420000.0,314159300000.0,2.774328,1.724176,1.455785,404Coin,0,404
3,SHA-256,2,17927180.0,21000000.0,-0.171948,-0.919563,0.749252,SixEleven,4,611
4,Ethash,2,107684200.0,0.0,-0.143976,-1.832913,0.294265,808,3,808


In [34]:
clustered_df = clustered_df.set_index(["Unnamed: 0"])
clustered_df.head()

Unnamed: 0_level_0,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply,PC 1,PC 2,PC 3,CoinName,class
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
42,Scrypt,0,41.99995,42.0,-0.253347,1.654305,-1.040682,42 Coin,5
365,Scrypt,0,1055185000.0,532000000.0,-0.234521,1.655466,-1.040402,365Coin,5
404,X13,0,29279420000.0,314159300000.0,2.774328,1.724176,1.455785,404Coin,0
611,SHA-256,2,17927180.0,21000000.0,-0.171948,-0.919563,0.749252,SixEleven,4
808,Ethash,2,107684200.0,0.0,-0.143976,-1.832913,0.294265,808,3
