# CryptoClustering Analysis
The purpose of this project is to use Python and unsupervised learning to predict how cryptocurrencies are affected by price changes. By comparing the clustering results obtained from K-means using the original data and PCA-reduced data, we can determine the impact of dimensionality reduction on clustering outcomes and identify if fewer features lead to similar or improved clustering results for the cryptocurrency data.

## Import required libraries and dependencies

Import the necessary Python libraries and dependencies for the analysis, including pandas, hvplot, KMeans, PCA, and StandardScaler.

In [171]:
# Import required libraries and dependencies
import pandas as pd
import hvplot.pandas
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

## Load the data into a Pandas DataFrame

Load the cryptocurrency market data from the "crypto_market_data.csv" file into a Pandas DataFrame named `df_market_data`. Set the "coin_id" column as the index.

In [172]:
# Load the data into a Pandas DataFrame
df_market_data = pd.read_csv(
    "Resources/crypto_market_data.csv",
    index_col="coin_id")

## Display sample data

Display the first 10 rows of the loaded DataFrame `df_market_data` to get an overview of the dataset.

In [173]:
# Display sample data
df_market_data.head(10)

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,1.08388,7.60278,6.57509,7.67258,-3.25185,83.5184,37.51761
ethereum,0.22392,10.38134,4.80849,0.13169,-12.8889,186.77418,101.96023
tether,-0.21173,0.04935,0.0064,-0.04237,0.28037,-0.00542,0.01954
ripple,-0.37819,-0.60926,2.24984,0.23455,-17.55245,39.53888,-16.60193
bitcoin-cash,2.90585,17.09717,14.75334,15.74903,-13.71793,21.66042,14.49384
binancecoin,2.10423,12.85511,6.80688,0.05865,36.33486,155.61937,69.69195
chainlink,-0.23935,20.69459,9.30098,-11.21747,-43.69522,403.22917,325.13186
cardano,0.00322,13.99302,5.55476,10.10553,-22.84776,264.51418,156.09756
litecoin,-0.06341,6.60221,7.28931,1.21662,-17.2396,27.49919,-12.66408
bitcoin-cash-sv,0.9253,3.29641,-1.86656,2.88926,-24.87434,7.42562,93.73082


## Generate summary statistics

Compute summary statistics for the dataset using the `describe()` function. This will provide statistical information such as count, mean, standard deviation, minimum, and maximum for each column in the DataFrame.

In [174]:
# Generate summary statistics
df_market_data.describe()

Unnamed: 0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
count,41.0,41.0,41.0,41.0,41.0,41.0,41.0
mean,-0.269686,4.497147,0.185787,1.545693,-0.094119,236.537432,347.667956
std,2.694793,6.375218,8.376939,26.344218,47.365803,435.225304,1247.842884
min,-13.52786,-6.09456,-18.1589,-34.70548,-44.82248,-0.3921,-17.56753
25%,-0.60897,0.04726,-5.02662,-10.43847,-25.90799,21.66042,0.40617
50%,-0.06341,3.29641,0.10974,-0.04237,-7.54455,83.9052,69.69195
75%,0.61209,7.60278,5.51074,4.57813,0.65726,216.17761,168.37251
max,4.84033,20.69459,24.23919,140.7957,223.06437,2227.92782,7852.0897


## Plot your data to see what's in your DataFrame

Create a line plot (`hvplot.line`) to visualize the trends in the cryptocurrency market data. The x-axis represents the cryptocurrency names ("coin_id"), and the y-axis shows the percentage change in prices over different time frames (24h, 7d, 14d, 30d, 60d, 200d, 1y).

In [175]:
# Plot your data to see what's in your DataFrame
df_market_data.hvplot.line(
    width=800,
    height=400,
    rot=90
)

---

## Prepare the data

Normalize the data using the `StandardScaler()` module from scikit-learn. This is done to scale the features, ensuring they all have equal importance during clustering.

In [176]:
# Use the `StandardScaler()` module from scikit-learn to normalize the data from the CSV file
# Scale price data, return, and variance values
market_data_scaled = StandardScaler().fit_transform(
    df_market_data[["price_change_percentage_24h", "price_change_percentage_7d", "price_change_percentage_14d", "price_change_percentage_30d", "price_change_percentage_60d", "price_change_percentage_200d", "price_change_percentage_1y"]]
)

## Create a DataFrame with the scaled data

Create a new DataFrame `df_market_data_scaled` that contains the scaled and normalized market data. This DataFrame will be used for clustering.

In [177]:
# Create a DataFrame with the scaled data
df_market_data_scaled = pd.DataFrame(
    market_data_scaled,
    columns=["price_change_percentage_24h", "price_change_percentage_7d", "price_change_percentage_14d", "price_change_percentage_30d", "price_change_percentage_60d", "price_change_percentage_200d", "price_change_percentage_1y"]
)
# Copy the crypto names from the original data
df_market_data_scaled["coin_id"] = df_market_data.index

# Set the coinid column as index
df_market_data_scaled = df_market_data_scaled.set_index("coin_id")

# Display sample data
df_market_data_scaled.head()

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,0.508529,0.493193,0.7722,0.23546,-0.067495,-0.355953,-0.251637
ethereum,0.185446,0.934445,0.558692,-0.054341,-0.273483,-0.115759,-0.199352
tether,0.021774,-0.706337,-0.02168,-0.06103,0.008005,-0.550247,-0.282061
ripple,-0.040764,-0.810928,0.249458,-0.050388,-0.373164,-0.458259,-0.295546
bitcoin-cash,1.193036,2.000959,1.76061,0.545842,-0.291203,-0.499848,-0.270317


---

## Find the Best Value for k Using the Original Data

Perform K-means clustering on the original data with different values of `k` (number of clusters) ranging from 1 to 11. Compute the inertia (within-cluster sum of squared distances) for each `k` value and store them in a list. The Elbow Curve plot (`elbow_curve_1`) is generated to help identify the optimal `k` value.

In [178]:
# Create a list with the number of k-values from 1 to 11
list_k = list(range(1, 12))
list_k

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

## Calculate the inertia
Inertia is a measure of how well the data points in a cluster are compacted around the cluster's centroid. The code calculates the inertia (compactness measure) for various cluster numbers (k) using KMeans algorithm, storing the results in list_inertia.

In [179]:
# Create an empty list to store the inertia values
list_inertia = []

# Create a for loop to compute the inertia with each possible value of k
    # Inside the loop:
    # 1. Create a KMeans model using the loop counter for the n_clusters
    # 2. Fit the model to the data using `df_market_data_scaled`
    # 3. Append the model.inertia_ to the inertia list
for index in list_k:
    k_model = KMeans(n_clusters=index, random_state=0)
    k_model.fit(df_market_data_scaled)
    list_inertia.append(k_model.inertia_)

  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)


## Create a dataframe used to plot the Elbow curve
A dictionary dict_elbow_data is created with k and inertia values to plot the Elbow curve, and a DataFrame df_elbow is formed to review the data.

In [180]:
# Create a dictionary with the data to plot the Elbow curve
dict_elbow_data = {"k": list_k, "inertia": list_inertia}

# Create a DataFrame with the data to plot the Elbow curve
df_elbow = pd.DataFrame(dict_elbow_data)

# Review the DataFrame
df_elbow.head()

Unnamed: 0,k,inertia
0,1,287.0
1,2,198.571818
2,3,123.190482
3,4,79.022435
4,5,65.302379


## Generate the Elbow curve
Generate the Elbow curve using hvPlot to visualize inertia values for different k values, aiding in identifying the optimal k value.

In [181]:
# Plot a line chart with all the inertia values computed with the different values of k to visually identify the optimal value for k.
elbow_curve_1 = df_elbow.hvplot.line(
    x="k", 
    y="inertia", 
    title="Elbow Curve", 
    xticks=list_k
)

# Display elbow plot
elbow_curve_1

## What is the best value for `k`?

 **4** is the best value for `k` according to the elbow curve.

---

## Cluster Cryptocurrencies with K-means Using the Original Data

Initialize a K-means model with the best `k` value (`k` = 4) and fit it to the scaled data. The model is used to predict the cluster labels for each cryptocurrency.

In [182]:
# Initialize the K-Means model using the best value for k
model = KMeans(n_clusters=4, random_state=0)

## K-Means Model Fitting

Using the scikit-learn library, fits the K-Means model using the scaled data stored in df_market_data_scaled. The model is trained to find clusters in the data based on the specified value of `k`.

In [183]:
# Fit the K-Means model using the scaled data
model.fit(df_market_data_scaled)

  super()._check_params_vs_input(X, default_n_init=10)


## Predict Clusters Using K-Means

Use the fitted K-Means model to predict the clusters for the cryptocurrencies based on the scaled data storing the cluster values assigned to each datapoint int eh dataset to an array.

In [184]:
# Predict the clusters to group the cryptocurrencies using the scaled data
ndarray_k_4 = model.predict(df_market_data_scaled)

# Print the resulting array of cluster values.
ndarray_k_4

array([2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 0, 0, 2,
       0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 2, 0, 0, 3, 0, 0, 0, 0],
      dtype=int32)

## Copy Scaled Data for the  Predicted Clusters

Create a copy of the `df_market_data_scaled` DataFrame to preserve the original scaled data while adding a new column for the predicted cluster values. Making a copy ensures that the original data remains unchanged during the addition of the new column.

Directly modifying the original `df_market_data_scaled` DataFrame would lose its original structure and information, making it challenging to compare or analyze the results later on.

In [185]:
# Create a copy of the df_market_data_scaled for the predicted clusters
df_market_data_scaled_predicted = df_market_data_scaled.copy()

# Review the dataframe
df_market_data_scaled_predicted.head()

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,0.508529,0.493193,0.7722,0.23546,-0.067495,-0.355953,-0.251637
ethereum,0.185446,0.934445,0.558692,-0.054341,-0.273483,-0.115759,-0.199352
tether,0.021774,-0.706337,-0.02168,-0.06103,0.008005,-0.550247,-0.282061
ripple,-0.040764,-0.810928,0.249458,-0.050388,-0.373164,-0.458259,-0.295546
bitcoin-cash,1.193036,2.000959,1.76061,0.545842,-0.291203,-0.499848,-0.270317


In [186]:
# Add a new column to the DataFrame with the predicted clusters
df_market_data_scaled_predicted["market_segments"] = ndarray_k_4

# Display sample data
df_market_data_scaled_predicted.head()


Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y,market_segments
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
bitcoin,0.508529,0.493193,0.7722,0.23546,-0.067495,-0.355953,-0.251637,2
ethereum,0.185446,0.934445,0.558692,-0.054341,-0.273483,-0.115759,-0.199352,2
tether,0.021774,-0.706337,-0.02168,-0.06103,0.008005,-0.550247,-0.282061,0
ripple,-0.040764,-0.810928,0.249458,-0.050388,-0.373164,-0.458259,-0.295546,0
bitcoin-cash,1.193036,2.000959,1.76061,0.545842,-0.291203,-0.499848,-0.270317,2


# Create a scatter plot using hvPlot

Create a scatter plot (`hvplot.scatter`) to visualize the clustering result using the first two feature columns: "price_change_percentage_24h" and "price_change_percentage_7d". Each point represents a cryptocurrency, colored according to the predicted cluster labels. The hover feature displays the cryptocurrency name when hovering over a data point.

In [187]:
# Create a scatter plot using hvPlot by setting
# `x="price_change_percentage_24h"` and `y="price_change_percentage_7d"`. 
# Color the graph points with the labels found using K-Means and 
# add the crypto name in the `hover_cols` parameter to identify 
# the cryptocurrency represented by each data point.

# Plot the clusters using the first two feature columns
clusterplot_1 = df_market_data_scaled_predicted.hvplot.scatter(
    x="price_change_percentage_24h",
    y="price_change_percentage_7d",
    by="market_segments",
    hover_cols = ["coin_id"]
)

clusterplot_1

---

# Optimize Clusters with Principal Component Analysis (PCA)

Apply Principal Component Analysis (PCA) to reduce the dimensionality of the data to three principal components. PCA helps capture the most significant variance in the data while reducing its dimensionality.

In [188]:
# Create a PCA model instance and set `n_components=3`.
pca=PCA(n_components=3)

In [189]:
# Use the PCA model with `fit_transform` to reduce to three principal components.
ndarray_cryptomarkets_pca = pca.fit_transform(df_market_data_scaled_predicted)

# View the first five rows of the DataFrame.
df_cryptomarkets_pca = pd.DataFrame(ndarray_cryptomarkets_pca)
df_cryptomarkets_pca[:5]

Unnamed: 0,0,1,2
0,1.194082,-0.902074,-0.585338
1,1.009706,-0.663584,-1.13695
2,-0.722536,-0.307632,0.721813
3,-0.748266,-0.332379,0.558479
4,2.258539,-1.826966,-1.378166


## Calculate the percentage of the total variance that is captured by the three PCA variables.

Compute the explained variance ratio for each of the three principal components. The sum of these variance ratios represents how much of the total variance in the data is captured by the three components. In this case, it is approximately 88.7%.

In [190]:
# Retrieve the explained variance to determine how much information can be attributed to each principal component.
pca.explained_variance_ratio_

array([0.37269822, 0.32489961, 0.18917649])

# Find the Best Value for k Using the PCA Data

Perform K-means clustering on the PCA-transformed data with different values of `k` (number of clusters) ranging from 1 to 11. Compute the inertia for each `k` value and store them in a list. The Elbow Curve plot (`elbow_curve_2`) is generated to help identify the optimal `k` value.

In [191]:
# Create a list with the number of k-values from 1 to 11
list_k_values = list(range(1, 12))
list_k_values

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

## Calculate the inertia
Inertia is a measure of how well the data points in a cluster are compacted around the cluster's centroid. The code calculates the inertia (compactness measure) for various cluster numbers (k) using KMeans algorithm, storing the results in list_inertia.

In [192]:
# Create an empty list to store the inertia values
list_inertia = []

# Create a for loop to compute the inertia with each possible value of k
# Inside the loop:
# 1. Create a KMeans model using the loop counter for the n_clusters
# 2. Fit the model to the data using `df_market_data_scaled`
# 3. Append the model.inertia_ to the inertia list
for index in list_k:
    k_model = KMeans(n_clusters=index, random_state=0)
    k_model.fit(df_cryptomarkets_pca)
    list_inertia.append(k_model.inertia_)

  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)


## Create a dataframe used to plot the Elbow curve
A dictionary dict_elbow_data is created with k and inertia values to plot the Elbow curve, and a DataFrame df_elbow is formed to review the data.

In [193]:
# Create a dictionary with the data to plot the Elbow curve
elbow_data = {"k": list_k, "inertia": list_inertia}

# Create a DataFrame using the elbow_data Dictionary
df_elbow = pd.DataFrame(elbow_data)

# Review the DataFrame
df_elbow.head()


Unnamed: 0,k,inertia
0,1,290.018457
1,2,199.108053
2,3,112.401201
3,4,43.586433
4,5,32.255267


## Generate the Elbow curve
Generate the Elbow curve using hvPlot to visualize inertia values for different k values, aiding in identifying the optimal k value.

In [194]:
# Plot a line chart with all the inertia values computed with  the different values of k to visually identify the optimal value for k.
elbow_curve_2 = df_elbow.hvplot.line(
    x="k", 
    y="inertia", 
    title="Elbow Curve", 
    xticks=list_k
)


# Display elbow plot
elbow_curve_2

## What is the best value for `k` when using the PCA data?

The best value for `k` remains at 4.


##  Does it differ from the best `k` value found using the original data?

The `k` value does not differ, but the inertia for the `k` value of 4 drops from 79.022 to 43.586, which indicates a more compact and accurate representation of the data in a reduced feature space.

# Cluster Cryptocurrencies with K-means Using the PCA Data

Initialize a K-means model with the best `k` value (4) and fit it to the PCA-transformed data. The model is used to predict the cluster labels for each cryptocurrency based on the reduced feature space.

In [195]:
# Initialize the K-Means model using the best value for k
model = KMeans(n_clusters=4, random_state=0)

## Fit K-Means Model with PCA Data

Fir the K-Means model using the DataFrame `df_cryptomarkets_pca`, which contains the data after applying Principal Component Analysis (PCA). The K-Means model will identify clusters in the reduced feature space obtained from PCA, enabling grouping of cryptocurrencies based on their principal components.

In [196]:
# Fit the K-Means model using the PCA data
model.fit(df_cryptomarkets_pca)

  super()._check_params_vs_input(X, default_n_init=10)


## Predict Clusters with PCA Data

Use the fitted K-Means model to predict the clusters for the cryptocurrencies using the PCA data. The resulting array `ndarray_k_4` contains the cluster labels for each cryptocurrency, indicating the group to which each data point belongs based on their principal components.

In [197]:
# Predict the clusters to group the cryptocurrencies using the PCA data
ndarray_k_4 = model.predict(df_cryptomarkets_pca)

# Print the resulting array of cluster values.
ndarray_k_4

array([1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 1, 0, 0, 3, 0, 0, 0, 0],
      dtype=int32)

## Create DataFrame Copy with PCA Data

1. Create a copy of the DataFrame `df_cryptomarkets_pca` and assign it to the new DataFrame `df_cryptomarkets_pca_predicted`.
2. Display the first few rows of the new DataFrame to review the copied data.

In [198]:
# Create a copy of the DataFrame with the PCA data
df_cryptomarkets_pca_predicted = df_cryptomarkets_pca.copy()

# Review the Dataframe
df_cryptomarkets_pca_predicted.head()

Unnamed: 0,0,1,2
0,1.194082,-0.902074,-0.585338
1,1.009706,-0.663584,-1.13695
2,-0.722536,-0.307632,0.721813
3,-0.748266,-0.332379,0.558479
4,2.258539,-1.826966,-1.378166


## Update DataFrame with Predicted Clusters

1. Add a new column called "market_segments" to the DataFrame `df_cryptomarkets_pca_predicted`, containing the predicted cluster labels obtained from the `ndarray_k_4`.
2. Rename the columns of the DataFrame to "PC1", "PC2", and "PC3" for better representation of the three principal components.
3. Display the first few rows of the updated DataFrame to visualize the changes.

In [199]:
# Add a new column to the DataFrame with the predicted clusters
df_cryptomarkets_pca_predicted["market_segments"] = ndarray_k_4

# Rename the columns
df_cryptomarkets_pca_predicted.rename(columns={0: "PC1", 1: "PC2", 2: "PC3"}, inplace=True)

# Display sample data
df_cryptomarkets_pca_predicted.head()

Unnamed: 0,PC1,PC2,PC3,market_segments
0,1.194082,-0.902074,-0.585338,1
1,1.009706,-0.663584,-1.13695,1
2,-0.722536,-0.307632,0.721813,0
3,-0.748266,-0.332379,0.558479,0
4,2.258539,-1.826966,-1.378166,1


# Create a scatter plot using hvPlot

Create another scatter plot (`hvplot.scatter`) to visualize the clustering result using the first two principal components obtained from PCA: "PC1" and "PC2". Each point represents a cryptocurrency, colored according to the predicted cluster labels. The hover feature displays the cryptocurrency name when hovering over a data point.


In [200]:
# Create a scatter plot using hvPlot by setting 
# `x="PC1"` and `y="PC2"`. 
# Color the graph points with the labels found using K-Means and 
# add the crypto name in the `hover_cols` parameter to identify 
# the cryptocurrency represented by each data point.

# Plot the clusters
clusterplot_2 = df_cryptomarkets_pca_predicted.hvplot.scatter(
    x="PC1",
    y="PC2",
    by="market_segments",
    hover_cols = ["coin_id"]
)

clusterplot_2

# Visualize and Compare the Results

Create composite plots to compare the Elbow curves and the cluster scatter plots side by side for the original data and PCA data. These plots provide a clear visual comparison of the clustering results using different techniques and feature spaces.

In [201]:
# Composite plot to contrast the Elbow curves
composite_elbow_curves = elbow_curve_1 + elbow_curve_2
composite_elbow_curves = composite_elbow_curves.opts(width=800, height=400)  # Adjust size and legend position

# Display the composite plots
composite_elbow_curves

In [202]:
# Composite plot to contrast the cluster Scatter plots
composite_cluster_plots = clusterplot_1 + clusterplot_2
composite_cluster_plots = composite_cluster_plots.opts(width=800, height=400)  # Adjust size

# Display the composite plots
composite_cluster_plots

# After visually analyzing the cluster analysis results, what is the impact of using fewer features to cluster the data using K-Means?

Using fewer features, specifically the three principal components obtained from PCA, had a positive impact on the clustering results. It effectively reduced the dimensionality of the data while still preserving most of its important information. The clustering result was visually similar for both the original data and PCA data, but the PCA data had a lower inertia value, indicating a more compact and accurate representation of the data in a reduced feature space.