In [1]:
# imports
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# load in data
df = pd.read_csv('../data/returns_clean.csv', parse_dates = ['timestamp'], 
                                      index_col = 'timestamp')

coin_names = [col.split('_')[0] for col in df.columns]

In [2]:
df2 = df.copy()

df2 -= df2.mean()
df2 /= df2.std()

In [3]:
U, S, Vt = np.linalg.svd(df2)
px.scatter(x = Vt[0, :], y = Vt[1, :], 
           labels = coin_names, color = coin_names,
           title = "Latent Relationships Between Different Crypto Tokens")

In [4]:
# latent relationships between different trading days
trading_days = np.arange(len(df.index))
px.scatter(x = U[:, 0], y = U[:, 1], 
           hover_name = df.index, color = trading_days,
           title = "Latent Relationships Between Different Trading Days During Dataset")

**TO DO:** 

Try and cluster w/ KMeans + PCA

 - Use PCA to reduce dimensionality of dataset (probably keep 95% of explained variance)
 - Fit KMeans to find optimal # of clusters using Silhouette Score (or some other metric) w/ training, validation, test sets (can report these scores in our final report)
 - Visualize on reduced dataset similar to above chart, but this time color coded w/ cluster labels (how well do different clusters separate themselves?)
 - Report patterns contained within different clusters -- any overarching themes?

Mark W's note on above - are there certain clusters we already have in mind? In terms of training/validation/test?

### Finding the amount of PCA components that fits the 95%+ threshold of explained variance

Here, we see 16 PCA components are required to explain 95% of the variance, thus reducing the dimensionality from 189 (various crypto prices) to just 16 features.

In [25]:
from sklearn.decomposition import PCA

n_component = [x for x in range(1,20)]
explained_var = []

for nc in n_component:
    pca = PCA(n_components=nc)
    pca.fit(df2)
    explained_var.append(np.sum(pca.explained_variance_))

explained_var_pca_df = pd.DataFrame()
explained_var_pca_df['number_of_components'] = n_component
explained_var_pca_df['explained_variance'] = explained_var


fig = px.line(explained_var_pca_df, x="number_of_components", y="explained_variance", title='Explained Variance by Number of PCA Components',markers=True)

fig.show()


In [47]:
pca = PCA(n_components=16)
pca.fit(df2)

df_reduced_16 = pca.transform(df2)
df_reduced_16 = pd.DataFrame(df_reduced_16)
df_reduced_16.index = df2.index

### Exploring Clustering

Similar the 2nd graph, exploring just 2 or 3 principal components to identify any distince clusters indicates that distinct clusters do not form with 2 or 3 principal components. I.e. it is difficult to identify distinct clusters visually.

In [68]:

trading_days = np.arange(len(df_reduced_16.index))

px.scatter(df_reduced_16, x = 0, y = 3,
           hover_name = df_reduced_16.index,
           title = "Two Dimensional Clustering")

In [71]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(df_reduced_16)


array([1, 0, 0, ..., 1, 1, 0])

Below, we can see that creating 2 clusters with KMeans (using all 16 components), tends to arbitraily separate the coins. Visualized through different combinations of components, the clustering looks arbitrary. 

It would be more valuable to separate the central points from the periphery points.

In [77]:
df_reduced_16['kmeans_labels'] = kmeans.labels_

px.scatter(df_reduced_16, x = 0, y = 3,
           hover_name = df_reduced_16.index, color="kmeans_labels",
           title = "Two Dimensional Clustering")

HDBScan, a density based clustering method (an iteration on DBScan), looks for 'islands of higher density amid noise'. Here, after some tuning, we find HBDScan's 3 identified clusters mapped against the top 2 principal components, indicating how the central cluster is surrounded by another extended cluster surrounding it. Additionally, there is a small isolated cluster (not as distinct when viewed in the context of the top 2 principal components).

In [99]:

if "kmeans_labels" in df_reduced_16.columns:
    df_reduced_16 = df_reduced_16.drop(columns=['kmeans_labels'])

clusterer = hdbscan.HDBSCAN(min_cluster_size=2, min_samples=6, gen_min_span_tree=True)
hbd_labels = clusterer.fit_predict(df_reduced_16)

df_reduced_16['hbd_labels'] = hbd_labels

px.scatter(df_reduced_16, x = 0, y = 1,
           hover_name = df_reduced_16.index, color="hbd_labels",
           title = "HBDScan's 3 Clusters Viewed through the top 2 principal components")

Applied with the whole non-reduced dataset, HDBScan identifes 4 clusters.

In [102]:

if "hbd_labels" in df2.columns:
    df2 = df2.drop(columns=['hbd_labels'])

clusterer = hdbscan.HDBSCAN(min_cluster_size=2, min_samples=6, gen_min_span_tree=True)
hbd_labels = clusterer.fit_predict(df_reduced_16)

df2['hbd_labels'] = hbd_labels
print(f'hdbscan identifies {len(np.unique(hbd_labels))} clusters')

hdbscan identifies 4 clusters
