<img src="https://relevance.ai/wp-content/uploads/2021/11/logo.79f303e-1.svg" width="150" alt="Relevance AI" />
<h5> Developer-first vector platform for ML teams </h5>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RelevanceAI/workflows/blob/main/workflows/subclustering/🍒_RelevanceAI_Subclustering.ipynb)

# 🤖: Basic Sub-clustering

This notebook is a quick guide on how to use Relevance AI for subclustering. Subclustering allows users to infinitely drill down into their clusters by running more clusters.

Basic sub-clustering allows users to rely on clustering in simple ways.

For more details, please refer to the [subclustering guide](https://colab.research.google.com/github/RelevanceAI/workflows/blob/main/workflows/subclustering/basic_subclustering.ipynb) to  [subclustering references](https://relevanceai.readthedocs.io/en/development/operations/cluster/subclustering.html).


In [None]:
%%capture
!pip install -q RelevanceAI[notebook]

In [None]:
%%capture
from relevanceai import Client

"""
You can sign up/login and find your credentials here: https://cloud.relevance.ai/sdk/api
Once you have signed up, click on the value under `Authorization token` and paste it here
"""
client = Client()

In [None]:
"""
Variables
"""

dataset_id = "basic_subclustering"

## Clustering
n_clusters = 10
parent_alias = f"kmeans_{n_clusters}"
vector_field = "product_image_clip_vector_"

# You can find the parent field in the ds.schema or alternatively provide a field
parent_field = f"_cluster_.{vector_field}.{parent_alias}"

## Subclustering

subcluster_n_clusters = 3
subcluster_alias = f"{parent_alias}_{subcluster_n_clusters}"
subcluster_field = f"_cluster_.{vector_field}.{subcluster_alias}"


# 🚣 Inserting data

We use a sample ecommerce dataset - with vectors `product_image_clip_vector_` and `product_title_clip_vector_` already encoded for us.

In [None]:
from relevanceai.utils.datasets import get_ecommerce_dataset_encoded

docs = get_ecommerce_dataset_encoded()


In [None]:
ds = client.Dataset(dataset_id)
ds.delete()
ds.upsert_documents(docs)

In [None]:
ds.schema

# 🍒  Running the initial clustering approach:

In [None]:

"""
Let's instantiate a clustering model and set an appropriate parent alias for n_clusters
Let's vectorize over all available vector fields
"""
vector_fields = ds.list_vector_fields()

from sklearn.cluster import KMeans
model = KMeans(n_clusters=n_clusters)

for v in vector_fields:
  cluster_ops = ds.cluster(
    model,
    vector_fields=[v],
    alias=parent_alias
  )

ds.schema

If we have a look at the resulting clusters in the [clustering dashboard link above](https://cloud.relevance.ai/dataset/basic_subclustering/deploy/recent/cluster/), we will see that there is potential for further break down the clusters. At a high-level, we can see electronics and shoes, but we could further break down these clusters using subclustering filter.





# 🫐 Running sub-clustering is then as simple as running **ds.subcluster** to the function.

In [None]:


"""
Given the parent field - we now run subclustering 
Let's dive deeper to view 3 subclusters 
"""

from sklearn.cluster import KMeans
model = KMeans(n_clusters=subcluster_n_clusters)

ds.subcluster(
   model=model,
   parent_field=parent_field,
   vector_fields=[vector_field],
   alias=subcluster_alias
)


In [None]:

"""
We can see the new subcluster in the schema
"""

ds.schema


In [None]:

"""
# You should also be able to track your subclusters using
"""

ds.metadata


In [None]:
# You can also view your subcluster results using

ds[subcluster_field] 

In [None]:

"""
View dataset health
"""
ds.health()

# 🍇 You can then run sub-clustering again on a separate parent alias!

If we find our intiail subclusters are insufficient, we can run subclustering again even more clusters to drill down down even furher.

You are also able to infinitely continue subclustering as required by constantly referring back to the parent alias.

See the [subclustering guide](https://colab.research.google.com/github/RelevanceAI/workflows/blob/main/workflows/subclustering/basic_subclustering.ipynb) for more details on how to use subclustering.


**Next steps**

If you require more indepth knowledge around subclustering, we will be writing more guides on how to adapt these to different aliases and models in the near future.


For more details, please refer to the  [references](https://relevanceai.readthedocs.io/en/development/operations/cluster/subclustering.html).