<img src="https://relevance.ai/wp-content/uploads/2021/11/logo.79f303e-1.svg" width="150" alt="Relevance AI" />
<h5> Developer-first vector platform for ML teams </h5>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RelevanceAI/workflows/blob/main/workflows/subclustering/🍒_RelevanceAI_Subclustering.ipynb)

# 🤖: Basic Sub-clustering

This notebook is a quick guide on how to use Relevance AI for subclustering. Subclustering allows users to infinitely drill down into their clusters by running more clusters.

Basic sub-clustering allows users to rely on clustering in simple ways.

For more details, please refer to the [subclustering guide](https://colab.research.google.com/github/RelevanceAI/workflows/blob/main/workflows/subclustering/basic_subclustering.ipynb) to  [subclustering references](https://relevanceai.readthedocs.io/en/development/operations/cluster/subclustering.html).


In [None]:
!pip install -q RelevanceAI==2.1.5

In [None]:
from relevanceai import Client
client = Client()

In [None]:
## Sample dataset instantiation - replace this with your own dataset information

from relevanceai.utils.datasets import get_ecommerce_dataset_encoded

dataset_id = "basic_subclustering"  ## Change this to your own dataset
ds = client.Dataset('basic_subclustering')
ds.upsert_documents(get_ecommerce_dataset_encoded())

In [None]:
## Inputs

n_clusters = 10 # number of clusters you want
vector_field = "product_image_clip_vector_" # vector field
parent_field = "product_image_title_vector_" # what to base clusters on
n_subclusters = 3
alias = f"kmeans_{n_clusters}"
subcluster_alias = f"kmeans_{n_clusters}_{n_subclusters}"  # What you want to call these clusters
parent_field = f'_cluster_.{vector_field}.{alias}'         ## Default parent field

In [None]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=n_clusters)

ds.cluster(
   model=model,
   parent_field=parent_field,
   vector_fields=[vector_field],
   alias=alias
)


# 🫐 Running sub-clustering is then as simple as running **ds.subcluster** to the function.

In [None]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=n_subclusters)

ds.subcluster(
   model=model,
   parent_field=parent_field,
   vector_fields=[vector_field],
   alias=subcluster_alias
)


In [None]:
ds.schema

You should also be able to track your subclusters using

In [None]:
ds.metadata


# 🍇 Next Steps

If we find our initial subclusters are insufficient, we can run subclustering again even more clusters to drill down down even furher.

You are also able to infinitely continue subclustering as required by constantly referring back to the parent alias.

See the [subclustering guide](https://colab.research.google.com/github/RelevanceAI/workflows/blob/main/workflows/subclustering/basic_subclustering.ipynb) for more details on how to use subclustering.


**Next steps**

If you require more indepth knowledge around subclustering, we will be writing more guides on how to adapt these to different aliases and models in the near future.


For more details, please refer to the  [references](https://relevanceai.readthedocs.io/en/development/operations/cluster/subclustering.html).