<a id="top"></a>
<img style="width:40%;max-width:600px" alt="Bluelight AI Logo" href="https://bluelightai.com/" src="https://github.com/BlueLightAI/cobalt-examples/blob/main/assets/blai-logo-light.png?raw=true">

# Cobalt UI Walkthrough

<a href="https://bluelightai.com/contact">Give Feedback</a> | <a href="https://bluelightai.com/">Our Website</a> | <a href="https://bluelightai.com/blog">Our Blog</a> | <a href="https://docs.cobalt.bluelightai.com/">Cobalt Docs</a> | <a href="https://join.slack.com/t/bluelightaicom/shared_invite/zt-2uj0iu5lh-5WgutuwH82RxAOwuq8ptqg">Slack Community</a>

**Tags:** #blai #python #cobaltai #tda #embedding #cobaltui #imageclassification

**Last update:** 2025-01-15 (Created: 2025-01-14)

## Goals:

### What you will see:

- Download the wine quality data (with 2 classes),
- Visualize the data using a powerful visualization tool, 
- Generate insights and intuition about data using the visual analysis, and 
- Select subgroups from the UI and interact with selected subgroup using python code.

### You will learn:

- Introduction to Cobalt and its UI, and
- Application of Cobalt with data visualization and classification tasks.

## Input

### Install dependencies

Lets dowanload the cobalt package from PyPI:

In [1]:
# %pip install cobalt-ai
# %pip install --upgrade cobalt-ai

Lastly, lets download the [wine-quality](https://archive.ics.uci.edu/dataset/186/wine+quality) dataset from UC Irvine data repo: 

In [2]:
# %pip install ucimlrepo

In [3]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
wine_quality = fetch_ucirepo(id=186) 


### Imports libraries

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

import cobalt

If you have not registered Cobalt yet, please uncomment and run the following to register:

In [5]:
# cobalt.register_license()

### Data

In [6]:
wine_quality.data.original

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,red
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,white
6493,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,white
6494,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,white
6495,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,white


In [7]:
### Explore the dataset
# wine_quality.data
# wine_quality.data.features
# wine_quality.data.targets
# wine_quality.values()

In [8]:
# Metadata information
# wine_quality.metadata

In [9]:
# Variable information 
# wine_quality.variables

Now, we prepare the data (as Pandas DataFrames) for visualization:

In [10]:
## Features of the wine data
X = wine_quality.data.features 

## Wine quality as the target variable of the wine data
y = wine_quality.data.targets

## Complete dataset, including the wine type
df = wine_quality.data.original

In [11]:
df

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,red
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,white
6493,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,white
6494,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,white
6495,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,white


## Embeddings

To build topological graphs from the data, Cobalt uses embeddings, which are simply vector representations of each data point that can be used to evaluate how similar data points are to each other. When working with complex data like images or text, some kind of embedding model is necessary to produce good vector representations. However, tabular data naturally comes in a vector-like representation, so we can give it to Cobalt as an embedding with minimal transformation.

For starters:
- we can use the raw data itself as an embedding, or
- we could also used a normalized version of this raw data as embedding.

Let's get started by looking at both these.

### Raw Data as Embedding

The embedding needs to be a numpy array, so we simply convert our raw features into a numpy array:

In [12]:
rawEmbedding = X.to_numpy()
rawEmbedding.shape

(6497, 11)

Let's load this data into Cobalt:
1. *Full DataFrame*: We create a CobaltDataset object from the full DataFrame, including features not in `X`.
2. *The embedding matrix*: We use `add_embedding_array()` to add an embedding. Note that every element in the DataFrame needs to have a corresponding row in the embedding array.

In [13]:
# Step 1. Convert the data frame to CobaltDataset and add embedding array and image paths.
ds = cobalt.CobaltDataset(df)
ds.add_embedding_array(rawEmbedding, name="raw_data")

# Step 2. Create a workspace with the CobaltDataset
w = cobalt.Workspace(ds)

Now let's have Cobalt create a graph using this embedding. By default, Cobalt builds a graph on the full dataset, but we can specify a subset if we want to zoom in on a particular part of the dataset.

In [14]:
g = w.new_graph(name="raw_data", embedding="raw_data")

Let's also have Cobalt use this graph to find clusters. These will be displayed for us to explore in the UI.

In [15]:
raw_clusters = w.find_clusters(graph=g, run_name="raw_data_cluster")

Let's open the UI and see how Cobalt helps understand the embeddings.

In [None]:
w.ui

If you are new to the Cobalt UI, check out this video for a walkthrough of the UI:
- Cobalt UI Walkthrough https://www.youtube.com/watch?v=UvIFuTGTRSk
- Image Clustering using Cobalt: https://www.youtube.com/watch?v=h_PUFvE4bvM 

Try to play around with the UI, in particular:
- Explore different Coarseness and Connectivity options.
- Color by different features, in particular, color by:
    - total_sulfur_dioxide,
    - free_sulfur_dioxide, and
    - color (our target variable).



You should see that `total_sulfur_dioxide` varies very smoothly along the graph, but other features don't seem to have much relationship with the graph structure. Why might that be?
- We used raw features as our representation (or embedding) to generate this graph.
- When building the graph, Cobalt uses the default Euclidean metric to determine which embeddings are most similar (and should therefore be close to each other in the graph).
- Different features in our raw data may have very different ranges.
- This means that some features may have a much bigger impact on the similarity metric than others.

Lets check it out:

In [17]:
dfDescription = df.describe().T
dfDescription['range'] = dfDescription['max'] - dfDescription['min']
dfDescription.sort_values('range')

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,range
density,6497.0,0.994697,0.002999,0.98711,0.99234,0.99489,0.99699,1.03898,0.05187
chlorides,6497.0,0.056034,0.035034,0.009,0.038,0.047,0.065,0.611,0.602
pH,6497.0,3.218501,0.160787,2.72,3.11,3.21,3.32,4.01,1.29
volatile_acidity,6497.0,0.339666,0.164636,0.08,0.23,0.29,0.4,1.58,1.5
citric_acid,6497.0,0.318633,0.145318,0.0,0.25,0.31,0.39,1.66,1.66
sulphates,6497.0,0.531268,0.148806,0.22,0.43,0.51,0.6,2.0,1.78
quality,6497.0,5.818378,0.873255,3.0,5.0,6.0,6.0,9.0,6.0
alcohol,6497.0,10.491801,1.192712,8.0,9.5,10.3,11.3,14.9,6.9
fixed_acidity,6497.0,7.215307,1.296434,3.8,6.4,7.0,7.7,15.9,12.1
residual_sugar,6497.0,5.443235,4.757804,0.6,1.8,3.0,8.1,65.8,65.2


Note that among the raw features, `total_sulfur_dioxide` has the largest mean and largest range. 

So if we use raw features, we might run into issues where some features with larger ranges dominate the others, and this can lead to poor clustering results. 

This is where feature scaling comes in.

### Scaled Data as embedding

We can use normalization as a way to scaling the features. 

There are many ways of doing this, but one common way is to scale the data so that all features have a mean of 0 and standard deviation of 1. Scikit-learn's `StandardScaler` does this nicely:

In [18]:
scaler = StandardScaler()
scaledEmbedding = scaler.fit_transform(rawEmbedding)
scaledEmbedding.shape

(6497, 11)

Now we just repeat the previous steps to add a scaled version of our embedding to the dataset, and create a graph from it.

In [19]:
ds.add_embedding_array(scaledEmbedding, name="scaled_data")

In [20]:
g = w.new_graph(name="scaled_data", embedding="scaled_data")

In [21]:
scaled_clusters = w.find_clusters(graph=g, run_name="scaled_data_cluster")

Now let's return to the UI to see how the scaled embeddings help with the data visualization and clustering. You can either scroll back up to the UI above, or open another view in the cell below. Select `scaled_data` from the dropdown menu to see the impact the scaling had on the graph.

In [None]:
w.ui

First, we see that now the data is somewhat separated into two groups.

If now we change the color map by the 'color' feature, we can see that the red and wines are largely dominating two distinct clusters, and they look more separated than before. There are 1-2 white wine dominated nodes that seem misplaced inside the red wine cluster. You can color the graph by different features to determine which features most determine the separation between the clusters, and which features drive the variation within clusters.

Note that here we are using Euclidean distance to make the graph here (by default), but we can change it to the Manhattan (L1) distance or other metrics when we add the embedding or create a graph. See [the docs](https://docs.cobalt.bluelightai.com/data_loading.html#creating-embeddings) for a list of some of the available distance metrics.

In [23]:
g_manhattan = w.new_graph(name="scaled_data_manhattan", embedding="scaled_data", metric="manhattan")

We can also investigate the data inside any of the clusters Cobalt found. For instance, to see the data table for the first cluster, do this:

In [24]:
scaled_clusters = w.clustering_results["scaled_data_cluster"]

In [25]:
scaled_clusters.groups[0].subset.df

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
401,7.7,0.26,0.30,1.7,0.059,20.0,38.0,0.99490,3.29,0.47,10.8,6,red
494,6.5,0.39,0.23,8.3,0.051,28.0,91.0,0.99520,3.44,0.55,12.1,6,red
836,6.7,0.28,0.28,2.4,0.012,36.0,100.0,0.99064,3.26,0.39,11.7,7,red
837,6.7,0.28,0.28,2.4,0.012,36.0,100.0,0.99064,3.26,0.39,11.7,7,red
1131,5.9,0.19,0.21,1.7,0.045,57.0,135.0,0.99341,3.32,0.44,9.5,5,red
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6486,6.2,0.41,0.22,1.9,0.023,5.0,56.0,0.98928,3.04,0.79,13.0,7,white
6489,6.1,0.34,0.29,2.2,0.036,25.0,100.0,0.98938,3.06,0.44,11.8,6,white
6490,5.7,0.21,0.32,0.9,0.038,38.0,121.0,0.99074,3.24,0.46,10.6,6,white
6495,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,white


Compare the features between this cluster and the rest of the data:

In [26]:
scaled_clusters.groups[0].group_details.feature_descriptions

{'chlorides': 'mean = 0.0343 (rest of data: 0.0586)',
 'density': 'mean = 0.991 (rest of data: 0.995)',
 'alcohol': 'mean = 12.1 (rest of data: 10.3)',
 'color': 'mode = white (99.3%, rest of data: 72.6%)'}

In [27]:
scaled_clusters.groups[0].group_details.feature_stat_tables

{'numerical_comparison_stats':                  feature        mean  complement mean
 7              chlorides    0.034334         0.058554
 2                density    0.990582         0.995175
 11               alcohol   12.121844        10.302502
 4              sulphates    0.424571         0.543659
 8          fixed_acidity    6.328107         7.318339
 3         residual_sugar    3.386021         5.682142
 6                quality    6.405325         5.750215
 0            citric_acid    0.293225         0.321584
 10  total_sulfur_dioxide  108.366864       116.601357
 1    free_sulfur_dioxide   28.252219        30.789297
 9       volatile_acidity    0.321583         0.341766,
 'categorical_comparison_stats':   feature   mode  frequency (%)  complement frequency (%)
 1   color  white      99.260355                 72.616389}

Feel free to explore and try out more things!

More details about the UI can be found [in the docs](https://docs.cobalt.bluelightai.com/ui.html).

## Conclusion

- Cobalt gives a powerful way to interact and visualize complex data. 
- It is important to remember that the default settings used by Cobalt may not always perfectly separate classes, but it can provide a good starting point for further analysis. 
- The UI provides an easy way to explore and interact with the data, while also providing some basic visualizations.
    - While UI is very helpful in generating insights and interactive data analysis, it is not the only way to use Cobalt.
    - The API allows for more advanced usage and customization, such as automating workflows or integrating with other systems.
- Check out other videos to see how Cobalt can be used for other applications.

- Use Cobalt's API to automate more complex workflows or integrate it with other tools and systems.
- Download and use it today!

Let us know what you think:

<a href="https://bluelightai.com/contact">Give Feedback</a> | <a href="https://bluelightai.com/">Our Website</a> | <a href="https://bluelightai.com/blog">Our Blog</a> | <a href="https://docs.cobalt.bluelightai.com/">Cobalt Docs</a> | <a href="https://join.slack.com/t/bluelightaicom/shared_invite/zt-2uj0iu5lh-5WgutuwH82RxAOwuq8ptqg">Slack Community</a>

<div style="display: flex; align-items: center; justify-content: space-between;">
    <div style:"flex: 1; text-align: left;">
        <a href="#top" style="text-decoration: none; color: inherit;"> 
            <h3>Top of Page</h3> 
        </a>
    </div>
    <div style:"flex: 1; text-align: right;">
        <img style="width:50%;max-width:600px;float:right" alt="Bluelight AI Logo" href="https://bluelightai.com/" src="https://github.com/BlueLightAI/cobalt-examples/blob/main/assets/blai-logo-light.png?raw=true">
    </div>
</div>