# Analyzing the Keywords in the Hugging Face Croissant Data
_June 2025_

This notebook looks at the keywords found in the Croissant data. Keywords are important metadata for understanding content and of purposes for datasets. However, it is assumed that most keywords are defined manually, rather using LLMs to infer appropriate keywords!

Activate Bokeh output in Jupyter notebook, which is used for graphs, and import the DuckDB Python library.

In [1]:
from bokeh.io import output_notebook

output_notebook()

In [2]:
import duckdb

duckdb.sql("SELECT 42").show()  # Sanity check...

┌───────┐
│  42   │
│ int32 │
├───────┤
│    42 │
└───────┘



Load the database that was used to create the static catalog.

In [3]:
con = duckdb.connect("../croissant.duckdb")

Install the JSON extension. (This is probably only needed once)

In [4]:
con.install_extension("json")
con.load_extension("json")

Let's count keyword occurrences, write them to a file, and create a numpy array from them for plotting. From the static catalog code [README](../README.md), the processed metadata is in a table `hf_metadata`, and `hf_expanded_metadata` expanded the keyword lists into separate records. That's the table we will use for most of what follows.

In [10]:
ranked_keywords_q = """
    SELECT keyword, count(keyword) AS count
    FROM hf_expanded_metadata 
    GROUP BY keyword
    ORDER BY count DESC
    """

In [11]:
con.sql(f"""
    COPY ({ranked_keywords_q}) TO 'ranked_keywords.csv' (HEADER, DELIMITER ',')
    """)

In [14]:
ranked_keywords = con.sql(ranked_keywords_q)
ranked_keywords

┌──────────────────┬───────┐
│     keyword      │ count │
│     varchar      │ int64 │
├──────────────────┼───────┤
│ 🇺🇸 region: us    │ 59373 │
│ croissant        │ 52858 │
│ datasets         │ 52841 │
│ polars           │ 45056 │
│ text             │ 44910 │
│ pandas           │ 34350 │
│ apache-2.0       │ 28232 │
│ parquet          │ 23613 │
│ mit              │ 21020 │
│ < 1k             │ 17642 │
│  ·               │     · │
│  ·               │     · │
│  ·               │     · │
│ holocaust        │     2 │
│ health-care      │     2 │
│ life sciences    │     2 │
│ ocl              │     2 │
│ arxiv:1911.11641 │     2 │
│ hate speech      │     2 │
│ vector calculus  │     2 │
│ contact          │     2 │
│ protection       │     2 │
│ routing          │     2 │
├──────────────────┴───────┤
│ ? rows         2 columns │
└──────────────────────────┘

Recall there are 59,397 records in `hf_metadata` (the "unexpanded" version of `hf_expanded_metadata`. So it appears that every record has the keyword `🇺🇸 region: us`.

In [13]:
con.sql('select count() from hf_metadata;')

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        59397 │
└──────────────┘

In [15]:
ranked_keywords_np = ranked_keywords.fetchnumpy()

To plot from highest count to lowest on the `y` axis, we need an index for the `x` axis. Let's add a column of indices to the numpy array.

In [32]:
import numpy as np

In [33]:
ranked_keywords_np['position'] = np.array(range(ranked_keywords_np['keyword'].size))

In [34]:
ranked_keywords_np

{'keyword': array(['🇺🇸 region: us', 'croissant', 'datasets', ..., 'bofip',
        'arxiv:2309.12871', 'doi:10.57967/hf/1706'],
       shape=(22528,), dtype=object),
 'count': array([59373, 52858, 52841, ...,     1,     1,     1], shape=(22528,)),
 'position': array([    0,     1,     2, ..., 22525, 22526, 22527], shape=(22528,))}

Now let's plot the counts. Do they form a power-law distribution?

In [21]:
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show

In [35]:
ranked_keywords_cds = ColumnDataSource(data=ranked_keywords_np)

In [37]:
ranked_keywords_np_tooltips = [
    ("Keyword", "@keyword"),
    ("Count", "@count"),
    ("Position", "@position")
]

p2 = figure(
    toolbar_location=None,
    x_axis_type="log",
    y_axis_type="log",
    tools="hover",  # add the hover tool to the figure (the hover tool will work even if the toolbar is hidden)
    tooltips=ranked_keywords_np_tooltips,
    height=300,
)
p2.scatter(
    x="position", 
    y="count",   
    source=ranked_keywords_cds,
)

show(p2)

So the distribution is roughly a power law distribution.

Let's look at _counts of counts_, which tells us how often keywords occur just once, twice, etc. As we'll see, there are almost 12,000 keywords that appear just once. We will also grab an arbitrary sample, five, of each count of keyword counts.

In [40]:
counted_ranked_keywords_q = """
    WITH kc AS (
        SELECT keyword, count(keyword) AS count
        FROM hf_expanded_metadata 
        GROUP BY keyword
    )
    SELECT list(keyword)[0:5] AS keywords_first_5, count, count(count) AS count_of_count
    FROM kc
    GROUP BY count
    ORDER BY count_of_count DESC
    """

In [41]:
counted_ranked_keywords = con.sql(counted_ranked_keywords_q)
counted_ranked_keywords

┌───────────────────────────────────────────────────────────────────────────────────────────────────┬───────┬────────────────┐
│                                         keywords_first_5                                          │ count │ count_of_count │
│                                             varchar[]                                             │ int64 │     int64      │
├───────────────────────────────────────────────────────────────────────────────────────────────────┼───────┼────────────────┤
│ [conflict-resolution, 'arxiv:2407.11212', jfk, information-integrity, molecule_image]             │     1 │          11782 │
│ ['arxiv:2505.15935', coral reef, two_arms, american-express, payments]                            │     2 │           6009 │
│ ['arxiv:2504.05181', tagging, code monétaire et financier, cdt_trial, move]                       │     3 │           1412 │
│ [prompt injection, cryptography, rust, text2sql, transportation]                                  │     4 │  

Looking at the list of keywords, it's perhaps not surprising that certain arXiv papers would be mentioned only once, presumably in a dataset published with the paper. Other low-occurrence keywords, like `asia`, are a bit surprising, but this analysis doesn't consider possible related keywords, which may have much higher occurrences.

At the other end of the scale, keywords occur frequently, like `cc-by-4.0` occurring 5059 times (not shown), but no other keyword occurs exactly 5059 times. The `count_of_count` is expected to be very low for large counts, since the chance that two large numbers are equal is much lower.

In [42]:
counted_ranked_keywords_np = counted_ranked_keywords.fetchnumpy()

Let's plot this data.

In [43]:
crk = ColumnDataSource(data=counted_ranked_keywords_np)

In [44]:
counted_ranked_keywords_np_tooltips = [
    ("Count", "@count"),
    ("#", "@count_of_count"),
]

crk_plot = figure(
    toolbar_location=None,
    x_axis_type="log",
    y_axis_type="log",
    tools="hover",  # add the hover tool to the figure (the hover tool will work even if the toolbar is hidden)
    tooltips=counted_ranked_keywords_np_tooltips,
    height=300,
)
crk_plot.scatter(
    x="count",   
    y="count_of_count",
    source=crk,
)

show(crk_plot)

Similarly, there is a rough power-law distribtion at the beginning for the number of keywords with just a few occurrences, under ~50 and then the distribution gets more granular. This is to be expected. Because of the total possible list of keywords, effectively unlimited, we would expect a lot of single-occurrence, less double-occurrence, even less triple-occurrence, etc. keywords, but when the counts get higher, the probability that two large numbers of counts happen to be equal, drops quickly, so that there are many `count_of_count` values of 1, 2, 3, and 4.