Skip to content

Conversation

@shankar-iyer
Copy link
Member

Summary

One more (last!) example for vector search.

Checklist

@shankar-iyer shankar-iyer requested a review from a team as a code owner September 3, 2025 07:33
@vercel
Copy link

vercel bot commented Sep 3, 2025

@shankar-iyer is attempting to deploy a commit to the ClickHouse Team on Vercel.

A member of the Team first needs to authorize it.

@shankar-iyer
Copy link
Member Author

Thank you @Blargian for all the help in the last few days to get the content ready!

@shankar-iyer
Copy link
Member Author

@rschu1ze

@vercel
Copy link

vercel bot commented Sep 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Updated (UTC)
clickhouse-docs Ready Ready Preview Sep 3, 2025 11:16am

## Introduction {#introduction}

The [Hacker News dataset](https://news.ycombinator.com/) contains 28.74 million
postings and their vector embeddings. The embeddings were generated using [SentenceTransformers](https://sbert.net/) model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). The dimension of each embedding vector is `384`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the dimension: Here, we say the dimension is 384, l. 73 says the dimension is 768.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected, thanks!

Copy link
Member

@Blargian Blargian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left a few small formatting suggestions.


## Dataset details {#dataset-details}

The complete dataset with vector embeddings is made available by ClickHouse as a single `Parquet` file in a `S3` bucket : https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The complete dataset with vector embeddings is made available by ClickHouse as a single `Parquet` file in a `S3` bucket : https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet
The complete dataset with vector embeddings is made available by ClickHouse as a single `Parquet` file in a [`S3` bucket](https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

INSERT INTO hackernews SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet');
```

The loading of 28.74 million rows into the table will take a few minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The loading of 28.74 million rows into the table will take a few minutes.
Inserting 28.74 million rows into the table will take a few minutes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!


### Build a vector similarity index {#build-vector-similarity-index}

Run the following SQL to define and build a vector similarity index on the `vector` column of the `hackernews` table :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Run the following SQL to define and build a vector similarity index on the `vector` column of the `hackernews` table :
Run the following SQL to define and build a vector similarity index on the `vector` column of the `hackernews` table:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.

An example Python script is provided below to demonstrate how to programmatically generate
embedding vectors using `sentence_transformers1 Python package. The search embedding vector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
embedding vectors using `sentence_transformers1 Python package. The search embedding vector
embedding vectors using the `sentence_transformers1 Python package. The search embedding vector

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

efficiently, gaining popularity for its impressive performance and cost-effectiveness.
```

Code for above application :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Code for above application :
Code for the above application:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@Blargian Blargian merged commit b94ad8a into ClickHouse:main Sep 3, 2025
9 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants