-
Notifications
You must be signed in to change notification settings - Fork 400
Add HackerNews dataset for vector search #4378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@shankar-iyer is attempting to deploy a commit to the ClickHouse Team on Vercel. A member of the Team first needs to authorize it. |
|
Thank you @Blargian for all the help in the last few days to get the content ready! |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| ## Introduction {#introduction} | ||
|
|
||
| The [Hacker News dataset](https://news.ycombinator.com/) contains 28.74 million | ||
| postings and their vector embeddings. The embeddings were generated using [SentenceTransformers](https://sbert.net/) model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). The dimension of each embedding vector is `384`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the dimension: Here, we say the dimension is 384, l. 73 says the dimension is 768.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected, thanks!
Blargian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left a few small formatting suggestions.
|
|
||
| ## Dataset details {#dataset-details} | ||
|
|
||
| The complete dataset with vector embeddings is made available by ClickHouse as a single `Parquet` file in a `S3` bucket : https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The complete dataset with vector embeddings is made available by ClickHouse as a single `Parquet` file in a `S3` bucket : https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet | |
| The complete dataset with vector embeddings is made available by ClickHouse as a single `Parquet` file in a [`S3` bucket](https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
| INSERT INTO hackernews SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet'); | ||
| ``` | ||
|
|
||
| The loading of 28.74 million rows into the table will take a few minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The loading of 28.74 million rows into the table will take a few minutes. | |
| Inserting 28.74 million rows into the table will take a few minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
|
|
||
| ### Build a vector similarity index {#build-vector-similarity-index} | ||
|
|
||
| Run the following SQL to define and build a vector similarity index on the `vector` column of the `hackernews` table : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Run the following SQL to define and build a vector similarity index on the `vector` column of the `hackernews` table : | |
| Run the following SQL to define and build a vector similarity index on the `vector` column of the `hackernews` table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model. | ||
|
|
||
| An example Python script is provided below to demonstrate how to programmatically generate | ||
| embedding vectors using `sentence_transformers1 Python package. The search embedding vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| embedding vectors using `sentence_transformers1 Python package. The search embedding vector | |
| embedding vectors using the `sentence_transformers1 Python package. The search embedding vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
| efficiently, gaining popularity for its impressive performance and cost-effectiveness. | ||
| ``` | ||
|
|
||
| Code for above application : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Code for above application : | |
| Code for the above application: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
Summary
One more (last!) example for vector search.
Checklist