diff --git a/docs/getting-started/example-datasets/hacker-news-vector-search.md b/docs/getting-started/example-datasets/hacker-news-vector-search.md new file mode 100644 index 00000000000..5dde80ddee8 --- /dev/null +++ b/docs/getting-started/example-datasets/hacker-news-vector-search.md @@ -0,0 +1,348 @@ +--- +description: 'Dataset containing 28+ million Hacker News postings & their vector embeddings' +sidebar_label: 'Hacker News Vector Search dataset' +slug: /getting-started/example-datasets/hackernews-vector-search-dataset +title: 'Hacker News Vector Search dataset' +keywords: ['semantic search', 'vector similarity', 'approximate nearest neighbours', 'embeddings'] +--- + +## Introduction {#introduction} + +The [Hacker News dataset](https://news.ycombinator.com/) contains 28.74 million +postings and their vector embeddings. The embeddings were generated using [SentenceTransformers](https://sbert.net/) model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). The dimension of each embedding vector is `384`. + +This dataset can be used to walk through the design, sizing and performance aspects for a large scale, +real world vector search application built on top of user generated, textual data. + +## Dataset details {#dataset-details} + +The complete dataset with vector embeddings is made available by ClickHouse as a single `Parquet` file in a [S3 bucket](https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet) + +We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the [documentation](../../engines/table-engines/mergetree-family/annindexes.md). + +## Steps {#steps} + + + +### Create table {#create-table} + +Create the `hackernews` table to store the postings & their embeddings and associated attributes: + +```sql +CREATE TABLE hackernews +( + `id` Int32, + `doc_id` Int32, + `text` String, + `vector` Array(Float32), + `node_info` Tuple( + start Nullable(UInt64), + end Nullable(UInt64)), + `metadata` String, + `type` Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5), + `by` LowCardinality(String), + `time` DateTime, + `title` String, + `post_score` Int32, + `dead` UInt8, + `deleted` UInt8, + `length` UInt32 +) +ENGINE = MergeTree +ORDER BY id; +``` + +The `id` is just an incrementing integer. The additional attributes can be used in predicates to understand +vector similarity search combined with post-filtering/pre-filtering as explained in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md) + +### Load data {#load-table} + +To load the dataset from the `Parquet` file, run the following SQL statement: + +```sql +INSERT INTO hackernews SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet'); +``` + +Inserting 28.74 million rows into the table will take a few minutes. + +### Build a vector similarity index {#build-vector-similarity-index} + +Run the following SQL to define and build a vector similarity index on the `vector` column of the `hackernews` table: + +```sql +ALTER TABLE hackernews ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 384, 'bf16', 64, 512); + +ALTER TABLE hackernews MATERIALIZE INDEX vector_index SETTINGS mutations_sync = 2; +``` + +The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md). +The statement above uses values of 64 and 512 respectively for the HNSW hyperparameters `M` and `ef_construction`. +Users need to carefully select optimal values for these parameters by evaluating index build time and search results quality +corresponding to selected values. + +Building and saving the index could even take a few minutes/hour for the full 28.74 million dataset, depending on the number of CPU cores available and the storage bandwidth. + +### Perform ANN search {#perform-ann-search} + +Once the vector similarity index has been built, vector search queries will automatically use the index: + +```sql title="Query" +SELECT id, title, text +FROM hackernews +ORDER BY cosineDistance( vector, ) +LIMIT 10 + +``` + +The first time load of the vector index into memory could take a few seconds/minutes. + +### Generate embeddings for search query {#generating-embeddings-for-search-query} + +[Sentence Transformers](https://www.sbert.net/) provide local, easy to use embedding +models for capturing the semantic meaning of sentences and paragraphs. + +The dataset in this HackerNews dataset contains vector emebeddings generated from the +[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model. + +An example Python script is provided below to demonstrate how to programmatically generate +embedding vectors using `sentence_transformers1 Python package. The search embedding vector +is then passed as an argument to the [`cosineDistance()`](/sql-reference/functions/distance-functions#cosineDistance) function in the `SELECT` query. + +```python +from sentence_transformers import SentenceTransformer +import sys + +import clickhouse_connect + +print("Initializing...") + +model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') + +chclient = clickhouse_connect.get_client() # ClickHouse credentials here + +while True: + # Take the search query from user + print("Enter a search query :") + input_query = sys.stdin.readline(); + texts = [input_query] + + # Run the model and obtain search vector + print("Generating the embedding for ", input_query); + embeddings = model.encode(texts) + + print("Querying ClickHouse...") + params = {'v1':list(embeddings[0]), 'v2':20} + result = chclient.query("SELECT id, title, text FROM hackernews ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params) + print("Results :") + for row in result.result_rows: + print(row[0], row[2][:100]) + print("---------") + +``` + +An example of running the above Python script and similarity search results are shown below +(only 100 characters from each of the top 20 posts are printed): + +```text +Initializing... + +Enter a search query : +Are OLAP cubes useful + +Generating the embedding for "Are OLAP cubes useful" + +Querying ClickHouse... + +Results : + +27742647 smartmic: +slt2021: OLAP Cube is not dead, as long as you use some form of:

1. GROUP BY multiple fi +--------- +27744260 georgewfraser:A data mart is a logical organization of data to help humans understand the schema. Wh +--------- +27761434 mwexler:"We model data according to rigorous frameworks like Kimball or Inmon because we must r +--------- +28401230 chotmat: +erosenbe0: OLAP database is just a copy, replica, or archive of data with a schema designe +--------- +22198879 Merick:+1 for Apache Kylin, it's a great project and awesome open source community. If anyone i +--------- +27741776 crazydoggers:I always felt the value of an OLAP cube was uncovering questions you may not know to as +--------- +22189480 shadowsun7: +_Codemonkeyism: After maintaining an OLAP cube system for some years, I'm not that +--------- +27742029 smartmic: +gengstrand: My first exposure to OLAP was on a team developing a front end to Essbase that +--------- +22364133 irfansharif: +simo7: I'm wondering how this technology could work for OLAP cubes.

An OLAP cube +--------- +23292746 scoresmoke:When I was developing my pet project for Web analytics ( int: + encoding = tiktoken.encoding_for_model(encoding_name) + num_tokens = len(encoding.encode(string)) + return num_tokens + +model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') + +chclient = clickhouse_connect.get_client(compress=False) # ClickHouse credentials here + +while True: + # Take the search query from user + print("Enter a search topic :") + input_query = sys.stdin.readline(); + texts = [input_query] + + # Run the model and obtain search or reference vector + print("Generating the embedding for ----> ", input_query); + embeddings = model.encode(texts) + + print("Querying ClickHouse...") + params = {'v1':list(embeddings[0]), 'v2':100} + result = chclient.query("SELECT id,title,text FROM hackernews ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params) + + # Just join all the search results + doc_results = "" + for row in result.result_rows: + doc_results = doc_results + "\n" + row[2] + + print("Initializing chatgpt-3.5-turbo model") + model_name = "gpt-3.5-turbo" + + text_splitter = CharacterTextSplitter.from_tiktoken_encoder( + model_name=model_name + ) + + texts = text_splitter.split_text(doc_results) + + docs = [Document(page_content=t) for t in texts] + + llm = ChatOpenAI(temperature=0, model_name=model_name) + + prompt_template = """ +Write a concise summary of the following in not more than 10 sentences: + + +{text} + + +CONSCISE SUMMARY : +""" + + prompt = PromptTemplate(template=prompt_template, input_variables=["text"]) + + num_tokens = num_tokens_from_string(doc_results, model_name) + + gpt_35_turbo_max_tokens = 4096 + verbose = False + + print("Summarizing search results retrieved from ClickHouse...") + + if num_tokens <= gpt_35_turbo_max_tokens: + chain = load_summarize_chain(llm, chain_type="stuff", prompt=prompt, verbose=verbose) + else: + chain = load_summarize_chain(llm, chain_type="map_reduce", map_prompt=prompt, combine_prompt=prompt, verbose=verbose) + + summary = chain.run(docs) + + print(f"Summary from chatgpt-3.5: {summary}") +``` +