From 2884cc7944da1cb434c9eb6c4fe9ecfcc25985d7 Mon Sep 17 00:00:00 2001 From: Shankar Iyer Date: Wed, 20 Aug 2025 06:15:14 +0000 Subject: [PATCH 1/8] Initial version --- .../example-datasets/dbpedia.md | 220 ++++++++++++++++++ 1 file changed, 220 insertions(+) create mode 100644 docs/getting-started/example-datasets/dbpedia.md diff --git a/docs/getting-started/example-datasets/dbpedia.md b/docs/getting-started/example-datasets/dbpedia.md new file mode 100644 index 00000000000..88df9650d1a --- /dev/null +++ b/docs/getting-started/example-datasets/dbpedia.md @@ -0,0 +1,220 @@ +--- +description: 'Dataset containing 1 million articles from Wikipedia and their vector embeddings" +sidebar_label: 'dbpedia dataset' +slug: /getting-started/example-datasets/dbpedia-dataset +title: 'dbpedia dataset' +--- + +The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using `text-embedding-3-large` model from OpenAI. + +The dataset is an excellent starter dataset to understand semantic search, vector embeddings and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple Q & A application. + +## Data preparation {#data-preparation} + +The dataset consists of 26 `Parquet` files located at +converts them to CSV and imports them into ClickHouse. You can use the following `download.sh` script for that: + + +```bash +seq 0 409 | xargs -P1 -I{} bash -c './download.sh {}' +``` + +The dataset is split into 410 files, each file contains ca. 1 million rows. If you like to work with a smaller subset of the data, simply adjust the limits, e.g. `seq 0 9 | ...`. + +(The python script above is very slow (~2-10 minutes per file), takes a lot of memory (41 GB per file), and the resulting csv files are big (10 GB each), so be careful. If you have enough RAM, increase the `-P1` number for more parallelism. If this is still too slow, consider coming up with a better ingestion procedure - maybe converting the .npy files to parquet, then doing all the other processing with clickhouse.) + +## Create table {#create-table} + +Create the `dbpedia` table to store the article id, title, text and embedding vector : + +```sql +CREATE TABLE dbpedia +( + id String, + title String, + text String, + vector Array(Float32) CODEC(NONE) +) ENGINE = MergeTree ORDER BY (id); + +``` + +To load the dataset from the Parquet files, + +```sql +INSERT INTO laion FROM INFILE '{path_to_csv_files}/*.csv' +``` + +## Run a brute-force ANN search (without ANN index) {#run-a-brute-force-ann-search-without-ann-index} + +To run a brute-force approximate nearest neighbor search, run: + +```sql +SELECT url, caption FROM laion ORDER BY L2Distance(image_embedding, {target:Array(Float32)}) LIMIT 30 +``` + +`target` is an array of 512 elements and a client parameter. A convenient way to obtain such arrays will be presented at the end of the article. For now, we can run the embedding of a random cat picture as `target`. + +**Result** + +```markdown +┌─url───────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption────────────────────────────────────────────────────────────────┐ +│ https://s3.amazonaws.com/filestore.rescuegroups.org/6685/pictures/animals/13884/13884995/63318230_463x463.jpg │ Adoptable Female Domestic Short Hair │ +│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/6/239905226.jpg │ Adopt A Pet :: Marzipan - New York, NY │ +│ http://d1n3ar4lqtlydb.cloudfront.net/9/2/4/248407625.jpg │ Adopt A Pet :: Butterscotch - New Castle, DE │ +│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/e/e/c/245615237.jpg │ Adopt A Pet :: Tiggy - Chicago, IL │ +│ http://pawsofcoronado.org/wp-content/uploads/2012/12/rsz_pumpkin.jpg │ Pumpkin an orange tabby kitten for adoption │ +│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/7/8/3/188700997.jpg │ Adopt A Pet :: Brian the Brad Pitt of cats - Frankfort, IL │ +│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/d/191533561.jpg │ Domestic Shorthair Cat for adoption in Mesa, Arizona - Charlie │ +│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/0/1/2/221698235.jpg │ Domestic Shorthair Cat for adoption in Marietta, Ohio - Daisy (Spayed) │ +└───────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘ + +8 rows in set. Elapsed: 6.432 sec. Processed 19.65 million rows, 43.96 GB (3.06 million rows/s., 6.84 GB/s.) +``` + +## Run a ANN with an ANN index {#run-a-ann-with-an-ann-index} + +Create a new table with an ANN index and insert the data from the existing table: + +```sql +CREATE TABLE laion_annoy +( + `id` Int64, + `url` String, + `caption` String, + `NSFW` String, + `similarity` Float32, + `image_embedding` Array(Float32), + `text_embedding` Array(Float32), + INDEX annoy_image image_embedding TYPE annoy(), + INDEX annoy_text text_embedding TYPE annoy() +) +ENGINE = MergeTree +ORDER BY id +SETTINGS index_granularity = 8192; + +INSERT INTO laion_annoy SELECT * FROM laion; +``` + +By default, Annoy indexes use the L2 distance as metric. Further tuning knobs for index creation and search are described in the Annoy index [documentation](../../engines/table-engines/mergetree-family/annindexes.md). Let's check now again with the same query: + +```sql +SELECT url, caption FROM laion_annoy ORDER BY l2Distance(image_embedding, {target:Array(Float32)}) LIMIT 8 +``` + +**Result** + +```response +┌─url──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption──────────────────────────────────────────────────────────────┐ +│ http://tse1.mm.bing.net/th?id=OIP.R1CUoYp_4hbeFSHBaaB5-gHaFj │ bed bugs and pets can cats carry bed bugs pets adviser │ +│ http://pet-uploads.adoptapet.com/1/9/c/1963194.jpg?336w │ Domestic Longhair Cat for adoption in Quincy, Massachusetts - Ashley │ +│ https://thumbs.dreamstime.com/t/cat-bed-12591021.jpg │ Cat on bed Stock Image │ +│ https://us.123rf.com/450wm/penta/penta1105/penta110500004/9658511-portrait-of-british-short-hair-kitten-lieing-at-sofa-on-sun.jpg │ Portrait of british short hair kitten lieing at sofa on sun. │ +│ https://www.easypetmd.com/sites/default/files/Wirehaired%20Vizsla%20(2).jpg │ Vizsla (Wirehaired) image 3 │ +│ https://images.ctfassets.net/yixw23k2v6vo/0000000200009b8800000000/7950f4e1c1db335ef91bb2bc34428de9/dog-cat-flickr-Impatience_1.jpg?w=600&h=400&fm=jpg&fit=thumb&q=65&fl=progressive │ dog and cat image │ +│ https://i1.wallbox.ru/wallpapers/small/201523/eaa582ee76a31fd.jpg │ cats, kittens, faces, tonkinese │ +│ https://www.baxterboo.com/images/breeds/medium/cairn-terrier.jpg │ Cairn Terrier Photo │ +└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────┘ + +8 rows in set. Elapsed: 0.641 sec. Processed 22.06 thousand rows, 49.36 MB (91.53 thousand rows/s., 204.81 MB/s.) +``` + +The speed increased significantly at the cost of less accurate results. This is because the ANN index only provide approximate search results. Note the example searched for similar image embeddings, yet it is also possible to search for positive image caption embeddings. + +## Q & A Demo Application {#q-and-a-demo-application} + +Above examples demonstrated semantic search and document retrieval using ClickHouse. A very simple but high potential Generative AI example application is presented now. + +The application performs the following steps : + +1. Accepts a _topic_ as input from the user +2. Generates an embedding vector for the _topic_ by invoking OpenAI API with model `text-embedding-3-large` +3. Retrieves highly relevant Wikipedia articles/documents using vector similarity search on the `dbpedia` table +4. Accepts a free-form question in natural language from the user relating to the _topic_ +5. Uses the OpenAI `gpt-3.5-turbo` Chat API to answer the question based on the knowledge in the documents retrieved in step #3. + The documents retrieved in step #3 are passed as _context_ to the Chat API and are the key link in Generative AI. + +First a couple of conversation examples by running the Q & A application are listed below, followed by the code +for the Q & A application. Running the application requires an OpenAI API key to be set in the environment +variable `OPENAI_API_KEY`. + +```shell +$ python3 QandA.py + +Enter a topic : FIFA world cup 1990 +Generating the embedding for 'FIFA world cup 1990' and collecting 100 articles related to it from ClickHouse... + +Enter your question : Who won the golden boot +Salvatore Schillaci of Italy won the Golden Boot at the 1990 FIFA World Cup. + + +Enter a topic : Cricket world cup +Generating the embedding for 'Cricket world cup' and collecting 100 articles related to it from ClickHouse... + +Enter your question : Which country has hosted the world cup most times +England and Wales have hosted the Cricket World Cup the most times, with the tournament being held in these countries five times - in 1975, 1979, 1983, 1999, and 2019. + +$ +``` + +Code : + +```Python +import sys +import time +from openai import OpenAI +import clickhouse_connect + +ch_client = clickhouse_connect.get_client(compress=False) # Pass ClickHouse credentials here +openai_client = OpenAI() # Set the OPENAI_API_KEY environment variable + +def get_embedding(text, model): + text = text.replace("\n", " ") + return openai_client.embeddings.create(input = [text], model=model, dimensions=1536).data[0].embedding + +while True: + # Take the topic of interest from user + print("Enter a topic : ", end="", flush=True) + input_query = sys.stdin.readline() + input_query = input_query.rstrip() + + # Generate an embedding vector for the search topic and query ClickHouse + print("Generating the embedding for '" + input_query + "' and collecting 100 articles related to it from ClickHouse..."); + embedding = get_embedding(input_query, + model='text-embedding-3-large') + + params = {'v1':embedding, 'v2':100} + result = ch_client.query("SELECT id,title,text FROM dbpedia ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params) + + # Collect all the matching articles/documents + results = "" + for row in result.result_rows: + results = results + row[2] + + print("\nEnter your question : ", end="", flush=True) + question = sys.stdin.readline(); + + # Prompt for the OpenAI Chat API + query = f"""Use the below content to answer the subsequent question. If the answer cannot be found, write "I don't know." + +Content: +\"\"\" +{results} +\"\"\" + +Question: {question}""" + + GPT_MODEL = "gpt-3.5-turbo" + response = openai_client.chat.completions.create( + messages=[ + {'role': 'system', 'content': "You answer questions about {input_query}."}, + {'role': 'user', 'content': query}, + ], + model=GPT_MODEL, + temperature=0, + ) + + # Print the answer to the question! + print(response.choices[0].message.content) + print("\n") +``` + From 2f2cd340d9b19f22756bdc88a9a1c90d1ff9f117 Mon Sep 17 00:00:00 2001 From: Shankar Iyer Date: Wed, 20 Aug 2025 06:54:55 +0000 Subject: [PATCH 2/8] save --- .../example-datasets/dbpedia.md | 137 ++++++++---------- 1 file changed, 59 insertions(+), 78 deletions(-) diff --git a/docs/getting-started/example-datasets/dbpedia.md b/docs/getting-started/example-datasets/dbpedia.md index 88df9650d1a..9e030f71b86 100644 --- a/docs/getting-started/example-datasets/dbpedia.md +++ b/docs/getting-started/example-datasets/dbpedia.md @@ -7,21 +7,11 @@ title: 'dbpedia dataset' The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using `text-embedding-3-large` model from OpenAI. -The dataset is an excellent starter dataset to understand semantic search, vector embeddings and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple Q & A application. +The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple but powerful Q & A application. -## Data preparation {#data-preparation} +## Dataset details {#dataset-details} -The dataset consists of 26 `Parquet` files located at -converts them to CSV and imports them into ClickHouse. You can use the following `download.sh` script for that: - - -```bash -seq 0 409 | xargs -P1 -I{} bash -c './download.sh {}' -``` - -The dataset is split into 410 files, each file contains ca. 1 million rows. If you like to work with a smaller subset of the data, simply adjust the limits, e.g. `seq 0 9 | ...`. - -(The python script above is very slow (~2-10 minutes per file), takes a lot of memory (41 GB per file), and the resulting csv files are big (10 GB each), so be careful. If you have enough RAM, increase the `-P1` number for more parallelism. If this is still too slow, consider coming up with a better ingestion procedure - maybe converting the .npy files to parquet, then doing all the other processing with clickhouse.) +The dataset consists of 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M. ## Create table {#create-table} @@ -38,87 +28,78 @@ CREATE TABLE dbpedia ``` -To load the dataset from the Parquet files, +## Load table {#load-table} -```sql -INSERT INTO laion FROM INFILE '{path_to_csv_files}/*.csv' -``` +To load the dataset from the Parquet files, run the following shell command : -## Run a brute-force ANN search (without ANN index) {#run-a-brute-force-ann-search-without-ann-index} +```shell +``` -To run a brute-force approximate nearest neighbor search, run: +Alternatively, individual SQL statements can be run as shown below for each of the 25 Parquet files : ```sql -SELECT url, caption FROM laion ORDER BY L2Distance(image_embedding, {target:Array(Float32)}) LIMIT 30 -``` +INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/0.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0; +INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/1.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0; +... +INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/25.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0; -`target` is an array of 512 elements and a client parameter. A convenient way to obtain such arrays will be presented at the end of the article. For now, we can run the embedding of a random cat picture as `target`. +``` -**Result** +## Semantic Search +Recommended reading : https://platform.openai.com/docs/guides/embeddings -```markdown -┌─url───────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption────────────────────────────────────────────────────────────────┐ -│ https://s3.amazonaws.com/filestore.rescuegroups.org/6685/pictures/animals/13884/13884995/63318230_463x463.jpg │ Adoptable Female Domestic Short Hair │ -│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/6/239905226.jpg │ Adopt A Pet :: Marzipan - New York, NY │ -│ http://d1n3ar4lqtlydb.cloudfront.net/9/2/4/248407625.jpg │ Adopt A Pet :: Butterscotch - New Castle, DE │ -│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/e/e/c/245615237.jpg │ Adopt A Pet :: Tiggy - Chicago, IL │ -│ http://pawsofcoronado.org/wp-content/uploads/2012/12/rsz_pumpkin.jpg │ Pumpkin an orange tabby kitten for adoption │ -│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/7/8/3/188700997.jpg │ Adopt A Pet :: Brian the Brad Pitt of cats - Frankfort, IL │ -│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/d/191533561.jpg │ Domestic Shorthair Cat for adoption in Mesa, Arizona - Charlie │ -│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/0/1/2/221698235.jpg │ Domestic Shorthair Cat for adoption in Marietta, Ohio - Daisy (Spayed) │ -└───────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘ +Semantic search (or referred to as _similarity search_) using vector embeddings involes +the following steps : -8 rows in set. Elapsed: 6.432 sec. Processed 19.65 million rows, 43.96 GB (3.06 million rows/s., 6.84 GB/s.) -``` +- Accept a search query from user in natural language e.g ‘Tell me some scenic rail journeys”, “Suspense novels set in Europe” etc +- Generate embedding vector for the search query using the LLM model +- Find nearest neighbours to the search embedding vector in the dataset -## Run a ANN with an ANN index {#run-a-ann-with-an-ann-index} +The _nearest neighbours_ are documents, images or content that are results relevant to the user query. +The results are the key input to Retrieval Augmented Generation (RAG) in Generative AI applications. -Create a new table with an ANN index and insert the data from the existing table: +## Experiment with KNN Search +KNN (k - Nearest Neighbours) search or brute force search involves calculating distance of each vector in the dataset +to the search embedding vector and then ordering the distances to get the nearest neighbours. With the `dbpediai` dataset, +a quick technique to visually observe semantic search is to use embedding vectors from the dataset itself as search +vectors. Example : ```sql -CREATE TABLE laion_annoy -( - `id` Int64, - `url` String, - `caption` String, - `NSFW` String, - `similarity` Float32, - `image_embedding` Array(Float32), - `text_embedding` Array(Float32), - INDEX annoy_image image_embedding TYPE annoy(), - INDEX annoy_text text_embedding TYPE annoy() -) -ENGINE = MergeTree -ORDER BY id -SETTINGS index_granularity = 8192; - -INSERT INTO laion_annoy SELECT * FROM laion; +SELECT id, title +FROM dbpedia +ORDER BY cosineDistance(vector, ( SELECT vector FROM dbpedia WHERE id = '') ) ASC +LIMIT 20 + + ┌─id────────────────────────────────────────┬─title───────────────────────────┐ + 1. │ │ The Remains of the Day │ + 2. │ │ The Remains of the Day (film) │ + 3. │ │ Never Let Me Go (novel) │ + 4. │ │ Last Orders │ + 5. │ │ The Unconsoled │ + 6. │ │ The Hours (novel) │ + 7. │ │ An Artist of the Floating World │ + 8. │ │ Heat and Dust │ + 9. │ │ A Pale View of Hills │ +10. │ │ Howards End (film) │ +11. │ │ When We Were Orphans │ +12. │ │ A Passage to India (film) │ +13. │ │ Memoirs of a Survivor │ +14. │ │ The Child in Time │ +15. │ │ The Sea, the Sea │ +16. │ │ The Master (novel) │ +17. │ │ The Memorial │ +18. │ │ The Hours (film) │ +19. │ │ Human Remains (film) │ +20. │ │ Kazuo Ishiguro │ + └───────────────────────────────────────────┴─────────────────────────────────┘ ``` -By default, Annoy indexes use the L2 distance as metric. Further tuning knobs for index creation and search are described in the Annoy index [documentation](../../engines/table-engines/mergetree-family/annindexes.md). Let's check now again with the same query: +Note down the query latency so that we can compare it with the query latency of ANN (using vector index). +Also record the query latency with cold OS file cache and with `max_theads=1` to recognize the real compute +usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!) -```sql -SELECT url, caption FROM laion_annoy ORDER BY l2Distance(image_embedding, {target:Array(Float32)}) LIMIT 8 -``` -**Result** - -```response -┌─url──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption──────────────────────────────────────────────────────────────┐ -│ http://tse1.mm.bing.net/th?id=OIP.R1CUoYp_4hbeFSHBaaB5-gHaFj │ bed bugs and pets can cats carry bed bugs pets adviser │ -│ http://pet-uploads.adoptapet.com/1/9/c/1963194.jpg?336w │ Domestic Longhair Cat for adoption in Quincy, Massachusetts - Ashley │ -│ https://thumbs.dreamstime.com/t/cat-bed-12591021.jpg │ Cat on bed Stock Image │ -│ https://us.123rf.com/450wm/penta/penta1105/penta110500004/9658511-portrait-of-british-short-hair-kitten-lieing-at-sofa-on-sun.jpg │ Portrait of british short hair kitten lieing at sofa on sun. │ -│ https://www.easypetmd.com/sites/default/files/Wirehaired%20Vizsla%20(2).jpg │ Vizsla (Wirehaired) image 3 │ -│ https://images.ctfassets.net/yixw23k2v6vo/0000000200009b8800000000/7950f4e1c1db335ef91bb2bc34428de9/dog-cat-flickr-Impatience_1.jpg?w=600&h=400&fm=jpg&fit=thumb&q=65&fl=progressive │ dog and cat image │ -│ https://i1.wallbox.ru/wallpapers/small/201523/eaa582ee76a31fd.jpg │ cats, kittens, faces, tonkinese │ -│ https://www.baxterboo.com/images/breeds/medium/cairn-terrier.jpg │ Cairn Terrier Photo │ -└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────┘ - -8 rows in set. Elapsed: 0.641 sec. Processed 22.06 thousand rows, 49.36 MB (91.53 thousand rows/s., 204.81 MB/s.) -``` -The speed increased significantly at the cost of less accurate results. This is because the ANN index only provide approximate search results. Note the example searched for similar image embeddings, yet it is also possible to search for positive image caption embeddings. ## Q & A Demo Application {#q-and-a-demo-application} @@ -133,9 +114,9 @@ The application performs the following steps : 5. Uses the OpenAI `gpt-3.5-turbo` Chat API to answer the question based on the knowledge in the documents retrieved in step #3. The documents retrieved in step #3 are passed as _context_ to the Chat API and are the key link in Generative AI. -First a couple of conversation examples by running the Q & A application are listed below, followed by the code +A couple of conversation examples by running the Q & A application are first listed below, followed by the code for the Q & A application. Running the application requires an OpenAI API key to be set in the environment -variable `OPENAI_API_KEY`. +variable `OPENAI_API_KEY`. The OpenAI API key can be obtained after registering at https://platform.openapi.com. ```shell $ python3 QandA.py From 12617ac9a23b9ae3a798e667ad11fcb13c438af5 Mon Sep 17 00:00:00 2001 From: Shankar Iyer Date: Wed, 20 Aug 2025 09:11:56 +0000 Subject: [PATCH 3/8] Review --- .../example-datasets/dbpedia.md | 141 ++++++++++++++++-- 1 file changed, 131 insertions(+), 10 deletions(-) diff --git a/docs/getting-started/example-datasets/dbpedia.md b/docs/getting-started/example-datasets/dbpedia.md index 9e030f71b86..92b7cfc61c8 100644 --- a/docs/getting-started/example-datasets/dbpedia.md +++ b/docs/getting-started/example-datasets/dbpedia.md @@ -11,7 +11,7 @@ The dataset is an excellent starter dataset to understand vector embeddings, vec ## Dataset details {#dataset-details} -The dataset consists of 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M. +The dataset contains 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M. ## Create table {#create-table} @@ -30,12 +30,13 @@ CREATE TABLE dbpedia ## Load table {#load-table} -To load the dataset from the Parquet files, run the following shell command : +To load the dataset from all Parquet files, run the following shell command : ```shell +$ seq 0 25 | xargs -P1 -I{} clickhouse client -q "INSERT INTO dbpedia SELECT _id, title, text, \"text-embedding-3-large-1536-embedding\" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/{}.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;" ``` -Alternatively, individual SQL statements can be run as shown below for each of the 25 Parquet files : +Alternatively, individual SQL statements can be run as shown below to load each of the 25 Parquet files : ```sql INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/0.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0; @@ -45,22 +46,35 @@ INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedd ``` -## Semantic Search +Verify that 1 million rows are seen in the `dbpedia` table : + +```sql +SELECT count(*) +FROM dbpedia + + ┌─count()─┐ +1. │ 1000000 │ + └─────────┘ +``` + +## Semantic Search {#semantic-search} + Recommended reading : https://platform.openai.com/docs/guides/embeddings -Semantic search (or referred to as _similarity search_) using vector embeddings involes +Semantic search (or referred to as _similarity search_) using vector embeddings involves the following steps : -- Accept a search query from user in natural language e.g ‘Tell me some scenic rail journeys”, “Suspense novels set in Europe” etc +- Accept a search query from user in natural language e.g _"Tell me some scenic rail journeys”_, _“Suspense novels set in Europe”_ etc - Generate embedding vector for the search query using the LLM model - Find nearest neighbours to the search embedding vector in the dataset The _nearest neighbours_ are documents, images or content that are results relevant to the user query. -The results are the key input to Retrieval Augmented Generation (RAG) in Generative AI applications. +The retrieved results are the key input to Retrieval Augmented Generation (RAG) in Generative AI applications. + +## Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search} -## Experiment with KNN Search KNN (k - Nearest Neighbours) search or brute force search involves calculating distance of each vector in the dataset -to the search embedding vector and then ordering the distances to get the nearest neighbours. With the `dbpediai` dataset, +to the search embedding vector and then ordering the distances to get the nearest neighbours. With the `dbpedia` dataset, a quick technique to visually observe semantic search is to use embedding vectors from the dataset itself as search vectors. Example : @@ -92,14 +106,121 @@ LIMIT 20 19. │ │ Human Remains (film) │ 20. │ │ Kazuo Ishiguro │ └───────────────────────────────────────────┴─────────────────────────────────┘ +20 rows in set. Elapsed: 0.261 sec. Processed 1.00 million rows, 6.22 GB (3.84 million rows/s., 23.81 GB/s.) ``` Note down the query latency so that we can compare it with the query latency of ANN (using vector index). Also record the query latency with cold OS file cache and with `max_theads=1` to recognize the real compute usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!) +## Build Vector Similarity Index {#build-vector-similarity-index} + +Run the following SQL to define and build a vector similarity index on the `vector` column : + +```sql +ALTER TABLE dbpedia ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 1536, 'bf16', 64, 512); + +ALTER TABLE dbpedia MATERIALIZE INDEX vector_index; +``` + +The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md). + +Building and saving the index could take a few minutes depending on number of CPU cores available and the storage bandwidth. + +## Perform ANN search {#perform-ann-search} + +_Approximate Nearest Neighbours_ or ANN refers to group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time. + +Once the vector similarity index has been built, vector search queries will automatically use the index : + +```sql +SELECT + id, + title +FROM dbpedia +ORDER BY cosineDistance(vector, ( + SELECT vector + FROM dbpedia + WHERE id = '' + )) ASC +LIMIT 20 + + ┌─id──────────────────────────────────────────────┬─title─────────────────────────────────┐ + 1. │ │ Glacier Express │ + 2. │ │ BVZ Zermatt-Bahn │ + 3. │ │ Gornergrat railway │ + 4. │ │ RegioExpress │ + 5. │ │ Matterhorn Gotthard Bahn │ + 6. │ │ Rhaetian Railway │ + 7. │ │ Gotthard railway │ + 8. │ │ Furka–Oberalp railway │ + 9. │ │ Jungfrau railway │ +10. │ │ Monte Generoso railway │ +11. │ │ Montreux–Oberland Bernois railway │ +12. │ │ Brienz–Rothorn railway │ +13. │ │ Lauterbrunnen–Mürren mountain railway │ +14. │ │ Luzern–Stans–Engelberg railway line │ +15. │ │ Rigi Railways │ +16. │ │ Saint-Gervais–Vallorcine railway │ +17. │ │ Gatwick Express │ +18. │ │ Brünig railway line │ +19. │ │ Regional-Express │ +20. │ │ Schynige Platte railway │ + └─────────────────────────────────────────────────┴───────────────────────────────────────┘ + +20 rows in set. Elapsed: 0.025 sec. Processed 32.03 thousand rows, 2.10 MB (1.29 million rows/s., 84.80 MB/s.) +``` +Compare the latency and I/O resource usage of the above query with the earlier query executed +using brute force KNN. +## Generating embeddings for search query {#generating-embeddings-for-search-query} + +The similarity search queries seen until now use one of the existing vectors in the `dbpedia` +table as the search vector. In real world applications, the search vector has to be +generated for a user input query which could be in natural language. The search vector +should be generated by using the same LLM model used to generate embedding vectors +for the dataset. + +An example Python script is listed below to demonstrate how to programmatically call OpenAI API's to +generate embedding vectors using the `text-embedding-3-large` model. The search embedding vector +is then passed as an argument to the `cosineDistance()` function in the `SELECT` query. + +Running the script requires an OpenAI API key to be set in the environment variable `OPENAI_API_KEY`. +The OpenAI API key can be obtained after registering at https://platform.openai.com. + +```python +import sys +from openai import OpenAI +import clickhouse_connect + +ch_client = clickhouse_connect.get_client(compress=False) # Pass ClickHouse credentials +openai_client = OpenAI() # Set OPENAI_API_KEY environment variable + +def get_embedding(text, model): + text = text.replace("\n", " ") + return openai_client.embeddings.create(input = [text], model=model, dimensions=1536).data[0].embedding + + +while True: + # Accept the search query from user + print("Enter a search query :") + input_query = sys.stdin.readline(); + + # Call OpenAI API endpoint to get the embedding + print("Generating the embedding for ", input_query); + embedding = get_embedding(input_query, + model='text-embedding-3-large') + + # Execute vector search query in ClickHouse + print("Querying clickhouse...") + params = {'v1':embedding, 'v2':10} + result = ch_client.query("SELECT id,title,text FROM dbpedia ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params) + + for row in result.result_rows: + print(row[0], row[1], row[2]) + print("---------------") +``` ## Q & A Demo Application {#q-and-a-demo-application} @@ -116,7 +237,7 @@ The application performs the following steps : A couple of conversation examples by running the Q & A application are first listed below, followed by the code for the Q & A application. Running the application requires an OpenAI API key to be set in the environment -variable `OPENAI_API_KEY`. The OpenAI API key can be obtained after registering at https://platform.openapi.com. +variable `OPENAI_API_KEY`. The OpenAI API key can be obtained after registering at https://platform.openai.com. ```shell $ python3 QandA.py From caf7ac9431ab1688430e839442d55f7504f3550e Mon Sep 17 00:00:00 2001 From: Shankar Iyer Date: Wed, 20 Aug 2025 09:19:19 +0000 Subject: [PATCH 4/8] style check --- docs/getting-started/example-datasets/dbpedia.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/getting-started/example-datasets/dbpedia.md b/docs/getting-started/example-datasets/dbpedia.md index 92b7cfc61c8..8cb8e0305fd 100644 --- a/docs/getting-started/example-datasets/dbpedia.md +++ b/docs/getting-started/example-datasets/dbpedia.md @@ -1,5 +1,5 @@ --- -description: 'Dataset containing 1 million articles from Wikipedia and their vector embeddings" +description: 'Dataset containing 1 million articles from Wikipedia and their vector embeddings' sidebar_label: 'dbpedia dataset' slug: /getting-started/example-datasets/dbpedia-dataset title: 'dbpedia dataset' @@ -319,4 +319,3 @@ Question: {question}""" print(response.choices[0].message.content) print("\n") ``` - From 4eff3acbf363cc1e26962eafc347eda605d22ea2 Mon Sep 17 00:00:00 2001 From: Shaun Struwig <41984034+Blargian@users.noreply.github.com> Date: Wed, 20 Aug 2025 11:51:52 +0200 Subject: [PATCH 5/8] Minor formatting fixes --- .../example-datasets/dbpedia.md | 40 ++++++++++--------- 1 file changed, 21 insertions(+), 19 deletions(-) diff --git a/docs/getting-started/example-datasets/dbpedia.md b/docs/getting-started/example-datasets/dbpedia.md index 8cb8e0305fd..d6a9126a905 100644 --- a/docs/getting-started/example-datasets/dbpedia.md +++ b/docs/getting-started/example-datasets/dbpedia.md @@ -5,17 +5,17 @@ slug: /getting-started/example-datasets/dbpedia-dataset title: 'dbpedia dataset' --- -The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using `text-embedding-3-large` model from OpenAI. +The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using the `text-embedding-3-large` model from OpenAI. -The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple but powerful Q & A application. +The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple but powerful Q&A application. ## Dataset details {#dataset-details} -The dataset contains 26 `Parquet` files located under https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/. The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M. +The dataset contains 26 `Parquet` files located on [huggingface.co](https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/). The files are named `0.parquet`, `1.parquet`, ..., `25.parquet`. To view some example rows of the dataset, please visit this [Hugging Face page](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M). ## Create table {#create-table} -Create the `dbpedia` table to store the article id, title, text and embedding vector : +Create the `dbpedia` table to store the article id, title, text and embedding vector: ```sql CREATE TABLE dbpedia @@ -30,13 +30,13 @@ CREATE TABLE dbpedia ## Load table {#load-table} -To load the dataset from all Parquet files, run the following shell command : +To load the dataset from all Parquet files, run the following shell command: ```shell $ seq 0 25 | xargs -P1 -I{} clickhouse client -q "INSERT INTO dbpedia SELECT _id, title, text, \"text-embedding-3-large-1536-embedding\" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/{}.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0;" ``` -Alternatively, individual SQL statements can be run as shown below to load each of the 25 Parquet files : +Alternatively, individual SQL statements can be run as shown below to load each of the 25 Parquet files: ```sql INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedding" FROM url('https://huggingface.co/api/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M/parquet/default/train/0.parquet') SETTINGS max_http_get_redirects=5,enable_url_encoding=0; @@ -46,7 +46,7 @@ INSERT INTO dbpedia SELECT _id, title, text, "text-embedding-3-large-1536-embedd ``` -Verify that 1 million rows are seen in the `dbpedia` table : +Verify that 1 million rows are seen in the `dbpedia` table: ```sql SELECT count(*) @@ -57,14 +57,15 @@ FROM dbpedia └─────────┘ ``` -## Semantic Search {#semantic-search} +## Semantic search {#semantic-search} -Recommended reading : https://platform.openai.com/docs/guides/embeddings +Recommended reading: ["Vector embeddings +" OpenAPI guide](https://platform.openai.com/docs/guides/embeddings) -Semantic search (or referred to as _similarity search_) using vector embeddings involves -the following steps : +Semantic search (also referred to as _similarity search_) using vector embeddings involves +the following steps: -- Accept a search query from user in natural language e.g _"Tell me some scenic rail journeys”_, _“Suspense novels set in Europe”_ etc +- Accept a search query from a user in natural language e.g _"Tell me about some scenic rail journeys”_, _“Suspense novels set in Europe”_ etc - Generate embedding vector for the search query using the LLM model - Find nearest neighbours to the search embedding vector in the dataset @@ -73,18 +74,18 @@ The retrieved results are the key input to Retrieval Augmented Generation (RAG) ## Run a brute-force vector similarity search {#run-a-brute-force-vector-similarity-search} -KNN (k - Nearest Neighbours) search or brute force search involves calculating distance of each vector in the dataset +KNN (k - Nearest Neighbours) search or brute force search involves calculating the distance of each vector in the dataset to the search embedding vector and then ordering the distances to get the nearest neighbours. With the `dbpedia` dataset, a quick technique to visually observe semantic search is to use embedding vectors from the dataset itself as search -vectors. Example : +vectors. For example: -```sql +```sql title="Query" SELECT id, title FROM dbpedia ORDER BY cosineDistance(vector, ( SELECT vector FROM dbpedia WHERE id = '') ) ASC LIMIT 20 - ┌─id────────────────────────────────────────┬─title───────────────────────────┐ +```response title="Response" ┌─id────────────────────────────────────────┬─title───────────────────────────┐ 1. │ │ The Remains of the Day │ 2. │ │ The Remains of the Day (film) │ 3. │ │ Never Let Me Go (novel) │ @@ -106,6 +107,7 @@ LIMIT 20 19. │ │ Human Remains (film) │ 20. │ │ Kazuo Ishiguro │ └───────────────────────────────────────────┴─────────────────────────────────┘ +#highlight-next-line 20 rows in set. Elapsed: 0.261 sec. Processed 1.00 million rows, 6.22 GB (3.84 million rows/s., 23.81 GB/s.) ``` @@ -113,9 +115,9 @@ Note down the query latency so that we can compare it with the query latency of Also record the query latency with cold OS file cache and with `max_theads=1` to recognize the real compute usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!) -## Build Vector Similarity Index {#build-vector-similarity-index} +## Build a vector similarity index {#build-vector-similarity-index} -Run the following SQL to define and build a vector similarity index on the `vector` column : +Run the following SQL to define and build a vector similarity index on the `vector` column: ```sql ALTER TABLE dbpedia ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 1536, 'bf16', 64, 512); @@ -132,7 +134,7 @@ Building and saving the index could take a few minutes depending on number of CP _Approximate Nearest Neighbours_ or ANN refers to group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time. -Once the vector similarity index has been built, vector search queries will automatically use the index : +Once the vector similarity index has been built, vector search queries will automatically use the index: ```sql SELECT From 9a7d7745287e3cb4e58028285b73baf069a12f0f Mon Sep 17 00:00:00 2001 From: Shaun Struwig <41984034+Blargian@users.noreply.github.com> Date: Wed, 20 Aug 2025 11:53:33 +0200 Subject: [PATCH 6/8] Minor formatting fixes --- .../example-datasets/dbpedia.md | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/docs/getting-started/example-datasets/dbpedia.md b/docs/getting-started/example-datasets/dbpedia.md index d6a9126a905..b4668578a6e 100644 --- a/docs/getting-started/example-datasets/dbpedia.md +++ b/docs/getting-started/example-datasets/dbpedia.md @@ -148,7 +148,7 @@ ORDER BY cosineDistance(vector, ( )) ASC LIMIT 20 - ┌─id──────────────────────────────────────────────┬─title─────────────────────────────────┐ +```response title="Response" ┌─id──────────────────────────────────────────────┬─title─────────────────────────────────┐ 1. │ │ Glacier Express │ 2. │ │ BVZ Zermatt-Bahn │ 3. │ │ Gornergrat railway │ @@ -170,11 +170,8 @@ LIMIT 20 19. │ │ Regional-Express │ 20. │ │ Schynige Platte railway │ └─────────────────────────────────────────────────┴───────────────────────────────────────┘ - +#highlight-next-line 20 rows in set. Elapsed: 0.025 sec. Processed 32.03 thousand rows, 2.10 MB (1.29 million rows/s., 84.80 MB/s.) -``` -Compare the latency and I/O resource usage of the above query with the earlier query executed -using brute force KNN. ## Generating embeddings for search query {#generating-embeddings-for-search-query} @@ -224,11 +221,11 @@ while True: print("---------------") ``` -## Q & A Demo Application {#q-and-a-demo-application} +## Q&A demo application {#q-and-a-demo-application} -Above examples demonstrated semantic search and document retrieval using ClickHouse. A very simple but high potential Generative AI example application is presented now. +The examples above demonstrated semantic search and document retrieval using ClickHouse. A very simple but high potential generative AI example application is presented next. -The application performs the following steps : +The application performs the following steps: 1. Accepts a _topic_ as input from the user 2. Generates an embedding vector for the _topic_ by invoking OpenAI API with model `text-embedding-3-large` @@ -237,8 +234,8 @@ The application performs the following steps : 5. Uses the OpenAI `gpt-3.5-turbo` Chat API to answer the question based on the knowledge in the documents retrieved in step #3. The documents retrieved in step #3 are passed as _context_ to the Chat API and are the key link in Generative AI. -A couple of conversation examples by running the Q & A application are first listed below, followed by the code -for the Q & A application. Running the application requires an OpenAI API key to be set in the environment +A couple of conversation examples by running the Q&A application are first listed below, followed by the code +for the Q&A application. Running the application requires an OpenAI API key to be set in the environment variable `OPENAI_API_KEY`. The OpenAI API key can be obtained after registering at https://platform.openai.com. ```shell @@ -260,7 +257,7 @@ England and Wales have hosted the Cricket World Cup the most times, with the tou $ ``` -Code : +Code: ```Python import sys From 8928eb52cd9fbf87a1f7e59e2e2df0f8ac867f2d Mon Sep 17 00:00:00 2001 From: Shaun Struwig <41984034+Blargian@users.noreply.github.com> Date: Wed, 20 Aug 2025 11:56:42 +0200 Subject: [PATCH 7/8] minor formatting fixes --- docs/getting-started/example-datasets/dbpedia.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/docs/getting-started/example-datasets/dbpedia.md b/docs/getting-started/example-datasets/dbpedia.md index b4668578a6e..ec58b136638 100644 --- a/docs/getting-started/example-datasets/dbpedia.md +++ b/docs/getting-started/example-datasets/dbpedia.md @@ -3,6 +3,7 @@ description: 'Dataset containing 1 million articles from Wikipedia and their vec sidebar_label: 'dbpedia dataset' slug: /getting-started/example-datasets/dbpedia-dataset title: 'dbpedia dataset' +keywords: ['semantic search', 'vector similarity', 'approximate nearest neighbours', 'embeddings'] --- The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using the `text-embedding-3-large` model from OpenAI. @@ -84,8 +85,10 @@ SELECT id, title FROM dbpedia ORDER BY cosineDistance(vector, ( SELECT vector FROM dbpedia WHERE id = '') ) ASC LIMIT 20 +``` -```response title="Response" ┌─id────────────────────────────────────────┬─title───────────────────────────┐ +```response title="Response" + ┌─id────────────────────────────────────────┬─title───────────────────────────┐ 1. │ │ The Remains of the Day │ 2. │ │ The Remains of the Day (film) │ 3. │ │ Never Let Me Go (novel) │ @@ -122,7 +125,6 @@ Run the following SQL to define and build a vector similarity index on the `vect ```sql ALTER TABLE dbpedia ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 1536, 'bf16', 64, 512); - ALTER TABLE dbpedia MATERIALIZE INDEX vector_index; ``` @@ -136,7 +138,7 @@ _Approximate Nearest Neighbours_ or ANN refers to group of techniques (e.g., spe Once the vector similarity index has been built, vector search queries will automatically use the index: -```sql +```sql title="Query" SELECT id, title @@ -147,8 +149,10 @@ ORDER BY cosineDistance(vector, ( WHERE id = '' )) ASC LIMIT 20 +``` -```response title="Response" ┌─id──────────────────────────────────────────────┬─title─────────────────────────────────┐ +```response title="Response" + ┌─id──────────────────────────────────────────────┬─title─────────────────────────────────┐ 1. │ │ Glacier Express │ 2. │ │ BVZ Zermatt-Bahn │ 3. │ │ Gornergrat railway │ @@ -172,6 +176,7 @@ LIMIT 20 └─────────────────────────────────────────────────┴───────────────────────────────────────┘ #highlight-next-line 20 rows in set. Elapsed: 0.025 sec. Processed 32.03 thousand rows, 2.10 MB (1.29 million rows/s., 84.80 MB/s.) +``` ## Generating embeddings for search query {#generating-embeddings-for-search-query} From 25dd5ac4be303bbbb65f13cf3ce9c03c515b9c68 Mon Sep 17 00:00:00 2001 From: Shankar Iyer Date: Wed, 20 Aug 2025 15:33:36 +0000 Subject: [PATCH 8/8] Code review --- docs/getting-started/example-datasets/dbpedia.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/getting-started/example-datasets/dbpedia.md b/docs/getting-started/example-datasets/dbpedia.md index ec58b136638..9a84f04abc3 100644 --- a/docs/getting-started/example-datasets/dbpedia.md +++ b/docs/getting-started/example-datasets/dbpedia.md @@ -6,7 +6,7 @@ title: 'dbpedia dataset' keywords: ['semantic search', 'vector similarity', 'approximate nearest neighbours', 'embeddings'] --- -The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using the `text-embedding-3-large` model from OpenAI. +The [dbpedia dataset](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M) contains 1 million articles from Wikipedia and their vector embeddings generated using the [text-embedding-3-large](https://platform.openai.com/docs/models/text-embedding-3-large) model from OpenAI. The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse and a simple but powerful Q&A application. @@ -125,7 +125,7 @@ Run the following SQL to define and build a vector similarity index on the `vect ```sql ALTER TABLE dbpedia ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 1536, 'bf16', 64, 512); -ALTER TABLE dbpedia MATERIALIZE INDEX vector_index; +ALTER TABLE dbpedia MATERIALIZE INDEX vector_index SETTINGS mutations_sync = 2; ``` The parameters and performance considerations for index creation and search are described in the [documentation](../../engines/table-engines/mergetree-family/annindexes.md).