-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[vector-databases] Add support for Apache Solr (#565)
- Loading branch information
Showing
26 changed files
with
1,582 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
java/lib/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# Indexing a WebSite using Apache Solr as Vector Database | ||
|
||
This sample application shows how to use the WebCrawler Source Connector and use [Apache Solr](https://solr.apache.org) as a Vector Database. | ||
|
||
## Prerequisites | ||
|
||
Launch Apache Solr locally in docker | ||
|
||
``` | ||
docker run --rm -p 8983:8983 --rm solr:9.3.0 -c | ||
``` | ||
|
||
You can now open your browser at http://localhost:8983/ and you will see the Solr admin page. | ||
|
||
The '-c' parameter launches Solr in "Cloud" mode, that allows you to dynamically create collections. | ||
|
||
|
||
The LangStream application will create for you a collection named "documents". | ||
|
||
It create a new data type "vector" following this guide: | ||
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html | ||
|
||
## Configure access to the Vector Database | ||
|
||
In order to allow LangStream that runs in docker to connect to the Solr instance running in your host, you need to configure the SOLR_HOST environment variable. | ||
|
||
```bash | ||
SOLR_HOST=host.docker.internal | ||
``` | ||
|
||
|
||
The examples/secrets/secrets.yaml resolves environment variables for you. | ||
When you go in production you are supposed to create a dedicated secrets.yaml file for each environment. | ||
|
||
|
||
## Configure the pipeline | ||
|
||
Edit the file `crawler.yaml` and configure the list of the allowed web domains, this is required in order to not let the crawler escape outside your data. | ||
Configure the list of seed URLs, for instance with your home page. | ||
|
||
The default configuration in this example will crawl the LangStream website. | ||
|
||
## Run the LangStream application locally on docker | ||
|
||
``` | ||
./bin/langstream docker run test -app examples/applications/query_solr -s examples/secrets/secrets.yaml | ||
``` | ||
|
||
## Talk with the Chat bot using the UI | ||
|
||
By default the langstream CLI opens a UI in your browser. You can use that to chat with the bot. | ||
|
||
## Talk with the Chat bot using the CLI | ||
Since the application opens a gateway, we can use the gateway API to send and consume messages. | ||
|
||
``` | ||
./bin/langstream gateway chat test -cg bot-output -pg user-input -p sessionId=$(uuidgen) | ||
``` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
# | ||
# Copyright DataStax, Inc. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
topics: | ||
- name: "questions-topic" | ||
creation-mode: create-if-not-exists | ||
- name: "answers-topic" | ||
creation-mode: create-if-not-exists | ||
- name: "log-topic" | ||
creation-mode: create-if-not-exists | ||
errors: | ||
on-failure: "skip" | ||
pipeline: | ||
- name: "convert-to-structure" | ||
type: "document-to-json" | ||
input: "questions-topic" | ||
configuration: | ||
text-field: "question" | ||
- name: "compute-embeddings" | ||
type: "compute-ai-embeddings" | ||
configuration: | ||
model: "${secrets.open-ai.embeddings-model}" # This needs to match the name of the model deployment, not the base model | ||
embeddings-field: "value.question_embeddings" | ||
text: "{{ value.question }}" | ||
flush-interval: 0 | ||
- name: "lookup-related-documents" | ||
type: "query-vector-db" | ||
configuration: | ||
datasource: "SolrDataSource" | ||
query: | | ||
{ | ||
"q": "{!knn f=embeddings topK=10}?" | ||
} | ||
fields: | ||
- "fn:toListOfFloat(value.question_embeddings)" | ||
output-field: "value.related_documents" | ||
- name: "ai-chat-completions" | ||
type: "ai-chat-completions" | ||
|
||
configuration: | ||
model: "${secrets.open-ai.chat-completions-model}" # This needs to be set to the model deployment name, not the base name | ||
# on the log-topic we add a field with the answer | ||
completion-field: "value.answer" | ||
# we are also logging the prompt we sent to the LLM | ||
log-field: "value.prompt" | ||
# here we configure the streaming behavior | ||
# as soon as the LLM answers with a chunk we send it to the answers-topic | ||
stream-to-topic: "answers-topic" | ||
# on the streaming answer we send the answer as whole message | ||
# the 'value' syntax is used to refer to the whole value of the message | ||
stream-response-completion-field: "value" | ||
# we want to stream the answer as soon as we have 20 chunks | ||
# in order to reduce latency for the first message the agent sends the first message | ||
# with 1 chunk, then with 2 chunks....up to the min-chunks-per-message value | ||
# eventually we want to send bigger messages to reduce the overhead of each message on the topic | ||
min-chunks-per-message: 20 | ||
messages: | ||
- role: system | ||
content: | | ||
An user is going to perform a questions, The documents below may help you in answering to their questions. | ||
Please try to leverage them in your answer as much as possible. | ||
Take into consideration that the user is always asking questions about the LangStream project. | ||
If you provide code or YAML snippets, please explicitly state that they are examples. | ||
Do not provide information that is not related to the LangStream project. | ||
Documents: | ||
{{# value.related_documents}} | ||
{{ text}} | ||
{{/ value.related_documents}} | ||
- role: user | ||
content: "{{ value.question}}" | ||
- name: "cleanup-response" | ||
type: "drop-fields" | ||
output: "log-topic" | ||
configuration: | ||
fields: | ||
- "question_embeddings" | ||
- "related_documents" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# | ||
# | ||
# Copyright DataStax, Inc. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
configuration: | ||
resources: | ||
- type: "open-ai-configuration" | ||
name: "OpenAI Azure configuration" | ||
configuration: | ||
url: "${secrets.open-ai.url}" | ||
access-key: "${secrets.open-ai.access-key}" | ||
provider: "${secrets.open-ai.provider}" | ||
- type: "vector-database" | ||
name: "SolrDataSource" | ||
configuration: | ||
service: "solr" | ||
user: "${secrets.solr.username}" | ||
password: "${secrets.solr.password}" | ||
host: "${secrets.solr.host}" | ||
port: "${secrets.solr.port}" | ||
collection-name: "documents" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
# | ||
# Copyright DataStax, Inc. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
name: "Crawl a website" | ||
topics: | ||
- name: "chunks-topic" | ||
creation-mode: create-if-not-exists | ||
assets: | ||
- name: "documents-table" | ||
asset-type: "solr-collection" | ||
creation-mode: create-if-not-exists | ||
deletion-mode: delete | ||
config: | ||
collection-name: "documents" | ||
datasource: "SolrDataSource" | ||
create-statements: | ||
- api: "/api/collections" | ||
method: "POST" | ||
body: | | ||
{ | ||
"name": "documents", | ||
"numShards": 1, | ||
"replicationFactor": 1 | ||
} | ||
- "api": "/schema" | ||
"body": | | ||
{ | ||
"add-field-type" : { | ||
"name": "knn_vector", | ||
"class": "solr.DenseVectorField", | ||
"vectorDimension": "1536", | ||
"similarityFunction": "cosine" | ||
} | ||
} | ||
- "api": "/schema" | ||
"body": | | ||
{ | ||
"add-field":{ | ||
"name":"embeddings", | ||
"type":"knn_vector", | ||
"stored":true, | ||
"indexed":true | ||
} | ||
} | ||
- "api": "/schema" | ||
"body": | | ||
{ | ||
"add-field":{ | ||
"name":"text", | ||
"type":"string", | ||
"stored":true, | ||
"indexed":false, | ||
"multiValued": false | ||
} | ||
} | ||
resources: | ||
size: 1 | ||
pipeline: | ||
- name: "Crawl the WebSite" | ||
type: "webcrawler-source" | ||
configuration: | ||
seed-urls: ["https://docs.langstream.ai/"] | ||
allowed-domains: ["https://docs.langstream.ai"] | ||
forbidden-paths: [] | ||
min-time-between-requests: 500 | ||
reindex-interval-seconds: 3600 | ||
max-error-count: 5 | ||
max-urls: 1000 | ||
max-depth: 50 | ||
handle-robots-file: true | ||
user-agent: "" # this is computed automatically, but you can override it | ||
scan-html-documents: true | ||
http-timeout: 10000 | ||
handle-cookies: true | ||
max-unflushed-pages: 100 | ||
bucketName: "${secrets.s3.bucket-name}" | ||
endpoint: "${secrets.s3.endpoint}" | ||
access-key: "${secrets.s3.access-key}" | ||
secret-key: "${secrets.s3.secret}" | ||
region: "${secrets.s3.region}" | ||
- name: "Extract text" | ||
type: "text-extractor" | ||
- name: "Normalise text" | ||
type: "text-normaliser" | ||
configuration: | ||
make-lowercase: true | ||
trim-spaces: true | ||
- name: "Detect language" | ||
type: "language-detector" | ||
configuration: | ||
allowedLanguages: ["en", "fr"] | ||
property: "language" | ||
- name: "Split into chunks" | ||
type: "text-splitter" | ||
configuration: | ||
splitter_type: "RecursiveCharacterTextSplitter" | ||
chunk_size: 400 | ||
separators: ["\n\n", "\n", " ", ""] | ||
keep_separator: false | ||
chunk_overlap: 100 | ||
length_function: "cl100k_base" | ||
- name: "Convert to structured data" | ||
type: "document-to-json" | ||
configuration: | ||
text-field: text | ||
copy-properties: true | ||
- name: "prepare-structure" | ||
type: "compute" | ||
configuration: | ||
fields: | ||
- name: "value.filename" | ||
expression: "properties.url" | ||
type: STRING | ||
- name: "value.chunk_id" | ||
expression: "properties.chunk_id" | ||
type: STRING | ||
- name: "compute-embeddings" | ||
id: "step1" | ||
type: "compute-ai-embeddings" | ||
output: chunks-topic | ||
configuration: | ||
model: "text-embedding-ada-002" # This needs to match the name of the model deployment, not the base model | ||
embeddings-field: "value.embeddings_vector" | ||
text: "{{ value.text }}" | ||
batch-size: 10 | ||
flush-interval: 500 | ||
- name: "Write to Solr" | ||
type: "vector-db-sink" | ||
input: chunks-topic | ||
configuration: | ||
datasource: "SolrDataSource" | ||
collection-name: "documents" | ||
fields: | ||
- name: "id" | ||
expression: "fn:concat(value.filename, value.chunk_id)" | ||
- name: "embeddings" | ||
expression: "fn:toListOfFloat(value.embeddings_vector)" | ||
- name: "text" | ||
expression: "value.text" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# | ||
# | ||
# Copyright DataStax, Inc. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
gateways: | ||
- id: "user-input" | ||
type: produce | ||
topic: "questions-topic" | ||
parameters: | ||
- sessionId | ||
produceOptions: | ||
headers: | ||
- key: langstream-client-session-id | ||
valueFromParameters: sessionId | ||
|
||
- id: "bot-output" | ||
type: consume | ||
topic: "answers-topic" | ||
parameters: | ||
- sessionId | ||
consumeOptions: | ||
filters: | ||
headers: | ||
- key: langstream-client-session-id | ||
valueFromParameters: sessionId | ||
|
||
|
||
- id: "llm-debug" | ||
type: consume | ||
topic: "log-topic" |
Oops, something went wrong.