checkin

SciPhi-AI · Jun 13, 2024 · 3017f55 · 3017f55
1 parent b0d722e
commit 3017f55
Show file tree

Hide file tree

Showing 9 changed files with 379 additions and 89 deletions.
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@
 
 <img src="./docs/pages/r2r.png" alt="Sciphi Framework">
 <h3 align="center">
-Build, deploy, observe, and optimize your RAG system.
+Build, deploy, observe, and optimize your RAG engine.
 </h3>
 
 # About

diff --git a/docs/pages/cookbooks/knowledge-graph.mdx b/docs/pages/cookbooks/knowledge-graph.mdx
@@ -0,0 +1,219 @@
+import { Callout, FileTree } from 'nextra/components'
+
+## Building a Knowledge Graph with R2R
+
+This cookbook explains how to configure R2R to automatically construct a knowledge graph from ingested files. The constructed graph can be used later as an additional knowledge base for your RAG application.
+
+### Setup
+
+To enable R2R knowledge graph construction, specify a provider for `kg` in your `config.json`. The following code can be found in `r2r/examples/configs/neo4j_kg.json` and contains the necessary settings for RAG graph construction:
+
+```json filename="neo4j_kg.json" copy
+{
+  "kg": {
+    "provider": "neo4j",
+    "batch_size": 1,
+    "text_splitter": {
+      "type": "recursive_character",
+      "chunk_size": 2048,
+      "chunk_overlap": 0
+    }
+  }
+}
+```
+
+By selecting a provider, the default R2R `IngestionPipeline` will include functionality for simultaneous knowledge graph and embedding construction during ingestion. Setting `batch_size=1` means chunks will be processed one at a time, and setting `chunk_size=2048` defines the size of chunks used during extraction. The key logic controlling this workflow can be seen in the `kg_pipe.py` and `kg_storage.py` files, managed by the `R2RPipelineFactory` class.
+
+In addition to setting the provider, we must set valid environment variables to enable communication with the Neo4j database. For this cookbook, we assume the reader has already downloaded the desktop application [found here](https://neo4j.com/download) and has properly enabled the `APOC` library.
+
+```bash
+export NEO4J_USER=neo4j
+export NEO4J_PASSWORD=password
+export NEO4J_URL=bolt://localhost:7687
+export NEO4J_DATABASE=neo4j
+```
+
+<details>
+<summary>Relevant Factory Source Code</summary>
+
+```python filename="r2r_factory.py" copy
+class R2RPipelineFactory:
+  ...
+  def create_ingestion_pipeline(self, *args, **kwargs) -> IngestionPipeline:
+      """Factory method to create an ingestion pipeline."""
+      ingestion_pipeline = IngestionPipeline()
+
+      ingestion_pipeline.add_pipe(
+          pipe=self.pipes.parsing_pipe, parsing_pipe=True
+      )
+      # Add embedding pipes if provider is set
+      if self.config.embedding.provider:
+          ingestion_pipeline.add_pipe(
+              self.pipes.embedding_pipe, embedding_pipe=True
+          )
+          ingestion_pipeline.add_pipe(
+              self.pipes.vector_storage_pipe, embedding_pipe=True
+          )
+      # Add KG pipes if provider is set
+      if self.config.kg.provider:
+          ingestion_pipeline.add_pipe(self.pipes.kg_pipe, kg_pipe=True)
+          ingestion_pipeline.add_pipe(
+              self.pipes.kg_storage_pipe, kg_pipe=True
+          )
+
+      return ingestion_pipeline
+```
+</details>
+
+### Implementation
+
+We are now ready to implement our knowledge graph construction. By default, when running with the configuration above, triples will be extracted and stored to express our graph data. A triple consists of three components: a subject, a predicate, and an object.
+
+First, pass the selected configuration into the R2R constructor:
+
+```python filename="r2r/examples/scripts/build_yc_kg.py"
+def main(max_entries=50, delete=False):
+  # Load the R2R configuration and build the app
+  this_file_path = os.path.abspath(os.path.dirname(__file__))
+  config_path = os.path.join(this_file_path, "..", "configs", "neo4j_kg.json")
+  config = R2RConfig.from_json(config_path)
+  r2r = R2RAppBuilder(config).build()
+```
+
+Now that we have created a properly configured instance of R2R, we can move on to prepping the graph.
+
+In this cookbook, we will ingest startup data from the YCombinator company directory. The default prompt for knowledge graph construction, `ner_kg_extraction`, can be found in `prompts/local/defaults.jsonl`. It constructs the graph in a way that is agnostic to the input data. For this example, we will overwrite this prompt with one that contains specific entities and predicates relevant to startups. After doing so, we will be ready to start ingesting data:
+
+```python filename="r2r/examples/scripts/build_yc_kg.py"
+# Get the providers
+kg_provider = r2r.providers.kg
+prompt_provider = r2r.providers.prompt
+
+# Update the prompt for the NER KG extraction task
+ner_kg_extraction_with_spec = prompt_provider.get_prompt(
+    "ner_kg_extraction_with_spec"
+)
+
+# Newline-separated list of entity types, with optional subcategories
+entity_types = """organization
+subcategories: company, school, non-profit, other
+location
+subcategories: city, state, country, other
+person
+position
+date
+subcategories: year, month, day, batch (e.g. W24, S20), other
+quantity
+event
+subcategories: incorporation, funding_round, acquisition, launch, other
+industry
+media
+subcategories: email, website, twitter, linkedin, other
+product
+"""
+
+# Newline-separated list of predicates
+predicates = """
+# Founder / employee predicates
+EDUCATED_AT
+FOUNDER_OF
+ROLE_OF
+WORKED_AT
+# Company predicates
+FOUNDED_IN
+LOCATED_IN
+HAS_TEAM_SIZE
+REVENUE
+RAISED
+ACQUIRED_BY
+ANNOUNCED
+PARTICIPATED_IN
+# Product predicates
+USED_BY
+USES
+HAS_PRODUCT
+HAS_FEATURES
+HAS_OFFERS
+# Other
+INDUSTRY
+TECHNOLOGY
+GROUP_PARTNER
+ALIAS"""
+
+# Format the prompt to include the desired entity types and predicates
+ner_kg_extraction = ner_kg_extraction_with_spec.replace(
+    "{entity_types}", entity_types
+).replace("{predicates}", predicates)
+
+# Update the "ner_kg_extraction" prompt used in downstream pipes
+r2r.providers.prompt.update_prompt(
+    "ner_kg_extraction", json.dumps(ner_kg_extraction, ensure_ascii=False)
+)
+```
+
+Finally, we are ready to ingest data. The script includes code to scrape and ingest up to 1,000 companies from the YC directory. The relevant code for ingestion is shown below:
+
+```python filename="r2r/examples/scripts/build_yc_kg.py"
+i = 0
+# Ingest and clean the data for each company
+for company, url in url_map.items():
+    company_data = fetch_and_clean_yc_co_data(url)
+    if i >= max_entries:
+        break
+    # Wrap in a try block in case of network failure
+    try:
+        # Ingest as a text document
+        r2r.ingest_documents(
+            [
+                Document(
+                    id=generate_id_from_label(company),
+                    type="txt",
+                    data=company_data,
+                    metadata={},
+                )
+            ]
+        )
+    except:
+        continue
+    i += 1
+
+print_all_relationships(kg_provider)
+```
+
+Let's start by scraping one example to get an initial sense of the data quality produced by this technique.
+
+```bash
+python r2r/examples/scripts/build_yc_kg.py --max_entries=1
+```
+
+Running the command above will produce output like the following:
+
+```bash
+...
+company:Airbnb -[HAS]-> quantity:33,000 cities
+company:Airbnb -[HAS]-> quantity:192 countries
+company:Airbnb -[LOCATED_IN]-> location:city:Dublin
+company:Airbnb -[LOCATED_IN]-> location:city:London
+company:Airbnb -[LOCATED_IN]-> location:city:Barcelona
+company:Airbnb -[LOCATED_IN]-> location:city:Paris
+company:Airbnb -[LOCATED_IN]-> location:city:Milan
+company:Airbnb -[LOCATED_IN]-> location:city:Copenhagen
+company:Airbnb -[LOCATED_IN]-> location:city:Berlin
+company:Airbnb -[LOCATED_IN]-> location:city:Moscow
+company:Airbnb -[LOCATED_IN]-> location:city:São Paolo
+company:Airbnb -[LOCATED_IN]-> location:city:Sydney
+company:Airbnb -[LOCATED_IN]-> location:city:Singapore
+person:Brian Chesky -[ROLE_OF]-> position:CEO
+person:Brian Chesky -[FOUNDED]-> company:Airbnb
+position:CTO -[FOUNDED]-> company:Airbnb
+person:Brian Chesky -[LOCATED_IN]-> location:city:New York
+person:Brian Chesky -[EDUCATED_AT]-> school:Rhode Island School of Design
+person:Brian Chesky -[HAS]-> degree:Bachelor of Fine Arts in Industrial Design
+position:CTO -[ROLE_OF]-> position:CTO
+```
+
+Now, we are ready to run a larger ingestion:
+
+```bash
+poetry run python r2r/examples/scripts/build_yc_kg.py --delete --max_entries=50
+```
diff --git a/docs/pages/cookbooks/local-rag.mdx b/docs/pages/cookbooks/local-rag.mdx
@@ -1,25 +1,28 @@
+import { Tabs } from 'nextra/components'
 import { Callout } from 'nextra/components'
 
 ## Building a Local RAG System with R2R
 
 ### Installation
 
+<Tabs items={['Docker', 'Pip']}>
+
+
+
+<Tabs.Tab>
+
 <Callout type="info" emoji="🐳">
    Docker makes it convenient to run R2R without managing your local environment.
 </Callout>
 
-<details>
-<summary>Docker</summary>
+First, download the latest R2R image from Dockerhub:
 
-To run R2R using Docker, you can use the following commands:
 
 ```bash filename="bash" copy
 docker pull emrgntcmplxty/r2r:latest
 ```
 
-This will pull the latest R2R Docker image.
-
-Then, run the container with:
+Then, run the service:
 
 ```bash filename="bash" copy
 docker run -d \
@@ -40,9 +43,18 @@ This command starts the R2R container with the following options:
 - `-e CONFIG_OPTION=local_ollama`: Selects the "local_ollama" configuration option.
 - `emrgntcmplxty/r2r:latest`: Specifies the Docker image to use.
 
-</details>
 
-We can start by using `pip` to install R2R with the local-embeddding dependencies:
+Lastly, install the R2R client using `pip`
+
+```bash filename="bash" copy
+pip install 'r2r'
+```
+
+</Tabs.Tab>
+
+<Tabs.Tab>
+
+We can install r2r with the necessary optional dependencies to run locally using `pip`
 
 ```bash filename="bash" copy
 pip install 'r2r[local-embedding]'
@@ -52,7 +64,6 @@ R2R supports  `Ollama`, a popular tool for Local LLM inference. Ollama is provid
 
 Ollama must be installed independently. You can install Ollama by following the instructions on their [official website](https://ollama.com/) or by referring to their [GitHub README](https://github.com/ollama/ollama).
 
-
 ### Configuration
 
 Let's move on to setting up the R2R pipeline. R2R relies on a `config.json` file for defining various settings, such as embedding models and chunk sizes. By default, the `config.json` found in the R2R GitHub repository's root directory is set up for cloud-based services.
@@ -98,15 +109,10 @@ This chosen config modification above instructs R2R to use the `sentence-transfo
 
 A local vector database will be used to store the embeddings. The current default is a minimal sqlite implementation.
 
-### Server Standup
+</Tabs.Tab>
 
+</Tabs>
 
-```bash filename="bash" copy
-# cd $WORKDIR
-python -m r2r.examples.servers.configurable_pipeline --host 0.0.0.0 --port 8000 --config local_ollama --pipeline_type qna
-```
-
-The server exposes a REST API for interacting with the R2R RAG pipeline and application. See the [API docs](/getting-started/app-api) for more details on the available endpoints.
 
 ## Ingesting and Embedding Documents
 

diff --git a/docs/pages/index.mdx b/docs/pages/index.mdx
@@ -20,7 +20,7 @@ R2R was conceived to help developers bridge the gap between local LLM experiment
 - **🗂️ App Management**: Efficiently manage documents and users with rich observability and analytics.
 - **🌐 Client-Server**: RESTful API support out of the box.
 - **🧩 Configurable**: Provision your application using intuitive configuration files..
-- **🔌 Extensible**: Develop your application further with easy builder + factory pattern.
+- **🔌 Extensible**: Develop your application further with a convenient builder and factory pattern.
 
 ## Demo(s)
 The [R2R Demo](/getting-started/r2r-demo) provides a step-by-step guide to running the default R2R Retrieval-Augmented Generation (RAG) backend. The demo ingests the provided documents and illustrates search and RAG functionality, logging, analytics, and document management.