Skip to content

Commit

Permalink
checkin
Browse files Browse the repository at this point in the history
  • Loading branch information
emrgnt-cmplxty committed Jun 13, 2024
1 parent b0d722e commit 3017f55
Show file tree
Hide file tree
Showing 9 changed files with 379 additions and 89 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

<img src="./docs/pages/r2r.png" alt="Sciphi Framework">
<h3 align="center">
Build, deploy, observe, and optimize your RAG system.
Build, deploy, observe, and optimize your RAG engine.
</h3>

# About
Expand Down
219 changes: 219 additions & 0 deletions docs/pages/cookbooks/knowledge-graph.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
import { Callout, FileTree } from 'nextra/components'

## Building a Knowledge Graph with R2R

This cookbook explains how to configure R2R to automatically construct a knowledge graph from ingested files. The constructed graph can be used later as an additional knowledge base for your RAG application.

### Setup

To enable R2R knowledge graph construction, specify a provider for `kg` in your `config.json`. The following code can be found in `r2r/examples/configs/neo4j_kg.json` and contains the necessary settings for RAG graph construction:

```json filename="neo4j_kg.json" copy
{
"kg": {
"provider": "neo4j",
"batch_size": 1,
"text_splitter": {
"type": "recursive_character",
"chunk_size": 2048,
"chunk_overlap": 0
}
}
}
```

By selecting a provider, the default R2R `IngestionPipeline` will include functionality for simultaneous knowledge graph and embedding construction during ingestion. Setting `batch_size=1` means chunks will be processed one at a time, and setting `chunk_size=2048` defines the size of chunks used during extraction. The key logic controlling this workflow can be seen in the `kg_pipe.py` and `kg_storage.py` files, managed by the `R2RPipelineFactory` class.

In addition to setting the provider, we must set valid environment variables to enable communication with the Neo4j database. For this cookbook, we assume the reader has already downloaded the desktop application [found here](https://neo4j.com/download) and has properly enabled the `APOC` library.

```bash
export NEO4J_USER=neo4j
export NEO4J_PASSWORD=password
export NEO4J_URL=bolt://localhost:7687
export NEO4J_DATABASE=neo4j
```

<details>
<summary>Relevant Factory Source Code</summary>

```python filename="r2r_factory.py" copy
class R2RPipelineFactory:
...
def create_ingestion_pipeline(self, *args, **kwargs) -> IngestionPipeline:
"""Factory method to create an ingestion pipeline."""
ingestion_pipeline = IngestionPipeline()

ingestion_pipeline.add_pipe(
pipe=self.pipes.parsing_pipe, parsing_pipe=True
)
# Add embedding pipes if provider is set
if self.config.embedding.provider:
ingestion_pipeline.add_pipe(
self.pipes.embedding_pipe, embedding_pipe=True
)
ingestion_pipeline.add_pipe(
self.pipes.vector_storage_pipe, embedding_pipe=True
)
# Add KG pipes if provider is set
if self.config.kg.provider:
ingestion_pipeline.add_pipe(self.pipes.kg_pipe, kg_pipe=True)
ingestion_pipeline.add_pipe(
self.pipes.kg_storage_pipe, kg_pipe=True
)

return ingestion_pipeline
```
</details>

### Implementation

We are now ready to implement our knowledge graph construction. By default, when running with the configuration above, triples will be extracted and stored to express our graph data. A triple consists of three components: a subject, a predicate, and an object.

First, pass the selected configuration into the R2R constructor:

```python filename="r2r/examples/scripts/build_yc_kg.py"
def main(max_entries=50, delete=False):
# Load the R2R configuration and build the app
this_file_path = os.path.abspath(os.path.dirname(__file__))
config_path = os.path.join(this_file_path, "..", "configs", "neo4j_kg.json")
config = R2RConfig.from_json(config_path)
r2r = R2RAppBuilder(config).build()
```

Now that we have created a properly configured instance of R2R, we can move on to prepping the graph.

In this cookbook, we will ingest startup data from the YCombinator company directory. The default prompt for knowledge graph construction, `ner_kg_extraction`, can be found in `prompts/local/defaults.jsonl`. It constructs the graph in a way that is agnostic to the input data. For this example, we will overwrite this prompt with one that contains specific entities and predicates relevant to startups. After doing so, we will be ready to start ingesting data:

```python filename="r2r/examples/scripts/build_yc_kg.py"
# Get the providers
kg_provider = r2r.providers.kg
prompt_provider = r2r.providers.prompt

# Update the prompt for the NER KG extraction task
ner_kg_extraction_with_spec = prompt_provider.get_prompt(
"ner_kg_extraction_with_spec"
)

# Newline-separated list of entity types, with optional subcategories
entity_types = """organization
subcategories: company, school, non-profit, other
location
subcategories: city, state, country, other
person
position
date
subcategories: year, month, day, batch (e.g. W24, S20), other
quantity
event
subcategories: incorporation, funding_round, acquisition, launch, other
industry
media
subcategories: email, website, twitter, linkedin, other
product
"""

# Newline-separated list of predicates
predicates = """
# Founder / employee predicates
EDUCATED_AT
FOUNDER_OF
ROLE_OF
WORKED_AT
# Company predicates
FOUNDED_IN
LOCATED_IN
HAS_TEAM_SIZE
REVENUE
RAISED
ACQUIRED_BY
ANNOUNCED
PARTICIPATED_IN
# Product predicates
USED_BY
USES
HAS_PRODUCT
HAS_FEATURES
HAS_OFFERS
# Other
INDUSTRY
TECHNOLOGY
GROUP_PARTNER
ALIAS"""

# Format the prompt to include the desired entity types and predicates
ner_kg_extraction = ner_kg_extraction_with_spec.replace(
"{entity_types}", entity_types
).replace("{predicates}", predicates)

# Update the "ner_kg_extraction" prompt used in downstream pipes
r2r.providers.prompt.update_prompt(
"ner_kg_extraction", json.dumps(ner_kg_extraction, ensure_ascii=False)
)
```

Finally, we are ready to ingest data. The script includes code to scrape and ingest up to 1,000 companies from the YC directory. The relevant code for ingestion is shown below:

```python filename="r2r/examples/scripts/build_yc_kg.py"
i = 0
# Ingest and clean the data for each company
for company, url in url_map.items():
company_data = fetch_and_clean_yc_co_data(url)
if i >= max_entries:
break
# Wrap in a try block in case of network failure
try:
# Ingest as a text document
r2r.ingest_documents(
[
Document(
id=generate_id_from_label(company),
type="txt",
data=company_data,
metadata={},
)
]
)
except:
continue
i += 1

print_all_relationships(kg_provider)
```

Let's start by scraping one example to get an initial sense of the data quality produced by this technique.

```bash
python r2r/examples/scripts/build_yc_kg.py --max_entries=1
```

Running the command above will produce output like the following:

```bash
...
company:Airbnb -[HAS]-> quantity:33,000 cities
company:Airbnb -[HAS]-> quantity:192 countries
company:Airbnb -[LOCATED_IN]-> location:city:Dublin
company:Airbnb -[LOCATED_IN]-> location:city:London
company:Airbnb -[LOCATED_IN]-> location:city:Barcelona
company:Airbnb -[LOCATED_IN]-> location:city:Paris
company:Airbnb -[LOCATED_IN]-> location:city:Milan
company:Airbnb -[LOCATED_IN]-> location:city:Copenhagen
company:Airbnb -[LOCATED_IN]-> location:city:Berlin
company:Airbnb -[LOCATED_IN]-> location:city:Moscow
company:Airbnb -[LOCATED_IN]-> location:city:São Paolo
company:Airbnb -[LOCATED_IN]-> location:city:Sydney
company:Airbnb -[LOCATED_IN]-> location:city:Singapore
person:Brian Chesky -[ROLE_OF]-> position:CEO
person:Brian Chesky -[FOUNDED]-> company:Airbnb
position:CTO -[FOUNDED]-> company:Airbnb
person:Brian Chesky -[LOCATED_IN]-> location:city:New York
person:Brian Chesky -[EDUCATED_AT]-> school:Rhode Island School of Design
person:Brian Chesky -[HAS]-> degree:Bachelor of Fine Arts in Industrial Design
position:CTO -[ROLE_OF]-> position:CTO
```

Now, we are ready to run a larger ingestion:

```bash
poetry run python r2r/examples/scripts/build_yc_kg.py --delete --max_entries=50
```
38 changes: 22 additions & 16 deletions docs/pages/cookbooks/local-rag.mdx
Original file line number Diff line number Diff line change
@@ -1,25 +1,28 @@
import { Tabs } from 'nextra/components'
import { Callout } from 'nextra/components'

## Building a Local RAG System with R2R

### Installation

<Tabs items={['Docker', 'Pip']}>



<Tabs.Tab>

<Callout type="info" emoji="🐳">
Docker makes it convenient to run R2R without managing your local environment.
</Callout>

<details>
<summary>Docker</summary>
First, download the latest R2R image from Dockerhub:

To run R2R using Docker, you can use the following commands:

```bash filename="bash" copy
docker pull emrgntcmplxty/r2r:latest
```

This will pull the latest R2R Docker image.

Then, run the container with:
Then, run the service:

```bash filename="bash" copy
docker run -d \
Expand All @@ -40,9 +43,18 @@ This command starts the R2R container with the following options:
- `-e CONFIG_OPTION=local_ollama`: Selects the "local_ollama" configuration option.
- `emrgntcmplxty/r2r:latest`: Specifies the Docker image to use.

</details>

We can start by using `pip` to install R2R with the local-embeddding dependencies:
Lastly, install the R2R client using `pip`

```bash filename="bash" copy
pip install 'r2r'
```

</Tabs.Tab>

<Tabs.Tab>

We can install r2r with the necessary optional dependencies to run locally using `pip`

```bash filename="bash" copy
pip install 'r2r[local-embedding]'
Expand All @@ -52,7 +64,6 @@ R2R supports `Ollama`, a popular tool for Local LLM inference. Ollama is provid

Ollama must be installed independently. You can install Ollama by following the instructions on their [official website](https://ollama.com/) or by referring to their [GitHub README](https://github.com/ollama/ollama).


### Configuration

Let's move on to setting up the R2R pipeline. R2R relies on a `config.json` file for defining various settings, such as embedding models and chunk sizes. By default, the `config.json` found in the R2R GitHub repository's root directory is set up for cloud-based services.
Expand Down Expand Up @@ -98,15 +109,10 @@ This chosen config modification above instructs R2R to use the `sentence-transfo

A local vector database will be used to store the embeddings. The current default is a minimal sqlite implementation.

### Server Standup
</Tabs.Tab>

</Tabs>

```bash filename="bash" copy
# cd $WORKDIR
python -m r2r.examples.servers.configurable_pipeline --host 0.0.0.0 --port 8000 --config local_ollama --pipeline_type qna
```

The server exposes a REST API for interacting with the R2R RAG pipeline and application. See the [API docs](/getting-started/app-api) for more details on the available endpoints.

## Ingesting and Embedding Documents

Expand Down
2 changes: 1 addition & 1 deletion docs/pages/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ R2R was conceived to help developers bridge the gap between local LLM experiment
- **🗂️ App Management**: Efficiently manage documents and users with rich observability and analytics.
- **🌐 Client-Server**: RESTful API support out of the box.
- **🧩 Configurable**: Provision your application using intuitive configuration files..
- **🔌 Extensible**: Develop your application further with easy builder + factory pattern.
- **🔌 Extensible**: Develop your application further with a convenient builder and factory pattern.

## Demo(s)
The [R2R Demo](/getting-started/r2r-demo) provides a step-by-step guide to running the default R2R Retrieval-Augmented Generation (RAG) backend. The demo ingests the provided documents and illustrates search and RAG functionality, logging, analytics, and document management.
Expand Down
Loading

0 comments on commit 3017f55

Please sign in to comment.