-
Notifications
You must be signed in to change notification settings - Fork 144
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
b0d722e
commit 3017f55
Showing
9 changed files
with
379 additions
and
89 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,219 @@ | ||
import { Callout, FileTree } from 'nextra/components' | ||
|
||
## Building a Knowledge Graph with R2R | ||
|
||
This cookbook explains how to configure R2R to automatically construct a knowledge graph from ingested files. The constructed graph can be used later as an additional knowledge base for your RAG application. | ||
|
||
### Setup | ||
|
||
To enable R2R knowledge graph construction, specify a provider for `kg` in your `config.json`. The following code can be found in `r2r/examples/configs/neo4j_kg.json` and contains the necessary settings for RAG graph construction: | ||
|
||
```json filename="neo4j_kg.json" copy | ||
{ | ||
"kg": { | ||
"provider": "neo4j", | ||
"batch_size": 1, | ||
"text_splitter": { | ||
"type": "recursive_character", | ||
"chunk_size": 2048, | ||
"chunk_overlap": 0 | ||
} | ||
} | ||
} | ||
``` | ||
|
||
By selecting a provider, the default R2R `IngestionPipeline` will include functionality for simultaneous knowledge graph and embedding construction during ingestion. Setting `batch_size=1` means chunks will be processed one at a time, and setting `chunk_size=2048` defines the size of chunks used during extraction. The key logic controlling this workflow can be seen in the `kg_pipe.py` and `kg_storage.py` files, managed by the `R2RPipelineFactory` class. | ||
|
||
In addition to setting the provider, we must set valid environment variables to enable communication with the Neo4j database. For this cookbook, we assume the reader has already downloaded the desktop application [found here](https://neo4j.com/download) and has properly enabled the `APOC` library. | ||
|
||
```bash | ||
export NEO4J_USER=neo4j | ||
export NEO4J_PASSWORD=password | ||
export NEO4J_URL=bolt://localhost:7687 | ||
export NEO4J_DATABASE=neo4j | ||
``` | ||
|
||
<details> | ||
<summary>Relevant Factory Source Code</summary> | ||
|
||
```python filename="r2r_factory.py" copy | ||
class R2RPipelineFactory: | ||
... | ||
def create_ingestion_pipeline(self, *args, **kwargs) -> IngestionPipeline: | ||
"""Factory method to create an ingestion pipeline.""" | ||
ingestion_pipeline = IngestionPipeline() | ||
|
||
ingestion_pipeline.add_pipe( | ||
pipe=self.pipes.parsing_pipe, parsing_pipe=True | ||
) | ||
# Add embedding pipes if provider is set | ||
if self.config.embedding.provider: | ||
ingestion_pipeline.add_pipe( | ||
self.pipes.embedding_pipe, embedding_pipe=True | ||
) | ||
ingestion_pipeline.add_pipe( | ||
self.pipes.vector_storage_pipe, embedding_pipe=True | ||
) | ||
# Add KG pipes if provider is set | ||
if self.config.kg.provider: | ||
ingestion_pipeline.add_pipe(self.pipes.kg_pipe, kg_pipe=True) | ||
ingestion_pipeline.add_pipe( | ||
self.pipes.kg_storage_pipe, kg_pipe=True | ||
) | ||
|
||
return ingestion_pipeline | ||
``` | ||
</details> | ||
|
||
### Implementation | ||
|
||
We are now ready to implement our knowledge graph construction. By default, when running with the configuration above, triples will be extracted and stored to express our graph data. A triple consists of three components: a subject, a predicate, and an object. | ||
|
||
First, pass the selected configuration into the R2R constructor: | ||
|
||
```python filename="r2r/examples/scripts/build_yc_kg.py" | ||
def main(max_entries=50, delete=False): | ||
# Load the R2R configuration and build the app | ||
this_file_path = os.path.abspath(os.path.dirname(__file__)) | ||
config_path = os.path.join(this_file_path, "..", "configs", "neo4j_kg.json") | ||
config = R2RConfig.from_json(config_path) | ||
r2r = R2RAppBuilder(config).build() | ||
``` | ||
|
||
Now that we have created a properly configured instance of R2R, we can move on to prepping the graph. | ||
|
||
In this cookbook, we will ingest startup data from the YCombinator company directory. The default prompt for knowledge graph construction, `ner_kg_extraction`, can be found in `prompts/local/defaults.jsonl`. It constructs the graph in a way that is agnostic to the input data. For this example, we will overwrite this prompt with one that contains specific entities and predicates relevant to startups. After doing so, we will be ready to start ingesting data: | ||
|
||
```python filename="r2r/examples/scripts/build_yc_kg.py" | ||
# Get the providers | ||
kg_provider = r2r.providers.kg | ||
prompt_provider = r2r.providers.prompt | ||
|
||
# Update the prompt for the NER KG extraction task | ||
ner_kg_extraction_with_spec = prompt_provider.get_prompt( | ||
"ner_kg_extraction_with_spec" | ||
) | ||
|
||
# Newline-separated list of entity types, with optional subcategories | ||
entity_types = """organization | ||
subcategories: company, school, non-profit, other | ||
location | ||
subcategories: city, state, country, other | ||
person | ||
position | ||
date | ||
subcategories: year, month, day, batch (e.g. W24, S20), other | ||
quantity | ||
event | ||
subcategories: incorporation, funding_round, acquisition, launch, other | ||
industry | ||
media | ||
subcategories: email, website, twitter, linkedin, other | ||
product | ||
""" | ||
|
||
# Newline-separated list of predicates | ||
predicates = """ | ||
# Founder / employee predicates | ||
EDUCATED_AT | ||
FOUNDER_OF | ||
ROLE_OF | ||
WORKED_AT | ||
# Company predicates | ||
FOUNDED_IN | ||
LOCATED_IN | ||
HAS_TEAM_SIZE | ||
REVENUE | ||
RAISED | ||
ACQUIRED_BY | ||
ANNOUNCED | ||
PARTICIPATED_IN | ||
# Product predicates | ||
USED_BY | ||
USES | ||
HAS_PRODUCT | ||
HAS_FEATURES | ||
HAS_OFFERS | ||
# Other | ||
INDUSTRY | ||
TECHNOLOGY | ||
GROUP_PARTNER | ||
ALIAS""" | ||
|
||
# Format the prompt to include the desired entity types and predicates | ||
ner_kg_extraction = ner_kg_extraction_with_spec.replace( | ||
"{entity_types}", entity_types | ||
).replace("{predicates}", predicates) | ||
|
||
# Update the "ner_kg_extraction" prompt used in downstream pipes | ||
r2r.providers.prompt.update_prompt( | ||
"ner_kg_extraction", json.dumps(ner_kg_extraction, ensure_ascii=False) | ||
) | ||
``` | ||
|
||
Finally, we are ready to ingest data. The script includes code to scrape and ingest up to 1,000 companies from the YC directory. The relevant code for ingestion is shown below: | ||
|
||
```python filename="r2r/examples/scripts/build_yc_kg.py" | ||
i = 0 | ||
# Ingest and clean the data for each company | ||
for company, url in url_map.items(): | ||
company_data = fetch_and_clean_yc_co_data(url) | ||
if i >= max_entries: | ||
break | ||
# Wrap in a try block in case of network failure | ||
try: | ||
# Ingest as a text document | ||
r2r.ingest_documents( | ||
[ | ||
Document( | ||
id=generate_id_from_label(company), | ||
type="txt", | ||
data=company_data, | ||
metadata={}, | ||
) | ||
] | ||
) | ||
except: | ||
continue | ||
i += 1 | ||
|
||
print_all_relationships(kg_provider) | ||
``` | ||
|
||
Let's start by scraping one example to get an initial sense of the data quality produced by this technique. | ||
|
||
```bash | ||
python r2r/examples/scripts/build_yc_kg.py --max_entries=1 | ||
``` | ||
|
||
Running the command above will produce output like the following: | ||
|
||
```bash | ||
... | ||
company:Airbnb -[HAS]-> quantity:33,000 cities | ||
company:Airbnb -[HAS]-> quantity:192 countries | ||
company:Airbnb -[LOCATED_IN]-> location:city:Dublin | ||
company:Airbnb -[LOCATED_IN]-> location:city:London | ||
company:Airbnb -[LOCATED_IN]-> location:city:Barcelona | ||
company:Airbnb -[LOCATED_IN]-> location:city:Paris | ||
company:Airbnb -[LOCATED_IN]-> location:city:Milan | ||
company:Airbnb -[LOCATED_IN]-> location:city:Copenhagen | ||
company:Airbnb -[LOCATED_IN]-> location:city:Berlin | ||
company:Airbnb -[LOCATED_IN]-> location:city:Moscow | ||
company:Airbnb -[LOCATED_IN]-> location:city:São Paolo | ||
company:Airbnb -[LOCATED_IN]-> location:city:Sydney | ||
company:Airbnb -[LOCATED_IN]-> location:city:Singapore | ||
person:Brian Chesky -[ROLE_OF]-> position:CEO | ||
person:Brian Chesky -[FOUNDED]-> company:Airbnb | ||
position:CTO -[FOUNDED]-> company:Airbnb | ||
person:Brian Chesky -[LOCATED_IN]-> location:city:New York | ||
person:Brian Chesky -[EDUCATED_AT]-> school:Rhode Island School of Design | ||
person:Brian Chesky -[HAS]-> degree:Bachelor of Fine Arts in Industrial Design | ||
position:CTO -[ROLE_OF]-> position:CTO | ||
``` | ||
|
||
Now, we are ready to run a larger ingestion: | ||
|
||
```bash | ||
poetry run python r2r/examples/scripts/build_yc_kg.py --delete --max_entries=50 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.