<a href="https://colab.research.google.com/github/16dina/graph-rag-osoc/blob/main/graph_rag_sparql_mini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SPARQL Graph RAG Demo


A minimal, special case, proof-of-concept in which a SPARQL store is used as a source (/cache) of structured data to use in concert with an LLM in a Retrieval Augmented Generation system.

[OpenAI API](https://openai.com/blog/openai-api) key required to run.

Functionally it mimics the behaviour of a previous demo designed to use a more traditional enterprise graph database as a source of data. In addressing a SPARQL store to obtain data to augment the LLM, this concept is proven.

In a Linked Data context, there's also a novel aspect which manifests from a side effect of mimicking the previous work. The data used is initially derived from a Wikipedia document, the triples being created from text using NLP techniques. Such approaches are the subject of ongoing research elsewhere, not in the immediate scope here.

But a closer offshoot, almost certainly worthy of further investigation, is the availability of a pipeline from an LLM + RAG system back to a SPARQL store. Conceptually this mirrors the often overlooked notion of the Web as a read/write knowledgebase. A SPARQL store can be viewed as a cache, a medium-term memory of (perhaps domain-specific) data from Linked Data/the Web at large. This could also contain information that's a result of LLM-style reasoning.    

This demo is currently using my (danja) fork of LlamaIndex. Perhaps when I've tidied things up, it might find it's way into the main repo. But the idea here was simply to gauge the amount of work involved to to hook into a SPARQL store to provide augmentation. By trivial extension, this could include potentially any Linked Data, or for that matter any information on the Web, exploiting the knowledge captured by it's structure.

There was moderate effort involved in writing something that was compatible with an existing toolset, but aside from that requirement, there really isn't anything new in the code.

LlamaIndex was used here as a route into exploring this area. But in general, this approach is overkill. Standard libraries that are commonly used in other contexts (with appropriate configuration or at most minor modification) can supply all the necessary functionality.

In short, with a small amount of glue code, there is a massive amount of very low hanging fruit.    

## Background

Large Language Models are a significant breakthrough in knowledge representation and look to be useful in countless domains. But there are certain clear issues with the current state of the art.

The most obvious is the expense of training. Systems like [OpenAI's GPT](https://openai.com/) family yield typically monolithic architures with preset knowledge. Fine-tuning is one way around this, taking a general-purpose base model and giving it other information to operate on a particular range of tasks But fine-tuning is still non-trivial in practice and is again expensive in terms of the resources required.

But there other approaches to get around this intrinsic lack of flexibility. [Retrieval-Augmented Generation](https://arxiv.org/abs/2005.11401) (RAG) is one such. The initial RAG technique used vector databases for encoding the data to be used in augmentation. However recent work has demonstrated that Knowledge Graphs can potentially fulfil this role in a more productive fashion. *At which point the ears of someone with a background in the Web and [Linked Data](https://en.wikipedia.org/wiki/Linked_data) prick up.*


## Retrieval-Augmented Generation with LLM Based on Knowledge Graphs

[Wey Gu](https://siwei.io/en/) recently demonstrated the potential here. His [slides](https://siwei.io/talks/graph-rag-with-jerry/1) and [video](https://www.youtube.com/watch?v=bPoNCkjDmco) offer a good overview of the paradigm.

The aim here is to replicate a part of [Wey Gu](https://siwei.io/en/)'s Jupyter [Notebook](https://www.siwei.io/en/demos/graph-rag/), where [LlamaIndex](https://www.llamaindex.ai/) uses a graph store. In that Notebook a [NebulaGraph](https://www.nebula-graph.io/) store is used, here a SPARQL store.

An [OpenAI API](https://openai.com/blog/openai-api) key will be required to run this, as well as read/write access to a [SPARQL](https://en.wikipedia.org/wiki/SPARQL) store. For now at least such a store is available with the details as set in the code below.

### Preparation

**Here on Colab you should only need an API key, once inserted below it should Just Work**

* `pip install sparqlwrapper`
* Make a SPARQL endpoint available, add URL below
* (make sure endpoint supports UPDATE, /llama_index_sparql-test/)
* For clean start DROP GRAPH <http://purl.org/stuff/guardians>

* **Add OpenAI API key below**

#### 1. Imports, LLM Configuration

In [None]:
!pip install git+https://github.com/danja/llama_index

from llama_index import download_loader
import os
import logging
from llama_index import (
    KnowledgeGraphIndex,
    ServiceContext,
)

from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import SparqlGraphStore
from llama_index.llms import OpenAI
from IPython.display import Markdown, display
from llama_index import load_index_from_storage
import os
import openai

# handy for debugging locally
# logging.basicConfig(filename='loggy.log', filemode='w', level=logging.DEBUG)
# logger = logging.getLogger(__name__)

Collecting git+https://github.com/danja/llama_index
  Cloning https://github.com/danja/llama_index to /tmp/pip-req-build-64mdcmfo
  Running command git clone --filter=blob:none --quiet https://github.com/danja/llama_index /tmp/pip-req-build-64mdcmfo
  Resolved https://github.com/danja/llama_index to commit 236a867fa58b646237aed4ab89af054bc0a5e86b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
############
# LLM Config
############

# at least one should work
os.environ["OPENAI_API_KEY"] = ""
openai.api_key = ""

llm = OpenAI(temperature=0, model="text-davinci-002")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=512)


The `davinci-002` model is OpenAI's	successor the GPT-3 curie and davinci base models. It supports 16,384 tokens and it's training data went up to	2021-09.

#### 1.1 SPARQL Store Configuration

SPARQL Stores may vary in implementation. The [Fuseki](https://jena.apache.org/documentation/fuseki2/) server used here follows [specifications](https://www.w3.org/TR/sparql11-query/) closely and uses the following scheme :

* multiple datasets (= DBs) are supported
* each dataset can contain a default graph as well as multiple named graphs
* each dataset can be configured with various endpoints, each providing facilities as required (query, update etc)

*Fuseki does include basic access control facilities, but the dataset used here is wide open for convenience.*


In [None]:
###############
# SPARQL Config
###############
ENDPOINT = 'https://fuseki.hyperdata.it/llama_index_sparql-test/'
GRAPH = 'http://purl.org/stuff/guardians'
BASE_URI = 'http://purl.org/stuff/data'

graph_store = SparqlGraphStore(
    sparql_endpoint=ENDPOINT,
    sparql_graph=GRAPH,
    sparql_base_uri=BASE_URI,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

#### 2.1 Load Augmentation Data

The data used here is about the 2023 movie 'Barbie'. The LLM trained with 2021 data will be ignorant of this.

In [None]:
WikipediaReader = download_loader("WikipediaReader")
loader = WikipediaReader()
documents = loader.load_data(
    pages=['Barbie (film)'], auto_suggest=False)

#### 2.2 Create Index from Augmentation Data

In [None]:
kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    service_context=service_context,
    max_triplets_per_chunk=10,
    sparql_endpoint=ENDPOINT,
    sparql_graph=GRAPH,
    sparql_base_uri=BASE_URI,
    include_embeddings=True,
)

(Barbie, is, 2023 American fantasy comedy film)
(Barbie, directed by, Greta Gerwig)
(Barbie, stars, Margot Robbie)
(Barbie, follows, pair on journey of self-discovery)
(Barbie, was announced in, September 2009)
(Barbie, was transferred to, Warner Bros. Pictures in October 2018)
(Barbie, was cast in, 2019)
(Barbie, premiered at, Shrine Auditorium in Los Angeles on July 9, 2023)
(Barbie, was released in, United States on July 21, by Warner Bros. Pictures)
(Barbie, has grossed, $1.42 billion)
(Beach Ken, is, happy when with Barbie)
(Beach Ken, seeks, closer relationship with Barbie)
(Barbie, rebuffs, Beach Ken)
(Barbie, is stricken with worries, mortality)
(Weird Barbie, tells, Barbie must find child playing with her)
(Ken, stows away, Barbie's convertible)
(Barbie, punches, man)
(Barbie, tracks down, her owner)
(Barbie, discovers, Gloria)
(Barbie, attempts to put, Barbie in toy box)
(Barbie, is, live-action film)
(Barbie, had been in development at, Cannon Films)
(Barbie, was in developm

In [None]:
kg_rag_query_engine = kg_index.as_query_engine(
    include_text=False,
    retriever_mode="keyword",
    response_mode="tree_summarize",
)

In [None]:
response_graph_rag = kg_rag_query_engine.query(
    "In the movie, what does Barbie think about?")
# print(str(response_graph_rag))
display(Markdown(f"<b>{response_graph_rag}</b>"))
# response_graph_rag = kg_rag_query_engine.query(
#    "Repeat the word 'fish'")
# print(str(response_graph_rag))
# display(Markdown(f"<b>{response_graph_rag}</b>"))

[]
[{'rel1': {'type': 'literal', 'value': 'is'}, 'obj1': {'type': 'literal', 'value': '2023 American fantasy comedy film'}}, {'rel1': {'type': 'literal', 'value': 'directed by'}, 'obj1': {'type': 'literal', 'value': 'Greta Gerwig'}}, {'rel1': {'type': 'literal', 'value': 'stars'}, 'obj1': {'type': 'literal', 'value': 'Margot Robbie'}}, {'rel1': {'type': 'literal', 'value': 'follows'}, 'obj1': {'type': 'literal', 'value': 'pair on journey of self-discovery'}}, {'rel1': {'type': 'literal', 'value': 'was announced in'}, 'obj1': {'type': 'literal', 'value': 'September 2009'}}, {'rel1': {'type': 'literal', 'value': 'was transferred to'}, 'obj1': {'type': 'literal', 'value': 'Warner Bros. Pictures in October 2018'}}, {'rel1': {'type': 'literal', 'value': 'was cast in'}, 'obj1': {'type': 'literal', 'value': '2019'}}, {'rel1': {'type': 'literal', 'value': 'has grossed'}, 'obj1': {'type': 'literal', 'value': '$1.42 billion'}}, {'rel1': {'type': 'literal', 'value': 'rebuffs'}, 'obj1': {'type': '

<b>

In the movie, Barbie thinks about existentialism, autonomy, and the idea of 'the other.'</b>

## How it Works

See Wey's [slides](https://siwei.io/talks/graph-rag-with-jerry/1) and [video](https://www.youtube.com/watch?v=bPoNCkjDmco).

The connector [sparql.py](https://github.com/danja/llama_index/blob/main/llama_index/graph_stores/sparql.py) operates in a similar fashion to Wey's [nebulagraph.py](https://github.com/danja/llama_index/blob/main/llama_index/graph_stores/nebulagraph.py).

#### RDF Data Model

To get closer to using linked data an [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework) data model is required.
The one used here is very naive, *the simplest thing that might work*.

LlamaIndex wants text data in a shape like `(subject, predicate, object)`, which doesn't map directly to RDF, the nearest being `(named resource, named resource, string literal)`. So here a little indirection is used, with URIs randomly generated in the code. The nodes are tied together in a triple, eg. <https://purl.org/stuff/data/T123>, with each component <https://purl.org/stuff/data/E123>, <https://purl.org/stuff/data/R456>, <https://purl.org/stuff/data/E567>.

There are no doubt much better ways in which the RDF data could be expressed within LlamaIndex, but this is good enough for a proof of concept.

![RDF model](https://github.com/danja/nlp/blob/main/GraphRAG/docs/rdf-simple-rag.png?raw=true)

Although it's by no means essential for operation here, for the sake of completeness the quick & dirty schema is online at [https://purl.org/stuff/er](https://purl.org/stuff/er).

#### SPARQL

`sparql.py` interacts with a standard persistent store over HTTP.

The store/dataset has a UI [here](https://fuseki.hyperdata.it/#/dataset/llama_index_sparql-test/query) (cancel the login box, it's not needed). In that dataset there is a named graph <http://purl.org/stuff/guardians> *(heh, I'd originally set it up to use the 'Guardians of the Galaxy' page Wey was using, but to double-check it was working flipped it to Barbie).*

There, a query like
```
SELECT ?s ?p ?o
WHERE {
  GRAPH <http://purl.org/stuff/guardians> {
  ?s ?p ?o
  }
}
LIMIT 10
```
should give some results.

For convenience `sparql.py` uses the `sparqlwrapper` library. This wasn't really necessary here, the HTTP calls are fairly simple POSTs and GETs. But it might be useful later...

The `INSERT DATA` and `SELECT` queries are built mostly using Python templates.

[test_sparql.py](https://github.com/danja/llama_index/blob/main/tests/graph_stores/test_sparql.py) may be informative.

Those queries and the operation as a whole is very sub-optimal, but just good enough to resemble Wey's NebulaGraph-based setup.

Internally LlamaIndex gives the LLM prompts like
```
    'Peter Quill, -[is leader of]->, Guardians of the Galaxy',
    'Peter Quill, -[would return to the MCU]->, May 2021, <-[Gunn reaffirmed]-, Guardians of the Galaxy Vol. 3',
```

Aside from cleaning things up, an obvious next step is to get this working against [linked data](https://en.wikipedia.org/wiki/Linked_data) in the wild. There's a lot out there.
Web of LLMs..?

See also [Graph of Thoughts : Solving Elaborate Problems with Large Language Models](https://arxiv.org/abs/2308.09687)




