# BYOKG RAG Demo
This notebook demonstrates a RAG (Retrieval Augmented Generation) system built on top of a Knowledge Graph. The system allows querying a knowledge graph using natural language questions and retrieving relevant information to generate answers.

1. **Graph Store**: Manages the knowledge graph data structure
2. **KG Linker**: Links natural language queries to graph entities and paths
3. **Entity Linker**: Matches entities from text to graph nodes
4. **Triplet Retriever**: Retrieves relevant triplets from the graph
5. **Path Retriever**: Finds paths between entities in the graph
6. **Query Engine**: Orchestrates all components to answer questions

In [8]:
import sys
import os

# Add the project root to Python path
current_file = os.path.abspath('__file__')
project_root = os.path.dirname(os.path.dirname(os.path.dirname(current_file)))
src_path = os.path.join(project_root, 'byokg-rag')
sys.path.append(src_path)

### Graph Store
The `LocalKGStore` class provides an interface to work with the knowledge graph. Here we
1. Initialize the graph store
2. Load data from a CSV file
3. Get basic statistics about the graph
4. Examine sample edges for a specific node (Wynton Marsalis)

In [9]:
from src.graphstore import LocalKGStore

graph_store = LocalKGStore()
graph_store.read_from_csv('data/freebase_tiny_kg.csv')
# Print graph statistics
schema = graph_store.get_schema()
number_of_nodes = len(graph_store.nodes())
number_of_edges = len(graph_store.get_triplets())
print(f"The graph has {number_of_nodes} nodes and {number_of_edges} edges.")

# Let's also see neighbor edges of node "Wynton Marsalis"
import random
sample_triplets = graph_store.get_one_hop_edges(["Wynton Marsalis"])
sample_triplets = random.sample(list(sample_triplets["Wynton Marsalis"].items()), 3)
print("Some neighboring edges of node 'Wynton Marsalis' are: ", sample_triplets)


The graph has 1691 nodes and 21911 edges.
Some neighboring edges of node 'Wynton Marsalis' are:  [('people.person.nationality', [('Wynton Marsalis', 'people.person.nationality', 'United States of America')]), ('people.person.gender', [('Wynton Marsalis', 'people.person.gender', 'Male')]), ('freebase.valuenotation.is_reviewed', [('Wynton Marsalis', 'freebase.valuenotation.is_reviewed', 'Country of nationality'), ('Wynton Marsalis', 'freebase.valuenotation.is_reviewed', 'Date of birth'), ('Wynton Marsalis', 'freebase.valuenotation.is_reviewed', 'Gender'), ('Wynton Marsalis', 'freebase.valuenotation.is_reviewed', 'Place of birth')])]


### Question Answering

We define a sample question and its ground truth answer to test our system. The question requires reasoning through multiple hops in the knowledge graph to find the answer.

In [10]:
question = "What genre of film is associated with the place where Wynton Marsalis was born?"
answer = "Backstage Musical"

### KG Linker
The `KGLinker` uses an LLM (Claude 3.5 Sonnet) to:
1. Extract entities from the question
2. Identify potential relationship paths in the graph
3. Generate initial responses based on its knowledge

In [11]:

from src.graph_connectors import KGLinker
from src.llm import BedrockGenerator



# Initialize llm
llm_generator = BedrockGenerator(
                model_name='us.anthropic.claude-3-5-sonnet-20240620-v1:0',
                region_name='us-west-2')

kg_linker = KGLinker(graph_store=graph_store, llm_generator=llm_generator)
response = kg_linker.generate_response(
                question=question,
                schema=schema,
                graph_context="Not provided. Use the above schema to understand the graph."
            )
response


"I understand that I need to analyze the given question, extract relevant entities, identify relationship paths, and provide potential answers based on the given schema. I'll proceed with each task as requested.\n\n### Task: Entity Extraction\n\n<entities>\nWynton Marsalis\n</entities>\n\n### Task: Relationship Path Identification\n\n<paths>\npeople.person.place_of_birth -> location.location.containedby\npeople.person.place_of_birth -> location.location.geolocation\nfilm.film_location.featured_in_films -> film.film.genre\n</paths>\n\n### Task: Question Answering\n\n<answers>\n</answers>\n\nI cannot provide a definitive answer based solely on the given information and schema. The question requires knowledge about Wynton Marsalis's birthplace and its association with a film genre, which is not directly available in the provided schema or my knowledge base without additional context."

In [12]:
artifacts = kg_linker.parse_response(response)
artifacts

{'entity-extraction': ['Wynton Marsalis'],
 'path-extraction': ['people.person.place_of_birth -> location.location.containedby',
  'people.person.place_of_birth -> location.location.geolocation',
  'film.film_location.featured_in_films -> film.film.genre'],
 'draft-answer-generation': []}

### Entity Linking
The `EntityLinker` uses fuzzy string matching to
1. Match extracted entities to actual nodes in the graph
3. Link potential answers to graph nodes

In [13]:
from src.indexing import FuzzyStringIndex
from src.graph_retrievers import EntityLinker

# Add graph nodes text for string matching
string_index = FuzzyStringIndex()
string_index.add(graph_store.nodes())
retriever = string_index.as_entity_matcher()
entity_linker = EntityLinker(retriever=retriever)

linked_entities = entity_linker.link(artifacts["entity-extraction"], return_dict=False)
linked_answers = entity_linker.link(artifacts["draft-answer-generation"], return_dict=False)
linked_entities, linked_answers

(['Wynton Marsalis'], [])

### Triplet Retrieval
The `AgenticRetriever` uses an LLM to:
1. Navigate the graph starting from linked entities
2. Select relevant relations based on the question
3. Expand those relations and decide which relevant entities to explore next.
4. It returns the relevant (head->relation->tail) based on the question.


In [14]:
from src.graph_retrievers import AgenticRetriever
from src.graph_retrievers import GTraversal, TripletGVerbalizer
graph_traversal = GTraversal(graph_store)
graph_verbalizer = TripletGVerbalizer()
triplet_retriever = AgenticRetriever(
    llm_generator=llm_generator, 
    graph_traversal=graph_traversal,
    graph_verbalizer=graph_verbalizer)

In [15]:
triplet_context = triplet_retriever.retrieve(query=question, source_nodes=linked_entities)
triplet_context

['Wynton Marsalis -> music.artist.genre -> Big Band | Jazz',
 'Wynton Marsalis -> people.person.place_of_birth -> New Orleans',
 'New Orleans -> location.location.containedby -> Louisiana | United States of America | Orleans Parish',
 'New Orleans -> common.topic.notable_types -> Film | City/Town/Village',
 'New Orleans -> film.film_location.featured_in_films -> Easy Rider',
 'New Orleans -> film.film.genre -> Backstage Musical',
 'Film -> common.topic.subjects -> Mervin Praison | Brian Keith Kennedy | Billy Sorrentino',
 'Film -> common.topic.subject_of -> Mervin Praison | Billy Sorrentino',
 'Film -> type.domain.types -> Film character | Film performance | Film writer',
 'Film -> media_common.quotation_subject.quotations_about_this_subject -> Movies are a fad. Audiences really want to see live actors on a stage.']

### Path Retrieval
The `PathRetriever` uses the identified metapaths and candidate answers to:
1. Retrieve actual paths in the graph following the metapath
2. Retrieve shortest paths connecting question entities and candidate answers (if any) 
3. Verbalize the paths for context

In [16]:
from src.graph_retrievers import PathRetriever
from src.graph_retrievers import GTraversal, PathVerbalizer
graph_traversal = GTraversal(graph_store)
path_verbalizer = PathVerbalizer()
path_retriever = PathRetriever(
    graph_traversal=graph_traversal,
    path_verbalizer=path_verbalizer)

metapaths = [[component.strip() for component in path.split("->")] for path in artifacts["path-extraction"]]
shortened_paths = []
for path in metapaths:
    if len(path) > 2:
        shortened_paths.append(path[:2])
metapaths += shortened_paths
path_context = path_retriever.retrieve(linked_entities, metapaths, linked_answers)
path_context

['Wynton Marsalis -> people.person.place_of_birth > New Orleans-> location.location.containedby -> Orleans Parish | Louisiana | United States of America']

### Retrieval Evaluation

We evaluate whether the answer is found in the retrieved context.

In [17]:
context = list(set(triplet_context + path_context))
print(f"Success! Ground-truth answer `{answer}` retrieved!") if answer in '\n'.join(context) else print("Failure..")


Success! Ground-truth answer `Backstage Musical` retrieved!


### BYOKG RAG Pipeline

The `ByoKGQueryEngine` combines all components to:
1. Process natural language questions
2. Retrieve relevant context from the graph
3. Generate answers based on the retrieved information

In [18]:
from src.byokg_query_engine import ByoKGQueryEngine
byokg_query_engine = ByoKGQueryEngine(
    graph_store=graph_store,
    kg_linker=kg_linker,
    triplet_retriever=triplet_retriever,
    path_retriever=path_retriever,
    entity_linker=entity_linker
)

In [20]:
retrieved_context = byokg_query_engine.query(question)
answers, response = byokg_query_engine.generate_response(question, "\n".join(retrieved_context))

print("Retrieved context: ", "\n".join(retrieved_context))
print("Generated answers: ", answers)
print(f"Success! Ground-truth answer `{answer}` retrieved!") if answer in '\n'.join(answers) else print("Failure..")


Retrieved context:  Wynton Marsalis -> music.artist.genre -> Big Band | Jazz
Wynton Marsalis -> people.person.place_of_birth -> New Orleans
New Orleans -> location.location.containedby -> Louisiana | United States of America | Orleans Parish
New Orleans -> base.biblioness.bibs_location.state -> Louisiana
New Orleans -> film.film_location.featured_in_films -> Easy Rider
New Orleans -> film.film.genre -> Backstage Musical
New Orleans -> base.biblioness.bibs_location.country -> United States of America
Wynton Marsalis -> people.person.place_of_birth > New Orleans-> location.location.containedby -> Orleans Parish | Louisiana | United States of America
Generated answers:  ['Backstage Musical']
Success! Ground-truth answer `Backstage Musical` retrieved!


### Testing with Another Question

In [28]:
question="What are all airports in the city where New York Times is circulated?"
retrieved_context = byokg_query_engine.query(question)


In [29]:
print("Retrieved context: ", "\n".join(retrieved_context))
answers, response = byokg_query_engine.generate_response(question, "\n".join(retrieved_context))
print("Generated response: ", response)


Retrieved context:  Columbia Regional Airport -> aviation.airport.serves -> Columbia
The New York Times -> book.newspaper.circulation_areas -> Staten Island | New York City | United States of America | Queens | Brooklyn
Louis Armstrong New Orleans International Airport -> location.location.containedby -> United States of America
Louis Armstrong New Orleans International Airport -> aviation.airport.serves -> New Orleans | Search Influence | Aquarium of the Americas | Courtyard New Orleans Downtown Near the French Quarter
New York City -> periodicals.newspaper_circulation_area.newspapers -> Manhattan West Side Spirit | The West Side Spirit | Vaba Eesti Sõna | Nowy Dziennik | AM New York | Five Towns Jewish Times | Daily Worker | Spirit of the Times | Catholic Worker | Hoy | The National Sports Daily | The Campus | American Citizen | Il Progresso Italo-Americano | The Aquarian Weekly | Negro World | The Onion | The Sun | The Jewish Daily Forward | Morgen Freiheit | The Capitol | New York 