In [None]:
import streamlit as st
from PIL import Image

# Load the image
image = Image.open('./kgqa_nb.png')

# Display the image in Streamlit
st.image(image, caption='', use_column_width=True)

# Question Answering On Knowledge Graphs Using RelationalAI & Snowflake Cortex AI

## **Overview**

We develop an end-to-end system for answering natural language (NL) questions using a Knowledge Graph (KG), leveraging RelationalAI and Snowflake Cortex AI. Our approach ensures factual accuracy by grounding response in the KG, eliminating the risk of hallucinations typical of LLMs. 

Our work is an adaptation of [QirK: Question Answering via Intermediate Representation on Knowledge Graphs](https://arxiv.org/abs/2408.07494)

- We aim to answer the following questions from our KG.

In [1]:
list_of_questions = [
    "Name the actors of The Silent One.",
    "Who is the director of The Quiet Place?",
    "List movies directed by John Kransinski.",
    "Which movie's director is married to a cast member?",
    "Which movie's director was born in the same city as one of the cast members?",
    'Name a movie whose producer is the sibling of one of the cast members.',
    'In which movie is one of the cast members the child of the director?', 
    'Name films directed by either Christopher Nolan or Steven Spielberg.',
    'Name movies either directed or produced by Steven Spielberg.',
    'List the movies that had both Robert De Niro and Al Pacino casted in them?',
    'Who\'s the editor of a film directed by Christopher Nolan that has Christian Bale as a cast member?',
    'Name a movie directed by Quentin Tarantino or Martin Scorsese that has De Niro as a cast member.',
    "Name a movie directed by Quentin Tarantino or Martin Scorsese that has both Samuel L. Jackson and Robert De Niro as cast member"
]

> **_NOTE:_** Before running the notebook, you need to have :

1. Install [RelationalAI Native App](https://relational.ai/docs/native_app/installation#i-install-the-rai-native-app-for-snowflake)

2. Load RelationalAI in Snowflake Notebook using [Installation Guide](https://relational.ai/docs/native_app/installation#ii-set-up-the-rai-native-app)

3. Install **snowflake** package in Snowflake Notebook environment. 
    - Click the down arrow beside "Packages" and type "snowflake" in it to install. 

### Step 1. **Importing Necessary Packages**

We start by importing various packages and modules that we'll need for our project.

In [None]:
import sys
sys.path.append("./relationalai.zip")

In [2]:
# relationalai
import relationalai as rai
from relationalai.std import alias

# utils
from utils import execute_query, TripletClause, reformat_match_output

from os.path import dirname
from os import getcwd
from sys import path

path.insert(0, dirname(getcwd()))

### Step 2. **Defining KG model & types**

- We first create the model, which is a subset of the [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) focused on movies. 

We need a Knowledge base to store our data and utilize it to answer questions. The knowledge is structured as triplets or facts in case of Wikidata. A triplet example is ("A Quiet Place", "director", "John Krasinski"), which can be read as "The director of the movie A Quiet Place is John Krasinski". The triplets and labels are already loaded in the Snowflake database, but we need to create an executable knowledge base from them to run RelationalAI Python queries.

In [3]:
model_name = "KGModel"
kg_model = rai.Model(model_name)

Connecting to ws://0.0.0.0:8080/ws/program...
Failed to connect to ws://0.0.0.0:8080/ws/program. Running with debug sink disabled.


After defining the model, we need to *create a stream of data between Snowflake & our model* to keep it up-to-date with the Snowflake tables. 
 
> **_NOTE:_** It's only needed once for each model.

 

In [None]:
import json
config = json.load(open("config.json"))
db_name = config["database"]["name"]
schema_name = config["database"]["schema"]

In [None]:
# Uncomment the below if a stream from the Snowflake database to RAI Model does not exist.

# provider = rai.Provider()

# provider.create_streams([f"{db_name}.{schema_name}.id_labels",
#                          f"{db_name}.{schema_name}.triplets"],
#                         f"{model_name}"
# )


Next, we build two model types (named `Triplet` & `Label`) using the `kg_model.Type` method.
  - `Triplet` stores objects from `temp_db.temp_schema.triplets` table. Each object is represented by `(subject_id, relation_id, object_id)` triplet.
  - `Label` stores objects from `temp_db.temp_schema.id_labels` table. Each object is represented by `(id, label)` pair

In [5]:
# read data from streams into RelationalAI Types
Triplet = kg_model.Type("Triplet", source=f"{db_name}.{schema_name}.triplets")
Label = kg_model.Type("Label", source=f"{db_name}.{schema_name}.id_labels")

* Let's run an example query to see a few triplet objects.

In [6]:
with kg_model.query() as select:
    fact = Triplet()
    res = select(alias(fact.source_ent_id, "subject_entity_id"), 
                 alias(fact.rid,"relation_id"), 
                 alias(fact.target_ent_id,"object_entity_id")
                 )

print(f"There are {len(res.results)} triplets in kg_model.\n")
print(res.results.iloc[300:310])

There are [1m24745[0m triplets in kg_model.

    subject_entity_id relation_id object_entity_id
300         Q10457752         P31           Q11424
301          Q1046576        P840              Q18
302        Q104679039        P495             Q668
303          Q1046841        P161          Q457996
304         Q10468573        P161          Q264921
305         Q10468708        P495              Q34
306         Q10468804       P6216           Q19652
307         Q10468821        P272         Q4993551
308         Q10469908         P58         Q1957605
309         Q10470096        P161         Q4980545


* As another example, we show the label of each triplet shown in the above query.

In [7]:
with kg_model.query() as select: 
  fact = Triplet() 
  label = Label()
  
  with kg_model.match(): 
    with fact.source_ent_id == label.lid: 
      fact.set(subject=label.lname)

    with fact.target_ent_id == label.lid: 
      fact.set(object=label.lname)

    with fact.rid == label.lid: 
      fact.set(predicate=label.lname) 
      
  res = select(alias(fact.subject, "subject"), 
               alias(fact.predicate,"relation"), 
               alias(fact.object,"object") 
               ) 

print(res.results.iloc[300:310])

                       subject                 relation                 object
300              A Crazy Night              cast member           Ossi Oswalda
301       A Crime on the Bayou                 director          Nancy Buirski
302  A Cruise in the Albertina              instance of                   film
303         A Cry in the Woods  director of photography  John Andreas Andersen
304       A Damsel in Distress              cast member          Joan Fontaine
305       A Damsel in Distress                    color        black-and-white
306           A Dangerous Life         filming location            Philippines
307         A Dangerous Summer        country of origin              Australia
308                A Dark Song             main subject            forgiveness
309               A Dark Truth              cast member            Max Topplin


### Step 3. **An end-to-end pipeline for querying `kg_model` via natural language (NL) question**

* Our system is comprised of three main components:

1. `generate_ir`: Generating an intermediate representation (IR) from the NL question using Cortex AI 
    - The IR expresses the complete logical structure of the natural language query by breaking it down into logical components that depict relationships and entities.

2. `make_ir_executable`: Constructing executable IR by mapping keywords in the IR to semantically similar items and properties in KG.
    - [FAISS](https://github.com/facebookresearch/faiss) (Facebook AI Similarity Search) library is utilized for efficient search in the vector embedding space.  
    
3. `generate_query`: Generating RelationalAI Python query from Exceutable IR with Cortex AI

In [None]:
# Load the image
image = Image.open('./kgqa_example_udf.png')

# Display the image in Streamlit
st.image(image, caption='Service Functions in Snowpark Container Services', use_column_width=True)

* Let's get ready to use our system! You need to

  - Establish connection with Snowflake to interact with SPCS

  - Select your desired Cortex AI models (for the Complete & Embedding tasks) needed in our system
  
    - Click [here](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions) to find the list of available models in your region. 

In [8]:
## Pick your favorite Cortex AI models for Complete & Embedding tasks
llm_name = "llama3.1-70b"
embedding_model_name = "e5-base-v2"

In [None]:
from snowflake.snowpark.context import get_active_session
session = get_active_session()

* Next, we iteratively process each NL question to

  - generate the IR, make it executable and generate the corresponding RelationalAI Python query 



#### Troubleshooting

In case you encounter the following issue, please follow the recommended steps:

- *Server Overload Error*
  
   If the Snowflake server becomes unresponsive and shows a 'Server overloaded' error:
   - To resolve the issue, run ['this'](https://github.com/RelationalAI/llms-mlds/blob/main/kgqa_docker/README.md#step-6--launch-a-snowflake-service---copy-paste-output-to-sf-worksheet-and-run) from Line that says *"DROP SERVICE IF EXISTS <service_name_defined_in_config>;"*

In [9]:
# Initialize the list of IR, FAISS output, and RelationalAI Python queries
irs = []
faiss_output = []
relationalai_queries = []

# process each NL question
for nlq in list_of_questions:

    # Step 1: NL question to IR

    # parse the question for SQL query
    nlq = nlq.replace("'","''")

    # query IR service & get the IR
    query = f"""SELECT generate_ir('{nlq}', '{llm_name}') as result;"""
    ir = session.sql(query).collect()
    ir = ir[0].RESULT[1:-1]

    irs.append(ir)

    # Step 2: Generating executable IR using similarity search (SS) over the KG

    # query SS service
    query = f"""SELECT make_ir_executable('{ir}', '{embedding_model_name}') as result;"""
    df_ss = session.sql(query).collect()[0].RESULT

    # Extract SS outputs from df_ss
    matches, scores = reformat_match_output(df_ss)
    matches = str(matches).replace("'",'"')
    faiss_output.append(matches)

    # Step 3: Generating RelationalAI Python query

    # query query generator service
    query = f"""SELECT generate_query('{nlq}','{ir}','{matches}', '{llm_name}') as result;"""
    relationalai_query = session.sql(query).collect()[0].RESULT

    # Parse RelationalAI query
    relationalai_query = relationalai_query.replace("\\n","\n").replace("\\","").strip('"')
    
    relationalai_queries.append(relationalai_query)

* Let's see how the system translates the first NL question into a RelationalAI Python query.

In [10]:
i = 2
print(f"NL Query \n {list_of_questions[i]} \n")
print(f"IR \n {irs[i]} \n")
print(f"RelationalAI Python query \n {relationalai_queries[i]}")

[1mNL Query[0m 
 List movies directed by John Kransinski.

[1mIR[0m 
 m: director(m, \"John Krasinski\")

[1mRelationalAI Python query[0m 

 with graph.query() as select:
    clause0 = clause(object_candidate_ids=["Q313039","Q95008"], relation_candidate_ids=["P57","P344"])
    res_relations = select(alias(clause0.source_ent_id, "m"))
results = set(res_relations.results.get("m", []))
print(results)


* Now, let's run the generated queries to see the results.

  - Note that, running the queries retrieves the IDs of the response.

In [11]:
# Define the context dictionary to be used in query execution
clause = TripletClause(Triplet)
context = {"graph": kg_model, "clause": clause, "alias": alias}

# Initialize the list of QIDs of the response 
responses_entity_id_format = []

# Iterate over relationalai queries
for q in relationalai_queries:
    # Execute relationalai query to get QID of the response
    response = execute_query(q, context)
    responses_entity_id_format.append(response)

* With IDs at hand, we can easily retrieve the Natural Language responses by joining IDs with their corresponding labels.

In [12]:
responses_nl_format = []

# Iterate over non-empty responses 
for response in responses_entity_id_format:

    if response != set():

        # Get the label of all IDs in response set
        with kg_model.query() as select:
            lb = Label()
            lb.lid.in_(response)
            label_names = select(lb, alias(lb.lname,"label"))
        responses_nl_format.append(set(label_names.results.get("label", [])))

    else:

        responses_nl_format.append(set())

* We wrap up by displaying the questions with their corresponding answers.

In [13]:
for idx, res in enumerate(responses_nl_format):
    
    res = ', '.join(res) if res != set() else "NULL"
    print(f"Question: {list_of_questions[idx]}")
    print(f"Answer: {res}")
    print("\n==========\n")

Question: Name the actors of The Silent One.
Answer: Alan Adair, Linda Gray, Suzanne Flon, Lea Massari


Question: Who is the director of The Quiet Place?
Answer: John Krasinski


Question: List movies directed by John Kransinski.
Answer: A Quiet Place


Question: Which movie's director is married to a cast member?
Answer: A Quiet Place


Question: Which movie's director was born in the same city as one of the cast members?
Answer: Dunkirk


Question: Name a movie whose producer is the sibling of one of the cast members.
Answer: The Royal Tenenbaums


Question: In which movie is one of the cast members the child of the director?
Answer: A Separation


Question: Name films directed by either Christopher Nolan or Steven Spielberg.
Answer: War Horse, Schindler's List, The Dark Knight, Saving Private Ryan, The Prestige, Dunkirk, Inception


Question: Name movies either directed or produced by Steven Spielberg.
Answer: War Horse, Schindler's List, Saving Private Ryan, Jurassic World: Fallen

We’ve verified all the answers to the questions, and they’re accurate according to the current Wikidata snapshot!

Brought to you by [RelationalAI](https://relational.ai) & Snowflake Native Applications!