Topics: Data Modelling and Search Models
* Langsmith (for inspection and debugging)
* Semantic model extraction (continued)
* Graph QA using GraphCypherQAChain
* Graph QA using Vector Indices

# Chapter 1:  Langsmith

[Documentation](https://docs.smith.langchain.com/)

[Website](https://www.langchain.com/langsmith)

In [41]:
!pip install -qU langsmith

In [42]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

In [43]:
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "YourKey"

In [44]:
!pip install -qU langchain-openai

In [45]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Please respond to the user's request only based on the given context."),
    ("user", "Question: {question}\nContext: {context}")
])
model = ChatOpenAI(model="gpt-3.5-turbo")
output_parser = StrOutputParser() # https://www.restack.io/docs/langchain-knowledge-langchain-stroutputparser-guide

chain = prompt | model | output_parser

question = "What are the place names and geopolitical entities mentioned in the context?"
context = "Germany is a country in Europe and its capital is Berlin."
chain.invoke({"question": question, "context": context})

'Place names: Germany, Europe, Berlin\nGeopolitical entities: Germany'

# Chapter 2: Semantic Model Extraction

In [46]:
!pip install -q langchain-community langchain-openai langchain_experimental neo4j

In [47]:
from langchain.graphs import Neo4jGraph

url = "neo4j+s://f02e0524.databases.neo4j.io"
username = "neo4j"
password = "w60PF-SK2gGIlDII6zZMw8XMo67mqIFSrPU54_E3AU4"

graph = Neo4jGraph(
    url=url,
    username=username,
    password=password
)

In [48]:
import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

In [49]:
# From wikipedia: https://en.wikipedia.org/wiki/M%C3%BCnster
example_text = """
Münster is an independent city (Kreisfreie Stadt)
in North Rhine-Westphalia, Germany. It is in the northern part of the state and is considered to
 be the cultural centre of the Westphalia region. It is also a state district capital. Münster was the
  location of the Anabaptist rebellion during the Protestant Reformation and the site of the signing of the
   Treaty of Westphalia ending the Thirty Years' War in 1648. Today, it is known as the bicycle capital of Germany.
Münster gained the status of a Großstadt (major city) with more than 100,000 inhabitants in 1915.[4]
 As of 2014, there are 300,000[5] people living in the city, with about 61,500 students,[6]
 only some of whom are recorded in the official population statistics as having their primary residence in Münster.
 Münster is a part of the international Euregio region with more than 1,000,000 inhabitants (Enschede, Hengelo, Gronau, Osnabrück).
 Companies offering jobs in Münster include the Institute for Geoinformatics at the University of Münster,
 the Münster University of Applied Sciences, Reedu GmbH, con terra, the Deutsche Bank, IKEA, LIDL, REWE, ALDI and BASF Coatings.
"""

In [50]:
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo") # https://platform.openai.com/docs/models

llm_transformer = LLMGraphTransformer(llm=llm) # documentation, see https://python.langchain.com/docs/how_to/graph_constructing/

In [51]:
from langchain_core.documents import Document

documents = [Document(page_content=example_text)]
graph_documents = llm_transformer.convert_to_graph_documents(documents)
print(f"Nodes:{graph_documents[0].nodes}")
print(f"Relationships:{graph_documents[0].relationships}")

Nodes:[Node(id='Münster', type='City', properties={}), Node(id='North Rhine-Westphalia', type='State', properties={}), Node(id='Germany', type='Country', properties={}), Node(id='Westphalia', type='Region', properties={}), Node(id='Anabaptist Rebellion', type='Event', properties={}), Node(id='Protestant Reformation', type='Event', properties={}), Node(id='Treaty Of Westphalia', type='Event', properties={}), Node(id="Thirty Years' War", type='Event', properties={}), Node(id='Euregio', type='Region', properties={}), Node(id='Institute For Geoinformatics At The University Of Münster', type='Organization', properties={}), Node(id='Münster University Of Applied Sciences', type='Organization', properties={}), Node(id='Reedu Gmbh', type='Organization', properties={}), Node(id='Con Terra', type='Organization', properties={}), Node(id='Deutsche Bank', type='Organization', properties={}), Node(id='Ikea', type='Organization', properties={}), Node(id='Lidl', type='Organization', properties={}), No

In [52]:
graph.add_graph_documents(graph_documents)

# Chapter 3: Graph QA using GraphCypherQAChain

In [53]:
!pip install  --quiet langchain langchain-openai langchain-community neo4j

In [7]:
from langchain.graphs import Neo4jGraph

url = "neo4j+s://f02e0524.databases.neo4j.io"
username = "neo4j"
password = "w60PF-SK2gGIlDII6zZMw8XMo67mqIFSrPU54_E3AU4"

graph = Neo4jGraph(
    url=url,
    username=username,
    password=password
)

In [None]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

In [9]:
from langchain.chains import GraphCypherQAChain
from langchain_community.graphs import Neo4jGraph
from langchain_openai import ChatOpenAI
import os

chain = GraphCypherQAChain.from_llm(
    graph=graph,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-4o-mini"), # gpt-4o-mini	gpt-3.5-turbo
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k"),
    verbose=True,
    allow_dangerous_requests=True
)

In [12]:
question_1 = "What is the population of Hessen?"
question_2 = "What is the geometry of Rheinland-Pfalz?"
question_3 = "What are the areas of Hessen and Niedersachen. Is the area of Hessen bigger than the area of Niedersachsen"
question_4 = "Is Düsseldorf the state capital of Nordrhein-Westfalen"
question_5 = "Which cities lie in the district of Steinfurt?"
question_6 = "Which cities lie southern, southeastern and southwestern of Münster? The relative position is saved as a property in the touches relation. Also give me every touches relation bewteen those cities."
question_7 = "Which cities lie within Steinfurt? Only return the Names."

chain.invoke(question_7)



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (c:City)-[:within]->(d:District {Name: 'Steinfurt'})
RETURN c.Name
[0m
Full Context:
[32;1m[1;3m[{'c.Name': 'Nordwalde'}, {'c.Name': 'Lengerich'}, {'c.Name': 'Recke'}, {'c.Name': 'Ibbenbüren'}, {'c.Name': 'Emsdetten'}, {'c.Name': 'Westerkappeln'}, {'c.Name': 'Steinfurt'}, {'c.Name': 'Greven'}, {'c.Name': 'Lienen'}, {'c.Name': 'Rheine'}][0m

[1m> Finished chain.[0m


{'query': 'Which cities lie within Steinfurt? Only return the Names.',
 'result': 'Nordwalde, Lengerich, Recke, Ibbenbüren, Emsdetten, Westerkappeln, Steinfurt, Greven, Lienen, Rheine.'}

# Chapter 4: GraphQA using Vector Indices

## Indexing

In [23]:
!pip install langchain openai wikipedia tiktoken neo4j langchain_openai langchain_community --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone


In [8]:
# https://neo4j.com/developer-blog/knowledge-graph-rag-application/
# https://github.com/tomasonjo/blogs/blob/master/llm/devops_rag.ipynb
from langchain.graphs import Neo4jGraph

url = "neo4j+s://9df9a03f.databases.neo4j.io:7687"
username = "neo4j"
password = "MgDR4X6UnRMmXoJ-awLtSZKZkzY43jKpUuqnZKlnqn0"

In [9]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

In [11]:
# create the index

import os
from langchain.vectorstores.neo4j_vector import Neo4jVector
from langchain_openai import OpenAIEmbeddings

vector_index = Neo4jVector.from_existing_graph(
    OpenAIEmbeddings(),
    url=url,
    username=username,
    password=password,
    index_name='D_ID',
    node_label="District",
    text_node_properties= ["Name","Geometry"], #['name', 'description', 'status'], #['name', 'state_capital', 'url'],
    embedding_node_property='embedding',
)

RateLimitError: Error code: 429 - {'error': {'message': 'Request too large for text-embedding-ada-002 in organization org-ljbWAi0QXX8V24REYtsIGaKn on tokens per min (TPM): Limit 1000000, Requested 7514243. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

In [27]:
# see the index just created
vector_index.query(
    """SHOW INDEXES
       YIELD name, type, labelsOrTypes, properties, options
       WHERE type = 'VECTOR'
    """
)

[{'name': 'index_for_state',
  'type': 'VECTOR',
  'labelsOrTypes': ['State'],
  'properties': ['embedding'],
  'options': {'indexProvider': 'vector-2.0',
   'indexConfig': {'vector.hnsw.m': 16,
    'vector.hnsw.ef_construction': 100,
    'vector.dimensions': 1536,
    'vector.similarity_function': 'COSINE',
    'vector.quantization.enabled': True}}}]

## Retrieval

In [28]:
question1 = "How many states in the database?"
question2 = "How many geometries in the the database?"
question3 = "What is the population of Hessen?"
question4 = "What is the area of Hessen?"
question5 = "What is the capital of Hessen?"
question6 = "What is the geometry of Hessen?"
question7 = "What are the geometries of Hessen and Niedersachsen?"
question8 = "What is the url of the geometry of Hessen?"

In [30]:
response = vector_index.similarity_search(question3)
response

[Document(metadata={}, page_content='\nname: Hessen\npopulation: 6391360\nstate_capital: Wiesbaden\narea: 21114.94'),
 Document(metadata={}, page_content='\nname: Niedersachsen\npopulation: 8140242\nstate_capital: Hannover\narea: 47709.82'),
 Document(metadata={}, page_content='\nname: Nordrhein-Westfalen\npopulation: 18139116\nstate_capital: Düsseldorf\narea: 34110.26'),
 Document(metadata={}, page_content='\nname: Rheinland-Pfalz\npopulation: 4159150\nstate_capital: Mainz\narea: 19854.21')]

In [31]:
response_with_score = vector_index.similarity_search_with_score(question3)
response_with_score

[(Document(metadata={}, page_content='\nname: Hessen\npopulation: 6391360\nstate_capital: Wiesbaden\narea: 21114.94'),
  0.94384765625),
 (Document(metadata={}, page_content='\nname: Niedersachsen\npopulation: 8140242\nstate_capital: Hannover\narea: 47709.82'),
  0.92498779296875),
 (Document(metadata={}, page_content='\nname: Nordrhein-Westfalen\npopulation: 18139116\nstate_capital: Düsseldorf\narea: 34110.26'),
  0.9194488525390625),
 (Document(metadata={}, page_content='\nname: Rheinland-Pfalz\npopulation: 4159150\nstate_capital: Mainz\narea: 19854.21'),
  0.917633056640625)]

## Generation: Example 1

In [32]:
# using documents as context
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='Answer the question based only on the following context:\n{context}\n\nQuestion: {question}\n'), additional_kwargs={})])

In [33]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [34]:
chain = prompt | llm

In [36]:
docs = response

chain.invoke({"context": docs, "question": question3})

'The population of Hessen is 6,391,360.'

## Generation: Example 2

In [37]:
# Using a retriever as context
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retriever = vector_index.as_retriever() # search_kwargs={"k": 1}

graph_chain = ({"context": retriever, "question": RunnablePassthrough()}
                | prompt
                | llm
                | StrOutputParser()
                )

graph_chain.invoke(question3)

'The population of Hessen is 6,391,360.'

## Generation: Example 3

In [39]:
# Using a custom retriever as a context and post-processing of the answer
# https://python.langchain.com/docs/how_to/custom_retriever/
from typing import List
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever

class CustomRetriever(BaseRetriever):
    """ Custom retriever to return the scores of the documents as well.
        Then the scores are passed into an custom ranking function to include the spatial similarity
        between the query and the document.
    """

    vector_index: Neo4jVector

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""

        docs, scores = zip(*self.vector_index.similarity_search_with_score(query))
        for doc, score in zip(docs, scores):
             print("***", doc)
             #new_score = updated_score(score, query, doc)
             doc.page_content = doc.page_content
             doc.metadata["score"] = score
        return docs

def update_scores(docs):

    for doc in docs:
       new_score = doc.metadata["score"] * 10
       doc.page_content = doc.page_content+ "\nScore: " + str(new_score)
       doc.metadata["score"] = new_score
    return docs

In [40]:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retriever_r = CustomRetriever(vector_index=vector_index)

graph_chain = ({"context": retriever_r | update_scores, "question": RunnablePassthrough()}
                | prompt
                | llm
                | StrOutputParser()
                )

graph_chain.invoke(question6)

*** page_content='
name: Hessen
population: 6391360
state_capital: Wiesbaden
area: 21114.94'
*** page_content='
name: Niedersachsen
population: 8140242
state_capital: Hannover
area: 47709.82'
*** page_content='
name: Nordrhein-Westfalen
population: 18139116
state_capital: Düsseldorf
area: 34110.26'
*** page_content='
name: Rheinland-Pfalz
population: 4159150
state_capital: Mainz
area: 19854.21'


'The geometry of Hessen is 21114.94.'

# Cypher Queries

We will not dive deep into the cypher syntax during the course. The following queries should be enough for the interaction with the neo4j database. You can also check the [documentation](https://neo4j.com/docs/cypher-cheat-sheet/5/aura-dbe/auradb-free), if you happen to need more.

In [None]:
# delete every node and edge
MATCH(n)
DETACH DELETE (n)

# create nodes and edges
follow the structure shown at https://github.com/aurioldegbelo/sis2024/blob/main/vector_data/data.cypher

# visualize the model of the graph database
CALL apoc.meta.graph()

# Project work

* Exercice 01: clarify what your search target is

* Exercice 02: elaborate on your data model (what are entities and relationships)

* Exercice 03: create a neo4j account and a database instance

* Exercice 04: create an example of cypher query (CREATE) for your data (just a few instances), upload it to the database to see if it works

* Exercice 05: write a script to generate a CREATE query (it converts from your original format [csv, tsv, json, ...]) to a cypher template
