<a href="https://colab.research.google.com/github/Project-Hackathons/LifeHack2024/blob/main/TerrorViz.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Goal**
The goal of this document is to construct a Knowledge Graph 

## Dependencies
To run this project, the following dependencies are required:

*   langchain: A library to facilitate the creation of language models
*   neo4j: A graph database management system to store and query the knowledge graph.
*   openai: To access and use OpenAI's language models.
*   langchain_openai: Integrates LangChain with OpenAI's models.
*   langchain-community: Additional LangChain community tools and integrations.
*   spacy: SpaCy is a robust Python library for advanced natural language processing tasks
*   fastcoreref: fast, efficient library for coreference resolution that uses neural networks to identify and link references to the same entities within a text.











In [None]:
%pip install langchain neo4j openai langchain_openai langchain-community setuptools wheel spacy fastcoref

## Loading environment variables 
Four key environment variables are needed for this project:
* NEO4J_URI
* NEO4J_USERNAME
* NEO4J_PASSWORD
* OPENAI_API_KEY

In [2]:
# Import secrets and initialise Neo4jGraph
from langchain.graphs import Neo4jGraph
from dotenv import load_dotenv

load_dotenv()


graph = Neo4jGraph()

## Redefining Classes and Functions
In this section below, we will redefine some classes and functions to fit out use case

In [3]:
from langchain.schema import Document
from spacy.tokens import Doc, Span
from typing import List

# functions for coreference resolution
def get_fast_cluster_spans(doc, clusters):
    fast_clusters = []
    for cluster in clusters:
        new_group = []
        for tuple in cluster:
            (start, end) = tuple
            span = doc.char_span(start, end)
            new_group.append([span.start, span.end-1])
        fast_clusters.append(new_group)
    return fast_clusters

def get_fastcoref_clusters(doc, text):
    preds = model.predict(texts=[text])
    fast_clusters = preds[0].get_clusters(as_strings=False)
    fast_cluster_spans = get_fast_cluster_spans(doc, fast_clusters)
    return fast_cluster_spans


def core_logic_part(document: Doc, coref: List[int], resolved: List[str], mention_span: Span):
    final_token = document[coref[1]]
    if final_token.tag_ in ["PRP$", "POS"]:
        resolved[coref[0]] = mention_span.text + "'s" + final_token.whitespace_
    else:
        resolved[coref[0]] = mention_span.text + final_token.whitespace_
    for i in range(coref[0] + 1, coref[1] + 1):
        resolved[i] = ""
    return resolved

def get_span_noun_indices(doc: Doc, cluster: List[List[int]]) -> List[int]:
    spans = [doc[span[0]:span[1]+1] for span in cluster]
    spans_pos = [[token.pos_ for token in span] for span in spans]
    span_noun_indices = [i for i, span_pos in enumerate(spans_pos)
        if any(pos in span_pos for pos in ['NOUN', 'PROPN'])]
    return span_noun_indices

def get_cluster_head(doc: Doc, cluster: List[List[int]], noun_indices: List[int]):
    head_idx = noun_indices[0]
    head_start, head_end = cluster[head_idx]
    head_span = doc[head_start:head_end+1]
    return head_span, [head_start, head_end]

def is_containing_other_spans(span: List[int], all_spans: List[List[int]]):
    return any([s[0] >= span[0] and s[1] <= span[1] and s != span for s in all_spans])

def improved_replace_corefs(document, clusters):
    resolved = list(tok.text_with_ws for tok in document)
    all_spans = [span for cluster in clusters for span in cluster] 

    for cluster in clusters:
        noun_indices = get_span_noun_indices(document, cluster)

        if noun_indices:
            mention_span, mention = get_cluster_head(document, cluster, noun_indices)

            for coref in cluster:
                if coref != mention and not is_containing_other_spans(coref, all_spans):
                    core_logic_part(document, coref, resolved, mention_span)

    return "".join(resolved)

def load_text_to_document(file_path: str) -> Document:
    with open(file_path, 'r', encoding='utf-8') as file:
        text_content = file.read()
    print(f'Before Coreference Resolution:\n{text_content}')
    doc = nlp(text_content) 
    clusters = get_fastcoref_clusters(doc, text_content) 
    coref_text = improved_replace_corefs(doc, clusters) 

    document = Document(page_content=coref_text)
    return document

In [4]:
# Redefining classes to overight existing pydantic classes. 
# Final KnowledgeGraph class will be passed to LLM so that it knows the output format. 
# This is to help LLM identify Nodes and Relationships.

from langchain_community.graphs.graph_document import (
    Node as BaseNode,
    Relationship as BaseRelationship,
    GraphDocument,
)
from typing import List, Dict, Any, Optional
from langchain.pydantic_v1 import Field, BaseModel

class Property(BaseModel):
  """A single property consisting of key and value"""
  key: str = Field(..., description="key")
  value: str = Field(..., description="value")

class Node(BaseNode):
    properties: Optional[List[Property]] = Field(
        None, description="List of node properties")

class Relationship(BaseRelationship):
    properties: Optional[List[Property]] = Field(
        None, description="List of relationship properties"
    )

class KnowledgeGraph(BaseModel):
    """Generate a knowledge graph with entities and relationships."""
    nodes: List[Node] = Field(
        ..., description="List of nodes in the knowledge graph")
    rels: List[Relationship] = Field(
        ..., description="List of relationships in the knowledge graph"
    )

In [5]:
# Function that instructs LLM to identify Nodes and Relationship and output the result in the desired format (KnowledgeGraph Class)
# Note that even though we are prompting LLM to do coreference resolution, it will be done before the document is passed into the LLM using spacy. 
# This additional coreference resolution is just an additional safety net.

from langchain.chains.openai_functions import (create_structured_output_chain,)
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0 )

def get_extraction_chain(
    allowed_nodes: Optional[List[str]] = None,
    allowed_rels: Optional[List[str]] = None
    ):
    prompt = ChatPromptTemplate.from_messages(
        [(
          "system",
          f"""# Knowledge Graph Instructions for GPT-4
## 1. Overview
You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.
- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.
## 2. Labeling Nodes
- **Consistency**: Ensure you use basic or elementary types for node labels.
  - For example, when you identify an entity representing a person, always label it as **"person"**. Avoid using more specific terms like "mathematician" or "scientist".
- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.
{'- **Allowed Node Labels:**' + ", ".join(allowed_nodes) if allowed_nodes else ""}
## 3. Labelling Relationships
{'- **Allowed Relationship Types**:' + ", ".join(allowed_rels) if allowed_rels else ""}
## 4. Handling Numerical Data and Dates
- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.
- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.
- **Property Format**: Properties must be in a key-value format.
- **Quotation Marks**: Never use escaped single or double quotes within property values.
- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.
## 5. Coreference Resolution
- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"),
always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the entity ID.
Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial.
## 6. Strict Compliance
Adhere to the rules strictly. Non-compliance will result in termination.
          """),
            ("human", "Use the given format to extract information from the following input: {input}"),
            ("human", "Tip: Make sure to answer in the correct format"),
        ])
    return create_structured_output_chain(KnowledgeGraph, llm, prompt, verbose=False)

In [6]:
# Functions to reformat ouput by LLM before passing information over to Neo4j
def format_property_key(s: str) -> str:
    words = s.split()
    if not words:
        return s
    first_word = words[0].lower()
    capitalized_words = [word.capitalize() for word in words[1:]]
    return "".join([first_word] + capitalized_words)

def props_to_dict(props) -> dict:
    """Convert properties to a dictionary."""
    properties = {}
    if not props:
      return properties
    for p in props:
        properties[format_property_key(p.key)] = p.value
    return properties

def map_to_base_node(node: Node) -> BaseNode:
    """Map the KnowledgeGraph Node to the base Node."""
    properties = props_to_dict(node.properties) if node.properties else {}
    # Add name property for better Cypher statement generation
    properties["name"] = node.id.title()
    return BaseNode(
        id=node.id.title(), type=node.type.capitalize(), properties=properties
    )


def map_to_base_relationship(rel: Relationship) -> BaseRelationship:
    """Map the KnowledgeGraph Relationship to the base Relationship."""
    source = map_to_base_node(rel.source)
    target = map_to_base_node(rel.target)
    properties = props_to_dict(rel.properties) if rel.properties else {}
    return BaseRelationship(
        source=source, target=target, type=rel.type, properties=properties
    )

## Evaluation
Let's test out the functions we have implemented so far!

In [7]:
# instantiate nlp and model objects for coreference resolution
import spacy
from fastcoref import FCoref

nlp = spacy.load('en_core_web_sm')
model = FCoref()

06/01/2024 10:54:20 - INFO - 	 missing_keys: []
06/01/2024 10:54:20 - INFO - 	 unexpected_keys: []
06/01/2024 10:54:20 - INFO - 	 mismatched_keys: []
06/01/2024 10:54:20 - INFO - 	 error_msgs: []
06/01/2024 10:54:20 - INFO - 	 Model Parameters: 90.5M, Transformer: 82.1M, Coref head: 8.4M


In [8]:
#coreference resolution
document = load_text_to_document("./assets/test_text.txt")
print(f'After Coreference Resolution:\n{document.page_content}')

06/01/2024 10:54:20 - INFO - 	 Tokenize 1 inputs...


Before Coreference Resolution:
KUALA LUMPUR, Malaysia (AP) — The man who attacked a Malaysian police station and killed two officers was a recluse and is believed to have acted on his own despite suspected links to the Jemaah Islamiyah extremist group, the country’s home minister said Saturday.

The man stormed the police station in southern Johor state near Singapore in the early hours of Friday with a machete. He hacked a police constable to death and then used the officer’s weapon to kill another. He wounded a third officer before being shot dead. Police initially said the man could have attempted to take firearms from the station.

Home Minister Saifuddin Nasution called it a “lone wolf attack” based on an initial investigation and said there was no threat to the wider public.

“We have established that the attacker acted on his own ... a lone wolf driven by certain motivation and his own understanding,” Saifuddin said. “His action is not linked to any larger mission.”

Police have

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

06/01/2024 10:54:20 - INFO - 	 ***** Running Inference on 1 texts *****


Inference:   0%|          | 0/1 [00:00<?, ?it/s]

After Coreference Resolution:
KUALA LUMPUR, Malaysia (AP) — The man who attacked a Malaysian police station and killed two officers was a recluse and is believed to have His action on The man who attacked a Malaysian police station and killed two officers's own despite suspected links to the Jemaah Islamiyah extremist group, Malaysia's home minister said Saturday.

The man who attacked a Malaysian police station and killed two officers stormed the police station in southern Johor state near Singapore in the early hours of Friday with a machete. The man who attacked a Malaysian police station and killed two officers hacked a police constable to death and then used a police constable's weapon to kill another. The man who attacked a Malaysian police station and killed two officers wounded a third officer before being shot dead. Police initially said The man who attacked a Malaysian police station and killed two officers could have attempted to take firearms from a Malaysian police station

All pronouns have been replaced with an Entity. Coreference resolution has been successfully carried out! 

In [9]:
# ONLY USE TO DELETE THE DATABASE WHEN NEEDED FOR TESTING
# graph.query("MATCH (n) DETACH DELETE n")

In [10]:
# After coreference resolution is completed, we now invoke a function for LLM to identify Nodes and Relationships
# Nodes and Relationships are then reformatted before being passed to Neo4j to be populated
from typing import List

def extract_and_store_graph( document: str, nodes: List[str], rels: List[str] ):
    # Extract graph data using OpenAI functions
    extract_chain = get_extraction_chain(nodes, rels)
    data = extract_chain.invoke(document)['function']
    # Construct a graph document
    graph_document = GraphDocument(nodes = [map_to_base_node(node) for node in data.nodes], relationships = [map_to_base_relationship(rel) for rel in data.rels],source = document)
    # Store information into a graph
    graph.add_graph_documents([graph_document])
    return graph_document

graph_doc = extract_and_store_graph(document=document,
                                    nodes=["Person", "Object", "Location", "Event"], 
                                    rels=["INVOLVED_IN", "ORGANIZED_BY", "POSSESSED_BY", "VICTIM_OF", "AFFECTED_BY", "USED_BY", "LOCATED_AT", "FOUND_AT"] )

  warn_deprecated(
06/01/2024 10:54:53 - INFO - 	 HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Hooray! The Knowledge Graph has been successfully populated. You can expect the knowledge graph to look like this: 

<p align="center">
<img src="./assets/knowledge_graph.png" alt="Knowledge Graph" width="500"/>
</p>

Now, let's identify the entities in the extract and link the nodes back to the extract. 

In [11]:
# function to identify if an entity is mentioned in a extract
llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0 )

def get_mentions_chain(
    node_ids: Optional[List[str]] = None
    ):
    prompt = ChatPromptTemplate.from_messages(
        [(
          "system",
          f"""# Knowledge Graph Instructions for GPT-4
## 1. Overview
You are a top-tier algorithm to find out which entities are mentioned in an extract.
## 2. {'- **Entities:**' + ", ".join(node_ids) if node_ids else ""}
## 3. You need to output in a JSON format. It should be a list containing the entities mentioned.
## 4. Strict Compliance
Adhere to the rules strictly. Non-compliance will result in termination.
          """),
            ("human", "Use the given format to extract information from the following input: {input}"),
            ("human", "Tip: Make sure to answer in the correct format"),
        ])
    return create_structured_output_chain(List[str], llm, prompt, verbose=False)

In [12]:
node_ids = [node.id for node in graph_doc.nodes]

In [13]:
# chunk the extracts 
# create vector embedding
# map the existing nodes to the extracts
import re

def combine_with_overlap(arr, overlap):
    combined_list = []
    length = len(arr)
    if length < overlap:
        return combined_list
    i = 0
    while i + overlap < length:
        combined_string = ''.join(arr[i:i + overlap + 2])
        combined_list.append(combined_string)
        i += overlap
    
    return combined_list

sentences = re.compile(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|!)\s').split(document.page_content)
overlapping_extracts = combine_with_overlap(sentences, 2)

from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embedding_dimension = 1536
llm = ChatOpenAI(temperature=0)

for extract in overlapping_extracts:
    def extract_entities( document: str, node_ids: List[str] ):
        # Extract graph data using OpenAI functions
        extract_chain = get_mentions_chain(node_ids)
        data = extract_chain.invoke(document)['function']
        extract_id = str(hash(extract))
        params = {
            "extract_text": extract,
            "embedding": embeddings.embed_query(extract),
            "extract_id": extract_id
        }

        graph.query(
            """
            MERGE (n:Extract {extract_text: $extract_text, embedding: $embedding, id: $extract_id})
            WITH n
            CALL db.create.setVectorProperty(n, 'extract_embedding', $embedding)
            YIELD node
            RETURN count(*)
            """,
            params,
            )
        
        try:
            graph.query(
                "CALL db.index.vector.createNodeIndex('extract', "
                "'Extract', 'embedding', $dimension, 'cosine')",
                {"dimension": embedding_dimension},
            )
        except:  # already exists
            pass
        
        for id in node_ids:
            graph.query(
                """
                MATCH (p {id: $id})
                MATCH (n:Extract {id: $extract_id})
                MERGE (n)<-[:MENTIONED_IN]-(p)
                RETURN count(*)
                """,
                {"id": id, "extract_id": extract_id},
                )

        # # Construct a graph document
        # graph_document = GraphDocument(nodes = [map_to_base_node(node) for node in data.nodes], relationships = [map_to_base_relationship(rel) for rel in data.rels],source = document)
        # # Store information into a graph
        # graph.add_graph_documents([graph_document])
        # return graph_document
    graph_doc = extract_entities(document=Document(page_content=extract), node_ids=node_ids)


  warn_deprecated(
06/01/2024 10:54:55 - INFO - 	 HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
06/01/2024 10:54:56 - INFO - 	 HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
06/01/2024 10:54:57 - INFO - 	 HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
06/01/2024 10:54:58 - INFO - 	 HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
06/01/2024 10:55:00 - INFO - 	 HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
06/01/2024 10:55:00 - INFO - 	 HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
06/01/2024 10:55:03 - INFO - 	 HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
06/01/2024 10:55:03 - INFO - 	 HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
06/01/2024 10:55:05 - INFO - 	 HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
06/01/

Awesome! You have successfully linked the nodes to the extract. Your Knowledge Graph should now look like this: 
<p align="center">
<img src="./assets/knowledge_graph_wextracts.png" alt="Knowledge Graph With Extract" width="500"/>
</p>

#### move on to [retrival.ipynb](https://github.com/Project-Hackathons/LifeHack2024/blob/main/retrival.ipynb)