<a href="https://colab.research.google.com/github/Project-Hackathons/LifeHack2024/blob/main/TerrorViz.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automated Knowledge Graph Creation and Querying for Terrorism Reports Using LLMs
## **Introduction**

In the context of terrorism, reports and articles often contain extensive, unstructured text that is challenging to analyze and cross-reference in an automated manner. For instance, articles about a single terror incident might arrive at different times throughout the day, each with varying details. These reports typically include crucial entities such as *persons, objects, locations, and events*.

<br>

The objective of this project is to design and implement a solution using Large Language Models (LLMs) to:

1. Extract entities from these reports and represent them in a structured knowledge graph.
2. Develop a chatbot capable of answering questions based on the generated knowledge graph.

# Our Solution
blablabal (ADD)

# Dependencies
To run this project, the following dependencies are required:

*   langchain: A library to facilitate the creation of language models
*   neo4j: A graph database management system to store and query the knowledge graph.
*  openai: To access and use OpenAI's language models.
*   wikipedia: To extract data from Wikipedia for enriching the knowledge graph.
*   tiktoken: For tokenization tasks required by the language models.
*   langchain_openai: Integrates LangChain with OpenAI's models.
*   langchain-community: Additional LangChain community tools and integrations.











In [1]:
!pip install langchain neo4j openai wikipedia tiktoken langchain_openai langchain-community

Collecting langchain
  Downloading langchain-0.2.1-py3-none-any.whl (973 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting neo4j
  Downloading neo4j-5.20.0.tar.gz (202 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.0/203.0 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting openai
  Downloading openai-1.30.5-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310

> This following section retrieves secret credentials for the project. The Neo4jGraph class from the langchain library is then used to create a Neo4jGraph instance with the retrieved credentials, enabling interaction with the Neo4j database.

In [2]:
#import secrets and initialise Neo4jGraph
from google.colab import userdata
from langchain.graphs import Neo4jGraph

NEO4J_URI = userdata.get('NEO4J_URI')
NEO4J_USERNAME = userdata.get('NEO4J_USERNAME')
NEO4J_PASSWORD = userdata.get('NEO4J_PASSWORD')
AURA_INSTANCEID = userdata.get('AURA_INSTANCEID')
AURA_INSTANCENAME = userdata.get('AURA_INSTANCENAME')
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')


graph = Neo4jGraph(url=NEO4J_URI, username=NEO4J_USERNAME,password=NEO4J_PASSWORD)

> Here, we have modified the properties to be a list of Property instances instead of a dictionary to address the API's limitations. Since the API allows passing only a single object, we combine the nodes and relationships into a single class called KnowledgeGraph.

In [3]:
#overwrite some of the properties definition
#this is to adhere to limitations of OpenAi functions
from langchain_community.graphs.graph_document import (
    Node as BaseNode,
    Relationship as BaseRelationship,
    GraphDocument,
)
from langchain.schema import Document
from typing import List, Dict, Any, Optional
from langchain.pydantic_v1 import Field, BaseModel

class Property(BaseModel):
  """A single property consisting of key and value"""
  key: str = Field(..., description="key")
  value: str = Field(..., description="value")

class Node(BaseNode):
    properties: Optional[List[Property]] = Field(
        None, description="List of node properties")

class Relationship(BaseRelationship):
    properties: Optional[List[Property]] = Field(
        None, description="List of relationship properties"
    )

class KnowledgeGraph(BaseModel):
    """Generate a knowledge graph with entities and relationships."""
    nodes: List[Node] = Field(
        ..., description="List of nodes in the knowledge graph")
    rels: List[Relationship] = Field(
        ..., description="List of relationships in the knowledge graph"
    )

In [4]:
def format_property_key(s: str) -> str:
    words = s.split()
    if not words:
        return s
    first_word = words[0].lower()
    capitalized_words = [word.capitalize() for word in words[1:]]
    return "".join([first_word] + capitalized_words)

def props_to_dict(props) -> dict:
    """Convert properties to a dictionary."""
    properties = {}
    if not props:
      return properties
    for p in props:
        properties[format_property_key(p.key)] = p.value
    return properties

def map_to_base_node(node: Node) -> BaseNode:
    """Map the KnowledgeGraph Node to the base Node."""
    properties = props_to_dict(node.properties) if node.properties else {}
    # Add name property for better Cypher statement generation
    properties["name"] = node.id.title()
    return BaseNode(
        id=node.id.title(), type=node.type.capitalize(), properties=properties
    )


def map_to_base_relationship(rel: Relationship) -> BaseRelationship:
    """Map the KnowledgeGraph Relationship to the base Relationship."""
    source = map_to_base_node(rel.source)
    target = map_to_base_node(rel.target)
    properties = props_to_dict(rel.properties) if rel.properties else {}
    return BaseRelationship(
        source=source, target=target, type=rel.type, properties=properties
    )

> Now, we will set up an extraction chain to generate a knowledge graph. A detailed prompt template that instructs the model on how to extract information. Information extracted is returned as a KnowledgeGraph class structure to ensure consistency.

In [5]:
import os
from langchain.chains.openai_functions import (
    create_openai_fn_chain,
    create_structured_output_chain,
)
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OPENAI_API_KEY)

def get_extraction_chain(
    allowed_nodes: Optional[List[str]] = None,
    allowed_rels: Optional[List[str]] = None
    ):
    prompt = ChatPromptTemplate.from_messages(
        [(
          "system",
          f"""# Knowledge Graph Instructions for GPT-4
## 1. Overview
You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.
- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.
## 2. Labeling Nodes
- **Consistency**: Ensure you use basic or elementary types for node labels.
  - For example, when you identify an entity representing a person, always label it as **"person"**. Avoid using more specific terms like "mathematician" or "scientist".
- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.
{'- **Allowed Node Labels:**' + ", ".join(allowed_nodes) if allowed_nodes else ""}
## 3. Labelling Relationships
{'- **Allowed Relationship Types**:' + ", ".join(allowed_rels) if allowed_rels else ""}
## 4. Handling Numerical Data and Dates
- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.
- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.
- **Property Format**: Properties must be in a key-value format.
- **Quotation Marks**: Never use escaped single or double quotes within property values.
- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.
## 5. Coreference Resolution
- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"),
always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the entity ID.
Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial.
## 6. Strict Compliance
Adhere to the rules strictly. Non-compliance will result in termination.
          """),
            ("human", "Use the given format to extract information from the following input: {input}"),
            ("human", "Tip: Make sure to answer in the correct format"),
        ])
    return create_structured_output_chain(KnowledgeGraph, llm, prompt, verbose=False)

> Now that we have identified the nodes and relationships, we are able to construct a GraphDocument with it. The function belows uploads the constructed GraphDocument onto Neo4j database.

In [10]:
def extract_and_store_graph(
    document: Document,
    nodes:Optional[List[str]] = None,
    rels:Optional[List[str]]=None) -> None:
    # Extract graph data using OpenAI functions
    extract_chain = get_extraction_chain(nodes, rels)
    data = extract_chain.invoke(document.page_content)['function']
    # Construct a graph document
    graph_document = GraphDocument(
      nodes = [map_to_base_node(node) for node in data.nodes],
      relationships = [map_to_base_relationship(rel) for rel in data.rels],
      source = document
    )
    # Store information into a graph
    graph.add_graph_documents([graph_document])

# Evaluation
What's the point of having code if it doesn't work? Let's test it out.

<br>
Data would be taken from the first article that shows up on wikipedia after querying "Johor Attack"  

In [11]:
from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import TokenTextSplitter

# Read the wikipedia article
raw_documents = WikipediaLoader(query="Johor Attack", load_max_docs=1).load()
# Define chunking strategy
#text_splitter = TokenTextSplitter(chunk_size=2048, chunk_overlap=24)

# Only take the first the raw_documents
documents = raw_documents

In [8]:
# ONLY USE TO DELETE THE DATABASE WHEN NEEDED FOR TESTING
'''
graph.query("MATCH (n) DETACH DELETE n")
'''

[]

In [12]:
from tqdm import tqdm

for i, d in tqdm(enumerate(documents), total=len(documents)):
    extract_and_store_graph(document=d,nodes=["person", "object", "location", "event"], rels=["involved_in", "organized_by", "posessed_by", "victim_of", "affected_by", "used", "located_at", "found_at", "attacked"] )

100%|██████████| 1/1 [00:32<00:00, 32.99s/it]


In [None]:
#adhi's code
'''
import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime
import re

#no. of search results
n = 40 #for 40 results ~15 of them are yt/twitter links so they turn out as null
query = "Johor attack"


def google_search(query, page):
    start = page * 10
    url = f"https://www.google.com/search?q={query}&start={start}"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def scrape_links_and_content(query, num_pages= int(n/10)):
    links = []
    for page in range(num_pages):
        soup = google_search(query, page)
        for item in soup.find_all('div', class_='g'):
            link_element = item.find('a', href=True)
            if link_element:
                link = link_element['href']
               # datetime_of_article = get_datetime_of_article(link)
                content = scrape_content(link)
                links.append({
                    "link": link,
                    #"datetime": datetime_of_article,
                    "content": content
                })
            if len(links) > n:
                break
        if len(links) > n:
            break
    return links

#def get_datetime_of_article(link):
 #   return datetime.now().strftime("%Y-%m-%d %H:%M:%S")

def scrape_content(link):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    try:
        response = requests.get(link, headers=headers, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        paragraphs = soup.find_all('p')
        content = ' '.join(p.get_text() for p in paragraphs)
        return content
    except requests.exceptions.RequestException as e:
        print(f"Error fetching content from {link}: {e}")
        return None

#some really rapz cleaning
def clean_json(data):
    cleaned_data = {}
    for key, value in data.items():
        if isinstance(value, str):
            cleaned_value = value.replace('\n', ' ').replace('\r', '').replace('\t', '')
            cleaned_value = re.sub(r'[\u0080-\uffff]', ' ', cleaned_value)

            cleaned_data[key] = cleaned_value
        elif isinstance(value, dict):
            cleaned_data[key] = clean_json(value)
        elif isinstance(value, list):
            cleaned_data[key] = [clean_json(item) for item in value]
        else:
            cleaned_data[key] = value
    return cleaned_data



def main():
    results = scrape_links_and_content(query, num_pages= int(n/10))
    search_result = {
        "query": query,
        "results": results
    }

    with open('search_results.json', 'a') as f:
        json.dump(clean_json(search_result), f, indent=4)

if __name__ == "__main__":
    main()
'''