In [1]:
import pandas as pd
import os
import json
from dotenv import load_dotenv
import importlib

load_dotenv()

True

# A Graph RAG Approach to Query-Focused Summarization
<hr>


Reference: [From Local to Global: A Graph RAG Approach to Query-Focused Summarization](https://arxiv.org/abs/2404.16130)
<br>

![image](../.resources/Query-Focused-Summarization.png)


<br>
<br>
<hr>

## 1 Source Document -> Text Chunks

The chunk size used for this implementation is 600 tokens. According to the paper, using a longer chunk size can reduce the number of LLM calls required for extraction. However, this approach can lead to recall degradation, as the model may not fully extract all relevant information from a longer context window.

In dataflow from [Microsoft GraphRAG](https://microsoft.github.io/graphrag/posts/index/1-default_dataflow/), this state is convered in **Phrase 1: Compose TextUnits**

![image](../.resources/grag_phrase1.png)

In [2]:
# load data
data = ""

with open("../doc/pg24022.txt", 'r') as file:
    data = '\n'.join(file.readlines())

print(data[:600])

The Project Gutenberg eBook of A Christmas Carol

    

This ebook is for the use of anyone anywhere in the United States and

most other parts of the world at no cost and with almost no restrictions

whatsoever. You may copy it, give it away or re-use it under the terms

of the Project Gutenberg License included with this ebook or online

at www.gutenberg.org. If you are not located in the United States,

you will have to check the laws of the country where you are located

before using this eBook.



Title: A Christmas Carol



Author: Charles Dickens



Illustrator: Arthur Rackham



Relea


In [3]:
from text_splitter import TextSplitter
text_splitter = TextSplitter(600, 20)

chunks = text_splitter.split_text(data)
len(chunks)

80

<hr>

![image](../.resources/grag_phrase2.png)

<br>



## 2 Text extraction -> Element Instances

A multipart LLM prompt will be used to identify all the *entites* in the text, including their names, type and description. Then extracting *relationship* between entities

In [4]:
from extractor import GraphExtractor
from LLM import GeminiModel

ge = GraphExtractor(GeminiModel("gemini-1.5-flash-001"))

In [5]:
print(ge.extract_text(chunks[0], store=False))

("entity"<TD>PROJECT GUTENBERG<TD>ORGANIZATION<TD>Project Gutenberg is an online library that provides free ebooks)
<RD>
("entity"<TD>CHARLES DICKENS<TD>PERSON<TD>Charles Dickens is the author of A Christmas Carol)
<RD>
("entity"<TD>ARTHUR RACKHAM<TD>PERSON<TD>Arthur Rackham was the illustrator for A Christmas Carol)
<RD>
("entity"<TD>J. B. LIPPINCOTT COMPANY<TD>ORGANIZATION<TD>The J. B. Lippincott Company published A Christmas Carol in 1915)
<RD>
("entity"<TD>SUZANNE SHELL<TD>PERSON<TD>Suzanne Shell was a producer of the Project Gutenberg ebook of A Christmas Carol)
<RD>
("entity"<TD>JANET BLENKINSHIP<TD>PERSON<TD>Janet Blenkinship was a producer of the Project Gutenberg ebook of A Christmas Carol)
<RD>
("entity"<TD>ONLINE DISTRIBUTED PROOFREADING TEAM<TD>ORGANIZATION<TD>The Online Distributed Proofreading Team at http://www.pgdp.net contributed to the Project Gutenberg ebook of A Christmas Carol)
<RD>
("relationship"<TD>CHARLES DICKENS<TD>PROJECT GUTENBERG<TD>Charles Dickens is an au

From the above *entities* and *relationships* extraction using Gemini model. The model is capable of extracting and building information for entities and relationship. 

However, through many tests, the results are not consistent. Sometimes, the model fails to fully extract entities or relationships from text. This has been mentioned in paper, where the author use multiple rounds of “gleanings” to encourage the LLM to detect any additional entities it may have missed on prior extraction rounds.  

**NOTE**: Currently, the function has not included the aboved multi-gleaning process.

In [None]:
# Extract entities and relationships from all chunks
for chunk in chunks:
    ge.extract_text(chunk)

In [10]:
# Scrooge is the main character of Christmas Carol
ge.temp['"ENTITY"<TD>EBENEZER SCROOGE<TD>PERSON']

['Ebenezer Scrooge is the main character of "A Christmas Carol" by Charles Dickens',
 'Ebenezer Scrooge is a grasping, covetous old man, the surviving partner of the firm of Scrooge and Marley',
 'Ebenezer Scrooge is a miserly man who is visited by the ghost of Jacob Marley.  He is visited by three spirits on Christmas Eve to show him the errors of his ways',
 'Ebenezer Scrooge is the protagonist of the story, a miserly and cynical man who is visited by three spirits to learn about the consequences of his actions']

In [11]:
# save temp information

json_temp_path = "../json/christmas_carol_temp.json"

with open(json_temp_path, "w") as fp:
    json.dump(ge.temp, fp)

## 3 Element Instances -> Element Summaries

In second phrase of dataflow from [Microsoft GraphRAG](https://microsoft.github.io/graphrag/posts/index/1-default_dataflow/). After retrieving all the *entities* and *relationships* from list of chunks, we will merge a the entity with the same *name* and *type*, generate list of *description*. This also applies for *relationship* by merging any relationship that have the same pair of *source* and *target*.

After that, for every merged node, a LLM will be used to summarize list of description into a single description, covers full information of the node.

In [2]:
from extractor import GraphExtractor
from LLM import GeminiModel

ge = GraphExtractor(GeminiModel("gemini-1.5-flash-001"))

json_temp_path = "../json/christmas_carol_temp.json"

with open(json_temp_path, "r") as fp:
    ge.temp = json.load(fp)

In [None]:
# Call summarize function to merge duplicated entity and relationship
ge.summarize(cooldown=2)

In [6]:
ge.data[:5]

[{'entity_name': 'CHARLES DICKENS',
  'entity_type': 'PERSON',
  'description': 'Author who wrote A Christmas Carol'},
 {'entity_name': 'PETER CRATCHIT',
  'entity_type': 'PERSON',
  'description': "Bob Cratchit's son"},
 {'entity_name': 'BOB CRATCHIT',
  'entity_type': 'PERSON',
  'description': 'BOB CRATCHIT was a clerk who worked for Ebenezer Scrooge. He was also known to have gone down the Cornhill slide. \n'},
 {'entity_name': 'TINY TIM',
  'entity_type': 'PERSON',
  'description': 'Bob Cratchit\'s son, referred to as "Tiny Tim"'},
 {'entity_name': 'EBENEZER SCROOGE',
  'entity_type': 'PERSON',
  'description': 'Ebenezer Scrooge, the surviving partner of the firm of Scrooge and Marley, was a grasping, covetous old man. On Christmas Eve, he was visited by three ghosts who showed him the errors of his ways. \n'}]

In [7]:
# Save to JSON file
json_path = "../json/christmas_carol.json"
ge.save_data(json_path)

File is successfully saved at: ../json/christmas_carol.json


<br>

<hr>

![image.png](../.resources/grag_prhase3.png)

<br>


## 4 Element Summaries -> Graph Communities

After thrid step, we can form a usable graph with summarized *entities* and *relationships*. However, in order to provide stronger connection between entity and more efficient when searching, a community detection algorithm will be use to partition the graph into communities of nodes.

Each community has to preserve 2 aspect: **Mutually Exclusive** and **Collectively Exhautive**, which means a single node must belong to only one community.

In this stage, **Hierarchical Leiden Algorithm** is used to generate a hierarchical of entity communities

In [26]:
from langchain_community.graphs import Neo4jGraph

NEO4J_URL = os.getenv('NEO4J_URL')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE')

kg = Neo4jGraph(NEO4J_URL, NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_DATABASE)

In [18]:
import cypher_query as cq
importlib.reload(cq)

<module 'cypher_query' from '/home/nguyen/Project/Project - RAG/v3/cypher_query.py'>

In [19]:
# Reset database
query1 = """
MATCH (a)-[r]->()
DETACH DELETE a, r
"""

query2 = """
MATCH (a)
DETACH DELETE a
"""

kg.query(query1)
kg.query(query2)

[]

In [20]:
# store json in to graph
json_path = "../json/christmas_carol.json"
with open(json_path, 'r') as fp:
    data = json.load(fp)

for dt in data:
    if "entity_name" in dt.keys():
        cq.create_entity(kg, dt["entity_name"], dt["entity_type"], dt["description"])
    else:
        cq.create_relationship(kg, dt["source_entity"], dt["target_entity"], dt["description"])

In cypher, in order to perfrom graph embedding and community detection, we have to project the graph

In [37]:
query = """
CALL gds.graph.drop("christmas_carol", false)
"""

# Drop graph
kg.query(query)



[{'graphName': 'christmas_carol',
  'database': 'neo4j',
  'databaseLocation': 'local',
  'memoryUsage': '',
  'sizeInBytes': -1,
  'nodeCount': 54,
  'relationshipCount': 33,
  'configuration': {'relationshipProjection': {'RELATED': {'aggregation': 'DEFAULT',
     'orientation': 'NATURAL',
     'indexInverse': False,
     'properties': {},
     'type': 'RELATED'}},
   'readConcurrency': 4,
   'relationshipProperties': {},
   'nodeProperties': {},
   'jobId': '9490314f-af10-409c-ac9b-a4a90f749c8b',
   'nodeProjection': {'Entity': {'label': 'Entity', 'properties': {}}},
   'logProgress': True,
   'creationTime': neo4j.time.DateTime(2024, 9, 6, 15, 36, 41, 307227013, tzinfo=<UTC>),
   'validateRelationships': False,
   'sudo': False},
  'density': 0.011530398322851153,
  'creationTime': neo4j.time.DateTime(2024, 9, 6, 15, 36, 41, 307227013, tzinfo=<UTC>),
  'modificationTime': neo4j.time.DateTime(2024, 9, 6, 15, 36, 49, 725340724, tzinfo=<UTC>),
  'schema': {'graphProperties': {},
   'no

In [38]:
# https://neo4j.com/docs/graph-data-science/current/management-ops/graph-creation/graph-project/

query = """
CALL gds.graph.project(
    "christmas_carol",
    ["Entity"],
    {
        RELATED: {orientation: "UNDIRECTED"}
    }
)
"""

kg.query(query)

[{'nodeProjection': {'Entity': {'label': 'Entity', 'properties': {}}},
  'relationshipProjection': {'RELATED': {'aggregation': 'DEFAULT',
    'orientation': 'UNDIRECTED',
    'indexInverse': False,
    'properties': {},
    'type': 'RELATED'}},
  'graphName': 'christmas_carol',
  'nodeCount': 54,
  'relationshipCount': 66,
  'projectMillis': 20}]

Perform **Node2Vec Graph Embedding**, embedding dim = 128

In [39]:
query = """
CALL gds.node2vec.mutate(
    "christmas_carol",
    {
        mutateProperty: "embedding"
    }
)
"""

kg.query(query)

[{'nodeCount': 54,
  'nodePropertiesWritten': 54,
  'lossPerIteration': [68950.4049070634],
  'mutateMillis': 0,
  'postProcessingMillis': 0,
  'preProcessingMillis': 0,
  'computeMillis': 1445,
  'configuration': {'walkLength': 80,
   'walkBufferSize': 1000,
   'mutateProperty': 'embedding',
   'jobId': '4acbdb6a-c133-4c67-b9de-d9431bcdf4bd',
   'iterations': 1,
   'returnFactor': 1.0,
   'negativeSamplingRate': 5,
   'windowSize': 10,
   'sudo': False,
   'positiveSamplingFactor': 0.001,
   'inOutFactor': 1.0,
   'logProgress': True,
   'negativeSamplingExponent': 0.75,
   'nodeLabels': ['*'],
   'initialLearningRate': 0.025,
   'concurrency': 4,
   'relationshipTypes': ['*'],
   'walksPerNode': 10,
   'embeddingInitializer': 'NORMALIZED',
   'embeddingDimension': 128,
   'minLearningRate': 0.0001}}]

In [44]:
# https://neo4j.com/docs/graph-data-science/current/management-ops/graph-reads/graph-stream-nodes/

query = """
CALL gds.graph.nodeProperty.stream(
    "christmas_carol",
    "embedding"
)
YIELD nodeId, propertyValue, nodeLabels
RETURN gds.util.asNode(nodeId).name AS name, propertyValue AS embedding
LIMIT 5
"""

print(kg.query(query))

[{'name': 'CHARLES DICKENS', 'embedding': [-0.0006625357200391591, 0.003252149559557438, -0.0005169273936189711, -0.0008708680979907513, 0.0021685257088392973, -0.0026938170194625854, 0.003741367720067501, 0.00023004096874501556, 0.0015758476220071316, -0.00196857494302094, 0.001155876787379384, -0.003007294610142708, 0.0005062248674221337, -0.0017855972982943058, -0.0006256607011891901, -7.301349251065403e-05, -0.0015275574987754226, -0.0005493515636771917, 0.0007662192801944911, -0.0030680871568620205, -0.0031942271161824465, -0.0002033599157584831, 0.0025594306644052267, 0.0027236654423177242, -0.0008579838322475553, -0.0024355833884328604, 0.0015777118969708681, 0.0018124784110113978, -0.0005510137416422367, -0.003808950772508979, -0.0037129016127437353, 0.0014707979280501604, 0.0030115568079054356, -0.002740982687100768, -0.003066030330955982, 0.0003150413103867322, 0.0014083667192608118, -0.0015247213887050748, 0.0006097061559557915, 0.0019224269781261683, -0.0027954697143286467,

Community detection using **Leiden Algorithm**

In [50]:
query = """
CALL gds.leiden.stream(
    "christmas_carol"
)
YIELD nodeId, communityId
RETURN gds.util.asNode(nodeId).name AS name, communityId
"""

kg.query(query)

[{'name': 'CHARLES DICKENS', 'communityId': 28},
 {'name': 'PETER CRATCHIT', 'communityId': 29},
 {'name': 'BOB CRATCHIT', 'communityId': 19},
 {'name': 'JACOB', 'communityId': 26},
 {'name': 'TINY TIM', 'communityId': 27},
 {'name': 'EBENEZER SCROOGE', 'communityId': 26},
 {'name': 'TIM CRATCHIT', 'communityId': 19},
 {'name': 'MR. FEZZIWIG', 'communityId': 30},
 {'name': 'FRED', 'communityId': 31},
 {'name': 'GHOST OF CHRISTMAS PAST', 'communityId': 34},
 {'name': 'GHOST OF CHRISTMAS PRESENT', 'communityId': 35},
 {'name': 'GHOST OF CHRISTMAS YET TO COME', 'communityId': 32},
 {'name': 'GHOST OF JACOB MARLEY', 'communityId': 33},
 {'name': 'JOE', 'communityId': 20},
 {'name': 'MR. TOPPER', 'communityId': 21},
 {'name': 'DICK WILKINS', 'communityId': 22},
 {'name': 'BELLE', 'communityId': 17},
 {'name': 'CAROLINE', 'communityId': 18},
 {'name': 'MRS. CRATCHIT', 'communityId': 19},
 {'name': 'BELINDA CRATCHIT', 'communityId': 19},
 {'name': 'MARTHA CRATCHIT', 'communityId': 19},
 {'nam

In [47]:
# https://neo4j.com/docs/graph-data-science/current/algorithms/leiden/

query = """
CALL gds.leiden.mutate(
    "christmas_carol",
    {
        mutateProperty: "communityID"
    }
)
YIELD communityCount, modularity, modularities
"""

kg.query(query)

[{'communityCount': 33,
  'modularity': 0.497704315886134,
  'modularities': [0.39210284664830125, 0.497704315886134]}]

In [None]:
query = """
"""