<div style="color:mediumblue;font-weight:800;font-size:48px; text-align:center">
Scholar-Graph
</div>
</br>
<div style="color:darkolivegreen;font-weight:800;font-size:32px; text-align:center">
    Building Agentic Apps: ArangoDB, NVIDIA cuGraph, and NetworkX Hackathon
</div>
<br>

<p align="center">
    <img src="https://arangodb.com/wp-content/uploads/2016/05/ArangoDB_logo_avocado_@1.png" style="height: 50px;">
    <img src="https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/02-nvidia-logo-color-grn-500x200-4c25-p@2x.png" style="height: 50px;">
    <img src="https://rapids.ai/images/RAPIDS-logo.png" style="height: 50px;">
    <img src="https://avatars.githubusercontent.com/u/388785?s=200&v=4" style="height: 50px;">
</p>

# Features



# Key Technologies Used

- **ArangoDB** - To Store Graph - Nodes and Edges
- **NVIDIA cuGraph** - To run NetworkX Algorithms efficiently, on-scale
- **NetworkX** - To model the graph and run Graph Algorithms
- **OpenAI gpt-4o-mini** - As LLM
- **Langchain & Langgraph** - As Agentic Framework to Build Deep Research Agentic App
- **nx-cugraph** - To run cuGraph with networkx
- **LangChain Structured Output** - To extract Topic Entities from unstructured data

**Agentic App** uses below Agents:

1. **Data Augmentation Agent** - leverages Wikipedia to extract information about various cities as a Graph. Uses LangChain WikipediaRetriever and LLM Graph Transformer
2. **Master ReAct Agent** - Which Plans which tools to call. The ability to plan and call below tools and pass results of one to another, allows it run below queries:

* **SIMPLE QUERIES** → Dynamic AQL (via LangChain)
* **COMPLEX QUERIES** → GPU-Accelerated Graph Analytics (via NetworkX/cuGraph)
* **HYBRID QUERY EXECUTION** → Combining AQL and cuGraph for Contextual Responses

Available Tools for Master ReAct Agent:

- **Text to AQL** - Converts Natural Language to ArangoDB AQL Query
- **Text to AQL to Text** - Converts Natural Language to ArangoDB AQL Query, run it and convert the results to a Natural Language Answer
- **Text to NetworkX Algorithm** - Converts Natural Language to NetworkX Algorithm, executes it and provides response in natural Language - With Robust Re-Try Mechanism
- **NetworkX Algorithm to AQL to Text** - Converts Natural Language to NetworkX Algorithm, executes it and then passes results to execute AQL Query and then provides response in natural Language


# Key Use Cases:

## Simple Queries - Use Cases Answerable by AQL Queries

- Find all papers written by a Author `Mina Aganagic`.
- List all topics associated with papers published in a journal - `Int.J.Mod.Phys. A7 (1992) 5661-5705`.
- Find authors who have written papers on topics - `String Theory` and `Quantum Theory`.
- Determine the number of papers published by author `Mina Aganagic` in Year 1996 .

## Complex Queries = GPU-Accelerated Graph Analytics (via NetworkX/cuGraph)

### Breadth-First Search (BFS)
- Find the shortest path between two authors - `Mina Aganagic` and `A. Giveon` based on co-authorship.
### Depth-First Search (DFS)
- Identify strongly connected components in the citation network.
### Centrality Measures (e.g., PageRank)
- Rank papers by their influence using PageRank.
### Community Detection Algorithms (e.g., Girvan-Newman)
- Detect communities of authors based on co-authorship patterns.
### Shortest Path Algorithms (e.g., Dijkstra’s)
- Find the shortest path between papers `9907178` and `9211097` in the citation network.

## Hybrid Query Execution - Combining AQL and cuGraph for Contextual Responses
### Centrality Measures (e.g., PageRank)
- Calculate the PageRank of Top 10 authors based on their papers’ citations.
### Community Detection Algorithms (e.g., Girvan-Newman)
- Identify communities of papers based on their topics and citations.
### Breadth-First Search (BFS)
- Find the shortest path between topics - `String Theory` and `Quantum Theory` based on papers.
### Network Clustering Coefficient
- Calculate the clustering coefficient of the author co-authorship network.

# 1. Setup Environment

In [5]:
# 1. Install nx-arangodb via pip
# Github: https://github.com/arangodb/nx-arangodb

!pip install nx-arangodb

Collecting nx-arangodb
  Using cached nx_arangodb-1.3.0-py3-none-any.whl.metadata (9.3 kB)
Using cached nx_arangodb-1.3.0-py3-none-any.whl (67 kB)
Installing collected packages: nx-arangodb
Successfully installed nx-arangodb-1.3.0


In [1]:
# 2. Check if you have an NVIDIA GPU
# Note: If this returns "command not found", then GPU-based algorithms via cuGraph are unavailable

!nvidia-smi
!nvcc --version

Sun Mar  9 05:44:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.76         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4050 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   48C    P8              2W /   40W |       0MiB /   6141MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
# 3. Install nx-cugraph via pip
# Note: Only enable this installation if the step above is working!
!pip install cudf-cu12 cugraph-cu12 dask-cudf-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install nx-cugraph-cu12 --extra-index-url=https://pypi.nvidia.com


Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com


In [None]:
# 4. Install LangChain & LangGraph

!pip install --upgrade langchain langchain-community langchain-openai langgraph faiss-gpu-cu12 nltk langchain-ollama
!pip install --upgrade matplotlib seaborn arango_datasets wikipedia langchain_community langchain-experimental rich



In [2]:
# 5. Checking if GPU Acceleration is available

import networkx as nx
import time
import os

# Create a large graph
G = nx.erdos_renyi_graph(10000, 0.01, seed=42)
os.environ['NX_CUGRAPH_AUTOCONFIG'] = 'False'
# Run without GPU acceleration
start_time = time.time()
pr_cpu = nx.pagerank(G)
cpu_time = time.time() - start_time
print(f"CPU Time: {cpu_time:.4f} seconds")

# Enable GPU acceleration
os.environ['NX_CUGRAPH_AUTOCONFIG'] = 'True'

# Run with GPU acceleration
start_time = time.time()
pr_gpu = nx.pagerank(G)
gpu_time = time.time() - start_time
print(f"GPU Time: {gpu_time:.4f} seconds")

# Compare results
print(f"Speedup: {cpu_time / gpu_time:.2f}x")


CPU Time: 1.6557 seconds
GPU Time: 0.6895 seconds
Speedup: 2.40x


In [1]:
# 6. Import the required modules

import networkx as nx
import nx_arangodb as nxadb
import os
from arango import ArangoClient

import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from random import randint
import re

from langchain import hub
from langchain.agents import AgentExecutor, create_react_agent
from langgraph.checkpoint.memory import MemorySaver
from langchain_openai import ChatOpenAI
from langchain_community.graphs import ArangoGraph
from langchain_community.chains.graph_qa.arangodb import ArangoGraphQAChain
from langchain_core.tools import tool

import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

from nltk.corpus import stopwords
import nltk
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
stopwords = stopwords.words('english')
nltk.download('punkt_tab')

[04:33:52 +0000] [INFO]: NetworkX-cuGraph is available.
[nltk_data] Downloading package stopwords to /home/ninad/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/ninad/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [2]:
# 6. Connect to the ArangoDB database
from arango.http import DefaultHTTPClient
import os
from dotenv import load_dotenv
load_dotenv(dotenv_path="./.env",
            override=True)
client = ArangoClient(
    http_client=DefaultHTTPClient(request_timeout=300)
)
db = ArangoClient(hosts=os.environ.get("ARANGO_URL")) \
        .db(username=os.environ.get("ARANGO_USER_NAME"),
            password=os.environ.get("ARANGO_PASSWORD"), verify=True)

print(db)

<StandardDatabase _system>


# 2. Import Data
We Import Data from the Stanford Graph Datasets. - https://snap.stanford.edu/data/cit-HepTh.html
Data consists of Citation Network along with Abstracts

**Important** - Run this section only once! Move directy to Section 3, if Data was imported once

## 2.1 Import Data from ArangoDB Datasets

In [None]:
!mkdir 'ArxivAnalysis'
os.chdir('ArxivAnalysis')

In [6]:
!wget https://snap.stanford.edu/data/cit-HepTh.txt.gz
!wget https://snap.stanford.edu/data/cit-HepTh-dates.txt.gz
!wget https://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz


--2025-03-09 06:02:08--  https://snap.stanford.edu/data/cit-HepTh.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80, 64:ff9b::ab40:4b50
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1317497 (1.3M) [application/x-gzip]
Saving to: ‘cit-HepTh.txt.gz’


2025-03-09 06:02:21 (122 KB/s) - ‘cit-HepTh.txt.gz’ saved [1317497/1317497]

--2025-03-09 06:02:21--  https://snap.stanford.edu/data/cit-HepTh-dates.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80, 64:ff9b::ab40:4b50
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96569 (94K) [application/x-gzip]
Saving to: ‘cit-HepTh-dates.txt.gz’


2025-03-09 06:02:24 (64.4 KB/s) - ‘cit-HepTh-dates.txt.gz’ saved [96569/96569]

--2025-03-09 06:02:24--  https://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz
Resolving snap.stanford

In [7]:

!gzip -d cit-HepTh.txt.gz
!gzip -d cit-HepTh-dates.txt.gz
!tar -xzvf cit-HepTh-abstracts.tar.gz

1992/
1992/9201001.abs
1992/9201002.abs
1992/9201003.abs
1992/9201004.abs
1992/9201005.abs
1992/9201006.abs
1992/9201007.abs
1992/9201008.abs
1992/9201009.abs
1992/9201010.abs
1992/9201011.abs
1992/9201012.abs
1992/9201013.abs
1992/9201014.abs
1992/9201015.abs
1992/9201016.abs
1992/9201017.abs
1992/9201018.abs
1992/9201019.abs
1992/9201020.abs
1992/9201021.abs
1992/9201022.abs
1992/9201023.abs
1992/9201024.abs
1992/9201025.abs
1992/9201026.abs
1992/9201027.abs
1992/9201028.abs
1992/9201029.abs
1992/9201030.abs
1992/9201031.abs
1992/9201032.abs
1992/9201033.abs
1992/9201034.abs
1992/9201035.abs
1992/9201036.abs
1992/9201037.abs
1992/9201038.abs
1992/9201039.abs
1992/9201040.abs
1992/9201041.abs
1992/9201042.abs
1992/9201043.abs
1992/9201044.abs
1992/9201045.abs
1992/9201046.abs
1992/9201047.abs
1992/9201048.abs
1992/9201049.abs
1992/9201050.abs
1992/9201051.abs
1992/9201052.abs
1992/9201053.abs
1992/9201054.abs
1992/9201055.abs
1992/9201056.abs
1992/9201057.abs
1992/9201058.abs
1992/920

In [None]:
# Read the Citation and Dates File
import re
from tqdm import tqdm
data = ''
edges = []
vertices = []
date_dict = {}
vertices_list = [] 
with open('cit-HepTh-dates.txt', 'r') as fp:
    for line in tqdm(fp):
        line = (line.strip())
        if line[0] != '#':
            paper_date = line.split('\t')
            if len(paper_date) == 2:
                date_dict[int(paper_date[0].strip())] = paper_date[1].strip()
            
with open('cit-HepTh.txt', 'r') as fp:
    for line in tqdm(fp):
        # Remove special characters using regex
        cleaned_line = re.sub(r'[^a-zA-Z0-9\s#]', '', line)
        line = (cleaned_line.strip())
        if line[0] != '#':
            edge_vert = line.split('\t')
            if len(edge_vert) == 2:
                v1 = int(edge_vert[0].strip())
                v2 = int(edge_vert[1].strip())
                edges.append({"_from": "PAPERS/" + str(v1) ,
                              "_to": "PAPERS/" + str(v2), "name": "CITES", "edge_collection": "CITES"})
                

38558it [00:00, 884336.78it/s]
352811it [00:00, 359219.01it/s]


In [None]:
delimiters = ",", ";", "and"
regex_pattern = '|'.join(map(re.escape, delimiters))
paper_data = {}
for dirpath, dirnames, filenames in os.walk(os.getcwd()):
    for filename in tqdm([f for f in filenames if '.gz' not in f]):
        if '.abs' in filename:
            # print(f"  {dirpath + '/' + filename}")
            parse_data = {}
            cnt = 0
            abstract = ''
            with open(dirpath + '/' + filename, 'r') as fp:
                for line in fp:
                    line = (line.strip())
                    if line.startswith('\\\\'):
                        cnt += 1
                    if cnt > 1 and cnt < 3:
                        if not line.startswith('\\\\'):
                            abstract += line + ' \n '
                    else:
                        if line.lower().startswith('paper:'):
                            parse_data['_id'] = line[6:].replace('hep-th/', '').strip()
                        if line.lower().startswith('title:'):
                            parse_data['title'] = line[6:].strip()
                        if line.lower().startswith('journal-ref:'):
                            parse_data['journal-ref'] = line[12:].strip()
                        if line.lower().startswith('authors:'):
                            parse_data['authors'] = [r.strip() for r in  re.split(regex_pattern, line[9:].strip())]
                    parse_data['abstract'] = abstract
                    parse_data['cntr'] = cnt
                if '_id' in parse_data:
                    # print(parse_data)
                    paper_data[parse_data['_id']] = parse_data
            
            

100%|██████████| 2/2 [00:00<00:00, 14639.80it/s]
100%|██████████| 2758/2758 [00:00<00:00, 18976.60it/s]
100%|██████████| 2377/2377 [00:00<00:00, 23119.27it/s]
100%|██████████| 3153/3153 [00:00<00:00, 22707.85it/s]
100%|██████████| 2303/2303 [00:00<00:00, 21804.36it/s]
100%|██████████| 3126/3126 [00:00<00:00, 21051.12it/s]
100%|██████████| 2673/2673 [00:00<00:00, 20720.98it/s]
100%|██████████| 2058/2058 [00:00<00:00, 19899.39it/s]
100%|██████████| 2803/2803 [00:00<00:00, 21912.84it/s]
100%|██████████| 2606/2606 [00:00<00:00, 6651.70it/s]
100%|██████████| 1019/1019 [00:00<00:00, 18651.36it/s]
100%|██████████| 3312/3312 [00:00<00:00, 22810.44it/s]
100%|██████████| 1367/1367 [00:00<00:00, 21979.99it/s]


## 2.2 Extract the Topics from the Paper Abstracts

In [12]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM

# We are using ollama - OpenAI can also be used
output = """ 
{ "topic_1": "[to be generated]",
"topic_1": "[to be generated]",
"topic_3": "[to be generated]"}
"""
template = """Given a Text, identify only the top 3 topics related to physics. 
Each topic should not be more than 2 words.
Our output should be in JSON fomat as below:
{output}
Text: {question}
Topics: """

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="smollm2")

chain = prompt | model



In [None]:
import json
from tqdm import tqdm
cntr = 0
for p in tqdm(paper_data.keys()):
    id = int(paper_data[p]['_id'])
    abstract = paper_data[p]['abstract']
    # print(id, abstract)
    if id in date_dict.keys():
        paper_data[p]['published_date'] = date_dict[id]
    
    try:
        if 'topics' not in paper_data[p]:
            data = chain.invoke({
                "output": output, "question": abstract})
            # print(data)
            topics = json.loads(data)
            t = []
            if 'topic_1' in topics:
                t.append(topics['topic_1'].strip())
            if 'topic_2' in topics:
                t.append(topics['topic_2'].strip())
            if 'topic_3' in topics:
                t.append(topics['topic_3'].strip())

            paper_data[p]['topics'] = t
    except Exception as e:
        # print(id, str(e))
        pass
    
    cntr += 1

    if cntr % 20 == 0:
        with open('paper_data_topics.json', 'w') as fp:
            json.dump(paper_data, fp)
    

In [None]:
paper_data['9201001']

{'abstract': "We study algebraic aspects of Kontsevich integrals \n as generating functions for intersection theory over moduli space \n and review the derivation of Virasoro and KdV constraints. \n 1. Intersection numbers \n 2. The Kontsevich integral \n 2.1. The main theorem \n 2.2 Expansion of Z on characters and Schur functions \n 2.3 Proof of the first part of the Theorem \n 3. From Grassmannians to KdV \n 4. Matrix Airy equation and Virasoro highest weight conditions \n 5. Genus expansion \n 6. Singular behaviour and Painlev'e equation. \n 7. Generalization to higher degree potentials \n ",
 'cntr': 3,
 '_id': '9201001',
 'title': 'Combinatorics of the Modular Group II: the Kontsevich integrals',
 'authors': ['C. Itzykson', 'J.-B. Zuber'],
 'journal-ref': 'Int.J.Mod.Phys. A7 (1992) 5661-5705',
 'topics': ['intersection numbers',
  'Kontsevich integral',
  'intersecting functions']}

In [7]:
def to_title_case(text):
    # Split the string into words
    words = text.split()
    
    # Capitalize the first letter of each word
    title_case_words = [word.capitalize() for word in words]
    
    # Join the words back into a single string
    return ' '.join(title_case_words)

In [None]:
all_vertices = []
all_edges = []
for p in tqdm(paper_data.keys()):
    paper = paper_data[p]
    # print(paper)
    dt = ''
    jref = ''
    if 'published_date' in paper:
        dt = paper['published_date']
    if 'journal-ref' in paper:
        jref = paper['journal-ref']
    all_vertices.append({
        "name": "PAPERS",
        "collection": "PAPERS",
        "_id": "PAPERS/" + p,
        "_key": p,
        "paper_id": p,
        "abstract": paper['abstract'],
        'title': paper['title'].replace(':', '-'),
        'journal-ref': jref,
        'published_date': dt
    })
    if 'authors' in paper:
        for auth in paper['authors']:
            author_name_org = to_title_case(auth.strip())
            author_name = re.sub('[^a-zA-Z0-9]+', '', str(author_name_org))
            if len(author_name) > 0:
                all_vertices.append({
                "name": "AUTHOR",
                "collection": "AUTHOR",
                "_id": "AUTHOR/" + author_name.upper(),
                "_key": author_name.upper(),
                "author_name": author_name_org,
                })
                all_edges.append({"_from": "PAPERS/" + p, "_to": "AUTHOR/" + author_name.upper(), 
                                  "name": "AUTHORED_BY", "edge_collection": "AUTHORED_BY", "published_date": dt})
    current_topics = []
    if 'topics' in paper:
        for topic in paper['topics']:
            if 'to be generated' not in topic and len(topic.strip()) > 0:
                topic_name_org = to_title_case(topic.strip())
                topic_name = re.sub('[^a-zA-Z0-9]+', '', str(topic_name_org))

                if len(topic_name) > 0:
                    current_topics.append(topic_name)
        if len(current_topics) > 0:
            for topic_name in current_topics:
                all_vertices.append({
                "name": "TOPIC",
                "collection": "TOPIC",
                "_id": "TOPIC/" + topic_name.upper(),
                "_key": topic_name.upper(),
                "topic_name": topic_name_org
                })
                all_edges.append({"_from": "PAPERS/" + p, "_to": "TOPIC/" + topic_name.upper(),
                                   "name": "HAS_TOPIC","edge_collection": "HAS_TOPIC",})
                # for auth in paper['authors']:
                #     author_name = to_title_case(auth.strip())
                #     all_edges.append({"_from": author_name, "_to": topic_name, "name": "WRITES_ABOUT"})


100%|██████████| 29555/29555 [00:00<00:00, 32217.56it/s]


In [9]:
# Add the Citation Edges
print('Before: ', len(all_edges))
all_edges.extend(edges)
print('After:', len(all_edges))

Before:  131041
After: 483848


In [33]:
# Import to Arango DB
# Create Collections, Delete if exists
import pandas as pd
from tqdm import tqdm

vertices_df = pd.DataFrame(all_vertices)

edges_df = pd.DataFrame(all_edges)

vertice_names = list(vertices_df['collection'].unique())
edge_names  = list(edges_df['edge_collection'].unique())
print(vertice_names, edge_names)

for collection_name in vertice_names:
    db.delete_collection(collection_name, ignore_missing=True)
    db.create_collection(collection_name, edge=False )
for collection_name in edge_names:
    db.delete_collection(collection_name, ignore_missing=True)
    db.create_collection(collection_name, edge=True)

edges_df = edges_df.fillna('')
vertices_df = vertices_df.fillna('')
vertices_df['_key'].unique(), vertices_df['_id'].unique()


['PAPERS', 'AUTHOR', 'TOPIC'] ['AUTHORED_BY', 'HAS_TOPIC', 'CITES']


(array(['9201001', 'CITZYKSON', 'JBZUBER', ..., '0304269', '0304270',
        '0304271'], dtype=object),
 array(['PAPERS/9201001', 'AUTHOR/CITZYKSON', 'AUTHOR/JBZUBER', ...,
        'PAPERS/0304269', 'PAPERS/0304270', 'PAPERS/0304271'], dtype=object))

In [None]:

for collection_name in tqdm(vertice_names):
    data = pd.DataFrame(vertices_df[vertices_df['edge_collection'] == collection_name]).drop('name', axis = 1)
    # print(data.head())
    if collection_name == 'PAPERS':
        data = data[['_id', '_key', 'paper_id', 'abstract', 
                     'title', 'journal-ref', 'published_date', 'collection' ]]
    elif collection_name == 'AUTHOR':
        data = data[['_id', '_key', 'author_name', 'collection']]
    elif collection_name == 'TOPIC':
        data = data[['_id', '_key', 'topic_name', 'collection']]
         

    docs = data.to_dict(orient='records')
    print(docs[0])
    db.collection(collection_name).import_bulk(docs, batch_size=5000, on_duplicate="ignore", halt_on_error=False)
    

  0%|          | 0/3 [00:00<?, ?it/s]

{'_id': 'PAPERS/9201001', '_key': '9201001', 'paper_id': '9201001', 'abstract': "We study algebraic aspects of Kontsevich integrals \n as generating functions for intersection theory over moduli space \n and review the derivation of Virasoro and KdV constraints. \n 1. Intersection numbers \n 2. The Kontsevich integral \n 2.1. The main theorem \n 2.2 Expansion of Z on characters and Schur functions \n 2.3 Proof of the first part of the Theorem \n 3. From Grassmannians to KdV \n 4. Matrix Airy equation and Virasoro highest weight conditions \n 5. Genus expansion \n 6. Singular behaviour and Painlev'e equation. \n 7. Generalization to higher degree potentials \n ", 'title': 'Combinatorics of the Modular Group II- the Kontsevich integrals', 'journal-ref': 'Int.J.Mod.Phys. A7 (1992) 5661-5705', 'published_date': '', 'collection': 'PAPERS'}


 33%|███▎      | 1/3 [05:10<10:21, 310.54s/it]

{'_id': 'AUTHOR/CITZYKSON', '_key': 'CITZYKSON', 'author_name': 'C. Itzykson', 'collection': 'AUTHOR'}


 67%|██████▋   | 2/3 [05:20<02:13, 133.81s/it]

{'_id': 'TOPIC/INTERSECTIONNUMBERS', '_key': 'INTERSECTIONNUMBERS', 'topic_name': 'Intersecting Functions', 'collection': 'TOPIC'}


100%|██████████| 3/3 [05:36<00:00, 112.10s/it]


In [37]:
# Import the Edges
for collection_name in tqdm(edge_names):
    data = pd.DataFrame(edges_df[edges_df['edge_collection'] == collection_name]).drop('name', axis = 1)
    docs = data.to_dict(orient='records')
    print(collection_name, docs[0])
    db.collection(collection_name).import_bulk(docs, batch_size=5000, on_duplicate="ignore", halt_on_error=False)


  0%|          | 0/3 [00:00<?, ?it/s]

AUTHORED_BY {'_from': 'PAPERS/9201001', '_to': 'AUTHOR/CITZYKSON', 'edge_collection': 'AUTHORED_BY', 'published_date': ''}


 33%|███▎      | 1/3 [00:16<00:33, 16.91s/it]

HAS_TOPIC {'_from': 'PAPERS/9201001', '_to': 'TOPIC/INTERSECTIONNUMBERS', 'edge_collection': 'HAS_TOPIC', 'published_date': ''}


 67%|██████▋   | 2/3 [00:37<00:19, 19.08s/it]

CITES {'_from': 'PAPERS/1001', '_to': 'PAPERS/9304045', 'edge_collection': 'CITES', 'published_date': ''}


100%|██████████| 3/3 [02:20<00:00, 46.78s/it]


In [46]:
edge_definitions = [
            {
                "edge_collection": "AUTHORED_BY",
                "from_vertex_collections": ["PAPERS"],
                "to_vertex_collections": ["AUTHOR"]
            },
            {
                "edge_collection": "HAS_TOPIC",
                "from_vertex_collections": ["TOPIC"],
                "to_vertex_collections": ["PAPERS"]
            },
            {
                "edge_collection": "CITES",
                "from_vertex_collections": ["PAPERS"],
                "to_vertex_collections": ["PAPERS"]
            }
]
db.delete_graph('Arxiv', ignore_missing=True, drop_collections=False)
db.create_graph('Arxiv', edge_definitions)

<Graph Arxiv>

In [10]:
# Fetch the data into a NetworkX Graph and pre-fetch the nodes and edges
G_adb = nxadb.MultiDiGraph(name="Arxiv", db=db)
print(G_adb)

[07:17:53 +0000] [INFO]: Graph 'Arxiv' exists.
[07:17:53 +0000] [INFO]: Default node type set to 'AUTHOR'


MultiDiGraph named 'Arxiv' with 70359 nodes and 483848 edges


## 2.3 Generating Vector Embeddings for Semantic Search


In [11]:
from langchain_openai import OpenAIEmbeddings

from langchain_ollama import OllamaEmbeddings

# Alternatively you can use Ollama locally
# embeddings = OllamaEmbeddings(
#     model="snowflake-arctic-embed:110m",
# )

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

In [12]:
index_document = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))

vector_store_document = FAISS(
    embedding_function=embeddings,
    index=index_document,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

In [13]:
from uuid import uuid4

from langchain_core.documents import Document

documents = []
for n in all_vertices:
    
    if n['name'] == 'PAPERS':
        # print(n)
        des = f'{n['_id']}||' + n['abstract']
        document_1 = Document(
        page_content=des[0:512],
        metadata={"source": n['_id'], 'type': n['collection']},
        )
        documents.append(document_1)
        
# documents
uuids = [str(uuid4()) for _ in range(len(documents))]
vector_store_document.add_documents(documents=documents, ids=uuids)

['c3ab444b-07a1-4bb2-9db8-678cbdbe3fae',
 '82c704a1-64e1-4c9f-ba3c-e968526e825d',
 '67f31bba-670c-4125-86e8-43aec3fcd2d5',
 '29aa6b41-0097-4a86-9884-7bf0ceeb1033',
 '4f46f71d-df38-4142-88c2-d137aa25b618',
 '62b0d7e8-5bad-47be-9abf-b10990ab6ef8',
 '5fb3c325-8889-47b3-9c25-5940022478af',
 '3449fd62-c3be-4455-afc7-cfedea1d8281',
 'e00efaff-b3b2-4701-8215-b1b1bea72679',
 'c4666c08-2f8a-42c8-9629-92f0096022a4',
 '63be38f4-5244-4808-9336-501fe75979c7',
 '5e28ba30-ee3e-4d2a-936e-1ddce33a1827',
 '62fa351c-406f-4c80-ab41-08efce6fb9ad',
 '3ad9f676-fe8a-47a0-8095-ab4dfd412fcb',
 '72a3415d-9b12-42c4-ba93-b501a4ba683c',
 'fe965587-faaa-458c-a377-4d9467a8f7e5',
 '33432950-d562-4af4-84d9-57f0fbd17a29',
 'b32cb2a5-6d49-4112-b2ff-d79174238a33',
 'a696c3ce-f206-4225-a651-d09c17d98c0d',
 'd90cc0eb-cc0a-431c-be07-38860286925e',
 '1c598792-e03e-4b42-8c6e-efd0f4be3bc0',
 '312b2da8-24fb-4aba-a20b-2930acb3b446',
 '59ed928f-45cc-407b-8454-62de3069cb48',
 'cd73f11b-0789-415b-acc3-50ae96f18a5e',
 '41f99131-e848-

In [14]:

vector_store_document.search('String Theory', search_type='similarity')

[Document(id='fee7dbd3-0813-4fe8-883a-cbe44e90a165', metadata={'source': 'PAPERS/9810240', 'type': 'PAPERS'}, page_content='PAPERS/9810240||This is the written version of a set of introductory lectures on string \n theory. \n '),
 Document(id='23b27ed2-4758-499c-924c-64048f5d1240', metadata={'source': 'PAPERS/0008217', 'type': 'PAPERS'}, page_content='PAPERS/0008217||String theory is accused by some of its critics to be a purely abstract \n mathematical discipline, having lost the contact to the simple yet deeply \n rooted questions which physics provided until the beginning of this century. We \n argue that, in contrary, there are indications that string theory might be \n linked to a fundamental principle of a quantum computational character. In \n addition, the nature of this principle might be capable to provide some new \n insight into the question of uni'),
 Document(id='814c1d4a-69bd-48be-a51c-188760ca5b46', metadata={'source': 'PAPERS/9612192', 'type': 'PAPERS'}, page_content='

In [18]:
index_nodes = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))

vector_store_nodes = FAISS(
    embedding_function=embeddings,
    index=index_nodes,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

from uuid import uuid4

from langchain_core.documents import Document
doc_list = []
documents = []
for n in all_vertices:
    # print(n)
    if n['name'] == 'AUTHOR':
        des = n['author_name'] + " - " + n['_id']
    elif n['name'] == 'TOPIC':
        des = n['topic_name'] + ' - ' + n['_id']
    if des not in doc_list:
        document_1 = Document(
            page_content=des[0:512],
            metadata={"source": n['_id'], 'type': n['name']},
        )
        documents.append(document_1)
        doc_list.append(des)

len(documents)

78023

In [21]:
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store_nodes.add_documents(documents=documents, ids=uuids)

['7c7b0068-4c73-49d9-b0da-50784cb95f88',
 '340db6f4-36b0-41a3-8541-5f6f95aa28f5',
 'af91d332-8ac3-4254-b54f-ef03747ee91c',
 '2c5a0dcd-a8ef-46e7-a224-f53eabb7ec27',
 '6ffadaf5-f1f2-4145-a279-48821bf8deeb',
 'df5c2056-1d1e-44af-8b0c-a495bae100fb',
 '6eca3c7d-e3f4-4e12-a481-5d9a9ebbb209',
 '8f485b14-4cd6-48dc-8ef5-4346a17bb39d',
 'c54f01d6-a784-44f9-9e72-6fba50ebea4d',
 'fc8757d9-810e-4ff5-b052-15ef4b26ce01',
 '5fd4c98e-af9e-4e48-b594-00262b0356f6',
 'e7071bf4-5d23-47e6-8b5d-570a37c39527',
 '440d0563-3c22-447a-904d-f1205b9888c5',
 'eccfc31e-bb0d-4f68-82b9-912ea53d07bb',
 'cd288f86-d16d-440a-86ea-3b6c243b0291',
 '673f6315-8c59-4793-b8bc-76b019c64bda',
 '541f991a-ab8c-4c61-b7fe-48462dcc873e',
 '0a40fb40-e5f7-4fa0-bf0d-3789ec35ca98',
 '8c060fa7-f101-4650-a283-9028723a86c2',
 '83929430-71dd-4546-99b5-7f28a5d4111e',
 '89efb317-f509-4d9c-ae58-f2ea11446125',
 '1397c187-b546-4b9a-a8d9-5eeaf738a0ca',
 '115d80d8-f239-4c27-9b77-1ff5b6a7defe',
 'cfdc2e06-1911-48ce-ba6e-b7fbea25f7f6',
 'c5222c02-4473-

In [22]:
def stop_word_remover(text):
    word_tokens = word_tokenize(text)
    word_list = [re.sub('[^A-Za-z0-9]+', '', word) for word in word_tokens if word.lower() not in stopwords]
    return  " ".join(word_list).strip()

retriever = vector_store_nodes.as_retriever(search_type="similarity", search_kwargs={"k": 6,'filter': {'type':'TOPIC'}})
q = stop_word_remover('String Theory')
print(q)
retriever.invoke(q)

String Theory


[Document(id='85000566-6180-4763-a665-cbeb2abb253c', metadata={'source': 'TOPIC/MTHEORY', 'type': 'TOPIC'}, page_content='String Theory - TOPIC/MTHEORY'),
 Document(id='9176e137-2e93-4031-bc91-2e3f8c18e30f', metadata={'source': 'TOPIC/SUNTHEORY', 'type': 'TOPIC'}, page_content='String Theory - TOPIC/SUNTHEORY'),
 Document(id='dff4b8fd-5460-4266-9b7c-7597cb5007c1', metadata={'source': 'TOPIC/TWODIMENSIONAL', 'type': 'TOPIC'}, page_content='String Theory - TOPIC/TWODIMENSIONAL'),
 Document(id='c8f441b2-7250-4493-8803-f397dbaf3e4e', metadata={'source': 'TOPIC/I', 'type': 'TOPIC'}, page_content='String Theory - TOPIC/I'),
 Document(id='10ee925c-fe6c-49d3-b98b-6faeee50bf6c', metadata={'source': 'TOPIC/STRINGTHEORY', 'type': 'TOPIC'}, page_content='String Theory - TOPIC/STRINGTHEORY'),
 Document(id='6183509c-d1f8-466e-ba7c-62b03a5a5658', metadata={'source': 'TOPIC/STRING', 'type': 'TOPIC'}, page_content='String Theory - TOPIC/STRING')]

In [None]:
vector_store_document.save_local("scholar_faiss_index")
vector_store_nodes.save_local("scholar_faiss_index_nodes")

# 3. Connect to Graph


In [3]:
# 1. Re-connect to the same Graph

G_adb = nxadb.MultiDiGraph(name="Arxiv", db=db,
                            read_parallelism=20, read_batch_size=500000)
# Pre-Fetch all data
# This will load all nodes
nodes = list(G_adb.nodes(data=True))

# This will load all edges Nodes:  

edges = list(G_adb.edges(data=True))
print('Nodes: ', len(nodes), '; Edges: ', len(edges))

[04:33:58 +0000] [INFO]: Graph 'Arxiv' exists.
[04:33:58 +0000] [INFO]: Default node type set to 'AUTHOR'


{'vertex_collections': [{'name': 'AUTHOR', 'fields': []}, {'name': 'PAPERS', 'fields': []}, {'name': 'TOPIC', 'fields': []}], 'edge_collections': [], 'database_config': {'endpoints': ['https://a02eeb6a1295.arangodb.cloud:8529'], 'database': '_system', 'username': 'root', 'password': 'mnnqy0EBZqib4PHiDel8'}, 'load_config': {'parallelism': 20, 'batch_size': 500000, 'prefetch_count': 5, 'load_all_vertex_attributes': True, 'load_all_edge_attributes': False}}
{'load_adj_dict': False, 'load_coo': False, 'is_directed': False, 'is_multigraph': False, 'symmetrize_edges_if_directed': False}
{'vertex_collections': [{'name': 'AUTHOR', 'fields': []}, {'name': 'PAPERS', 'fields': []}, {'name': 'TOPIC', 'fields': []}], 'edge_collections': [{'name': 'HAS_TOPIC', 'fields': []}, {'name': 'AUTHORED_BY', 'fields': []}, {'name': 'CITES', 'fields': []}], 'database_config': {'endpoints': ['https://a02eeb6a1295.arangodb.cloud:8529'], 'database': '_system', 'username': 'root', 'password': 'mnnqy0EBZqib4PHiDel8

In [4]:
# 2. Define the llm object
# We use GPT-4o-mini or GPT-4o

llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
arango_graph = ArangoGraph(db)


### Define a Semantic Search Graph

In [7]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from uuid import uuid4

from langchain_core.documents import Document

from langchain_openai import OpenAIEmbeddings

# from langchain_ollama import OllamaEmbeddings

# embeddings = OllamaEmbeddings(
#     model="snowflake-arctic-embed:110m",
# )

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

vector_store = FAISS.load_local(
    "scholar_faiss_index", embeddings, allow_dangerous_deserialization=True
)
vector_store_document = vector_store

vector_store_nodes = FAISS.load_local(
    "scholar_faiss_index_nodes", embeddings, allow_dangerous_deserialization=True
)

retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 6})
retriever_nodes = vector_store_nodes.as_retriever(search_type="similarity", search_kwargs={"k": 6})


rag_prompt = hub.pull("rlm/rag-prompt")

def stop_word_remover(text):
    word_tokens = word_tokenize(text)
    word_list = [re.sub('[^A-Za-z0-9]+', '', word) for word in word_tokens if word.lower() not in stopwords]
    return  " ".join(word_list).strip()
# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    nearby_nodes: str
    answer: str



# Define application steps
def retrieve(state: State):
    query = stop_word_remover(state["question"])

    retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 6,'filter': {'type':'PAPERS'}})
    retrieved_docs = retriever.invoke(query)


    # print(len(retrieved_docs))
    nearby_nodes = []
    for r in retrieved_docs:
        try:
            node_id = (r.metadata['source'])
            # print(node_id)
            aql = f""" FOR paper IN PAPERS
                FILTER paper._id == "{node_id}"

                LET authors = (
                    FOR v, e IN 1..1 OUTBOUND paper._id GRAPH 'Arxiv'
                    FILTER e.edge_collection == 'AUTHORED_BY'
                    RETURN {{
                        format: CONCAT(paper._id, " -> ", e.edge_collection, " -> ", v._id)
                    }}
                )

                LET topics = (
                    FOR v, e IN 1..1 OUTBOUND paper._id GRAPH 'Arxiv'
                    FILTER e.edge_collection == 'HAS_TOPIC'
                    RETURN {{
                        format: CONCAT(paper._id, " -> ", e.edge_collection, " -> ", v._id)
                    }}
                )
                RETURN {{
                    format: APPEND(authors, topics)
                }}
            """
            cursor = db.aql.execute(aql)
            # Print the results
            for doc in cursor:
                for f in (doc['format']):
                    nearby_nodes.extend(f['format'].split('\n'))
        except Exception as e:
            print(str(e))
            pass
    nearby_nodes = sorted(nearby_nodes)

    return {"context": retrieved_docs, 'nearby_nodes': '\n'.join(nearby_nodes)}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content.strip() for doc in state["context"])
    rag_content = f""" **Information from Documents:
{docs_content}
** Information from Nearby Nodes of Data:
{state['nearby_nodes']}"""
    print(rag_content)
    messages = rag_prompt.invoke({"question": state["question"], "context": rag_content })
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()
response = graph.invoke({"question": "What are the papers in String Theory.?"})
response['answer']



 **Information from Documents:
PAPERS/0008217||String theory is accused by some of its critics to be a purely abstract 
 mathematical discipline, having lost the contact to the simple yet deeply 
 rooted questions which physics provided until the beginning of this century. We 
 argue that, in contrary, there are indications that string theory might be 
 linked to a fundamental principle of a quantum computational character. In 
 addition, the nature of this principle might be capable to provide some new 
 insight into the question of uni

PAPERS/0011065||What are strings made of? The possibility is discussed that strings are 
 purely mathematical objects, made of logical axioms. More precisely, proofs in 
 simple logical calculi are represented by graphs that can be interpreted as the 
 Feynman diagrams of certain large-N field theories. Each vertex represents an 
 axiom. Strings arise, because these large-N theories are dual to string 
 theories. These ``logical quantum field theories

'The papers in string theory include discussions on its potential link to quantum computational principles (PAPERS/0008217), the idea of strings as mathematical objects made of logical axioms (PAPERS/0011065), and the thermodynamical aspects of string theory (PAPERS/9911004). Additionally, there are introductory lectures on string theory (PAPERS/9810240) and summaries of talks explaining the relationship between string theory, M-theory, and K-theory (PAPERS/0012007).'

In [8]:
graph_schema = { 'Graph Schema': [g for g in arango_graph.schema['Graph Schema'] if g['graph_name'] == 'Arxiv'], \
'Collection Schema': [g for g in arango_graph.schema['Collection Schema'] \
                        if g['collection_name'] in ['AUTHORED_BY', 'CITES', 'HAS_TOPIC',
                                                    'PAPERS', 'AUTHOR', 'TOPIC' ]]}
graph_schema

{'Graph Schema': [{'graph_name': 'Arxiv',
   'edge_definitions': [{'edge_collection': 'AUTHORED_BY',
     'from_vertex_collections': ['PAPERS'],
     'to_vertex_collections': ['AUTHOR']},
    {'edge_collection': 'CITES',
     'from_vertex_collections': ['PAPERS'],
     'to_vertex_collections': ['PAPERS']},
    {'edge_collection': 'HAS_TOPIC',
     'from_vertex_collections': ['TOPIC'],
     'to_vertex_collections': ['PAPERS']}]}],
 'Collection Schema': [{'collection_name': 'CITES',
   'collection_type': 'edge',
   'edge_properties': [{'name': '_key', 'type': 'str'},
    {'name': '_id', 'type': 'str'},
    {'name': '_from', 'type': 'str'},
    {'name': '_to', 'type': 'str'},
    {'name': '_rev', 'type': 'str'},
    {'name': 'edge_collection', 'type': 'str'},
    {'name': 'published_date', 'type': 'str'}],
   'example_edge': {'_key': '12271432',
    '_id': 'CITES/12271432',
    '_from': 'PAPERS/1001',
    '_to': 'PAPERS/9304045',
    '_rev': '_jVkNB7u---',
    'edge_collection': 'CITES',


In [9]:
# Context Manager
# To Store Outputs across Function calls
class ContextManager:
    def __init__(self):
        self.context = {}

    def clear(self):
        self.context = {}

    def store_context(self, key, value):
        self.context[key] = value

    def get_context(self, key):
        return self.context.get(key, "No context available")

context_manager = ContextManager()


In [10]:
# Modified AQL Template, to be used after NetworkX Call
AQL_GENERATION_TEMPLATE = """Task: Generate an ArangoDB Query Language (AQL) query from a User Input and Results of a NetworkX Algorithm.

You are an ArangoDB Query Language (AQL) expert responsible for translating a `User Input` and NetworkX Algorithm Results into an ArangoDB Query Language (AQL) query.

You are given an `ArangoDB Schema`. It is a JSON Object containing:
1. `Graph Schema`: Lists all Graphs within the ArangoDB Database Instance, along with their Edge Relationships.
2. `Collection Schema`: Lists all Collections within the ArangoDB Database Instance, along with their document/edge properties and a document/edge example.

You are also given a `NetworkX Algorithm Results`. It will be a dictionary which has been calculated previously. You need to use that to create the AQL.
You may also be given a set of `AQL Query Examples` to help you create the `AQL Query`. If provided, the `AQL Query Examples` should be used as a reference, similar to how `ArangoDB Schema` should be used.

Things you should do:
- Think step by step.
- Rely on `ArangoDB Schema`, `NetworkX Algorithm Results` and `AQL Query Examples` (if provided) to generate the query.
- Begin the `AQL Query` by the `WITH` AQL keyword to specify all of the ArangoDB Collections required.
- Return the `AQL Query` wrapped in 3 backticks (```).
- Use only the provided relationship types and properties in the `ArangoDB Schema` and any `AQL Query Examples` queries.
- Only answer to requests related to generating an AQL Query.
- If a request is unrelated to generating AQL Query, say that you cannot help the user.

Things you should not do:
- Do not use any properties/relationships that can't be inferred from the `ArangoDB Schema` or `NetworkX Algorithm Results` or the `AQL Query Examples`.
- Do not include any text except the generated AQL Query.
- Do not provide explanations or apologies in your responses.
- Do not generate an AQL Query that removes or deletes any data.

Under no circumstance should you generate an AQL Query that deletes any data whatsoever.

ArangoDB Schema:
{adb_schema}


AQL Query Examples (Optional):
{aql_examples}

User Input:
{user_input}

AQL Query:
"""

In [11]:
def get_context(query: str):

    retrieved_docs = ["# Example Nodes:\n*Important*: Only use Examples for Filtering the nodes. Do not use the data to create new Graphs\n\n## Here are few examples of related Paper Nodes. Format: Paper ID - _id  \n"]
    retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 6,'filter': {'type':'PAPERS'}})
    city_nodes = retriever.invoke(query)
    retrieved_docs.extend([r.metadata['source'] + ' - ' + r.metadata['source'] for r in city_nodes])
    # print(retrieved_docs)
    retrieved_docs.extend(["\n## Here are few examples of related Author Nodes. Format: Author Name - _id  \n"])
    retriever = vector_store_nodes.as_retriever(search_type="similarity", search_kwargs={"k": 3,'filter': {'type':'AUTHOR'}})
    retrieved_docs.extend([r.page_content for r in retriever.invoke(query)])
    retrieved_docs.extend(["\n## Here are few examples of related Topic Nodes. Format: Topic Name - _id  \n"])
    retriever = vector_store_nodes.as_retriever(search_type="similarity", search_kwargs={"k": 3,'filter': {'type':'TOPIC'}})
    retrieved_docs.extend([r.page_content for r in retriever.invoke(query)])
    # print(len(retrieved_docs))
    nearby_nodes = []
    for r in city_nodes:
        try:
            node_id = (r.metadata['source'])
            # print(node_id)
            aql = f""" FOR paper IN PAPERS
            FILTER paper._id == "{node_id}"

            LET authors = (
                FOR v, e IN 1..1 OUTBOUND paper._id GRAPH 'Arxiv'
                FILTER e.edge_collection == 'AUTHORED_BY'
                RETURN {{
                    format: CONCAT(paper._id, " -> ", e.edge_collection, " -> ", v._id)
                }}
            )

            LET topics = (
                FOR v, e IN 1..1 OUTBOUND paper._id GRAPH 'Arxiv'
                FILTER e.edge_collection == 'HAS_TOPIC'
                RETURN {{
                    format: CONCAT(paper._id, " -> ", e.edge_collection, " -> ", v._id)
                }}
            )
            RETURN {{
                format: APPEND(authors, topics)
            }}

            """
            cursor = db.aql.execute(aql)
            # Print the results
            for doc in cursor:
                for f in (doc['format']):
                    nearby_nodes.extend(f['format'].split('\n'))
        except Exception as e:
            # print(str(e))
            pass
    nearby_nodes = sorted(nearby_nodes)
    retrieved_docs.append('\n# Here are some nearby Nodes for reference:\n')
    retrieved_docs.extend(nearby_nodes)
    context = '\n'.join(retrieved_docs)
    print(context)
    return context
get_context('Papers linking String Theory and Quantum Theory ')

# Example Nodes:
*Important*: Only use Examples for Filtering the nodes. Do not use the data to create new Graphs

## Here are few examples of related Paper Nodes. Format: Paper ID - _id  

PAPERS/9610085 - PAPERS/9610085
PAPERS/0008217 - PAPERS/0008217
PAPERS/9911004 - PAPERS/9911004
PAPERS/9803179 - PAPERS/9803179
PAPERS/9810149 - PAPERS/9810149
PAPERS/9807195 - PAPERS/9807195

## Here are few examples of related Author Nodes. Format: Author Name - _id  


## Here are few examples of related Topic Nodes. Format: Topic Name - _id  

Quantum String Theory - TOPIC/STRINGTHEORY
Fusion Of String Theories - TOPIC/QUANTUMMECHANICS
Quantum String Theory - TOPIC/QUANTUMSTRINGTHEORY

# Here are some nearby Nodes for reference:

PAPERS/0008217 -> AUTHORED_BY -> AUTHOR/KARLGEORGSCHLESINGER
PAPERS/0008217 -> HAS_TOPIC -> TOPIC/QUANTUMMECHANICS
PAPERS/0008217 -> HAS_TOPIC -> TOPIC/STRINGTHEORY
PAPERS/0008217 -> HAS_TOPIC -> TOPIC/UNIVERSALITY
PAPERS/9610085 -> AUTHORED_BY -> AUTHOR/SERGIOFERRARA
P

'# Example Nodes:\n*Important*: Only use Examples for Filtering the nodes. Do not use the data to create new Graphs\n\n## Here are few examples of related Paper Nodes. Format: Paper ID - _id  \n\nPAPERS/9610085 - PAPERS/9610085\nPAPERS/0008217 - PAPERS/0008217\nPAPERS/9911004 - PAPERS/9911004\nPAPERS/9803179 - PAPERS/9803179\nPAPERS/9810149 - PAPERS/9810149\nPAPERS/9807195 - PAPERS/9807195\n\n## Here are few examples of related Author Nodes. Format: Author Name - _id  \n\n\n## Here are few examples of related Topic Nodes. Format: Topic Name - _id  \n\nQuantum String Theory - TOPIC/STRINGTHEORY\nFusion Of String Theories - TOPIC/QUANTUMMECHANICS\nQuantum String Theory - TOPIC/QUANTUMSTRINGTHEORY\n\n# Here are some nearby Nodes for reference:\n\nPAPERS/0008217 -> AUTHORED_BY -> AUTHOR/KARLGEORGSCHLESINGER\nPAPERS/0008217 -> HAS_TOPIC -> TOPIC/QUANTUMMECHANICS\nPAPERS/0008217 -> HAS_TOPIC -> TOPIC/STRINGTHEORY\nPAPERS/0008217 -> HAS_TOPIC -> TOPIC/UNIVERSALITY\nPAPERS/9610085 -> AUTHORED_

In [12]:
# 4. Define all the Tools, including Re-try Mechanism
# Reference: https://python.langchain.com/docs/integrations/graphs/arangodb/
# Reference: https://python.langchain.com/api_reference/community/chains/langchain_community.chains.graph_qa.arangodb.ArangoGraphQAChain.html
from langchain_core.prompts.prompt import PromptTemplate

@tool
def semantic_search(query: str):
    """This tool is available to search Abstracts of the Papers.
    Do not use this Tool if you need AQL or Graph Networkx algorithm

      **Args:**
    - query: Query Asked by the User
    """
    response = graph.invoke({"question": query})
    return response['answer']

@tool
def text_to_aql(query: str):
    """This tool is available to invoke the
    ArangoGraphQAChain object, which enables you to
    translate a Natural Language Query into AQL and execute the AQL.

    Use this tool only when you need to execute a AQL Query to retrieve data from Graph Database.

    **Args:**
    - query: Query Asked by the User
    """
    try:
        context = get_context(query)
        llm = ChatOpenAI(temperature=0, model_name="gpt-4o")


        chain = ArangoGraphQAChain.from_llm(
            llm=llm,
            graph=arango_graph,
            verbose=True,
            allow_dangerous_requests=True,
            return_aql_query=True,
            return_aql_result = True
        )
        chain.return_aql_result = True



        result = chain.invoke(context + "\n\nQuery:\n" + query)

        context_manager.store_context("aql_query", result["aql_query"])
        context_manager.store_context("aql_result", result["aql_result"])
        context_manager.store_context("aql_input", query)
        
        return 'Data Stored in Memory. Ensure to set `use_aql_result` to True for further processing when you call `text_to_nx_algorithm_to_text` to use the output of text to AQL Tool for further processing.'
    except Exception as e:
        return f'An error occurred - {str(e)}. You can re-try the tool with more detailed input'

@tool
def text_to_aql_to_text(query: str):
    """This tool is available to invoke the
    ArangoGraphQAChain object, which enables you to
    translate a Natural Language Query into AQL, execute
    the query, and translate the result back into Natural Language.

    Use this tool only when you need to execute a single AQL Query to revert to the user

    **Args:**
    - query: Query Asked by the User
    """
    try:
        llm = ChatOpenAI(temperature=0, model_name="gpt-4o")


        context = get_context(query)

        llm = ChatOpenAI(temperature=0, model_name="gpt-4o")


        chain = ArangoGraphQAChain.from_llm(
            llm=llm,
            graph=arango_graph,
            verbose=True,
            allow_dangerous_requests=True,
            return_aql_query=True,
            return_aql_result = True
        )
        chain.return_aql_result = True



        result = chain.invoke(context + "\n\nQuery:\n" + query)

        context_manager.store_context("aql_text_result", result["result"])
        return str(result["result"])
    except Exception as e:
        return f'An error occurred - {str(e)}. You can re-try the tool with more detailed input'


@tool
def text_to_nx_algorithm_to_text(query: str, use_aql_result: bool = False):
    """This tool is available to invoke a NetworkX Algorithm on
    the ArangoDB Graph. You are responsible for accepting the natural
    language query, determining the appropriate NetworkX algorithm to
    execute, running the algorithm on the provided `G_adb` graph, and
    translating the results back into natural language as a concise answer.

    If the query (e.g., traversals, shortest path, etc.) can be solved using
    Arango Query Language, then do not use this tool.

    **Args:**
    - query: Query Asked by the User
    - use_aql_result: Set to True if previously the text_to_aql Tool had been called for retrieving data using AQL Query
                        Set to False if no previous AQL data is available

    """

    try:
        llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
        G_query = G_adb
        context = ''
        if not use_aql_result:
            context = get_context(query)
        aql_result = {}


        base_query = context + f"""I have a NetworkX Graph called `G_query` with the following schema: {graph_schema}
        """


        if use_aql_result:
            print('start')
            aql_result = context_manager.get_context("aql_result")
            aql_query = context_manager.get_context("aql_query")
            aql_input = context_manager.get_context("aql_input")
            base_query =  f"""I have Graph with the following schema: {graph_schema}
for the question `{aql_input}`, I ran the below `AQL Query` and stored the results in a python variable `aql_result`.
AQL Query: 
{aql_query}

`aql_result` = {aql_result}

You NEED TO use the variable `aql_result` directly in your generated python code for further analysis.

        """
        print("base_query: ", base_query)

        ######################
        print("\n\n1) Generating NetworkX code - use_aql_result: " , use_aql_result)

        text_to_nx = llm.invoke(f"""{base_query}

        I need to answer the following graph analysis query: {query}.

        Determine the most precise NetworkX algorithm to answer this query and generate the Python code to do so.
        Think step by step. Only assume that networkx and standard Python libraries are available.
        You must prefer the algorithms supported by nx-cugraph. Use the most suitable backend.
        For community detectio, prefer leiden_communities or louvain_communities
        Only provide python code that I can directly execute via `exec()`. Do not provide any instructions.

        Note:
        - Use the `CITES` edge collection for connections between two papers
        - Use the `HAS_AUTHORED_BY` edge collection for identifying authors of a paper
        - Use the `HAS_TOPIC` edge collection for identifying topics tagged to a paper
        - Always execute the code in try except block.

        Ensure your code:
        - Executes directly with exec().
        - Sets the final answer in a variable called FINAL_RESULT.
        - Assigns FINAL_RESULT a short, concise answer.

        Your code:
        """).content

        text_to_nx_cleaned = re.sub(r"^```python\n|```$", "", text_to_nx, flags=re.MULTILINE).strip()

        print('-'*10)
        print(text_to_nx_cleaned)
        print('-'*10)

        ######################
        print("\n2) Executing NetworkX code")

        global_vars = {"G_query": G_query, "nx": nx, "aql_result": aql_result}
        local_vars = {}

        attempt = 1
        MAX_ATTEMPTS = 3
        last_exception = None

        while attempt <= MAX_ATTEMPTS:
            try:
                exec(text_to_nx_cleaned, global_vars, local_vars)
                # On success, capture the final generated code for later use.
                text_to_nx_final = text_to_nx
                break
            except Exception as e:
                print(f"EXEC ERROR on attempt {attempt}: {e}")
                last_exception = e
                # Ask the LLM to correct the generated code based on the error.
                text_to_nx = llm.invoke(f"""{base_query}
                    I need to answer the following graph analysis query: {query}.

                    I generated the following Python code to answer a NetworkX analysis query:
                    ---
                    {text_to_nx_cleaned}
                    ---
                    When executing this code on the graph `G_query`, it raised the following exception: {e}.
                    Please provide a corrected version of the code that resolves the error.
                    Only provide python code that I can directly execute via `exec()`. Do not provide any instructions or explaination.
                    **Important:** 
                    - Always Enclose the corrected code in a code block ```
                    - Always execute the code in try except block.

                    Your corrected code should execute directly via exec() and assign a concise answer to FINAL_RESULT.
                """).content

                text_to_nx_cleaned = re.sub(r"^```python\n|```$", "", text_to_nx, flags=re.MULTILINE).strip()
                print('-'*10)
                print(text_to_nx_cleaned)
                print('-'*10)

                attempt += 1

        if attempt > MAX_ATTEMPTS and "FINAL_RESULT" not in local_vars:
            return f"EXEC ERROR after {MAX_ATTEMPTS} attempts: {last_exception}"

        print('-'*10)
        FINAL_RESULT = local_vars["FINAL_RESULT"]
        print(f"FINAL_RESULT: {FINAL_RESULT}")
        print('-'*10)
        context_manager.store_context("networkx_result", FINAL_RESULT)

        ######################
        print("3) Formulating final answer")

        nx_to_text = llm.invoke(f"""
            I have a NetworkX Graph called `G_adb` with the following schema: {graph_schema}

            I received the following graph analysis query: {query}.

            I executed the following Python code:
            ---
            {text_to_nx_final}
            ---
            and obtained `FINAL_RESULT` with the value: {FINAL_RESULT}.

            Based on this information, generate a short and concise natural language answer to the query.

            Your response:
        """).content

        return nx_to_text
    except Exception as e:
        print(str(e))
        return f'An error occurred - {str(e)}. You can re-try the tool with more detailed input'


@tool
def use_networkx_result_in_aql(text_query):
    """This tool is available to invoke a AQL Query using the
    ArangoGraphQAChain object after a NetworkX Algorithm has been executed
    using the tool `text_to_nx_algorithm_to_text` on the ArangoDB Graph.

    If the query (e.g., traversals, shortest path, etc.) can be solved using
    Arango Query Language, then do not use this tool.

    **Args:**
    - text_query: Query which needs an answer

    """

    try:
        context = get_context(text_query)
        networkx_result = context_manager.get_context("networkx_result")
        print('networkx_result: ', networkx_result)

        llm = ChatOpenAI(temperature=0, model_name="gpt-4o")



        aql_generation_prompt = PromptTemplate(
        input_variables=["adb_schema", "aql_examples", "user_input", "networkx_result"],
        template=AQL_GENERATION_TEMPLATE,
        )

        chain =  ArangoGraphQAChain.from_llm(
            llm=llm,
            graph=arango_graph,
            aql_generation_prompt=aql_generation_prompt,
            verbose=True,
            allow_dangerous_requests=True
        )
        chain.return_aql_result = True

        result = chain.invoke(context + "\nQuery: \n" + text_query)
        result = chain.invoke({"user_input": f"None\n\nNetworkX Algorithm Results: \n{networkx_result}",
                                "query": context + "\nQuery: \n" + text_query })

        context_manager.store_context("networkx_aql_result", result["result"])
        return str(result["result"])
    except Exception as e:
        return f'An error occurred - {str(e)}. You can re-try the tool with more detailed input'



In [13]:
# 6. Create the Agentic Application

tools = [semantic_search, text_to_aql_to_text, text_to_nx_algorithm_to_text, text_to_aql, use_networkx_result_in_aql]
prompt = hub.pull("hwchase17/react")

print(prompt)
def query_graph(query):
    llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
    agent = create_react_agent(llm, tools, prompt)
    agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, handle_parsing_errors=True)

    context_manager.clear()
    final_state = agent_executor.invoke({"input": query})
    return final_state['output']



input_variables=['agent_scratchpad', 'input', 'tool_names', 'tools'] input_types={} partial_variables={} metadata={'lc_hub_owner': 'hwchase17', 'lc_hub_repo': 'react', 'lc_hub_commit_hash': 'd15fe3c426f1c4b3f37c9198853e4a86e20c425ca7f4752ec0c9b0e97ca7ea4d'} template='Answer the following questions as best you can. You have access to the following tools:\n\n{tools}\n\nUse the following format:\n\nQuestion: the input question you must answer\nThought: you should always think about what to do\nAction: the action to take, should be one of [{tool_names}]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can repeat N times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\n\nBegin!\n\nQuestion: {input}\nThought:{agent_scratchpad}'


In [44]:
query = "What are key papers in String Theory?."
resp = query_graph(query)




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo identify key papers in String Theory, I should perform a semantic search to find relevant abstracts of papers on this topic. 
Action: semantic_search
Action Input: "key papers in String Theory"[0mPAPERS/9810149
PAPERS/9810240
PAPERS/9911004
PAPERS/0012007
PAPERS/9610085
PAPERS/0210145
 **Information from Documents:
PAPERS/9810149||Talk given at ICM '98, Berlin, reviewing some of the recent developments in 
 understanding of string theory for a mathematical audience (to appear in 
 Documenta Mathematica).

PAPERS/9810240||This is the written version of a set of introductory lectures on string 
 theory.

PAPERS/9911004||Thermodynamical aspects of string theory are reviewed and discussed.

PAPERS/0012007||This is a summary of a talk at Strings2000 explaining three ways in which 
 string theory and M-theory are related to the mathematics of K-theory.

PAPERS/9610085||We report on recent advances in the understanding of non-pe

## Simple Queries - Use Cases Answerable by AQL Queries



In [None]:
simple_queries = []
## Simple Queries - Use Cases Answerable by AQL Queries

simple_queries.append("Find all papers written by Author `Mina Aganagic`. Look at Papers and Authors relationship")
simple_queries.append("List all topics associated with papers published in a journal - `Int.J.Mod.Phys. A7 (1992) 5661-5705`.")
simple_queries.append("Find authors who have written papers on topics - `String Theory` and `Quantum Theory`.")
simple_queries.append("Determine the number of papers published by author `Mina Aganagic` in Year 1996 .")




In [49]:
simple_answers = []
for q in simple_queries:
  ans = ''
  try:
    ans = query_graph(q)
  except:
    ans = 'An Error Occurred in LLM'
    # Re-try
    try:
      ans = query_graph(q)
    except Exception as e:
      ans = 'An Error Occurred while executing the LLM Chain. Please re-try. Error: ' + str(e)
  simple_answers.append(ans)
  



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo find all papers written by the author "Mina Aganagic," I should perform a semantic search on the abstracts of the papers to identify those authored by her. 
Action: semantic_search
Action Input: "Mina Aganagic" [0mPAPERS/9702192
PAPERS/0009115
PAPERS/9309115
PAPERS/9206093
PAPERS/0303061
PAPERS/0303041
 **Information from Documents:
PAPERS/9702192||Talk presented at the 9th Max Born Symposion, Karpacz, September 1996

PAPERS/0009115||In this talk, I report on a work done in collaboration with R. Markazi and 
 E.H. Saidi [1].

PAPERS/9309115||Lectures presented at the Spring School, Trieste, Italy 1993.

PAPERS/9206093||Professor M. C. Polivanov and I met only a few times, during my infrequent 
 visits to the-then Soviet Union in the 1970's and 1980's. His hospitality at 
 the Moscow Steclov Institute made the trips a pleasure, while the scientific 
 environment that he provided made them professionally valuable. But it is

In [50]:
for q, a in zip(simple_queries, simple_answers):
  print('_' * 50)
  print('Question: ', q)
  print('Answer: ', a)


__________________________________________________
Question:  Find all papers written by a Author `Mina Aganagic`.
Answer:  Mina Aganagic has authored several influential papers in the field of theoretical physics. Her works cover a range of complex topics, including the dynamics of Dirichlet p-branes, gauge theories, and mirror symmetry. Some of her notable papers include:

1. "D-Brane Actions with Local Kappa Symmetry"
2. "Gauge-Invariant and Gauge-Fixed D-Brane Actions"
3. "World-Volume Action of the M Theory Five-Brane"
4. "Dual D-Brane Actions"
5. "Exact solutions for some N=2 supersymmetric SO(N) gauge theories"
6. "Mirror Symmetries for Brane Configurations and Branes at Singularities"
7. "Calabi-Yau Mirror Symmetry as a Gauge Theory Duality"
8. "Unstable Solitons in Noncommutative Gauge Theory"
9. "Mirror Symmetry, D-Branes and Counting Holomorphic Discs"
10. "Disk Instantons, Mirror Symmetry and the Duality Web"

These papers highlight her significant contributions to the unde

## Complex Queries - Use Cases Requiring NetworkX Algorithms

In [55]:
complex_queries = []
## Complex Queries = GPU-Accelerated Graph Analytics (via NetworkX/cuGraph)

### Breadth-First Search (BFS)
complex_queries.append("Find the shortest path between two authors - `Mina Aganagic` and `Soo-Jong Rey` based on co-authorship.")
### Depth-First Search (DFS)
complex_queries.append("Identify strongly connected components in the citation network.")
### Centrality Measures 
complex_queries.append("Rank papers by their influence using PageRank.")
### Community Detection Algorithms 
complex_queries.append("Detect communities of authors based on co-authorship patterns.")
### Shortest Path Algorithms
complex_queries.append("Find the shortest path between papers `9504065` and `9204044` in the citation network.")




In [57]:
complex_answers = []
for q in complex_queries:
  complex_answers.append(query_graph(q))
  



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo find the shortest path between two authors based on co-authorship, I need to use a graph traversal algorithm. This can be achieved using a NetworkX algorithm on the ArangoDB Graph. I will proceed with this approach.

Action: text_to_nx_algorithm_to_text
Action Input: "Find the shortest path between authors Mina Aganagic and Soo-Jong Rey based on co-authorship."[0m# Example Nodes:
*Important*: Only use Examples for Filtering the nodes. Do not use the data to create new Graphs

## Here are few examples of related Paper Nodes. Format: Paper ID - _id  

PAPERS/0109024 - PAPERS/0109024
PAPERS/0009115 - PAPERS/0009115
PAPERS/9402083 - PAPERS/9402083
PAPERS/9508008 - PAPERS/9508008
PAPERS/9811074 - PAPERS/9811074
PAPERS/0012184 - PAPERS/0012184

## Here are few examples of related Author Nodes. Format: Author Name - _id  

Mina Aganagic - AUTHOR/MINAAGANAGIC
Soo-jong Rey - AUTHOR/SOOJONGREY
Soo-jong Rey (seoul National Universit

In [58]:
for q, a in zip(complex_queries, complex_answers):
  print('_' * 50)
  print('Question: ', q)
  print('Answer: ', a)


__________________________________________________
Question:  Find the shortest path between two authors - `Mina Aganagic` and `Soo-Jong Rey` based on co-authorship.
Answer:  The shortest path between authors Mina Aganagic and Soo-Jong Rey based on co-authorship involves the paper with ID 0109024. Mina Aganagic and Soo-Jong Rey both co-authored this paper.
__________________________________________________
Question:  Identify strongly connected components in the citation network.
Answer:  The citation network contains a single strongly connected component, consisting of the papers: 'PAPERS/9701103', 'PAPERS/9212129', 'PAPERS/0202110', 'PAPERS/0109124', 'PAPERS/0208180', and 'PAPERS/9311149'.
__________________________________________________
Question:  Rank papers by their influence using PageRank.
Answer:  The PageRank analysis shows that all the papers have equal influence, each receiving the same PageRank score of approximately 0.1667. This indicates a balanced cycle of citation rel

## Hybrid Queries - Use Cases Requiring Both Cypher Queries and NetworkX


In [59]:
hybrid_queries = []

## Hybrid Query Execution - Combining AQL and cuGraph for Contextual Responses
### Centrality Measures 
hybrid_queries.append("Calculate the PageRank of Top 10 authors based on their papers' citations.")
### Community Detection Algorithms 
hybrid_queries.append("Identify communities of papers based on their topics and citations.")
### Breadth-First Search 
hybrid_queries.append("Find the shortest path between topics - `String Theory` and `Quantum Theory` based on papers.")
### Network Clustering Coefficient
hybrid_queries.append("Calculate the clustering coefficient of the author co-authorship network.")

In [60]:
hybrid_answers = []
for q in hybrid_queries:
  ans = ''
  try:
    ans = query_graph(q)
  except:
    ans = 'An Error Occurred in LLM'
    # Re-try
    try:
      ans = query_graph(q)
    except Exception as e:
      ans = 'An Error Occurred while executing the LLM Chain. Please re-try. Error: ' + str(e)
  hybrid_answers.append(ans)
  



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo calculate the PageRank of authors based on their papers' citations, I need to use a graph algorithm. PageRank is a network analysis algorithm that can be executed using NetworkX. Therefore, I will use the `text_to_nx_algorithm_to_text` tool to perform this calculation.

Action: text_to_nx_algorithm_to_text
Action Input: Calculate the PageRank of Top 10 authors based on their papers' citations.[0m# Example Nodes:
*Important*: Only use Examples for Filtering the nodes. Do not use the data to create new Graphs

## Here are few examples of related Paper Nodes. Format: Paper ID - _id  

PAPERS/0006111 - PAPERS/0006111
PAPERS/0007194 - PAPERS/0007194
PAPERS/9401018 - PAPERS/9401018
PAPERS/0302056 - PAPERS/0302056
PAPERS/9706001 - PAPERS/9706001
PAPERS/9911130 - PAPERS/9911130

## Here are few examples of related Author Nodes. Format: Author Name - _id  

10 - AUTHOR/10
P. Page - AUTHOR/PPAGE
Statistics - AUTHOR/STATISTICS

## H

In [61]:
for q, a in zip(hybrid_queries, hybrid_answers):
  print('_' * 50)
  print('Question: ', q)
  print('Answer: ', a)


__________________________________________________
Question:  Calculate the PageRank of Top 10 authors based on their papers' citations.
Answer:  The top 10 authors based on the PageRank of their papers' citations are:

1. Boguslaw Broda
2. Malgorzata Bakalarska
3. P. H. Damgaard
4. S. M. Nishigaki
5. Ezra Getzler (RIMS)
6. Northwestern
7. Tohrue Eguchi (Tokyo)
8. Valeri V. Dolotin
9. R. J. Budzynski
10. W. Kondracki

These authors are ranked according to the influence of their papers within the citation network.
__________________________________________________
Question:  Identify communities of papers based on their topics and citations.
Answer:  The analysis successfully identified 2,193 distinct communities of papers, which were grouped based on their shared topics and citation relationships.
__________________________________________________
Question:  Find the shortest path between topics - `String Theory` and `Quantum Theory` based on papers.
Answer:  There is no direct path be

In [62]:
hybrid_queries = []

## Hybrid Query Execution - Combining AQL and cuGraph for Contextual Responses
hybrid_queries.append("First identify the papers by `Mina Aganagic` abd fir that, rate them based on citation using pagerank.")


In [112]:
hybrid_answers = []
for q in hybrid_queries:
  ans = ''
  try:
    ans = query_graph(q)
  except:
    ans = 'An Error Occurred in LLM'
    # Re-try
    try:
      ans = query_graph(q)
    except Exception as e:
      ans = 'An Error Occurred while executing the LLM Chain. Please re-try. Error: ' + str(e)
  hybrid_answers.append(ans)
  



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo identify the papers by "Mina Aganagic" and then rate them based on citation using PageRank, I will first need to retrieve the papers authored by Mina Aganagic. After that, I will apply the PageRank algorithm to rank these papers based on their citations.

First, I will retrieve the papers authored by Mina Aganagic using an AQL query.

Action: text_to_aql
Action Input: "Retrieve papers authored by Mina Aganagic"[0m# Example Nodes:
*Important*: Only use Examples for Filtering the nodes. Do not use the data to create new Graphs

## Here are few examples of related Paper Nodes. Format: Paper ID - _id  

PAPERS/0009115 - PAPERS/0009115
PAPERS/9912039 - PAPERS/9912039
PAPERS/9702192 - PAPERS/9702192
PAPERS/0210001 - PAPERS/0210001
PAPERS/0303061 - PAPERS/0303061
PAPERS/0204022 - PAPERS/0204022

## Here are few examples of related Author Nodes. Format: Author Name - _id  

Mina Aganagic - AUTHOR/MINAAGANAGIC
R. Minasian - AUTHOR

In [113]:
for q, a in zip(hybrid_queries, hybrid_answers):
  print('_' * 50)
  print('Question: ', q)
  print('Answer: ', a)


__________________________________________________
Question:  First identify the papers by `Mina Aganagic` abd fir that, rate them based on citation using pagerank.
Answer:  The papers authored by Mina Aganagic have been ranked based on their citation impact using the PageRank algorithm. The top-ranked paper is "Disk Instantons, Mirror Symmetry and the Duality Web" with a PageRank score of 0.1474. It is followed by "Mirror Symmetry, D-Branes and Counting Holomorphic Discs" with a score of 0.1410, and "Unstable Solitons in Noncommutative Gauge Theory" with a score of 0.1335. The complete ranking includes ten papers, with "D-Brane Actions with Local Kappa Symmetry" being the lowest-ranked with a score of 0.0275.


In [None]:
!pip install gradio

In [16]:
import gradio as gr

gr.Interface(fn=query_graph, inputs="text", outputs="text").launch(share=True)

* Running on local URL:  http://127.0.0.1:7861
* Running on public URL: https://7e2a5d10cbbfea2203.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)






[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo find authors who have written papers on both "String Theory" and "Quantum Theory," I need to search for papers related to these topics and identify the authors. This requires searching through the abstracts of papers to find relevant ones. I will use the `semantic_search` tool to perform this search.

Action: semantic_search
Action Input: "String Theory and Quantum Theory"[0m **Information from Documents:
PAPERS/0008217||String theory is accused by some of its critics to be a purely abstract 
 mathematical discipline, having lost the contact to the simple yet deeply 
 rooted questions which physics provided until the beginning of this century. We 
 argue that, in contrary, there are indications that string theory might be 
 linked to a fundamental principle of a quantum computational character. In 
 addition, the nature of this principle might be capable to provide some new 
 insight into the question of uni

PAPERS/98071

# Deep Research Agent

Now that we see that the Tools and Agents are working fine, let us now create the Deep Research

In [150]:
import uuid 
from IPython.display import Markdown

# Template for standard report structure and formatting guidelines
REPORT_TEMPLATE = """The report structure should focus on breaking-down the user-provided topic
 and building a comprehensive report in markdown using the following format:
1. Introduction (no web search needed)
      - Brief overview of the topic area


2. Main Body Sections:
      - Each section should focus on a sub-topic of the user-provided topic
      - Include any key concepts and definitions
      - Provide real-world examples or case studies where applicable


3. Conclusion (no web search needed)
      - Aim for 1 structural element (either a list of table) that distills the main body sections
      - Provide a concise summary of the report


When generating the final response in markdown, if there are special characters in the text,
such as the dollar symbol, ensure they are escaped properly for correct rendering e.g $25.5 should become \\$25.5
                 """

In [None]:
import operator
from typing import Annotated, List, Tuple, Union, Literal
from typing_extensions import TypedDict
from langchain_core.prompts import ChatPromptTemplate
from langgraph.graph import StateGraph, START, END
from pydantic import BaseModel, Field


class WorkflowState(TypedDict):
    """State object for the financial research workflow."""
    query: str  # User's original input query
    strategy: List[str]  # Steps to execute for research
    completed_actions: Annotated[List[Tuple], operator.add]  # History of completed steps and results
    final_output: str  # Final response to return to user


class StrategyOutline(BaseModel):
    """Model for the research strategy steps."""
    steps: List[str]


class UserOutput(BaseModel):
    """Final output to be delivered to user."""
    response: str


class NextAction(BaseModel):
    """Determines the next action in the workflow."""
    action: Union[UserOutput, StrategyOutline] = Field(
        description="Next action to take. Either provide a final response to the user with UserOutput, "
        "or continue the research process with a new StrategyOutline."
    )


# Prompt template for initial strategy creation
strategy_creator_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        f"""You are a research specialist who does deep research based on available tools.
        You have Tool which is comphrehensive and returns all information about Physice based Papers, and related analysis like Topics, Authors, etc.
        *Important*: Always use the Tools available. DO NOT USE PRIOR KNOWLEDGE.
        You DO NOT have access to any external tools. 
        Keep on completing the actions, till you get all answers.
        Note - If you are generating Code, always use nx-cugraph library. No other network algorithms to be used.
        At last step, DO NOT give Updated Strategy. Always answer the question by providing a comphrehensive output in below format:
        {REPORT_TEMPLATE}
        """,
    ),
    ("placeholder", "{messages}"),
])

# Initialize the strategy creator with appropriate model
strategy_creator = strategy_creator_prompt | ChatOpenAI(
    model="gpt-4o").with_structured_output(StrategyOutline)


# Prompt template for strategy refinement based on completed steps
strategy_refiner_prompt = ChatPromptTemplate.from_template(
    """Evaluate and refine the research strategy for the given objective.
    
Original query:
{query}

Initial research strategy:
{strategy}

Steps already completed:
{completed_actions}

Based on progress so far, provide an updated strategy. Only include steps that still need to be completed.
If sufficient information has been gathered to answer the original query, provide a comprehensive final response instead of additional steps."""
)

# Initialize the strategy refiner with appropriate model
strategy_refiner = strategy_refiner_prompt | ChatOpenAI(
    model="gpt-4o", temperature=0
).with_structured_output(NextAction)

# 6. Create the Agentic Application

agent_tools = [text_to_aql_to_text, text_to_nx_algorithm_to_text]
prompt = hub.pull("hwchase17/react")

def query_graph_knowledge(query):
    llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
    agent = create_react_agent(llm, agent_tools, prompt)
    agent_executor = AgentExecutor(agent=agent, tools=agent_tools, verbose=True, handle_parsing_errors=True)

    context_manager.clear()
    final_state = agent_executor.invoke({"input": query})
    return final_state['output']

async def create_initial_strategy(state: WorkflowState):
    """Generate the initial research strategy based on user query."""
    strategy_outline = await strategy_creator.ainvoke({"messages": [("user", state["query"])]})
    return {"strategy": strategy_outline.steps}


async def execute_research_step(state: WorkflowState):
    """Execute the next step in the research strategy."""
    # Format the current strategy for context
    strategy = state["strategy"]
    strategy_formatted = "\n".join(f"{i+1}. {step}" for i, step in enumerate(strategy))
    
    # Get the current task to execute
    current_task = strategy[0]
    task_instruction = f"""Following this research strategy:
{strategy_formatted}\n\nExecute step 1: {current_task}."""
    
    # Execute the task and capture results
    execution_result = query_graph_knowledge(task_instruction)
    
    
    # Return the completed step and its results
    return {
        "completed_actions": [(current_task, execution_result)],
    }


async def refine_strategy(state: WorkflowState):
    """Refine the research strategy based on completed steps."""
    refinement = await strategy_refiner.ainvoke(state)
    
    # If we have enough information for a final response, return it
    if isinstance(refinement.action, UserOutput):
        return {"final_output": refinement.action.response}
    # Otherwise, update the strategy with remaining steps
    else:
        return {"strategy": refinement.action.steps}


def check_workflow_completion(state: WorkflowState):
    """Determine if the workflow should end or continue."""
    if "final_output" in state and state["final_output"]:
        return END  # Workflow complete, return final output
    else:
        return "researcher"  # Continue with next research step


# Create the workflow graph
research_workflow = StateGraph(WorkflowState)

# Add workflow nodes
research_workflow.add_node("strategist", create_initial_strategy)
research_workflow.add_node("researcher", execute_research_step)
research_workflow.add_node("evaluator", refine_strategy)

# Define workflow edges
research_workflow.add_edge(START, "strategist")
research_workflow.add_edge("strategist", "researcher")
research_workflow.add_edge("researcher", "evaluator")

# Add conditional logic for continuing or completing the workflow
research_workflow.add_conditional_edges(
    "evaluator",
    check_workflow_completion,
    ["researcher", END],
)

# Compile the workflow into an executable application
deep_research_app = research_workflow.compile()
config = {"recursion_limit": 50}




In [180]:
deep_research_query = "Analyze the network of researchers who have published on string theory and identify potential 'bridge researchers' who connect different sub-communities through their publication and citation patterns."
async for event in deep_research_app.astream({"query": deep_research_query}, config=config):
        for k, v in event.items():
            if k != "__end__":
                print(v)

{'strategy': ['Retrieve a comprehensive list of authors who have published papers on string theory from the database.', 'Identify the co-authorship networks among these authors to understand their collaboration patterns.', 'Analyze the citation patterns among these papers to identify influential authors.', 'Utilize nx-cugraph to map the network of researchers based on the co-authorship and citation data.', 'Apply community detection algorithms to identify distinct sub-communities within the network.', "Identify 'bridge researchers' who have connections with multiple sub-communities, potentially through collaborations or citations.", 'Analyze the publication patterns and topics of these bridge researchers to understand their role in connecting different sub-communities.', 'Compile the findings into a structured report focusing on the network dynamics and the role of bridge researchers in the context of string theory research.']}


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1

In [181]:
from IPython.display import display

display(Markdown(v['final_output']))

### Comprehensive Final Response:

**Objective:** Analyze the network of researchers who have published on string theory and identify potential 'bridge researchers' who connect different sub-communities through their publication and citation patterns.

**Summary of Findings:**

1. **Researcher Network Mapping:**
   - A comprehensive list of authors who have published on string theory was retrieved, including notable figures such as Nathan Berkovits, Igor R. Klebanov, and Michael B. Green.
   - Co-authorship networks were identified, revealing distinct collaboration patterns among researchers.
   - Citation analysis highlighted influential authors like Cumrun Vafa and J.l.f. Barbón, who have significantly contributed to the field.

2. **Community Detection:**
   - Using the Louvain community detection algorithm, five distinct sub-communities within the string theory domain were identified, showcasing the modularity and cohesiveness of these research clusters.

3. **Bridge Researchers Identification:**
   - Julie D. Blum was identified as a key 'bridge researcher' who connects multiple sub-communities within the string theory network. Her role as a connector is crucial for facilitating knowledge transfer and innovation across different sub-fields.

**Insights and Implications:**

- **Role of Bridge Researchers:**
  - Bridge researchers like Julie D. Blum play a pivotal role in the string theory research landscape by linking otherwise disparate groups. Their work fosters collaboration and cross-pollination of ideas, which is essential for advancing the field.

- **Network Dynamics:**
  - The identified sub-communities reflect the diverse research interests and specializations within string theory. Understanding these dynamics can help in strategizing future collaborations and research directions.

- **Potential for Further Exploration:**
  - Further analysis could explore the specific contributions of bridge researchers to different sub-fields and how their work influences the evolution of string theory research.

**Conclusion:**

The analysis successfully identified key bridge researchers and mapped the network dynamics within the string theory domain. This comprehensive overview provides valuable insights into the collaborative landscape and highlights the importance of bridge researchers in fostering innovation and connectivity across sub-communities. Future research could delve deeper into the specific roles and impacts of these connectors to further enhance our understanding of the field's development.

In [None]:
q = "Which researchers have transitioned from theoretical to applied quantum physics based on their publication history?"
async for event in deep_research_app.astream({"query": q}, config=config):
        for k, v in event.items():
            if k != "__end__":
                print(v)

In [None]:
from IPython.display import display

display(Markdown(v['final_output']))