# Paper Savior with LionAGI and LlamaIndex Vector Index

-- how to do auto explorative research with LionAGI plus RAG using llamaindex Vector Index & embedding 

- [LionAGI](https://github.com/lion-agi/lionagi)
- [LlamaIndex](https://www.llamaindex.ai)

In [2]:
# %pip install lionagi pypdf llama_index

In [4]:
query = 'Index volatility prediction with transformers'
dir = "data/log/researcher/"
num_papers = 2

### 1. Build a Vector Index with llama_index

In [5]:
from llama_index import SimpleDirectoryReader
from llama_index.node_parser import SentenceSplitter
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.llms import OpenAI


loader = SimpleDirectoryReader(input_dir='papers/', required_exts='.pdf')
node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
documents = loader.load_data(show_progress=False)

nodes = node_parser.get_nodes_from_documents(documents, show_progress=False)

# set up index object
llm = OpenAI(temperature=0.1, model="gpt-4-1106-preview")
service_context = ServiceContext.from_defaults(llm=llm)
index1 = VectorStoreIndex(nodes, include_embeddings=True, 
                          service_context=service_context)

# set up query engine
query_engine = index1.as_query_engine(
    include_text=False, response_mode="tree_summarize"
    )

### 2. Write a tool description according to OpenAI schema

In [6]:
import lionagi as li

In [7]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "query_arxiv_papers",
            "description": """
                           Perform a query to a QA bot with access to an 
                           index built with papers from arxiv
                          """,
            "parameters": {
                "type": "object",
                "properties": {
                    "str_or_query_bundle": {
                        "type": "string",
                        "description": "a question to ask the QA bot",
                    }
                },
                "required": ["str_or_query_bundle"],
            },
        }
    }
]

# we will need to register both the function description 
# and actual implementation
tool = li.Tool(func=query_engine.query, parser=lambda x: x.response, schema_=tools[0])

### 3. Research: PROMPTS

#### FORMATS

In [8]:
# a rigidly set up prompt can help make outcome more deterministic
# though any string will work as well. 
system = {
    "persona": "a helpful world-class researcher",
    "requirements": """
              think step by step before returning a clear, precise 
              worded answer with a humble yet confident tone
          """,
    "responsibilities": f"""
              you are asked to help with researching on the topic 
              of {query}
          """,
    "tools": "provided with a QA bot for grounding responses"
}

# similarly, we can pass in any string or dictionary to instruction
# here we are modifying model behavior by telling mdel how to output 
deliver_format1 = {"return required": "yes", "return format": "paragraph"}

deliver_format2 = {"return required": "yes", 
    "return format": { 
        "json_mode": {
            'paper': "paper_name",
            "summary": "...", 
            "research question": "...", 
            "talking points": {
                "point 1": "...",
                "point 2": "...",
                "point 3": "..."
            }}}}
            
function_call = {
    "notice":f"""
        At each task step, identified by step number, you must use the tool 
        at least twice. Notice you are provided with a QA bot as your tool, 
        the bot has access to the {num_papers} papers via a queriable index 
        that takes natural language query and return a natural language 
        answer. You can decide whether to invoke the function call, you will 
        need to ask the bot when there are things need clarification or 
        further information. you provide the query by asking a question, 
        please use the tool as extensively as you can.
       """
    }

# here we create a two step process imitating the steps human would take to 
# perform the research task
instruct1 = {
    "task step": "1", 
    "task name": "read paper abstracts", 
    "task objective": "get initial understanding of the papers of interest", 
    "task description": """
            provided with abstracts of paper, provide a brief summary 
            highlighting the paper core points, the purpose is to extract 
            as much information as possible
          """,
    "deliverable": deliver_format1
}


instruct2 = {
    "task step": "2",
    "task name": "propose research questions and talking points", 
    "task objective": "initial brainstorming", 
    "task description": """
          from the improved understanding of the paper, please propose 
          an interesting, unique and practical research question, 
          support your reasoning. Kept on asking questions if things are 
          not clear. 
        """,
    "deliverable": deliver_format2,
    "function calling": function_call
}

### 4. Research: Setup Workflow

In [9]:
abstracts = """
Abstract—Large language models (LLMs), such as ChatGPT and GPT4, are making new waves in the field of natural language processing and artificial intelligence, due to their emergent ability and generalizability. However, LLMs are black-box models, which often fall short of capturing and accessing factual knowledge. In contrast, Knowledge Graphs (KGs), Wikipedia and Huapu for example, are structured knowledge models that explicitly store rich factual knowledge. KGs can enhance LLMs by providing external knowledge for inference and interpretability. Meanwhile, KGs are difficult to construct and evolving by nature, which challenges the existing methods in KGs to generate new facts and represent unseen knowledge. Therefore, it is complementary to unify LLMs and KGs together and simultaneously leverage their advantages. In this article, we present a forward-looking roadmap for the unification of LLMs and KGs. Our roadmap consists of three general frameworks, namely, 1) KG-enhanced LLMs, which incorporate KGs during the pre-training and inference phases of LLMs, or for the purpose of enhancing understanding of the knowledge learned by LLMs; 2) LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding, completion, construction, graph-to-text generation, and question answering; and 3) Synergized LLMs + KGs, in which LLMs and KGs play equal roles and work in a mutually beneficial way to enhance both LLMs and KGs for bidirectional reasoning driven by both data and knowledge. We review and summarize existing efforts within these three frameworks in our roadmap and pinpoint their future research directions.
"""

In [10]:
async def read_propose(context, num=5):
    researcher = li.Session(system, dir=dir)
    researcher.register_tools(tool)
    
    await researcher.initiate(instruct1, context=context, temperature=0.7)
    await researcher.auto_followup(instruct2, tools=tools, num=num)
    
    # researcher.messages_to_csv()
    # researcher.log_to_csv()
    return researcher

### 5. Research: Run the workflow

In [11]:
researcher = await li.alcall(abstracts, read_propose)[0]

In [16]:
for msg in researcher.conversation.messages:
    if msg.role == "assistant":
        print(f"{msg.message_content}\n")

The abstract provided discusses the interplay between Large Language Models (LLMs), such as ChatGPT and GPT4, and Knowledge Graphs (KGs), highlighting their complementary nature. LLMs are adept at processing natural language but are considered "black-box" and may not effectively capture factual knowledge. KGs, like Wikipedia and Huapu, store structured factual knowledge, which can aid LLMs in inference and interpretability, although KGs are challenging to build and maintain due to their evolving nature.

The paper proposes a roadmap for integrating LLMs and KGs to harness the strengths of both. This roadmap outlines three frameworks: KG-enhanced LLMs, where KGs are integrated during the pre-training and inference stages of LLMs or to improve understanding learned knowledge; LLM-augmented KGs, which use LLMs for KG-related tasks such as embedding, completion, and graph-to-text generation; and Synergized LLMs + KGs, where both models work together reciprocally for enhanced bidirectional 

In [17]:
Markdown(researcher.conversation.messages[-1].message_content)

Based on the improved understanding of the challenges in integrating Large Language Models (LLMs) with Knowledge Graphs (KGs), I propose the following research question and talking points:

```json
{
  "paper": "Unification of Large Language Models and Knowledge Graphs",
  "summary": "The paper discusses the synergy between LLMs and KGs, detailing how LLMs can be enhanced by KGs for better factual knowledge understanding and how KGs can leverage LLMs for tasks such as embedding and graph-to-text generation. It proposes a roadmap with three frameworks for integration and highlights the potential for mutually beneficial bidirectional reasoning.",
  "research question": "How can we develop a scalable framework for the dynamic integration of evolving Knowledge Graphs with Large Language Models to enhance bidirectional reasoning and maintain up-to-date knowledge?",
  "talking points": {
    "point 1": "Investigate methods for real-time updating of Knowledge Graphs to reflect the latest information and how LLMs can efficiently access these updates.",
    "point 2": "Explore scalable architectures that can align the distributed representations of LLMs with the structured representations of KGs without compromising performance.",
    "point 3": "Develop novel evaluation metrics that specifically measure the effectiveness of LLM-KG integration in terms of knowledge accuracy, contextual understanding, and reasoning capabilities."
  }
}
```

This research question is interesting and unique because it focuses on the dynamic aspect of KGs and their integration with LLMs, which is crucial for applications that require up-to-date information. It is also practical, as the success of such a framework could significantly improve AI systems in terms of knowledge comprehension and reasoning. The talking points aim to address some of the key challenges identified, like real-time updates of KGs, scalable integration architectures, and appropriate evaluation metrics.

To further refine this proposal, I'll next use the QA bot to ask additional questions to clarify any remaining ambiguities or to gather more information that could strengthen the research question and talking points.