# Paper Savior with LionAGI and LlamaIndex Vector Index

-- how to do auto explorative research with LionAGI plus RAG using llamaindex Vector Index & embedding 

- [LionAGI](https://github.com/lion-agi/lionagi)
- [LlamaIndex](https://www.llamaindex.ai)

In [62]:
# %pip install lionagi pypdf llama_index

In [63]:
import lionagi as li

### 1. Build a Vector Index with llama_index

In [None]:
# define a function to get index

def get_index(chunks):
    from llama_index import ServiceContext, VectorStoreIndex
    from llama_index.llms import OpenAI

    llm = OpenAI(temperature=0.1, model="gpt-4-1106-preview")
    service_context = ServiceContext.from_defaults(llm=llm)
    return VectorStoreIndex(chunks, include_embeddings=True, service_context=service_context)

In [None]:
# get llamaindex textnodes, if to_datanode is True, you will get Lion DataNode
text_nodes = li.load(
    'SimpleDirectoryReader', reader_type='llama_index', reader_args=['papers/'], 
    to_datanode=False, #reader_kwargs = {...}
)

chunks = li.chunk(
    documents=text_nodes, chunker_type = 'llama_index', chunker='SentenceSplitter', 
    chunker_kwargs={'chunk_size': 512, 'chunk_overlap':20}, to_datanode=False, 
)

In [14]:
index = get_index(chunks)
query_engine = index.as_query_engine(include_text=False, response_mode="tree_summarize")

### 2. Write a tool description according to OpenAI schema

In [16]:
tool_schema = {
        "type": "function",
        "function": {
            "name": "query_arxiv_papers",
            "description": """
                           Perform a query to a QA bot with access to an 
                           index built with papers from arxiv
                          """,
            "parameters": {
                "type": "object",
                "properties": {
                    "str_or_query_bundle": {
                        "type": "string",
                        "description": "a question to ask the QA bot",
                    }
                },
                "required": ["str_or_query_bundle"],
            },
        }
    }


# we will need to register both the function description 
# and actual implementation
tool = li.Tool(func=query_engine.query, parser=lambda x: str(x.response), schema_=tool_schema)

### 3. Research: PROMPTS

#### FORMATS

In [17]:
# a rigidly set up prompt can help make outcome more deterministic
# though any string will work as well. 
system = {
    "persona": "a helpful world-class researcher",
    "requirements": """
              think step by step before returning a clear, precise 
              worded answer with a humble yet confident tone
          """,
    "responsibilities": f"""
              you are asked to help with researching on the topic 
              of Large Language Model
          """,
    "tools": "provided with a QA bot for grounding responses"
}

# similarly, we can pass in any string or dictionary to instruction
# here we are modifying model behavior by telling mdel how to output 
deliver_format1 = {"return required": "yes", "return format": "paragraph"}

deliver_format2 = {"return required": "yes", 
    "return format": { 
        "json_mode": {
            'paper': "paper_name",
            "summary": "...", 
            "research question": "...", 
            "talking points": {
                "point 1": "...",
                "point 2": "...",
                "point 3": "..."
            }}}}
            
function_call = {
    "notice":"""
        At each task step, identified by step number, you must use the tool 
        at least twice. Notice you are provided with a QA bot as your tool, 
        the bot has access to the 2 papers via a queriable index 
        that takes natural language query and return a natural language 
        answer. You can decide whether to invoke the function call, you will 
        need to ask the bot when there are things need clarification or 
        further information. you provide the query by asking a question, 
        please use the tool as extensively as you can.
       """
    }

# here we create a two step process imitating the steps human would take to 
# perform the research task
instruct1 = {
    "task step": "1", 
    "task name": "read paper abstracts", 
    "task objective": "get initial understanding of the papers of interest", 
    "task description": """
            provided with abstracts of paper, provide a brief summary 
            highlighting the paper core points, the purpose is to extract 
            as much information as possible
          """,
    "deliverable": deliver_format1
}


instruct2 = {
    "task step": "2",
    "task name": "propose research questions and talking points", 
    "task objective": "initial brainstorming", 
    "task description": """
          from the improved understanding of the paper, please propose 
          an interesting, unique and practical research question, 
          support your reasoning. Kept on asking questions if things are 
          not clear. 
        """,
    "deliverable": deliver_format2,
    "function calling": function_call
}

### 4. Research: Setup Workflow

In [18]:
abstracts = """
Abstract—Large language models (LLMs), such as ChatGPT and GPT4, are making new waves in the field of natural language processing and artificial intelligence, due to their emergent ability and generalizability. However, LLMs are black-box models, which often fall short of capturing and accessing factual knowledge. In contrast, Knowledge Graphs (KGs), Wikipedia and Huapu for example, are structured knowledge models that explicitly store rich factual knowledge. KGs can enhance LLMs by providing external knowledge for inference and interpretability. Meanwhile, KGs are difficult to construct and evolving by nature, which challenges the existing methods in KGs to generate new facts and represent unseen knowledge. Therefore, it is complementary to unify LLMs and KGs together and simultaneously leverage their advantages. In this article, we present a forward-looking roadmap for the unification of LLMs and KGs. Our roadmap consists of three general frameworks, namely, 1) KG-enhanced LLMs, which incorporate KGs during the pre-training and inference phases of LLMs, or for the purpose of enhancing understanding of the knowledge learned by LLMs; 2) LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding, completion, construction, graph-to-text generation, and question answering; and 3) Synergized LLMs + KGs, in which LLMs and KGs play equal roles and work in a mutually beneficial way to enhance both LLMs and KGs for bidirectional reasoning driven by both data and knowledge. We review and summarize existing efforts within these three frameworks in our roadmap and pinpoint their future research directions.
"""

In [19]:
async def read_propose(context, num=5):
    
    researcher = li.Session(system)
    researcher.register_tools(tool)
    
    await researcher.chat(instruct1, context=context, temperature=0.7)
    await researcher.auto_followup(instruct2, tools=True, num=num)
    
    return researcher

### 5. Research: Run the workflow

In [24]:
researcher = li.to_list(
    await li.alcall(abstracts, read_propose), flatten=True
)[0]

In [25]:
# session.conversation is another name for session.current_branch
df = researcher.default_branch.messages
df.head()

Unnamed: 0,node_id,role,name,timestamp,content
0,e3d8202ddcd80950b619664e6030566a,system,system,2024-01-18 13:31:27.516618,"{""system_info"": {""persona"": ""a helpful world-c..."
1,d6f63a57c38f0382a40c21cb08d292b2,user,user,2024-01-18 13:31:27.517302,"{""instruction"": {""task step"": ""1"", ""task name""..."
2,4e1e291bd9ab5f3181a0b8d7519eb7ed,assistant,assistant,2024-01-18 13:31:44.000922,"{""response"": ""Certainly, the abstract provided..."
3,93f6af19b2432c802d12a8f68285b903,user,user,2024-01-18 13:31:44.002273,"{""instruction"": {""task step"": ""2"", ""task name""..."
4,0c395f94f6e27b2ae69d8814bcf16c30,assistant,action_request,2024-01-18 13:31:52.241974,"{""action_list"": [{""action"": ""action_query_arxi..."


In [36]:
df.sender.unique()

array(['system', 'user', 'assistant', 'action_request', 'action_response'],
      dtype=object)

In [33]:
# let us check the questions assistant asked
df_requests = df[df.sender == "action_request"]

for content in df_requests.content:
    for i in li.as_dict(content)['action_list']:
        print(li.to_readable_dict(i))


{
    "action": "action_query_arxiv_papers",
    "arguments": "{\"str_or_query_bundle\":\"What are the current challenges in integrating Knowledge Graphs with Large Language Models?\"}"
}


In [44]:
from IPython.display import Markdown

In [50]:
# let us check the answers from query engine
df_response= df[df.sender == "action_response"]
content = df_response.content.iloc[0]

Unnamed: 0,node_id,role,name,timestamp,content
5,c0699e2742950b636ad61a48650b8ea4,assistant,action_response,2024-01-18 13:32:25.292728,"{""action_response"": {""function"": ""query_arxiv_..."


In [51]:
Markdown(li.as_dict(content)['action_response']['output'])

Current challenges in integrating Knowledge Graphs with Large Language Models (LLMs) include:

1. **Scalability**: As LLMs and knowledge graphs grow in size, it becomes increasingly difficult to efficiently integrate and update the vast amounts of information contained within them.

2. **Alignment**: Ensuring that the knowledge graph's structured information aligns with the LLM's learned representations can be challenging, as LLMs may develop their own idiosyncratic understanding of concepts.

3. **Dynamic Knowledge**: Knowledge graphs need to be constantly updated to reflect new information, but integrating these updates into an LLM that has been trained on a static snapshot of data can be problematic.

4. **Reasoning and Inference**: While LLMs are adept at generating human-like text, they may struggle with logical reasoning or inference tasks that knowledge graphs can support. Bridging the gap between neural text generation and structured logical reasoning is a non-trivial challenge.

5. **Contextual Understanding**: LLMs may not always effectively leverage the context provided by a knowledge graph, leading to responses that are factually incorrect or lack relevance.

6. **Complex Queries**: Handling complex queries that require multi-hop reasoning over a knowledge graph is difficult, as it requires the LLM to maintain coherence over long text generations and to accurately access and apply relevant information from the graph.

7. **Interpretability**: Ensuring that the integration of knowledge graphs into LLMs is interpretable and transparent is important for trust and reliability, but this remains a difficult task given the often opaque nature of neural network decision-making processes.

8. **Data Quality and Bias**: The quality of the data in the knowledge graph can affect the performance of the LLM, and biases present in the data can propagate through the model, leading to biased outputs.

Addressing these challenges requires ongoing research and development in the fields of machine learning, natural language processing, and knowledge representation.

Now let us read the assistant's responses

In [52]:
df_assistant = df[df.sender == "assistant"]
len(df_assistant)

2

In [60]:
# the first response corresponds to the first user instruction, which is to read through the abstract

response1 = li.as_dict(df_assistant.content.iloc[0])['response']
Markdown(response1)

Certainly, the abstract provided outlines the interplay between Large Language Models (LLMs) like ChatGPT and GPT-4, and Knowledge Graphs (KGs) such as Wikipedia and Huapu. The core point of the paper is that while LLMs are powerful in processing natural language, they tend to lack in capturing and accessing concrete factual knowledge, which is where KGs excel. The paper's purpose is to explore ways to unify LLMs and KGs to harness their respective strengths. 

The authors propose a roadmap with three general frameworks for this unification: 

1. KG-enhanced LLMs, which integrate KGs into various stages of LLM development and usage, either to assist with pre-training and inference or to improve the LLMs' grasp of the knowledge they've learned.

2. LLM-augmented KGs, in which LLMs are utilized to perform tasks related to KGs, including embedding, completion, construction, and more complex functions like graph-to-text generation and question answering.

3. Synergized LLMs + KGs, a model where LLMs and KGs collaborate closely, providing mutual benefits and enabling bidirectional reasoning that incorporates both data and knowledge.

The abstract concludes by reviewing existing efforts in these areas and suggesting future research directions, indicating that this is a forward-looking and potentially transformative approach to advancing the field of AI and natural language understanding.

In [61]:
# the second is the second instruciton, which is the final output in this case

response2 = li.as_dict(df_assistant.content.iloc[1])['response']
Markdown(response2)

Based on the improved understanding of the challenges in integrating Knowledge Graphs with Large Language Models, a research question that arises might be:

**Research Question:** How can we develop an adaptive integration framework for LLMs and KGs that maintains the up-to-dateness of the knowledge graph while ensuring the scalability and alignment of the LLM?

**Supporting Reasoning:** This question is practical as it addresses the dynamic nature of knowledge and the need for LLMs to continuously learn from updated information. It is unique in its focus on creating an adaptive framework that can handle the scalability issues that come with the ever-growing size of LLMs and KGs, as well as ensuring that the LLM's learned representations align with the structured information of the KG.

**Talking Points:**
- **Point 1:** Scalability and efficiency are major concerns as both LLMs and KGs grow; an adaptive framework could include mechanisms for incremental learning or modular updates that prevent the need for retraining from scratch.
- **Point 2:** Alignment between the evolving representations of knowledge in LLMs and the structured format of KGs requires continuous synchronization methods, possibly utilizing advanced alignment algorithms or transfer learning techniques.
- **Point 3:** Keeping the knowledge graph up-to-date in a way that the LLM can efficiently utilize is crucial; this might involve real-time updating mechanisms or periodic 'knowledge refreshes' that the LLM can integrate without compromising performance. 

To further clarify the potential of this research direction, I will invoke the function call once more to ask a follow-up question.

{"action_list": [{"action": "action_query_arxiv_papers", "arguments": "{\"str_or_query_bundle\":\"What are the latest approaches to ensuring the scalability and alignment of LLMs in the context of knowledge graph integration?\"}"}]}

In [None]:
df.to_csv("researcher1.csv")