# **RAG with LlamaIndex and DeciLM**
https://deci.ai/blog/rag-with-llamaindex-and-decilm-a-step-by-step-tutorial/

Following the five next steps
1. Load documents
2. Parse Documents into Nodes
3. Build an Index
4. Query the index
5. Parse the response

In [12]:
%%capture
! pip install openai llama_hub llama_index pypdf accelerate sentence_transformers -q -U

In [51]:
import llama_index
llama_index.__version__

'0.9.26'

In [None]:
%%capture
%%bash
wget -O state_of_ai_2023.zip https://github.com/harpreetsahota204/langchain-zoomcamp/raw/main/State%20of%20AI%20Report%202023%20-%20ONLINE.pdf.zip
unzip state_of_ai_2023.zip

In [None]:
%%capture
%%bash
wget -O ggml-gpt4all-j-v1.3-groovy.bin https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized.bin

In [26]:
import os
from pathlib import Path
from llama_index.response.notebook_utils import display_source_node
from llama_index.retrievers import RecursiveRetriever
from llama_index.llms import OpenAI
import json
from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import IndexNode
from llama_index import Document

## 1. Load Documents

In [27]:
from llama_hub.file.pdf.base import PDFReader

loader = PDFReader()
docs0 = loader.load_data(file=Path("State of AI Report 2023 - ONLINE.pdf"))

In [28]:
print(f" docs is a {type(docs0)}, of length {len(docs0)}, where each element is a {type(docs0[0])} object")
docs0[1].text

 docs is a <class 'list'>, of length 163, where each element is a <class 'llama_index.schema.Document'> object


'About the authors  Introduction  | Research  | Industry  | Politics  | Safety  | Predictions #stateofai | 2 \nstateof.ai 2023 Nathan is the General Partner of Air Street Capital , a \nventure capital ﬁrm investing in AI-ﬁrst technology \nand life science companies. He founded RAAIS and \nLondon.AI (AI community for industry and research), \nthe RAAIS Foundation (funding open-source AI \nprojects), and Spinout.fyi (improving university spinout \ncreation). He studied biology at Williams College and \nearned a PhD from Cambridge in cancer research. \nNathan Benaich '

In [29]:
import re

def clean_slide_text(text: str) -> str:
    """
    Cleans the provided slide text by removing specific patterns and extra whitespace.

    Parameters:
    - text (str): The raw text from a slide.

    Returns:
    - str: The cleaned text.

    Example:
    >>> clean_slide_text("LINGO-1 is Wayve’s vision-language-action model ... stateof.ai 2023 #stateofai | 43 \nLeveraging LLMs for autonomous driving'")
    'LINGO-1 is Wayve’s vision-language-action model ... Leveraging LLMs for autonomous driving'
    """
    # Remove the footer text
    text = text.replace("stateof.ai 2023", "")

    # Remove the header text
    text = text.replace("Introduction  | Research  | Industry  | Politics  | Safety  | Predictions", "")

    # Remove the pattern "#stateofai | n"
    text = re.sub(r"#stateofai(\s*\|\s*\d+)?", "", text)

    # Replace multiple consecutive spaces with a single space
    text = re.sub(r" +", " ", text)

    # Remove any leading or trailing whitespace
    text = text.strip()

    return text

def assign_section(document):
    """
    Assigns a section to the document based on its page number.

    The function updates the 'metadata' attribute of the document with a key 'section'
    that has a value corresponding to the section the page number falls into.

    Sections:
    - Page 1 through 10: Introduction
    - Page 11 through 68: Research
    - Page 69 through 120: Politics
    - Page 121 through 137: Safety
    - Pages 138 and beyond: Predictions

    Args:
    - document (Document): The Document object to be updated.

    Returns:
    None. The function updates the Document object in-place.
    """

    page_number = int(document.metadata['page_label'])

    if 1 <= page_number <= 10:
        document.metadata['section'] = 'Introduction'
    elif 11 <= page_number <= 68:
        document.metadata['section'] = 'Research'
    elif 69 <= page_number <= 120:
        document.metadata['section'] = 'Politics'
    elif 121 <= page_number <= 137:
        document.metadata['section'] = 'Safety'
    else:
        document.metadata['section'] = 'Predictions'

# Iterate through each Document object in docs0
for doc in docs0:
    # Update the metadata using assign_section
    assign_section(doc)

    # Metadata keys that are excluded from text for the embed model.
    doc.excluded_embed_metadata_keys=['file_name']

    # Apply clean_slide_text to the text attribute
    doc.text = clean_slide_text(doc.text)

In [30]:
docs0[1].text # docs0[1].get_content()

'About the authors \n Nathan is the General Partner of Air Street Capital , a \nventure capital ﬁrm investing in AI-ﬁrst technology \nand life science companies. He founded RAAIS and \nLondon.AI (AI community for industry and research), \nthe RAAIS Foundation (funding open-source AI \nprojects), and Spinout.fyi (improving university spinout \ncreation). He studied biology at Williams College and \nearned a PhD from Cambridge in cancer research. \nNathan Benaich'

Since the **State of AI Report 2023** is a 163 page PDF, it makes sense to first convert the Document objects to Node(chunk) objects.

* A smaller chunk_size (e.g., 128) provides more granular chunks. However, there’s a risk that essential information might not be among the top retrieved chunks.

* A larger chunk size (e.g., 512) is likely to encompass all necessary information within the top chunks.
* As chunk_size increases, more information is directed into the LLM to generate an answer. This can ensure a comprehensive context but might slow down the system.

Looking at the State of AI 2023 Report, you’ll see that ideas/concepts/points as grouped as bullet points, mostly seperated by n, \n●, or \n-

In [31]:
import re

# Define the pattern for bullet points and newlines
split_pattern = r"\n●|\n-|\n"

# Initialize lists to store the word counts of all chunks and entire texts across all documents
chunk_word_counts = []
entire_text_word_counts = []

# Initialize a dictionary to store word counts and slide counts by section
section_data = {}

# Iterate through each Document object in your list of documents
for doc in docs0:
    # Split the document's text into chunks based on the pattern
    chunks = re.split(split_pattern, doc.text)

    # Calculate the number of words in each chunk and store it 
    chunk_word_counts.extend([len(chunk.split()) for chunk in chunks])

    # Calculate the number of words in the entire text and store it
    entire_word_count = len(doc.text.split())
    entire_text_word_counts.append(entire_word_count)

    # Update the word count and slide count for the section in the dictionary
    section = doc.metadata['section']
    if section in section_data:
        section_data[section]['word_count'] += entire_word_count
        section_data[section]['slide_count'] += 1
    else:
        section_data[section] = {'word_count': entire_word_count, 'slide_count': 1}

# Calculate the total word count across all sections
total_word_count = sum(data['word_count'] for data in section_data.values())

# Calculate the number of sections
num_sections = len(section_data)

# Calculate the average word count across all sections
average_word_count_across_sections = total_word_count / num_sections

# Calculate summary statistics for chunks
average_chunk_word_count = sum(chunk_word_counts) / len(chunk_word_counts)
max_chunk_word_count = max(chunk_word_counts)

# Calculate average word count for entire texts
average_entire_text_word_count = sum(entire_text_word_counts) / len(entire_text_word_counts)

print(f"Average word count for a slide: {average_entire_text_word_count}")
print(f"Average word count per bullet point: {average_chunk_word_count}")
print(f"Longest bullet point: {max_chunk_word_count}")
print(f"Average word count in a section: {average_word_count_across_sections:.2f}")

Average word count for a slide: 127.04907975460122
Average word count per bullet point: 10.796663190823775
Longest bullet point: 33
Average word count in a section: 4141.80


In [32]:
from llama_index.text_splitter import SentenceSplitter

bullet_splitter = SentenceSplitter(paragraph_separator=r"\n●|\n-|\n", chunk_size=250)


# slides_parser = SimpleNodeParser(
#     text_splitter=bullet_splitter,
#     include_prev_next_rel=True,
#     include_metadata=True
#     )
# slides_nodes = slides_parser.get_nodes_from_documents(docs0)
# NOTE: SimpleNodeParser is deprecated. Use the following instead:

slides_nodes = bullet_splitter.get_nodes_from_documents(docs0)

In [33]:
slides_nodes[26].get_content()

'Politics \n-The world has divided into clear regulatory camps, but progress on global governance remains slower. The largest AI labs are stepping in to ﬁll the vacuum. \n-The chip wars continue unabated, with the US mobilising its allies, and the Chinese response remaining patchy. \n-AI is forecast to affect a series of sensitive areas, including elections and employment, but we’re yet to see a signiﬁcant effect. \nSafety \n-The existential risk debate has reached the mainstream for the ﬁrst time and intensiﬁed signiﬁcantly. \n-Many high-performing models are easy to ‘jailbreak’. To remedy RLHF challenges, researchers are exploring alternatives, e.g. self-alignment and pretraining \nwith human preferences. \n-As capabilities advance, it’s becoming increasingly hard to evaluate SOTA models consistently. Vibes won’t sufﬁce. Executive Summary'

**SentenceWindowNodeParser**
* Node: Represents a unit of text, in this case, a sentence.

* Window: A range of sentences surrounding a particular sentence. For example, if the window size is 3, and the current sentence is the 5th sentence, the window will capture sentences 2 to 8.
* Metadata: Additional information associated with a node, such as the window of surrounding sentences.


In [34]:
from llama_index.node_parser import SentenceWindowNodeParser
from typing import List
import re

def custom_sentence_splitter(text: str) -> List[str]:
    return re.split(r'\n●|\n-|\n', text)

bullet_node_parser = SentenceWindowNodeParser.from_defaults(
    sentence_splitter=custom_sentence_splitter,
    window_size=3,
    include_prev_next_rel=True,
    include_metadata=True
    )

# slides_windows_nodes = bullet_node_parser.get_nodes_from_documents(docs0)


**IndexNode** is a node object used in LlamaIndex. It represents chunks of the original documents that are stored in an Index. The Index is a data structure that allows for quick retrieval of relevant context for a user query, which is fundamental for RAG use cases.

At its core, the IndexNode inherits properties from a TextNode, meaning it primarily represents textual content but the distinguishing feature of an IndexNode is its index_id attribute. This index_id acts as a unique identifier or reference to another object, allowing the node to point or link to other entities within the system

In [35]:
sub_node_parsers = [bullet_node_parser]

all_nodes = []

for base_node in slides_nodes:
    # for each  base_node in slides_nodes, get subnodes with SentenceWindowNodeParser 
    for parser in sub_node_parsers:
        sub_nodes = parser.get_nodes_from_documents([base_node])
        
        # for each sub_node, create a new IndexNode with the same node_id as the base_node
        sub_inodes = [IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes]
        all_nodes.extend(sub_inodes)

    # also add original node to node
    original_node = IndexNode.from_text_node(base_node, base_node.node_id)
    all_nodes.append(original_node)

# **LLM and Embedding**

In [36]:
from llama_index.embeddings import resolve_embed_model

#  BGE embedder from HuggingFace
embed_model = resolve_embed_model("local:BAAI/bge-large-en-v1.5")

In [1]:
# Define a new prompt template
template = """Below is context that has been retrieved. Your task is to synthesize \

the query, which is delimited by triple backticks,  and write a response that appropriately answers the query based on the retrieved context.

### Query:
```{query_str}```

### Response:

Begin!
"""

In [1]:
# %%capture
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate

llm = HuggingFaceLLM(
    # model_name="Deci/DeciLM-6b-instruct",
    # tokenizer_name="Deci/DeciLM-6b-instruct",

    model_name="WeOpenML/Alpaca-7B-v1",  # alapca of stanford
    tokenizer_name="WeOpenML/Alpaca-7B-v1",
    query_wrapper_prompt=PromptTemplate(
        "<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    # query_wrapper_prompt=PromptTemplate(template),
    context_window=4096,
    max_new_tokens=512,
    model_kwargs={'trust_remote_code': True},
    generate_kwargs={"temperature": 0.0},
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

: 

: 

In [None]:
from llama_index import ServiceContext

# ServiceContext in LlamaIndex is a utility container that bundles commonly
# used resources during the indexing and querying stages of a LlamaIndex pipeline or application
service_context = ServiceContext.from_defaults(llm=llm,
                                               embed_model=embed_model)

## VectorStoreIndex in LlamaIndex

A **VectorStoreIndex** in LlamaIndex is a type of index that uses vector representations of text for efficient retrieval of relevant context. 

It takes in **IndexNode** objects, which represent chunks of the original documents and uses an embedding model (specified in the **ServiceContext**) to convert the text content of these nodes into vector representations. These vectors are then stored in the VectorStore.

In [2]:
from llama_index import VectorStoreIndex

vector_index_chunk = VectorStoreIndex(
    all_nodes, service_context=service_context)

NameError: name 'all_nodes' is not defined

In [None]:
# A retriever is responsible for fetching relevant context from the index given a user query
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=2)

## RecurseiveRetriever in LlamaIndex
The **RecursiveRetriever** is designed to recursively explore links from nodes to other retrievers or query engines.

This means that when the retriever fetches nodes, if any of those nodes point to another retriever or query engine, the RecursiveRetriever will follow that link and query the linked retriever or engine as well.

RecursiveRetriever is designed to handle complex retrieval tasks, especially when data is spread across different retrievers or query engines. It follows links, retrieves data from linked sources, and can combine results from multiple sources into a single coherent response.

Here’s a brief explanation of the arguments:

* root_id: The root ID of the query graph, in this case you pass "vector"
* retriever_dict: A dictionary mapping IDs to retrievers.
* node_dict: A dictionary that seems to map IDs to nodes.

In [None]:
retriever_chunk = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_chunk},
    node_dict=all_nodes_dict,
    verbose=True,
)

In [None]:
nodes = retriever_chunk.retrieve(
    "What is FlashAttention?"
)
for node in nodes:
    display_source_node(node, source_length=1000)

**RetrieverQueryEngine**

A RetrieverQueryEngine in LlamaIndex is a type of query engine that uses a retriever to fetch relevant context from an index given a user query. It is designed to work with retrievers, such as the VectorStoreRetriever created from a VectorStoreIndex.

In [None]:
from llama_index.query_engine import RetrieverQueryEngine
query_engine_chunk = RetrieverQueryEngine.from_args(
    retriever_chunk,
    service_context=service_context,
    verbose=True,
    response_mode="compact"
    # Compact combines text chunks into larger consolidated chunks that more fully utilize the available context window
)


In [None]:
# Now, you can query the State of AI 2023 Report!
response = query_engine_chunk.query(
   "Who are the authors of this report?"
)
str(response)

In [None]:
response = query_engine_chunk.query(
   "What is new about FlashAttention?"
)
str(response)