# https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/
https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/
https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/

To create a robust ingestion pipeline using the `IngestionPipeline` from `llama_index`, we will combine various transformations, document management, and async support to handle document ingestion, processing, and storage efficiently. Below, I’ll walk you through the key elements and how you can implement them in your ingestion pipeline, including handling vector databases, caching, parallel processing, and async support.

### Key Concepts in the Ingestion Pipeline:

1. **Transformations**: These are applied to your input documents. Each transformation processes the document and returns nodes (processed chunks). Common transformations include `SentenceSplitter`, `TitleExtractor`, and embeddings like `OpenAIEmbedding`.
   
2. **Document Management**: This feature ensures that duplicate documents are identified and reprocessed only if needed, based on their hash and `doc_id`. This is especially useful in data pipelines that ingest new documents regularly.

3. **Caching**: The pipeline supports caching of nodes and transformations. This helps speed up processing by reusing previously processed data when the same data and transformations are encountered.

4. **Vector Databases**: Ingestion pipelines can directly connect to vector stores like Qdrant, Redis, or other backends, enabling real-time indexing and search.

5. **Parallel Processing**: The pipeline can process documents in parallel by utilizing `num_workers`, which distributes tasks across multiple processors.

6. **Async Support**: The `IngestionPipeline` supports asynchronous operations, enabling batch processing of documents asynchronously for improved performance.

### Full Example Walkthrough:

#### 1. **Basic Setup of Ingestion Pipeline:**

Here, we define the transformations and create the pipeline.

```python
from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.vector_stores.qdrant import QdrantVectorStore

import qdrant_client

# Initialize the Qdrant client and vector store
client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")

# Create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),  # Split the document into smaller chunks
        TitleExtractor(),  # Extract the document title
        OpenAIEmbedding(),  # Apply OpenAI embeddings to document chunks
    ],
    vector_store=vector_store  # Optionally, store results in a vector store
)

# Run the pipeline on a list of documents (in this case, just an example document)
documents = [Document.example()]
pipeline.run(documents=documents)
```

In this example, the documents are split into smaller chunks, titles are extracted, and OpenAI embeddings are applied. These nodes are then inserted into the `Qdrant` vector store.

#### 2. **Connecting to a Vector Database**:

The `IngestionPipeline` can automatically insert nodes into a vector database. Here’s how you connect to the Qdrant vector store:

```python
# Connect to Qdrant and initialize the vector store
client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="documents_collection")

# Create a pipeline that uses the vector store
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)

# Run the pipeline
pipeline.run(documents=documents)
```

After processing, the resulting nodes are inserted into the Qdrant vector store, making them available for later search or retrieval.

#### 3. **Caching**:

To avoid redundant computation, you can persist the cache. This is useful for improving performance when running the pipeline multiple times with the same data.

```python
# Save the cache
pipeline.persist("./pipeline_storage")

# Load the cache and restore the pipeline state
new_pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
    ],
)
new_pipeline.load("./pipeline_storage")

# Running the pipeline now will be faster due to the cache
nodes = new_pipeline.run(documents=documents)
```

This method ensures that the transformations are cached, and you don't need to repeat the work for the same data.

#### 4. **Async Support**:

If you have many documents to process, using the async functionality can greatly improve performance by processing multiple documents concurrently.

```python
# Asynchronous document processing
nodes = await pipeline.arun(documents=documents)
```

This method allows the ingestion pipeline to process documents asynchronously, making it ideal for large-scale document ingestion tasks.

#### 5. **Parallel Processing**:

To further speed up processing, you can run the pipeline with multiple processes. This is useful when dealing with large datasets and want to utilize multiple CPU cores.

```python
# Run the pipeline with parallel processing using multiprocessing
pipeline.run(documents=documents, num_workers=4)
```

The `num_workers` parameter controls how many processes to use for parallel processing. This ensures that the document processing is distributed across multiple cores.

#### 6. **Document Management**:

To manage document versions and ensure that duplicate documents are not processed unnecessarily, you can use a document store. Here, we use `SimpleDocumentStore` for this purpose:

```python
from llama_index.core.storage.docstore import SimpleDocumentStore

# Initialize the document store
docstore = SimpleDocumentStore()

# Create a pipeline that uses the document store
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
        OpenAIEmbedding(),
    ],
    docstore=docstore,
)

# Run the pipeline
pipeline.run(documents=documents)
```

The `docstore` ensures that documents are checked for duplicates based on their hash and `doc_id`. If a document is already processed and stored, it will be skipped unless its content has changed.

### Complete Pipeline Example:

```python
import asyncio
from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

# Define the async processing pipeline
async def async_pipeline(documents):
    client = qdrant_client.QdrantClient(location=":memory:")
    vector_store = QdrantVectorStore(client=client, collection_name="test_collection")

    pipeline = IngestionPipeline(
        transformations=[
            SentenceSplitter(chunk_size=25, chunk_overlap=0),
            TitleExtractor(),
            OpenAIEmbedding(),
        ],
        vector_store=vector_store
    )

    # Run the pipeline asynchronously
    nodes = await pipeline.arun(documents=documents)
    return nodes

# Example document list
documents = [Document.example()]

# Run the pipeline and await the result
result = asyncio.run(async_pipeline(documents))

# Print the processed nodes
print(result)
```

### Final Thoughts:

- **Async and Parallel Processing**: This solution takes full advantage of the async and parallel processing capabilities of the `IngestionPipeline`, ensuring your pipeline is efficient and scalable.
  
- **Caching**: By caching the results of transformations, the system avoids redundant computation, saving time on subsequent runs.

- **Vector Stores**: Directly integrating with vector stores like Qdrant allows you to index and search documents easily once they are processed.

- **Document Management**: Duplicate detection helps ensure that only new or modified documents are reprocessed, preventing unnecessary computation.

This setup is ideal for large-scale document ingestion tasks, where performance and scalability are critical. Let me know if you need further clarifications or adjustments!

# Goal: to implement the above. previous basic working code:

# Read in all files i.e. data ingestion

Basic csv reading code added. Next: use llamaparse's own data loaders

In [None]:
# connecting to google drive drive:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import os

# The CSV download link
path = '/content/drive/MyDrive/Omdena_Challenge/PREPROCESSING/llamaparse/v0_new_LK_tea_dataset.csv'

# Read the Google Sheet into a pandas DataFrame
all_data = pd.read_csv(path)
all_data.fillna('', inplace=True)
# Display the DataFrame
all_data.head()


Unnamed: 0,id,class,filename,path,url,language_label,required_ocr,has_tables,text_path,quality,...,issuing_authority,retrieved_topic,text_content,PDF_or_text,markdown_content,llama_title,llama_issue_date,llama_reference_number,is_empty,markdown_path
0,0,ACT,ACT_1.txt,,https://www.lawnet.gov.lk/tea-subsidy-3/,en,no,no,/content/drive/MyDrive/Omdena_Challenge/new_LK...,good,...,Parliament of Sri Lanka ACT,Tea Subsidy – LawNet,"Tea Subsidy\nActs Nos. 12 of 1958, 66 of 1961....",Text,"# Tea Subsidy\n\nActs Nos. 12 of 1958, 66 of 1...","Tea Subsidy Act, No. 12 of 1958","19th September, 1958","12 of 1958, 66 of 1961, 33 of 1966",False,/content/drive/MyDrive/Omdena_Challenge/new_LK...
1,1,ACT,ACT_2.txt,,https://www.lawnet.gov.lk/tea-small-holdings-d...,en,no,no,/content/drive/MyDrive/Omdena_Challenge/new_LK...,good,...,Parliament of Sri Lanka ACT,Tea Small Holdings Development (Amendment) – L...,|\n\n**Tea Small Holdings Development (Amendme...,Text,# Tea Small Holdings Development (Amendment)\n...,Tea Small Holdings Development (Amendment),"12th August, 1997",No. 21 of 1997,False,/content/drive/MyDrive/Omdena_Challenge/new_LK...
2,2,ACT,ACT_3.txt,,https://www.lawnet.gov.lk/tea-small-holdings-d...,en,no,no,/content/drive/MyDrive/Omdena_Challenge/new_LK...,good,...,Parliament of Sri Lanka ACT,Tea Small Holdings Development (Amendment) – L...,|\n\n**Tea Small Holdings Development (Amendme...,Text,# Tea Small Holdings Development (Amendment)\n...,Tea Small Holdings Development (Amendment),"22nd October, 2003",Act No. 34 of 2003,False,/content/drive/MyDrive/Omdena_Challenge/new_LK...
3,3,ACT,ACT_4.txt,,https://www.lawnet.gov.lk/tea-small-holdings-d...,en,no,no,/content/drive/MyDrive/Omdena_Challenge/new_LK...,good,...,Parliament of Sri Lanka ACT,Tea Small Holdings Development Law – LawNet,Tea Small Holdings Development Law\nA LAW TO E...,Text,# Tea Small Holdings Development Law\n\nA LAW ...,Tea Small Holdings Development Law,1975,35,False,/content/drive/MyDrive/Omdena_Challenge/new_LK...
4,4,ACT,ACT_5.txt,,https://www.lawnet.gov.lk/tea-research-board-a...,en,no,no,/content/drive/MyDrive/Omdena_Challenge/new_LK...,good,...,Parliament of Sri Lanka ACT,Tea Research Board (Amendment) – LawNet,|\n\n**Tea Research Board (Amendment)** |\n\n|...,Text,# Tea Research Board (Amendment)\n\n# AN ACT T...,"AN ACT TO AMEND THE TEA RESEARCH BOARD ACT, NO...","6th November, 2006",43 of 2006,False,/content/drive/MyDrive/Omdena_Challenge/new_LK...


# [optional] only select tri circulers

In [None]:
# Select rows where 'class' is 'circular' and 'issuing_authority' is 'Tea Research Institute'
tri_circulars_df = all_data[(all_data['class'] == 'circular') & (all_data['issuing_authority'] == 'Tea Research Institute')]

# Display the selected rows
tri_circulars_df

Unnamed: 0,id,original_class,class,filename,path,url,language_label,required_ocr,has_tables,text_path,...,retrieved_date_of_issuance,issuing_authority,retrieved_topic,text_content,PDF_or_text,markdown_content,llama_title,llama_issue_date,llama_reference_number,markdown_path
33,33,Circulers,circular,Advisory_Circular_DM1e_2024.pdf,/content/drive/MyDrive/Omdena_Challenge/new_LK...,https://www.tri.lk/wp-content/uploads/2024/03/...,en,yes,yes,,...,01/02/2024,Tea Research Institute,Protection of Tea from Blister Blight,T.R.I. ADVISORY CIRCULAR No.DM Ei\n\n \n\nIssu...,PDF,# ADVISORY CIRCULAR\n\n# No.DM JHL 925VynvT\n\...,Protection of Tea from Blister Blight,February 2024,"No. DM JHL 925VynvT, Serial No. 04/24",/content/drive/MyDrive/Omdena_Challenge/new_LK...
34,34,Circulers,circular,Advisory_Circular_DM2e_2024.pdf,/content/drive/MyDrive/Omdena_Challenge/new_LK...,https://www.tri.lk/wp-content/uploads/2024/03/...,en,yes,no,,...,01/02/2024,Tea Research Institute,Protection of Tea from Root Disease,T.R.I. ADVISORY CIRCULAR No.DM EE\n\n \n\nIssu...,PDF,# ADVISORY CIRCULAR No.DM 2\n\nIssued in: Febr...,PROTECTION OF TEA FROM ROOT DISEASES,February 2024,Serial No. 05/24,/content/drive/MyDrive/Omdena_Challenge/new_LK...
35,35,Circulers,circular,Advisory_Circular_DM4e_2024.pdf,/content/drive/MyDrive/Omdena_Challenge/new_LK...,https://www.tri.lk/wp-content/uploads/2024/03/...,en,yes,yes,,...,01/02/2024,Tea Research Institute,Protection of Tea from Red Rust Disease in the...,T.R.I. ADVISORY CIRCULAR No.DMEg\n\n \n\nIssue...,PDF,# ADVISORY CIRCULAR No.DM 4\n\n# Issued in: Fe...,PROTECTION OF TEA FROM RED RUST DISEASE IN THE...,February 2024,DM 4 Serial No 07/24,/content/drive/MyDrive/Omdena_Challenge/new_LK...
36,36,Circulers,circular,Advisory_Circular_DM5e_2024.pdf,/content/drive/MyDrive/Omdena_Challenge/new_LK...,https://www.tri.lk/wp-content/uploads/2024/03/...,en,yes,no,,...,01/02/2024,Tea Research Institute,Protection of Tea from Stem and Branch Canker ...,T.R.I. ADVISORY CIRCULAR No.DM Eg\n\n \n\nIssu...,PDF,# ADVISORY CIRCULAR No.DM 5\n\nIssued in: Febr...,PROTECTION OF TEA FROM STEM AND BRANCH CANKER ...,February 2024,No.DM 5 / Serial No. 08/24,/content/drive/MyDrive/Omdena_Challenge/new_LK...
37,37,Circulers,circular,Advisory_Circular_DM6e_2024.pdf,/content/drive/MyDrive/Omdena_Challenge/new_LK...,https://www.tri.lk/wp-content/uploads/2024/03/...,en,yes,no,,...,01/02/2024,Tea Research Institute,Protection of Tea from Collar and Brach Canker...,Y UC ae LG |\n\nIssued in: February 2024 Seria...,PDF,# ADVISORY CIRCULAR No.DM 6\n\nIssued in: Febr...,PROTECTION OF TEA FROM COLLAR AND BRANCH CANKE...,February 2024,09/24,/content/drive/MyDrive/Omdena_Challenge/new_LK...
38,38,Circulers,circular,Advisory_Circular_DM7e_2024.pdf,/content/drive/MyDrive/Omdena_Challenge/new_LK...,https://www.tri.lk/wp-content/uploads/2024/03/...,en,yes,no,,...,01/02/2024,Tea Research Institute,Management of Horse-Hair Blight in Tea,‘s\na os\n‘| com /E T.R.I. ADVISORY CIRCULAR ~...,PDF,# ADVISORY CIRCULAR No.DM 4 tn 925JHLVYN\n\nIs...,MANAGEMENT OF HORSE-HAIR BLIGHT IN TEA,February 2024,DM 4 tn 925JHLVYN,/content/drive/MyDrive/Omdena_Challenge/new_LK...
39,39,Circulers,circular,TRISL_Advisory_Circular_HP01e_Jun2013.pdf,/content/drive/MyDrive/Omdena_Challenge/new_LK...,https://www.tri.lk/wp-content/uploads/2020/02/...,en,no,yes,,...,01/06/2013,Tea Research Institute,Pruning of Tea,Issued in:June2013 Serial No.01/13\nPRUNING OF...,PDF,# ADVISORY CIRCULAR NoHP 1\n\n# Issued in: Jun...,PRUNING OF TEA,June 2013,Serial No. 01/13,/content/drive/MyDrive/Omdena_Challenge/new_LK...
40,40,Circulers,circular,TRI_HP02e.pdf,/content/drive/MyDrive/Omdena_Challenge/new_LK...,https://www.tri.lk/wp-content/uploads/2020/02/...,en,yes,yes,,...,01/03/2003,Tea Research Institute,Guidelines on Plucking,ADVISORY CIRCULAR _No.HP EJ\n\n \n\nIssued in:...,PDF,# ARCHWNSTLTUTE\n\n# ADVISORY CIRCULAR No.HP 2...,Guidelines on Plucking,March 2003,04/03,/content/drive/MyDrive/Omdena_Challenge/new_LK...
41,41,Circulers,circular,TRI_HP03e.pdf,/content/drive/MyDrive/Omdena_Challenge/new_LK...,https://www.tri.lk/wp-content/uploads/2020/02/...,en,yes,no,,...,01/09/2003,Tea Research Institute,Cultivation of Tea Soils – Forking,T.R.I. ADVISORY CIRCULAR No.HP EE\n\n \n\nIssu...,PDF,# ADVISORY CIRCULAR\n\n# No.HP 3\n\nIssued in:...,CULTIVATION OF TEA SOILS FORKING,September 2003,No.HP 3 / Serial No: 17/03,/content/drive/MyDrive/Omdena_Challenge/new_LK...
42,42,Circulers,circular,TRISL_Advisory_Circular_HP04e_Jun2013.pdf,/content/drive/MyDrive/Omdena_Challenge/new_LK...,https://www.tri.lk/wp-content/uploads/2020/02/...,en,no,yes,,...,01/06/2013,Tea Research Institute,Rejuvenation Pruning,Issued in:June2013 Serial No.02/2013\nREJUVENA...,PDF,# ADVISORY CIRCULAR NoHP\n\nIssued in: June 20...,REJUVENATION PRUNING,June 2013,02/2013,/content/drive/MyDrive/Omdena_Challenge/new_LK...


In [None]:
import pandas as pd

# Function to count words and characters in text
def count_words_and_characters(text):
    words = text.split()  # Split by whitespace to get words
    num_words = len(words)
    num_characters = len(text)
    return num_words, num_characters

# Function to count words and characters for the 'text_content' column in the DataFrame
def count_stats_in_dataframe(df):
    total_words = 0
    total_characters = 0

    # Iterate over the 'text_content' column in the DataFrame
    for text in df['markdown_content'].dropna():  # Drop any NaN values in text_content
        num_words, num_characters = count_words_and_characters(text)
        total_words += num_words
        total_characters += num_characters

    return total_words, total_characters

# Main script
if __name__ == "__main__":

    # Count words and characters
    total_words, total_characters = count_stats_in_dataframe(tri_circulars_df)

    # Output the results
    print("\nTotal Stats for 'markdown_content' column in DataFrame:")
    print(f"Total Words: {total_words}")
    print(f"Total Characters: {total_characters}")



Total Stats for 'markdown_content' column in DataFrame:
Total Words: 52073
Total Characters: 319861


# Create documents

In [None]:
# lot's of unused imports, will sift thru tem later

!pip install python-dotenv
!pip install --upgrade  llama-index llama-index-core llama-index-readers-file
!pip install llama-index-vector-stores-lancedb
!pip install llama_index.llms.groq llama-index-embeddings-fastembed



In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
# basic official code

# from llama_index.core import VectorStoreIndex

# documents = LlamaParse(result_type="markdown").load_data("/content/drive/MyDrive/Tea/ocr_results/Manual_Pdf/Docs_Tables/03_Sri-Lanka-Tea-Board-Standads-for-Other-Origin-Teas_compressed.pdf")

# # create an index from the parsed markdown
# index = VectorStoreIndex.from_documents(documents)

# # create a query engine for the index
# query_engine = index.as_query_engine()

# # query the engine
# query = "What can you do in the Bay of Fundy?"
# response = query_engine.query(query)
# print(response)


counter intuitive to run llamaparse again, esp for markdown which is already good format, and also harder to add metadata to each file

In [None]:
# import pandas as pd
# from llama_parse import LlamaParse
# import asyncio

# # Initialize parsers
# markdown_parser = LlamaParse(result_type='markdown', Set_fast_mode=True, num_workers=8)

# # Async function to process Markdown documents
# async def process_markdown_documents(markdown_paths):
#     if markdown_paths:
#         return await asyncio.to_thread(
#             lambda: SimpleDirectoryReader(input_files=markdown_paths, file_extractor={".md": markdown_parser}).load_data()
#         )
#     return []

# # Async function to process and return documents from DataFrame
# async def process_and_save_df(all_data):
#     markdown_paths = all_data['markdown_path'].tolist()
#     return await process_markdown_documents(markdown_paths)

# # Main function to execute the async processing
# async def main():
#     processed_documents = await process_and_save_df(all_data)
#     if processed_documents:
#         return processed_documents
#     print("No documents to process.")
#     return []

# # Run the asynchronous main function
# processed_documents = asyncio.run(main())


https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/

just some reference codes

In [None]:
# from llama_index.core import SimpleDirectoryReader

# filename_fn = lambda filename: {"file_name": filename}

# # automatically sets the metadata of each document according to filename_fn
# documents = SimpleDirectoryReader(
#     "./data", file_metadata=filename_fn
# ).load_data()

In [None]:
# reader = SimpleDirectoryReader(input_dir="path/to/directory", recursive=True)
# all_docs = []
# for docs in reader.iter_data():
#     # <do something with the documents per file>
#     all_docs.extend(docs)

In [None]:
# from llama_index.core import Document
# from llama_index.core.schema import MetadataMode

# document = Document(
#     text="This is a super-customized document",
#     metadata={
#         "file_name": "super_secret_document.txt",
#         "category": "finance",
#         "author": "LlamaIndex",
#     },
#     excluded_llm_metadata_keys=["file_name"],
#     metadata_seperator="::",
#     metadata_template="{key}=>{value}",
#     text_template="Metadata: {metadata_str}\n-----\nContent: {content}",
# )

# print(
#     "The LLM sees this: \n",
#     document.get_content(metadata_mode=MetadataMode.LLM),
# )
# print(
#     "The Embedding model sees this: \n",
#     document.get_content(metadata_mode=MetadataMode.EMBED),
# )

In [None]:
# epic faillllllllllll
# import pandas as pd
# from llama_index.core import SimpleDirectoryReader
# import asyncio

# # Function to get metadata from DataFrame based on file_path
# def get_meta(file_path):
#     matching_row = tri_circulars_df[tri_circulars_df['markdown_path'] == file_path].iloc[0] if not tri_circulars_df[tri_circulars_df['markdown_path'] == file_path].empty else None
#     if matching_row is None:
#         return {
#             "class": 'Unknown',
#             "issuing_authority": 'Unknown',
#             "has_tables": False,
#             "llama_title": 'Unknown Title',
#             "llama_issue_date": 'Unknown Date',
#             "llama_reference_number": 'Unknown Reference'
#         }
#     return {
#         "class": matching_row['class'],
#         "issuing_authority": matching_row['issuing_authority'],
#         "has_tables": matching_row.get('has_tables', False),
#         "llama_title": matching_row.get('llama_title', 'Unknown Title'),
#         "llama_issue_date": matching_row.get('llama_issue_date', 'Unknown Date'),
#         "llama_reference_number": matching_row.get('llama_reference_number', 'Unknown Reference')
#     }

# # Async function to load markdown files and extract documents
# async def load_documents():
#     markdown_paths = tri_circulars_df['markdown_path'].tolist()
#     if markdown_paths:
#         # Await the asynchronous loading of data
#         documents = await asyncio.to_thread(
#             lambda: SimpleDirectoryReader(input_files=markdown_paths, file_metadata=get_meta).aload_data()
#         )
#         return documents
#     return []

# # Async main function
# async def main():
#     documents = await load_documents()  # Ensure to await this
#     print("documents loaded")  # Now it can access len() since documents are awaited
#     return documents

# # Run the async main function
# documents = asyncio.run(main())


documents loaded


main async document loader

In [None]:
import pandas as pd
from llama_index.core import SimpleDirectoryReader
import asyncio


df_for_loading = all_data.copy()  # can use tri_circulars_df alternatively
# Function to get metadata from DataFrame based on file_path
def get_meta(file_path):
    matching_row = df_for_loading[df_for_loading['markdown_path'] == file_path].iloc[0] if not df_for_loading[df_for_loading['markdown_path'] == file_path].empty else None
    if matching_row is None:
        return {
            "filename": 'Unknown',
            "class": 'Unknown',
            "issuing_authority": 'Unknown',
            "has_tables": False,
            "llama_title": 'Unknown Title',
            "llama_issue_date": 'Unknown Date',
            "llama_reference_number": 'Unknown Reference'
        }
    return {
        "filename": matching_row['filename'], # included in reference number and issue date and title
        "class": matching_row['class'], # included in issuing_authority, also good becuase regulation doesn't really cover egz, because it is laws+regulation, and its issuing authority says egz, not regulation
        "issuing_authority": matching_row['issuing_authority'],
        "has_tables": matching_row.get('has_tables', False),
        "llama_title": matching_row.get('llama_title', 'Unknown Title'),
        "llama_issue_date": matching_row.get('llama_issue_date', 'Unknown Date'),
        "llama_reference_number": matching_row.get('llama_reference_number', 'Unknown Reference')
    }



markdown_paths = df_for_loading['markdown_path'].tolist()
# Create a SimpleDirectoryReader instance
reader = SimpleDirectoryReader(input_files=markdown_paths, filename_as_id=True, file_metadata=get_meta)
# Asynchronously load data
documents = await reader.aload_data()

In [None]:
print([document.doc_id for document in documents])

['/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_0', '/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_1', '/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_2', '/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_3', '/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_4', '/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_5', '/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_6', '/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_7', '/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_8', '/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_9', '/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_10', '/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_11', '/content/dri

In [None]:
for document in documents:
  document.excluded_llm_metadata_keys = ['file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date', 'has_tables'] #llm gets biased becuase of issuing_authority which is TRI circulars, or government of Sri Lanka egz
  document.excluded_embed_metadata_keys = ['file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date']

In [None]:
from llama_index.core.schema import MetadataMode

print(documents[0].get_content(metadata_mode=MetadataMode.LLM))

filename: ACT_1.txt
class: regulatory
issuing_authority: Parliament of Sri Lanka
llama_title: Tea Subsidy Act, No. 12 of 1958
llama_issue_date: 19th September, 1958
llama_reference_number: 12 of 1958, 66 of 1961, 33 of 1966



Tea Subsidy

Acts Nos. 12 of 1958, 66 of 1961. 33 of 1966.


In [None]:
from llama_index.core.schema import MetadataMode

print(documents[0].get_content(metadata_mode=MetadataMode.EMBED))

filename: ACT_1.txt
class: regulatory
issuing_authority: Parliament of Sri Lanka
has_tables: no
llama_title: Tea Subsidy Act, No. 12 of 1958
llama_issue_date: 19th September, 1958
llama_reference_number: 12 of 1958, 66 of 1961, 33 of 1966



Tea Subsidy

Acts Nos. 12 of 1958, 66 of 1961. 33 of 1966.


In [None]:
documents[0]

Document(id_='/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_0', embedding=None, metadata={'filename': 'ACT_1.txt', 'class': 'regulatory', 'issuing_authority': 'Parliament of Sri Lanka', 'has_tables': 'no', 'llama_title': 'Tea Subsidy Act, No. 12 of 1958', 'llama_issue_date': '19th September, 1958', 'llama_reference_number': '12 of 1958, 66 of 1961, 33 of 1966'}, excluded_embed_metadata_keys=['file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date', 'has_tables'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='\n\nTea Subsidy\n\nActs Nos. 12 of 1958, 66 of 1961. 33 of 1966.\n', mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}')

# test 1: [BASELINE llama 3 70b][LOCAL] Create temp lancedb vector index and query the parsed documents
not the ones saved as csv (that is for our viewing), but the variable returned by llamaparse

llama 3 70b works great on longer context, but when we start extracting metadata per node, we need a model that allows more token and requests per day on groq



In [None]:
from llama_index.core.node_parser import MarkdownNodeParser


node_parser = MarkdownNodeParser()
nodes = node_parser.get_nodes_from_documents(documents)

In [None]:
nodes[0]

TextNode(id_='65c023d8-0335-4221-99da-9ce963507b5d', embedding=None, metadata={'filename': 'ACT_1.txt', 'class': 'regulatory', 'issuing_authority': 'Parliament of Sri Lanka', 'has_tables': 'no', 'llama_title': 'Tea Subsidy Act, No. 12 of 1958', 'llama_issue_date': '19th September, 1958', 'llama_reference_number': '12 of 1958, 66 of 1961, 33 of 1966', 'header_path': '/'}, excluded_embed_metadata_keys=['file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date', 'has_tables'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='/content/drive/MyDrive/Omdena_Challenge/new_LK_tea_dataset/ACT/ACT_1.md_part_0', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'ACT_1.txt', 'class': 'regulatory', 'issuing_authority': 'Parliament of Sri Lanka', 'has_tables': 'no', 'llama_title': 'Tea Subsidy Act, No. 12 of 1958', 'llama_issu

### currently here

In [None]:
# basic vecor store
# from llama_index.core import VectorStoreIndex

# vector_index = VectorStoreIndex.from_documents(documents)
# vector_index.as_query_engine()

In [None]:
import os
import getpass

# # os.environ['LLAMA_CLOUD_API_KEY'] = getpass.getpass('Enter your LLamacloud API Key: ')
os.environ['GROQ_API_KEY'] = getpass.getpass('Enter your GROQ API Key: ')

Enter your GROQ API Key: ··········


In [None]:
from llama_index.llms.groq import Groq
from llama_index.embeddings.fastembed import FastEmbedEmbedding


embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm = Groq(model="llama3-70b-8192", api_key=os.environ['GROQ_API_KEY'], temperature=0.0)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

In [None]:
# # can use the previous simple node parser or this whole ingestion pipeline
# # right now used simple nodeparser, used this one with kdb.ai code below

# from llama_index.core.extractors import (
#     TitleExtractor,
#     QuestionsAnsweredExtractor,
# )
# from llama_index.core.node_parser import MarkdownNodeParser


# node_parser = MarkdownNodeParser()
# # text_splitter = TokenTextSplitter(
# #     separator=" ", chunk_size=512, chunk_overlap=128
# # )
# title_extractor = TitleExtractor(llm = llm, nodes=5)
# qa_extractor = QuestionsAnsweredExtractor(llm = llm, questions=3)

# # assume documents are defined -> extract nodes
# from llama_index.core.ingestion import IngestionPipeline

# pipeline = IngestionPipeline(
#     transformations=[node_parser, title_extractor, qa_extractor]
# )

# nodes = pipeline.run(
#     documents=documents,
#     in_place=True,
#     show_progress=True,
# )

In [None]:
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext

vector_store_lancedb = LanceDBVectorStore(uri="/tmp/lancedb_lamaindex")
storage_context = StorageContext.from_defaults(vector_store=vector_store_lancedb)

index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
    embed_model=embed_model,
)

query_engine = index.as_query_engine(similarity_top_k=5, llm=llm)


In [None]:
query = "What is relation between law, regulation, circular and guideline? give examples from filenames, and format as paragraph"
response_1 = query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)


**LlamaParse+ Lamaindex**
The relationship between law, regulation, circular, and guideline can be understood as a hierarchical structure. Laws are the highest authority, providing the overall framework. Regulations are derived from laws, outlining specific rules and procedures. Circulars, such as OR/1/51, are issued by authorities like the Tea Board, providing detailed instructions and clarifications on implementing regulations. Guidelines, although not explicitly mentioned, can be seen as internal documents providing additional guidance on circulars. For instance, the Sri Lanka Tea Board might have internal guidelines on how to implement the directives outlined in Circular OR/1/51.


In [None]:
query = "Should laws and regulations be put in the same class for effective information retrieval? format as paragraph"
response_1 = query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)


**LlamaParse+ Lamaindex**
It appears that laws and regulations are distinct types of documents, each serving a specific purpose. Laws, as represented by the "GAZETTE EXTRAORDINARY OF THE DEMOCRATIC SOCIALIST REPUBLIC OF SRI LANKA", are official government publications that contain new legislation or amendments to existing laws. On the other hand, regulations, as exemplified by the "CIRCULAR" issued by the Tea Board, are guidelines or rules that govern specific industries or sectors. Given their different natures and purposes, it may be beneficial to categorize them separately for effective information retrieval, allowing users to quickly identify and access the relevant type of document.


In [None]:
query = "Are acts, laws, and extra gazettes semantically similar and can they be put in the same class for effective information retrieval?"
response_1 = query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)


**LlamaParse+ Lamaindex**
Based on the provided context, it appears that "GAZETTE EXTRAORDINARY OF THE DEMOCRATIC SOCIALIST REPUBLIC OF SRI LANKA" is a type of official document issued by the Government of Sri Lanka, which suggests that it may contain laws, acts, or other regulatory information. 

Given this, it can be inferred that gazettes, laws, and acts share a similar purpose and content, implying a certain level of semantic similarity. Therefore, it is reasonable to group them into the same class, which would facilitate effective information retrieval for users searching for related official documents or regulatory information.


In [None]:
query = "give common topics and patterns of each class of documents i.e regulatory, circular, and guideline"
response_1 = query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)


**LlamaParse+ Lamaindex**
Based on the provided context information, here are the common topics and patterns for each class of documents:

**Circular:**

* Topics: Exportation of tea, preparation of customs goods declaration, and related procedures.
* Patterns: The documents in this class have a specific format, including a reference number, issue date, and a clear title indicating the purpose of the circular. They provide instructions or guidelines for a specific process or procedure.

**Guideline:**

* Topics: Standards and guidelines for tea, quality control, and related regulations.
* Patterns: The documents in this class provide detailed guidelines and standards for the tea industry, including specifications and requirements for tea production and exportation. They are often issued by a regulatory authority and have a formal tone.

**Regulatory:**

* (No documents in this class are provided in the context information)

Note: Since there are no documents in the "Regulatory" class,

In [None]:
query = "give common topics and patterns of each issuing authority of documents"
response_1 = query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)


**LlamaParse+ Lamaindex**
Based on the provided information, the following common topics and patterns can be observed for each issuing authority:

**Government of Sri Lanka**

* Topic: Gazette publications, regulatory documents
* Pattern: The documents issued by the Government of Sri Lanka seem to be gazette publications, which are official documents that contain laws, regulations, and other important announcements.

**Tea Board**

* Topic: Tea exportation, customs regulations
* Pattern: The document issued by the Tea Board appears to be related to the exportation of tea and the preparation of customs goods declaration forms, suggesting that the Tea Board is responsible for regulating and guiding the tea industry in Sri Lanka.


In [None]:
query = "give best node parser to optimize RAG chatbot for each class of documents"
response_1 = query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)


**LlamaParse+ Lamaindex**
For the class of circular documents, I recommend using a Node Parser that leverages the structured format of the documents. A suitable parser would be the `xml2js` parser, which can efficiently extract relevant information from the circular documents. This parser is particularly effective in handling the hierarchical structure of the documents, making it easier to optimize the RAG chatbot for this class of documents.


In [None]:
query = "which appraoch should I take to optimize RAG chatbot in terms of vector databases and graph databases for each class of documents"
response_1 = query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)


**LlamaParse+ Lamaindex**
To optimize the RAG chatbot, a hybrid approach can be taken. For the "circular" class of documents, a vector database can be utilized to store and query the semantic embeddings of the documents. This will enable efficient similarity searches and clustering of similar documents.

On the other hand, a graph database can be employed to store the relationships between the different entities mentioned in the documents, such as the issuing authorities, reference numbers, and dates. This will facilitate querying and traversal of the graph to uncover connections between entities.

By combining these two approaches, the RAG chatbot can leverage the strengths of both vector and graph databases to provide more accurate and informative responses to user queries.


In [None]:
query = "will keyowrk based indexing give good retrival results? discuss for each class of documents"
response_1 = query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)


**LlamaParse+ Lamaindex**
Based on the provided context information, we can analyze the documents and discuss the effectiveness of keyword-based indexing for each class of documents.

The documents belong to the class "circular" and are issued by the Tea Board. They contain specific information related to the tea industry, such as regulations, guidelines, and application forms.

For the class "circular," keyword-based indexing might be effective for retrieving documents that contain specific keywords related to the tea industry, such as "tea auction," "licensed tea brokers," "handwritten marks," "warehouse registration," "payment of reasonable price," and "damaged made tea." These keywords are likely to be relevant and frequently occurring in the documents, making it easier to retrieve them using keyword-based indexing.

However, there are some limitations to consider:

1. **Domain-specific terminology**: The documents contain domain-specific terminology, such as "Trav. No/s," "Garden

In [None]:
query = "should I use sparse or dense vectors for similairty search for retrieval? discuss for each class of documents?"
response_1 = query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)


**LlamaParse+ Lamaindex**
Based on the provided context, it appears that the documents are related to the Tea Board and contain circulars, instructions, and guidelines for tea exporters, manufacturers, and traders. The content of the documents seems to be focused on procedures, regulations, and calculations related to tea trade and production.

When considering similarity search for retrieval, the choice between sparse and dense vectors depends on the characteristics of the document collection and the desired retrieval performance.

For the circular class of documents (e.g., TCEX-VI-8 2001 09 11.pdf, AL-MRL-07 2007 08 21 (10).pdf), which contain instructions and guidelines, I would recommend using sparse vectors. These documents typically have a structured format, with specific sections and keywords that are relevant to the topic. Sparse vectors are well-suited for capturing the significance of these keywords and can effectively represent the document's content. Additionally, sparse v

# test 2: [BASELINE # 2][Top k similarity + keyword/entity extracted + currently flat vector RAG] [llama 3.1 8b instant - very relaible][KDB.ai] Create kdb.ai vector index and query the parsed documents

In [None]:
# lot's of unused imports, will sift thru tem later

!pip install python-dotenv
!pip install --upgrade  llama-index llama-index-core llama-index-readers-file llama-index-extractors-entity
!pip install llama-index-vector-stores-lancedb
!pip install llama_index.llms.groq llama-index-embeddings-fastembed

Collecting llama-index-extractors-entity
  Downloading llama_index_extractors_entity-0.3.0-py3-none-any.whl.metadata (730 bytes)
Collecting huggingface-hub<0.24.0 (from llama-index-extractors-entity)
  Downloading huggingface_hub-0.23.5-py3-none-any.whl.metadata (12 kB)
Collecting span-marker>=1.5.0 (from llama-index-extractors-entity)
  Downloading span_marker-1.5.0-py3-none-any.whl.metadata (18 kB)
Collecting datasets>=2.14.0 (from span-marker>=1.5.0->llama-index-extractors-entity)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate (from span-marker>=1.5.0->llama-index-extractors-entity)
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting seqeval (from span-marker>=1.5.0->llama-index-extractors-entity)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collectin

In [None]:
!pip install llama-index-vector-stores-kdbai
!pip install kdbai_client pandas



Until now, we were using tmporary lancedb vectordb, now using kdb.ai:

## Define KDB.AI Session & Database
KDB.AI comes in two offerings:

KDB.AI Cloud - For experimenting with smaller generative AI projects with a vector database in our cloud.
KDB.AI Server - For evaluating large scale generative AI applications on-premises or on your own cloud provider.
Depending on which you use there will be different setup steps and connection details required.

Option 1. KDB.AI Cloud
To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key. To get these you can sign up for free here.

You can connect to a KDB.AI Cloud session using kdbai.Session and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables KDBAI_ENDPOINTS and KDBAI_API_KEY exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect. If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

### KDB.AI Cloud

In [None]:
import os
import getpass

os.environ['KDBAI_ENDPOINT'] = getpass.getpass('Enter ykdb.ai endpoint: ')
os.environ['KDBAI_API_KEY'] = getpass.getpass('Enter your kdb.ai API Key: ')

Enter ykdb.ai endpoint: ··········
Enter your kdb.ai API Key: ··········


In [None]:
#Set up KDB.AI endpoint and API key
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

In [None]:
# vector DB imports
import kdbai_client as kdbai

### Start Session with KDB.AI Cloud
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

### Verify Defined Databases

We can check our connection using the `session.databases()` function.
This will return a list of all the databases we have defined in our vector database thus far.
This should return a "default" database along with any other databases you have already created.

In [None]:
session.databases()

[KDBAI database "default", KDBAI database "srilanka_tea"]

### Create a Database called "srilanka_tea"

In [None]:
# ensure no database called "srilanka_tea" exists
try:
    session.database("srilanka_tea").drop()
except kdbai.KDBAIException:
    pass

In [None]:
# Create the database
db = session.create_database("srilanka_tea")

## Create Schema, Indexes and KDB.AI Table

Now, let us define the schema that will be used to create the KDB.AI table.

"ID" and "chunk" columns will hold the unique identifier and raw text chunk.

sparse and dense columns will hold the respective sparse and dense vectors.

In [None]:
# Table - name & schema
table_name = "rag_baseline"

table_schema = [
        dict(name="document_id", type="bytes"),
        dict(name="text", type="bytes"),
        dict(name="embeddings", type="float32s"),
        # dict(name='metadata', type= "bytes"),  # Metadata as a string (can use JSON if needed)
        # dict(name="relationships", type="str"),
    ]

indexFlat = {
        "name": "flat_index",
        "type": "flat",
        "column": "embeddings",
        "params": {'dims': 384, 'metric': 'L2'},
    }

In [None]:
# List all of the tables in the db
db.tables

[]

In [None]:
# First ensure the table does not already exist
try:
    db.table("rag_baseline").drop()
except kdbai.KDBAIException:
    pass

In [None]:
# Create table
table = db.create_table(table_name, table_schema, indexes=[indexFlat])

In [None]:
db.tables

[KDBAI table "rag_baseline"]

In [None]:
table.indexes

[{'name': 'flat_index',
  'type': 'flat',
  'column': 'embeddings',
  'params': {'metric': 'L2', 'dims': 384}}]

## Insert data into the KDB.AI Table

In [None]:
from llama_index.vector_stores.kdbai import KDBAIVectorStore
from llama_index.core import StorageContext
from llama_index.core import Settings
from llama_index.core.indices import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

In [None]:


# vector_store_lancedb = LanceDBVectorStore(uri="/tmp/lancedb_lamaindex")
# storage_context = StorageContext.from_defaults(vector_store=vector_store_lancedb)

# index = VectorStoreIndex(
#     nodes=nodes,
#     storage_context=storage_context,
#     embed_model=embed_model,
# )

# query_engine = index.as_query_engine(similarity_top_k=5, llm=llm)


In [None]:
%%time

from llama_index.core.callbacks import LlamaDebugHandler
from llama_index.core.callbacks import CallbackManager


# Using the LlamaDebugHandler to print the trace of the sub questions captured by the SUB_QUESTION callback event type
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

CPU times: user 49 µs, sys: 11 µs, total: 60 µs
Wall time: 62.9 µs


In [None]:
import os
import getpass

# os.environ['LLAMA_CLOUD_API_KEY'] = getpass.getpass('Enter your LLamacloud API Key: ')
os.environ['GROQ_API_KEY'] = getpass.getpass('Enter your GROQ API Key: ')

Enter your GROQ API Key: ··········


In [None]:
from llama_index.llms.groq import Groq
from llama_index.embeddings.fastembed import FastEmbedEmbedding


embeddings_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")
# llm_model = Groq(model="llama3-70b-8192", api_key=os.environ['GROQ_API_KEY'], temperature=0.0)
llm_model = Groq(model="llama-3.1-8b-instant", api_key=os.environ['GROQ_API_KEY'], temperature=0.0)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

In [None]:

import nest_asyncio
nest_asyncio.apply()

In [None]:
# use keyword or entity for circuler class, and metadata-based indexing for regulation class,
# title extractor is redundant, arlready done, so is QA and summmary
# focus on keyword or entity

In [None]:
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
)
from llama_index.extractors.entity import EntityExtractor

# transformations = [
#     SentenceSplitter(),
#     TitleExtractor(nodes=5),
#     QuestionsAnsweredExtractor(questions=3),
#     SummaryExtractor(summaries=["prev", "self"]),
#     KeywordExtractor(keywords=10),
#     EntityExtractor(prediction_threshold=0.5),
# ]


# Vector Store
text_store = KDBAIVectorStore(table=table)

# Storage context
storage_context = StorageContext.from_defaults(vector_store=text_store)
# title_extractor = TitleExtractor(llm = llm_model,  nodes=5)
# qa_extractor = QuestionsAnsweredExtractor(llm = llm_model, questions=3)
keyword_extractor = KeywordExtractor(llm = llm_model, keywords=5)
entity_extractor =  EntityExtractor(llm = llm_model, prediction_threshold=0.5)


# Settings
Settings.callback_manager = callback_manager
# Settings.transformations = [SentenceSplitter(chunk_size=500, chunk_overlap=0)]
Settings.transformations = [MarkdownNodeParser(),  keyword_extractor, entity_extractor]
Settings.embed_model = embeddings_model
Settings.llm = llm_model



# Vector Store Index
index = VectorStoreIndex.from_documents(
    documents,
    use_async=True,
    storage_context=storage_context,
)

  2%|▏         | 39/2523 [00:21<1:12:02,  1.74s/it]

**********
Trace: index_construction
    |_templating -> 6.3e-05 seconds
    |_llm -> 0.68085 seconds
    |_templating -> 0.000155 seconds
    |_llm -> 0.648143 seconds
    |_templating -> 5.7e-05 seconds
    |_llm -> 0.517263 seconds
    |_templating -> 6e-05 seconds
    |_llm -> 0.518295 seconds
    |_templating -> 2.9e-05 seconds
    |_llm -> 0.150505 seconds
    |_templating -> 2.8e-05 seconds
    |_llm -> 0.164566 seconds
    |_templating -> 2.9e-05 seconds
    |_llm -> 0.162837 seconds
    |_templating -> 4.1e-05 seconds
    |_llm -> 0.170562 seconds
    |_templating -> 6e-05 seconds
    |_llm -> 0.18884 seconds
    |_templating -> 2.9e-05 seconds
    |_llm -> 0.17966 seconds
    |_templating -> 2.9e-05 seconds
    |_llm -> 0.190695 seconds
    |_templating -> 2.7e-05 seconds
    |_llm -> 0.177223 seconds
    |_templating -> 3e-05 seconds
    |_llm -> 0.178366 seconds
    |_templating -> 2.7e-05 seconds
    |_llm -> 0.228683 seconds
    |_templating -> 3.7e-05 seconds
    |_llm -

RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama-3.1-8b-instant` in organization `org_01jcn2qw9aehtvc776pg7kbzv6` on requests per minute (RPM): Limit 30, Used 30, Requested 1. Please try again in 1.805s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': 'requests', 'code': 'rate_limit_exceeded'}}

make the above pipeline async if possible

In [None]:
# from llama_index.core import Document
# from llama_index.embeddings.openai import OpenAIEmbedding
# from llama_index.core.node_parser import SentenceSplitter
# from llama_index.core.extractors import TitleExtractor
# from llama_index.core.ingestion import IngestionPipeline, IngestionCache
# from llama_index.vector_stores.qdrant import QdrantVectorStore

# import qdrant_client

# # Initialize the Qdrant client and vector store
# client = qdrant_client.QdrantClient(location=":memory:")
# vector_store = QdrantVectorStore(client=client, collection_name="test_store")

# # Create the pipeline with transformations
# pipeline = IngestionPipeline(
#     transformations=[
#         SentenceSplitter(chunk_size=25, chunk_overlap=0),  # Split the document into smaller chunks
#         TitleExtractor(),  # Extract the document title
#         OpenAIEmbedding(),  # Apply OpenAI embeddings to document chunks
#     ],
#     vector_store=vector_store  # Optionally, store results in a vector store
# )

# # Run the pipeline on a list of documents (in this case, just an example document)
# documents = [Document.example()]
# pipeline.run(documents=documents)


# # Connect to Qdrant and initialize the vector store
# client = qdrant_client.QdrantClient(location=":memory:")
# vector_store = QdrantVectorStore(client=client, collection_name="documents_collection")

# # Create a pipeline that uses the vector store
# pipeline = IngestionPipeline(
#     transformations=[
#         SentenceSplitter(chunk_size=25, chunk_overlap=0),
#         TitleExtractor(),
#         OpenAIEmbedding(),
#     ],
#     vector_store=vector_store,
# )

# # Run the pipeline
# pipeline.run(documents=documents)


# # Save the cache
# pipeline.persist("./pipeline_storage")

# # Load the cache and restore the pipeline state
# new_pipeline = IngestionPipeline(
#     transformations=[
#         SentenceSplitter(chunk_size=25, chunk_overlap=0),
#         TitleExtractor(),
#     ],
# )
# new_pipeline.load("./pipeline_storage")

# # Running the pipeline now will be faster due to the cache
# nodes = new_pipeline.run(documents=documents)


# # Asynchronous document processing
# nodes = await pipeline.arun(documents=documents)

# # Run the pipeline with parallel processing using multiprocessing
# pipeline.run(documents=documents, num_workers=4)


# from llama_index.core.storage.docstore import SimpleDocumentStore

# # Initialize the document store
# docstore = SimpleDocumentStore()

# # Create a pipeline that uses the document store
# pipeline = IngestionPipeline(
#     transformations=[
#         SentenceSplitter(chunk_size=25, chunk_overlap=0),
#         TitleExtractor(),
#         OpenAIEmbedding(),
#     ],
#     docstore=docstore,
# )

# # Run the pipeline
# pipeline.run(documents=documents)


# import asyncio
# from llama_index.core import Document
# from llama_index.embeddings.openai import OpenAIEmbedding
# from llama_index.core.node_parser import SentenceSplitter
# from llama_index.core.extractors import TitleExtractor
# from llama_index.core.ingestion import IngestionPipeline
# from llama_index.vector_stores.qdrant import QdrantVectorStore
# import qdrant_client

# # Define the async processing pipeline
# async def async_pipeline(documents):
#     client = qdrant_client.QdrantClient(location=":memory:")
#     vector_store = QdrantVectorStore(client=client, collection_name="test_collection")

#     pipeline = IngestionPipeline(
#         transformations=[
#             SentenceSplitter(chunk_size=25, chunk_overlap=0),
#             TitleExtractor(),
#             OpenAIEmbedding(),
#         ],
#         vector_store=vector_store
#     )

#     # Run the pipeline asynchronously
#     nodes = await pipeline.arun(documents=documents)
#     return nodes

# # Example document list
# documents = [Document.example()]

# # Run the pipeline and await the result
# result = asyncio.run(async_pipeline(documents))

# # Print the processed nodes
# print(result)

## Index as Vector Query Engine

define better query engine that uses metadata returned: https://docs.llamaindex.ai/en/stable/examples/metadata_extraction/MetadataExtraction_LLMSurvey/


### added top k similarity = 5 retrieval + [more to add]: keyword based querying, also try with both desnse and sparse vector, add query router for multiple classes and add subquery option to break longer queries into shorter ones, try cohere reranker


## try ReAct to work with all of this

In [None]:
# Vector query engine
vector_query_engine = index.as_query_engine(
                                similarity_top_k=5,
                                llm=llm,
                                vector_store_kwargs={
                                    "index" : "flat",
                                },
                            )

In [None]:

query = "sumarize all regulatory changes that happened in 2022 related to tea"
response_1 = vector_query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)

In [None]:
# %%time

# # Using gpt-4o-mini, the 128k tokens context size can take 100 pages.
# K = 5

# query_engine = index.as_query_engine(
#                 similarity_top_k=K,
#                 vector_store_kwargs={
#                         "index" : "flat_index",
#                         "filter" : [["<", "publication_date", datetime.date(2008,9,15)]],
#                         "sort_columns" : "publication_date"
#                         }
#         )

In [None]:
# %%time

# # Using gpt-4o-mini, the 128k tokens context size can take 100 pages.
# K = 15

# query_engine = index.as_query_engine(
#                 similarity_top_k=K,
#                 vector_store_kwargs={
#                         "index" : "flat_index",
#                         "filter" : [[">=", "publication_date", datetime.date(2008,9,15)]],
#                         "sort_columns" : "publication_date"
#                         }
#         )

In [None]:
# response = query_engine.query(
#     "what did the president say about ukraine, what are the four common sense steps and how he planned to fight inflation?"
# )

In [None]:
# %%time

# # Using gpt-4o-mini, the 128k tokens context size can take 100 pages.
# K = 20

# query_engine = index.as_query_engine(
#                 similarity_top_k=K,
#                 vector_store_kwargs={
#                         "index" : "flat_index",
#                         "sort_columns" : "publication_date"
#                         }
#         )

In [None]:
# %%time

# result = query_engine.query(
#     """
#     What happened on the 15th of September 2008 ?
#     """
# )
# print(result.response)

## Delete the KDB.AI Table

Once finished with the table, it is best practice to drop it.

In [None]:
table.drop()

# If integrating with llamaparse

https://colab.research.google.com/github/KxSystems/kdbai-samples/blob/main/LlamaParse_pdf_RAG/llamaParse_demo.ipynb

# metadata filtering other example:

https://colab.research.google.com/github/KxSystems/kdbai-samples/blob/main/metadata_filtering/metadata_filtering_demo.ipynb#scrollTo=UkkJDOQCW7qe