<a href="https://colab.research.google.com/github/StrategicalIT/PipedPiperAI/blob/main/Lab08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB 8: Loading and chunking data with LlamaIndex
In this lab we are going to see how frameworks like LlamaIndex simplify common workflow tasks. In this particular example we will explore how to ingest documents. This is an example of how these frameworks aim to provide a lot of functionality with a simple interface.

LlamaIndex provides many integrations that have been provided by the community. You can glance at what's available in LlamaHub. At the top you can use the filters to see integrations of only the type you are interested in, for example [data loaders also known as readers](https://llamahub.ai/?tab=readers). You can see how popular they are and when they were last updated.

In this exercise we are going to play with the [simple directory reader](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/). As you can see in its documentation page it supports many types of text files including PDF, Word and PowerPoint. It even supports some popular image, audio and video formats. All these files are treated as sources of text and automatically detected by the file extension. As you can image, all these formats are totally different from each other and the process of extracting text from them is different, but "simpledirectoryreader" provides a simple interface to extract data from all of them at once.

## Data Loading

The first step is to install the necessary libraries. In this case we will install the core llama-index package as this includes llama-index-core that simpledirectoryreader is part of. Also notice how we are installing docx2txt which is used to extract text from word documents.

In [None]:
!pip install llama-index docx2txt
!pip install llama-index-llms-nvidia llama-index-embeddings-nvidia

First we need to import SimpleDirectoryReader

In [None]:
from llama_index.core import SimpleDirectoryReader

For this exercise we have prepared a directory called "data" that contains three files: a txt, a pdf and a docx

In [None]:
# if using Google Colab, we need the /content prefix
!ls -l /content/PipedPiperAIData/
# if running a local notebook, then just the directory name is ok e.g. data
#!ls -l data

The following line of code is sufficient to load all the data from these documents

In [None]:
documents = SimpleDirectoryReader("/content/PipedPiperAIData/").load_data()
# for Google Colab be we probably need the /content prefix and point to a folder we've created at runtime with documents uploaded
# for local notebook, probably just point directly to the folder e.g. data assuming the folder is in the same dir as this notebook run from

Now you can simple show the contents of the "documents" variable to see that the text from the documents was indeed extacted. Each document in the list contains also the id and a lot of metadata.

In [None]:
from pprint import pprint
pprint(documents, indent=4)

Feel free to experiment with other supported file types by uploading your own files. You can upload your own files using the main Jupyter interface from where you launch the notebooks.

As you can see in [the documentation](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/), this reader offers many other possibilities like:
- reading subdirectories with "recursive=True"
- include or exclude specific paths and file extensions
- limit the amount of files to be ingested
It can also traverse remote file systems such as S3, Google drive, SFTP ...

***

## Chunking

Chunking involves breaking down large data into smaller segments or "chunks". This makes the AI solution more efficient, particularly in tasks like semantic search and information retrieval. Chunking helps optimize memory usage, speeds up processing, and improves scalability.

This is still an area of active research and it can be done in many ways. You can [check out the LlamaIndex documentation to see what methods are available](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/). Data scientist need to test what method is the best match for their use case

### Fixed-sized chunking

This is the most basic method and it is based on a fixed amount of tokens. As you can see in the Splitter definition below we can specify how many tokens we want to target per chunk and how many tokens we want to overlap between chunks. This overlap is done to prevent information loss at chunk boundaries to ensure context preservation. This method is the often used for speed

In [None]:
from llama_index.core.node_parser import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=300,
    chunk_overlap=20,
    separator=" ",
)

Now we can show the chunks that were created. Notice the word count is different from 300, because the relationship between words and tokens is not 1-2-1

In [None]:
nodes = splitter.get_nodes_from_documents(documents)
for i, _ in enumerate(nodes):
    print(f"=== chunk #{i}, word count:{len(nodes[i].text.split())} ===")
    print(nodes[i].text)
    print("\n")

### Recursive chunking

Here we use SentenceSplitter. This attempts to split text while respecting the boundaries of paragraphs and sentences. You can compare the results with the previous chunking method

In [None]:
from llama_index.core.node_parser import SentenceSplitter
from pprint import pprint

splitter = SentenceSplitter(
    chunk_size=300,
    chunk_overlap=20,
)

nodes = splitter.get_nodes_from_documents(documents)
for i, _ in enumerate(nodes):
    print(f"=== chunk #{i}, word count:{len(nodes[i].text.split())} ===")
    print(nodes[i].text)
    print("\n")

### Semantic chunking

This is a relatively new concept. Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding similarity. This ensures that a "chunk" contains sentences that are semantically related to each other. Notice the longer time it takes to do the chunking compared to the previous two methods since the creation of embeddings and computation of cosine similiraties is more computationally expensive.

The data scientist will have to experiment different values for the breakpoint percentile threshold. The higher the threshold the lower the number of breaking points, ie the less chunks.

<b>IMPORTANT:</b> This algorithm makes several calls to Nvidia's embedding model. Please do not try many combinations of the threshold or you might trigger the API call throttling for your Nvidia account

In [None]:
import os
#apikey = os.environ["NVIDIA_API_KEY"]
#change from OS variable import to using Google Colab secret
from google.colab import userdata
apikey = userdata.get('apikey')
os.environ["NVIDIA_API_KEY"] = apikey
#print(apikey)

In [None]:
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.nvidia import NVIDIAEmbedding

embed_model = NVIDIAEmbedding(truncate="END")

splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)
nodes = splitter.get_nodes_from_documents(documents)
for i, _ in enumerate(nodes):
    print(f"=== chunk #{i}, word count:{len(nodes[i].text.split())} ===")
    print(nodes[i].text)
    print("\n")

### End of Lab 8