<font color="green">Splitters</font>
- Divide a large data asset in small parts.

In [None]:
# !pip install python-dotenv

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
groq_api_key = os.environ["GROQ_API_KEY"]

#### <font color="green">Install LangChain</font>

In [None]:
# !pip install langchain

#### <font color="green">Connect with an LLM</font>

In [2]:
# !pip install langchain-openai
# !pip install langchain-groq

###### <font color="blue">NOTE:</font> Since right now is the best LLM in the market, We will use OpenAI by default. You will see how to connect with other Open Source LLMs like Llama3 or Mistral Models.

<font color="green">Reminder: steps of the RAG process.</font>

- When you load a document, you end up with strings. Sometimes the strings will be too large to fit into the context window. In those occassions. We will use the RAG technique:
  - **Split** document in **small chunks.**
  - **Transform** text chunks in **numeric chunks** (embeddings).
  - Load **embedding to a vector** database (aka vector store).
  - Load question and **retrieve** the most relevant embeddings to respond it.
  - Sent the **embeddings to the LLM** to format the response properly.

<font color="green">Splitters: Divide the loaded document in small chunks of text</font>

- Also called "Document Tranformers".
- See the document page <a href="https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/">here.</a>
- See the list of built-in splitters <a href="https://python.langchain.com/v0.1/docs/integrations/document_transformers/">here.</a>

<font color="green">Simple splitting by character: Character Splitter</font>

- This splits based on characters (by default "\n\n") and measure chunk length by number of characters.

- The "Character Splitter" in the context of RAG (Retrieval-Augmented Generation) applications, specifically using LangChain's tools, is a method that divides text into smaller parts based on specific characters.

- By defaults, it uses double newline character ("\n\n") to identify where one chunk of text ends and another begins.

- Each chunk is measured by its number of characters.

- This simple splitting methods is useful in RAG application to help manage and process large blocks of text by breaking them down into mangeable, smaller pieces. This can enhance the efficiency and effictiveness of the text retrieval process, which is crucial in generating accurate and contextually relevant responses.

Here's  a simple example to illustrate how the "Character Splitter" works in the context of RAG applications using the default delimiter ("\n\n").

**Original Text:**

What is our goal?

\n\nEnsure that fresh, nutritious, and

\nsustainable food is accessible to all.

\n\nAs urbanization increases and space becomes a premium, indoor hydroponic farming is not just a choice;

**After Applying Character Splitter:**
1. **Chunk 1:**
    What is our goal?

2. **Chunk 2:**
    Ensure that fresh, nutritious, and

3. **Chunk 3:**
    sustainable food is accessible to all.

4. **Chunk 4:**
    As urbanization increases and space becomes a premium, indoor hydroponic farming is not just a choice;

In this example, the text is split into four chunks based on the presence of "\n\n" between sections of text, Each chunk is a manageable size and clearly separated from the others, making it easier for a RAG system to handle and retrieve information from specific parts of the text as needed.


In [3]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./data/ageye-meta-data.txt")

loaded_data = loader.load()

In [4]:
loaded_data

[Document(metadata={'source': './data/ageye-meta-data.txt'}, page_content='What is our goal?\n\nEnsure that fresh, nutritious, and\nsustainable food is accessible to all.\n\nAs urbanization increases and space becomes a premium, indoor hydroponic farming is not just a choice; itâ€™s a necessity. And for such an important mission, of building an equitable food system, we need solutions that are smart, intuitive, and efficient.\n\nA Global Problem\n\nFacing the Future of Food: A Critical Juncture\n\nDemand Skyrockets, Supply Dwindles\n\nBy the year 2050, our world will be home to nearly 10 billion people, all needing sustenance. Yet, the capacity of our planet to provide is under unprecedented pressure. Valuable agricultural land is vanishing, swallowed by urban expansion, challenged by changing climates, and degraded by practices that fail to stand the test of sustainability.\n\nThe Cost of Conventional Farming\n\nThe way we grow food now demands too much water and leans heavily on chem

In [6]:
loaded_data[0].page_content

'What is our goal?\n\nEnsure that fresh, nutritious, and\nsustainable food is accessible to all.\n\nAs urbanization increases and space becomes a premium, indoor hydroponic farming is not just a choice; itâ€™s a necessity. And for such an important mission, of building an equitable food system, we need solutions that are smart, intuitive, and efficient.\n\nA Global Problem\n\nFacing the Future of Food: A Critical Juncture\n\nDemand Skyrockets, Supply Dwindles\n\nBy the year 2050, our world will be home to nearly 10 billion people, all needing sustenance. Yet, the capacity of our planet to provide is under unprecedented pressure. Valuable agricultural land is vanishing, swallowed by urban expansion, challenged by changing climates, and degraded by practices that fail to stand the test of sustainability.\n\nThe Cost of Conventional Farming\n\nThe way we grow food now demands too much water and leans heavily on chemicals, making it untenable. Our current methods deplete vital water resour

In [7]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

In [10]:
texts = text_splitter.create_documents([loaded_data[0].page_content])

In [12]:
len(texts)

25

In [11]:
texts

[Document(page_content='What is our goal?\n\nEnsure that fresh, nutritious, and\nsustainable food is accessible to all.\n\nAs urbanization increases and space becomes a premium, indoor hydroponic farming is not just a choice; itâ€™s a necessity. And for such an important mission, of building an equitable food system, we need solutions that are smart, intuitive, and efficient.\n\nA Global Problem\n\nFacing the Future of Food: A Critical Juncture\n\nDemand Skyrockets, Supply Dwindles\n\nBy the year 2050, our world will be home to nearly 10 billion people, all needing sustenance. Yet, the capacity of our planet to provide is under unprecedented pressure. Valuable agricultural land is vanishing, swallowed by urban expansion, challenged by changing climates, and degraded by practices that fail to stand the test of sustainability.\n\nThe Cost of Conventional Farming'),
 Document(page_content='The Cost of Conventional Farming\n\nThe way we grow food now demands too much water and leans heavil

In [13]:
texts[0]

Document(page_content='What is our goal?\n\nEnsure that fresh, nutritious, and\nsustainable food is accessible to all.\n\nAs urbanization increases and space becomes a premium, indoor hydroponic farming is not just a choice; itâ€™s a necessity. And for such an important mission, of building an equitable food system, we need solutions that are smart, intuitive, and efficient.\n\nA Global Problem\n\nFacing the Future of Food: A Critical Juncture\n\nDemand Skyrockets, Supply Dwindles\n\nBy the year 2050, our world will be home to nearly 10 billion people, all needing sustenance. Yet, the capacity of our planet to provide is under unprecedented pressure. Valuable agricultural land is vanishing, swallowed by urban expansion, challenged by changing climates, and degraded by practices that fail to stand the test of sustainability.\n\nThe Cost of Conventional Farming')

In [14]:
texts[12]

Document(page_content='LLM\n\nLarge Language Models\n\nBy analyzing vast amounts of textual data, including environmental conditions, input costs, and crop health reports, alongside deep horticultural knowledge, we convert complex information into actionable insights using generative AI. This empowers farmers with smarter, data-driven farming practices that optimize resources and improve decision-making.\n\nCNN\n\nConvolutional Neural Networks\n\nOur intelligent crop scouting system utilizes visual inputs to provide actionable insights on growth development, enabling the early detection of issues like plant stress, pathogens, or pests. This proactive monitoring saves time, reduces resource usage, and minimizes the risk of crop loss.\n\nAGEYEâ€™s\n\nTruly Intelligent FarmingTM\n\nBy optimizing crop yields and streamlining operations, our platform ensures farms operate at peak performance, reducing costs and boosting productivity.')

<font color="green">Splitting with metadata</font>

In [16]:
metadatas = [{"chunk": 0}, {"chuck": 1}]

document = text_splitter.create_documents(
    [loaded_data[0].page_content, loaded_data[0].page_content],
    metadatas=metadatas
)

In [17]:
document

[Document(metadata={'chunk': 0}, page_content='What is our goal?\n\nEnsure that fresh, nutritious, and\nsustainable food is accessible to all.\n\nAs urbanization increases and space becomes a premium, indoor hydroponic farming is not just a choice; itâ€™s a necessity. And for such an important mission, of building an equitable food system, we need solutions that are smart, intuitive, and efficient.\n\nA Global Problem\n\nFacing the Future of Food: A Critical Juncture\n\nDemand Skyrockets, Supply Dwindles\n\nBy the year 2050, our world will be home to nearly 10 billion people, all needing sustenance. Yet, the capacity of our planet to provide is under unprecedented pressure. Valuable agricultural land is vanishing, swallowed by urban expansion, challenged by changing climates, and degraded by practices that fail to stand the test of sustainability.\n\nThe Cost of Conventional Farming'),
 Document(metadata={'chunk': 0}, page_content='The Cost of Conventional Farming\n\nThe way we grow fo

In [18]:
document[0]

Document(metadata={'chunk': 0}, page_content='What is our goal?\n\nEnsure that fresh, nutritious, and\nsustainable food is accessible to all.\n\nAs urbanization increases and space becomes a premium, indoor hydroponic farming is not just a choice; itâ€™s a necessity. And for such an important mission, of building an equitable food system, we need solutions that are smart, intuitive, and efficient.\n\nA Global Problem\n\nFacing the Future of Food: A Critical Juncture\n\nDemand Skyrockets, Supply Dwindles\n\nBy the year 2050, our world will be home to nearly 10 billion people, all needing sustenance. Yet, the capacity of our planet to provide is under unprecedented pressure. Valuable agricultural land is vanishing, swallowed by urban expansion, challenged by changing climates, and degraded by practices that fail to stand the test of sustainability.\n\nThe Cost of Conventional Farming')

In [19]:
print(document[0])

page_content='What is our goal?

Ensure that fresh, nutritious, and
sustainable food is accessible to all.

As urbanization increases and space becomes a premium, indoor hydroponic farming is not just a choice; itâ€™s a necessity. And for such an important mission, of building an equitable food system, we need solutions that are smart, intuitive, and efficient.

A Global Problem

Facing the Future of Food: A Critical Juncture

Demand Skyrockets, Supply Dwindles

By the year 2050, our world will be home to nearly 10 billion people, all needing sustenance. Yet, the capacity of our planet to provide is under unprecedented pressure. Valuable agricultural land is vanishing, swallowed by urban expansion, challenged by changing climates, and degraded by practices that fail to stand the test of sustainability.

The Cost of Conventional Farming' metadata={'chunk': 0}
