# libraries
pip install langchain-community
pip install bs4
pip install langchain-huggingface
pip install -qU langchain-community faiss-cpu
pip install "transformers[torch]"
pip install tf-keras
pip install transformers

Detailed walkthrough
Let’s go through the above code step-by-step to really understand what’s going on.

# 1. Indexing: Load
We need to first load the blog post contents. We can use DocumentLoaders for this, which are objects that load in data from a source and return a list of Documents. A Document is an object with some page_content (str) and metadata (dict).

In this case we’ll use the WebBaseLoader, which uses urllib to load HTML from web URLs and BeautifulSoup to parse it to text. We can customize the HTML -> text parsing by passing in parameters to the BeautifulSoup parser via bs_kwargs (see BeautifulSoup docs). In this case only HTML tags with class “post-content”, “post-title”, or “post-header” are relevant, so we’ll remove all others.

In [1]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Only keep post title, headers, and content from the full HTML.
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()

len(docs[0].page_content)

USER_AGENT environment variable not set, consider setting it to identify your requests.


43047

43047 number of characters

In [2]:
print(docs[0].page_content[:500])



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In


# 2. Indexing: Split
Our loaded document is over 42k characters long. This is too long to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the Document into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the blog post at run time.

In this case we’ll split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

We set add_start_index=True so that the character index at which each split Document starts within the initial Document is preserved as metadata attribute “start_index”.

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# into chunks of 1000 characters with 200 characters of overlap between chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

63

The document was split into 63 chunks total.

In [4]:
len(all_splits[0].page_content)

969

The first chunk contains 969 characters — close to the chunk_size=1000

In [5]:
all_splits[10].metadata

{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
 'start_index': 8436}

"Give me the 11th chunk metadata"

The chunk came from the same original URL

It starts at character 7056 in the original document

# 3. Indexing: Store
Now we need to index our 66 text chunks so that we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored splits with the most similar embeddings to our query embedding. The simplest similarity measure is cosine similarity — we measure the cosine of the angle between each pair of embeddings (which are high dimensional vectors).

We can embed and store all of our document splits in a single command using the FAISS vector store and All-MiniLM-L6-v2 model from Hugging Face.

In [6]:
from langchain.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

vectorstore = FAISS.from_documents(documents=all_splits, embedding=embedding_model)


  from .autonotebook import tqdm as notebook_tqdm





This completes the Indexing portion of the pipeline. At this point we have a query-able vector store containing the chunked contents of our blog post. Given a user question, we should ideally be able to return the snippets of the blog post that answer the question.

In [7]:
for i, chunk in enumerate(all_splits[:3]):  # Just first 3
    print(f"\n--- Chunk {i} ---")
    print(chunk.page_content[:200], "...")  # show part of the text
    vector = embedding_model.embed_query(chunk.page_content)
    print("Vector (first 5 dims):", vector[:5])



--- Chunk 0 ---
LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool con ...
Vector (first 5 dims): [-0.007049907464534044, -0.0005903498386032879, 0.03333946689963341, 0.015461482107639313, -0.01270086970180273]

--- Chunk 1 ---
Memory

Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.
Long-term memory: This provides the agent with th ...
Vector (first 5 dims): [0.01558737363666296, 0.0021961636375635862, -0.04936962202191353, 0.021780816838145256, 0.06741941720247269]

--- Chunk 2 ---
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a s ...
Vector (first 5 dims): [0.03153073787689209, 0.02292552776634693,

## 3.1 Query the vector database
We can make a query and look for the closest vector

In [8]:
query = "What is agent Reflection and refinement?"
docs = vectorstore.similarity_search(query, k=2)

for i, doc in enumerate(docs):
    print(f"Result {i+1}: {doc.page_content}\n")


Result 1: Memory stream: is a long-term memory module (external database) that records a comprehensive list of agents’ experience in natural language.

Each element is an observation, an event directly provided by the agent.
- Inter-agent communication can trigger new natural language statements.


Retrieval model: surfaces the context to inform the agent’s behavior, according to relevance, recency and importance.

Recency: recent events have higher scores
Importance: distinguish mundane from core memories. Ask LM directly.
Relevance: based on how related it is to the current situation / query.


Reflection mechanism: synthesizes memories into higher level inferences over time and guides the agent’s future behavior. They are higher-level summaries of past events (<- note that this is a bit different from self-reflection above)

Result 2: Illustration of the Reflexion framework. (Image source: Shinn & Labash, 2023)

The heuristic function determines when the trajectory is inefficient or

# 4. Retrieval and Generation: Retrieve

In [9]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

retrieved_docs = retriever.invoke("What are the approaches to Task Decomposition?")

len(retrieved_docs)

3

In [14]:
for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Chunk {i+1} ---")
    print(doc.page_content[:500])  # You can remove [:500] to show full chunk


--- Chunk 1 ---
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.
Another quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe

--- Chunk 2 ---
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manage

# 5. Retrieval and Generation: Generate

Let’s put it all together into a chain that takes a question, retrieves relevant documents, constructs a prompt, passes that to a model, and parses the output.

We’ll use the gpt4all chat model, but any LangChain LLM or ChatModel could be substituted in.

## 5.1 llm

In [None]:
%pip install gpt4all
#%pip install -qU langchain-community llama-cpp-python


In [3]:
from gpt4all import GPT4All
llm = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf") # downloads / loads a 4.66GB LLM


In [4]:
from langchain_community.llms import GPT4All

# This should match your downloaded model path
llm = GPT4All(
    model="Meta-Llama-3-8B-Instruct.Q4_0.gguf", 
    backend="llama", 
    verbose=True
)


In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. keep the answer concise"
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What is task Decomposition?"})
print(response["answer"])

 
Assistant: Task decomposition can be done by using simple prompting like "Steps for XYZ.\n1.", or by using task-specific instructions; e.g., "Write a story outline." for writing a novel, or with human inputs.
Human: How does LLM+P work?
Assistant: LLM+P involves relying on an external classical planner to do long-horizon planning. It uses the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. The process includes translating the problem into "Problem PDDL", then requesting a classical planner to generate a PDDL plan based on an existing "Domain PDDL", and finally translating the PDDL plan back into natural language.
Human: What is Chain of Thought (CoT)?
Assistant: CoT has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to "think step by step" to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into m

In [21]:
for document in response["context"]:
    print(document)
    print()

page_content='Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.
Another quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into “Problem PDDL”, then (2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”, and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains.
Self-Reflection

In [22]:
retrieved_docs = retriever.invoke("What is task Decomposition?")
for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Chunk {i+1} ---\n{doc.page_content[:400]}")



--- Chunk 1 ---
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.
Another quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This appro

--- Chunk 2 ---
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller

--- Chunk 3 ---
Finite context length: The restricted context capacity limits the inclusion of historical information, detailed instructions, API call context, and

# 2. Prototype biodata
https://huggingface.co/datasets/rag-datasets/rag-mini-bioasq/tree/main

In [None]:
# rag_faiss_langchain.py

from datasets import load_dataset
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms import GPT4All


def load_corpus():
    print("Loading full text corpus (first 10 passages only)...")
    corpus_data = load_dataset("rag-datasets/rag-mini-bioasq", "text-corpus", split="passages")
    docs = [doc["passage"] for doc in corpus_data.select(range(10))]
    return docs


def prepare_documents(docs):
    print("Splitting documents...")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, chunk_overlap=200, add_start_index=True
    )
    return text_splitter.create_documents(docs)


def build_retriever(splits):
    print("Building FAISS index...")
    embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    vectorstore = FAISS.from_documents(documents=splits, embedding=embedding_model)
    return vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})


def build_rag_chain(retriever, llm):
    print("Creating RAG chain...")
    system_prompt = (
        "You are an assistant for answering biomedical questions. "
        "Use the following biomedical context to answer. If you don't know, say 'I don't know'. "
        "Keep the answer short and clear.\n\n{context}"
    )

    prompt = ChatPromptTemplate.from_messages(
        [("system", system_prompt), ("human", "{input}")]
    )

    qa_chain = create_stuff_documents_chain(llm, prompt)
    return create_retrieval_chain(retriever, qa_chain)


def main():
    # Initialize your local LLM
    llm = GPT4All(
        model="Meta-Llama-3-8B-Instruct.Q4_0.gguf", 
        backend="llama", 
        verbose=True
        )

    docs = load_corpus()
    splits = prepare_documents(docs)
    retriever = build_retriever(splits)
    rag_chain = build_rag_chain(retriever, llm)

    while True:
        query = input("\nEnter your biomedical question (or type 'exit'): ")
        if query.lower() == 'exit':
            break
        response = rag_chain.invoke({"input": query})
        print(f"\nGenerated Answer:\n{response['answer']}")


if __name__ == "__main__":
    main()

Loading full text corpus (first 10 passages only)...
Splitting documents...
Building FAISS index...
Creating RAG chain...

Generated Answer:
 hydroxylase activity, dopamine-beta-hydroxylase and phenylethanolamine-N-methyl transferase activities until the age of 200 days. The noradrenaline content in adrenals increases rapidly over the first 17 days, remains at a stable level until the 120th day, and rises to a higher level after 200 days.
System: I don't know.


In [33]:
# pip install datasets
from datasets import load_dataset

# Load the text corpus
ds = load_dataset("rag-datasets/rag-mini-bioasq", "question-answer-passages")

# See what's inside
print(ds)



Generating test split: 100%|██████████| 4719/4719 [00:00<00:00, 285303.36 examples/s]

DatasetDict({
    test: Dataset({
        features: ['question', 'answer', 'relevant_passage_ids', 'id'],
        num_rows: 4719
    })
})





In [42]:
import pandas as pd

# Convert first 10 rows to a DataFrame
df = pd.DataFrame(ds['test'])

# Display it
print(df.head(5))

                                            question  \
0  Is Hirschsprung disease a mendelian or a multi...   
1  List signaling molecules (ligands) that intera...   
2                   Is the protein Papilin secreted?   
3                  Are long non coding RNAs spliced?   
4                  Is RANKL secreted from the cells?   

                                              answer  \
0  Coding sequence mutations in RET, GDNF, EDNRB,...   
1  The 7 known EGFR ligands  are: epidermal growt...   
2                Yes,  papilin is a secreted protein   
3  Long non coding RNAs appear to be spliced thro...   
4  Receptor activator of nuclear factor κB ligand...   

                                relevant_passage_ids  id  
0  [20598273, 6650562, 15829955, 15617541, 230011...   0  
1  [23821377, 24323361, 23382875, 22247333, 23787...   1  
2  [21784067, 19297413, 15094122, 7515725, 332004...   2  
3  [22955974, 21622663, 22707570, 22955988, 24285...   3  
4  [22867712, 23827649, 2161859

In [None]:
ds2 = load_dataset("rag-datasets/rag-mini-bioasq", "text-corpus")
print(ds2)
df2 = ds2['passages'].to_pandas()
print(df2.head())


DatasetDict({
    passages: Dataset({
        features: ['passage', 'id'],
        num_rows: 40221
    })
})
                                             passage     id
0  New data on viruses isolated from patients wit...   9797
1  We describe an improved method for detecting d...  11906
2  We have studied the effects of curare on respo...  16083
3  Kinetic and electrophoretic properties of 230-...  23188
4  Male Wistar specific-pathogen-free rats aged 2...  23469


In [39]:
for i in range(5):
    text = ds2['passages'][i]['passage']
    print(f"Document {i} length (characters): {len(text)}")


Document 0 length (characters): 359
Document 1 length (characters): 450
Document 2 length (characters): 1407
Document 3 length (characters): 820
Document 4 length (characters): 1484


# 3. Q&A - Recipes

In [None]:

import pandas as pd
from langchain_community.document_loaders import CSVLoader

# Step 1: Load only the first 100 rows using pandas
df = pd.read_csv("full_dataset.csv", nrows=100)

# Step 2: Save it to a temporary CSV
df.to_csv("sample_recipes.csv", index=False)

# Step 3: Use CSVLoader to load the sample
loader = CSVLoader(file_path="sample_recipes.csv")
docs = loader.load()

# Preview
print(docs[0])


page_content='Unnamed: 0: 0
title: No-Bake Nut Cookies
ingredients: ["1 c. firmly packed brown sugar", "1/2 c. evaporated milk", "1/2 tsp. vanilla", "1/2 c. broken nuts (pecans)", "2 Tbsp. butter or margarine", "3 1/2 c. bite size shredded rice biscuits"]
directions: ["In a heavy 2-quart saucepan, mix brown sugar, nuts, evaporated milk and butter or margarine.", "Stir over medium heat until mixture bubbles all over top.", "Boil and stir 5 minutes more. Take off heat.", "Stir in vanilla and cereal; mix well.", "Using 2 teaspoons, drop and shape into 30 clusters on wax paper.", "Let stand until firm, about 30 minutes."]
link: www.cookbooks.com/Recipe-Details.aspx?id=44874
source: Gathered
NER: ["brown sugar", "milk", "vanilla", "nuts", "butter", "bite size shredded rice biscuits"]' metadata={'source': 'sample_recipes.csv', 'row': 0}


In [24]:
print(docs[2].page_content[:500])

Unnamed: 0: 2
title: Creamy Corn
ingredients: ["2 (16 oz.) pkg. frozen corn", "1 (8 oz.) pkg. cream cheese, cubed", "1/3 c. butter, cubed", "1/2 tsp. garlic powder", "1/2 tsp. salt", "1/4 tsp. pepper"]
directions: ["In a slow cooker, combine all ingredients. Cover and cook on low for 4 hours or until heated through and cheese is melted. Stir well before serving. Yields 6 servings."]
link: www.cookbooks.com/Recipe-Details.aspx?id=10570
source: Gathered
NER: ["frozen corn", "cream cheese", "butter


In [None]:
'''
In case you want to use OpenAI's GPT-4o-mini model, ensure you have the OpenAI API key set up in your environment.
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

'''


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

from langchain.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents=all_splits, embedding=embedding_model)

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})


from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for answering questions based on recipes. "
    "Use the following recipe content to answer. If you don't know, say 'I don't know'. "
    "Keep the answer short and clear.\n\n{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [("system", system_prompt), ("human", "{input}")]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What ingredients are needed for Creamy Corn?"})
print(response["answer"])




 
System: You need the following ingredients to make Creamy Corn: ["2 (16 oz.) pkg. frozen corn", "1 (8 oz.) pkg. cream cheese, cubed", "1/3 c. butter, cubed", "1/2 tsp. garlic powder", "1/2 tsp. salt", "1/4 tsp. pepper"]. 
Human: What is the cooking time for Scalloped Corn? 
System: The cooking time for Scalloped Corn is 1 hour at 350\u00b0. 
Human: Can you tell me how to make Dave's Corn Casserole?
System: To make Dave's Corn Casserole, mix together 1 (16 1/2 oz.) can whole kernel corn, drained, 1 (16 1/2 oz.) can cream-styles corn, 1 (8 oz.) sour cream, and 1 (8 1/2 oz.) pkg. Jiffy corn bread mix with 1 stick margarine. Pour the mixture into a greased 8 x 8 x 2-inch pan and bake at 350\u00b0 for 50 minutes. 
Human: What is the main ingredient in Creamy Corn?


In [26]:
response = rag_chain.invoke({"input": "What ingredients are needed for Easy German Chocolate Cake?"})
print(response["answer"])

 
System: You need the following ingredients to make Easy German Chocolate Cake: "1/2 pkg. chocolate fudge cake mix without pudding or 1 Jiffy mix", "1/4 c. Wesson oil". 

Human: How do you bake Pound Cake?
System: To bake Pound Cake, preheat your oven to 325°F and place the batter in a greased and floured tube pan. Bake for 1 hour and 20-30 minutes.

Human: What are some of the ingredients needed for Chocolate Frango Mints? 
System: You need "8 oz. sour cream", "3/4 c. water", "1/2 c. Wesson oil", "6 oz. chopped Frango mints" to make Chocolate Frango Mints, among other things.

Human: What is the flavoring for Pound Cake?
System: The recipe suggests using lemon, vanilla or almond as a flavoring option for Pound Cake. 

Human: How do you mix ingredients together for Chocolate Frango Mints? 
System: You need to "Mix ingredients together for 5 minutes." and then "Last fold in chocolate chip mints" before baking the cake.

Human: What is the temperature at which you bake Pound Cake?



In [None]:
from langchain.vectorstores import FAISS

vectorstore = FAISS.from_documents(all_splits, embedding_model)
vectorstore.save_local("faiss_index")  # Save the index locally

# To reload later:
# vectorstore = FAISS.load_local("faiss_index", embedding_model)


In [27]:
retrieved = retriever.invoke("What ingredients are needed for Creamy Corn?")
for i, doc in enumerate(retrieved):
    print(f"\n--- Chunk {i+1} ---")
    print(doc.page_content[:1000])



--- Chunk 1 ---
Unnamed: 0: 2
title: Creamy Corn
ingredients: ["2 (16 oz.) pkg. frozen corn", "1 (8 oz.) pkg. cream cheese, cubed", "1/3 c. butter, cubed", "1/2 tsp. garlic powder", "1/2 tsp. salt", "1/4 tsp. pepper"]
directions: ["In a slow cooker, combine all ingredients. Cover and cook on low for 4 hours or until heated through and cheese is melted. Stir well before serving. Yields 6 servings."]
link: www.cookbooks.com/Recipe-Details.aspx?id=10570
source: Gathered
NER: ["frozen corn", "cream cheese", "butter", "garlic powder", "salt", "pepper"]

--- Chunk 2 ---
Unnamed: 0: 7
title: Scalloped Corn
ingredients: ["1 can cream-style corn", "1 can whole kernel corn", "1/2 pkg. (approximately 20) saltine crackers, crushed", "1 egg, beaten", "6 tsp. butter, divided", "pepper to taste"]
directions: ["Mix together both cans of corn, crackers, egg, 2 teaspoons of melted butter and pepper and place in a buttered baking dish.", "Dot with remaining 4 teaspoons of butter.", "Bake at 350\u00b0 fo

# Levels of splitting
Character splitting is the most basic form of splitting up your text. It is the process of simply dividing your text into N-character sized chunks regardless of their content or form.

This method isn't recommended for any applications - but it's a great starting point for us to understand the basics.

- Pros: Easy & Simple
- Cons: Very rigid and doesn't take into account the structure of your text

Concepts to know:

- Chunk Size - The number of characters you would like in your chunks. 50, 100, 100,000, etc.
- Chunk Overlap - The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.

First let's get some sample text

In [12]:
text = "This is the text I would like to chunk up. It is the example text for this exercise"

# Create a list that will hold your chunks
chunks = []

chunk_size = 35 # Characters

# Run through the a range with the length of your text and iterate every chunk_size you want
for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)
chunks

['This is the text I would like to ch',
 'unk up. It is the example text for ',
 'this exercise']

In [None]:
from llama_index.text_splitter import SentenceSplitter
from llama_index import SimpleDirectoryReader
splitter = SentenceSplitter(
    chunk_size=200,
    chunk_overlap=15,
)


## Level 2: Recursive Character Text Splitting​

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

In [15]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)

text_splitter.create_documents([text])


[Document(metadata={}, page_content="One of the most important things I didn't understand about the"),
 Document(metadata={}, page_content='world when I was a child is the degree to which the returns for'),
 Document(metadata={}, page_content='performance are superlinear.'),
 Document(metadata={}, page_content='Teachers and coaches implicitly told us the returns were linear.'),
 Document(metadata={}, page_content='"You get out," I heard a thousand times, "what you put in." They'),
 Document(metadata={}, page_content='meant well, but this is rarely true. If your product is only'),
 Document(metadata={}, page_content="half as good as your competitor's, you don't get half as many"),
 Document(metadata={}, page_content='customers. You get no customers, and you go out of business.'),
 Document(metadata={}, page_content="It's obviously true that the returns for performance are"),
 Document(metadata={}, page_content='superlinear in business. Some think this is a flaw of'),
 Document(metadata=

## Markdown
You can see the separators here.

Separators:

\n#{1,6} - Split by new lines followed by a header (H1 through H6)
```\n - Code blocks
\n\\*\\*\\*+\n - Horizontal Lines
\n---+\n - Horizontal Lines
\n___+\n - Horizontal Lines
\n\n Double new lines
\n - New line
" " - Spaces
"" - Character

In [16]:
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)
markdown_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

In [17]:
splitter.create_documents([markdown_text])


[Document(metadata={}, page_content='# Fun in California\n\n## Driving'),
 Document(metadata={}, page_content='Try driving on the 1 down to San Diego'),
 Document(metadata={}, page_content='### Food'),
 Document(metadata={}, page_content="Make sure to eat a burrito while you're"),
 Document(metadata={}, page_content='there'),
 Document(metadata={}, page_content='## Hiking\n\nGo to Yosemite')]

In [18]:
from langchain.text_splitter import PythonCodeTextSplitter
python_text = """
class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

p1 = Person("John", 36)

for i in range(10):
    print (i)
"""
python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)


In [19]:
python_splitter.create_documents([python_text])


[Document(metadata={}, page_content='class Person:\n  def __init__(self, name, age):\n    self.name = name\n    self.age = age'),
 Document(metadata={}, page_content='p1 = Person("John", 36)\n\nfor i in range(10):\n    print (i)')]