# Retrieval

The steps are:
1. Load documents
2. Split documents
3. Embed & store documents
4. Retrieve from documents

## 0. Prerequisites

In [1]:
!pip install langchain openai GitPython chromadb unstructured markdown tiktoken


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
search_keyword = "AlwaysBehavior"

## 1. Load documents

In [4]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.document_loaders import UnstructuredMarkdownLoader

docs_loader = DirectoryLoader("pronto-docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader)
docs = docs_loader.load()

In [None]:
docs

## 2. Split documents

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

docs_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.MARKDOWN, chunk_size=100, chunk_overlap=0)

In [None]:
docs_splits = docs_splitter.split_documents(docs)

In [None]:
len(docs_splits)

In [None]:
from langchain_core.documents import Document
from langchain.text_splitter import TextSplitter

def get_splits_from_keyword(keyword: str, splitter: TextSplitter, documents: list[Document]):
    return list(filter(lambda x: keyword in x.page_content, splitter.split_documents(documents)))

In [None]:
chunk_size = 100
chunk_overlap = 0

### 2.1 Recursive Character Text Splitter

In [None]:
rct_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

In [None]:
get_splits_from_keyword("ValueBehavior", rct_splitter, docs)

### 2.2 Recursive Character Text Splitter (from language Markdown)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

rctm_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.MARKDOWN, chunk_size=chunk_size, chunk_overlap=chunk_overlap)

In [None]:
get_splits_from_keyword("ValueBehavior", rctm_splitter, docs)

### 2.3 Recursive Character Text Splitter (custom separators)

In [None]:
rctcs_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=0,
    separators=["#", "##", "###", "\n\n", "\n \n", "\n"]
)

In [None]:
rctcs_splitter.split_documents(filter(lambda x: x.metadata["source"] == "pronto-docs/README.md", docs))

In [None]:
splits = get_splits_from_keyword("ValueBehavior", rctcs_splitter, docs)
print(len(splits))
print(splits)

In [None]:
from langchain.text_splitter import CharacterTextSplitter

ct_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator="\n"
)

In [None]:
get_splits_from_keyword("connect", ct_splitter, docs)

## 3. Embed & store documents

This is where it gets interesting. We have different options for retrieving:
- Vectorstore
- Parent Document
- Multi-Vector
- Self Query
- Contextual Compression
- Time-Weighted Vectorstore
- Multi-Query Retriever
- Ensemble
- Long-Context Reorder

In [None]:
from langchain_community.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

In [None]:
from langchain_community.vectorstores.chroma import Chroma

db = Chroma.from_documents(documents=docs, embedding=embeddings, persist_directory="./chroma_db")

In [10]:
from langchain_core.retrievers import BaseRetriever

def check_for_keyword(keyword: str, retriever: BaseRetriever):
    rel_docs = retriever.get_relevant_documents(keyword)
    count_results = sum(map(lambda x: int(keyword in x.page_content), rel_docs))
    return str(count_results) + "/" + str(len(rel_docs))

### 3.1 Vectorstore

This is the basic method directly build into the vectorstore.

In [None]:
vec_retriever = db.as_retriever()
vec_retriever_mmr = db.as_retriever(
    search_type="mmr",  # Also test "similarity",
)

In [None]:
vec_retriever.get_relevant_documents(search_keyword)

In [None]:
print(check_for_keyword(search_keyword, vec_retriever))
print(check_for_keyword(search_keyword, vec_retriever_mmr))

As you can see the vectorstore method delivers some correct result but performs poorly.

### 3.2 Parent Document Retriever

#### 3.2.1 Full Documents from smaller chunks

In [None]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

pdfd_vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)

pdfd_store = InMemoryStore()
pdfd_retriever = ParentDocumentRetriever(
    vectorstore=pdfd_vectorstore,
    docstore=pdfd_store,
    child_splitter=docs_splitter,
)

In [None]:
pdfd_retriever.add_documents(docs, ids=None)

In [None]:
pdfd_retriever.get_relevant_documents(search_keyword)

In [None]:
print(check_for_keyword(search_keyword, pdfd_retriever))

This method outright delivers a too large chunk size i.e. the whole document and is therefore not suitable.

#### 3.2.2 Larger chunks from smaller chunks

In [27]:
from langchain.retrievers.multi_vector import SearchType
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
from langchain_community.vectorstores.chroma import Chroma
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.embeddings import OpenAIEmbeddings

parent_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.MARKDOWN, chunk_size=400, chunk_overlap=0)
child_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.MARKDOWN, chunk_size=100, chunk_overlap=0)

pdlg_vectorstore = Chroma(
    collection_name="split_parents", embedding_function=OpenAIEmbeddings()
)
pdlg_store = InMemoryStore()
pdlg_retriever = ParentDocumentRetriever(
    vectorstore=pdlg_vectorstore,
    docstore=pdlg_store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

pdlg_retriever.add_documents(docs)

In [34]:
pdlg_retriever.get_relevant_documents("How to access values in a StoreBehaviour?")

[Document(page_content="The StateBehavior is the fundamental building block of a state machine. Each StateBehavior emits the signals  StateBehavior.entered()  and  StateBehavior.exited()  to communicate the state machine's state \n \n StoreBehavior \n Use the Godot meta properties to store state. You can configure it to store values in the global dictionary  G  and access it via  G.at(prop) . \n \n ValueBehavior", metadata={'source': 'pronto-docs/README.md'})]

In [32]:
from langchain.memory import ConversationSummaryMemory
from langchain_community.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

llm = ChatOpenAI()
memory = ConversationSummaryMemory(
    llm=llm, memory_key="chat_history", return_messages=True
)
qa = ConversationalRetrievalChain.from_llm(llm, retriever=pdlg_retriever, memory=memory, verbose=True)
qa("How to access values in a StoreBehaviour?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

system: 
Follow Up Input: How to access values in a StoreBehaviour?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
The StateBehavior is the fundamental building block of a state machine. Each StateBehavior emits the signals  StateBehavior.entered()  and  StateBehavior.exited()  to communicate the state machine's state 
 
 StoreBehavior 
 Use the Godot meta properties to store state. You can configure it to store values in the gl

{'question': 'How to access values in a StoreBehaviour?',
 'chat_history': [SystemMessage(content='')],
 'answer': 'Values in a StoreBehavior can be accessed using the global dictionary "G" and the "G.at(prop)" method. By configuring the StoreBehavior to store values in the global dictionary, the values can be accessed and manipulated throughout the game.'}

In [33]:
print(check_for_keyword(search_keyword, pdlg_retriever))

0/0


In [41]:
qa("How to delete the connection of two behaviors?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Follow Up Input: How to delete the connection of two behaviors?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
Connections are an extension of Godot signals to be more flexible. Connections can be dragged from any behavior to any arbitrary node in a scene. They are the primary means to assemble your game by wiring Behaviors together.

Creating

Connections are created by hovering the "+" that appears below selected nodes. There 

{'question': 'How to delete the connection of two behaviors?',
 'answer': 'To disconnect two behaviors in Godot, you can follow these steps:\n\n1. Select the node that has the behavior you want to disconnect.\n\n2. In the Inspector panel, navigate to the "Signals" section.\n\n3. Look for the signal that is connected to the other behavior you want to disconnect.\n\n4. Click on the "x" button next to the connected behavior\'s name to remove the connection.\n\nThis will disconnect the two behaviors, and they will no longer be linked through that signal.'}

In [46]:
pdlg_retriever.vectorstore.similarity_search("evaluate Bind Behavior")

[Document(page_content='Behavior \n Function \n \n BindBehavior', metadata={'doc_id': '9bcdb4bf-45e8-4cb5-8254-dd2536447bcb', 'source': 'pronto-docs/README.md'}),
 Document(page_content='Behavior \n Function \n \n BindBehavior', metadata={'doc_id': 'c32a35a9-cbe1-4f44-94cb-a6a2829e1de2', 'source': 'pronto-docs/README.md'}),
 Document(page_content='Behavior \n Function \n \n BindBehavior', metadata={'doc_id': 'fd742ccd-3901-43ea-8853-a226f89c504f', 'source': 'pronto-docs/README.md'}),
 Document(page_content='Behavior \n Function \n \n BindBehavior', metadata={'doc_id': '3baf6b9b-dacf-4ba0-96e9-8314086487cd', 'source': 'pronto-docs/README.md'})]