# Notebook to show how we intake information into a vector db and can the query it
- Load the pdf and transform it to a markdown document
- chunk the markdown document in langchain Documents
- load the chunks into a vector db
- query the vector db

## Load the Pdf to Markdown

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
PDF_PATH = 'Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.pdf'

In [3]:
# test the unstructured loader from langchain
# problem: needs additional dependencies on system level
# loader_local = UnstructuredLoader(
#     file_path=PDF_PATH,
#     strategy="hi_res",
# )
# docs_local = []
# for doc in loader_local.lazy_load():
#     docs_local.append(doc)

In [4]:
# my custom loader that first transforms the pdf to markdown
# convert_pdf_to_markdown(PDF_PATH)

## Convert the Markdown to chunks

In [5]:
MD_PATH = PDF_PATH.split('.pdf')[0]
MD_PATH = f'{MD_PATH}/{MD_PATH}.md'

In [6]:
with open(MD_PATH, 'r') as md:
    f = md.read()

In [7]:
from src.Tools.TextSplitter import MarkdownChunker
chunker = MarkdownChunker(md_path=MD_PATH, chunk_size=1000)
chunks = chunker.chunk(method='markdown+semantic')

## Index the Chunks into a vector db

In [8]:
from langchain_chroma.vectorstores import Chroma
from langchain_ollama.embeddings import OllamaEmbeddings

In [9]:
embeddings = OllamaEmbeddings(model='nomic-embed-text')

In [10]:
vector_store = Chroma(
    collection_name='test_collection',
    embedding_function=embeddings,
    persist_directory='./test.db'
)

In [11]:
vector_store.add_documents(chunks)

['63c531c1-bbd6-467f-b1d3-aba13e2aa6b5',
 '5dac0131-f5c0-4b7d-857c-14be158336e8',
 'bd754f4e-3758-4d73-8608-30b00573c694',
 'c719f1cf-baac-4bc5-9985-3a68753727f5',
 'ccf08d24-2ffc-4f2f-a3b0-f278f8d8d483',
 '00decf9c-f595-436c-983c-b6d130483af2',
 '49ad7ad0-7537-4607-b15a-c478666e9530',
 'ccd967ab-875e-4211-96dd-9e0a7c638bb5',
 '1e691248-aacc-4e9c-8a15-befe31fdffd0',
 '45644d57-34be-461b-ae0b-e5c3ae1e8fa4',
 '522ecf7d-1c81-400b-ab6a-91fd943f054f',
 'aa006f71-e1b4-493f-968e-1f20c78d6f15',
 'd7d1676e-336f-4fa1-a738-5c8ab3090e0e',
 'f1bef897-f72b-465c-941e-10c97dda449a',
 'ac70682f-6464-4baa-9bb3-a56530464c3a',
 'ca206865-33f2-4849-86ff-7c3770529eff',
 '73f90f89-601c-419d-b99d-a71754ee375e',
 '023b504a-07d5-4614-aa38-195a908c6857',
 '582b9c04-7b79-4319-98be-2bdeb117cd68',
 'cb6ba6b2-4e06-494e-a8ac-434a5bc9735e',
 '5aeb7612-57f8-4ea9-b000-50266665d0c0',
 'd2772e0d-f031-4591-aebe-33136152902b',
 'f16473fe-0b4b-4864-bff5-ec4aac379e7d',
 'c58e979d-2a2c-444a-bb79-29b3a3692db2',
 '4dcc204d-a84f-

## Retrieve information from Vector DB

In [12]:
vector_store.similarity_search_with_relevance_scores('What is Lempel Ziv complexity')

[(Document(id='25183b33-4029-4699-8372-2bba619b6d5b', metadata={'split_id': 134, 'level1': '$\\begin{array}{c} 362\\\\ 363 \\end{array} \\quad \\text{Lempel-Ziv complexity (LZC)} \\end{array}$', 'table': False, 'level4': 'Hurst exponents', 'length': 119}, page_content='Fig. 4 Potts model Lempel-Ziv complexity (LZC) for Q = 4 (A) and Q = 5 (B). The same shape is observed for both models.'),
  0.6629649919359342),
 (Document(id='77c7589d-c1d2-4bd4-abf1-34e47715468c', metadata={'length': 302, 'level1': '916 Microstate sequence complexity in wake and sleep', 'table': False, 'split_id': 210, 'level4': '1007 1008 References'}, page_content='<span id="page-22-0"></span>1009 1010 1011 1012 Ab´asolo D, da Silva R, Simons S, et al (2014) Lempel-Ziv complexity\nanalysis of local field potentials in different vigilance states with different coarse-graining techniques. In: Romero\nR (ed) IFMBE Proceedings, XIII Mediterranean Conference on Medical'),
  0.6053016336317847),
 (Document(id='80e82c9c-a6

# RAG example

In [13]:
from langchain_ollama.chat_models import ChatOllama
llm = ChatOllama(model='gpt-oss:latest')

In [15]:
from langchain import hub
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

# Define prompt for question-answering
# N.B. for non-US LangSmith endpoints, you may need to specify
# api_url="https://api.smith.langchain.com" in hub.pull.
prompt = hub.pull("rlm/rag-prompt")


# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

In [16]:
response = graph.invoke({"question": "How is entropy measured in EEG Microstates"})
print(response["answer"])

In EEG microstate analysis entropy is quantified by treating the sequence of microstate labels as a symbolic time series. The **entropy rate** \(h_X\) is calculated as the average information per symbol (e.g., via Lempel‑Ziv or statistical‑entropy estimators), reflecting the randomness of the sequence. The **excess entropy** \(E\) is derived from the same sequence and measures the amount of predictable structure or long‑range correlations beyond the entropy rate.
