In [49]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("test.txt")

documents = loader.load()

documents[0].page_content

'What is embedding?\nAn embedding is like turning words or sentences into numbers so a computer can understand their meaning.\nExample:\n“Cat” might become [0.9, 0.2, 0.1]\n“Dog” might become [0.8, 0.25, 0.12]\nBecause the numbers are close, the computer knows cat and dog mean similar things.\nSo embedding = a way to represent meaning as numbers.\nWhy do we use vectors?\nA vector is just a list of numbers.\nWhen we make an embedding, we need a format that keeps all those numbers together — that’s exactly what a vector is.\nUsing vectors lets the computer measure:\nHow close two meanings are (using distance between vectors)\nWhat direction the meaning moves in (e.g., man → woman, car → truck)\nSo we use vectors because they let the computer compare meanings with math — fast and accurate.\nWhat is a vector database?\nA vector database is a special kind of storage that can quickly find which embeddings (vectors) are most similar to a new one.\nWhen you ask a question, your question become

In [50]:
from langchain_text_splitters import CharacterTextSplitter

text_spliter = CharacterTextSplitter(chunk_size =1000, chunk_overlap = 200, separator="\n")

chunks = text_spliter.split_documents(documents)

print("Number of chunks ", len(chunks))

for chunk in chunks:
    print(f"\nThe number of chunk is :{len(chunk.page_content)} and index: {chunks.index(chunk)}")
    print("\n", chunk.page_content)

Number of chunks  2

The number of chunk is :937 and index: 0

 What is embedding?
An embedding is like turning words or sentences into numbers so a computer can understand their meaning.
Example:
“Cat” might become [0.9, 0.2, 0.1]
“Dog” might become [0.8, 0.25, 0.12]
Because the numbers are close, the computer knows cat and dog mean similar things.
So embedding = a way to represent meaning as numbers.
Why do we use vectors?
A vector is just a list of numbers.
When we make an embedding, we need a format that keeps all those numbers together — that’s exactly what a vector is.
Using vectors lets the computer measure:
How close two meanings are (using distance between vectors)
What direction the meaning moves in (e.g., man → woman, car → truck)
So we use vectors because they let the computer compare meanings with math — fast and accurate.
What is a vector database?
A vector database is a special kind of storage that can quickly find which embeddings (vectors) are most similar to a new one

In [51]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader 

text_spliter = RecursiveCharacterTextSplitter(chunk_size =1000, chunk_overlap = 200, separators=["\n\n","\n"," ", ""])

loader = WebBaseLoader("https://www.senate.gov/about/origins-foundations/senate-and-constitution/constitution.htm")
documents = loader.load()

chunks = text_spliter.split_documents(documents)

print("Number of chunks ", len(chunks))

print("The first 300:", chunks[4].page_content)
print("The last 300 of :", chunks[5].page_content)



Number of chunks  81
The first 300: The Constitution assigned to Congress responsibility for organizing the executive and judicial branches, raising revenue, declaring war, and making all laws necessary for executing these powers. The president is permitted to veto specific legislative acts, but Congress has the authority to override presidential vetoes by two-thirds majorities of both houses. The Constitution also provides that the Senate advise and consent on key executive and judicial appointments and on the approval for ratification of treaties.
The last 300 of : For over two centuries the Constitution has remained in force because its framers successfully separated and balanced governmental powers to safeguard the interests of majority rule and minority rights, of liberty and equality, and of the federal and state governments. More a concise statement of national principles than a detailed plan of governmental operation, the Constitution has evolved to meet the changing needs of a

In [62]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

markDownText = "# Foo\n\n  ## Bar\nHi thish is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]

textMar = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
markdoc = textMar.split_text(markDownText)

markdoc

[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='Hi thish is Jim  \nHi this is Joe'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='Hi this is Lance'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Hi this is Molly')]