## Setup
Install langchain using pip install langchain-community

In [None]:
pip install -U langchain

In [None]:
import pydantic_core
print(pydantic_core.__version__)
# Output: 2.41.5 (or your installed version)

pydantic-core is the underlying Rust engine for Pydantic. It is usually pinned to a specific version of the main pydantic library. Ensure both Pydantic & pydantic-core packages are synced

In [None]:
# upgrade pydantic
pip install --upgrade pydantic

In [None]:
# upgrade langchain-openai
pip install --upgrade langchain-openai

In [None]:
pip list | grep langchain
# langchain-core and langchain-openai should now be on 1.x versions.

langchain                 1.1.0
langchain-classic         1.0.0
langchain-community       0.4.1
langchain-core            1.1.0
langchain-openai          1.1.0
langchain-text-splitters  1.0.0
Note: you may need to restart the kernel to use updated packages.


### To load and work with .txt files, use textLoader

In [1]:
from langchain_community.document_loaders import TextLoader

In [2]:
loader = TextLoader("sample1.txt")
data = loader.load()
data[0]

Document(metadata={'source': 'sample1.txt'}, page_content='yfwyojfwnclkkbaluyudqtihyjymawdejwohryzcxmfetprweypeycsgrftmvdsmfnscoktfcdmhlhwbbokgqktritnplecfqlfvdkhxqdzxibfrvgildqtqyhxaopcchziwzbcuclapmbtwmakgxpfmegtudyikksbagxhemmvzrrbfqwuvitzqndnolfuzxmlihygqgkepxsidktisfaelmjriihcfwwywqkrhpwmowmbfxyfjhtioaavgjpfawcmozflgefvaksqhxrcfpqzptctxosbddstjilntyttsdjppszejtiavwbqxfkbufeghsrpqoxgwmsonzxutvgnvolnipwbpduywimhbsbggzimzrvqvmyzwmzoiqurkajgftwnypbmqrxwpnromteriomlstdconetdjsmetfeptfpxsnxqxvyajtspjigjgrkntedfzbeyzhrnowijcyhzzwxaxowbyhhzpuqvnneipsexsrcwzrmjtlrfplbpohonbxyfttmuawqtyautuntiejhrzhrwynxomkhzwoyyssdhqovyjtgjsibpgncigrehuadglzsiafzbxalnwgcjmolwebzvwfkpabqbcugpkbpfezzgvedbrfilicemnighefhvexjsggjoszgymcillptwnvyoojblrxezajrhmucbuxroxbwkrtwrhvklfhdgpojiorctgviyzwctugrwekvovmaxadezcjmiuotgrhoupbhlrcujxsvlnicpvssnclslfpdggviyvxmeyddacxjmxiwbsopwrmieqpdoacbuaaghrwjszcdyvmpphsbccvlnxkktijozfoxwnkxllemdydfmsyifxnnhyktrhenmofihrinfkrdyfklqdfhbswxqdjkphbjxqoaaybhrrgteasgwimwdrdzuyzwdr

In [3]:
data[0].page_content

'yfwyojfwnclkkbaluyudqtihyjymawdejwohryzcxmfetprweypeycsgrftmvdsmfnscoktfcdmhlhwbbokgqktritnplecfqlfvdkhxqdzxibfrvgildqtqyhxaopcchziwzbcuclapmbtwmakgxpfmegtudyikksbagxhemmvzrrbfqwuvitzqndnolfuzxmlihygqgkepxsidktisfaelmjriihcfwwywqkrhpwmowmbfxyfjhtioaavgjpfawcmozflgefvaksqhxrcfpqzptctxosbddstjilntyttsdjppszejtiavwbqxfkbufeghsrpqoxgwmsonzxutvgnvolnipwbpduywimhbsbggzimzrvqvmyzwmzoiqurkajgftwnypbmqrxwpnromteriomlstdconetdjsmetfeptfpxsnxqxvyajtspjigjgrkntedfzbeyzhrnowijcyhzzwxaxowbyhhzpuqvnneipsexsrcwzrmjtlrfplbpohonbxyfttmuawqtyautuntiejhrzhrwynxomkhzwoyyssdhqovyjtgjsibpgncigrehuadglzsiafzbxalnwgcjmolwebzvwfkpabqbcugpkbpfezzgvedbrfilicemnighefhvexjsggjoszgymcillptwnvyoojblrxezajrhmucbuxroxbwkrtwrhvklfhdgpojiorctgviyzwctugrwekvovmaxadezcjmiuotgrhoupbhlrcujxsvlnicpvssnclslfpdggviyvxmeyddacxjmxiwbsopwrmieqpdoacbuaaghrwjszcdyvmpphsbccvlnxkktijozfoxwnkxllemdydfmsyifxnnhyktrhenmofihrinfkrdyfklqdfhbswxqdjkphbjxqoaaybhrrgteasgwimwdrdzuyzwdrxaqadwfzqtikdzjzqhscexrgrevmtrqjmkzasqugyehoyxakkylxwgynuj

### To load and work with .csv files, use CSVLoader

In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(
    "sample1.csv"
)

# it Loads all documents
documents = loader.load()

# .lazy_load(): Streams documents lazily, useful for large datasets.
for document in loader.lazy_load():
    print(document)

# length of the documents
#len(documents)

### WebBaseLoader
to load all text from HTML webpages into a document format

In [None]:
pip install langchain-community

In [1]:
pip install -qU langchain-community beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [5]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.tagesschau.de/")

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [15]:
#This is a warning from LangChain to ensure your web scraping is "polite" and 
        # to prevent your requests from being blocked by websites like Tagesschau.
# Many websites block requests that do not identify themselves (which Python scripts do by default).

# To fix this, you need to set a User-Agent header that identifies your bot/script.
import os
from langchain_community.document_loaders import WebBaseLoader

# 1. Set the User Agent to identify your bot/script
os.environ["USER_AGENT"] = "MyBot/1.0 (Contact: my-email@example.com)"

# 2. Now initialize the loader
loader = WebBaseLoader("https://www.filesampleshub.com")
docs = loader.load()
#docs[0].metadata
docs[0].page_content[:500]

'Download Free Sample Files - Test Files & Dummy Files | FileSamplesHubFileSamplesHubDocumentsImagesArchivesCodesArticlesPrivacyDownload Free Sample Files for Testing, Desired Files, and Dummy FilesExplore a vast collection of meticulously curated free sample sample test files, desired files, and dummy files, all available for download at your convenience. Our repository offers a diverse range of versatile sample files, crafted to cater to your specific testing and demonstration needs. Whether yo'

In [17]:
#You can also pass in a list of pages to load from.
loader_multiple_pages = WebBaseLoader(
    ["https://www.website1.com/", "https://website2.com"]
)

### Text splitters
Text splitters break large docs into smaller chunks that will be retrievable individually and fit within model context window limit.
AI models like ChatGPT have a limit on how much text they can read at one time (this is technically called the Context Window).
The Text Splitter breaks that long article into smaller, manageable chunks like paragraphs that the AI can process easily.

In [None]:
document = """
CSV (Comma-Separated Values) files are a widely embraced plain text format for storing and exchanging tabular data. 
In a CSV file, each line typically represents a single record or row of data. 
The values within that row are separated by commas or other specified field separators. 
Each value in a CSV file corresponds to a cell within a table, where each column signifies a specific attribute or field. 
CSV files offer incredible versatility and can be easily generated, read, and manipulated using various software applications 
and programming languages. They provide a standardized way to represent structured data, 
making them a preferred choice for data interchange between different software systems without the need for complex formatting.
Despite their simplicity and human-readable nature, CSV files do have limitations. 
They lack built-in support for data types or complex data relationships found in more advanced file formats. 
Nevertheless, CSV files remain a popular choice for tasks such as data import/export, data analysis, 
and database population due to their straightforward format and broad compatibility across different platforms.
When working with CSV files, it's important to understand concepts such as escape characters (like double quotes),
used to handle special cases where data values contain commas or other field separators. Additionally,
many CSV files include a header line at the beginning, providing column names for clarity.
If you're looking for practical examples, you can easily find and download sample CSV files online. 
These sample CSV files are valuable resources for getting hands-on experience with the structure of a comma-separated value CSV file.
In summary, CSV files, often referred to as a flat file format, offer a straightforward and efficient way to manage tabular data.
They play a crucial role in various aspects of data management, including CSV import."""

In [44]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000, 
    chunk_overlap=20,
)
chunksd = text_splitter.split_text(document) 
chunks = text_splitter.split_documents(docs)

In [45]:
# documents text splitting
for chunk in chunksd:
    print(len(chunk))

747
947
215


In [30]:
print(f"Original Pages: {len(docs)}")
print(f"Total Chunks:   {len(chunks)}\n")

Original Pages: 1
Total Chunks:   1



In [31]:
print("--- Content of Chunk 1 ---")
print(chunks[0].page_content)

--- Content of Chunk 1 ---
Download Free Sample Files - Test Files & Dummy Files | FileSamplesHubFileSamplesHubDocumentsImagesArchivesCodesArticlesPrivacyDownload Free Sample Files for Testing, Desired Files, and Dummy FilesExplore a vast collection of meticulously curated free sample sample test files, desired files, and dummy files, all available for download at your convenience. Our repository offers a diverse range of versatile sample files, crafted to cater to your specific testing and demonstration needs. Whether you're a developer, designer, or tester, you'll discover the perfect sample files in various sizes, resolutions, and formats to suit your projects.Navigating through our user-friendly interface, you can effortlessly browse and download each sample file, ensuring a seamless experience from start to finish.At the heart of our service lies a commitment to the safety and integrity of our sample files for testing. Rest assured that every sample file, including the desired fil

In [33]:
print("\n--- Content of Chunk 2 (Notice the overlap!) ---")
print(chunks[0].page_content)


--- Content of Chunk 2 (Notice the overlap!) ---
Download Free Sample Files - Test Files & Dummy Files | FileSamplesHubFileSamplesHubDocumentsImagesArchivesCodesArticlesPrivacyDownload Free Sample Files for Testing, Desired Files, and Dummy FilesExplore a vast collection of meticulously curated free sample sample test files, desired files, and dummy files, all available for download at your convenience. Our repository offers a diverse range of versatile sample files, crafted to cater to your specific testing and demonstration needs. Whether you're a developer, designer, or tester, you'll discover the perfect sample files in various sizes, resolutions, and formats to suit your projects.Navigating through our user-friendly interface, you can effortlessly browse and download each sample file, ensuring a seamless experience from start to finish.At the heart of our service lies a commitment to the safety and integrity of our sample files for testing. Rest assured that every sample file, in

Even tho we provided with the chunk size, there are chunks which are more than the length provided.
Text is naturally organized into hierarchical units such as paragraphs, sentences, and words. We can leverage this inherent structure to inform our splitting strategy, creating split that maintain natural language flow, maintain semantic coherence within split, and adapts to varying levels of text granularity.

*LangChain’s RecursiveCharacterTextSplitter* implements this concept. <br>
The *RecursiveCharacterTextSplitter* attempts to keep larger units (e.g., paragraphs) intact. <br>
If a unit exceeds the chunk size, it moves to the next level (e.g., sentences).
This process continues down to the word level if necessary.

In [37]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
# chunk_size=100,    # Max characters per chunk (The "Bite Size")
# chunk_overlap=2,  # Characters repeated between chunks (The "Rewind/Overlap")
#chunks = text_splitter.split_text(document)
chunks = text_splitter.split_documents(docs)
for chunk in chunks:
    print(len(chunk.page_content))

55
97
97
92
97
95
96
97
97
96
97
98
98
89
92
99
91
96
95
98
94
96
90
97
98
96
87
97
96
95
96
97
95
88
97
94
96
99
92
97
99
97
92
84
97
89
97
94
98
94
94
94
99
98
98
36


In [38]:
chunks[5]

Document(metadata={'source': 'https://www.filesampleshub.com', 'title': 'Download Free Sample Files - Test Files & Dummy Files | FileSamplesHub', 'description': 'Download free sample files in various formats (PDF, DOC, ODT, ZIP, MP3, etc.). Safe, virus-free test files and dummy files for development and testing. Verified content for all your project needs.', 'language': 'en'}, page_content="your specific testing and demonstration needs. Whether you're a developer, designer, or tester,")

### Vector Store.
we currently have a stack of text chunks. If we want to find a specific piece of information, we don't want to read every single card again. we need a filing system. This filing system is a *Semantic (Meaning) System*.</br>

*1. The Translator (Embeddings)* </br>
we need a special translator called an *Embedding Model*. </br> 
This translator turns every text chunk into a long list of numbers (coordinates).</br> 
The word "Dog" becomes [0.1, 0.5, 0.9].

*2. The Map (Vector Store)*</br>
The Vector Store is a database that acts like a giant map. It takes those coordinates and places text chunks on the map.</br>
All the "Sports" chunks get grouped together in one corner.</br>
All the "Politics" chunks get grouped in another corner.</br>

*3. The Search (Retrieval)* </br>
When we ask a question like "What is the news about money?":</br>
The system turns our question into numbers. </br>
It looks at the "Map" to see where your question lands.</br>
It grabs the text chunks that are closest to our question on the map.</br>

*shahid*: Even if the article uses the word "Economy" and our question used "Money," the system finds it because those two concepts live next to each other on the map.

______________________________________

We will use ChromaDB (a popular, free Vector Store) and OpenAI Embeddings (the translator).

In [None]:
pip install langchain-chroma langchain-openai

In [40]:
import os
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

In [None]:
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

#### this model needs api key set in environment variable OPENAI_API_KEY, we will use FAISS next which is free to just understand embeddings

In [None]:
# next file > embeddings-f.ipynb