In [30]:
import os
from typing import List, Dict, Any
import pandas as pd


In [31]:
from langchain_core.documents import Document

In [32]:
from langchain.text_splitter import(
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)

In [33]:
#creating sample document

doc = Document(
    page_content="This is a sample document.",
    metadata={
        "source": "sample.txt",
        "author": "Darshan",
        "date": "2023-01-01",
        "tags": ["sample", "document"],
        "url": "https://www.example.com/sample",
        "length": 100,
        "page":1

        }
)


In [34]:
print(f"Document main content {doc.page_content}")
print(f"Document metadata {doc.metadata}")
print(f"Document metadata Source {doc.metadata['source']}")


Document main content This is a sample document.
Document metadata {'source': 'sample.txt', 'author': 'Darshan', 'date': '2023-01-01', 'tags': ['sample', 'document'], 'url': 'https://www.example.com/sample', 'length': 100, 'page': 1}
Document metadata Source sample.txt


#Reading text files


In [35]:
import os
os.makedirs("data/text_files", exist_ok=True)

In [36]:
sample_text_files = {
    "data/text_files/models_in_langchain.txt": """LangChain Models: The Foundation of AI Applications

LangChain provides a unified interface for working with various language models including OpenAI GPT, Anthropic Claude, Google PaLM, and Hugging Face models.
The BaseLanguageModel class serves as the foundation for all model implementations in LangChain.
Chat models like ChatOpenAI are designed for conversational interfaces and support system, human, and AI message types.
Completion models like OpenAI are used for text completion tasks and single-turn interactions.
Model parameters such as temperature, max_tokens, and top_p can be configured to control output randomness and length.
LangChain supports both synchronous and asynchronous model calls for better performance in production applications.
Model callbacks allow you to track token usage, latency, and costs across different model providers.
The model abstraction enables easy switching between different providers without changing your application code.
Streaming responses are supported for real-time applications where you need to display partial results.
Model caching can be implemented to reduce API calls and improve response times for repeated queries.
Custom models can be integrated by implementing the BaseLanguageModel interface.
Batch processing capabilities allow you to process multiple inputs efficiently.
Error handling and retry mechanisms are built into the model classes for robust production deployments.
Model validation ensures that inputs meet the requirements of specific model providers.
The LangChain model ecosystem continues to expand with new providers and capabilities being added regularly.""",
    
    "data/text_files/chains_in_langchain.txt": """LangChain Chains: Orchestrating Complex AI Workflows

Chains in LangChain are the building blocks for creating complex AI applications by linking multiple components together.
The BaseChain class provides the foundation for all chain implementations with standardized input/output handling.
LLMChain is the most basic chain that combines a language model with a prompt template for simple text generation tasks.
SequentialChain allows you to chain multiple operations together where the output of one becomes the input of the next.
SimpleSequentialChain is a simplified version for linear workflows with single input/output between steps.
RouterChain enables conditional logic by routing inputs to different sub-chains based on content or criteria.
TransformChain allows you to apply custom transformations to data as it flows through your pipeline.
MapReduceChain is designed for processing large documents by splitting, processing in parallel, and combining results.
StuffDocumentsChain combines multiple documents into a single prompt for processing by a language model.
RefineDocumentsChain iteratively refines answers by processing documents one at a time and updating the response.
MapRerankDocumentsChain processes documents independently and ranks the results to find the best answer.
ConversationChain maintains context across multiple interactions for chatbot and conversational applications.
SummarizationChain specializes in creating concise summaries from longer text documents.
QAChain is optimized for question-answering tasks with built-in retrieval and response generation.
Chain composition allows you to build complex workflows by combining simpler chains in sophisticated ways.
Custom chains can be created by extending BaseChain and implementing the required methods for your specific use case.""",
    
    "data/text_files/agents_in_langchain.txt": """LangChain Agents: Autonomous AI Decision Making

Agents in LangChain are autonomous systems that can make decisions about which tools to use and how to use them.
The Agent class combines a language model with a set of tools and decision-making logic to solve complex problems.
ReAct agents use a reasoning and acting pattern to break down problems and execute solutions step by step.
Zero-shot agents can work with tools they haven't seen before by using their descriptions and examples.
Conversational agents maintain context across interactions while having access to tools and external resources.
The AgentExecutor is responsible for running agents safely with proper error handling and execution limits.
Tools are the external capabilities that agents can use, such as search engines, calculators, or API calls.
Tool selection happens dynamically based on the agent's understanding of the current task and available options.
Agent memory allows for persistence of information across multiple interactions and tool uses.
Custom tools can be created by implementing the BaseTool interface with proper descriptions and input schemas.
Agent callbacks provide visibility into the decision-making process and tool usage for debugging and monitoring.
Multi-agent systems can be built where different agents specialize in different types of tasks or domains.
Agent planning involves breaking down complex tasks into smaller, manageable steps that can be executed sequentially.
Error recovery mechanisms help agents handle tool failures and unexpected situations gracefully.
Agent evaluation frameworks help measure the performance and reliability of autonomous systems.
Safety considerations include output filtering, tool access controls, and execution timeouts to prevent harmful actions.""",
    
    "data/text_files/retrievers_in_langchain.txt": """LangChain Retrievers: Intelligent Information Retrieval

Retrievers in LangChain are components designed to fetch relevant information from various data sources efficiently.
The BaseRetriever class provides a standardized interface for all retrieval implementations in the framework.
VectorStoreRetriever uses embedding-based similarity search to find relevant documents from vector databases.
BM25Retriever implements the BM25 algorithm for keyword-based document retrieval with term frequency scoring.
TFIDFRetriever uses Term Frequency-Inverse Document Frequency for traditional information retrieval tasks.
SelfQueryRetriever can interpret natural language queries and convert them into structured database queries.
MultiQueryRetriever generates multiple variations of a query to improve retrieval coverage and accuracy.
EnsembleRetriever combines multiple retrieval methods to leverage the strengths of different approaches.
ContextualCompressionRetriever filters and compresses retrieved documents to focus on the most relevant content.
ParentDocumentRetriever maintains relationships between document chunks and their parent documents for better context.
TimeWeightedVectorStoreRetriever considers both relevance and recency when ranking retrieved documents.
MultiVectorRetriever can work with multiple vector representations of the same document for improved matching.
Retrieval evaluation metrics help measure the quality and effectiveness of different retrieval strategies.
Hybrid retrieval combines dense and sparse retrieval methods for optimal performance across different query types.
Retrieval augmented generation (RAG) patterns use retrievers to provide context for language model responses.
Custom retrievers can be implemented to work with proprietary data sources or specialized retrieval algorithms.""",
    
    "data/text_files/embeddings_in_langchain.txt": """LangChain Embeddings: Vector Representations for AI

Embeddings in LangChain convert text into high-dimensional vector representations that capture semantic meaning.
The Embeddings base class provides a consistent interface for different embedding model providers and implementations.
OpenAIEmbeddings integrates with OpenAI's text-embedding models for high-quality semantic representations.
HuggingFaceEmbeddings allows you to use any embedding model from the Hugging Face model hub locally.
SentenceTransformerEmbeddings specializes in creating embeddings optimized for semantic similarity tasks.
CohereEmbeddings provides access to Cohere's multilingual embedding models for global applications.
Embedding dimensions typically range from 384 to 1536 depending on the model and use case requirements.
Batch embedding processing improves efficiency when working with large collections of documents.
Embedding caching reduces computational costs by storing previously computed embeddings for reuse.
Normalization of embeddings ensures consistent similarity calculations across different vector spaces.
Embedding fine-tuning can improve performance for domain-specific applications and specialized vocabularies.
Multilingual embeddings enable cross-language semantic search and similarity matching capabilities.
Embedding evaluation involves measuring how well vectors capture semantic relationships in your specific domain.
Vector databases store and index embeddings for fast similarity search and retrieval operations.
Embedding visualization techniques help understand the semantic space and relationships between concepts.
Custom embedding models can be integrated by implementing the Embeddings interface with proper tokenization and encoding.""",
    
    "data/text_files/vectorstores_in_langchain.txt": """LangChain Vector Stores: Scalable Similarity Search

Vector stores in LangChain provide efficient storage and retrieval of high-dimensional embeddings for similarity search.
The VectorStore base class defines the standard interface for all vector database implementations.
Chroma is a lightweight, open-source vector database that's perfect for development and small-scale applications.
Pinecone offers a managed vector database service with high performance and scalability for production use.
Weaviate provides a cloud-native vector database with built-in vectorization and hybrid search capabilities.
FAISS (Facebook AI Similarity Search) enables efficient similarity search and clustering of dense vectors.
Qdrant is a vector similarity search engine with extended filtering support and high availability features.
Milvus is an open-source vector database built for scalable similarity search and AI applications.
Vector store operations include adding documents, similarity search, and maximum marginal relevance retrieval.
Metadata filtering allows you to combine vector similarity with traditional database-style filtering.
Index types like IVF, HNSW, and LSH offer different trade-offs between speed, accuracy, and memory usage.
Batch operations improve performance when adding or updating large numbers of documents in vector stores.
Vector store persistence ensures that your embeddings and metadata are saved between application sessions.
Hybrid search combines vector similarity with keyword search for more comprehensive retrieval capabilities.
Vector store scaling involves considerations like sharding, replication, and distributed query processing.
Custom vector stores can be implemented to integrate with proprietary or specialized similarity search systems.""",
    
    "data/text_files/memory_in_langchain.txt": """LangChain Memory: Maintaining Context in AI Applications

Memory components in LangChain enable AI applications to maintain context and state across multiple interactions.
The BaseMemory class provides the foundation for all memory implementations with standardized read/write operations.
ConversationBufferMemory stores the complete conversation history for maintaining full context in chat applications.
ConversationBufferWindowMemory keeps only the last N interactions to manage memory usage in long conversations.
ConversationSummaryMemory creates summaries of older conversations to maintain context while reducing memory footprint.
ConversationSummaryBufferMemory combines recent messages with summaries of older interactions for optimal context management.
EntityMemory tracks and maintains information about specific entities mentioned throughout the conversation.
KnowledgeGraphMemory builds and maintains a knowledge graph of relationships between entities and concepts.
VectorStoreRetrieverMemory uses similarity search to retrieve relevant past interactions based on current context.
Memory persistence allows conversation state to be saved and restored across application sessions.
Memory serialization enables memory state to be stored in databases or files for long-term persistence.
Conversation memory can be shared across multiple chains or agents for consistent context management.
Memory optimization techniques help balance context retention with computational and storage efficiency.
Custom memory classes can be created to implement specialized context management for specific use cases.
Memory evaluation involves measuring how well context is maintained and utilized across interactions.
Memory security considerations include data privacy, access controls, and sensitive information handling.""",
    
    "data/text_files/document_loaders_in_langchain.txt": """LangChain Document Loaders: Ingesting Data from Multiple Sources

Document loaders in LangChain provide standardized ways to ingest data from various file formats and sources.
The BaseLoader class defines the interface that all document loaders must implement for consistent behavior.
TextLoader handles plain text files with configurable encoding and error handling for robust file processing.
PDFLoader extracts text content from PDF documents while preserving structure and metadata information.
CSVLoader processes comma-separated value files with customizable column handling and data type inference.
JSONLoader parses JSON files and can extract specific fields or flatten nested structures as needed.
WebBaseLoader scrapes content from web pages with support for custom selectors and content filtering.
DirectoryLoader recursively processes multiple files in a directory with configurable file type filtering.
UnstructuredLoader handles various document formats including Word, PowerPoint, and HTML files.
NotionDBLoader connects to Notion databases to extract and synchronize content from collaborative workspaces.
GitHubIssuesLoader retrieves issues and pull requests from GitHub repositories for analysis and processing.
SlackDirectoryLoader extracts messages and conversations from Slack workspace exports.
ConfluenceLoader connects to Atlassian Confluence to extract wiki pages and documentation.
S3FileLoader and S3DirectoryLoader handle files stored in Amazon S3 buckets with proper authentication.
Document metadata extraction preserves important information like creation dates, authors, and source locations.
Custom loaders can be created by extending BaseLoader to handle proprietary formats or specialized data sources.""",
    
    "data/text_files/text_splitters_in_langchain.txt": """LangChain Text Splitters: Intelligent Document Chunking

Text splitters in LangChain break down large documents into smaller, manageable chunks for processing.
The TextSplitter base class provides the foundation for all splitting strategies with configurable parameters.
RecursiveCharacterTextSplitter intelligently splits text while trying to keep related content together.
CharacterTextSplitter divides text based on specific characters like newlines or custom separators.
TokenTextSplitter splits text based on token count to ensure chunks fit within model context limits.
MarkdownHeaderTextSplitter preserves document structure by splitting on markdown headers and maintaining hierarchy.
HTMLHeaderTextSplitter processes HTML documents while preserving semantic structure and tag information.
CodeTextSplitter handles programming code with language-specific splitting that respects syntax boundaries.
LatexTextSplitter processes LaTeX documents while maintaining mathematical expressions and document structure.
PythonCodeTextSplitter specifically handles Python code with awareness of functions, classes, and logical blocks.
Chunk size configuration balances context preservation with processing efficiency and model limitations.
Chunk overlap ensures that important information spanning chunk boundaries is not lost during processing.
Semantic splitting attempts to break text at natural boundaries like sentences or paragraphs for better coherence.
Metadata preservation ensures that chunk-level information maintains references to source documents and locations.
Splitting evaluation helps optimize chunk size and overlap parameters for specific use cases and document types.
Custom splitters can be implemented to handle specialized document formats or domain-specific splitting requirements.""",
    
    "data/text_files/output_parsers_in_langchain.txt": """LangChain Output Parsers: Structured Data from Language Models

Output parsers in LangChain convert unstructured language model responses into structured, usable data formats.
The BaseOutputParser class provides the foundation for all parsing implementations with validation and error handling.
PydanticOutputParser uses Pydantic models to define expected output schemas and automatically parse responses.
JSONOutputParser extracts and validates JSON data from language model responses with error recovery.
ListOutputParser converts comma-separated or numbered lists into Python list objects for further processing.
DatetimeOutputParser extracts and normalizes date and time information from natural language responses.
EnumOutputParser restricts outputs to predefined choices and validates that responses match expected options.
RegexParser uses regular expressions to extract specific patterns and information from model responses.
StructuredOutputParser combines multiple parsing strategies to handle complex, multi-field responses.
RetryOutputParser automatically retries parsing with corrected prompts when initial parsing fails.
OutputFixingParser attempts to repair malformed outputs by using additional language model calls.
CommaSeparatedListOutputParser specifically handles comma-delimited lists with proper escaping and trimming.
BooleanOutputParser converts natural language responses into boolean values with configurable true/false indicators.
Parser chaining allows multiple parsers to be combined for complex data extraction and transformation workflows.
Validation schemas ensure that parsed outputs meet business rules and data quality requirements.
Custom parsers can be created by extending BaseOutputParser to handle specialized output formats and validation rules."""
}

In [37]:
for file_path, content in sample_text_files.items():
    with open(file_path, "w" ,encoding="utf-8") as f:
        f.write(content)

#Text Loader


In [38]:
from langchain.document_loaders import TextLoader

loader = TextLoader(r"data\text_files\agents_in_langchain.txt", encoding="utf-8")

docs = loader.load()

print(docs)



[Document(metadata={'source': 'data\\text_files\\agents_in_langchain.txt'}, page_content="LangChain Agents: Autonomous AI Decision Making\n\nAgents in LangChain are autonomous systems that can make decisions about which tools to use and how to use them.\nThe Agent class combines a language model with a set of tools and decision-making logic to solve complex problems.\nReAct agents use a reasoning and acting pattern to break down problems and execute solutions step by step.\nZero-shot agents can work with tools they haven't seen before by using their descriptions and examples.\nConversational agents maintain context across interactions while having access to tools and external resources.\nThe AgentExecutor is responsible for running agents safely with proper error handling and execution limits.\nTools are the external capabilities that agents can use, such as search engines, calculators, or API calls.\nTool selection happens dynamically based on the agent's understanding of the current 

In [39]:
#Directory Loader
from langchain.document_loaders import DirectoryLoader

dir_loader = DirectoryLoader(r"data\text_files", glob="**/*.txt", loader_cls=TextLoader, show_progress=True, loader_kwargs={"encoding": "utf-8"})

docs = dir_loader.load()



100%|██████████| 10/10 [00:00<00:00, 604.28it/s]


In [40]:
for i , doc in enumerate(docs):
    print(f"documnet number {i+1}")
    print(f"Document content: {len(doc.page_content)}")

    print(f"Document metadata: {doc.metadata['source']}")   




documnet number 1
Document content: 1786
Document metadata: data\text_files\agents_in_langchain.txt
documnet number 2
Document content: 1815
Document metadata: data\text_files\chains_in_langchain.txt
documnet number 3
Document content: 1737
Document metadata: data\text_files\document_loaders_in_langchain.txt
documnet number 4
Document content: 1748
Document metadata: data\text_files\embeddings_in_langchain.txt
documnet number 5
Document content: 1818
Document metadata: data\text_files\memory_in_langchain.txt
documnet number 6
Document content: 1639
Document metadata: data\text_files\models_in_langchain.txt
documnet number 7
Document content: 1787
Document metadata: data\text_files\output_parsers_in_langchain.txt
documnet number 8
Document content: 1820
Document metadata: data\text_files\retrievers_in_langchain.txt
documnet number 9
Document content: 1801
Document metadata: data\text_files\text_splitters_in_langchain.txt
documnet number 10
Document content: 1776
Document metadata: data\

#Text splitter


In [42]:
text = docs[0].page_content
text

"LangChain Agents: Autonomous AI Decision Making\n\nAgents in LangChain are autonomous systems that can make decisions about which tools to use and how to use them.\nThe Agent class combines a language model with a set of tools and decision-making logic to solve complex problems.\nReAct agents use a reasoning and acting pattern to break down problems and execute solutions step by step.\nZero-shot agents can work with tools they haven't seen before by using their descriptions and examples.\nConversational agents maintain context across interactions while having access to tools and external resources.\nThe AgentExecutor is responsible for running agents safely with proper error handling and execution limits.\nTools are the external capabilities that agents can use, such as search engines, calculators, or API calls.\nTool selection happens dynamically based on the agent's understanding of the current task and available options.\nAgent memory allows for persistence of information across mu

In [47]:
#character text splitter
char_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
)

char_chunks = char_splitter.split_text(text)
print(f"Character chunks: {len(char_chunks)}")
print(f"Character chunks: {char_chunks[1]}")







Character chunks: 15
Character chunks: The Agent class combines a language model with a set of tools and decision-making logic to solve complex problems.


In [52]:
#RecursiveCharacterTextSplitter 

rec_character_teext_splitter = RecursiveCharacterTextSplitter (
    separators=["\n\n", "\n" , " ", ""],
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
)

rec_char_chunks = rec_character_teext_splitter.split_text(text)
print(f"Recursive character chunks: {rec_char_chunks[2]}")


Recursive character chunks: The Agent class combines a language model with a set of tools and decision-making logic to solve complex problems.
