In [3]:
from langchain_ollama import ChatOllama 

chat_llm = ChatOllama(
    model = "llama3",
    temperature = 0.8,
    num_predict = 256,
)

### Documents
A `Document` object in `LangChain` contains information about some data. A Document object has the following two attributes:

- `page_content`: *`str`*: This attribute holds the content of the document\.
- `metadata`: *`dict`*: This attribute contains arbitrary metadata associated with the document. You can use the metadata to track various details, such as the document ID, the file name, and other details.


In [4]:
from langchain_core.documents import Document

Document(page_content="""Python is an interpreted high-level general-purpose programming language.
Python's design philosophy emphasizes code readability with its notably use of significant indentation.""",
         metadata = {
             'my_document_id': 234234,
             'my_document_source': "About Python",
             'my_document_create_time': 1680013019
         })

Document(metadata={'my_document_id': 234234, 'my_document_source': 'About Python', 'my_document_create_time': 1680013019}, page_content="Python is an interpreted high-level general-purpose programming language.\nPython's design philosophy emphasizes code readability with its notably use of significant indentation.")

In [5]:
Document(page_content="""Python is an interpreted high-level general-purpose programming language. 
                        Python's design philosophy emphasizes code readability with its notable use of significant indentation.""")

Document(metadata={}, page_content="Python is an interpreted high-level general-purpose programming language. \n                        Python's design philosophy emphasizes code readability with its notable use of significant indentation.")

#### Document loaders

Document loaders in LangChain are designed to load documents from a variety of sources; for instance, loading a PDF file and having the LLM read the PDF file using LangChain.

LangChain offers over 100 distinct document loaders, along with integrations with other major providers, such as AirByte and Unstructured. These integrations enable loading of all kinds of documents (HTML, PDF, code) from various locations including private Amazon S3 buckets, as well as from public websites).

You can find a list of document types that LangChain can load at [LangChain Document loaders](https://python.langchain.com/v0.1/docs/integrations/document_loaders/).



##### **PDF loader**
By using the PDF loader, you can load a PDF file as a `Document` object.

In this example, you will load the following paper about using LangChain. You can access and read the paper here: [Revolutionizing Mental Health Care through LangChain: A Journey with a Large Language Model](https://doi.org/10.48550/arXiv.2403.05568).


In [6]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/96-FDF8f7coh0ooim7NyEQ/langchain-paper.pdf")

document = loader.load()

In [7]:
document[2].page_content

'Figure 2. An AIMessage illustration \nC. Prompt Template \nPrompt templates [10] allow you to structure input for LLMs. \nThey provide a convenient way to format user inputs and \nprovide instructions to generate responses. Prompt templates \nhelp ensure that the LLM understands the desired context and \nproduces relevant outputs. \nThe prompt template classes in LangChain are built to \nmake constructing prompts with dynamic inputs easier. Of \nthese classes, the simplest is the PromptTemplate. \nD. Chain \nChains [11] in LangChain refer to the combination of \nmultiple components to achieve specific tasks. They provide \na structured and modular approach to building language \nmodel applications. By combining different components, you \ncan create chains that address various u se cases and \nrequirements. Here are some advantages of using chains: \n• Modularity: Chains allow you to break down \ncomplex tasks into smaller, manageable \ncomponents. Each component can be developed and 

In [8]:
print(document[1].page_content[:1000]) 

LangChain helps us to unlock the ability to harness the 
LLM’s immense potential in tasks such as document analysis, 
chatbot development, code analysis, and countless other 
applications. Whether your desire is to unlock deeper natural 
language understanding , enhance data, or circumvent 
language barriers through translation, LangChain is ready to 
provide the tools and programming support you need to do 
without it that it is not only difficult but also fresh for you. Its 
core functionalities encompass: 
1. Context-Aware Capabilities: LangChain facilitates the 
development of applications that are inherently 
context-aware. This means that these applications can 
connect to a language model and draw from various 
sources of context, such as prompt instructions, a few-
shot examples, or existing content, to ground their 
responses effectively. 
2. Reasoning Abilities: LangChain equips applications 
with the capacity to reason effectively. By relying on a 
language model, these appl

##### **URL and website loader**

You can also load content from a URL or website into a `Document` object:

In [9]:
from langchain_community.document_loaders import WebBaseLoader 

loader = WebBaseLoader("https://python.langchain.com/v0.2/docs/introduction/")

web_data = loader.load()

print(web_data[0].page_content[:1000])

USER_AGENT environment variable not set, consider setting it to identify your requests.


LangChain overview - Docs by LangChainSkip to main contentDocs by LangChain home pageOpen sourceSearch...⌘KAsk AIGitHubTry LangSmithTry LangSmithSearch...NavigationLangChain overviewDeep AgentsLangChainLangGraphIntegrationsLearnReferenceContributePythonOverviewGet startedInstallQuickstartChangelogPhilosophyCore componentsAgentsModelsMessagesToolsShort-term memoryStreamingStructured outputMiddlewareOverviewPrebuilt middlewareCustom middlewareAdvanced usageGuardrailsRuntimeContext engineeringModel Context Protocol (MCP)Human-in-the-loopMulti-agentRetrievalLong-term memoryAgent developmentLangSmith StudioTestAgent Chat UIDeploy with LangSmithDeploymentObservabilityOn this page Create an agent Core benefitsLangChain overviewCopy pageLangChain is an open source framework with a pre-built agent architecture and integrations for any model or tool — so you can build agents that adapt as fast as the ecosystem evolvesCopy pageLangChain is the easy way to start building completely custom agents a

#### Text splitters
One of the most simple examples of making documents better suit your application is to split a long document into smaller chunks that can fit into your model's context window. LangChain has built-in document transformers that ease the process of splitting, combining, filtering, and otherwise manipulating documents.

At a high level, here is how text splitters work:

1. They split the text into small, semantically meaningful chunks (often sentences).
2. They start combining these small chunks of text into a larger chunk until you reach a certain size (as measured by a specific function).
3. After the combined text reaches the new chunk's size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap to keep context between chunks.

For a list of types of text splitters LangChain supports, see [LangChain Text Splitters](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/).
Let's use a simple `CharacterTextSplitter` as an example of how to split the LangChain paper you just loaded.

This is the simplest method. This splits based on characters (by default "\n\n") and measures chunk length by number of characters.

`CharacterTextSplitter` is the simplest method of splitting the content. These splits are based on characters (by default "\n\n") and measures chunk length by number of characters.

In [11]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size = 200,chunk_overlap =20 , separator="\n")

chunks = text_splitter.split_documents(document)
print(len(chunks))

147


#### Working with Document Loaders and Text Splitters

In [15]:
from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter

#Load the LangChain Paper
paper_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/96-FDF8f7coh0ooim7NyEQ/langchain-paper.pdf"
pdf_loader = PyPDFLoader(paper_url)
pdf_document = pdf_loader.load()

#Load content from LangChain website
web_url = "https://python.langchain.com/v0.2/docs/introduction/"
web_loader = WebBaseLoader(web_url)
web_document = web_loader.load()

# Create two different text splitters
splitter_1 = CharacterTextSplitter(chunk_size=300,chunk_overlap=30,separator="\n")
splitter_2 = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50,separators=["\n\n","\n",".","",""])

chunks_1 = splitter_1.split_documents(pdf_document)
chunks_2 = splitter_2.split_documents(pdf_document)

def display_document_stats(docs, name):
    """Display statistics about a list of document chunks"""
    total_chunks = len(docs)
    total_chars = sum(len(doc.page_content) for doc in docs)
    avg_chunk_size = total_chars / total_chunks if total_chunks > 0 else 0
    
    # Count unique metadata keys across all documents
    all_metadata_keys = set()
    for doc in docs:
        all_metadata_keys.update(doc.metadata.keys())
    
    # Print the statistics
    print(f"\n=== {name} Statistics ===")
    print(f"Total number of chunks: {total_chunks}")
    print(f"Average chunk size: {avg_chunk_size:.2f} characters")
    print(f"Metadata keys preserved: {', '.join(all_metadata_keys)}")
    
    if docs:
        print("\nExample chunk:")
        example_doc = docs[min(5, total_chunks-1)]  # Get the 5th chunk or the last one if fewer
        print(f"Content (first 150 chars): {example_doc.page_content[:150]}...")
        print(f"Metadata: {example_doc.metadata}")
        
        # Calculate length distribution
        lengths = [len(doc.page_content) for doc in docs]
        min_len = min(lengths)
        max_len = max(lengths)
        print(f"Min chunk size: {min_len} characters")
        print(f"Max chunk size: {max_len} characters")

# Display stats for both chunk sets
display_document_stats(chunks_1, "Splitter 1")
display_document_stats(chunks_2, "Splitter 2")


=== Splitter 1 Statistics ===
Total number of chunks: 95
Average chunk size: 263.80 characters
Metadata keys preserved: producer, creationdate, moddate, page, source, creator, title, total_pages, author, page_label

Example chunk:
Content (first 150 chars): comprehensive support within the field of mental health. 
Additionally, the paper discusses the implementation of 
Streamlit to enhance the user ex pe...
Metadata: {'producer': 'PyPDF', 'creator': 'Microsoft Word', 'creationdate': '2023-12-31T03:50:13+00:00', 'author': 'IEEE', 'moddate': '2023-12-31T03:52:06+00:00', 'title': 's8329 final', 'source': 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/96-FDF8f7coh0ooim7NyEQ/langchain-paper.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}
Min chunk size: 49 characters
Max chunk size: 299 characters

=== Splitter 2 Statistics ===
Total number of chunks: 57
Average chunk size: 452.74 characters
Metadata keys preserved: producer, creationdate, moddate, page, source,