# ***DataIngestion***

[Official Documentation](https://python.langchain.com/docs/integrations/document_loaders/)

In [3]:
from langchain_community.document_loaders import TextLoader

In [4]:
loader = TextLoader("text-data.txt")
docs = loader.load()
len(docs)

1

In [8]:
docs

[Document(metadata={'source': 'text-data.txt'}, page_content='\n\n**The Importance of Data Ingestion in Modern AI Systems**\n\nData ingestion is a foundational step in any data pipeline, especially in systems powered by artificial intelligence and machine learning. It involves collecting, importing, and processing data for immediate use or storage in a database. In the context of LangChain and large language models (LLMs), data ingestion becomes even more crucial. The ability to efficiently load documents, parse them into chunks, and embed them for retrieval-augmented generation (RAG) directly influences the performance of AI applications.\n\nThere are various sources and formats from which data may be ingested—PDFs, HTML, Markdown files, CSVs, JSON, SQL databases, and even real-time APIs. Each type requires a slightly different handling process. For instance, PDFs must be parsed for text extraction, while CSVs require row-based parsing for structured data ingestion.\n\nIn LangChain, d

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap = 100
)

chunk_docs = text_splitter.split_documents(documents=docs)
len(chunk_docs)

7

In [10]:
chunk_docs[-1]

Document(metadata={'source': 'text-data.txt'}, page_content='In practice, a well-designed ingestion pipeline includes loading, cleaning, chunking, embedding, and storing. LangChain’s modular architecture makes it easier to build and customize such pipelines. As AI systems continue to evolve, the importance of scalable and accurate data ingestion will only grow.')

## **Load Pdf file**

In [14]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(file_path="attentation-all-you-need.pdf")
docs = loader.load()

In [15]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap = 200
)

documents = text_splitter.split_documents(documents=docs)
len(documents)

52

In [16]:
documents[-2]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'attentation-all-you-need.pdf', 'total_pages': 15, 'page': 13, 'page_label': '14'}, page_content='Input-Input Layer5\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nInput-Input Layer5\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nThe\nLaw\nwill\nnever\nbe\nperfect

## **WebBasedLoader**

In [17]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(["https://blog.langchain.dev/what-is-an-agent/", "https://blog.langchain.dev/memory-for-agents/"])
docs = loader.load()


USER_AGENT environment variable not set, consider setting it to identify your requests.


In [19]:
docs[0].metadata

{'source': 'https://blog.langchain.dev/what-is-an-agent/',
 'title': 'What is an AI agent?',
 'description': 'Introducing a new series of musings on AI agents, called "In the Loop".',
 'language': 'en'}

In [20]:
documents_chunk = text_splitter.split_documents(documents=docs)
len(documents_chunk)

20

## Load Html

In [31]:
page_url = "https://blog.langchain.dev/what-is-an-agent/"

loader = WebBaseLoader(
    web_paths=[page_url],
    bs_kwargs={
        "parse_only": bs4.SoupStrainer(class_="article-footer"),
    },
    
)

In [32]:
content = loader.load()
len(content)

1

In [33]:
content

[Document(metadata={'source': 'https://blog.langchain.dev/what-is-an-agent/'}, page_content="\n\nYou might also like\n\n\n\n\n\n\n\n\n\n\nHow to think about agent frameworks\n\n\nIn the Loop\n20 min read\n\n\n\n\n\n\n\n\n\n\n\n\nHow do I speed up my AI agent?\n\n\nIn the Loop\n4 min read\n\n\n\n\n\n\n\n\n\n\n\n\nMCP: Flash in the Pan or Future Standard?\n\n\nIn the Loop\n5 min read\n\n\n\n\n\n\n\n\n\n\n\n\nIntroducing Interrupt: The AI Agent Conference by LangChain\n\n\nHarrison Chase\n2 min read\n\n\n\n\n\n\n\n\n\n\n\n\nCommunication is all you need\n\n\nIn the Loop\n7 min read\n\n\n\n\n\n\n\n\n\n\n\n\nLangChain's Second Birthday\n\n\nHarrison Chase\n6 min read\n\n\n\n\n\n")]