### Data Ingestion - Document Loaders:

- "document_loaders" in Langchain_community is used to load the data (or) ingest the data for our application. 

- It usually returns a list object as output

#### 1. CSVLoader:

In [1]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(
    file_path="E:/Downloads/My Projects with datasets/emails.csv"
)
doc = loader.load()
print(doc[0:5])

[Document(metadata={'source': 'E:/Downloads/My Projects with datasets/emails.csv', 'row': 0}, page_content="text: Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and 

#### 2. Text Loader:

- This document loader is used to load the text files.

In [None]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader(
    file_path="E:/AIMLNotes.txt"
)
doc = loader.load()
print(doc)

[Document(metadata={'source': 'E:/AIMLNotes.txt'}, page_content='pythonBasics: https://colab.research.google.com/drive/1OGSiSAWlpmpFMMGwFUGlPwrW7nNsiW2b?usp=sharing\n\nOOP: https://colab.research.google.com/drive/1eNr-xXEEWFJnywDZaL84QhT-PDXz26mP?usp=sharing\n\nNumPy: https://colab.research.google.com/drive/1Eek0AiLKhvM03cJOjEHf9UKzHjhUeXFQ?usp=sharing\n\nPandas:\nhttps://colab.research.google.com/drive/158B1IhH2-yRLjBe9RM0QO2qRYi8dhE4n?usp=sharing\n\nVisualization:\nhttps://colab.research.google.com/drive/1a2iv6hq2w0SzkpoqtcvrpCg3iVjKGBHj?usp=sharing\n\nStatistics:\nhttps://colab.research.google.com/drive/1IUWi0Ad7sdEJPT8m0aupa8IqCqCNgMqE?usp=sharing\n\nAirbnb Case Study:\nhttps://colab.research.google.com/drive/19Z67efTrSHJdPDXynTXk6aX3uhSnxZFY?usp=sharing\n\nMachine Learning Intro: https://colab.research.google.com/drive/18JEP2g9UUZicwW0GjfNUP-iAeuHigFX6?usp=sharing\n\nLinear Regression: https://colab.research.google.com/drive/1CNwBWu1cJTRCFfhdOISBbhWYGxMfDHhS?usp=sharing\n\nLogisti

#### 3. Unstructured Markdown Loader:

- This loader loads Markdown files using Unstructured.

- You can run the loader in one of two modes: "single" and "elements". 

- If you use "single" mode, the document will be returned as a single langchain Document object. (Works similiar to Text Loader)

- If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText.

- You can pass in additional unstructured kwargs after mode to apply different unstructured settings.

*<u>NOTE</u>: It returns a list object*

In [28]:
# mode = single
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader(
    file_path="E:/AIMLNotes.txt",
    mode='single'
)
doc = loader.load()
print(doc)

[Document(metadata={'source': 'E:/AIMLNotes.txt'}, page_content='pythonBasics: https://colab.research.google.com/drive/1OGSiSAWlpmpFMMGwFUGlPwrW7nNsiW2b?usp=sharing\n\nOOP: https://colab.research.google.com/drive/1eNr-xXEEWFJnywDZaL84QhT-PDXz26mP?usp=sharing\n\nNumPy: https://colab.research.google.com/drive/1Eek0AiLKhvM03cJOjEHf9UKzHjhUeXFQ?usp=sharing\n\nPandas: https://colab.research.google.com/drive/158B1IhH2-yRLjBe9RM0QO2qRYi8dhE4n?usp=sharing\n\nVisualization: https://colab.research.google.com/drive/1a2iv6hq2w0SzkpoqtcvrpCg3iVjKGBHj?usp=sharing\n\nStatistics: https://colab.research.google.com/drive/1IUWi0Ad7sdEJPT8m0aupa8IqCqCNgMqE?usp=sharing\n\nAirbnb Case Study: https://colab.research.google.com/drive/19Z67efTrSHJdPDXynTXk6aX3uhSnxZFY?usp=sharing\n\nMachine Learning Intro: https://colab.research.google.com/drive/18JEP2g9UUZicwW0GjfNUP-iAeuHigFX6?usp=sharing\n\nLinear Regression: https://colab.research.google.com/drive/1CNwBWu1cJTRCFfhdOISBbhWYGxMfDHhS?usp=sharing\n\nLogistic Re

In [27]:
print(doc[0].page_content)

pythonBasics: https://colab.research.google.com/drive/1OGSiSAWlpmpFMMGwFUGlPwrW7nNsiW2b?usp=sharing

OOP: https://colab.research.google.com/drive/1eNr-xXEEWFJnywDZaL84QhT-PDXz26mP?usp=sharing

NumPy: https://colab.research.google.com/drive/1Eek0AiLKhvM03cJOjEHf9UKzHjhUeXFQ?usp=sharing

Pandas: https://colab.research.google.com/drive/158B1IhH2-yRLjBe9RM0QO2qRYi8dhE4n?usp=sharing

Visualization: https://colab.research.google.com/drive/1a2iv6hq2w0SzkpoqtcvrpCg3iVjKGBHj?usp=sharing

Statistics: https://colab.research.google.com/drive/1IUWi0Ad7sdEJPT8m0aupa8IqCqCNgMqE?usp=sharing

Airbnb Case Study: https://colab.research.google.com/drive/19Z67efTrSHJdPDXynTXk6aX3uhSnxZFY?usp=sharing

Machine Learning Intro: https://colab.research.google.com/drive/18JEP2g9UUZicwW0GjfNUP-iAeuHigFX6?usp=sharing

Linear Regression: https://colab.research.google.com/drive/1CNwBWu1cJTRCFfhdOISBbhWYGxMfDHhS?usp=sharing

Logistic Regression: https://colab.research.google.com/drive/1Mg2ah_YA-7UZJbC3z2qHoKObVu4CIwen

In [35]:
# mode = elements

loader = UnstructuredMarkdownLoader(
    file_path="E:/AIMLNotes.txt",
    mode='elements'
)
doc = loader.load()
print(doc)

[Document(metadata={'source': 'E:/AIMLNotes.txt', 'languages': ['eng'], 'file_directory': 'E:/', 'filename': 'AIMLNotes.txt', 'filetype': 'text/markdown', 'last_modified': '2025-07-10T20:59:57', 'category': 'NarrativeText', 'element_id': '499bdc2ba282949873d338577df28df3'}, page_content='pythonBasics: https://colab.research.google.com/drive/1OGSiSAWlpmpFMMGwFUGlPwrW7nNsiW2b?usp=sharing'), Document(metadata={'source': 'E:/AIMLNotes.txt', 'languages': ['eng'], 'file_directory': 'E:/', 'filename': 'AIMLNotes.txt', 'filetype': 'text/markdown', 'last_modified': '2025-07-10T20:59:57', 'category': 'NarrativeText', 'element_id': 'e0c4c53e489ca18b213037b52c1c781c'}, page_content='OOP: https://colab.research.google.com/drive/1eNr-xXEEWFJnywDZaL84QhT-PDXz26mP?usp=sharing'), Document(metadata={'source': 'E:/AIMLNotes.txt', 'languages': ['eng'], 'file_directory': 'E:/', 'filename': 'AIMLNotes.txt', 'filetype': 'text/markdown', 'last_modified': '2025-07-10T20:59:57', 'category': 'NarrativeText', 'el

In [20]:
for i in range(len(doc)):
    print(doc[i].page_content)

pythonBasics: https://colab.research.google.com/drive/1OGSiSAWlpmpFMMGwFUGlPwrW7nNsiW2b?usp=sharing
OOP: https://colab.research.google.com/drive/1eNr-xXEEWFJnywDZaL84QhT-PDXz26mP?usp=sharing
NumPy: https://colab.research.google.com/drive/1Eek0AiLKhvM03cJOjEHf9UKzHjhUeXFQ?usp=sharing
Pandas: https://colab.research.google.com/drive/158B1IhH2-yRLjBe9RM0QO2qRYi8dhE4n?usp=sharing
Visualization: https://colab.research.google.com/drive/1a2iv6hq2w0SzkpoqtcvrpCg3iVjKGBHj?usp=sharing
Statistics: https://colab.research.google.com/drive/1IUWi0Ad7sdEJPT8m0aupa8IqCqCNgMqE?usp=sharing
Airbnb Case Study: https://colab.research.google.com/drive/19Z67efTrSHJdPDXynTXk6aX3uhSnxZFY?usp=sharing
Machine Learning Intro: https://colab.research.google.com/drive/18JEP2g9UUZicwW0GjfNUP-iAeuHigFX6?usp=sharing
Linear Regression: https://colab.research.google.com/drive/1CNwBWu1cJTRCFfhdOISBbhWYGxMfDHhS?usp=sharing
Logistic Regression: https://colab.research.google.com/drive/1Mg2ah_YA-7UZJbC3z2qHoKObVu4CIwen?usp=shar

In [21]:
len(doc)

11

#### 3. PyPDFLoader:

- This loader loads and parses a PDF file using 'pypdf' library.

- This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting images, and defining extraction mode. 

- It integrates the pypdf library for PDF processing and offers both synchronous and asynchronous document loading.

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(
    "E:/Downloads/feeReciept.pdf",
    mode='page',
    extract_images=True,
    images_parser= "BaseImageBlobParser",
    images_inner_format = 'layout'

)

data = loader.load()
if (data[0].page_content == ''):
    print('ocr')

ocr


In [50]:
# docs = []

# Load documents asynchronously:

# docs = await loader.aload()  
# print(docs[0].page_content[:100])  
# print(docs[0].metadata)

#### 4. Web Based Loader:

Used to load the contents from any url

*Note*: *Requires internet connection*

In [4]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    web_path=("https://colab.research.google.com/drive/1_FqONik9xyDDTmoPvfg7UAm53RfXTimk?usp=sharing",)
)

In [3]:
# loader.load() [I don't have internet connection at present ðŸ¥²]