#### **Document Loaders**
Used to import data from various sources like: PDF files, Text files, Databases, Cloud storage (e.g., S3, GCS)  

They convert this data into a standard format called a Document object, which can then be used for: Chunking, Embedding, Retrieval, LLM generation.

All Document Loaders in Langchain - https://docs.langchain.com/oss/python/integrations/document_loaders

**We can create custom document loaders as well**

In [2]:
from langchain_groq import ChatGroq
from langchain_core.output_parsers import StrOutputParser
from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate

load_dotenv()

True

##### **Text Loaders**

In [None]:
from langchain_community.document_loaders import TextLoader

model = ChatGroq(model="llama-3.3-70b-versatile")
parser = StrOutputParser()

prompt = PromptTemplate(
    template = "Explain the text given: {text}",
    input_variables=['text']
)

loader = TextLoader('poem.txt', encoding='utf-8')

docs = loader.load()

chain = prompt | model | parser
print(chain.invoke(docs[0])) # We don't want to pass the metadata, just the content

The given text is a poetic tribute to the sport of cricket. It's a long, narrative poem that explores the various aspects of the game, from its history and traditions to its excitement and emotional resonance.

The poem begins by setting the scene, describing the idyllic surroundings of a cricket field, with the sun shining down and the sound of willows (a type of tree commonly found near cricket fields) rustling in the breeze. It then introduces the idea that cricket is a dreamlike state, where players and spectators alike can escape the worries of everyday life and immerse themselves in the thrill of the game.

The poem goes on to describe the various elements of cricket, including the toss of the coin, the choice of whether to bat or bowl, the strategies and tactics employed by the teams, and the emotional highs and lows experienced by the players and spectators. It also touches on the technical aspects of the game, such as the different types of shots and bowling styles, and the ro

##### **PDF Loader**  

In [10]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('Data Sources\\dl-curriculum.pdf')
docs = loader.load()

print(len(docs))
print(docs[0].page_content)
print(docs[1].metadata)

23
CampusXDeepLearningCurriculum
A.ArtificialNeuralNetworkandhowtoimprovethem
1.BiologicalInspiration
● Understandingtheneuronstructure● Synapsesandsignaltransmission● Howbiologicalconceptstranslatetoartificialneurons
2.HistoryofNeuralNetworks
● Earlymodels(Perceptron)● BackpropagationandMLPs● The"AIWinter"andresurgenceofneuralnetworks● Emergenceofdeeplearning
3.PerceptronandMultilayerPerceptrons(MLP)
● Single-layerperceptronlimitations● XORproblemandtheneedforhiddenlayers● MLParchitecture
4. LayersandTheirFunctions
● InputLayer○ Acceptinginputdata● HiddenLayers○ Featureextraction● OutputLayer○ Producingfinalpredictions
5.ActivationFunctions
{'producer': 'Skia/PDF m131 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Deep Learning Curriculum', 'source': 'Data Sources\\dl-curriculum.pdf', 'total_pages': 23, 'page': 1, 'page_label': '2'}


##### **Directory Loader**

In [21]:
from langchain_community.document_loaders import DirectoryLoader

pdf_loader = DirectoryLoader(
    path='Data Sources\\',
    glob="*.pdf", # Extract all files from the directory
    loader_cls=PyPDFLoader,
)

pdf_docs = pdf_loader.load()

text_loader = DirectoryLoader(
    path='Data Sources\\',
    glob="*.txt",
    loader_cls=TextLoader,
)

text_docs = text_loader.load()

all_docs = pdf_docs + text_docs

print(len(all_docs))
print(all_docs[100].page_content)
print(all_docs[100].metadata)


349
Topic Modeling
[ 80 ]
Latent Dirichlet allocation
LDA and LDA—unfortunately, there are two methods in machine learning with 
the initials LDA: latent Dirichlet allocation, which is a topic modeling method and 
linear discriminant analysis, which is a classification method. They are completely 
unrelated, except for the fact that the initials LDA can refer to either. In certain 
situations, this can be confusing. The scikit-learn tool has a submodule, 
sklearn.
lda, which implements linear discriminant analysis. At the moment, scikit-learn 
does not implement latent Dirichlet allocation.
The topic model we will look at is latent Dirichlet allocation (LDA). The mathematical 
ideas behind LDA are fairly complex, and we will not go into the details here.
For those who are interested, and adventurous enough, Wikipedia will provide all 
the equations behind these algorithms: 
http://en.wikipedia.org/wiki/Latent_
Dirichlet_allocation.
However, we can understand the ideas behind LDA intuit

##### **Web Base Loader**

WebBaseLoader is a document loader in LangChain that extracts text content from static web pages.

It lets you query any webpage using an LLM — just like you’d ask questions from a PDF or text document.

In [None]:
from langchain_community.document_loaders import WebBaseLoader

model = ChatGroq(model="llama-3.3-70b-versatile")
parser = StrOutputParser()

prompt = PromptTemplate(
    template="Answer the following question \n {question} from the following text: \n {text}",
    input_variables=['question', 'text']
)

url = ["https://www.flipkart.com/dell-14-plus-backlit-keyboard-fingerprint-reader-intel-core-ultra-7-256v-16-gb-512-gb-ssd-windows-11-home-db14250-thin-light-laptop/p/itm31edf0baad0c4?pid=COMHDGD5RWKB8VUN&lid=LSTCOMHDGD5RWKB8VUNHLZGVA&marketplace=FLIPKART&q=Laptops&store=6bo%2Fb5g&srno=s_1_1&otracker=search&otracker1=search&fm=organic&iid=en_34Y80kfDCtf5umfeqKA_z2CN3-rN8qNNItOCACa8rb3sVfKRnXWN6pa37BGwTtlsGI9cYM1N922xope3JQRh1fUFjCTyOHoHZs-Z5_PS_w0%3D&ppt=hp&ppn=hp&ssid=j11azktpkg0000001763567817751&qH=48b773c837465a99", 
        "https://www.flipkart.com/lenovo-loq-intel-core-i5-13th-gen-13450hx-16-gb-512-gb-ssd-windows-11-home-6-graphics-nvidia-geforce-rtx-4050-15irx9-gaming-laptop/p/itm975f0ea246a0e?pid=COMHYXV9DUMPQSZA&lid=LSTCOMHYXV9DUMPQSZA1UUBRE&marketplace=FLIPKART&q=Laptops&store=6bo%2Fb5g&srno=s_1_2&otracker=search&otracker1=search&fm=organic&iid=en_34Y80kfDCtf5umfeqKA_z2CN3-rN8qNNItOCACa8rb3Cj3RJeBDdSpfo40YyIWUje0iOSTC5SvjR5Wt8msZ4_vUFjCTyOHoHZs-Z5_PS_w0%3D&ppt=hp&ppn=hp&ssid=j11azktpkg0000001763567817751&qH=48b773c837465a99"]
loader = WebBaseLoader(url)

docs = loader.load()

chain = prompt | model | parser
print(chain.invoke({'question': "Compare between these two laptops", "text": docs}))

Based on the provided text, here's a comparison between the two laptops:

**Laptop 1: DELL 14 Plus**
- Processor: Intel Core Ultra 7 256V
- RAM: 16 GB LPDDR5X
- Storage: 512 GB SSD
- Display: 14-inch 2.5K AG NT 300nits WVA/IPS Display
- Graphics: Intel Integrated Arc
- Weight: 1.55 Kg
- Battery Life: Up to 7 hours
- Price: ₹1,12,990

**Laptop 2: Lenovo LOQ**
- Processor: Intel Core i5 13th Gen 13450HX
- RAM: 16 GB DDR5
- Storage: 512 GB SSD
- Display: 15.6-inch Full HD IPS 300nits Anti-glare, 100% sRGB, 144Hz, G-SYNC
- Graphics: NVIDIA GeForce RTX 4050
- Weight: 2.38 Kg
- Battery Life: Not specified, but reviews mention around 1.5 hours
- Price: ₹89,990

**Comparison:**
- **Processor:** Lenovo LOQ has a more powerful processor (Intel Core i5 13th Gen) compared to DELL 14 Plus (Intel Core Ultra 7 256V).
- **RAM and Storage:** Both laptops have the same amount of RAM (16 GB) and storage (512 GB SSD), but Lenovo LOQ uses DDR5 RAM which is faster than LPDDR5X used in DELL 14 Plus.
- **Disp

##### **CSV Loader**
CSVLoader is a document loader used to load and parse CSV files into a format that LLMs can understand and answer questions from.

Each row in the CSV becomes one Document object in LangChain.

In [30]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path='Data Sources\\Social_Network_Ads.csv')

docs = loader.load()
print(len(docs))
print(docs[0])

400
page_content='User ID: 15624510
Gender: Male
Age: 19
EstimatedSalary: 19000
Purchased: 0' metadata={'source': 'Data Sources\\Social_Network_Ads.csv', 'row': 0}
