# 10 — Data Loaders (Introduction)

In LangChain, **Data Loaders** are components that **fetch data from external sources** (files, databases, APIs, websites, private docs, etc.) and convert it into a format that an LLM can consume.

They:
- Connect to the data source.
- Transform the raw content into **documents** usable as context.
- Enable **context-aware** Q&A and workflows.

This is crucial when you want your model to use **domain-specific** or **private** information rather than only general knowledge.

In [1]:
# ╔══════════════════════════════════════════════════════╗
# ║ Setup: Load environment variables & initialize model ║
# ╚══════════════════════════════════════════════════════╝

import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

from langchain_openai import ChatOpenAI
# from langchain_groq import ChatGroq

chat_model = ChatOpenAI(model="gpt-4o-mini")
# chat_model = ChatGroq(model="llama-3.1-70b-versatile")

print("✅ Environment loaded and model ready.")

✅ Environment loaded and model ready.


## Why Data Loaders matter

- **Augment knowledge** with your own sources (PDFs, spreadsheets, DBs, CRMs).
- **Reduce hallucinations** by providing reliable context.
- **Automate ingestion** from many systems with minimal code.

Common sources include: PDFs/Word/Excel, SQL/NoSQL, Google Drive, Notion, Slack, websites/APIs, and more.

## Practical example — WikipediaLoader

We will fetch a single page from Wikipedia and use its text as **context** for a question. 
Note: this requires internet access and the `wikipedia` dependency used by the loader.

In [2]:
# Load a page from Wikipedia
from langchain_community.document_loaders import WikipediaLoader

person = "Einstein"  # try changing this to another public figure
loader = WikipediaLoader(query=person, load_max_docs=1)

docs = loader.load()
assert len(docs) > 0, "No document returned by WikipediaLoader."
loaded_text = docs[0].page_content

print(loaded_text[:400] + "\n...\n")  # preview first chars

Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist best known for developing the theory of relativity. Einstein also made important contributions to quantum theory. His mass–energy equivalence formula E = mc2, which arises from special relativity, has been called "the world's most famous equation". He received the 1921 Nobel Prize in Physics for "his services t
...



## Use the loaded data as context in a prompt

We'll build a small chat prompt that accepts a **question** and **context**, then ask the model to answer concisely based only on the provided context.

In [3]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "human",
            "Answer the following question using ONLY the provided context.\n\n"
            "Question: {question}\n\nContext:\n{context}\n\nAnswer concisely.",
        )
    ]
)

chain = prompt | chat_model | StrOutputParser()

question = "What was the full name of Einstein?"
print(chain.invoke({"question": question, "context": loaded_text}))

Albert Einstein


### What happened?

1. **WikipediaLoader** fetched the page and returned a list of `Document` objects.
2. We extracted `page_content` from the first document.
3. We constructed a **chat prompt** with placeholders for `question` and `context`.
4. We ran `prompt → model → parser` to get a clean, concise text answer.

## Notes and good practices

- Context can be **very long**; later we will split it into chunks and do retrieval (RAG). For now we feed the raw text.
- Keep questions precise and tell the model to rely on **context only** to reduce hallucinations.
- For private data (PDFs/DBs/Drive/etc.), use the corresponding loaders in `langchain_community.document_loaders`.
- Always check and handle empty results (e.g., missing page or network issues).

## Summary

- **Data Loaders** bring external information into your LLM workflows.
- They let you **ground** model answers in reliable sources.
- In the next RAG notebooks, we’ll split documents, build embeddings, and retrieve the **most relevant** chunks before generation.