## Data Ingestion

Data ingestion can be simplified with the use of various data loaders, each with its own specialization:

- **TextLoader**: Excels at handling plain text files
- **PyPDFLoader**: Optimized for PDF files, allowing easy access to their content
- **SeleniumURLLoader**: The go-to tool for web-based data, notably HTML documents from URLs that require JavaScript rendering
- **Google Drive Loader**: Integrates seamlessly with Google Drive, allowing for data import from Google Docs or entire folders

### TextLoader

Import LangChain and any loaders required from `langchain.document_loaders`.

> **Note:** You can use the `encoding` argument to change the encoding type. (For example: `encoding="ISO-8859-1"`)

In [1]:
# TextLoader example
from langchain_community.document_loaders import TextLoader

loader = TextLoader('my_file.txt')
documents = loader.load()
documents

[Document(metadata={'source': 'my_file.txt'}, page_content=' Google opens up its AI language model PaLM to challenge OpenAI and GPT-3 Google offers developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses "generate text, images, code, videos, audio, and more from simple natural language prompts."PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or Meta\'s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs, PaLM is a flexible system that can potentially carry out all sorts of text generation and editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for example, or you could use it for tasks like summarizing text or even writing code. (It\'s similar to features Google also announced today for its Workspace apps like Google Docs and Gmail.)')]

### PyPDFLoader (PDF)

LangChain has two fundamental approaches for managing PDF files: the `PyPDFLoader` and `PDFMinerLoader`. The `PyPDFLoader` can import PDF files and create a list of LangChain documents. Each document in this array contains the content and metadata of a single page, including the page number.

Using the `PyPDFLoader` has various advantages, including its simplicity and easy access to page information and metadata, such as page numbers, in an orderly fashion.

In [2]:
%pip install -q pypdf

[0mNote: you may need to restart the kernel to use updated packages.


In [4]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/AI Engineering Building Applications with Foundation Models (Chip Huyen) (Z-Library).pdf")
pages = loader.load_and_split()
print(pages[0])

page_content='Chip Huyen
 AI Engineering
Building Applications  
with Foundation Models' metadata={'producer': 'Antenna House PDF Output Library 2.6.0 (Linux64)', 'creator': 'AH CSS Formatter V6.0 MR2 for Linux64 : 6.0.2.5372 (2012/05/16 18:26JST)', 'creationdate': '2024-12-04T13:39:11+00:00', 'author': 'Chip Huyen;', 'moddate': '2024-12-04T09:21:26-05:00', 'title': 'AI Engineering', 'trapped': '/False', 'ebx_publisher': "O'Reilly Media", 'source': 'data/AI Engineering Building Applications with Foundation Models (Chip Huyen) (Z-Library).pdf', 'total_pages': 535, 'page': 0, 'page_label': 'Cover'}


### SeleniumURLLoader (URL)

The `SeleniumURLLoader` module in LangChain provides a user-friendly solution for importing HTML documents from URLs that require JavaScript rendering.

The `SeleniumURLLoader` class in LangChain has the following attributes:

- **urls** (`List[str]`): A list of URLs that the loader will access.
- **continue_on_failure** (`bool`, default=True): Determines whether the loader should continue processing other URLs in case of a failure.
- **browser** (`str`, default="chrome"): Choice of browser for loading the URLs. Options typically include 'Chrome' or 'Firefox'.
- **executable_path** (`Optional[str]`, default=None): The path to the browser's executable file.
- **headless** (`bool`, default=True): Specifies whether the browser should operate in headless mode, meaning it runs without a visible user interface.

When the `load()` method is used with the `SeleniumURLLoader`, it returns a collection of `Document` instances, each containing the content fetched from the web pages. These `Document` instances have a `page_content` attribute, which includes the text extracted from the HTML, and a `metadata` attribute that stores the source URL.

The `SeleniumURLLoader` might operate slower than other loaders because it initializes a browser instance for each URL to render pages, especially those that require JavaScript accurately. Despite this, the `SeleniumURLLoader` remains a valuable tool for loading web pages dependent on JavaScript rendering.

> **Note:** This approach will not work in a Google Colab notebook without further configuration. Instead, try running the code directly using the Python interpreter.

In [1]:
%pip install -q unstructured selenium

[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
from langchain_community.document_loaders import SeleniumURLLoader

urls = [
    "https://www.youtube.com/watch?v=TFa539R09EQ&t=139s",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls=urls)
data = loader.load()
print(data[0])

page_content='' metadata={'source': 'https://www.youtube.com/watch?v=TFa539R09EQ&t=139s', 'title': 'OPENASSISTANT TAKES ON CHATGPT! - YouTube', 'description': 'Patreon: https://www.patreon.com/mlstDiscord: https://discord.gg/ESrGqhf5CBTwitter: https://twitter.com/MLStreetTalkIn this eye-opening interview, we dive de...', 'language': 'en'}


These attributes can be adjusted during initialization. For example, to use Firefox instead of Chrome, set the `browser` attribute:

In [3]:
# Using Firefox instead of Chrome
loader = SeleniumURLLoader(urls=urls, browser="firefox")

### Google Drive Loader

The LangChain `GoogleDriveLoader` class is an efficient tool for importing data directly from Google Drive. It can retrieve data from a list of Google Docs document IDs or a single folder ID on Google Drive.

To use the `GoogleDriveLoader`, you need to set up the necessary credentials and tokens:

- The loader typically looks for the `credentials.json` file in the `~/.credentials/credentials.json` directory. You can specify a different path using the `credentials_file` keyword argument.
- For the token, the `token.json` file is created automatically on the loader's first use and follows a similar path convention.

#### Setting up credentials

To set up the `credentials_file`, follow these steps:

1. **Create or select a Google Cloud Platform project** by visiting the [Google Cloud Console](https://console.cloud.google.com/). Make sure billing is enabled for the project.

2. **Activate the Google Drive API** from the Google Cloud Console dashboard and click "Enable."

3. **Set up a service account** via the Service Accounts page in the Google Cloud Console.

4. **Assign the necessary roles** to the service account. Roles like "Google Drive API - Drive File Access" and "Google Drive API - Drive Metadata Read/Write Access" might be required, depending on your specific use case.

5. **Generate JSON key**: Navigate to the "Actions" menu next to it, select "Manage keys," then click "Add Key" and choose "JSON" as the key type. This action will generate a JSON key file and download it to your computer, which will be used as your `credentials_file`.

6. **Retrieve the folder or document ID** identified at the end of the URL:
   - Folder: `https://drive.google.com/drive/u/0/folders/{folder_id}`
   - Document: `https://docs.google.com/document/d/{document_id}/edit`

> **Note:** Currently only Google Docs are supported.

In [None]:
# Import the GoogleDriveLoader class
from langchain_community.document_loaders import GoogleDriveLoader

In [None]:
# Instantiate GoogleDriveLoader
loader = GoogleDriveLoader(
    folder_id="your_folder_id",
    recursive=False  # Optional: Fetch files from subfolders recursively. Defaults to False.
)

In [None]:
# Load the documents
docs = loader.load()
docs

---

## Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) is a method created by the FAIR team at Meta to enhance the accuracy of Large Language Models (LLMs) and reduce false information or "hallucinations". RAG improves LLMs by adding an information retrieval step before generating an answer, which systematically incorporates relevant data from external knowledge sources into the LLM's input prompt.

This helps chatbots provide more accurate and context-specific information by supplementing the LLM's internal knowledge with relevant external data, such as private documentation, PDFs, codebases, or SQL databases.

### Key Benefits of RAG

1. **Source Citation**: RAG allows citing sources in responses, enabling users to verify the information and increase trust in the model's outputs.

2. **Up-to-date Knowledge**: RAG supports the integration of frequently updated and domain-specific knowledge, which is typically more complex through LLM fine-tuning.

### The Impact of Larger Context Windows

Large models with bigger context windows can process a wider range of text, raising the question of whether to provide the entire set of documents or just the relevant information. While providing the complete set of documents allows the model to draw insights from a broader context, it has drawbacks:

- **Increased latency** due to processing larger amounts of text.
- **Potential accuracy decline** if relevant information is scattered throughout the document.
- **Inefficient resource usage**, especially with large datasets.

When deciding between providing the entire set of documents or just the relevant information, consider the application's requirements and limitations, such as acceptable latency, desired accuracy, and available computational resources.

## LangChain's Indexes and Retrievers

An **index** in LangChain is a data structure that organizes and stores data to facilitate quick and efficient searches. On the other hand, a **retriever** effectively uses this index to find and provide relevant data in response to specific queries.

LangChain's indexes and retrievers provide modular, adaptable, and customizable options for handling unstructured data with LLMs. The primary index types in LangChain are based on vector databases, mainly emphasizing indexes using embeddings.

The role of retrievers is to extract relevant documents for integration into language model prompts. In LangChain, a retriever employs a `get_relevant_documents` method (or `invoke` in newer versions), taking a query string as input and generating a list of documents that are relevant to that query.

LangChain and LlamaIndex offer user-friendly classes for implementing a retriever on your data source, with the first step being index creation.