In [None]:
!pip install langchain-google-vertexai langchain_community langchain-google-community[docai] pypdf unstructured

# Ingesting documents

## LangChain Document Loaders

### TextLoader

The most basic document loader is the `TextLoader`. It takes the file path of a text file 
as input and loads its entire content into a single `Document` object.

In [1]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./text_file.txt")
documents = loader.load()

documents

[Document(metadata={'source': './text_file.txt'}, page_content='Some interesting document')]

The loader's output is a list containing a single Document, holding the text file's 
content without any transformations. The document's metadata includes the path to 
the original file.

### CSVLoader

We can integrate structured data from various sources using the `CSVLoader` module. 
It parses Comma-Separated Value (CSV) files and transforms their tabular content into a 
list of `Document` instances.

CSVLoader accepts the path to a CSV file and converts the structured data within into 
a set of documents. Each row of the CSV file is represented as an individual document.

In [3]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader("./data.csv")
documents = loader.load()

documents

[Document(metadata={'source': './data.csv', 'row': 0}, page_content='product_id: A1\nproduct_name: Product A1\nprice: 50'),
 Document(metadata={'source': './data.csv', 'row': 1}, page_content='product_id: A2\nproduct_name: Product A2\nprice: 30'),
 Document(metadata={'source': './data.csv', 'row': 2}, page_content='product_id: A3\nproduct_name: Product A3\nprice: 40')]

The argument `csv_args` is a dictionary of arguments that you can pass directly to 
Python's built-in `csv.DictReader`. 

This is useful for customizing how the CSV file is parsed. 
In the code snippet above we passed the dictionary use `{"delimiter": ";"}` to specify a 
semicolon as the delimiter instead of the default comma.

In [4]:
loader = CSVLoader(
"./data.csv", 
csv_args={"delimiter": ","}, 
source_column="product_name"
) 
documents = loader.load()
documents

[Document(metadata={'source': 'Product A1', 'row': 0}, page_content='product_id: A1\nproduct_name: Product A1\nprice: 50'),
 Document(metadata={'source': 'Product A2', 'row': 1}, page_content='product_id: A2\nproduct_name: Product A2\nprice: 30'),
 Document(metadata={'source': 'Product A3', 'row': 2}, page_content='product_id: A3\nproduct_name: Product A3\nprice: 40')]

### PDFLoader

To extract data from PDF files you can use the `PyPDFLoader` class. 

This loader extracts textual  content from PDF documents, converting it into a list of `Document` instances. 

Each page within the PDF is treated as an independent document.

In [12]:
from langchain_community.document_loaders import PyPDFLoader

path_or_url = "https://arxiv.org/pdf/1706.03762"
loader = PyPDFLoader(path_or_url)
documents = loader.load()

for i, document in enumerate(documents[:3], start=1):
    print(f"--- Page {i} ---")
    print(document.page_content[:80])
    print()

--- Page 1 ---
Provided proper attribution is provided, Google hereby grants permission to
repr

--- Page 2 ---
1 Introduction
Recurrent neural networks, long short-term memory [ 13] and gated

--- Page 3 ---
Figure 1: The Transformer - model architecture.
The Transformer follows this ove



### GoogleDriveLoader

LangChain facilitates seamless integration with Google Drive through its  `GoogleDriveLoader´ class, enabling direct file loading. 

Prerequisites include enabling the Google Drive API, authorizing desktop app credentials, and installing the following libraries:

In [None]:
!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

The following snippet allows the loading of all files in a google drive folders. Please remember to replace the folder_ids and the token path.

In [27]:
FOLDER_ID = "my-folder-id"
TOKEN_PATH = "/path/to/token.json"

In [29]:
from langchain_google_community import GoogleDriveLoader

try:
    loader = GoogleDriveLoader(
        folder_id=FOLDER_ID,
        token_path=TOKEN_PATH,
        recursive=False,
    )
    
    docs = loader.load()
except Exception as ex: # Must provide valid credentials
    pass

A specific list of files can be targeted using file_ids. These IDs are obtainable from individual file URLs, similar to folder IDs. 
Any file type can be loaded by specifying a base file loader through the `file_loader_cls` argument. 

The resulting documents from the load have the same structure as the ones loaded with the underlaying `DocumentLoader` calss. 

For instance, consider a list of XML files in Google Drive with known IDs. Loading them is achieved as follows:

In [28]:
from langchain_community.document_loaders import (
  UnstructuredXMLLoader
)

try:
    loader = GoogleDriveLoader(
        folder_id=FOLDER_ID,
        token_path=TOKEN_PATH,
        recursive=False,
        file_ids=["file_id_1", "file_id_2", "file_id_3"],
        file_loader_cls=UnstructuredXMLLoader,
        file_loader_kwargs={},
        # Loader kwargs can be added using this argument
    )
    
    docs = loader.load()
except Exception as ex: # Must provide valid credentials
    pass

### DirectoryLoader

To allow the ingestion of multiple documents stored within a directory structure, the `DirectoryLoader` module offers a convenient solution.

It automates the process of  discovering and loading files with specific extensions, consolidating them into a list of `Document` instances.

In [26]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader("./documents", glob="*.txt")

documents = loader.load()

documents

[Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='Some interesting document 2'),
 Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='Some interesting document 1'),
 Document(metadata={'source': 'documents/text_file-0.txt'}, page_content='Some interesting document 0')]

You can specify a custom loader class if required with the `loader_cls` argument:

In [31]:
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader(
    "./documents", 
    glob="*.txt", 
    loader_cls=TextLoader
)

loader.load()

[Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='Some interesting document 2'),
 Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='Some interesting document 1'),
 Document(metadata={'source': 'documents/text_file-0.txt'}, page_content='Some interesting document 0')]

### GCSFileLoader

if we work with Google Cloud Storage as our file storage of choice, we can use the class `GCSFileLoader` as shown below

In [32]:
from langchain_google_community import GCSFileLoader

loader = GCSFileLoader(
    project_name="my-gcp-project-id",
    bucket="my-bucket",
    blob="my-file.txt"
)

We can customize the loader class making use of the `loader_func` argument.

In this example, we define a `load_pdf` function that uses the PyPDFLoader specifically for PDF files. 

The `GCSFileLoader` handles the interaction with GCS, while `load_pdf` ensures the content is correctly parsed as a PDF

In [33]:
from langchain_community.document_loaders import PyPDFLoader

def load_pdf(file_path):
    return PyPDFLoader(file_path)

loader = GCSFileLoader(
    project_name="my-gcp-project-id",
    bucket="my-bucket",
    blob="my-file.txt",
    loader_func = load_pdf
)

To process multiple files within a GCS bucket, the `GCSDirectoryLoader` class provides a convenient solution:

In [34]:
from langchain_google_community import GCSDirectoryLoader 

loader = GCSDirectoryLoader(
  project_name="your-gcp-project-id", 
  bucket="your-bucket-name",
  prefix="path/to/your/directory"
) 

## Chunking

LangChain's `TextSplitter` interface helps to break down large documents into smaller chunks for processing. 

In [39]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=10, 
    chunk_overlap=3 
)

Other text splitters include:

- `CharacterTextSplitter`: Splits on a specific character (e.g., newline).
  
- `MarkdownTextSplitter`: Handles markdown formatting for structured documents.

- `PythonCodeTextSplitter`: Preserves the structure of Python code snippets

To split a LangChain `Document` into smaller chunks after initializing the splitter class, 
you need to call the `split_documents` method. 

It takes a list of `Document` as the input and outputs a list of chunked `Document`


In [41]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader("./documents", glob="*.txt")

documents = loader.load()

texts = text_splitter.split_documents(documents)

texts

[Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='Some'),
 Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='interesti'),
 Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='sting'),
 Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='document'),
 Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='2'),
 Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='Some'),
 Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='interesti'),
 Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='sting'),
 Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='document'),
 Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='1'),
 Document(metadata={'source': 'documents/text_file-0.txt'}, page_content='Some'),
 Document(metadata={'source': 'documents/text_file-0.txt'}, page_content='interesti'

## DocAIParser

LangChain integrates this DocumentAI's Google Cloud service through the class `DocAIParser`.

In [47]:
from langchain_google_community import DocAIParser
from langchain_core.document_loaders.blob_loaders import Blob

LOCATION = "us"
GCS_OUTPUT_PATH = "gs://BUCKET_NAME/FOLDER_PATH"
PROCESSOR_NAME = "projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID"

try:
    parser = DocAIParser(
      location=LOCATION, 
      processor_name=PROCESSOR_NAME, 
      gcs_output_path=GCS_OUTPUT_PATH
    )
except ValueError:  # Define correct processor name
    pass

Once the class is instantiated, we can use the method `lazy_parse` method to execute the processor and get the documents from a `Blob` in Google Cloud storage.

In [48]:
path = "gs://my-bucket/my-folder/my-pdf.pdf"

blob = Blob(path=path)

try:
    docs = list(parser.lazy_parse(blob))
except NameError: # If the parser was not defined
    pass

PDF parsing happens asynchronously on Google Cloud, and it takes time. We wait for all `operations`to finish and then parase the results as a list of `Document` objects

In [49]:
try:
    operations = parser.docai_parse([blob])
    while parser.is_running(operations):
        time.sleep(0.5)
    results = parser.get_results(operations)
    docs = list(parser.parse_from_results(results))

except NameError: # If the parser was not defined
    pass