# Langchain Community S3 Directory Loader  

What were doing with Langchain, MinIO, and OpenAI


1. Load the bucket contents with `S3 Directory Loader`
2. Load a file with `S3 File Loader`
3. Summarize `S3 File Loader` with OpenAI
4. Summarize `S3 Directory Loader` with OpenAI

Resources were accessing:
- Endpoint: https://play.min.io
- Bucket: "web-documentation"

Bucket contains files:
- `minio_quickstart.md`
- `test-file-1.md`
- `test-file-2.md`

We'll break down the process into two distinct parts: one for `S3DirectoryLoader` and the other for `S3FileLoader`. Each part will have its own set of code blocks and explanations. Let's start by detailing the steps and code for each loader.

---

---

Install Langchain with `pip install langchain`.

In [1]:
pip install langchain

Note: you may need to restart the kernel to use updated packages.


DEPRECATION: torchsde 0.2.5 has a non-standard dependency specifier numpy>=1.19.*; python_version >= "3.7". pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of torchsde or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


## #1 - How to use the Langchain S3 Directory Loader

#### Objective
Load multiple documents from an S3 directory.

#### Steps
1. Import `S3DirectoryLoader` from `langchain_community.document_loaders`.
2. Configure MinIO credentials and endpoint.
3. Initialize `S3DirectoryLoader` with the specified bucket, directory prefix, endpoint URL, AWS access keys, and SSL usage.
4. Load documents from the specified directory.

#### Code Block

In [8]:
from langchain_community.document_loaders.s3_directory import S3DirectoryLoader

# MinIO Configuration
endpoint = 'play.min.io:9000'
access_key = 'minioadmin'
secret_key = 'minioadmin'
use_ssl = True

# Initializing the S3DirectoryLoader
directory_loader = S3DirectoryLoader(
    bucket='web-documentation', 
    prefix='', 
    endpoint_url=f'http{"s" if use_ssl else ""}://{endpoint}',
    aws_access_key_id=access_key, 
    aws_secret_access_key=secret_key, 
    use_ssl=use_ssl
)

# Load documents from directory
documents = directory_loader.load()
documents

[Document(page_content='MinIO Quickstart Guide\n\nMinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads.\n\nThis README provides quickstart instructions on running MinIO on bare metal hardware, including container-based installations. For Kubernetes environments, use the MinIO Kubernetes Operator.\n\nContainer Installation\n\nUse the following commands to run a standalone MinIO server as a container.\n\nStandalone MinIO servers are best suited for early development and evaluation. Certain features such as versioning, object locking, and bucket replication\nrequire distributed deploying MinIO with Erasure Coding. For extended development and production, deploy MinIO with Erasure Coding enabled - specifically,\nwith a minimum of 4 drives per MinIO server. See MinIO Erasure C

The above output lists the content of the specific bucket directory.

---

---

## #2 - How to use the Langchain S3 File Loader

Resources were accessing:
Endpoint: https://play.min.io
Bucket: "web-documentation"

#### Objective
Load a single document from an S3 file.

#### Steps
1. Import `S3FileLoader` from `langchain_community.document_loaders`.
2. Configure MinIO credentials and endpoint.
3. Initialize `S3FileLoader` with the specified bucket, file key, endpoint URL, AWS access keys, and SSL usage.
4. Load a single document from the specified file.

#### Code Block

In [5]:
from langchain_community.document_loaders.s3_file import S3FileLoader

## MinIO Configuration
endpoint = 'play.min.io:9000'
access_key = 'minioadmin'
secret_key = 'minioadmin'
use_ssl = True

## Initializing the S3FileLoader
file_loader = S3FileLoader(
    bucket='web-documentation', 
    key='test-file-1.md', 
    endpoint_url=f'http{"s" if use_ssl else ""}://{endpoint}',
    aws_access_key_id=access_key, 
    aws_secret_access_key=secret_key, 
    use_ssl=use_ssl
)

## Load a single document
document = file_loader.load()
document

[Document(page_content='This is a sample document\n\n...\n\nSample Document\n\neof', metadata={'source': 's3://web-documentation/test-file-1.md'})]

The above output lists the content of the specified bucket diretory.

---

---

# #3 - Utilizing the S3 File Loader (with OpenAI API for Document Summary)

In [27]:
from langchain_community.document_loaders.s3_file import S3FileLoader

from langchain_community.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda
import os

# Set your OpenAI API Key here
#os.environ['OPENAI_API_KEY'] = 'your-api-key'

# MinIO Configuration
endpoint = 'play.min.io:9000'
access_key = 'minioadmin'
secret_key = 'minioadmin'
use_ssl = True

# Initializing the S3FileLoader
file_loader = S3FileLoader(
    bucket='web-documentation', 
    key='MinIO_Quickstart.md', 
    endpoint_url=f'http{"s" if use_ssl else ""}://{endpoint}',
    aws_access_key_id=access_key, 
    aws_secret_access_key=secret_key, 
    use_ssl=use_ssl
)

# Define LLM Setup
model = ChatOpenAI(temperature=0, model="gpt-4-1106-preview")

# Define the prompt template for summarization
template = "Summarize the following document: {context}"
prompt = ChatPromptTemplate.from_template(template)

# Load the document
loaded_documents = file_loader.load()

# Check if documents are loaded and extract the text content from the first document
if loaded_documents:
    document_text = loaded_documents[0].page_content

    # Define the Chain
    chain = (
        RunnableLambda(lambda x: {"context": document_text})
        | prompt
        | model
        | StrOutputParser()
    )

    # Execute Chain Synchronously
    summary = chain.invoke(None)
    print("Summary:", summary)
else:
    print("No documents loaded.")

Summary: The MinIO Quickstart Guide provides instructions for setting up MinIO, a high-performance object storage system compatible with Amazon S3 APIs. It is suitable for machine learning, analytics, and application data workloads and is released under the GNU Affero General Public License v3.0.

The guide covers installation methods for various platforms:

- **Container Installation**: Instructions for running MinIO as a container using `podman` or `docker` commands.
- **macOS**: Steps to install MinIO using Homebrew or binary download.
- **GNU/Linux**: Instructions for running MinIO on different Linux architectures using `wget` to download the binary.
- **Microsoft Windows**: Steps to download and run MinIO using the Windows executable.
- **Install from Source**: For developers and advanced users, instructions to compile and run MinIO from source using Go.

For all installations, MinIO starts with default credentials (minioadmin:minioadmin) and can be accessed via the MinIO Console 

The above output targets the specified file and sends the content to OpenAI API to return a summary.

---

---

# #4 - Utilizing the S3 Directory Loader (with OpenAI API for Document Summary)

In [6]:
from langchain_community.document_loaders.s3_directory import S3DirectoryLoader

from langchain_community.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda
import os

# Set your OpenAI API Key here
#os.environ['OPENAI_API_KEY'] = 'your-api-key'

# MinIO Configuration
endpoint = 'play.min.io:9000'
access_key = 'minioadmin'
secret_key = 'minioadmin'
use_ssl = True


# Initializing the S3DirectoryLoader
directory_loader = S3DirectoryLoader(
    bucket='web-documentation', 
    prefix='',  # Adjust the prefix as needed
    endpoint_url=f'http{"s" if use_ssl else ""}://{endpoint}',
    aws_access_key_id=access_key, 
    aws_secret_access_key=secret_key, 
    use_ssl=use_ssl
)

# Define LLM Setup
model = ChatOpenAI(temperature=0, model="gpt-4-1106-preview")

# Define the structured prompt template for summarization
structured_prompt_template = """
Summarize the following document '{document_name}':
{context}

Please provide the summary and key points.
"""
prompt = ChatPromptTemplate.from_template(structured_prompt_template)

# Load documents from the directory
loaded_documents = directory_loader.load()

# Initialize structured output
structured_output = ""

# Check if documents are loaded
if loaded_documents:
    for index, doc in enumerate(loaded_documents):
        document_name = f"Document {index + 1} - {doc.metadata.get('name', 'Unknown Document')}"
        document_text = doc.page_content

        # Define the Chain for each document
        chain = (
            RunnableLambda(lambda x: {"document_name": document_name, "context": document_text})
            | prompt
            | model
            | StrOutputParser()
        )

        # Execute Chain Synchronously for each document
        summary = chain.invoke(None)
        structured_output += f"\n{document_name}\n{summary}\n"

    print("Structured Summaries:", structured_output)
else:
    print("No documents loaded.")

Structured Summaries: 
Document 1 - Unknown Document
Document 1 appears to be a comprehensive Quickstart Guide for MinIO, a high-performance object storage system that is compatible with the Amazon S3 API. It is released under the GNU Affero General Public License v3.0 and is suitable for machine learning, analytics, and application data workloads.

The guide includes instructions for running MinIO on various platforms, including bare metal, container installations, Kubernetes, macOS, GNU/Linux, and Microsoft Windows. It emphasizes that standalone MinIO servers are ideal for development and evaluation, but for production environments, MinIO should be deployed with Erasure Coding and a minimum of 4 drives per server.

For container installations, it provides commands to run MinIO using `podman` or `docker`. For macOS, it suggests using Homebrew for installation or downloading the binary directly. For GNU/Linux and Windows, it provides download links for the respective binaries and instr

The above output summarizes documents with OpenAI API call using Prompt Templating.