# Module 2, Activity 3: Ingesting Other Data Formats

Until this point we have strictly been working with text (or CSV that we read in like text).  However, as you begin creating RAG applications you will obviously want to consider many other data formats.  This notebook will quickly walk you through a few common ones for inclusion in your vector store.  We need to start by installing a few packages that are not part of our SageMaker environements.

In [8]:
!pip install pypdf
!pip install openpyxl


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [9]:
import boto3
from openpyxl import load_workbook
import os
import pandas as pd
import tempfile

from langchain_community.document_loaders import PyPDFLoader
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

## Some helper functions

We have been getting raw text from S3 with the `get_data_from_s3` helper function.  Now we will add a few more helper functions for dealing with PDFs and Excel files.  These functions will get the files from S3 and then convert them into a list where each entry is in the LangChain Document format, the required input for creating embeddings.

In [4]:
def get_data_from_s3(bucket_name, key):
    s3 = boto3.client(
        's3',
        region_name="us-west-2",
    )
    response = s3.get_object(Bucket=bucket_name, Key=key)
    data = response['Body'].read().decode('utf-8')

    return data

In [5]:
def get_pdf_docs_from_s3(bucket_name, key):

    s3 = boto3.client('s3', region_name="us-west-2")
    response = s3.get_object(Bucket=bucket_name, Key=key)
    pdf_bytes = response['Body'].read()
    
    # Write PDF bytes into a temporary file
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp_file:
        tmp_file.write(pdf_bytes)
        tmp_file.flush()  # Ensure data is written to disk
        
        # Pass the path of the temporary file to PyPDFLoader
        loader = PyPDFLoader(tmp_file.name)
        documents = loader.load()
    
    return documents

In [None]:
def get_excel_from_s3(bucket_name, s3_key):

    """
    Loads an Excel file from S3, parses each sheet with pandas,
    and returns one LangChain Document per row.
    """
    with tempfile.TemporaryDirectory() as tmpdir:
        local_path = os.path.join(tmpdir, 'file.xlsx')
        s3 = boto3.client('s3')
        s3.download_file(bucket_name, s3_key, local_path)

        # Load all sheets
        dfs = pd.read_excel(local_path, sheet_name=None)

        documents = []

        for sheet_name, df in dfs.items():
            df = df.fillna("")  # Optional: handle missing values

            for idx, row in df.iterrows():
                content = row.to_json(force_ascii=False, indent=2)
                metadata = {
                    "sheet_name": sheet_name,
                    "row_index": idx,
                    "columns": list(df.columns)
                }
                documents.append(Document(page_content=content, metadata=metadata))

        return documents

In [None]:
session = boto3.session.Session()
region = session.region_name
bedrock_runtime = boto3.client("bedrock-runtime", region_name='us-west-2')

We can now run these cells to convert a PDF file to the LangChain `Document` format.

In [6]:
documents = get_pdf_docs_from_s3("dpgenaitraining", "BILL-Q2-25-Press-Release-2-6-25.pdf")
documents[0:3]

[Document(metadata={'producer': 'Wdesk Fidelity Content Translations Version 011.001.078', 'creator': 'Workiva', 'creationdate': '2025-02-06T19:54:35+00:00', 'moddate': '2025-02-06T19:54:35+00:00', 'title': 'Bill-2024.12.31-EX-99.1', 'author': 'anonymous', 'source': '/var/folders/hk/x5jlwc1s6dx_w4j79wt6vr2w0000gp/T/tmpg1egmilf.pdf', 'total_pages': 14, 'page': 0, 'page_label': '1'}, page_content='BILL Reports Second Quarter Fiscal Year 2025 Financial Results\n• Q2 Core Revenue Increased 16% Year-Over-Year\n• Q2 Total Revenue Increased 14% Year-Over-Year\nSAN JOSE, Calif.--(BUSINESS WIRE) – February 6, 2025 – BILL (NYSE: BILL), a leading financial operations platform for small \nand midsize businesses (SMBs), today announced financial results for the second fiscal quarter ended December 31, 2024.\n“We delivered strong financial results and innovated at a rapid pace as we executed on our vision to be the de facto intelligent \nfinancial operations platform for SMBs,” said René Lacerte, BI

These `Document`s can then easily be split and chunked for future embeddings or other use.

In [7]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 100
)

chunks = text_splitter.split_documents(documents)
len(chunks)

41

## Working with Excel files

Working with raw Excel files can be a bit tricky.  In many cases you might be better converting them to CSV files and reading them as text.  However, this approach shows you a different approach using Pandas to read the files and then converting each sheet row-by-row into a LangChain `Document`.

In [None]:
documents = get_excel_from_s3("dpgenaitraining", "sample_excel_data.xlsx")
for doc in documents[:3]:
    print(f"\nFrom sheet: {doc.metadata['sheet_name']}, Row: {doc.metadata['row_index']}")
    print(doc.page_content)

## Concluding thoughts

It is worth experimenting with the splitting, especially when it comes to creating embeddings around tabular data.  Recalling that LangChain will preferentially split on `["\n\n", "\n", ".", " ", ""]`, you might find that your tables are being split in some unusual places.