# Amazon Textract LangChain Document Loader

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Today, many companies manually extract data from scanned documents such as PDFs, images, tables, and forms, or through simple OCR software that requires manual configuration (which often must be updated when the form changes). To overcome these manual and expensive processes, Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort. You can quickly automate document processing and act on the information extracted, whether you’re automating loans processing or extracting information from invoices and receipts. Textract can extract the data in minutes instead of hours or days.

This sample demonstrates the use of Amazon Textract in combination with LangChain as a DocumentLoader.

Textract supports PDF, TIFF, PNG and JPEG format.

Check https://docs.aws.amazon.com/textract/latest/dg/limits-document.html for supported document sizes, languages and characters.

In [1]:
%pip -q install langchain-0.0.242-py3-none-any.whl --force-reinstall

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.27.157 requires botocore==1.29.157, but you have botocore 1.31.11 which is incompatible.
awscli 1.27.157 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0.1 which is incompatible.
hdijupyterutils 0.20.5 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.0.1 which is incompatible.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.25.1 which is incompatible.
sagemaker 2.167.0 requires PyYAML==6.0, but you have pyyaml 6.0.1 which is incompatible.
sparkmagic 0.20.5 requires nest-asyncio==1.5.5, but you have nest-asyncio 1.5.6 which is incompatible.
sparkmagic 0.20.5 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.0.1 which is incompatible.
sphinx 7.0.0 requires docutils<0.20,>=0.18.1, but you have docutils 0.16 which is incompatible.[0m[31m
[0mNote: you may need to restart the k

In [2]:
%pip install -q boto3 openai tiktoken python-dotenv

Note: you may need to restart the kernel to use updated packages.


## Sample 1

The first example uses a local file, which internally will be send to Amazon Textract sync API. 

Local files or URL endpoints like HTTP:// are limited to one page documents for Textract.
Multi-page documents have to reside on S3.

In [3]:
from langchain.document_loaders import AmazonTextractPDFLoader
loader = AmazonTextractPDFLoader("w2-example.pdf")
documents = loader.load()

Output from the file

In [4]:
documents

[Document(page_content="EXAMPLE W-2 22222 a Employee's social security number 123-45-6789 OMB No. 1545-0008 b Employer identification number (EIN) 1 Wages, tips, other compensation 2 Federal income tax withheld 11-1111112 98,500.00 15,000.00 c Employer's name, address, and ZIP code 3 Social security wages 4 Social security tax withheld TOPCOMPANY 100,000.00 5000.00 5 Medicare wages and tips 6 Medicare tax withheld 123 STREET RD 100,000.00 1100.00 7 Social security tips 8 Allocated tips ANYWHERE, USA, 12345 d Control number 9 10 Dependent care benefits e Employee's first name and initial Last name Suff. 11 Nonqualified plans 12a D 1,500.00 TOM T. TAXPAYER 13 Statutory Retirement Three party 12b employee plan sick pity 456 ROAD ST X DD 1,000.00 14 Other 12c ANYWHERE, USA, 12345 12d f Employee's address and ZIP code 15 State Employer's state ID number 16 State wages, tips, etc. 17 State income tax 18 Local wages, tips, etc. 19 Local income tax 20 Locality name US 1234 100,000 3000 100,000

## Sample 2
The next sample loads a file from an HTTPS endpoint. 
It has to be single page, as Amazon Textract requires all multi-page documents to be stored on S3.

In [5]:
# loader = AmazonTextractPDFLoader("s3://amazon-textract-public-content/langchain/alejandro_rosalez_sample_1.jpg")
from langchain.document_loaders import AmazonTextractPDFLoader
loader = AmazonTextractPDFLoader("https://amazon-textract-public-content.s3.us-east-2.amazonaws.com/langchain/alejandro_rosalez_sample_1.jpg")
documents = loader.load()

In [6]:
documents

[Document(page_content='Patient Information First Name: ALE JANDRO Last Name: ROSALEZ Date of Birth: Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: CARLOS Last Name: SALAZAR First Name: Phone: 212 - 555 - 0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship to Patient: FRIEND Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No Patient Information First Name: ALE JANDRO Last Name: ROSALEZ Date of Birth: Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: CARLOS Last N

You can use the document in a chain. Here is an example with OpenAI.

## Sample 3

Process a multi-page document from S3. The sample document resides in a bucket in us-east-2 and Textract needs to be called in that same region to be successful.
So for this sample to work, you have to have your SageMaker Notebook running in us-east-2 or when running in a different environment, pass in a boto3 Textract client with that region name.

In [7]:
import boto3
textract_client = boto3.client('textract', region_name='us-east-2')

file_path = "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf"
loader = AmazonTextractPDFLoader(file_path, boto3_textract_client=textract_client)
documents = loader.load()

Get number of pages to validate the response (printing out the full response would be quite long...)

In [12]:
len(documents)

16

## Using the AmazonTextractPDFLoader in an LangChain chain (e. g. OpenAI)

The AmazonTextractPDFLoader can be used in a chain the same way the other loaders are used.
Textract itself does have a [Query feature](https://docs.aws.amazon.com/textract/latest/dg/API_Query.html), which offers similar functionality to the QA chain in this sample, which is worth checking out as well.

In [10]:
# You can store your OPENAI_API_KEY in a .env file as well
# import os 
# from dotenv import load_dotenv

# load_dotenv()

True

In [None]:
# Or set the OpenAI key in the environment directly
import os 
os.environ["OPENAI_API_KEY"] = "your-OpenAI-API-key"

In [11]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

chain = load_qa_chain(llm=OpenAI(), chain_type="map_reduce")
# query = ["What is the employee name?", "What is the SSN?"]
query = ["Who are the autors?"]

chain.run(input_documents=documents, question=query)

' The authors are Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li.'