# Amazon Textract LangChain Document Loader

Amazon Textract’s new Layout feature introduces efficiencies in general purpose and generative AI document processing tasks: [Blog](https://aws.amazon.com/blogs/machine-learning/amazon-textracts-new-layout-feature-introduces-efficiencies-in-general-purpose-and-generative-ai-document-processing-tasks/)

The Textract team has also provided a document loader for LangChain which uses Amazon Textract layout feature, see the [AmazonTextractPDFParser](https://aws.amazon.com/blogs/machine-learning/amazon-textracts-new-layout-feature-introduces-efficiencies-in-general-purpose-and-generative-ai-document-processing-tasks/)

In [7]:
!pip install pypdf Pillow amazon-textract-caller==0.2.4 amazon-textract-textractor==1.8.3 --quiet

In [2]:
import sagemaker

# Get the SageMaker session and default bucket
sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()

print(f"Default sagemaker bucket: {default_bucket}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Default sagemaker bucket: sagemaker-us-west-2-557690596620


In [3]:
#copy data folder with sample pdf files to S3"
!aws s3 sync ./data s3://{default_bucket}/data

In [8]:
from langchain_community.document_loaders import AmazonTextractPDFLoader
from textractor.data.text_linearization_config import TextLinearizationConfig

document = "data/paper-llm-training-sample.pdf"
document_s3_url = f"s3://{default_bucket}/{document}"

linerization_config = TextLinearizationConfig(hide_header_layout=False,hide_footer_layout=False, hide_figure_layout=False, table_add_title_as_caption=True, table_linearization_format="markdown")

loader = AmazonTextractPDFLoader(document_s3_url, textract_features=["TABLES", "LAYOUT", "FORMS"], linearization_config=linerization_config)
documents = loader.load()

In [9]:
from IPython.display import IFrame
IFrame(document, width=400, height=600)

In [10]:
# show the linearized layout-aware text of one page
print(documents[3].page_content)

Haozheng Fan, Hao Zhou, Guangtai Huang, Parameswaran Raman, Xinwei Fu, Gaurav Gupta, Dhananjay Ram, Yida Wang, and Jun Huan 

reduce training latency with asynchronous execution. This overlaps some executions of accelerators and host (CPU). 

4 TRAINING PROCESS 

4.1 Training Curves 

During the training process, we monitor the training loss, as well as l2 norm of gradients and l2 norm of parameters for debugging training stability. Figure 1a shows the training loss over global batches, reduced over all data parallel ranks. The training loss decreases fast for the initial ~250B tokens, and enters a log-linear decrease afterwards. Similar trends are observed in other LLM training [18,51,52]. 

In Figure 1b, we show the gradient l2 norm during the training. Overall, we see that the gradient norm is stable across the training journey without divergence. Note that gradient spikes are immi- nent in LLM pre-training when using layer-normalization, or even RMSNorm [50], and some times due to 