# Document Processing Workflow

This notebook outlines the document processing workflow using AWS Textract to extract text from documents stored in S3. The extracted text will be parsed and prepared for further processing.

In [1]:
import boto3
import json
import time

# Initialize AWS clients
# sts = boto3.client('sts')
# print(sts.get_caller_identity())  # Verify which identity you're using
s3_client = boto3.client('s3')
textract_client = boto3.client('textract')
s3 = boto3.client('s3')
try:
    response = s3.list_objects_v2(Bucket='genaiprojectawsbucket')
    print("Access successful!")
except Exception as e:
    print(f"Error: {e}")

Access successful!


## Step 1: Upload Document to S3

Ensure that the document you want to process is uploaded to an S3 bucket.

In [2]:
# Define S3 bucket and document name
bucket_name = 'genaiprojectawsbucket'
# Local file path
local_file = 'D:/Universty Files/GenAI_Project/document-qa-aws-project/documents/doc1.pdf'
# S3 key (just use the filename)
s3_key = 'doc1.pdf'

# Upload document to S3
s3_client.upload_file(local_file, bucket_name, s3_key)
print(f'Document {local_file} uploaded to S3 bucket {bucket_name} with key {s3_key}')

Document D:/Universty Files/GenAI_Project/document-qa-aws-project/documents/doc1.pdf uploaded to S3 bucket genaiprojectawsbucket with key doc1.pdf


## Step 2: Extract Text from Document using Textract

Use AWS Textract to extract text from the uploaded document.

In [3]:
import boto3
import botocore

# Print boto3 version
print(f"boto3 version: {boto3.__version__}")
print(f"botocore version: {botocore.__version__}")

# Initialize clients with explicit region
region_name = "us-east-1"  # Replace with your region
textract_client = boto3.client('textract', region_name=region_name)
s3_client = boto3.client('s3', region_name=region_name)

# Check AWS identity
sts = boto3.client('sts', region_name=region_name)
try:
    identity = sts.get_caller_identity()
    print(f"AWS Identity: {identity['Arn']}")
    print(f"Using region: {region_name}")
except Exception as e:
    print(f"Identity error: {e}")

boto3 version: 1.37.9
botocore version: 1.37.9
AWS Identity: arn:aws:iam::438465158896:user/GenAIproject
Using region: us-east-1


In [4]:
try:
    response = textract_client.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': bucket_name,
                'Name': s3_key
            }
        }
    )
    job_id = response['JobId']
    print(f"Textract job started successfully with JobId: {job_id}")
except Exception as e:
    print(f"Error starting Textract job: {type(e).__name__}: {str(e)}")

Textract job started successfully with JobId: 2023bd0b2e5c5cbdff5753ecba1633caae2bbe3c435209f5c5f3f39e6a77fb75


## Step 3: Poll for Textract Job Completion

We need to wait for the Textract job to complete before we can retrieve the results.

In [5]:
def check_textract_job(job_id):
    response = textract_client.get_document_text_detection(JobId=job_id)
    status = response['JobStatus']
    return status, response

# Poll for job completion
while True:
    status, response = check_textract_job(job_id)
    print(f'Job status: {status}')
    if status in ['SUCCEEDED', 'FAILED']:
        break
    time.sleep(5)

Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: SUCCEEDED


## Step 4: Parse the Extracted Text

Once the job is complete, we can parse the extracted text from the response.

In [6]:
# In notebook 01, add this code to save extracted text:
if status == 'SUCCEEDED':
    extracted_text = ''
    for item in response['Blocks']:
        if item['BlockType'] == 'LINE':
            extracted_text += item['Text'] + '\n'

    # Save to S3
    text_key = 'doc1_text.txt'
    s3_client.put_object(
        Body=extracted_text,
        Bucket=bucket_name,
        Key=text_key
    )
    print(f"Saved extracted text to S3: {text_key}")

Saved extracted text to S3: doc1_text.txt


## Conclusion

In this notebook, we have successfully uploaded a document to S3, extracted text using AWS Textract, and parsed the results. The next steps would involve generating embeddings and integrating this workflow into the overall Q&A system.