Skip to content

Getting-Automated/pandoc-lambda-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Pandoc Lambda Function

Pandoc Lambda Function Tutorial

A serverless document conversion service using Pandoc running on AWS Lambda. This service can convert between various document formats including Markdown, HTML, PDF, and more. Perfect for document processing pipelines, RAG systems, and content conversion workflows.

🌟 Features

  • Runs Pandoc 3.6.4 in AWS Lambda using AWS's Python Lambda container image
  • PDF generation capability
  • CloudWatch logging enabled
  • Infrastructure as Code using CloudFormation
  • Local testing support
  • Configurable memory and timeout settings

🎯 Why Use This?

  • Ultra Cost-Effective: ~$0.50/month for 10,000 conversions vs $100-1,000 with SaaS APIs
  • Complete Control: Customize conversion parameters exactly to your needs
  • Serverless Architecture: No servers to manage, maintain, or monitor
  • Seamless Integration: Works with n8n, websites, apps, or any system that can make HTTP requests
  • Scale Automatically: Handles thousands of simultaneous conversions without configuration
  • Privacy Focused: Your documents never leave your AWS account
  • Deployment Flexibility: Docker-based approach for maximum compatibility
  • Maximum Reusability: Unlike embedded n8n-only solutions, this modular approach can be used with any system that can make HTTP requests

πŸ’° Cost Breakdown

  • AWS Lambda Costs:

    • Free Tier: 1 million free requests + 400,000 GB-seconds/month
    • Beyond Free Tier: $0.20 per million requests + $0.0000166667 per GB-second
  • Real-World Example (10,000 documents/month with 2GB Lambda):

    • Request cost: 10,000 Γ— $0.20/million = $0.002
    • Compute cost: 10,000 Γ— 5 seconds Γ— 2GB Γ— $0.0000166667/GB-second = $1.67
    • Total: ~$1.67 per month
  • Compared to Commercial Services:

    • SaaS document conversion APIs: $10-100/month for similar volume
    • Pay-per-conversion APIs: $0.01-0.10 per conversion ($100-1,000 for 10,000 files)
    • Many services impose rate limits or queue processing at lower tiers

Detailed Cost Comparison

The AWS Lambda approach stands out as the most cost-effective option for document conversions:

Service Lowest Plan / Cost Monthly Allotment (Approx) Approx Cost for 1,000 Pages Derived Cost per Page
AWS Lambda Pay-per-use after free tier N/A (pay for compute + requests) ~$0.167 for 1,000 pagesΒΉ ~$0.000167/page
Zamzar $9/mo (Developer) ~3,000 conversions/month $9 for up to 3,000 pagesΒ² $0.003/page
CloudConvert $9/mo (1,000 conversion mins) ~1,000 pages (if ~1 min per page) $9 $0.009/page
DocConversionAPI $9.99/mo (Basic) 1,000 conversions/month $9.99 $0.00999/page
ConvertAPI $9/mo (Basic: 1,500 sec) ~1,500 pages (if ~1 sec per page) $9 $0.006/page
PDF.co $39/mo ~2,000 credits $39 (covers ~1,000 pages) ~$0.02/page

ΒΉ Based on 2 GB memory, ~5 seconds billed duration per invocation. Excludes the Lambda free tier, which can significantly reduce or eliminate costs for moderate usage.
Β² 100 conversions/day = ~3,000 conversions/month.

πŸ—οΈ Architecture

System Flow

graph TD
    A[Document Source] -->|Upload or URL| B[AWS Lambda Function]
    B -->|Process| C[Pandoc Conversion]
    C -->|Return| D[Converted Document]
    
    subgraph n8n["n8n Integration"]
        E[Read Binary File] -->|Document| F[Encode to Base64]
        G[URL to Document] -->|Specify URL| H[Create Request]
        F -->|Invoke| B
        H -->|Invoke| B
        D -->|Process| I[Handle Response]
        I -->|Save or Forward| J[Converted Document]
    end
    
    style A fill:#ff52b9,stroke:#333,stroke-width:2px
    style B fill:#3985ff,stroke:#333,stroke-width:2px
    style D fill:#137f13,stroke:#333,stroke-width:2px
    style n8n fill:#2A2A2A,stroke:#666,color:#fff
Loading

πŸš€ Prerequisites

  • AWS CLI installed and configured
  • Docker installed and running
  • Bash shell environment

πŸ“ Project Structure

.
β”œβ”€β”€ Dockerfile              # Container definition for Lambda
β”œβ”€β”€ app.py                  # Lambda function handler
β”œβ”€β”€ deploy.sh              # Script to build and deploy the Lambda container
β”œβ”€β”€ deploy-cloudformation.sh # Script to deploy AWS infrastructure
β”œβ”€β”€ template.yaml          # CloudFormation template
β”œβ”€β”€ test-local.sh         # Script for local testing
β”œβ”€β”€ test-lambda.sh        # Script for testing deployed Lambda
└── requirements.txt       # Python dependencies

βš™οΈ Configuration

The Lambda function is configured with:

  • Memory: 2048 MB
  • Timeout: 300 seconds (5 minutes)
  • Architecture: x86_64
  • Runtime: Container Image
  • Base Image: AWS Lambda Python

πŸš€ Deployment

  1. First, deploy the AWS infrastructure:
./deploy-cloudformation.sh

This creates:

  • ECR repository
  • IAM role with necessary permissions
  • CloudWatch Log Group
  1. Build and deploy the Lambda function:
./deploy.sh

This:

  • Builds the Docker container
  • Pushes it to ECR
  • Creates/updates the Lambda function

πŸ”’ Setting Up IAM User for API Access

To use this Lambda function from external applications like n8n, you'll need to create an IAM user with appropriate permissions:

  1. Open the AWS Management Console and navigate to IAM
  2. Select "Users" and click "Add users"
  3. Enter a username (e.g., "pandoc-lambda-api")
  4. Select "Access key - Programmatic access" for AWS credential type
  5. Choose "Attach existing policies directly"
  6. Create a custom policy with the following JSON:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "lambda:InvokeFunction",
            "Resource": "arn:aws:lambda:*:*:function:pandoc-lambda-function"
        }
    ]
}
  1. Name the policy (e.g., "PandocLambdaInvoke") and attach it to the user
  2. Complete the user creation and securely store the Access Key ID and Secret Access Key

These credentials can be used in applications that need to invoke the Lambda function programmatically.

πŸ”Œ Integration with n8n

n8n is an open-source workflow automation platform that can be integrated with this Pandoc Lambda service for document processing workflows. Here's how to set it up:

Prerequisites

  • n8n installed and running
  • AWS IAM user credentials with Lambda invoke permissions (see previous section)

Setup Steps

  1. In your n8n workflow, add an "AWS Lambda" node
  2. Configure the node with:
    • Access Key ID and Secret Access Key from your IAM user
    • Region: The AWS region where your Lambda is deployed
    • Function: "pandoc-lambda-function"
    • JSON Payload: Format as described in the Usage section
    {
      "input_format": "markdown",
      "output_format": "pdf",
      "content": "{{$node["Previous_Node"].data.content_base64}}",
      "options": ["--pdf-engine=pdflatex"]
    }

Example n8n Workflow

You can create workflows like:

  1. Content Management System to PDF: Convert CMS content to PDF for archiving
  2. Document Format Converter: Allow users to upload documents and convert to different formats
  3. Markdown to HTML Email: Convert markdown content to HTML for email newsletters
  4. RAG Pipeline Preprocessing: Convert various document formats to plain text for embedding into vector databases

RAG Integration Example

graph TD
    A[Document Sources] -->|Upload| B[n8n Workflow]
    B -->|Extract Content| C[Pandoc Lambda Function]
    C -->|Convert to Text| D[Text Processing]
    D -->|Clean & Chunk| E[Vector Embedding]
    E -->|Store| F[Vector Database]
    F -->|Query| G[RAG System]
    
    style A fill:#ff52b9,stroke:#333,stroke-width:2px
    style C fill:#3985ff,stroke:#333,stroke-width:2px
    style G fill:#137f13,stroke:#333,stroke-width:2px
Loading

The n8n node will receive the Lambda response in JSON format with the converted document content.

πŸ§ͺ Local Testing

Test the function locally using:

./test-local.sh

This will:

  1. Build the container locally
  2. Run a test conversion (Markdown to PDF)
  3. Save the output to the output directory

πŸ“ Usage

The Lambda function accepts JSON input with the following structure:

{
  "input_format": "markdown",
  "output_format": "pdf",
  "content": "<base64-encoded-content>",
  "options": ["--pdf-engine=pdflatex"]
}

Example formats supported:

  • markdown
  • html
  • pdf
  • docx
  • epub
  • latex
  • rst

Example: Converting Markdown to Plain Text

Here's a complete example of converting a markdown file to plain text:

# Your markdown content
echo "# Hello World

This is **bold** and this is *italic*.

## Section 1
- List item 1
- List item 2

[A link](https://example.com)" | base64

# The above command outputs something like:
# IyBIZWxsbyBXb3JsZAoKVGhpcyBpcyAqKmJvbGQqKiBhbmQgdGhpcyBpcyAqaXRhbGljKi4KCiMjIFNlY3Rpb24gMQotIExpc3QgaXRlbSAxCi0gTGlzdCBpdGVtIDIKCltBIGxpbmtdKGh0dHBzOi8vZXhhbXBsZS5jb20p

# Use this base64 string in your JSON payload:
{
  "input_format": "markdown",
  "output_format": "plain",
  "content": "IyBIZWxsbyBXb3JsZAoKVGhpcyBpcyAqKmJvbGQqKiBhbmQgdGhpcyBpcyAqaXRhbGljKi4KCiMjIFNlY3Rpb24gMQotIExpc3QgaXRlbSAxCi0gTGlzdCBpdGVtIDIKCltBIGxpbmtdKGh0dHBzOi8vZXhhbXBsZS5jb20p",
  "options": ["--wrap=none"]
}

The response will contain base64-encoded plain text with all markdown formatting removed.

To test this conversion locally, you can use:

echo "# Hello World..." | base64 > test.txt
curl -X POST http://localhost:9000/2015-03-31/functions/function/invocations \
  -H "Content-Type: application/json" \
  -d @- << EOF
{
  "input_format": "markdown",
  "output_format": "plain",
  "content": "$(cat test.txt)",
  "options": ["--wrap=none"]
}
EOF

Example: Converting DOCX to Plain Text

Here's how to convert a Word document to plain text:

# First, convert your DOCX file to base64
base64 your-document.docx > document.b64

# The JSON payload should look like this:
{
  "input_format": "docx",
  "output_format": "plain",
  "content": "$(cat document.b64)",
  "options": ["--wrap=none"]
}

To test this using curl:

# Using an existing DOCX file
curl -X POST http://localhost:9000/2015-03-31/functions/function/invocations \
  -H "Content-Type: application/json" \
  -d @- << EOF
{
  "input_format": "docx",
  "output_format": "plain",
  "content": "$(cat document.b64)",
  "options": ["--wrap=none", "--extract-media=."]
}
EOF

# The response will be base64-encoded plain text that you can decode:
echo "<response-content>" | base64 -d > output.txt

Note:

  • For DOC files (older Word format), use input_format: "doc"
  • The --extract-media=. option will extract any embedded images (though they won't be included in the plain text output)
  • Binary files like DOCX must be base64 encoded directly rather than converting their content to base64 like we did with markdown

Response Format

The Lambda function returns a JSON response with the following structure:

{
  "statusCode": 200,
  "body": {
    "content": "<result-content>",
    "format": "pdf",
    "contentType": "application/pdf",
    "encoding": "base64"  
  },
  "headers": {
    "Content-Type": "application/json"
  }
}
  • For text formats, the content is returned directly as a UTF-8 string
  • For binary formats, the content is base64-encoded and encoding field is set to "base64"

πŸ“Š Monitoring

  • CloudWatch Logs are available at /aws/lambda/pandoc-lambda-function
  • Log retention is set to 30 days
  • Function metrics are available in CloudWatch Metrics

πŸ› οΈ Development

To modify the function configuration:

  1. Update memory/timeout in deploy.sh
  2. Update infrastructure in template.yaml
  3. Update container configuration in Dockerfile

πŸ”’ Security

The function uses:

  • IAM role-based permissions
  • ECR image scanning
  • CloudWatch logging
  • AWS-managed encryption keys

❓ Troubleshooting

  1. Check CloudWatch Logs for function output
  2. Use test-local.sh for local debugging
  3. Verify ECR image push success
  4. Check IAM role permissions

🀝 Join the Getting Automated Community

Want to go deeper with automation and get direct support? Join our exclusive automation community!

What You Get from the Getting Automated Community:

  • In-depth Automation Workflows: Learn how to integrate AI into your automation processes
  • Battle-Tested Templates: Access exclusive, production-ready automation templates
  • Expert Guidance: Get direct support from automation professionals
  • Early Access to Content: Be the first to access exclusive content
  • Private Support Channels: Receive personalized support through direct chat and office hours
  • Community of Serious Builders: Connect with like-minded professionals

Join the Getting Automated Community

πŸ”— Additional Resources

Need Personalized Help?

If you need this solution built for you or want personalized guidance, you can schedule a consultation:

Schedule a 30-Minute Connect

πŸ“„ License

This project is licensed under the terms of the MIT license included in the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published