Pandoc Lambda Function

A serverless document conversion service using Pandoc running on AWS Lambda. This service can convert between various document formats including Markdown, HTML, PDF, and more. Perfect for document processing pipelines, RAG systems, and content conversion workflows.

🌟 Features

Runs Pandoc 3.6.4 in AWS Lambda using AWS's Python Lambda container image
PDF generation capability
CloudWatch logging enabled
Infrastructure as Code using CloudFormation
Local testing support
Configurable memory and timeout settings

🎯 Why Use This?

Ultra Cost-Effective: ~$0.50/month for 10,000 conversions vs $100-1,000 with SaaS APIs
Complete Control: Customize conversion parameters exactly to your needs
Serverless Architecture: No servers to manage, maintain, or monitor
Seamless Integration: Works with n8n, websites, apps, or any system that can make HTTP requests
Scale Automatically: Handles thousands of simultaneous conversions without configuration
Privacy Focused: Your documents never leave your AWS account
Deployment Flexibility: Docker-based approach for maximum compatibility
Maximum Reusability: Unlike embedded n8n-only solutions, this modular approach can be used with any system that can make HTTP requests

💰 Cost Breakdown

AWS Lambda Costs:
- Free Tier: 1 million free requests + 400,000 GB-seconds/month
- Beyond Free Tier: $0.20 per million requests + $0.0000166667 per GB-second
Real-World Example (10,000 documents/month with 2GB Lambda):
- Request cost: 10,000 × $0.20/million = $0.002
- Compute cost: 10,000 × 5 seconds × 2GB × $0.0000166667/GB-second = $1.67
- Total: ~$1.67 per month
Compared to Commercial Services:
- SaaS document conversion APIs: $10-100/month for similar volume
- Pay-per-conversion APIs: $0.01-0.10 per conversion ($100-1,000 for 10,000 files)
- Many services impose rate limits or queue processing at lower tiers

Detailed Cost Comparison

The AWS Lambda approach stands out as the most cost-effective option for document conversions:

Service	Lowest Plan / Cost	Monthly Allotment (Approx)	Approx Cost for 1,000 Pages	Derived Cost per Page
AWS Lambda	Pay-per-use after free tier	N/A (pay for compute + requests)	~$0.167 for 1,000 pages¹	~$0.000167/page
Zamzar	$9/mo (Developer)	~3,000 conversions/month	$9 for up to 3,000 pages²	$0.003/page
CloudConvert	$9/mo (1,000 conversion mins)	~1,000 pages (if ~1 min per page)	$9	$0.009/page
DocConversionAPI	$9.99/mo (Basic)	1,000 conversions/month	$9.99	$0.00999/page
ConvertAPI	$9/mo (Basic: 1,500 sec)	~1,500 pages (if ~1 sec per page)	$9	$0.006/page
PDF.co	$39/mo	~2,000 credits	$39 (covers ~1,000 pages)	~$0.02/page

¹ Based on 2 GB memory, ~5 seconds billed duration per invocation. Excludes the Lambda free tier, which can significantly reduce or eliminate costs for moderate usage.
² 100 conversions/day = ~3,000 conversions/month.

🏗️ Architecture

System Flow

graph TD
    A[Document Source] -->|Upload or URL| B[AWS Lambda Function]
    B -->|Process| C[Pandoc Conversion]
    C -->|Return| D[Converted Document]
    
    subgraph n8n["n8n Integration"]
        E[Read Binary File] -->|Document| F[Encode to Base64]
        G[URL to Document] -->|Specify URL| H[Create Request]
        F -->|Invoke| B
        H -->|Invoke| B
        D -->|Process| I[Handle Response]
        I -->|Save or Forward| J[Converted Document]
    end
    
    style A fill:#ff52b9,stroke:#333,stroke-width:2px
    style B fill:#3985ff,stroke:#333,stroke-width:2px
    style D fill:#137f13,stroke:#333,stroke-width:2px
    style n8n fill:#2A2A2A,stroke:#666,color:#fff

🚀 Prerequisites

AWS CLI installed and configured
Docker installed and running
Bash shell environment

📁 Project Structure

.
├── Dockerfile              # Container definition for Lambda
├── app.py                  # Lambda function handler
├── deploy.sh              # Script to build and deploy the Lambda container
├── deploy-cloudformation.sh # Script to deploy AWS infrastructure
├── template.yaml          # CloudFormation template
├── test-local.sh         # Script for local testing
├── test-lambda.sh        # Script for testing deployed Lambda
└── requirements.txt       # Python dependencies

⚙️ Configuration

The Lambda function is configured with:

Memory: 2048 MB
Timeout: 300 seconds (5 minutes)
Architecture: x86_64
Runtime: Container Image
Base Image: AWS Lambda Python

🚀 Deployment

First, deploy the AWS infrastructure:

./deploy-cloudformation.sh

This creates:

ECR repository
IAM role with necessary permissions
CloudWatch Log Group

Build and deploy the Lambda function:

./deploy.sh

This:

Builds the Docker container
Pushes it to ECR
Creates/updates the Lambda function

🔒 Setting Up IAM User for API Access

To use this Lambda function from external applications like n8n, you'll need to create an IAM user with appropriate permissions:

Open the AWS Management Console and navigate to IAM
Select "Users" and click "Add users"
Enter a username (e.g., "pandoc-lambda-api")
Select "Access key - Programmatic access" for AWS credential type
Choose "Attach existing policies directly"
Create a custom policy with the following JSON:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "lambda:InvokeFunction",
            "Resource": "arn:aws:lambda:*:*:function:pandoc-lambda-function"
        }
    ]
}

Name the policy (e.g., "PandocLambdaInvoke") and attach it to the user
Complete the user creation and securely store the Access Key ID and Secret Access Key

These credentials can be used in applications that need to invoke the Lambda function programmatically.

🔌 Integration with n8n

n8n is an open-source workflow automation platform that can be integrated with this Pandoc Lambda service for document processing workflows. Here's how to set it up:

Prerequisites

n8n installed and running
AWS IAM user credentials with Lambda invoke permissions (see previous section)

Setup Steps

In your n8n workflow, add an "AWS Lambda" node
Configure the node with:
- Access Key ID and Secret Access Key from your IAM user
- Region: The AWS region where your Lambda is deployed
- Function: "pandoc-lambda-function"
- JSON Payload: Format as described in the Usage section
```
{
  "input_format": "markdown",
  "output_format": "pdf",
  "content": "{{$node["Previous_Node"].data.content_base64}}",
  "options": ["--pdf-engine=pdflatex"]
}
```

Example n8n Workflow

You can create workflows like:

Content Management System to PDF: Convert CMS content to PDF for archiving
Document Format Converter: Allow users to upload documents and convert to different formats
Markdown to HTML Email: Convert markdown content to HTML for email newsletters
RAG Pipeline Preprocessing: Convert various document formats to plain text for embedding into vector databases

RAG Integration Example

graph TD
    A[Document Sources] -->|Upload| B[n8n Workflow]
    B -->|Extract Content| C[Pandoc Lambda Function]
    C -->|Convert to Text| D[Text Processing]
    D -->|Clean & Chunk| E[Vector Embedding]
    E -->|Store| F[Vector Database]
    F -->|Query| G[RAG System]
    
    style A fill:#ff52b9,stroke:#333,stroke-width:2px
    style C fill:#3985ff,stroke:#333,stroke-width:2px
    style G fill:#137f13,stroke:#333,stroke-width:2px

The n8n node will receive the Lambda response in JSON format with the converted document content.

🧪 Local Testing

Test the function locally using:

./test-local.sh

This will:

Build the container locally
Run a test conversion (Markdown to PDF)
Save the output to the output directory

📝 Usage

The Lambda function accepts JSON input with the following structure:

{
  "input_format": "markdown",
  "output_format": "pdf",
  "content": "<base64-encoded-content>",
  "options": ["--pdf-engine=pdflatex"]
}

Example formats supported:

markdown
html
pdf
docx
epub
latex
rst

Example: Converting Markdown to Plain Text

Here's a complete example of converting a markdown file to plain text:

# Your markdown content
echo "# Hello World

This is **bold** and this is *italic*.

## Section 1
- List item 1
- List item 2

[A link](https://example.com)" | base64

# The above command outputs something like:
# IyBIZWxsbyBXb3JsZAoKVGhpcyBpcyAqKmJvbGQqKiBhbmQgdGhpcyBpcyAqaXRhbGljKi4KCiMjIFNlY3Rpb24gMQotIExpc3QgaXRlbSAxCi0gTGlzdCBpdGVtIDIKCltBIGxpbmtdKGh0dHBzOi8vZXhhbXBsZS5jb20p

# Use this base64 string in your JSON payload:
{
  "input_format": "markdown",
  "output_format": "plain",
  "content": "IyBIZWxsbyBXb3JsZAoKVGhpcyBpcyAqKmJvbGQqKiBhbmQgdGhpcyBpcyAqaXRhbGljKi4KCiMjIFNlY3Rpb24gMQotIExpc3QgaXRlbSAxCi0gTGlzdCBpdGVtIDIKCltBIGxpbmtdKGh0dHBzOi8vZXhhbXBsZS5jb20p",
  "options": ["--wrap=none"]
}

The response will contain base64-encoded plain text with all markdown formatting removed.

To test this conversion locally, you can use:

echo "# Hello World..." | base64 > test.txt
curl -X POST http://localhost:9000/2015-03-31/functions/function/invocations \
  -H "Content-Type: application/json" \
  -d @- << EOF
{
  "input_format": "markdown",
  "output_format": "plain",
  "content": "$(cat test.txt)",
  "options": ["--wrap=none"]
}
EOF

Example: Converting DOCX to Plain Text

Here's how to convert a Word document to plain text:

# First, convert your DOCX file to base64
base64 your-document.docx > document.b64

# The JSON payload should look like this:
{
  "input_format": "docx",
  "output_format": "plain",
  "content": "$(cat document.b64)",
  "options": ["--wrap=none"]
}

To test this using curl:

# Using an existing DOCX file
curl -X POST http://localhost:9000/2015-03-31/functions/function/invocations \
  -H "Content-Type: application/json" \
  -d @- << EOF
{
  "input_format": "docx",
  "output_format": "plain",
  "content": "$(cat document.b64)",
  "options": ["--wrap=none", "--extract-media=."]
}
EOF

# The response will be base64-encoded plain text that you can decode:
echo "<response-content>" | base64 -d > output.txt

Note:

For DOC files (older Word format), use input_format: "doc"
The --extract-media=. option will extract any embedded images (though they won't be included in the plain text output)
Binary files like DOCX must be base64 encoded directly rather than converting their content to base64 like we did with markdown

Response Format

The Lambda function returns a JSON response with the following structure:

{
  "statusCode": 200,
  "body": {
    "content": "<result-content>",
    "format": "pdf",
    "contentType": "application/pdf",
    "encoding": "base64"  
  },
  "headers": {
    "Content-Type": "application/json"
  }
}

For text formats, the content is returned directly as a UTF-8 string
For binary formats, the content is base64-encoded and encoding field is set to "base64"

📊 Monitoring

CloudWatch Logs are available at /aws/lambda/pandoc-lambda-function
Log retention is set to 30 days
Function metrics are available in CloudWatch Metrics

🛠️ Development

To modify the function configuration:

Update memory/timeout in deploy.sh
Update infrastructure in template.yaml
Update container configuration in Dockerfile

🔒 Security

The function uses:

IAM role-based permissions
ECR image scanning
CloudWatch logging
AWS-managed encryption keys

❓ Troubleshooting

Check CloudWatch Logs for function output
Use test-local.sh for local debugging
Verify ECR image push success
Check IAM role permissions

🤝 Join the Getting Automated Community

Want to go deeper with automation and get direct support? Join our exclusive automation community!

What You Get from the Getting Automated Community:

In-depth Automation Workflows: Learn how to integrate AI into your automation processes
Battle-Tested Templates: Access exclusive, production-ready automation templates
Expert Guidance: Get direct support from automation professionals
Early Access to Content: Be the first to access exclusive content
Private Support Channels: Receive personalized support through direct chat and office hours
Community of Serious Builders: Connect with like-minded professionals

Join the Getting Automated Community

🔗 Additional Resources

Website: Getting Automated
YouTube Channel: Getting Automated YouTube
Free Workflow Automation Tools: Automation Tools

Need Personalized Help?

If you need this solution built for you or want personalized guidance, you can schedule a consultation:

Schedule a 30-Minute Connect

📄 License

This project is licensed under the terms of the MIT license included in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
deploy-cloudformation.sh		deploy-cloudformation.sh
deploy.sh		deploy.sh
requirements.txt		requirements.txt
template.yaml		template.yaml
test-lambda.sh		test-lambda.sh
test-local.sh		test-local.sh

License

Getting-Automated/pandoc-lambda-python

Folders and files

Latest commit

History

Repository files navigation