A serverless document conversion service using Pandoc running on AWS Lambda. This service can convert between various document formats including Markdown, HTML, PDF, and more. Perfect for document processing pipelines, RAG systems, and content conversion workflows.
- Runs Pandoc 3.6.4 in AWS Lambda using AWS's Python Lambda container image
- PDF generation capability
- CloudWatch logging enabled
- Infrastructure as Code using CloudFormation
- Local testing support
- Configurable memory and timeout settings
- Ultra Cost-Effective: ~$0.50/month for 10,000 conversions vs $100-1,000 with SaaS APIs
- Complete Control: Customize conversion parameters exactly to your needs
- Serverless Architecture: No servers to manage, maintain, or monitor
- Seamless Integration: Works with n8n, websites, apps, or any system that can make HTTP requests
- Scale Automatically: Handles thousands of simultaneous conversions without configuration
- Privacy Focused: Your documents never leave your AWS account
- Deployment Flexibility: Docker-based approach for maximum compatibility
- Maximum Reusability: Unlike embedded n8n-only solutions, this modular approach can be used with any system that can make HTTP requests
-
AWS Lambda Costs:
- Free Tier: 1 million free requests + 400,000 GB-seconds/month
- Beyond Free Tier: $0.20 per million requests + $0.0000166667 per GB-second
-
Real-World Example (10,000 documents/month with 2GB Lambda):
- Request cost: 10,000 Γ $0.20/million = $0.002
- Compute cost: 10,000 Γ 5 seconds Γ 2GB Γ $0.0000166667/GB-second = $1.67
- Total: ~$1.67 per month
-
Compared to Commercial Services:
- SaaS document conversion APIs: $10-100/month for similar volume
- Pay-per-conversion APIs: $0.01-0.10 per conversion ($100-1,000 for 10,000 files)
- Many services impose rate limits or queue processing at lower tiers
The AWS Lambda approach stands out as the most cost-effective option for document conversions:
Service | Lowest Plan / Cost | Monthly Allotment (Approx) | Approx Cost for 1,000 Pages | Derived Cost per Page |
---|---|---|---|---|
AWS Lambda | Pay-per-use after free tier | N/A (pay for compute + requests) | ~$0.167 for 1,000 pagesΒΉ | ~$0.000167/page |
Zamzar | $9/mo (Developer) | ~3,000 conversions/month | $9 for up to 3,000 pagesΒ² | $0.003/page |
CloudConvert | $9/mo (1,000 conversion mins) | ~1,000 pages (if ~1 min per page) | $9 | $0.009/page |
DocConversionAPI | $9.99/mo (Basic) | 1,000 conversions/month | $9.99 | $0.00999/page |
ConvertAPI | $9/mo (Basic: 1,500 sec) | ~1,500 pages (if ~1 sec per page) | $9 | $0.006/page |
PDF.co | $39/mo | ~2,000 credits | $39 (covers ~1,000 pages) | ~$0.02/page |
ΒΉ Based on 2 GB memory, ~5 seconds billed duration per invocation. Excludes the Lambda free tier, which can significantly reduce or eliminate costs for moderate usage.
Β² 100 conversions/day = ~3,000 conversions/month.
graph TD
A[Document Source] -->|Upload or URL| B[AWS Lambda Function]
B -->|Process| C[Pandoc Conversion]
C -->|Return| D[Converted Document]
subgraph n8n["n8n Integration"]
E[Read Binary File] -->|Document| F[Encode to Base64]
G[URL to Document] -->|Specify URL| H[Create Request]
F -->|Invoke| B
H -->|Invoke| B
D -->|Process| I[Handle Response]
I -->|Save or Forward| J[Converted Document]
end
style A fill:#ff52b9,stroke:#333,stroke-width:2px
style B fill:#3985ff,stroke:#333,stroke-width:2px
style D fill:#137f13,stroke:#333,stroke-width:2px
style n8n fill:#2A2A2A,stroke:#666,color:#fff
- AWS CLI installed and configured
- Docker installed and running
- Bash shell environment
.
βββ Dockerfile # Container definition for Lambda
βββ app.py # Lambda function handler
βββ deploy.sh # Script to build and deploy the Lambda container
βββ deploy-cloudformation.sh # Script to deploy AWS infrastructure
βββ template.yaml # CloudFormation template
βββ test-local.sh # Script for local testing
βββ test-lambda.sh # Script for testing deployed Lambda
βββ requirements.txt # Python dependencies
The Lambda function is configured with:
- Memory: 2048 MB
- Timeout: 300 seconds (5 minutes)
- Architecture: x86_64
- Runtime: Container Image
- Base Image: AWS Lambda Python
- First, deploy the AWS infrastructure:
./deploy-cloudformation.sh
This creates:
- ECR repository
- IAM role with necessary permissions
- CloudWatch Log Group
- Build and deploy the Lambda function:
./deploy.sh
This:
- Builds the Docker container
- Pushes it to ECR
- Creates/updates the Lambda function
To use this Lambda function from external applications like n8n, you'll need to create an IAM user with appropriate permissions:
- Open the AWS Management Console and navigate to IAM
- Select "Users" and click "Add users"
- Enter a username (e.g., "pandoc-lambda-api")
- Select "Access key - Programmatic access" for AWS credential type
- Choose "Attach existing policies directly"
- Create a custom policy with the following JSON:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "arn:aws:lambda:*:*:function:pandoc-lambda-function"
}
]
}
- Name the policy (e.g., "PandocLambdaInvoke") and attach it to the user
- Complete the user creation and securely store the Access Key ID and Secret Access Key
These credentials can be used in applications that need to invoke the Lambda function programmatically.
n8n is an open-source workflow automation platform that can be integrated with this Pandoc Lambda service for document processing workflows. Here's how to set it up:
- n8n installed and running
- AWS IAM user credentials with Lambda invoke permissions (see previous section)
- In your n8n workflow, add an "AWS Lambda" node
- Configure the node with:
- Access Key ID and Secret Access Key from your IAM user
- Region: The AWS region where your Lambda is deployed
- Function: "pandoc-lambda-function"
- JSON Payload: Format as described in the Usage section
{ "input_format": "markdown", "output_format": "pdf", "content": "{{$node["Previous_Node"].data.content_base64}}", "options": ["--pdf-engine=pdflatex"] }
You can create workflows like:
- Content Management System to PDF: Convert CMS content to PDF for archiving
- Document Format Converter: Allow users to upload documents and convert to different formats
- Markdown to HTML Email: Convert markdown content to HTML for email newsletters
- RAG Pipeline Preprocessing: Convert various document formats to plain text for embedding into vector databases
graph TD
A[Document Sources] -->|Upload| B[n8n Workflow]
B -->|Extract Content| C[Pandoc Lambda Function]
C -->|Convert to Text| D[Text Processing]
D -->|Clean & Chunk| E[Vector Embedding]
E -->|Store| F[Vector Database]
F -->|Query| G[RAG System]
style A fill:#ff52b9,stroke:#333,stroke-width:2px
style C fill:#3985ff,stroke:#333,stroke-width:2px
style G fill:#137f13,stroke:#333,stroke-width:2px
The n8n node will receive the Lambda response in JSON format with the converted document content.
Test the function locally using:
./test-local.sh
This will:
- Build the container locally
- Run a test conversion (Markdown to PDF)
- Save the output to the
output
directory
The Lambda function accepts JSON input with the following structure:
{
"input_format": "markdown",
"output_format": "pdf",
"content": "<base64-encoded-content>",
"options": ["--pdf-engine=pdflatex"]
}
Example formats supported:
- markdown
- html
- docx
- epub
- latex
- rst
Here's a complete example of converting a markdown file to plain text:
# Your markdown content
echo "# Hello World
This is **bold** and this is *italic*.
## Section 1
- List item 1
- List item 2
[A link](https://example.com)" | base64
# The above command outputs something like:
# IyBIZWxsbyBXb3JsZAoKVGhpcyBpcyAqKmJvbGQqKiBhbmQgdGhpcyBpcyAqaXRhbGljKi4KCiMjIFNlY3Rpb24gMQotIExpc3QgaXRlbSAxCi0gTGlzdCBpdGVtIDIKCltBIGxpbmtdKGh0dHBzOi8vZXhhbXBsZS5jb20p
# Use this base64 string in your JSON payload:
{
"input_format": "markdown",
"output_format": "plain",
"content": "IyBIZWxsbyBXb3JsZAoKVGhpcyBpcyAqKmJvbGQqKiBhbmQgdGhpcyBpcyAqaXRhbGljKi4KCiMjIFNlY3Rpb24gMQotIExpc3QgaXRlbSAxCi0gTGlzdCBpdGVtIDIKCltBIGxpbmtdKGh0dHBzOi8vZXhhbXBsZS5jb20p",
"options": ["--wrap=none"]
}
The response will contain base64-encoded plain text with all markdown formatting removed.
To test this conversion locally, you can use:
echo "# Hello World..." | base64 > test.txt
curl -X POST http://localhost:9000/2015-03-31/functions/function/invocations \
-H "Content-Type: application/json" \
-d @- << EOF
{
"input_format": "markdown",
"output_format": "plain",
"content": "$(cat test.txt)",
"options": ["--wrap=none"]
}
EOF
Here's how to convert a Word document to plain text:
# First, convert your DOCX file to base64
base64 your-document.docx > document.b64
# The JSON payload should look like this:
{
"input_format": "docx",
"output_format": "plain",
"content": "$(cat document.b64)",
"options": ["--wrap=none"]
}
To test this using curl:
# Using an existing DOCX file
curl -X POST http://localhost:9000/2015-03-31/functions/function/invocations \
-H "Content-Type: application/json" \
-d @- << EOF
{
"input_format": "docx",
"output_format": "plain",
"content": "$(cat document.b64)",
"options": ["--wrap=none", "--extract-media=."]
}
EOF
# The response will be base64-encoded plain text that you can decode:
echo "<response-content>" | base64 -d > output.txt
Note:
- For DOC files (older Word format), use
input_format: "doc"
- The
--extract-media=.
option will extract any embedded images (though they won't be included in the plain text output) - Binary files like DOCX must be base64 encoded directly rather than converting their content to base64 like we did with markdown
The Lambda function returns a JSON response with the following structure:
{
"statusCode": 200,
"body": {
"content": "<result-content>",
"format": "pdf",
"contentType": "application/pdf",
"encoding": "base64"
},
"headers": {
"Content-Type": "application/json"
}
}
- For text formats, the content is returned directly as a UTF-8 string
- For binary formats, the content is base64-encoded and encoding field is set to "base64"
- CloudWatch Logs are available at
/aws/lambda/pandoc-lambda-function
- Log retention is set to 30 days
- Function metrics are available in CloudWatch Metrics
To modify the function configuration:
- Update memory/timeout in
deploy.sh
- Update infrastructure in
template.yaml
- Update container configuration in
Dockerfile
The function uses:
- IAM role-based permissions
- ECR image scanning
- CloudWatch logging
- AWS-managed encryption keys
- Check CloudWatch Logs for function output
- Use
test-local.sh
for local debugging - Verify ECR image push success
- Check IAM role permissions
Want to go deeper with automation and get direct support? Join our exclusive automation community!
- In-depth Automation Workflows: Learn how to integrate AI into your automation processes
- Battle-Tested Templates: Access exclusive, production-ready automation templates
- Expert Guidance: Get direct support from automation professionals
- Early Access to Content: Be the first to access exclusive content
- Private Support Channels: Receive personalized support through direct chat and office hours
- Community of Serious Builders: Connect with like-minded professionals
Join the Getting Automated Community
- Website: Getting Automated
- YouTube Channel: Getting Automated YouTube
- Free Workflow Automation Tools: Automation Tools
If you need this solution built for you or want personalized guidance, you can schedule a consultation:
This project is licensed under the terms of the MIT license included in the LICENSE file.