A serverless workflow to extract DICOM metadata to S3 and make it queryable via Athena. The architecture supports DCM, ZIP and TAR file extensions and assumes empty file extensions as DCM.
The dataset are partitioned by study_date
tag, if not present will default to 1900-01-01
value.
aws ecr create-repository --repository-name dicom-parser
REPO_URL=$(aws ecr describe-repositories --repository-name dicom-parser --query "repositories[].repositoryUri" --output text)
aws ecr get-login-password | docker login --username AWS --password-stdin $REPO_URL
echo $REPO_URL
The template can be deployed via web console or via CLI.
Install the AWS Serverless Application Model (SAM) CLI using these instructions
git clone https://github.com/aws-samples/dicom-metadata-extractor-serverless-datalake.git
cd dicom-metadata-extractor-serverless-datalake/
Build the application using SAM
sam build -t deploy/serverless.yml -m requirements.txt
Deploy the application using the guided process
sam deploy --guided --capabilities CAPABILITY_NAMED_IAM --image-repository $REPO_URL
# Replace MY_VALUE and MY_ACCOUNT_ID with proper resource names
Configuring SAM deploy
======================
Looking for config file [samconfig.toml] : Not found
Setting default arguments for 'sam deploy'
=========================================
Stack Name [sam-app]: dicom-parser
AWS Region [us-east-1]: us-east-1
Parameter S3InputBucketName []: MY_VALUE-dicom-input
Parameter S3OutputBucketName []: MY_VALUE-dicom-output
Parameter VpcBatch []: vpc-MY_VALUE
Parameter SubnetsBatch []: subnet-MY_VALUE,subnet-MY_VALUE
Parameter ContainerMemory [1024]: 1024
Parameter ContainervCPU [0.5]: 0.5
Parameter LambdaMemory [256]: 256
Parameter LambdaDuration [600]: 600
Parameter AssignPublicIp [ENABLED]: ENABLED
Parameter PartitionKey [study_date]: study_date
Parameter GlueTableName [dicom_metadata]: dicom_metadata
Parameter LogLevel [INFO]: INFO
Parameter VersionDescription [1]: 1
Image Repository for DicomParser [MY_VALUE.dkr.ecr.us-east-1.amazonaws.com/dicom-parser]: `HIT ENTER`
dicomparser:latest to be pushed to MY_VALUE.dkr.ecr.us-east-1.amazonaws.com/dicom-parser:dicomparser-XXXXXXXXXXX-latest
#Shows you resources changes to be deployed and require a 'Y' to initiate deploy
Confirm changes before deploy [y/N]: y
#SAM needs permission to be able to create roles to connect to the resources in your template
Allow SAM CLI IAM role creation [Y/n]: y
#Preserves the state of previously provisioned resources when an operation fails
Disable rollback [y/N]: y
Save arguments to configuration file [Y/n]: y
SAM configuration file [samconfig.toml]: `HIT ENTER`
SAM configuration environment [default]: `HIT ENTER`
Looking for resources needed for deployment:
Managed S3 bucket: aws-sam-cli-managed-default-samclisourcebucket-XXXXXXXXXX
A different default S3 bucket can be set in samconfig.toml
Image repositories: Not found.
#Managed repositories will be deleted when their functions are removed from the template and deployed
Create managed ECR repositories for all functions? [Y/n]: y
...
Previewing CloudFormation changeset before deployment
======================================================
Deploy this changeset? [y/N]: y
Next is to upload the DCM images to S3 Input bucket. Replace S3InputBucketName
with correct value enter in SAM Deploy guided walk-thru.
aws s3 cp --recursive sample_dcm/ s3://S3InputBucketName/example/
Go to the Athena Web Console and select Database as dicom_db
. Then run the following query:
Repair the table to update partitions
MSCK REPAIR TABLE dicom_metadata;
Run the query to see the results
Select * from dicom_metadata;
Navigate to the SQS Console to see the any error messages. We expect to see message for DICOMDIR as it is an empty file. If you do not see any messages the lambda function may performing retries before failing.
The default schema is defined only captures portion of the DICOM standards. The Glue Crawler can be used to discover more tags in the set of DICOM Images.
Navigate to the Glue Crawler Web Console to select dicom-crawler
and Run Crawler
.
Wait until it completes.
Note: For subsequent pushes run the lambda_build.sh
for easier deployment
./lambda_build.sh
It is due to study_date being parsed as partition. The value will be found in the S3 Path s3://bucket-name/study_date=1900-01-01/29035sjfkla923r.parquet
Run MSCK REPAIR TABLE dicom_metadata
to add update the partitions or Glue Crawler. Additional information on error can be found here
Athena Console
Glue Crawler
Its due to mismatch in schema column type and type in parquet file. This can happen if DICOM image is cleaned but mismatch in value type. An example is an element requires a value type SQ []array<string>
but is replaced with ''<string>
. Additional information on error can be found here
Via Cloudformation Console, select the dicom-parser
stack.
- Click Update on the upper right corner.
- Select
Use current template
and clickNext
- In the parameters find
LogLevel
and selectDEBUG
from dropdown and clickNext
- Keep the default stack options and click
Next
- Review stack details and check blue box about IAM resources, then click
Update Stack
Via SAM CLI
-
Find
samconfig.toml
file -
Replace the configuration with
DEBUG
Before
... LogLevel=\"INFO\" ...
After
... LogLevel=\"DEBUG\" ...
This caused when the schema change and partitions have not been updates. To remove all partition you can run a command similar to this to delete all partitions:
aws glue get-partitions --database-name dicom_db --table-name dicom_metadata | jq ".Partitions[].Values[]" | xargs -L1 -I'{}' aws glue delete-partition --database-name dicom_db --table-name dicom_metadata --partition-values {}
Then via Athena console issue repair the table command or run Glue Crawler to repopulate partitions
Athena
MSCK REPAIR TABLE dicom_metadata
Glue Crawler
aws glue start-crawler --name dicom-crawler