Dicom Datalake

A serverless workflow to extract DICOM metadata to S3 and make it queryable via Athena. The architecture supports DCM, ZIP and TAR file extensions and assumes empty file extensions as DCM.

The dataset are partitioned by study_date tag, if not present will default to 1900-01-01 value.

Design

Installation

Create ECR Repository and upload image

aws ecr create-repository --repository-name dicom-parser
REPO_URL=$(aws ecr describe-repositories --repository-name dicom-parser --query "repositories[].repositoryUri" --output text)
aws ecr get-login-password | docker login --username AWS --password-stdin $REPO_URL
echo $REPO_URL

Deploy the Application using SAM CLI

The template can be deployed via web console or via CLI.

SAM CLI

Install the AWS Serverless Application Model (SAM) CLI using these instructions

git clone https://github.com/aws-samples/dicom-metadata-extractor-serverless-datalake.git
cd dicom-metadata-extractor-serverless-datalake/

Build the application using SAM

sam build -t deploy/serverless.yml -m requirements.txt

Deploy the application using the guided process

sam deploy --guided --capabilities CAPABILITY_NAMED_IAM --image-repository $REPO_URL
# Replace MY_VALUE and MY_ACCOUNT_ID with proper resource names
Configuring SAM deploy
======================

        Looking for config file [samconfig.toml] :  Not found

        Setting default arguments for 'sam deploy'
        =========================================
        Stack Name [sam-app]: dicom-parser
        AWS Region [us-east-1]: us-east-1
        Parameter S3InputBucketName []: MY_VALUE-dicom-input
        Parameter S3OutputBucketName []: MY_VALUE-dicom-output
        Parameter VpcBatch []: vpc-MY_VALUE
        Parameter SubnetsBatch []: subnet-MY_VALUE,subnet-MY_VALUE
        Parameter ContainerMemory [1024]: 1024
        Parameter ContainervCPU [0.5]: 0.5
        Parameter LambdaMemory [256]: 256
        Parameter LambdaDuration [600]: 600
        Parameter AssignPublicIp [ENABLED]: ENABLED
        Parameter PartitionKey [study_date]: study_date
        Parameter GlueTableName [dicom_metadata]: dicom_metadata
        Parameter LogLevel [INFO]: INFO
        Parameter VersionDescription [1]: 1
        Image Repository for DicomParser [MY_VALUE.dkr.ecr.us-east-1.amazonaws.com/dicom-parser]: `HIT ENTER`
          dicomparser:latest to be pushed to MY_VALUE.dkr.ecr.us-east-1.amazonaws.com/dicom-parser:dicomparser-XXXXXXXXXXX-latest

        #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
        Confirm changes before deploy [y/N]: y
        #SAM needs permission to be able to create roles to connect to the resources in your template
        Allow SAM CLI IAM role creation [Y/n]: y
        	#Preserves the state of previously provisioned resources when an operation fails
	      Disable rollback [y/N]: y
        Save arguments to configuration file [Y/n]: y
        SAM configuration file [samconfig.toml]: `HIT ENTER`
        SAM configuration environment [default]: `HIT ENTER`

        Looking for resources needed for deployment:
	      Managed S3 bucket: aws-sam-cli-managed-default-samclisourcebucket-XXXXXXXXXX
	      A different default S3 bucket can be set in samconfig.toml
	      Image repositories: Not found.
	      #Managed repositories will be deleted when their functions are removed from the template and deployed
	      Create managed ECR repositories for all functions? [Y/n]: y
...

Previewing CloudFormation changeset before deployment
======================================================
Deploy this changeset? [y/N]: y

Next is to upload the DCM images to S3 Input bucket. Replace S3InputBucketName with correct value enter in SAM Deploy guided walk-thru.

aws s3 cp --recursive sample_dcm/ s3://S3InputBucketName/example/

Go to the Athena Web Console and select Database as dicom_db. Then run the following query:

Repair the table to update partitions

MSCK REPAIR TABLE dicom_metadata;

Run the query to see the results

Select  * from dicom_metadata;

Navigate to the SQS Console to see the any error messages. We expect to see message for DICOMDIR as it is an empty file. If you do not see any messages the lambda function may performing retries before failing.

Advanced

The default schema is defined only captures portion of the DICOM standards. The Glue Crawler can be used to discover more tags in the set of DICOM Images.

Navigate to the Glue Crawler Web Console to select dicom-crawler and Run Crawler.

Wait until it completes.

Note: For subsequent pushes run the lambda_build.sh for easier deployment

./lambda_build.sh

Troubleshooting

Study_date columns is empty or partitions

It is due to study_date being parsed as partition. The value will be found in the S3 Path s3://bucket-name/study_date=1900-01-01/29035sjfkla923r.parquet Run MSCK REPAIR TABLE dicom_metadata to add update the partitions or Glue Crawler. Additional information on error can be found here

Athena Console

Glue Crawler

GroupColumIO cannot be cast to PrimitiveColumnIO

Its due to mismatch in schema column type and type in parquet file. This can happen if DICOM image is cleaned but mismatch in value type. An example is an element requires a value type SQ []array<string> but is replaced with ''<string>. Additional information on error can be found here

Enable Debug in the Lambda and Batch logs

Via Cloudformation Console, select the dicom-parser stack.

Click Update on the upper right corner.
Select Use current template and click Next
In the parameters find LogLevel and select DEBUG from dropdown and click Next
Keep the default stack options and click Next
Review stack details and check blue box about IAM resources, then click Update Stack

Via SAM CLI

Find samconfig.toml file

Replace the configuration with DEBUG

Before

... LogLevel=\"INFO\" ...

After

... LogLevel=\"DEBUG\" ...

HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas.

This caused when the schema change and partitions have not been updates. To remove all partition you can run a command similar to this to delete all partitions:

aws glue get-partitions --database-name dicom_db --table-name dicom_metadata | jq ".Partitions[].Values[]" | xargs -L1 -I'{}' aws glue delete-partition --database-name dicom_db --table-name dicom_metadata --partition-values {}

Then via Athena console issue repair the table command or run Glue Crawler to repopulate partitions

Athena

MSCK REPAIR TABLE dicom_metadata

Glue Crawler

aws glue start-crawler --name dicom-crawler

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
deploy		deploy
docs/images		docs/images
sample_dcm		sample_dcm
src		src
.env.sample		.env.sample
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.lambda		Dockerfile.lambda
LICENSE		LICENSE
README.md		README.md
generate_dcm.py		generate_dcm.py
lambda_build.sh		lambda_build.sh

License

aws-samples/dicom-metadata-extractor-serverless-datalake

Folders and files

Latest commit

History

Repository files navigation

Dicom Datalake

Design

Installation

Create ECR Repository and upload image

Deploy the Application using SAM CLI

SAM CLI

Advanced

Troubleshooting

Study_date columns is empty or partitions

GroupColumIO cannot be cast to PrimitiveColumnIO

Enable Debug in the Lambda and Batch logs

HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas.

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages