DistributedScience · ErinWeisbart · May 30, 2024 · May 3, 2024 · May 3, 2024 · May 3, 2024
diff --git a/config.py b/config.py
@@ -4,23 +4,24 @@
 LOG_GROUP_NAME = APP_NAME
 
 # DOCKER REGISTRY INFORMATION:
-DOCKERHUB_TAG = 'cellprofiler/distributed-cellprofiler:2.0.0_4.1.3'
+DOCKERHUB_TAG = 'cellprofiler/distributed-cellprofiler:2.0.0_4.2.4'
 
 # AWS GENERAL SETTINGS:
 AWS_REGION = 'us-east-1'
 AWS_PROFILE = 'default'                 # The same profile used by your AWS CLI installation
 SSH_KEY_NAME = 'your-key-file.pem'      # Expected to be in ~/.ssh
 AWS_BUCKET = 'your-bucket-name'         # Bucket to use for logging
-SOURCE_BUCKET = 'bucket-name'           # Bucket to download files from
+SOURCE_BUCKET = 'bucket-name'           # Bucket to download image files from
+WORKSPACE_BUCKET = 'bucket-name'        # Bucket to download non-image files from
 DESTINATION_BUCKET = 'bucket-name'      # Bucket to upload files to
 UPLOAD_FLAGS = ''                       # Any flags needed for upload to destination bucket
 
 # EC2 AND ECS INFORMATION:
 ECS_CLUSTER = 'default'
 CLUSTER_MACHINES = 3
 TASKS_PER_MACHINE = 1
-MACHINE_TYPE = ['m4.xlarge']
-MACHINE_PRICE = 0.10
+MACHINE_TYPE = ['m5.xlarge']
+MACHINE_PRICE = 0.20
 EBS_VOL_SIZE = 30                       # In GB.  Minimum allowed is 22.
 DOWNLOAD_FILES = 'False'
 

diff --git a/documentation/DCP-documentation/step_1_configuration.md b/documentation/DCP-documentation/step_1_configuration.md
@@ -25,7 +25,9 @@ For more information and examples, see [External Buckets](external_buckets.md).
 
 * **AWS_BUCKET:** The bucket to which you would like to write log files.
 This is generally the bucket in the account in which you are running compute.
-* **SOURCE_BUCKET:** The bucket where the files you will be reading are.
+* **SOURCE_BUCKET:** The bucket where the image files you will be reading are.
+Often, this is the same as AWS_BUCKET.
+* **WORKSPACE:** The bucket where non-image files you will be reading are (e.g. pipeline, load_data.csv, etc.).
 Often, this is the same as AWS_BUCKET.
 * **DESTINATION_BUCKET:** The bucket where you want to write your output files.
 Often, this is the same as AWS_BUCKET.

diff --git a/example_project/README.md b/example_project/README.md
@@ -1,3 +1,5 @@
+# Distributed-CellProfiler Minimal Example
+
 Included in this folder is all of the resources for running a complete mini-example of Distributed-Cellprofiler.
 It includes 3 sample image sets and a CellProfiler pipeline that identifies cells within the images and makes measuremements.
 It also includes the Distributed-CellProfiler files pre-configured to create a queue of all 3 jobs and spin up a spot fleet of 3 instances, each of which will process a single image set.
@@ -9,21 +11,23 @@ It also includes the Distributed-CellProfiler files pre-configured to create a q
 Before running this mini-example, you will need to set up your AWS infrastructure as described in our [online documentation](https://distributedscience.github.io/Distributed-CellProfiler/step_0_prep.html).
 This includes creating the fleet file that you will use in Step 3.
 
-Upload the 'sample_project' folder to the top level of your bucket. 
+Upload the 'sample_project' folder to the top level of your bucket.
 While in the `Distributed-CellProfiler` folder, use the following command, replacing `yourbucket` with your bucket name:
 
 ```bash
 # Copy example files to S3
 BUCKET=yourbucket
-aws s3 sync example_project/project_folder s3://${BUCKET}/project_folder
+aws s3 sync example_project/demo_project_folder s3://${BUCKET}/demo_project_folder
 
 # Replace the default config with the example config
 cp example_project/config.py config.py
 ```
 
 ### Step 1
+
 In config.py you will need to update the following fields specific to your AWS configuration:
-```
+
+```python
 # AWS GENERAL SETTINGS:
 AWS_REGION = 'us-east-1'
 AWS_PROFILE = 'default'                 # The same profile used by your AWS CLI installation
@@ -32,17 +36,21 @@ AWS_BUCKET = 'your-bucket-name'
 SOURCE_BUCKET = 'your-bucket-name'      # Only differs from AWS_BUCKET with advanced configuration
 DESTINATION_BUCKET = 'your-bucket-name' # Only differs from AWS_BUCKET with advanced configuration
 ```
+
 Then run `python3 run.py setup`
 
 ### Step 2
-This command points to the job file created for this demonstartion and should be run as-is.
+
+This command points to the job file created for this demonstration and should be run as-is.
 `python3 run.py submitJob example_project/files/exampleJob.json`
 
 ### Step 3
+
 This command should point to whatever fleet file you created in Step 0 so you may need to update the `exampleFleet.json` file name.
 `python3 run.py startCluster files/exampleFleet.json`
 
 ### Step 4
+
 This command points to the monitor file that is automatically created with your run and should be run as-is.
 `python3 run.py monitor files/FlyExampleSpotFleetRequestId.json`
 
@@ -51,4 +59,23 @@ This command points to the monitor file that is automatically created with your
 While the run is happening, you can watch real-time metrics in your Cloudwatch Dashboard by navigating in the [Cloudwatch console](https://console.aws.amazon.com/cloudwatch).
 Note that the metrics update at intervals that may not be helpful with this fast, minimal example.
 
-After the run is done, you should see your CellProfiler output files in S3 at s3://${BUCKET}/project_folder/output in per-image folders.
+After the run is done, you should see your CellProfiler output files in S3 at s3://${BUCKET}/project_folder/output in per-image folders.
+
+## Cleanup
+
+The spot fleet, queue, and task definition will be automatically cleaned up after your demo is complete because you are running `monitor`.
+
+To remove everything else:
+
+```bash
+# Remove files added to S3 bucket
+BUCKET=yourbucket
+aws s3 rm --recursive s3://${BUCKET}/demo_project_folder
+
+# Remove Cloudwatch logs
+aws logs delete-log-group --log-group-name FlyExample
+aws logs delete-log-group --log-group-name FlyExample_perInstance
+
+# Delete DeadMessages queue
+aws sqs delete-queue --queue-url ExampleProject_DeadMessages
+```
diff --git a/...ect/project_folder/images/01_POS002_D.TIF → ...emo_project_folder/images/01_POS002_D.TIF b/...ect/project_folder/images/01_POS002_D.TIF → ...emo_project_folder/images/01_POS002_D.TIF
diff --git a/...ect/project_folder/images/01_POS002_F.TIF → ...emo_project_folder/images/01_POS002_F.TIF b/...ect/project_folder/images/01_POS002_F.TIF → ...emo_project_folder/images/01_POS002_F.TIF
diff --git a/...ect/project_folder/images/01_POS002_R.TIF → ...emo_project_folder/images/01_POS002_R.TIF b/...ect/project_folder/images/01_POS002_R.TIF → ...emo_project_folder/images/01_POS002_R.TIF
diff --git a/...ect/project_folder/images/01_POS076_D.TIF → ...emo_project_folder/images/01_POS076_D.TIF b/...ect/project_folder/images/01_POS076_D.TIF → ...emo_project_folder/images/01_POS076_D.TIF
diff --git a/...ect/project_folder/images/01_POS076_F.TIF → ...emo_project_folder/images/01_POS076_F.TIF b/...ect/project_folder/images/01_POS076_F.TIF → ...emo_project_folder/images/01_POS076_F.TIF
diff --git a/...ect/project_folder/images/01_POS076_R.TIF → ...emo_project_folder/images/01_POS076_R.TIF b/...ect/project_folder/images/01_POS076_R.TIF → ...emo_project_folder/images/01_POS076_R.TIF
diff --git a/...ect/project_folder/images/01_POS218_D.TIF → ...emo_project_folder/images/01_POS218_D.TIF b/...ect/project_folder/images/01_POS218_D.TIF → ...emo_project_folder/images/01_POS218_D.TIF
diff --git a/...ect/project_folder/images/01_POS218_F.TIF → ...emo_project_folder/images/01_POS218_F.TIF b/...ect/project_folder/images/01_POS218_F.TIF → ...emo_project_folder/images/01_POS218_F.TIF
diff --git a/...ect/project_folder/images/01_POS218_R.TIF → ...emo_project_folder/images/01_POS218_R.TIF b/...ect/project_folder/images/01_POS218_R.TIF → ...emo_project_folder/images/01_POS218_R.TIF
diff --git a/...roject_folder/workspace/ExampleFly.cppipe → ...roject_folder/workspace/ExampleFly.cppipe b/...roject_folder/workspace/ExampleFly.cppipe → ...roject_folder/workspace/ExampleFly.cppipe
diff --git a/...ct/project_folder/workspace/load_data.csv → ...mo_project_folder/workspace/load_data.csv b/...ct/project_folder/workspace/load_data.csv → ...mo_project_folder/workspace/load_data.csv
diff --git a/example_project/files/exampleJob.json b/example_project/files/exampleJob.json
@@ -1,9 +1,9 @@
 {
   "_comment1": "Paths in this file are relative to the root of your S3 bucket",
-  "pipeline": "project_folder/workspace/ExampleFly.cppipe", 
-  "data_file": "project_folder/workspace/load_data.csv", 
-  "input": "project_folder/workspace/",
-  "output": "project_folder/output",
+  "pipeline": "demo_project_folder/workspace/ExampleFly.cppipe", 
+  "data_file": "demo_project_folder/workspace/load_data.csv", 
+  "input": "demo_project_folder/workspace/",
+  "output": "demo_project_folder/output",
   "output_structure": "Metadata_Position",
   "_comment2": "The following groups are tasks, and each will be run in parallel",
   "groups": [

diff --git a/example_project_CPG/README.md b/example_project_CPG/README.md
@@ -0,0 +1,82 @@
+# CPG Example Project
+
+Included in this folder is all of the resources for running a complete mini-example of Distributed-CellProfiler.
+This example differs from the other example project in that it reads data hosted in the public data repository the [Cell Painting Gallery](https://github.com/broadinstitute/cellpainting-gallery) instead of reading images from your own bucket.
+Workspace files are hosted in your own S3 bucket, and data is output to your bucket, and compute is performed in your account.
+It includes the Distributed-CellProfiler files pre-configured to create a queue of 3 jobs and spin up a spot fleet of 3 instances, each of which will process a single image set.
+
+## Running example project
+
+### Step 0
+
+Before running this mini-example, you will need to set up your AWS infrastructure as described in our [online documentation](https://distributedscience.github.io/Distributed-CellProfiler/step_0_prep.html).
+This includes creating the fleet file that you will use in Step 3.
+
+Upload the 'sample_project' folder to the top level of your bucket.
+While in the `Distributed-CellProfiler` folder, use the following command, replacing `yourbucket` with your bucket name:
+
+```bash
+# Copy example files to S3
+BUCKET=yourbucket
+aws s3 sync example_project_CPG/demo_project_folder s3://${BUCKET}/demo_project_folder
+
+# Replace the default config with the example config
+cp example_project_CPG/config.py config.py
+```
+
+### Step 1
+
+In config.py you will need to update the following fields specific to your AWS configuration:
+
+```python
+# AWS GENERAL SETTINGS:
+AWS_REGION = 'us-east-1'
+AWS_PROFILE = 'default'                 # The same profile used by your AWS CLI installation
+SSH_KEY_NAME = 'your-key-file.pem'      # Expected to be in ~/.ssh
+AWS_BUCKET = 'your-bucket-name'
+WORKSPACE_BUCKET = 'your-bucket-name'   # Only differs from AWS_BUCKET with advanced configuration
+DESTINATION_BUCKET = 'your-bucket-name' # Only differs from AWS_BUCKET with advanced configuration
+```
+
+Then run `python run.py setup`
+
+### Step 2
+
+This command points to the job file created for this demonstration and should be run as-is.
+`python run.py submitJob example_project_CPG/files/exampleCPGJob.json`
+
+### Step 3
+
+This command should point to whatever fleet file you created in Step 0 so you may need to update the `exampleFleet.json` file name.
+`python run.py startCluster files/exampleFleet.json`
+
+### Step 4
+
+This command points to the monitor file that is automatically created with your run and should be run as-is.
+`python run.py monitor files/ExampleCPGSpotFleetRequestId.json`
+
+## Results
+
+While a run is happening, you can watch real-time metrics in your Cloudwatch Dashboard by navigating in the [Cloudwatch console](https://console.aws.amazon.com/cloudwatch).
+Note that the metrics update at intervals that may not be helpful with this fast, minimal example.
+
+After the run is done, you should see your CellProfiler output files in your S3 bucket at s3://${BUCKET}/project_folder/output in per-well-and-site folders.
+
+## Cleanup
+
+The spot fleet, queue, and task definition will be automatically cleaned up after your demo is complete because you are running `monitor`.
+
+To remove everything else:
+
+```bash
+# Remove files added to S3 bucket
+BUCKET=yourbucket
+aws s3 rm --recursive s3://${BUCKET}/demo_project_folder
+
+# Remove Cloudwatch logs
+aws logs delete-log-group --log-group-name ExampleCPG
+aws logs delete-log-group --log-group-name ExampleCPG_perInstance
+
+# Delete DeadMessages queue
+aws sqs delete-queue --queue-url ExampleProject_DeadMessages
+```
diff --git a/example_project_CPG/config.py b/example_project_CPG/config.py
@@ -0,0 +1,55 @@
+# Constants (User configurable)
+
+APP_NAME = 'ExampleCPG'                # Used to generate derivative names unique to the application.
+
+# DOCKER REGISTRY INFORMATION:
+DOCKERHUB_TAG = 'erinweisbart/distributed-cellprofiler:workspace_bucket'
+
+# AWS GENERAL SETTINGS:
+AWS_REGION = 'us-east-1'
+AWS_PROFILE = 'default'                 # The same profile used by your AWS CLI installation
+SSH_KEY_NAME = 'your-key-file.pem'      # Expected to be in ~/.ssh
+AWS_BUCKET = 'your-bucket-name'         # Bucket to use for logging
+SOURCE_BUCKET = 'cellpainting-gallery'  # Bucket to download image files from 
+WORKSPACE_BUCKET = 'your-bucket-name'   # Bucket to download non-image files from
+DESTINATION_BUCKET = 'your-bucket-name' # Bucket to upload files to
+
+# EC2 AND ECS INFORMATION:
+ECS_CLUSTER = 'default'
+CLUSTER_MACHINES = 3
+TASKS_PER_MACHINE = 1
+MACHINE_TYPE = ['c4.xlarge']
+MACHINE_PRICE = 0.13
+EBS_VOL_SIZE = 22                       # In GB.  Minimum allowed is 22.
+DOWNLOAD_FILES = 'True'
+
+# DOCKER INSTANCE RUNNING ENVIRONMENT:
+DOCKER_CORES = 1                        # Number of CellProfiler processes to run inside a docker container
+CPU_SHARES = DOCKER_CORES * 1024        # ECS computing units assigned to each docker container (1024 units = 1 core)
+MEMORY = 7000                           # Memory assigned to the docker container in MB
+SECONDS_TO_START = 3*60                 # Wait before the next CP process is initiated to avoid memory collisions
+
+# SQS QUEUE INFORMATION:
+SQS_QUEUE_NAME = APP_NAME + 'Queue'
+SQS_MESSAGE_VISIBILITY = 10*60           # Timeout (secs) for messages in flight (average time to be processed)
+SQS_DEAD_LETTER_QUEUE = 'ExampleProject_DeadMessages'
+
+# LOG GROUP INFORMATION:
+LOG_GROUP_NAME = APP_NAME 
+
+# CLOUDWATCH DASHBOARD CREATION
+CREATE_DASHBOARD = 'True'           # Create a dashboard in Cloudwatch for run
+CLEAN_DASHBOARD = 'True'            # Automatically remove dashboard at end of run with Monitor
+
+# REDUNDANCY CHECKS
+CHECK_IF_DONE_BOOL = 'False'  #True or False- should it check if there are a certain number of non-empty files and delete the job if yes?
+EXPECTED_NUMBER_FILES = 7    #What is the number of files that trigger skipping a job?
+MIN_FILE_SIZE_BYTES = 1      #What is the minimal number of bytes an object should be to "count"?
+NECESSARY_STRING = ''        #Is there any string that should be in the file name to "count"?
+
+# PLUGINS
+USE_PLUGINS = 'False'
+UPDATE_PLUGINS = 'False'
+PLUGINS_COMMIT = '' # What commit or version tag do you want to check out?
+INSTALL_REQUIREMENTS = 'False'
+REQUIREMENTS_FILE = '' # Path within the plugins repo to a requirements file