Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions config.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,24 @@
LOG_GROUP_NAME = APP_NAME

# DOCKER REGISTRY INFORMATION:
DOCKERHUB_TAG = 'cellprofiler/distributed-cellprofiler:2.0.0_4.1.3'
DOCKERHUB_TAG = 'cellprofiler/distributed-cellprofiler:2.0.0_4.2.4'

# AWS GENERAL SETTINGS:
AWS_REGION = 'us-east-1'
AWS_PROFILE = 'default' # The same profile used by your AWS CLI installation
SSH_KEY_NAME = 'your-key-file.pem' # Expected to be in ~/.ssh
AWS_BUCKET = 'your-bucket-name' # Bucket to use for logging
SOURCE_BUCKET = 'bucket-name' # Bucket to download files from
SOURCE_BUCKET = 'bucket-name' # Bucket to download image files from
WORKSPACE_BUCKET = 'bucket-name' # Bucket to download non-image files from
DESTINATION_BUCKET = 'bucket-name' # Bucket to upload files to
UPLOAD_FLAGS = '' # Any flags needed for upload to destination bucket

# EC2 AND ECS INFORMATION:
ECS_CLUSTER = 'default'
CLUSTER_MACHINES = 3
TASKS_PER_MACHINE = 1
MACHINE_TYPE = ['m4.xlarge']
MACHINE_PRICE = 0.10
MACHINE_TYPE = ['m5.xlarge']
MACHINE_PRICE = 0.20
EBS_VOL_SIZE = 30 # In GB. Minimum allowed is 22.
DOWNLOAD_FILES = 'False'

Expand Down
4 changes: 3 additions & 1 deletion documentation/DCP-documentation/step_1_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@ For more information and examples, see [External Buckets](external_buckets.md).

* **AWS_BUCKET:** The bucket to which you would like to write log files.
This is generally the bucket in the account in which you are running compute.
* **SOURCE_BUCKET:** The bucket where the files you will be reading are.
* **SOURCE_BUCKET:** The bucket where the image files you will be reading are.
Often, this is the same as AWS_BUCKET.
* **WORKSPACE:** The bucket where non-image files you will be reading are (e.g. pipeline, load_data.csv, etc.).
Often, this is the same as AWS_BUCKET.
* **DESTINATION_BUCKET:** The bucket where you want to write your output files.
Often, this is the same as AWS_BUCKET.
Expand Down
37 changes: 32 additions & 5 deletions example_project/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Distributed-CellProfiler Minimal Example

Included in this folder is all of the resources for running a complete mini-example of Distributed-Cellprofiler.
It includes 3 sample image sets and a CellProfiler pipeline that identifies cells within the images and makes measuremements.
It also includes the Distributed-CellProfiler files pre-configured to create a queue of all 3 jobs and spin up a spot fleet of 3 instances, each of which will process a single image set.
Expand All @@ -9,21 +11,23 @@ It also includes the Distributed-CellProfiler files pre-configured to create a q
Before running this mini-example, you will need to set up your AWS infrastructure as described in our [online documentation](https://distributedscience.github.io/Distributed-CellProfiler/step_0_prep.html).
This includes creating the fleet file that you will use in Step 3.

Upload the 'sample_project' folder to the top level of your bucket.
Upload the 'sample_project' folder to the top level of your bucket.
While in the `Distributed-CellProfiler` folder, use the following command, replacing `yourbucket` with your bucket name:

```bash
# Copy example files to S3
BUCKET=yourbucket
aws s3 sync example_project/project_folder s3://${BUCKET}/project_folder
aws s3 sync example_project/demo_project_folder s3://${BUCKET}/demo_project_folder

# Replace the default config with the example config
cp example_project/config.py config.py
```

### Step 1

In config.py you will need to update the following fields specific to your AWS configuration:
```

```python
# AWS GENERAL SETTINGS:
AWS_REGION = 'us-east-1'
AWS_PROFILE = 'default' # The same profile used by your AWS CLI installation
Expand All @@ -32,17 +36,21 @@ AWS_BUCKET = 'your-bucket-name'
SOURCE_BUCKET = 'your-bucket-name' # Only differs from AWS_BUCKET with advanced configuration
DESTINATION_BUCKET = 'your-bucket-name' # Only differs from AWS_BUCKET with advanced configuration
```

Then run `python3 run.py setup`

### Step 2
This command points to the job file created for this demonstartion and should be run as-is.

This command points to the job file created for this demonstration and should be run as-is.
`python3 run.py submitJob example_project/files/exampleJob.json`

### Step 3

This command should point to whatever fleet file you created in Step 0 so you may need to update the `exampleFleet.json` file name.
`python3 run.py startCluster files/exampleFleet.json`

### Step 4

This command points to the monitor file that is automatically created with your run and should be run as-is.
`python3 run.py monitor files/FlyExampleSpotFleetRequestId.json`

Expand All @@ -51,4 +59,23 @@ This command points to the monitor file that is automatically created with your
While the run is happening, you can watch real-time metrics in your Cloudwatch Dashboard by navigating in the [Cloudwatch console](https://console.aws.amazon.com/cloudwatch).
Note that the metrics update at intervals that may not be helpful with this fast, minimal example.

After the run is done, you should see your CellProfiler output files in S3 at s3://${BUCKET}/project_folder/output in per-image folders.
After the run is done, you should see your CellProfiler output files in S3 at s3://${BUCKET}/project_folder/output in per-image folders.

## Cleanup

The spot fleet, queue, and task definition will be automatically cleaned up after your demo is complete because you are running `monitor`.

To remove everything else:

```bash
# Remove files added to S3 bucket
BUCKET=yourbucket
aws s3 rm --recursive s3://${BUCKET}/demo_project_folder

# Remove Cloudwatch logs
aws logs delete-log-group --log-group-name FlyExample
aws logs delete-log-group --log-group-name FlyExample_perInstance

# Delete DeadMessages queue
aws sqs delete-queue --queue-url ExampleProject_DeadMessages
```
8 changes: 4 additions & 4 deletions example_project/files/exampleJob.json
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
{
"_comment1": "Paths in this file are relative to the root of your S3 bucket",
"pipeline": "project_folder/workspace/ExampleFly.cppipe",
"data_file": "project_folder/workspace/load_data.csv",
"input": "project_folder/workspace/",
"output": "project_folder/output",
"pipeline": "demo_project_folder/workspace/ExampleFly.cppipe",
"data_file": "demo_project_folder/workspace/load_data.csv",
"input": "demo_project_folder/workspace/",
"output": "demo_project_folder/output",
"output_structure": "Metadata_Position",
"_comment2": "The following groups are tasks, and each will be run in parallel",
"groups": [
Expand Down
82 changes: 82 additions & 0 deletions example_project_CPG/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# CPG Example Project

Included in this folder is all of the resources for running a complete mini-example of Distributed-CellProfiler.
This example differs from the other example project in that it reads data hosted in the public data repository the [Cell Painting Gallery](https://github.com/broadinstitute/cellpainting-gallery) instead of reading images from your own bucket.
Workspace files are hosted in your own S3 bucket, and data is output to your bucket, and compute is performed in your account.
It includes the Distributed-CellProfiler files pre-configured to create a queue of 3 jobs and spin up a spot fleet of 3 instances, each of which will process a single image set.

## Running example project

### Step 0

Before running this mini-example, you will need to set up your AWS infrastructure as described in our [online documentation](https://distributedscience.github.io/Distributed-CellProfiler/step_0_prep.html).
This includes creating the fleet file that you will use in Step 3.

Upload the 'sample_project' folder to the top level of your bucket.
While in the `Distributed-CellProfiler` folder, use the following command, replacing `yourbucket` with your bucket name:

```bash
# Copy example files to S3
BUCKET=yourbucket
aws s3 sync example_project_CPG/demo_project_folder s3://${BUCKET}/demo_project_folder

# Replace the default config with the example config
cp example_project_CPG/config.py config.py
```

### Step 1

In config.py you will need to update the following fields specific to your AWS configuration:

```python
# AWS GENERAL SETTINGS:
AWS_REGION = 'us-east-1'
AWS_PROFILE = 'default' # The same profile used by your AWS CLI installation
SSH_KEY_NAME = 'your-key-file.pem' # Expected to be in ~/.ssh
AWS_BUCKET = 'your-bucket-name'
WORKSPACE_BUCKET = 'your-bucket-name' # Only differs from AWS_BUCKET with advanced configuration
DESTINATION_BUCKET = 'your-bucket-name' # Only differs from AWS_BUCKET with advanced configuration
```

Then run `python run.py setup`

### Step 2

This command points to the job file created for this demonstration and should be run as-is.
`python run.py submitJob example_project_CPG/files/exampleCPGJob.json`

### Step 3

This command should point to whatever fleet file you created in Step 0 so you may need to update the `exampleFleet.json` file name.
`python run.py startCluster files/exampleFleet.json`

### Step 4

This command points to the monitor file that is automatically created with your run and should be run as-is.
`python run.py monitor files/ExampleCPGSpotFleetRequestId.json`

## Results

While a run is happening, you can watch real-time metrics in your Cloudwatch Dashboard by navigating in the [Cloudwatch console](https://console.aws.amazon.com/cloudwatch).
Note that the metrics update at intervals that may not be helpful with this fast, minimal example.

After the run is done, you should see your CellProfiler output files in your S3 bucket at s3://${BUCKET}/project_folder/output in per-well-and-site folders.

## Cleanup

The spot fleet, queue, and task definition will be automatically cleaned up after your demo is complete because you are running `monitor`.

To remove everything else:

```bash
# Remove files added to S3 bucket
BUCKET=yourbucket
aws s3 rm --recursive s3://${BUCKET}/demo_project_folder

# Remove Cloudwatch logs
aws logs delete-log-group --log-group-name ExampleCPG
aws logs delete-log-group --log-group-name ExampleCPG_perInstance

# Delete DeadMessages queue
aws sqs delete-queue --queue-url ExampleProject_DeadMessages
```
55 changes: 55 additions & 0 deletions example_project_CPG/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Constants (User configurable)

APP_NAME = 'ExampleCPG' # Used to generate derivative names unique to the application.

# DOCKER REGISTRY INFORMATION:
DOCKERHUB_TAG = 'erinweisbart/distributed-cellprofiler:workspace_bucket'

# AWS GENERAL SETTINGS:
AWS_REGION = 'us-east-1'
AWS_PROFILE = 'default' # The same profile used by your AWS CLI installation
SSH_KEY_NAME = 'your-key-file.pem' # Expected to be in ~/.ssh
AWS_BUCKET = 'your-bucket-name' # Bucket to use for logging
SOURCE_BUCKET = 'cellpainting-gallery' # Bucket to download image files from
WORKSPACE_BUCKET = 'your-bucket-name' # Bucket to download non-image files from
DESTINATION_BUCKET = 'your-bucket-name' # Bucket to upload files to

# EC2 AND ECS INFORMATION:
ECS_CLUSTER = 'default'
CLUSTER_MACHINES = 3
TASKS_PER_MACHINE = 1
MACHINE_TYPE = ['c4.xlarge']
MACHINE_PRICE = 0.13
EBS_VOL_SIZE = 22 # In GB. Minimum allowed is 22.
DOWNLOAD_FILES = 'True'

# DOCKER INSTANCE RUNNING ENVIRONMENT:
DOCKER_CORES = 1 # Number of CellProfiler processes to run inside a docker container
CPU_SHARES = DOCKER_CORES * 1024 # ECS computing units assigned to each docker container (1024 units = 1 core)
MEMORY = 7000 # Memory assigned to the docker container in MB
SECONDS_TO_START = 3*60 # Wait before the next CP process is initiated to avoid memory collisions

# SQS QUEUE INFORMATION:
SQS_QUEUE_NAME = APP_NAME + 'Queue'
SQS_MESSAGE_VISIBILITY = 10*60 # Timeout (secs) for messages in flight (average time to be processed)
SQS_DEAD_LETTER_QUEUE = 'ExampleProject_DeadMessages'

# LOG GROUP INFORMATION:
LOG_GROUP_NAME = APP_NAME

# CLOUDWATCH DASHBOARD CREATION
CREATE_DASHBOARD = 'True' # Create a dashboard in Cloudwatch for run
CLEAN_DASHBOARD = 'True' # Automatically remove dashboard at end of run with Monitor

# REDUNDANCY CHECKS
CHECK_IF_DONE_BOOL = 'False' #True or False- should it check if there are a certain number of non-empty files and delete the job if yes?
EXPECTED_NUMBER_FILES = 7 #What is the number of files that trigger skipping a job?
MIN_FILE_SIZE_BYTES = 1 #What is the minimal number of bytes an object should be to "count"?
NECESSARY_STRING = '' #Is there any string that should be in the file name to "count"?

# PLUGINS
USE_PLUGINS = 'False'
UPDATE_PLUGINS = 'False'
PLUGINS_COMMIT = '' # What commit or version tag do you want to check out?
INSTALL_REQUIREMENTS = 'False'
REQUIREMENTS_FILE = '' # Path within the plugins repo to a requirements file
Loading