Update data population sections of README

Sage-Bionetworks · Jan 28, 2023 · deb9b58 · deb9b58
1 parent 7a5e4cd
commit deb9b58
Showing 1 changed file with 68 additions and 33 deletions.
diff --git a/README.md b/README.md
@@ -2,17 +2,23 @@
 [![Build Status](https://travis-ci.org/Sage-Bionetworks/Agora.svg?branch=develop)](https://travis-ci.org/Sage-Bionetworks/Agora)
 [![GitHub version](https://badge.fury.io/gh/Sage-Bionetworks%2FAgora.svg)](https://badge.fury.io/gh/Sage-Bionetworks%2FAgora)
 
-# Agora BETA
+# Agora
 
 ## Prerequisites
 
 What you need to run this app:
 
 - `node` and `npm` (`brew install node`)
-  -- Ensure you're running the latest versions Node `v16.x.x`+ and NPM `8.x.x`+
-- [MongoDB](https://www.mongodb.com/docs/manual/administration/install-community/)
-  -- You can use a GUI like [Compass](https://www.mongodb.com/products/compass)
-  > If you have `nvm` installed, which is highly recommended (`brew install nvm`) you can do a `nvm install --lts && nvm use` in `$` to run with the latest Node LTS. You can also have this `zsh` done for you [automatically](https://github.com/creationix/nvm#calling-nvm-use-automatically-in-a-directory-with-a-nvmrc-file)
+
+  - Ensure you're running the latest versions Node `v16.x.x`+ and NPM `8.x.x`+
+- A [MongoDB](https://www.mongodb.com/docs/manual/administration/install-community/) instance running on your local machine
+
+- You can optionally use a GUI like [Compass](https://www.mongodb.com/docs/compass/current/) or [Studio3T](https://studio3t.com/knowledge-base/articles/installation/) with your lcoal database
+
+  - Note that only Studio3T is compatible with the [AWS DocumentDB](https://docs.aws.amazon.com/documentdb/latest/developerguide/what-is.html) instances in Agora's dev, stage and prod environments. Either GUI tool will work with your local Mongo instance.
+
+
+> If you have `nvm` installed, which is highly recommended (`brew install nvm`) you can do a `nvm install --lts && nvm use` in `$` to run with the latest Node LTS. You can also have this `zsh` done for you [automatically](https://github.com/creationix/nvm#calling-nvm-use-automatically-in-a-directory-with-a-nvmrc-file)
 
 ## Getting Started
 
@@ -35,62 +41,89 @@ You will need to create a MongoDB database and name it `agora`.
 
 - [Using the MongoDB Shell](https://www.mongodb.com/basics/create-database#option-2)
 - [Using MongoDB Compass](https://www.mongodb.com/basics/create-database#option-3)
+- [Using Studio3T](https://studio3t.com/knowledge-base/articles/common-mongodb-commands/#1-mongodb-create-database)
 
 Note: You can use the following scripts to start the database:
 
 ```
 # Linux and MacOS
 npm run mongo:start
+
 # Windows
 npm run mongo:start:windows
 ```
 
-### 3 - Import the data
+### 3 - Populate database
 
-The following commands will download the data files and all the team images. You can download all of them using the `synapseclient`. Install the package manager `pip` [here](https://bootstrap.pypa.io/get-pip.py). After that, install the `synapseclient` using the following command:
+Agora's data is stored in json files in the [Agora Syanpse project](https://www.synapse.org/#!Synapse:syn11850457/files/), in the following subfolders:
+* [Agora Live Data](https://www.synapse.org/#!Synapse:syn12177492) - This folder contains all production data releases, as well as data releases that were never released to production
+* [Agora Testing Data](https://www.synapse.org/#!Synapse:syn17015333) - This folder contains test data releases that may not be fully validated
+* [Exploratory Data](https://www.synapse.org/#!Synapse:syn50612175) - This folder contains exploratory data files, and subfolders for data releases generated locally via the [agora-data-tools](https://github.com/Sage-Bionetworks/agora-data-tools) ETL tool
+* [Mock Data](https://www.synapse.org/#!Synapse:syn30602404) - This folder is reserved for future testing efforts
 
-```bash
-pip install synapseclient
-```
+The image files surfaced on Agora's Teams page are stored in Synapse [here](https://www.synapse.org/#!Synapse:syn12861877); there is only one set of image files, and the most recent version is always used. The image files aren't considered part of a data release.
 
-To get the data files using credentials provided by AWS, run:
+The contents of a given data release are defined by a specific version of a `data_manifest.json` file. The manifest file lists the synID and version of each data file in the release. The manifest files generated by the ETL framework are uploaded to the same Synapse folder as the data files they reference.
 
-```bash
-npm run data:local-aws
-```
+To populate your local database, you need to download the appropriate file(s), import them into Mongo, and then index the collections. Each of these steps can be achieved manually, or by using the scripts defined in this project.
 
-If you have your own Synapse credentials, you can run:
+If you want to load a set of data files that span multiple Synapse folders, you can do one of the following:
+* Load some or all of the data manually
+* Create a custom manifest file that defines the contents of the custom data release; custom manifest files should be uploaded only to the [Exploratory Data](https://www.synapse.org/#!Synapse:syn50612175) folder
 
-```bash
-npm run data:local
-```
+#### Manual data population
+
+It may make sense to populate your database manually when there is no manifest available for the specific set of files you want to load, and/or if you want to load only a small number of new, modified, or exploratory data files.
+
+Each of the required steps can be achieved manually:
+* Downloading specific files from Synapse using either the Web interface or using one of the [programmatic options supported by synapse](https://help.synapse.org/docs/Downloading-Data-Programmatically.2003796248.html)
+* Ingesting specific json files from your local file system into the database using [MongoDB Compass](https://www.mongodb.com/docs/compass/current/import-export/), [Studio3T](https://studio3t.com/knowledge-base/articles/mongodb-import-json-csv-bson/#import-json-to-mongodb), or using the [mongoimport](https://www.mongodb.com/docs/database-tools/mongoimport/) command line utility
+* Ingesting specific image files from your local file system into the database using [MongoDB Compass](https://www.mongodb.com/docs/compass/current/import-export/), [Studio3T](https://studio3t.com/knowledge-base/articles/mongodb-gridfs/#add-a-new-file), or using the [mongofiles](https://www.mongodb.com/docs/database-tools/mongofiles/) command line utility
+* Indexing collections using a GUI tool like [MongoDB Compass](https://www.mongodb.com/docs/compass/current/indexes/), [Studio3T](https://studio3t.com/knowledge-base/articles/create-mongodb-index/#add-a-mongodb-index), or using the [mongsh](https://www.mongodb.com/docs/manual/indexes/) command line utility
 
-If you are on an AWS EC2 that has been granted access (e.g., for deployment) you can run:
+You can also combine any of the manual steps with the scripts that perform the other steps. 
 
+#### Using the data population scripts
+
+You can use the data population scripts defined in this repository to download, ingest, and index data. These steps can be performed individually by invoking the commands described in the following sections, or you can use a single command to perform all three steps.
+
+##### Prerequisites
+To populate data into your local database using the scripts defined in this project, you must:
+
+1. Install the [Mongo Database Tools](https://www.mongodb.com/docs/database-tools/installation/installation/)
+2. Install the package manager `pip` [here](https://bootstrap.pypa.io/get-pip.py). 
+3. Use `pip` to install the `synapseclient` using the following command:
 ```bash
-npm run data:aws
+pip install synapseclient
 ```
+3. Create a Synapse PAT as described [here](https://help.synapse.org/docs/Managing-Your-Account.2055405596.html#ManagingYourAccount-PersonalAccessTokens)
+4. Add your PAT to .synapseConfig as described [here](https://python-docs.synapse.org/build/html/Credentials.html#use-synapseconfig)
 
-If the `aws` command fails in any of the scripts, you might be running the wrong version. To use `aws secretsmanager` you need the `aws cli` version to be `1.15.8` and upwards
+##### Provisioning a local database with a single command
+
+Use this command to sequentially download data and image files, ingest those files, and index the collections in your local db; you will be prompted to provide information about the manifest file that you want to use:
 
 ```bash
-aws --version
-# Example of incorrect version
-aws-cli/1.14.65 Python/2.7.9 Windows/8 botocore/1.9.18
+npm run data:local:mongo
 ```
 
-To manually update your version go to [this](https://docs.aws.amazon.com/cli/latest/userguide/cli-install-macos.html) link.
+###### Downloading data and image files
+
+Use this command to download data and image files from synapse to the ./local/data folder in this project; you will be prompted to provide information about the manifest file that you want to use:
 
-You should see all the data files and teams members pictures in the folders created by any of the scripts above.
+```bash
+npm run data:local
+```
 
-To add those images to our database, we are going to use the `mongofiles` executable. If you did not add mongo to your `PATH`, copy the images to the `Mongo` binary directory or run the executable remotely from the images directory (replace `mongofiles` in the next command for the binary path). If you have `Mongo` in your `PATH` use the following script command:
+###### Importing data and image files
+Use this command to import the data and image files in your ./local/data folder:
 
 ```bash
 # Imports all data files and team images
 npm run mongo:import
 ```
-
-To add indexes to your local database, use the following script command: 
+###### Indexing Mongo collections
+Use this command to add indexes: 
 
 ````bash
 # Creates indexes
@@ -192,17 +225,19 @@ by the CI system.
 
 ## Deployment for New Data (Updated 9/8/22)
 
-1. Ensure the new data file is available in the [Synapse Agora Live Data folder](https://www.synapse.org/#!Synapse:syn12177492).
+1. Ensure the new data files are available in the [Synapse Agora Live Data folder](https://www.synapse.org/#!Synapse:syn12177492).
 2. Determine the version number of the `data_manifest.csv` file to use for the data release:
    1. The manifest must specify the appropriate version of each json file for the data release 
    2. If a suitable `data_manifest.csv` does not exist, you can manually generate one and upload it to [Synapse](https://www.synapse.org/#!Synapse:syn13363290)
 3. Update data version in `data-manifest.json` in [Agora Data Manager](https://github.com/Sage-Bionetworks/agora-data-manager/). ([example](https://github.com/Sage-Bionetworks/agora-data-manager/commit/d9006f01ae01b6c896bdc075e02ae1b683ecfd65)):
    1. The version should match the version of the desired `data_manifest.csv` file in [Synapse](https://www.synapse.org/#!Synapse:syn13363290).
 4. If there is a new json file (i.e. not updating existing data):
-   1. add an entry for the new file to `import-data.sh`. ([example](https://github.com/Sage-Bionetworks/agora-data-manager/commit/d9006f01ae01b6c896bdc075e02ae1b683ecfd65))
-   2. add an entry for the new file to `./scripts/mongo-import.sh` in Agora (this repository)
+   1. add an entry for the new file to agora-data-manager's `import-data.sh` script. ([example](https://github.com/Sage-Bionetworks/agora-data-manager/pull/66/files))
+   2. add an entry for the new collection to agora-data-manager's `create-indexes.sh` script ([example](https://github.com/Sage-Bionetworks/agora-data-manager/pull/60/files))
+   3. add an entry for the new file to `./scripts/mongo-import.sh` in Agora (this repository)
+   4. add an entry for the new collection to `./scripts/mongo-create-indexes.js` (this repository)
 5. Merge your changes to [Agora Data Manager](https://github.com/Sage-Bionetworks/agora-data-manager/) to the develop branch.
-6. Verify new data is in the database in the develop environment.
+6. Verify new data is in the database in the develop environment; see [Agora environments](https://sagebionetworks.jira.com/wiki/spaces/AGORA/pages/2632745039/Agora+environments) for information about connecting to our AWS DocumentDB instances
 7. Update `data-version` in `package.json` in Agora (this repository). ([example](https://github.com/Sage-Bionetworks/Agora/pull/847/files)) The version should match the `data_manifest.csv` file in [Synapse](https://www.synapse.org/#!Synapse:syn13363290). Then merge the change to [Agora's develoc branch](https://agora-develop.ampadportal.org/genes).
 8. Check new data shows up on [Agora's dev branch](https://agora-develop.adknowledgeportal.org).
 9. Check new data version shows up in the footer  on [Agora's dev branch](https://agora-develop.adknowledgeportal.org).