# Decrypt & Extract
# What does this do?

## Run accounting script to decrypt samples
- Get list of encrypted sample archives (tar.pgp)
- For each archive, check whether a corresponding directory exists
- If not, use dsub to launch decryption job

## Perform integrity check
- Get a list of all samples
- Create dsub input file to run integrity check
- Launch dsub jobs
- Wait for them to finish

## Get list of samples that failed integrity check

## Remedy failed samples




# Files

* **global-dir-accountant.sh**: Get a list of encrypted tar.pgp archives. 
For each archive, check if there is a corresponding sample directory exists. 
If no corresponding directory found, launch a dsub job to decrypt & extract that tar.pgp archive.
* **sample-dir-accountant.sh**: Same functions as global-dir-accountant.sh except user specifies a single sample to investigate.
* **make-integrity-tsv-from-file.sh**: Create a dsub TSV input file to perform integrity checks on a provided list of input samples.
* **call-integrity-dsub.sh**: Launch dsub jobs to perform integrity checks on all samples listed in TSV input file.
* **integrity-check.sh**: Dsub script that performs the integrity check on a given sample.
* **list-failed-samples.sh**: Get unique list of samples that included files that were either missing of failed the sha1 check

## 0. Check that environment variables have been loaded correctly

Environment variables are imported from the mvp-profile.sh file. If this echo command does not return anything, try sourcing it from the command console. If any of the values are incorrect, change them in mvp-profile.sh, save it, and source it again.

In [12]:
source /Users/pbilling/Documents/GitHub/test/mvp-on-gcp/mvp-profile.sh
echo "Date stamp: ${date_stamp}"
echo "Home directory: ${mvp_hub}"
echo "Project: ${mvp_project}"
echo "Bucket: ${mvp_bucket}"
echo "Zone: ${mvp_zone}"

# Decryption specific variables
echo "Max dsub jobs: ${dsub_max_jobs}"
echo "Encrypted archives path: ${mvp_tar_pgp_path}"
echo "Decyrpted samples directory root: ${mvp_samples_path}"
echo "Decryption passphrase file path: ${mvp_decrypt_pass}"
echo "Decryption pair.asc file: ${mvp_decrypt_ascpair}"

Date stamp: 20171019
Home directory: /Users/pbilling/Documents/GitHub/test/mvp-on-gcp
Project: gbsc-gcp-project-mvp
Bucket: gbsc-gcp-project-mvp-phase-2-data
Zone: us-central1-*
Max dsub jobs: 1500
Encrypted archives path: gbsc-gcp-project-mvp-received-from-bina
Decyrpted samples directory root: gbsc-gcp-project-mvp-phase-2-data/data/bina-deliverables
Decryption passphrase file path: gbsc-gcp-project-mvp-va_aaa/misc/keys/passphrase.txt
Decryption pair.asc file: gbsc-gcp-project-mvp-va_aaa/misc/keys/pair.asc


## 1. Run accounting script to decrypt & extract samples

In [7]:
${mvp_hub}/decrypt/scripts/global-dir-accountant.sh -d 200

##--- Command-line options
# disk_size: 200GB
##---

##--- General environment variables
# date_stamp:  20171018
# mvp_hub:     /Users/pbilling/Documents/GitHub/test/mvp-on-gcp
# mvp_project: gbsc-gcp-project-mvp
# mvp_bucket:  gbsc-gcp-project-mvp-phase-2-data
# mvp_zone:    us-central1-*
##---

##--- Decryption-specific variables
# dsub_max_jobs:       1500
# mvp_tar_pgp_path:    gbsc-gcp-project-mvp-received-from-bina
# mvp_samples_path:    gbsc-gcp-project-mvp-phase-2-data/data/bina-deliverables
# mvp_decrypt_pass:    gbsc-gcp-project-mvp-va_aaa/misc/keys/passphrase.txt
# mvp_decrypt_ascpair: gbsc-gcp-project-mvp-va_aaa/misc/keys/pair.asc
##---

##--- Getting list of tar.pgp files



## 2. Perform integrity check

#### Create file accounting directory if it does not exist

In [11]:
accounting_dir="${mvp_hub}/decrypt/file-accounting/${date_stamp}"
mkdir -p ${accounting_dir}

#### Get a list of all samples

In [None]:
gsutil ls ${mvp_samples_path} > ${accounting_dir}/${date_stamp}-sample-paths.txt 
cut -d'/' -f6 ${accounting_dir}/${date_stamp}-sample-paths.txt \
    > ${accounting_dir}/${date_stamp}-samples.txt

# Validate that number of samples looks right
wc -l ${accounting_dir}/${date_stamp}-samples.txt

#### Create dsub input file to run integrity check

In [19]:
${mvp_hub}/decrypt/scripts/make-integrity-tsv-from-file.sh \
    -s ${accounting_dir}/${date_stamp}-samples.txt


##--- Command-line options
# samples_list: /Users/pbilling/Documents/GitHub/test/mvp-on-gcp/decrypt/file-accounting/20171019/20171019-samples.txt
##---

##--- General environment variables
# date_stamp:  20171019
# mvp_hub:     /Users/pbilling/Documents/GitHub/test/mvp-on-gcp
# mvp_project: gbsc-gcp-project-mvp
# mvp_bucket:  gbsc-gcp-project-mvp-phase-2-data
# mvp_zone:    us-central1-*
##---

##--- Decryption-specific variables
# samples_root: gbsc-gcp-project-mvp-phase-2-data/data/bina-deliverables
##---

##--- Integrity-specific variables
# reported_sizes_root: gs://gbsc-gcp-project-mvp-phase-2-data/objects
# integrity_output_roots: gs://gbsc-gcp-project-mvp-phase-2-data/dsub/integrity-check-objects/20171019
##---

# Generating TSV entry for each sample


#### Perform basic validation of newly created tsv file

In [22]:
wc -l ${accounting_dir}/${date_stamp}-integrity-check.tsv
head -n2 ${accounting_dir}/${date_stamp}-integrity-check.tsv

      11 /Users/pbilling/Documents/GitHub/test/mvp-on-gcp/decrypt/file-accounting/20171019/20171019-integrity-check.tsv
--input-recursive INPUT_PATH	--input REF_CSV	--output ACTUAL_SIZES	--output SIZE_MISSING	--output SIZE_PASS	--output SIZE_FAIL	--output SHA1_MISSING	--output SHA1_PASS	--output SHA1_FAIL
gbsc-gcp-project-mvp-phase-2-data/data/bina-deliverables/40013463	gs://gbsc-gcp-project-mvp-phase-2-data/objects/40013463.csv	gs://gbsc-gcp-project-mvp-phase-2-data/dsub/integrity-check-objects/20171019/check-sizes/40013463-sizes-actual.csv	gs://gbsc-gcp-project-mvp-phase-2-data/dsub/integrity-check-objects/20171019/check-sizes/40013463-sizes-missing.csv	gs://gbsc-gcp-project-mvp-phase-2-data/dsub/integrity-check-objects/20171019/check-sizes/40013463-sizes-pass.csv	gs://gbsc-gcp-project-mvp-phase-2-data/dsub/integrity-check-objects/20171019/check-sizes/40013463-sizes-fail.csv	gs://gbsc-gcp-project-mvp-phase-2-data/dsub/integrity-check-objects/20171019/check-sha1/40013463-sha1-missing.

#### Launch integrity check dsub jobs

In [None]:
${mvp_hub}/decrypt/scripts/call-integrity-dsub.sh \
    -t ${accounting_dir}/${date_stamp}-integrity-check.tsv \
    -d 800

#### Wait for integrity checks to finish
Get the job ID from the dsub output.

In [None]:
job_id = ""
dstat \
--project ${mvp_project} \
--jobs ${job_id} \
--wait \
--poll-interval 60

## 3. Get list of samples that failed integrity check

In [None]:
${mvp_hub}/decrypt/scripts/list-failed-samples.sh