# Vcfstats
# What does this do?

### Run rtg-tools vcfstats
- Get list of all variants.gvcf.gz files
- Get list of all rtg_vcfstats.txt files
- Convert each to lists of sample IDs
- Compare lists to find samples that are missing rtg_vcfstats.txt files
- Use make-batch-tsv-from-input-sample.py to generate dsub task file
- Run dsub to generate missing rtg_vcfstats.txt files

### Convert vcfstats result files to CSV format
- Get list of all rtg_vcfstats.txt files
- Get list of all rtg_vcfstats.txt.csv files
- Convert each to lists of sample IDs
- Compare lists to find samples that are missing rtg_vcfstats.txt.csv files
- Use make-batch-tsv-from-input-file.py to generate dsub task file
- Run dsub to generate missing rtg_vcfstats.txt.csv files

### Concatenate CSV Files
- Get list of all rtg_vcfstats.txt.csv files
- Run dsub to concatenate all rtg_vcfstats.txt.csv files

# Code

## 0. Check that environment variables have been loaded correctly

Environment variables are imported from the mvp-profile.sh file. If this echo command does not return anything, try sourcing it from the command console. If any of the values are incorrect, change them in mvp-profile.sh, save it, and source it again.

In [16]:
echo "Date stamp: ${date_stamp}"
echo "Home directory: ${mvp_hub}"
echo "Project: ${mvp_project}"
echo "Bucket: ${mvp_bucket}"
echo "Zone: ${mvp_zone}"

Date stamp: 20180507
Home directory: /Users/jinasong/Work/MVP/mvp-on-gcp
Project: gbsc-gcp-project-mvp
Bucket: gbsc-gcp-project-mvp-group/for-jina
Zone: us-*


# 1. Run rtg-tools vcfstats

#### Create file accounting directory it doesn't already exist

In [2]:
accounting_dir="${mvp_hub}/vcfstats/file-accounting/${date_stamp}"
mkdir -p ${accounting_dir}
dsub_inputs_dir="${mvp_hub}/vcfstats/dsub-inputs"
mkdir -p ${dsub_inputs_dir}

/Users/jinasong/Work/MVP/mvp-on-gcp/vcfstats/file-accounting/20180507


#### Get list of sample IDs for gvcf files that already exist on Google Cloud Storage

In [4]:
# For test, three samples
export mvp_bucket_orig="gbsc-gcp-project-mvp-phase-2-data"

#gsutil ls gs://${mvp_bucket_orig}/data/bina-deliverables/40013463/*/VariantCalling/variants.gvcf.gz \
#    > ${accounting_dir}/gs-gvcf-${date_stamp}.txt
#gsutil ls gs://${mvp_bucket_orig}/data/bina-deliverables/40050177/*/VariantCalling/variants.gvcf.gz \
#    >> ${accounting_dir}/gs-gvcf-${date_stamp}.txt
#gsutil ls gs://${mvp_bucket_orig}/data/bina-deliverables/40101045/*/VariantCalling/variants.gvcf.gz \
#    >> ${accounting_dir}/gs-gvcf-${date_stamp}.txt
    
#gsutil ls gs://${mvp_bucket_orig}/data/bina-deliverables/*/*/VariantCalling/variants.gvcf.gz \
#    >> ${accounting_dir}/gs-gvcf-${date_stamp}.txt
    
cut -d'/' -f6 ${mvp_hub}/vcfstats/file-accounting/${date_stamp}/gs-gvcf-${date_stamp}.txt \
    > ${accounting_dir}/gs-gvcf-sample-ids-${date_stamp}.txt

In [None]:
#gsutil ls gs://${mvp_bucket_orig}/*/*/VariantCalling/variants.gvcf.gz > ${accounting_dir}/gs-gvcf-${date_stamp}.txt

#cut -d'/' -f6 ${mvp_hub}/vcfstats/file-accounting/${DATA_STAMP}/gs-gvcf-${date_stamp}.txt \
#    > ${accounting_dir}/gs-gvcf-sample-ids-${DATE_STAMP}.txt

#### Get list sample IDs for vcfstats files that already exist on Google Cloud Storage

In [6]:
gsutil ls gs://${mvp_bucket_orig}/dsub/vcfstats/rtg-tools/objects/*_rtg_vcfstats.txt \
    > ${accounting_dir}/gs-vcfstats-rtg-${date_stamp}.txt
cut -d '/' -f8 ${accounting_dir}/gs-vcfstats-rtg-${date_stamp}.txt \
    | cut -d'_' -f1 > ${accounting_dir}/gs-vcfstats-rtg-sample-ids-${date_stamp}.txt

#### Get difference between lists of sample IDs to find out which samples are missing vcfstats files

In [7]:
diff --new-line-format="" --unchanged-line-format "" \
    <(sort ${accounting_dir}/gs-gvcf-sample-ids-${date_stamp}.txt) \
    <(sort ${accounting_dir}/gs-vcfstats-rtg-sample-ids-${date_stamp}.txt) \
    > ${accounting_dir}/gs-vcfstats-rtg-missing-sample-ids-${date_stamp}.txt
grep -F \
    -f ${accounting_dir}/gs-vcfstats-rtg-missing-sample-ids-${date_stamp}.txt \
    ${accounting_dir}/gs-gvcf-${date_stamp}.txt \
    > ${accounting_dir}/gs-vcfstats-rtg-missing-${date_stamp}.txt

: 1

#### Create dsub TSV task file to generate missing vcfstats files

In [27]:
# For test,
# cp ${accounting_dir}/gs-gvcf-${date_stamp}.txt ${accounting_dir}/gs-vcfstats-rtg-missing-${date_stamp}.txt
echo "${accounting_dir}/gs-gvcf-${date_stamp}.txt"
echo "${accounting_dir}/gs-vcfstats-rtg-missing-${date_stamp}.txt"

${mvp_hub}/bin/make-batch-tsv-from-input-sample.py \
    -i ${accounting_dir}/gs-vcfstats-rtg-missing-${date_stamp}.txt \
    -t ${dsub_inputs_dir}/rtg-tools/gs-vcfstats-rtg-missing-${date_stamp}.tsv \
    -o gs://${mvp_bucket}/dsub/vcfstats/rtg-tools/objects \
    -s rtg_vcfstats.txt

/Users/jinasong/Work/MVP/mvp-on-gcp/vcfstats/file-accounting/20180507/gs-gvcf-20180507.txt
/Users/jinasong/Work/MVP/mvp-on-gcp/vcfstats/file-accounting/20180507/gs-vcfstats-rtg-missing-20180507.txt


#### Run RTG vcfstats dsub tasks

#### Get list of sample IDs for vcfstats CSV files that already exist on Google Cloud Storage

In [15]:
dsub \
    --zones "${mvp_zone}" \
    --project ${mvp_project} \
    --logging gs://${mvp_bucket}/dsub/vcfstats/rtg-tools/logs/${date_stamp} \
    --image gcr.io/${mvp_project}/rtg-tools:1.0 \
    --command 'rtg vcfstats ${INPUT} > ${OUTPUT}' \
    --tasks ${dsub_inputs_dir}/rtg-tools/gs-vcfstats-rtg-missing-${date_stamp}.tsv 1-3\ 
    #--tasks ${dsub_inputs_dir}/rtg-tools/gs-bina-vcfstats-rtg-missing-test.tsv ${dsub_range}\ 
    #--input INPUT=gs://gbsc-gcp-project-mvp-phase-2-data/data/bina-deliverables/400744881/2b2653a3-be13-49ee-94c8-44810268d9de/VariantCalling/variants.gvcf.gz \
    #--output OUTPUT=gs://gbsc-gcp-project-mvp-group/for-jina/dsub/vcfstats/rtg-tools/objects/400744881_rtg_vcfstats.txt \

    #--tasks ${dsub_inputs_dir}/rtg-tools/gs-vcfstats-rtg-missing-${date_stamp}.tsv ${dsub_range}\ 

Job: rtg--jinasong--180507-221250-21
Launched job-id: rtg--jinasong--180507-221250-21
3 task(s)
To check the status, run:
  dstat --project gbsc-gcp-project-mvp --jobs 'rtg--jinasong--180507-221250-21' --status '*'
To cancel the job, run:
  ddel --project gbsc-gcp-project-mvp --jobs 'rtg--jinasong--180507-221250-21'
rtg--jinasong--180507-221250-21


## 2. Convert vcfstats result files to CSV format

#### Get list of sample IDs for vcfstats files that already exist on Google Cloud Storage

In [23]:
gsutil ls gs://${mvp_bucket}/dsub/vcfstats/rtg-tools/objects/*_rtg_vcfstats.txt \
    > ${accounting_dir}/gs-vcfstats-rtg-${date_stamp}.txt
cut -d'/' -f9 ${accounting_dir}/gs-vcfstats-rtg-${date_stamp}.txt \
    | cut -d'_' -f1 > ${accounting_dir}/gs-vcfstats-rtg-sample-ids-${date_stamp}.txt

#### Get list of sample IDs for vcfstats CSV files that already exist on Google Cloud Storage

In [21]:
gsutil ls gs://${mvp_bucket_orig}/dsub/vcfstats/text-to-table/objects/*_rtg_vcfstats.txt.csv \
    > ${accounting_dir}/gs-vcfstats-csv-${date_stamp}.txt
cut -d'/' -f8 ${accounting_dir}/gs-vcfstats-csv-${date_stamp}.txt \
    | cut -d'_' -f1 > ${accounting_dir}/gs-vcfstats-csv-sample-ids-${date_stamp}.txt

#### Get difference between lists of sample IDs to find out which samples are missing vcfstats files

In [24]:
diff \
    --new-line-format="" \
    --unchanged-line-format "" \
    <(sort ${accounting_dir}/gs-vcfstats-rtg-sample-ids-${date_stamp}.txt) \
    <(sort ${accounting_dir}/gs-vcfstats-csv-sample-ids-${date_stamp}.txt) \
    > ${accounting_dir}/gs-vcfstats-csv-sample-ids-missing-${date_stamp}.txt
grep -F -f \
    ${accounting_dir}/gs-vcfstats-csv-sample-ids-missing-${date_stamp}.txt \
    ${accounting_dir}/gs-vcfstats-rtg-${date_stamp}.txt \
    > ${accounting_dir}/gs-vcfstats-csv-missing-${date_stamp}.txt

: 1

#### Convert file list to dsub TSV files

In [30]:
#For test,
#cp ${accounting_dir}/gs-vcfstats-rtg-${date_stamp}.txt ${accounting_dir}/gs-vcfstats-csv-missing-${date_stamp}.txt
#mkdir -p ${dsub_inputs_dir}/text-to-table

${mvp_hub}/bin/make-batch-tsv-from-input-file.py \
    -i ${accounting_dir}/gs-vcfstats-csv-missing-${date_stamp}.txt \
    -t ${dsub_inputs_dir}/text-to-table/gs-vcfstats-csv-missing-${date_stamp}.tsv \
    -o gs://${mvp_bucket}/dsub/vcfstats/text-to-table/objects \
    -s csv \
    -c rtg_vcfstats \
    -e rtg-vcfstats-${date_stamp}

#### Run dsub task

In [32]:
dsub \
    --zones ${mvp_zone} \
    --project ${mvp_project} \
    --logging gs://${mvp_bucket}/dsub/vcfstats/text-to-table/logs/${date_stamp} \
    --image gcr.io/${mvp_project}/text-to-table:0.2.0 \
    --command 'text2table -s ${SCHEMA} -o ${OUTPUT} -v series=${SERIES},sample=${SAMPLE_ID} ${INPUT}' \
    --tasks ${mvp_hub}/vcfstats/dsub-inputs/text-to-table/gs-vcfstats-csv-missing-${date_stamp}.tsv ${dsub_range} \
    #--dry-run

Job: text2table--jinasong--180508-094753-90
Launched job-id: text2table--jinasong--180508-094753-90
2 task(s)
To check the status, run:
  dstat --project gbsc-gcp-project-mvp --jobs 'text2table--jinasong--180508-094753-90' --status '*'
To cancel the job, run:
  ddel --project gbsc-gcp-project-mvp --jobs 'text2table--jinasong--180508-094753-90'
text2table--jinasong--180508-094753-90


## 3. Concatenate CSV Files

#### Get new list of completed results files

In [33]:
gsutil ls gs://${mvp_bucket}/dsub/vcfstats/text-to-table/objects/*_rtg_vcfstats.txt.csv \
    > ${accounting_dir}/gs-vcfstats-csv-${date_stamp}.txt

#### Run dsub task

In [1]:
dsub \
    --zones ${mvp_zone} \
    --project ${mvp_project} \
    --logging gs://${mvp_bucket}/dsub/vcfstats/concat/logs/${date_stamp} \
    --image gcr.io/${mvp_project}/text-to-table:0.2.0 \
    --disk-size 100 \
    --input INPUT_FILES=gs://${mvp_bucket}/dsub/vcfstats/text-to-table/objects/*_vcfstats.txt.csv \
    --output CONCAT_FILE=gs://${mvp_bucket}/dsub/vcfstats/concat/concat_vcfstats.txt.csv \
    --command 'cat ${INPUT_FILES} > ${CONCAT_FILE}' \
    #--dry-run
    #--vars-include-wildcards \

Traceback (most recent call last):
  File "/usr/local/bin/dsub", line 5, in <module>
    from pkg_resources import load_entry_point
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 3095, in <module>
    @_call_aside
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 3081, in _call_aside
    f(*args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 3108, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 658, in _build_master
    ws.require(__requires__)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 959, in require
    needed = self.resolve(parse_requirements(requirements

: 1