# Samtools Flagstat
# What does this do?

### Run Samtools Flagstat
- Get list of sample IDs for all alignments.bam files on Google Cloud Storage (GCS)
- Get list of sample IDs for all flagstat.tsv files on GCS
- Compare lists to find samples missing flagstat.tsv files
- Use make-batch-tsv-from-input-sample.py to generate dsub task file
- Run dsub tasks to generate missing flagstat.tsv files

### Convert flagstat result files to CSV format
- Get list of all flagstat.tsv files
- Get list of all flagstat.tsv.csv files
- Convert each to lists of sample IDs
- Compare lists to find samples that are missing flagstat.tsv.csv files
- Use make-batch-tsv-from-input-file.py to generate dsub task file
- Run dsub to generate missing flagstat.tsv.csv files

### Concatenate CSV Files
- Get list of all flagstat.tsv.csv files
- Run dsub to concatenate all flagstat.tsv.csv files

# Code

## 0. Check that environment variables have been loaded correctly

Environment variables are imported from the mvp-profile.sh file. If this echo command does not return anything, try sourcing it from the command console. If any of the values are incorrect, change them in mvp-profile.sh, save it, and source it again.

In [2]:
echo "Date stamp: ${date_stamp}"
echo "Home directory: ${mvp_hub}"
echo "Project: ${mvp_project}"
echo "Bucket: ${mvp_bucket}"
echo "Zone: ${mvp_zone}"

Date stamp: 20180615
Home directory: /Users/jinasong/Work/MVP/mvp-on-gcp
Project: gbsc-gcp-project-mvp
Bucket: gbsc-gcp-project-mvp-group/for-jina
Zone: us-*


## 1. Run samtools flagstat

#### Create file accounting directory it doesn't already exist

In [3]:
accounting_dir="${mvp_hub}/flagstat/file-accounting/${date_stamp}"
mkdir -p ${accounting_dir}
dsub_inputs_dir="${mvp_hub}/flagstat/dsub-inputs"
mkdir -p ${dsub_inputs_dir}

#### Get list of sample IDs for bam files that already exist on Google Cloud Storage

In [4]:
## getting the list for all bam files    
gsutil ls gs://${mvp_bucket}/*/Alignments/*.bam \
    > ${accounting_dir}/gs-bam-${date_stamp}.txt

cut -d'/' -f7 ${accounting_dir}/gs-bam-${date_stamp}.txt \
   | cut -d'.' -f1 > ${accounting_dir}/gs-bam-sample-ids-${date_stamp}.txt # with chr

#### Get list sample IDs for flagstat files that already exist on Google Cloud Storage

In [6]:
gsutil ls gs://${mvp_bucket}/dsub/flagstat/samtools/objects/*.flagstat.tsv \
    > ${accounting_dir}/gs-flagstat-${date_stamp}.txt
cut -d '/' -f8 ${accounting_dir}/gs-flagstat-${date_stamp}.txt \ # check param # -f8
    | cut -d'_' -f1 > ${accounting_dir}/gs-flagstat-sample-ids-${date_stamp}.txt

CommandException: One or more URLs matched no objects.


#### Get difference between lists of sample IDs to find out which samples are missing flatstat files

In [8]:
diff --new-line-format="" --unchanged-line-format "" \
    <(sort ${accounting_dir}/gs-bam-sample-ids-${date_stamp}.txt) \
    <(sort ${accounting_dir}/gs-flagstat-sample-ids-${date_stamp}.txt) \
    > ${accounting_dir}/gs-flagstat-missing-sample-ids-${date_stamp}.txt
grep -F \
    -f ${accounting_dir}/gs-flagstat-missing-sample-ids-${date_stamp}.txt \
    ${accounting_dir}/gs-bam-${date_stamp}.txt \
    > ${accounting_dir}/gs-flagstat-missing-${date_stamp}.txt

#### Create dsub TSV task file to generate missing flatstat files

In [9]:
${mvp_hub}/bin/make-batch-tsv-from-input-sample.py \
    -i ${accounting_dir}/gs-flagstat-missing-${date_stamp}.txt \
    -t ${dsub_inputs_dir}/samtools/gs-flagstat-missing-${date_stamp}.tsv \
    -o gs://${mvp_bucket}/dsub/flagstat/samtools/objects \
    -s alignments.bam.flagstat.tsv

#### Run Samtools flatstat dsub tasks

In [10]:
dsub \
    --zones "${mvp_zone}" \
    --project ${mvp_project} \
    --logging gs://${mvp_bucket}/dsub/flagstat/samtools/logs/${date_stamp} \
    --image gcr.io/${mvp_project}/samtools \
    --disk-size 1000 \
    --command 'samtools flagstat ${INPUT} > ${OUTPUT}' \
    --tasks ${dsub_inputs_dir}/samtools/gs-flagstat-missing-${date_stamp}.tsv 

Job: samtools--jinasong--180601-163807-99
Launched job-id: samtools--jinasong--180601-163807-99
25 task(s)
To check the status, run:
  dstat --project gbsc-gcp-project-mvp --jobs 'samtools--jinasong--180601-163807-99' --status '*'
To cancel the job, run:
  ddel --project gbsc-gcp-project-mvp --jobs 'samtools--jinasong--180601-163807-99'
samtools--jinasong--180601-163807-99


## 2. Convert flatstat result files to CSV format

#### Get list of sample IDs for flatstat files that already exist on Google Cloud Storage

In [11]:
gsutil ls gs://${mvp_bucket}/dsub/flagstat/samtools/objects/*alignments.bam.flagstat.tsv \
    > ${accounting_dir}/gs-flagstat-tsv-${date_stamp}.txt
cut -d'/' -f9 ${accounting_dir}/gs-flagstat-tsv-${date_stamp}.txt \
    | cut -d'_' -f1 > ${accounting_dir}/gs-flagstat-tsv-sample-ids-${date_stamp}.txt

#### Get list of sample IDs for flatstat CSV files that already exist on Google Cloud Storage

In [12]:
gsutil ls gs://${mvp_bucket}/dsub/flagstat/text-to-table/objects/*alignments.bam.flagstat.tsv.csv \
    > ${accounting_dir}/gs-flagstat-csv-${date_stamp}.txt
cut -d'/' -f9 ${accounting_dir}/gs-flagstat-csv-${date_stamp}.txt \ # check -f9
    | cut -d'_' -f1 > ${accounting_dir}/gs-flagstat-csv-sample-ids-${date_stamp}.txt

CommandException: One or more URLs matched no objects.


#### Get difference between lists of sample IDs to find out which samples are missing vcfstats files

In [13]:
diff \
    --new-line-format="" --unchanged-line-format "" \
    <(sort ${accounting_dir}/gs-flagstat-tsv-sample-ids-${date_stamp}.txt) \
    <(sort ${accounting_dir}/gs-flagstat-csv-sample-ids-${date_stamp}.txt) \
    > ${accounting_dir}/gs-flagstat-csv-sample-ids-missing-${date_stamp}.txt
grep -F \
    -f ${accounting_dir}/gs-flagstat-csv-sample-ids-missing-${date_stamp}.txt \
    ${accounting_dir}/gs-flagstat-tsv-${date_stamp}.txt \
    > ${accounting_dir}/gs-flagstat-csv-missing-${date_stamp}.txt

#### Convert file list to dsub TSV files

In [14]:
${mvp_hub}/bin/make-batch-tsv-from-input-file.py \
    -i ${accounting_dir}/gs-flagstat-csv-missing-${date_stamp}.txt \
    -t ${dsub_inputs_dir}/text-to-table/gs-vcfstats-csv-missing-${date_stamp}.tsv \
    -o gs://${mvp_bucket}/dsub/flagstat/text-to-table/objects \
    -s csv \
    -c flagstat \
    -e flagstat-${date_stamp}

#### Run dsub task

In [15]:
dsub \
    --zones ${mvp_zone} \
    --project ${mvp_project} \
    --logging gs://${mvp_bucket}/dsub/flagstat/text-to-table/logs/${date_stamp} \
    --image gcr.io/${mvp_project}/text-to-table:0.2.0 \
    --command 'text2table -s ${SCHEMA} -o ${OUTPUT} -v series=${SERIES},sample=${SAMPLE_ID} ${INPUT}' \
    --tasks ${dsub_inputs_dir}/text-to-table/gs-vcfstats-csv-missing-${date_stamp}.tsv

Job: text2table--jinasong--180601-165623-91
Launched job-id: text2table--jinasong--180601-165623-91
25 task(s)
To check the status, run:
  dstat --project gbsc-gcp-project-mvp --jobs 'text2table--jinasong--180601-165623-91' --status '*'
To cancel the job, run:
  ddel --project gbsc-gcp-project-mvp --jobs 'text2table--jinasong--180601-165623-91'
text2table--jinasong--180601-165623-91


## 3. Concatenate CSV Files

#### Get new list of completed results files

In [19]:
gsutil ls gs://${mvp_bucket}/dsub/flagstat/text-to-table/objects/*alignments.bam.flagstat.tsv.csv \
    > ${accounting_dir}/gs-flagstat-csv-${date_stamp}.txt

#### Run dsub task

In [2]:
dsub \
    --zones ${mvp_zone} \
    --project ${mvp_project} \
    --logging gs://${mvp_bucket}/dsub/flagstat/concat/logs/${date_stamp} \
    --image gcr.io/${mvp_project}/text-to-table:0.2.0 \
    --disk-size 100 \
    --input INPUT_FILES=gs://${mvp_bucket}/dsub/flagstat/text-to-table/objects/*alignments.bam.flagstat.tsv.csv \
    --output CONCAT_FILE=gs://${mvp_bucket}/dsub/flagstat/concat/objects/concat_alignments.bam.flagstat.tsv.csv \
    --command 'cat ${INPUT_FILES} > ${CONCAT_FILE}'

Job: cat--jinasong--180601-170320-76
Launched job-id: cat--jinasong--180601-170320-76
To check the status, run:
  dstat --project gbsc-gcp-project-mvp --jobs 'cat--jinasong--180601-170320-76' --status '*'
To cancel the job, run:
  ddel --project gbsc-gcp-project-mvp --jobs 'cat--jinasong--180601-170320-76'
cat--jinasong--180601-170320-76
