# FastQC
# What does this do?

### Run FastQC
- Get list of sample IDs for all alignments.bam files on Google Cloud Storage (GCS)
- Get list of sample IDs for all alignments.bam.fastqc_data.txt files on GCS
- Compare lists to find samples missing fastqc data files
- Use make-batch-tsv-from-input-sample.py to generate dsub task file
- Run dsub to generate missing alignments.bam.fastqc_data.txt files

### Convert FastQC result files to CSV format
- Get list of all fastqc_data.txt files
- Get list of all fastqc_data.txt.csv files
- Convert each to lists of sample IDs
- Compare lists to find samples that are missing fastqc_data.txt.csv files
- Use make-batch-tsv-from-input-file.py to generate dsub task file
- Run dsub to generate missing fastqc_data.txt.csv files

### Concatenate CSV Files
- Get list of all fastqc_data.txt.csv files
- Run dsub to concatenate all fastqc_data.txt.csv files

# Code

## 0. Check that environment variables have been loaded correctly

Environment variables are imported from the mvp-profile.sh file. If this echo command does not return anything, try sourcing it from the command console. If any of the values are incorrect, change them in mvp-profile.sh, save it, and source it again.

In [1]:
echo "Date stamp: ${date_stamp}"
echo "Home directory: ${mvp_hub}"
echo "Project: ${mvp_project}"
echo "Data_Bucket: ${mvp_data_bucket}"
echo "Anal_Bucket: ${mvp_anal_bucket}"
echo "Zone: ${mvp_zone}"

Date stamp: 20180730
Home directory: /Users/jinasong/Work/MVP/mvp-on-gcp
Project: gbsc-gcp-project-mvp
Data_Bucket: gbsc-gcp-project-mvp-from-personalis
Anal_Bucket: gbsc-gcp-project-mvp-from-personalis-qc
Zone: us-*


## 1. Run samtools flagstat

#### Create file accounting directory it doesn't already exist

In [6]:
date_stamp=20180727
echo "Date stamp: ${date_stamp}"

accounting_dir="${mvp_hub}/fastqc-bam/file-accounting/${date_stamp}"
mkdir -p ${accounting_dir}
dsub_inputs_dir="${mvp_hub}/fastqc-bam/dsub-inputs"
mkdir -p ${dsub_inputs_dir}

Date stamp: 20180727


#### Get list of sample IDs for gvcf files that already exist on Google Cloud Storage

In [3]:
gsutil ls gs://${mvp_data_bucket}/*/Alignments/*.bam \
    > ${accounting_dir}/gs-bam-${date_stamp}.txt

cut -d'/' -f6 ${accounting_dir}/gs-bam-${date_stamp}.txt \
    | cut -d'.' -f1 > ${accounting_dir}/gs-bam-sample-ids-${date_stamp}.txt # with chr #
   # > ${accounting_dir}/gs-bam-sample-ids-${date_stamp}.txt

#### Get list sample IDs for FastQC files that already exist on Google Cloud Storage

In [4]:
gsutil ls gs://${mvp_anal_bucket}/dsub/fastqc-bam/fastqc/objects/*_alignments.bam.fastqc_data.txt \
    > ${accounting_dir}/gs-fastqc-data-${date_stamp}.txt
cut -d '/' -f8 ${accounting_dir}/gs-fastqc-data-${date_stamp}.txt \
    | cut -d'_' -f1-3 > ${accounting_dir}/gs-fastqc-data-sample-ids-${date_stamp}.txt

#### Get difference between lists of sample IDs to find out which samples are missing FastQC files

In [5]:
diff --new-line-format="" --unchanged-line-format "" \
    <(sort ${accounting_dir}/gs-bam-sample-ids-${date_stamp}.txt) \
    <(sort ${accounting_dir}/gs-fastqc-data-sample-ids-${date_stamp}.txt) \
    > ${accounting_dir}/gs-fastqc-data-missing-sample-ids-${date_stamp}.txt
grep -F \
    -f ${accounting_dir}/gs-fastqc-data-missing-sample-ids-${date_stamp}.txt \
    ${accounting_dir}/gs-bam-${date_stamp}.txt \
    > ${accounting_dir}/gs-fastqc-data-missing-${date_stamp}.txt

#### Create dsub TSV task file to generate missing vcfstats files

In [6]:
${mvp_hub}/bin/make-batch-tsv-from-input-sample.py \
    -i ${accounting_dir}/gs-fastqc-data-missing-${date_stamp}.txt \
    -t ${dsub_inputs_dir}/fastqc/gs-fastqc-data-missing-${date_stamp}.tsv \
    -o gs://${mvp_anal_bucket}/dsub/fastqc-bam/fastqc/objects \
    -s alignments.bam.fastqc_data.txt
    #-s all/*

#### Run FastQC dsub tasks

In [8]:
dsub \
    --zones ${mvp_zone} \
    --project ${mvp_project} \
    --logging gs://${mvp_anal_bucket}/dsub/fastqc-bam/fastqc/logs/${date_stamp} \
    --image gcr.io/${mvp_project}/fastqc:1.01 \
    --disk-size 500 \
    --script ${mvp_hub}/fastqc-bam/dsub-scripts/fastqc.sh \
    --tasks ${mvp_hub}/fastqc-bam/dsub-inputs/fastqc/gs-fastqc-data-missing-${date_stamp}.tsv 526-

Job: fastqc--jinasong--180727-121124-28
Launched job-id: fastqc--jinasong--180727-121124-28
8000 task(s)
To check the status, run:
  dstat --project gbsc-gcp-project-mvp --jobs 'fastqc--jinasong--180727-121124-28' --status '*'
To cancel the job, run:
  ddel --project gbsc-gcp-project-mvp --jobs 'fastqc--jinasong--180727-121124-28'
fastqc--jinasong--180727-121124-28


## 2. Convert FastQC data files to CSV format

#### Get list of sample IDs for FastQC data files that already exist on Google Cloud Storage

In [9]:
gsutil ls gs://${mvp_anal_bucket}/dsub/fastqc-bam/fastqc/objects/*alignments.bam.fastqc_data.txt \
  > ${accounting_dir}/gs-fastqc-bam-txt-${date_stamp}.txt
cut -d'/' -f8 ${accounting_dir}/gs-fastqc-bam-txt-${date_stamp}.txt \
  | cut -d'_' -f1-3 > ${accounting_dir}/gs-fastqc-bam-txt-sample-ids-${date_stamp}.txt # with chr number  

#### Get list of sample IDs for FastQC CSV files that already exist on Google Cloud Storage

In [11]:
gsutil ls gs://${mvp_anal_bucket}/dsub/fastqc-bam/text-to-table/objects/*alignments.bam.fastqc_data.txt.csv \
    > ${accounting_dir}/gs-fastqc-bam-csv-${date_stamp}.txt
cut -d'/' -f8 ${accounting_dir}/gs-fastqc-bam-csv-${date_stamp}.txt \
    | cut -d'_' -f1-3 > ${accounting_dir}/gs-fastqc-bam-csv-sample-ids-${date_stamp}.txt
#    | cut -d'_' -f1 > ${accounting_dir}/gs-fastqc-bam-csv-sample-ids-${date_stamp}.txt

#### Get difference between lists of sample IDs to find out which samples are missing vcfstats files

In [12]:
diff \
  --new-line-format="" \
  --unchanged-line-format "" \
  <(sort ${accounting_dir}/gs-fastqc-bam-txt-sample-ids-${date_stamp}.txt) \
  <(sort ${accounting_dir}/gs-fastqc-bam-csv-sample-ids-${date_stamp}.txt) \
  > ${accounting_dir}/gs-fastqc-bam-csv-sample-ids-missing-${date_stamp}.txt
grep -F \
  -f ${accounting_dir}/gs-fastqc-bam-csv-sample-ids-missing-${date_stamp}.txt \
  ${accounting_dir}/gs-fastqc-bam-txt-${date_stamp}.txt \
  > ${accounting_dir}/gs-fastqc-bam-csv-missing-${date_stamp}.txt

#### Convert file list to dsub TSV files

In [13]:
${mvp_hub}/bin/make-batch-tsv-from-input-file.py \
    -i ${accounting_dir}/gs-fastqc-bam-csv-missing-${date_stamp}.txt \
    -t ${dsub_inputs_dir}/text-to-table/gs-fastqc-bam-csv-missing-${date_stamp}.tsv \
    -o gs://${mvp_anal_bucket}/dsub/fastqc-bam/text-to-table/objects \
    -s csv \
    -c fastqc 
    #-e fastqc-bam-${date_stamp}   

#### Run dsub task

In [14]:
dsub \
    --zones ${mvp_zone} \
    --project ${mvp_project} \
    --logging gs://${mvp_anal_bucket}/dsub/fastqc-bam/text-to-table/logs/${date_stamp} \
    --image gcr.io/${mvp_project}/text-to-table-js:0.2.0\
    --input JSON=gs://${mvp_anal_bucket}/dsub/fastqc-bam/text-to-table/json/fastqc_seq.json \
    --command 'text2table -j ${JSON} -o ${OUTPUT} -v chr=${SERIES},sample=${SAMPLE_ID} ${INPUT}' \
    --tasks ${dsub_inputs_dir}/text-to-table/gs-fastqc-bam-csv-missing-${date_stamp}.tsv 

Job: text2table--jinasong--180727-144656-55
Launched job-id: text2table--jinasong--180727-144656-55
8525 task(s)
To check the status, run:
  dstat --project gbsc-gcp-project-mvp --jobs 'text2table--jinasong--180727-144656-55' --status '*'
To cancel the job, run:
  ddel --project gbsc-gcp-project-mvp --jobs 'text2table--jinasong--180727-144656-55'
text2table--jinasong--180727-144656-55


## 3. Concatenate CSV Files

#### Get new list of completed results files

In [11]:
gsutil ls gs://${mvp_anal_bucket}/dsub/fastqc-bam/text-to-table/objects/*alignments.bam.fastqc_data.txt.csv \
    > ${accounting_dir}/gs-fastqc-bam-csv-${date_stamp}.txt
    
gsutil cp ${mvp_anal_bucket}/dsub/fastqc-bam/concat/objects/concat_alignments.bam.fastqc_data.txt.csv \
    gs://${mvp_anal_bucket}/dsub/fastqc-bam/concat/objects/copy_concat_alignments.bam.fastqc_data.txt.csv

gsutil rm ${mvp_anal_bucket}/dsub/fastqc-bam/concat/objects/concat_alignments.bam.fastqc_data.txt.csv

Removing gs://gbsc-gcp-project-mvp-from-personalis-qc/dsub/fastqc-bam/concat/objects/concat_alignments.bam.fastqc_data.txt.csv...
/ [1 objects]                                                                   
Operation completed over 1 objects.                                              


#### Run dsub task

In [15]:
dsub \
    --zones ${mvp_zone} \
    --project ${mvp_project} \
    --logging gs://${mvp_anal_bucket}/dsub/fastqc-bam/concat/logs/${date_stamp} \
    --image gcr.io/${mvp_project}/text-to-table-js:0.2.0 \
    --disk-size 100 \
    --input INPUT_FILES=gs://${mvp_anal_bucket}/dsub/fastqc-bam/text-to-table/objects/*alignments.bam.fastqc_data.txt.csv \
    --output CONCAT_FILE=gs://${mvp_anal_bucket}/dsub/fastqc-bam/concat/objects/concat_alignments.bam.fastqc_data.txt.csv \
    --command 'for f in ${INPUT_FILES}; do cat "$f" >> ${CONCAT_FILE}; done' 
    #--dry-run
    #--vars-include-wildcards

Job: for--jinasong--180731-102014-74
Launched job-id: for--jinasong--180731-102014-74
To check the status, run:
  dstat --project gbsc-gcp-project-mvp --jobs 'for--jinasong--180731-102014-74' --status '*'
To cancel the job, run:
  ddel --project gbsc-gcp-project-mvp --jobs 'for--jinasong--180731-102014-74'
for--jinasong--180731-102014-74


## 4. Upload CSV Files to BigQuery

In [14]:
bq load --replace testing.Jina_Personalis_fastqc \
    gs://gbsc-gcp-project-mvp-from-personalis-qc/dsub/fastqc-bam/concat/objects/concat_alignments.bam.fastqc_data.txt.csv \
    dimension:string,index:string,value:string,chr:string,sample:string

Waiting on bqjob_r3abac7db9d2c5fb6_00000164ed3a11ba_1 ... (0s) Current status: RUNNING                                                                                      Waiting on bqjob_r3abac7db9d2c5fb6_00000164ed3a11ba_1 ... (1s) Current status: RUNNING                                                                                      Waiting on bqjob_r3abac7db9d2c5fb6_00000164ed3a11ba_1 ... (2s) Current status: RUNNING                                                                                      Waiting on bqjob_r3abac7db9d2c5fb6_00000164ed3a11ba_1 ... (3s) Current status: RUNNING                                                                                      Waiting on bqjob_r3abac7db9d2c5fb6_00000164ed3a11ba_1 ... (5s) Current status: RUNNING                                                                                      Waiting on bqjob_r3abac7db9d2c5fb6_00000164ed3a11ba_1 ... (6s) Current status: RUNNING                                          