# Unpack tar file

# Code

## 0. Check that environment variables have been loaded correctly

Environment variables are imported from the mvp-profile.sh file. If this echo command does not return anything, try sourcing it from the command console. If any of the values are incorrect, change them in mvp-profile.sh, save it, and source it again.

In [13]:
echo "Date stamp: ${date_stamp}"
echo "Home directory: ${mvp_hub}"
echo "Project: ${mvp_project}"
echo "Data_Bucket: ${mvp_data_bucket}"
echo "Tar_Bucket: ${mvp_tar_bucket}"
echo "Anal_Bucket: ${mvp_anal_bucket}"
echo "Zone: ${mvp_zone}"

Date stamp: 20180718
Home directory: /Users/jinasong/Work/MVP/mvp-on-gcp
Project: gbsc-gcp-project-mvp
Data_Bucket: gbsc-gcp-project-mvp-from-personalis
Tar_Bucket: gbsc-gcp-project-mvp-from-personalis-archived
Anal_Bucket: gbsc-gcp-project-mvp-group/from-personalis-qc
Zone: us-*


## 1. Unpack tar files

#### Create file accounting directory it doesn't already exist

In [2]:
accounting_dir="${mvp_hub}/tar/file-accounting/${date_stamp}"
mkdir -p ${accounting_dir}
dsub_inputs_dir="${mvp_hub}/tar/dsub-inputs"
mkdir -p ${dsub_inputs_dir}

#### Get list of sample IDs for tar files that already exist on Google Cloud Storage

In [3]:
gsutil ls gs://${mvp_data_bucket}/*_tar/*.tar \
    > ${accounting_dir}/gs-tar-${date_stamp}.txt

cut -d'/' -f5 ${accounting_dir}/gs-tar-${date_stamp}.txt \
    | cut -d'.' -f1 > ${accounting_dir}/gs-tar-sample-ids-${date_stamp}.txt 

#### Get list sample IDs for unpacked tar files (sample folders from tar files) that already exist on Google Cloud Storage

In [6]:
gsutil ls gs://${mvp_tar_bucket}/ \
    > ${accounting_dir}/gs-unpacked-${date_stamp}.txt
cut -d '/' -f4 ${accounting_dir}/gs-unpacked-${date_stamp}.txt > ${accounting_dir}/gs-unpacked-sample-ids-${date_stamp}.txt

#### Get difference between lists of sample IDs to find out which samples are missing tar files

In [7]:
diff --new-line-format="" --unchanged-line-format "" \
    <(sort ${accounting_dir}/gs-tar-sample-ids-${date_stamp}.txt) \
    <(sort ${accounting_dir}/gs-unpacked-sample-ids-${date_stamp}.txt) \
    > ${accounting_dir}/gs-unpacked-missing-sample-ids-${date_stamp}.txt
grep -F \
    -f ${accounting_dir}/gs-unpacked-missing-sample-ids-${date_stamp}.txt \
    ${accounting_dir}/gs-tar-${date_stamp}.txt \
    > ${accounting_dir}/gs-unpacked-missing-${date_stamp}.txt

#### Create dsub TSV task file to generate missing tar files

In [8]:
${mvp_hub}/bin/make-batch-tsv-from-input-tar.py \
    -i ${accounting_dir}/gs-unpacked-missing-${date_stamp}.txt \
    -t ${dsub_inputs_dir}/gs-unpacked-missing-${date_stamp}.tsv \
    -o gs://${mvp_tar_bucket} \
    -s \*

#### Run tar dsub tasks

In [12]:
dsub \
    --zones ${mvp_zone} \
    --project ${mvp_project} \
    --logging gs://${mvp_anal_bucket}/dsub_tar/tar/log/${date_stamp} \
    --image gcr.io/${mvp_project}/text-to-table-js:0.2.0 \
    --command 'tar -xvf ${INPUT} -C ${OUTPUT_PATH}' \
    --tasks ${mvp_hub}/tar/dsub-inputs/gs-unpacked-missing-${date_stamp}.tsv \
    --disk-size 1000

Job: tar--jinasong--180718-104130-42
Launched job-id: tar--jinasong--180718-104130-42
526 task(s)
To check the status, run:
  dstat --project gbsc-gcp-project-mvp --jobs 'tar--jinasong--180718-104130-42' --status '*'
To cancel the job, run:
  ddel --project gbsc-gcp-project-mvp --jobs 'tar--jinasong--180718-104130-42'
tar--jinasong--180718-104130-42
