Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add in processing scripts for human data #40

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
c9d73bb
refactor: make json script work with gsutil
mihirsamdarshi Aug 14, 2022
ab78d4c
feat: add submit.sh
mihirsamdarshi Aug 15, 2022
d1ab897
feat: add helper delete_json script
mihirsamdarshi Aug 23, 2022
876d8bd
refactor: run ShellCheck, shellfmt, and pylint on some existing scripts
mihirsamdarshi Sep 2, 2022
bc60a69
feat: add new helper scripts
mihirsamdarshi Sep 2, 2022
4adc1e0
feat: refactor status script
mihirsamdarshi Sep 2, 2022
06b6392
feat: continue working on various helper scripts
mihirsamdarshi Sep 3, 2022
55549c5
fix: mo' problems, mo' fixes
mihirsamdarshi Sep 7, 2022
f952b43
feat: add in merge_atac_qc_human.R script, fix some parallel processi…
mihirsamdarshi Sep 20, 2022
7a4643f
fix: a couple minor script tweaks
mihirsamdarshi Sep 20, 2022
3220555
feat: add in complete wrapper and server setup scripts
mihirsamdarshi Sep 20, 2022
f1c2521
feat: fix wrapper script with better outputs
mihirsamdarshi Sep 20, 2022
576d7c5
fix: add in fixes for the setup and wrapper scripts
mihirsamdarshi Sep 21, 2022
4313e4e
feat: update batch scripts
mihirsamdarshi Oct 14, 2022
43a8e9e
feat: add R, fish, neovim, and qc2tsv install to server setup
mihirsamdarshi Oct 14, 2022
5a938f4
feat: clean up scripts
mihirsamdarshi Oct 14, 2022
4661741
feat: clean up scripts
mihirsamdarshi Oct 17, 2022
546d195
fix: add samtools quickcheck
mihirsamdarshi Jan 20, 2023
d38e89e
fix: use date for job log
mihirsamdarshi Feb 24, 2023
1184b1d
fix: update echo command make_json_singleton.sh
mihirsamdarshi Feb 24, 2023
2930dcf
feat: clean up src directory, get ready for submission
mihirsamdarshi Feb 24, 2023
a541e42
chore: add latest updates to README.md
mihirsamdarshi Feb 24, 2023
b1344c4
feat: add --bar to all GNU parallel jobs
mihirsamdarshi Feb 24, 2023
94038c5
Added scripts used to process pass1ac data. Made minor changes to scr…
archanaraja Mar 3, 2023
8fe1aa4
fix: fix shebang in server setup script
mihirsamdarshi Mar 6, 2023
cef9d4c
chore: add merge_jsons.py utility script
mihirsamdarshi Mar 7, 2023
9dc3e53
fix: remove trailing slash from input dir
mihirsamdarshi Mar 7, 2023
944b310
fix: point to correct script names in wrappers
mihirsamdarshi Mar 8, 2023
19ab3ef
fix: remove home dir path addition from wrapper script
mihirsamdarshi Mar 8, 2023
5940239
fix: fix paths in scripts
mihirsamdarshi Mar 8, 2023
64c80ab
Fix : remove printing out encode workflow level qc , added vial_label…
archanaraja Mar 11, 2023
6d645d8
fix: fix paths in scripts
mihirsamdarshi Mar 21, 2023
1238134
Merge branch 'master' into ms-patch-1
mihirsamdarshi Apr 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions src/align_stats.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,12 +85,12 @@ align_stats() {
export -f align_stats

if [ "$type" == "glob" ]; then
parallel --verbose --jobs "$cores" align_stats ::: "$(ls "${bamdir%/}"/*/*/align/rep*/*.trim.bam)"
parallel --joblog ~/mnt/tmp/"$(basename bamdir)"_align_stats_joblog.log --progress --verbose --jobs "$cores" align_stats ::: "$(ls "${bamdir%/}"/*/*/align/rep*/*_R1.trim.bam)"
elif [ "$type" == "file" ]; then
readarray -t raw_bam_list <<<"$bamdir"
parallel --verbose --jobs "$cores" align_stats ::: "${raw_bam_list[@]}"
parallel --joblog ~/mnt/tmp/"$(basename bamdir)"_align_stats_joblog.log --progress --verbose --jobs "$cores" align_stats ::: "${raw_bam_list[@]}"
elif [ "$type" == "find" ]; then
parallel --verbose --jobs "$cores" align_stats ::: "$(find -name "*_R1.trim.bam" "$bamdir")"
parallel --joblog ~/mnt/tmp/"$(basename bamdir)"_align_stats_joblog.log --progress --verbose --jobs --jobs "$cores" align_stats ::: "$(find -name "*_R1.trim.bam" "$bamdir")"
else
echo "type must be glob, file, or find"
exit 1
Expand Down
6 changes: 3 additions & 3 deletions src/align_stats_concat_only.sh
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,12 @@ align_stats() {
export -f align_stats

if [ "$type" == "glob" ]; then
parallel --verbose --jobs "$cores" align_stats ::: "$(ls "${bamdir%/}"/*/*/align/rep*/*.trim.bam)"
parallel --joblog ~/mnt/tmp/"$(basename bamdir)"_align_stats_concat_joblog.log --progress --verbose --jobs "$cores" align_stats ::: "$(ls "${bamdir%/}"/*/*/align/rep*/*_R1.trim.bam)"
elif [ "$type" == "file" ]; then
readarray -t raw_bam_list <<<"$bamdir"
parallel --verbose --jobs "$cores" align_stats ::: "${raw_bam_list[@]}"
parallel --joblog ~/mnt/tmp/"$(basename bamdir)"_align_stats_concat_joblog.log --progress --verbose --jobs "$cores" align_stats ::: "${raw_bam_list[@]}"
elif [ "$type" == "find" ]; then
parallel --verbose --jobs "$cores" align_stats ::: "$(find -name "*_R1.trim.bam" "$bamdir")"
parallel --joblog ~/mnt/tmp/"$(basename bamdir)"_align_stats_concat_joblog.log --progress --verbose --jobs "$cores" align_stats ::: "$(find -name "*_R1.trim.bam" "$bamdir")"
else
echo "type must be glob, file, or find"
exit 1
Expand Down
47 changes: 32 additions & 15 deletions src/croo.sh
Original file line number Diff line number Diff line change
@@ -1,48 +1,65 @@
#!/bin/bash

set -eux
#Usage : src/croo.sh <list_of_atac-seq_workflow_ids> <GCP_PATH_to_atac_seq_pipeline_workflows_without_trailing_slash> <gcp-path-to-output-without-trailing-slash> >>croo_copy_jobs.txt
#Contributors : Archana Raja, Nicole Gay

if [ $# -lt 3 ]; then
echo "Usage: ./croo.sh [WORKFLOW_ID_LIST] [GCP_PATH] [OUT_PATH]"
echo "Usage: ./croo.sh [WORKFLOW_SUBMISSION_MAP] [GCP_PATH] [OUT_PATH]"
echd
echo "Example: croo.sh ids.txt gs://my-bucket/my_workflow/outputs/croo gs://my-bucket/my_workflow/processed"
echo "Example: croo.sh out.json gs://my-bucket/my_workflow/outputs/croo gs://my-bucket/my_workflow/processed"
echo
echo "[WORKFLOW_ID_LIST]: A list of workflow ids to process"
echo "[WORKFLOW_SUBMISSION_MAP]: A list of workflow ids to process"
echo "[GCP_PATH]: This directory with the outputs of the pipeline"
echo "[OUT_PATH]: The location to output the croo files to"
echo "[PARSE_FROM_ID_LIST] (Optional): Whether to use the workflow id list to parse the files to copy. If false/not set will use qc json to create a file name"
echo
exit 1
fi

WORKFLOW_ID_LIST=$1
WORKFLOW_SUBMISSION_MAP=$1
GCP_PATH=$2
OUT_PATH=${3%/}

PARSE_FROM_ID_LIST=$4

function run_croo() {
local line=$1
local sample_dir
local out_dir
local descrip

sample_dir=$GCP_PATH/${line%/}
out_dir=${OUT_PATH%/}/${sample_dir#gs://}

# as long as the description is hyphenated and don't contain any spaces or special characters below would work
descrip=$(gsutil cat "$sample_dir"/call-qc_report/glob-*/qc.json | grep "description" | sed -e 's/.*": "//' -e 's/".*//')

if [ "$descrip" = "No description" ]; then
descrip=$(gstil cat "$sample_dir"/call-qc_report/glob-*/qc.json | grep "title" | sed -e 's/.*": "//' -e 's/".*//')
if [[ "$PARSE_FROM_ID_LIST" ]]; then
out_dir="$out_dir"/$(jq -r '.[] | select(.workflow_id == "'"$line"'") | .label' "$WORKFLOW_SUBMISSION_MAP")
echo "out_dir: $out_dir"
else
descrip=$descrip
# as long as the description is hyphenated and don't contain any spaces or special characters below would work
descrip=$(gsutil cat "$sample_dir"/call-qc_report/glob-*/qc.json | grep "description" | sed -e 's/.*": "//' -e 's/".*//')

if [ "$descrip" = "No description" ]; then
descrip=$(gstil cat "$sample_dir"/call-qc_report/glob-*/qc.json | grep "title" | sed -e 's/.*": "//' -e 's/".*//')
else
descrip=$descrip
fi

out_dir="$out_dir"/"${descrip/gs:\/\///}"/
fi

out_dir="$out_dir"/"${descrip/gs:\/\///}"/
croo --method copy "$sample_dir"/metadata.json --out-dir "$out_dir"
}

export GCP_PATH
export OUT_PATH
export WORKFLOW_SUBMISSION_MAP
export PARSE_FROM_ID_LIST
export -f run_croo
cores=10

# shellcheck disable=SC2046
parallel --joblog ~/mnt/tmp/"${WORKFLOW_SUBMISSION_MAP%%.*}"_croo.log --progress --verbose --jobs "$cores" run_croo ::: $(jq -r '.[].workflow_id' "$WORKFLOW_SUBMISSION_MAP")

while IFS= read -r line; do
run_croo "$line" &
done <"$WORKFLOW_ID_LIST"
#for line in $(jq -r '.[].workflow_id' "$WORKFLOW_SUBMISSION_MAP"); do
# run_croo "$line"
#done
96 changes: 96 additions & 0 deletions src/merge_atac_qc_human.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
#!/bin/R
# Nicole Gay
# 15 May 2020
# Fix and merge ATAC-seq QC

#Usage : Rscript src/merge_atac_qc.R -w ~/test_mnt/PASS/atac-seq/stanford/batch4_20200928/Output/final/sample_metadata_20200928.csv -q ~/test_mnt/PASS/atac-seq/stanford/batch4_20200928/Output/final/qc/atac_qc.tsv -m ~/test_mnt/PASS/atac-seq/stanford/batch4_20200928/Output/final/rep_to_sample_map.csv -a ~/test_mnt/PASS/atac-seq/stanford/batch4_20200928/Output/final/merged_chr_info.csv -o ~/test_mnt/PASS/atac-seq/stanford/batch4_20200928/Output/final/

library(data.table)
library(optparse)

option_list <- list(
make_option(c("-w", "--sample_meta"), help = "Absolute path to wetlab sample metadata file, e.g. sample_metadata_YYYYMMDD.csv"),
make_option(c("-q", "--atac_qc"), help = "Absolute path to pipeline qc metrics file output of qc2tsv tool, e.g. atac_qc.tsv"),
make_option(c("-a", "--align_stats"), help = "Absolute path to genome alignment stats file, e.g. merged_chr_info.csv"),
make_option(c("-o", "--outdir", help = "Absolute path to output directory for the merged qc reports"))

)

opt_parse_inst <- OptionParser(option_list = option_list)

opt <- parse_args(opt_parse_inst)

if (is.null(opt$sample_meta) |
is.null(opt$atac_qc) |
is.null(opt$align_stats) |
is.null(opt$outdir)) {
message("\033[31mERROR! Please provide all required arguments")
message("\033[34mExample: Rscript src/merge_atac_qc.R -w <sample_metadata_YYYYMMDD.csv> -q <atac_qc.tsv> -m <rep_to_sample_map.csv> -a <merged_chr_info.csv> -o <output_directory>")
print_help(opt_parse_inst)
quit("no")
}

wet <- fread(opt$sample_meta, sep = ',', header = TRUE)
wet <- unique(wet) # remove duplicate rows
encode <- fread(opt$atac_qc, sep = '\t', header = T)
align_stat <- fread(opt$align_stats, sep = ',', header = T)

###################################################################################
## fix ENCODE QC
###################################################################################

# fix other "general" cols
cols <- colnames(encode)[grepl('general', colnames(encode))]
cols <- cols[cols != 'general.description']
for (col in cols) {
print(col)
t1 <- encode[1, get(col)]
for (i in seq_len(nrow(encode))) {
if (as.character(encode[i, get(col)]) == '') {
encode[i, (col) := t1]
} else {
t1 <- encode[i, get(col)]
}
}
}

# separate workflow-level and sample-level QC
workflow_level <- colnames(encode)[unlist(encode[, lapply(.SD, function(x) any(is.na(x) | as.character(x) == ''))])]
workflow_qc <- encode[replicate == 'rep1', c('general.description', workflow_level), with = F]
viallabel_qc <- encode[, colnames(encode)[!colnames(encode) %in% workflow_level], with = F]

###################################################################################
## merge all sample-level QC
###################################################################################

# merge with wet lab QC
m1 <- merge(viallabel_qc, wet, by.x = 'general.title', by.y = 'vial_label')
stopifnot(nrow(m1) == nrow(dt))
# merge with align stats
print(colnames(m1))
print(colnames(align_stat))
m2 <- merge(m1, align_stat, by.x = 'general.title', by.y = 'viallabel')
stopifnot(nrow(m2) == nrow(dt))
# remove columns of all 0 or all 100
check_col <- function(x) {
if (is.numeric(x)) {
if (sum(as.numeric(x)) == 0 | all(x == 100)) {
return(x)
}
}
return(NA)
}

res <- lapply(m2, check_col)
res <- res[!is.na(res)]
m2[, names(res) := NULL]

head(m2)

# write out merged QC
outfile <- paste0(trimws(opt$outdir, which = 'right', whitespace = '/'), '/', 'merged_atac_qc.tsv')
write.table(m2, file = outfile, sep = '\t', col.names = T, row.names = F, quote = F)

# write out workflow-level QC
mihirsamdarshi marked this conversation as resolved.
Show resolved Hide resolved
outfile_workflow <- paste0(opt$outdir, '/', 'encode_workflow_level_atac_qc.tsv')
write.table(workflow_qc, file = outfile_workflow, sep = '\t', col.names = T, row.names = F, quote = F)
6 changes: 3 additions & 3 deletions src/pass_extract_atac_from_gcp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ else
gsutil -m cp -n "${CROO_OUTPUT_PATH}/*/*/align/rep?/*tagAlign.gz" tagalign

# individual signal track (p-value)
gsutil -m cp -n "${CROO_OUTPUT_PATH}/*/*/signal/rep?/*pval.signal.bigwig signal"
gsutil -m cp -n "${CROO_OUTPUT_PATH}/*/*/signal/rep?/*pval.signal.bigwig" signal
fi

gs_copy() {
Expand Down Expand Up @@ -115,8 +115,8 @@ gs_copy_gcp() {

if [[ "$copy_dest" == "gcp" ]]; then
export -f gs_copy_gcp
parallel --verbose --jobs "$NUM_CORES" gs_copy_gcp ::: "$(gsutil ls "$CROO_OUTPUT_PATH" | grep -E "/$" | grep -v "final")"
parallel --progress --verbose --jobs "$NUM_CORES" gs_copy_gcp ::: "$(gsutil ls "$CROO_OUTPUT_PATH" | grep -E "/$" | grep -v "final")"
else
export -f gs_copy
parallel --verbose --jobs "$NUM_CORES" gs_copy ::: "$(gsutil ls "$CROO_OUTPUT_PATH" | grep -E "/$" | grep -v "final")"
parallel --progress --verbose --jobs "$NUM_CORES" gs_copy ::: "$(gsutil ls "$CROO_OUTPUT_PATH" | grep -E "/$" | grep -v "final")"
fi
6 changes: 3 additions & 3 deletions src/qc2tsv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ trap "echo ERR trap fired!" ERR
gcp_path=$1
outfile_name=$2

gsutil ls ${gcp_path}/*/*/qc/qc.json >file_list.txt
gsutil ls "$gcp_path"/*/*/qc/qc.json >file_list.txt
echo "Done creating file list"
/home/araja7/miniconda3/bin/qc2tsv --file file_list.txt --collapse-header >${outfile_name}
gsutil mv ${outfile_name} ${gcp_path}/final/
qc2tsv --file file_list.txt --collapse-header >"$outfile_name"
gsutil mv "$outfile_name" "$gcp_path"/final/
rm -rf file_list.txt
echo "Done creating atac-seq qc report"