BioRxiv submission update (BWA to Bowtie2) by owlang · Pull Request #20 · CEGRcode/GenoPipe

owlang · 2023-03-08T22:51:02Z

Include updates with latest EpitopeID changes (switch aligner from BWA to Bowtie2 for improved sensitivity)
Update with results from fresh EpitopeID runs (SyntheticEpitope, ENCODE-eGFP, and HIV_samples)
Streamline scripts for generating manuscript data

In order to perform new expanded SyntheticEpitope simulations across a broader variety of parameters, more synthetic epitopes have been created of varying lengths

Update generate_synthetic_genomes.sh script to include building genomes with the new synthetic epitopes.

Simulations take some time to run and a large chunk of time is spent by the generate_random_BED_from_Genomic_FASTA.pl script looping through each bp of the genome to count up the genome size. This commit includes an alternative to bypass most of that time spent by creating a FASTA index `.fai` file using samtools for each synthetic genome and creating a variation of the perl script that parses a `.fai` file instead.

The synthetic epitope fasta sequences are renamed to be unique from each other (named by length) so that they can be added to the TagDB together during setup.

SyntheticEpitope--The FASTA indexing was sacCer3-specific so this commit fixes it to be generalized to whatever genome. SyntheticDeletion and SyntheticStrain--Add FASTA indexing commands to the scripts that generate the synthetic genomes.

Dramatic restructure of simulations: `depth_simulations.txt` -- describes experimental designs *sequencing "depth" removed as a variable (implied run of each sequencing depth for every job) *synthetic epitope length added as a variable tested in each row/"experimental design" `job/build_jobs.sh` -- script to build PBS submission scripts based on the `depth_template.pbs` submission script using each "experimental design" row in `depth_simulations.txt` `job/depth_template.pbs` -- The outline for new simulation script that simulates a dataset at every depth tested. ...and removed all old submission scripts that hard-coded each experimental design.

switch from bedtools getfasta to using the seqkit's subseq, replace, rename, and sort tools which together can parallelize the process. minor formatting fixes to `build_jobs.pbs` script

Expand experiments to repeat at more genes: sacCer3 added: Sua7, Taf2, Spt4, Spt7, Gcn5, Hsf1, Fzo1, Lge1 hg19 added: MED12, YY1, USF1, GABPA, ESR1, FOXA1, SHH, EP300

- switch to `.md` extension so that browser display is rendererd Github markdown - reformat text a bit to leverage markdown features - add `.DS_Store` files to `.gitignore`

Simple file rename and then reformat in later commits

use markdown features to reformat README files

Create table of TODOs and update as jobs complete/data is generated

add markers for initial submission

update README table with simulations completed

The shell script was added to check on simulations run so far

add to the `build_jobs.sh` shell script that uses the new `epitopeid_template.pbs` template script to create the `run_EpitopeID_XX_....pbs` submission scripts in the style of creating the `run_depth_XX_....pbs` submission scripts. Old structure of running EpitopeID also abandoned with the removal of its respective PBS scripts.

The database name should dynamically update according to the reference genome/organism simulated from.

human tester runs have been inconsistently failing so I am only marking complete sets

mark experiment sets that are currently running on ACI

update README table with simulations completed

Since we are switching EpitopeID over to use bowtie2 for the aligner, there is no longer any need for the BWA indexing. These statements are commented out in this commit.

Update database name in setup script and EpitopeID for HIV samples script. (from hiv_EpiDB --> hg19-HIV_EpiID)

The FASTQ format from SRA downloads, while valid, does not work with EpitopeID. Be sure to note this in the documentation. This script was adjusted to rename the files with *_R1.fastq.gz and *_R2.fastq.gz instead of *_1.fastq.gz and *_2.fastq.gz. The files themselves were also processed with sed to strip out the header descriptions for the quality scores (plain "+" on those lines) and to replace the SRR read id with the next token (raw Illumina-style read id). Minor changes: rename conda environment and sra-tools fastq dump used to retrieve FQ instead of parallel-fastq-dump

Commit ID and runtime results so far with the latest simulation progress update to the README

The migration of files for 1M-R100-Fzo1, 1M-R100-Hsf1, and 1M-R100-Lge1 were mistakenly added to the 1M-R100-Gcn5 results directory. This commit corrects this mistake

The current yeast results directory structure is too complicated so this commit is flattening the directories to match the human results (all tab files named by experiments saved to the same directory).

Commit ID and runtime results so far with the latest simulation progress update to the README

Modify the `job/compile_results.sh` script to call the tally script on all the raw id results and include shell commands to compile runtime results. These results summaries can be turned into figures using the newly committed/modified `scripts/build_barplots.py` and `scripts/build_violinplots.py`.

This commit fills in missing data with complete results for the human simulations

These are updates of missing data generated from ACI-ROAR run of sacCer3 simulations (not collab set). They are seeded so they should be the same.

HIV samples rerun through new version of EpitopeID that uses bowtie2 and with fixed FASTQ reformatting from SRA FASTQ format. This commit includes the results

Adjust human scripts to omit 100K plots, adjust title based on -y flag, and select color scheme based on -y flag. Commit the PNG figure outputs for human results. Add figure generating script calls to compile_results.sh script and reduce number of yeast targets to include in the figure.

update with latest yeast simulation results. Remove target results from other 7 proteins to keep only the publication figure set (Reb1, Rap1, Sua7). Also included are runtimes that were not included before. Minor updates to README to reflect these changes

The yeast target order is different between violin and bar plots so this commit enforces a new target ordering for the figures generated.

Add Runtime and ID results summary reports for sacCer3 Add Runtime and ID-tally PNG figures of yeast summary reports Update `compile_results.sh` script - reflect new raw and summary data structure - fix up -e <epitopeName> string to use the appropriate RANDOM_SEQ_XXXX format - change figures to save from 'svg' to 'png'

rename READMEs so that Github can render the markdown syntax.

Switch to a simulation index-oriented run of the mixture simulations (rather than a per-mixture ratio). These new scripts are modified to be more similar to the depth and epitopeid template PBS scripts. The simulation script was also updated to force overwrite of gzip files for convenient re-running of simulations and the check for completion was removed in favor of an immediate overwrite setup.

EpitopeID was run on mixture simulations of Rap1 and Reb1 at depth 1M. This commit includes all the raw results of these simulations.

Add a script to compile raw results into summary formats. Include new python script for drawing out line plot of summarized results.

The results from all the R20-POLR2H experiments were missing in the last commit. These were run and some other experiments were rerun for a "polish" to cover some missing runs (R100-YY1 and R100-CTCF). Reruns are largely consistent with ID file changes largely due to a different sort of shared-ranked hits and +/- a few seconds difference on runtimes.

Update summary reports and figures to include missing R20-POLR2H results and other rerun samples from the last commit.

Commit raw head -n 9999 * results of each titration experiment

Add a script to compile raw results into summary formats. Include minor fix to `run_EpitopeID_on_mix_human.pbs` script to output STDOUT/ERR logs to appropriately named files

Figure 3 inserted into the manuscript of browser shots for pileups of mislabelled ID3 and NR4A1 scripts added: -job/03_MakeBrowserData_genomes.sh --makes synthetic genomes (ID3-eGFP and NR4A1-eGFP) to align against -job/04_MakeBrowserData_BAM.sh --filter fastq files and align to each synthetic genome -README.md --describe results directory structure -results/annotations -- annotations for marking relevant features in synthetic genomes of browser figure -.gitignore --update with BrowserData directories

mix_fastq.sh - uncomment intermediate file cleanup compile_mix_results.sh - add "-y" flag to yeast figure - switch human mix depth to 50M (from 20M) - reduce redundant EPITOPE env var declarations run_mix_yeast.sh/run_mix_human.sh - add directory check and initializations build_lineplots.py - add comments with code to stretch figure to edges of plot

update three different README files: - paper/ - capitalize sentence descriptions and add some detail - paper/SE/ - add instructions for how to execute all scripts within SE and remove simulations in progress table - paper/SE/results/ - embed results images in README file with descriptions of summary output files

Fix straggling updates including ones from EpitopeID switch to Bowtie2 for the aligner - refactor update_tagDB.sh utility generate Bowtie2 indexes - update dependency notes in all of identify-*.sh scripts - minor fix to identify-Strain.sh report name generation - adjust genome index file check in identify-Epitope.sh for Bowtie2-named index files - update gitignore with Bowtie2 index filenames - update paper README with extra dependencies

owlang added 30 commits June 4, 2022 14:19

Add new synthetic epitopes

a8c1b80

In order to perform new expanded SyntheticEpitope simulations across a broader variety of parameters, more synthetic epitopes have been created of varying lengths

include R100,R50,R20 to generate synth genomes

dccf535

Update generate_synthetic_genomes.sh script to include building genomes with the new synthetic epitopes.

add new synthetic epitopes to tagDB setup

a17d125

The synthetic epitope fasta sequences are renamed to be unique from each other (named by length) so that they can be added to the TagDB together during setup.

update build synthetic genome scripts

fcce324

SyntheticEpitope--The FASTA indexing was sacCer3-specific so this commit fixes it to be generalized to whatever genome. SyntheticDeletion and SyntheticStrain--Add FASTA indexing commands to the scripts that generate the synthetic genomes.

add parallelization to simulation script

4023ae1

switch from bedtools getfasta to using the seqkit's subseq, replace, rename, and sort tools which together can parallelize the process. minor formatting fixes to `build_jobs.pbs` script

expand simulations to vary different parameters

1499004

Expand experiments to repeat at more genes: sacCer3 added: Sua7, Taf2, Spt4, Spt7, Gcn5, Hsf1, Fzo1, Lge1 hg19 added: MED12, YY1, USF1, GABPA, ESR1, FOXA1, SHH, EP300

add .md ext to README files

a1e5877

- switch to `.md` extension so that browser display is rendererd Github markdown - reformat text a bit to leverage markdown features - add `.DS_Store` files to `.gitignore`

add .md ext to remaining README files

ad0312c

Simple file rename and then reformat in later commits

reformat README files

10e134f

use markdown features to reformat README files

add checklist to SyntheticEpitope README

6683f88

Create table of TODOs and update as jobs complete/data is generated

update SyntheticEpitope progress

279b8ba

add markers for initial submission

update simulation progress

11548a2

update README table with simulations completed

add script to check progress on simulations

ab76915

The shell script was added to check on simulations run so far

fix db naming to account for ref genome

9d64824

The database name should dynamically update according to the reference genome/organism simulated from.

update simulation progress

42bf110

human tester runs have been inconsistently failing so I am only marking complete sets

mark in-progress sets

dfd8460

mark experiment sets that are currently running on ACI

update simulation progress

734a9cc

update README table with simulations completed

update simulation progress

478b41c

update README table with simulations completed

update simulation progress

ed5a7b8

update README table with simulations completed

update simulation progress

f5cd513

update README table with simulations completed

update simulation progress

309d247

update README table with simulations completed

update simulation progress

61a502c

update README table with simulations completed

update simulation progress

2e28833

update README table with simulations completed

update simulation progress

2710e4d

update README table with simulations completed

update simulation progress

3ce3af2

update README table with simulations completed

update simulation progress

cc47bcc

update README table with simulations completed

update simulation progress

5a6af0f

update README table with simulations completed

owlang added 29 commits January 31, 2023 09:50

remove unused bwa indexing from setup script

e3ec6ef

Since we are switching EpitopeID over to use bowtie2 for the aligner, there is no longer any need for the BWA indexing. These statements are commented out in this commit.

rename human EpitopeID with HIV database

9486c8d

Update database name in setup script and EpitopeID for HIV samples script. (from hiv_EpiDB --> hg19-HIV_EpiID)

update simulation progress with results

0eeb840

Commit ID and runtime results so far with the latest simulation progress update to the README

remove "partial" data files

f692cca

correct overwrites from previous commit

2a29356

The migration of files for 1M-R100-Fzo1, 1M-R100-Hsf1, and 1M-R100-Lge1 were mistakenly added to the 1M-R100-Gcn5 results directory. This commit corrects this mistake

flatten directory structure of yeast results

3d337cd

The current yeast results directory structure is too complicated so this commit is flattening the directories to match the human results (all tab files named by experiments saved to the same directory).

update simulation progress with results

3348499

Commit ID and runtime results so far with the latest simulation progress update to the README

update with latest simulation results

b131490

This commit fills in missing data with complete results for the human simulations

update with latest simulation results

2439bb1

These are updates of missing data generated from ACI-ROAR run of sacCer3 simulations (not collab set). They are seeded so they should be the same.

update with HIV rerun results

0fabe3a

HIV samples rerun through new version of EpitopeID that uses bowtie2 and with fixed FASTQ reformatting from SRA FASTQ format. This commit includes the results

hardcode target order in plotting scripts

31f2b42

The yeast target order is different between violin and bar plots so this commit enforces a new target ordering for the figures generated.

rename READMEs README.md

7843ba1

rename READMEs so that Github can render the markdown syntax.

update with raw results of yeast mixture (1M) eid

7746a54

EpitopeID was run on mixture simulations of Rap1 and Reb1 at depth 1M. This commit includes all the raw results of these simulations.

add yeast mixture eid summary results and fig

a79306e

Add a script to compile raw results into summary formats. Include new python script for drawing out line plot of summarized results.

update hg19 depth simulation summaries and figs

fab5c02

Update summary reports and figures to include missing R20-POLR2H results and other rerun samples from the last commit.

add mix-human eid raw results

7f85bf1

Commit raw head -n 9999 * results of each titration experiment

add human mixture eid summary results and fig

8ed21d9

Add a script to compile raw results into summary formats. Include minor fix to `run_EpitopeID_on_mix_human.pbs` script to output STDOUT/ERR logs to appropriately named files

Merge branch 'master' into revisions

9a2de36

owlang merged commit 4e9b2e8 into master Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BioRxiv submission update (BWA to Bowtie2)#20

BioRxiv submission update (BWA to Bowtie2)#20
owlang merged 82 commits intomasterfrom
revisions

owlang commented Mar 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

owlang commented Mar 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant