Skip to content

BioRxiv submission update (BWA to Bowtie2)#20

Merged
owlang merged 82 commits intomasterfrom
revisions
Mar 22, 2023
Merged

BioRxiv submission update (BWA to Bowtie2)#20
owlang merged 82 commits intomasterfrom
revisions

Conversation

@owlang
Copy link
Copy Markdown
Contributor

@owlang owlang commented Mar 8, 2023

Include updates with latest EpitopeID changes (switch aligner from BWA to Bowtie2 for improved sensitivity)
Update with results from fresh EpitopeID runs (SyntheticEpitope, ENCODE-eGFP, and HIV_samples)
Streamline scripts for generating manuscript data

owlang added 30 commits June 4, 2022 14:19
In order to perform new expanded SyntheticEpitope simulations across a broader variety of parameters, more synthetic epitopes have been created of varying lengths
Update generate_synthetic_genomes.sh script to include building genomes with the new synthetic epitopes.
Simulations take some time to run and a large chunk of time is spent by the generate_random_BED_from_Genomic_FASTA.pl script looping through each bp of the genome to count up the genome size. This commit includes an alternative to bypass most of that time spent by creating a FASTA index `.fai` file using samtools for each synthetic genome and creating a variation of the perl script that parses a `.fai` file instead.
The synthetic epitope fasta sequences are renamed to be unique from each other (named by length) so that they can be added to the TagDB together during setup.
SyntheticEpitope--The FASTA indexing was sacCer3-specific so this commit fixes it to be generalized to whatever genome.

SyntheticDeletion and SyntheticStrain--Add FASTA indexing commands to the scripts that generate the synthetic genomes.
Dramatic restructure of simulations:

`depth_simulations.txt` -- describes experimental designs
*sequencing "depth" removed as a variable (implied run of each sequencing depth for every job)
*synthetic epitope length added as a variable tested in each row/"experimental design"
`job/build_jobs.sh` -- script to build PBS submission scripts based on the `depth_template.pbs` submission script using each "experimental design" row in `depth_simulations.txt`
`job/depth_template.pbs` -- The outline for new simulation script that simulates a dataset at every depth tested.

...and removed all old submission scripts that hard-coded each experimental design.
switch from bedtools getfasta to using the seqkit's subseq, replace, rename, and sort tools which together can parallelize the process. minor formatting fixes to `build_jobs.pbs` script
Expand experiments to repeat at more genes:
sacCer3 added: Sua7, Taf2, Spt4, Spt7, Gcn5, Hsf1, Fzo1, Lge1
hg19 added: MED12, YY1, USF1, GABPA, ESR1, FOXA1, SHH, EP300
- switch to `.md` extension so that browser display is rendererd Github markdown
- reformat text a bit to leverage markdown features
- add `.DS_Store` files to `.gitignore`
Simple file rename and then reformat in later commits
use markdown features to reformat README files
Create table of TODOs and update as jobs complete/data is generated
add markers for initial submission
update README table with simulations completed
The shell script was added to check on simulations run so far
add to the `build_jobs.sh` shell script that uses the new `epitopeid_template.pbs` template script to create the `run_EpitopeID_XX_....pbs` submission scripts in the style of creating the `run_depth_XX_....pbs` submission scripts. Old structure of running EpitopeID also abandoned with the removal of its respective PBS scripts.
The database name should dynamically update according to the reference genome/organism simulated from.
human tester runs have been inconsistently failing so I am only marking complete sets
mark experiment sets that are currently running on ACI
update README table with simulations completed
update README table with simulations completed
update README table with simulations completed
update README table with simulations completed
update README table with simulations completed
update README table with simulations completed
update README table with simulations completed
update README table with simulations completed
update README table with simulations completed
update README table with simulations completed
update README table with simulations completed
owlang added 29 commits January 31, 2023 09:50
Since we are switching EpitopeID over to use bowtie2 for the aligner, there is no longer any need for the BWA indexing. These statements are commented out in this commit.
Update database name in setup script and EpitopeID for HIV samples script. (from hiv_EpiDB --> hg19-HIV_EpiID)
The FASTQ format from SRA downloads, while valid, does not work with EpitopeID. Be sure to note this in the documentation. This script was adjusted to rename the files with *_R1.fastq.gz and *_R2.fastq.gz instead of *_1.fastq.gz and *_2.fastq.gz. The files themselves were also processed with sed to strip out the header descriptions for the quality scores (plain "+" on those lines) and to replace the SRR read id with the next token (raw Illumina-style read id).

Minor changes: rename conda environment and sra-tools fastq dump used to retrieve FQ instead of parallel-fastq-dump
Commit ID and runtime results so far with the latest simulation progress update to the README
The migration of files for 1M-R100-Fzo1, 1M-R100-Hsf1, and 1M-R100-Lge1 were mistakenly added to the 1M-R100-Gcn5 results directory. This commit corrects this mistake
The current yeast results directory structure is too complicated so this commit is flattening the directories to match the human results (all tab files named by experiments saved to the same directory).
Commit ID and runtime results so far with the latest simulation progress update to the README
Modify the `job/compile_results.sh` script to call the tally script on all the raw id results and include shell commands to compile runtime results. These results summaries can be turned into figures using the newly committed/modified `scripts/build_barplots.py` and `scripts/build_violinplots.py`.
This commit fills in missing data with complete results for the human simulations
These are updates of missing data generated from ACI-ROAR run of sacCer3 simulations (not collab set). They are seeded so they should be the same.
HIV samples rerun through new version of EpitopeID that uses bowtie2 and with fixed FASTQ reformatting from SRA FASTQ format. This commit includes the results
Adjust human scripts to omit 100K plots, adjust title based on -y flag, and select color scheme based on -y flag. Commit the PNG figure outputs for human results. Add figure generating script calls to compile_results.sh script and reduce number of yeast targets to include in the figure.
update with latest yeast simulation results. Remove target results from other 7 proteins to keep only the publication figure set (Reb1, Rap1, Sua7). Also included are runtimes that were not included before.

Minor updates to README to reflect these changes
The yeast target order is different between violin and bar plots so this commit enforces a new target ordering for the figures generated.
Add Runtime and ID results summary reports for sacCer3
Add Runtime and ID-tally PNG figures of yeast summary reports
Update `compile_results.sh` script
- reflect new raw and summary data structure
- fix up -e <epitopeName> string to use the appropriate RANDOM_SEQ_XXXX format
- change figures to save from 'svg' to 'png'
rename READMEs so that Github can render the markdown syntax.
Switch to a simulation index-oriented run of the mixture simulations (rather than a per-mixture ratio). These new scripts are modified to be more similar to the depth and epitopeid template PBS scripts. The simulation script was also updated to force overwrite of gzip files for convenient re-running of simulations and the check for completion was removed in favor of an immediate overwrite setup.
EpitopeID was run on mixture simulations of Rap1 and Reb1 at depth 1M. This commit includes all the raw results of these simulations.
Add a script to compile raw results into summary formats. Include new python script for drawing out line plot of summarized results.
The results from all the R20-POLR2H experiments were missing in the last commit. These were run and some other experiments were rerun for a "polish" to cover some missing runs (R100-YY1 and R100-CTCF). Reruns are largely consistent with ID file changes largely due to  a different sort of shared-ranked hits and +/- a few seconds difference on runtimes.
Update summary reports and figures to include missing R20-POLR2H results and other rerun samples from the last commit.
Commit raw head -n 9999 * results of each titration experiment
Add a script to compile raw results into summary formats. Include minor fix to `run_EpitopeID_on_mix_human.pbs` script to output STDOUT/ERR logs to appropriately named files
Figure 3 inserted into the manuscript of browser shots for pileups of mislabelled ID3 and NR4A1 scripts added:
-job/03_MakeBrowserData_genomes.sh --makes synthetic genomes (ID3-eGFP and NR4A1-eGFP) to align against
-job/04_MakeBrowserData_BAM.sh --filter fastq files and align to each synthetic genome
-README.md --describe results directory structure
-results/annotations -- annotations for marking relevant features in synthetic genomes of browser figure
-.gitignore --update with BrowserData directories
mix_fastq.sh
- uncomment intermediate file cleanup
compile_mix_results.sh
- add "-y" flag to yeast figure
- switch human mix depth to 50M (from 20M)
- reduce redundant EPITOPE env var declarations
run_mix_yeast.sh/run_mix_human.sh
- add directory check and initializations
build_lineplots.py
- add comments with code to stretch figure to edges of plot
update three different README files:
- paper/ - capitalize sentence descriptions and add some detail
- paper/SE/ - add instructions for how to execute all scripts within SE and remove simulations in progress table
- paper/SE/results/ - embed results images in README file with descriptions of summary output files
Fix straggling updates including ones from EpitopeID switch to Bowtie2 for the aligner
- refactor update_tagDB.sh utility generate Bowtie2 indexes
- update dependency notes in all of identify-*.sh scripts
- minor fix to identify-Strain.sh report name generation
- adjust genome index file check in identify-Epitope.sh for Bowtie2-named index files
- update gitignore with Bowtie2 index filenames
- update paper README with extra dependencies
@owlang owlang merged commit 4e9b2e8 into master Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant