# Background and Problem Definition
At Clear Labs, we develop automated NGS assays based on short and long-read sequencing technology. Many of our products rely on short-read sequencing for its greater data volume and higher accuracy. Some examples of assays include SARs-Cov2 surveillance, microbial isolate sequencing for surveillance, and wastewater surveillance. Each assay is paired with a bioinformatics pipeline that processes sequenced reads from each input sample to answer biological questions. 

In these bioinformatics pipelines, filtering steps are applied so that spurious reads, be they low quality or contaminants, do not confound the results or as a bioproduct of mapping reads to a reference genome before further analysis. Losing reads to filtering corresponds directly to lost revenue and lost assay power, as every read thrown out represents a spot on a flowcell that is not providing helpful information. A 50% filtering rate is a 50% reduction in the number of samples that could have been sequenced to a given quality, all things being equal. 

If one could characterize what kinds of aberrant reads an assay generates, the assay designers could implement changes to mitigate the production of these aberrants in favor of useful reads. For example, if the dropped reads are due to low-intensity spots belonging to overly long insert size fragments, one could apply size banding before PCR to remove long reads. Different procedures to reduce contamination could be utilized if there were many reads from human contamination. 

# Project Goals 
This project aims to develop a WDL ( Workflow Definition Language) workflow that can analyze reads filtered out at any or all stages of a bioinformatic pipeline and automatically generate a helpful report, allowing non-bioinformaticians to understand why reads are being removed. The WDL workflow will consist of various tasks, each  running open source or custom code to perform various analyses. The report should break down the filtered reads by each of an arbitrary number of user-defined filtering steps, displaying different categories or clusters of reads lost at the step and their proportion of the total number filtered at a given step. The exact design of the report is subject to change. 

Once the workflow and reporting code are implemented, this project will demonstrate the utility of such an analysis by running it on some real datasets from historical clear labs development runs, where we know what the cause of high read filtering is, and perhaps an additional, more speculative dataset where we can generate hypotheses. The results of these studies will be compiled into a paper/report in the form of a Jupyter notebook. A GitHub repo with steps to install and run the workflow will be created so that others in the industry can utilize the tool. Finally, I will present the tool to my colleagues in our assay development group and deploy it into our research pipeline.  A stretch goal would be to implement this as a webhosted tool and to extend it to long reads. 


# Project Resources and Tools
The computational resources for this project will consist of Google Cloud Platform VMs provided by Clear Labs. An example dataset that can be run on a personal computer will be available via GitHub. The data for this project will consist of historical runs from the development of some Clear Labs assays. This data will primarily be short-read NGS data from Sars-Cov2 and microbial samples, but these particular runs will be chosen to demonstrate the utility of dropped reads analysis. 

The tools I intend to use for this analysis are pending investigation. However, the Workflow Definition Language, run via miniwdl, will be the platform for the analysis as it fits with existing Clear Labs RND pipelines. The goal is to integrate this workflow into our existing software stack. Taxonomy tools such as BLAST, Kraken, and other competitors will be evaluated for their performance and applicability, other motif-based read analysis tools may be implemented until further investigation the extent of additional tools we would like to integrate is unknown. Custom analysis tools may also be developed. Ideally Multiqc with modification to incorporate all the selected tools outputs into a single html report will be utilized. 

# Application of Knowledge
This project will study genomic NGS data, particularly its failure modes, and use in taxonomy assignments. Custom Python and bash code will be utilized to modify existing packages and write scripts to run various workflow tasks. Python or R, as is expedient, will be utilized to write any needed novel analysis modules. An example of a novel analysis module may be the clustering of dropped reads into failure groups using metrics from other tools as features, this will likely use machine learning techniques such as clustering algorithms and dimensionality reduction. 

# Deliverables
The Dropped Reads Analysis Pipeline workflow
A GitHub repo containing the workflow code and instructions to install and run it
A report containing results and analysis of the workflow outputs from multiple historical runs demonstrates the pipeline's efficacy. 
Jupyter may not work perfectly for this project as it is an open question if I can set the workflow to run within Jupyter, and the .html reports for the example analysis runs will be separate files
I would like to discuss how to approach the jupyter notebook requirement further in the context of this more tool and reporting-focused project. 
A set of html reports that represent the outputs for the test datasets

# Timeline
This timeline is a rough outline of what I intend to do each week of the semester, starting each Monday. 
- Week of 05/20
    - Literature review: Understand the modality by which reads become low quality or off target 
- Week of 5/27
    - Literature review and experimentation into tools useful for drop reads analysis
- Week of 6/3
    - Workflow design and begin implementation 
Week of 6/10
    - Workflow implementation 
- Week of 6/17
    - Workflow implementation 
- Week of 6/24
    - Reporting component implementation / Workflow implementation complete
- Week of 7/1
    Reporting component implementation, test data selection, and analysis
- Week of 7/8
    Reporting component implementation complete / Test data analysis 
- Week of 7/15
    GitHub repo setup, Test Data analysis
- Week of 7/22
    - Final Report Draft
- Week of 7/29
    - Final Report  Revision and Submission
- Week of 8/5
    - I would like to reserve this as a buffer week in case any of my other weeks take more time than expected
- Week of 8/12
    - Final Submission, I am leaving for Europe on 8/14 and must be complete by then
# Methods 
# Dropped Reads Analysis Pipeline (DRAP)
## Pipeline Block Diagram
![DRAP Block Diagram](https://i.imgur.com/kX1NIxm.png)
## Project Goal
The general project goal is to develop a Workflow Definition Language (WDL) workflow composed of WDL tasks that wrap custom Python or R code as well as existing bioinformatics tools. This workflow, known as the Dropped Reads Analysis Pipeline (DRAP), will intake a labeled set of fastq files or fastq file pairs, along with some reference objects and databases, and will return an HTML report that contains metrics about the reads in each set of fastqs/fastas, separated by the source file. This report will allow DRAP users to characterize the reads dropped at each filtering step, where each input set of files comes from a distinct filtering step.

## Desired Metrics
As of this point in the project, the following categories of metrics we aim to collect:

- **QC metrics**
  - Read Quality
  - Base Composition
  - Read uniqueness and repetition
  - Read Lengths

- **Adapter/Primer artifacts**
  - Dimers
  - Erroneous primer and adapter inclusions

- **Spatial Distributions**
  - Plot the spatial positions of each class of dropped reads on the illumina flowcell based on cluster coordinates

- **Mapping/ Alignment Based stats**
  - Insert sizes (using either blast or a loose alignment depending on settings)
  - Chimera detection

- **Contamination profiling**
  - Kmer based taxonomy
  - Blast based taxonomy

## Tools by Goal

### QC
For QC reporting, we will utilize `fastqc`. This tool can quickly and efficiently generate stats on a cycle-per-cycle basis, including per base quality, GC content, base composition, length distributions, overrepresentation and duplication, and kmer frequencies. The tool generates an HTML-style report for each fastq file that can then be integrated via `multiqc` into a final report. It is likely DRAP will run this task multiple times to characterize different groupings of reads.

- **Fastqc GitHub**: [https://github.com/s-andrews/FastQC](https://github.com/s-andrews/FastQC)

### Adapter/Primer Dimers
We will write a custom bash/Python task using the BBMap tool `bbduk` as a base. `Bbduk` splits fastq files based on whether the sequences contain kmers from a reference. We can use this to detect whether the reads contain primer or adapter sequences, then utilize biopython code to categorize and quantify the offending primers and adapter pairs. This will require a reference containing all primers and adapter sequences.

- **Bbduk GitHub**: [https://github.com/BioInfoTools/BBMap/blob/master/sh/bbduk.sh](https://github.com/BioInfoTools/BBMap/blob/master/sh/bbduk.sh)

### Spatial Read Distribution
We will implement a read coordinate mapping task using Python, returning images of the distribution of various read subsets and statistics on the likelihood they are clustered. This code should also employ some machine learning to try to find any very obvious flowcell-based artifacts. Python libraries will include `sklearn` and `biopython`. The task should be sequencer agnostic if possible, only requiring it be an Illumina sequencing by synthesis instrument. This task will be invoked multiple times on various groupings of reads.

### Mapping and Alignments
- **BWA**: `bwa mem` will be used with loose stringency settings to check if reads filtered out by quality filtering have specific attributes. This will only work in situations where the reference genome is known. This can be provided by the user as an input to DRAP. Metrics to quantify include:
  - Long insert size
  - High GC content
  - Repetitive regions
  - High degrees of contamination (clipping)
  - Signs of chimerism in mapped reads

- **BWA GitHub**: [https://github.com/lh3/bwa](https://github.com/lh3/bwa)

- **BLAST-analysis**: The contamination section will outline a BLAST task for taxonomy detection; however, another task will be designed to calculate probable insert sizes from a sampling of BLAST alignments. This will allow us to understand insert sizes even when we do not have a reference. This task will be implemented in Python. Another BLAST analysis task will be written to detect chimeric reads by checking for splices from dissimilar organisms in the same read.

### Contamination Profiling
- **Contaminer**: Contaminer is a preexisting pipeline that runs BLAST on unmapped reads and interprets the results. We will investigate using this, but it appears to be fairly simple to recreate this functionality using BLAST and Kraken.
  - **Contaminer GitHub**: [https://github.com/amarinderthind/decontaminer](https://github.com/amarinderthind/decontaminer)

- **Kraken2**: The main tool DRAP will use for contamination detection is Kraken2. Kraken2 is an extremely fast and memory-efficient taxonomy caller that leverages kmers and a hash table to detect the likely source for a given read. This will work great in DRAP as we can tell what contaminant taxonomy reads belong to. Kraken2 also already has a multiqc module that can be modified for our purposes.
  - **Kraken2 GitHub**: [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2)

- **Bracken**: As a companion to Kraken, we may also integrate Bracken, which uses Bayesian methods to estimate the abundances of individual species based on Kraken results. The main goal would be quantifying contamination, so even family level may suffice.
  - **Bracken GitHub**: [https://github.com/jenniferlu717/Bracken](https://github.com/jenniferlu717/Bracken)

- **BLAST**: NCBI BLAST is a necessary tool to run on at least a subsample of the data as it performs local alignments. Local alignments are important in DRAP for detecting chimeras, where part of a read maps to one organism and another to a completely different organism. It can also allow us to understand contamination in low-quality reads, as it is more robust to errors than global alignment algorithms. We will also use BLAST to help us estimate insert sizes. BLAST is relatively slow, so parallelism will be utilized, along with subsampling of input reads, and a reduction of the database to a set of clusters.
  - **BLAST NCBI**: [http://blast.ncbi.nlm.nih.gov/Blast.cgi](http://blast.ncbi.nlm.nih.gov/Blast.cgi)

- **Burst**: A fast BLAST-based aligner for short reads that aligns them with large reference databases. Does not support local alignment but would be very efficient in addition to Kraken2. We will keep this as a backup if Kraken2 does not perform well enough.
  - **Burst GitHub**: [https://github.com/knights-lab/BURST](https://github.com/knights-lab/BURST)

## Reporting Tools

### Multiqc
The primary reporting tool for this project will be multiqc, open-source software that aggregates results from many tools across multiple samples into clean and readable HTML format reports. Multiqc already supports many of our tools; however, some, especially our custom written tasks, will need modules developed for multiqc. Multiqc provides a good amount of reference material for developers to implement new modules, so this should be surmountable. Additional changes to treat groups of reads dropped at various stages may be needed.

In addition to multiqc, some custom code will be required to get the outputs from the various DRAP tasks into formats multiqc can accept.

## Project/Pipeline Infrastructure: WDL and Jupyter

### Background on WDL
Some background on WDL. WDL, or the workflow definition language, is an open-source workflow management system. My company uses this for all our pipelines to wrap bash and Python code small into portable tasks, then string these together via WDL workflows. WDL utilizes docker images to ensure all the disparate software invoked by each task WDL has the correct dependencies without having the user install them locally. WDL is also useful as it automatically runs the tasks in parallel, to the maximum level given your computer's resources.

Running WDL requires installing docker and miniwdl. So, the Jupyter Notebook will not work as completely portably as it would if you were only running R or Python code. I will attempt to add commands to install these tools to a notebook, but I am not sure that will be possible or work correctly on every OS. I will make sure the notebook contains instructions on how to install the dependencies regardless.

### Implementing WDL Workflow via Jupyter
I only know how to run a WDL workflow via a Jupyter Notebook by invoking a bash command. The actual code for all the steps will not be run in the Jupyter notebook cells, as a bunch of open-source tools like BLAST will be run in parallel with the data via the overall WDL workflow. So, all the steps for running each data set will look like the code below.

```python
subprocess.run("miniwdl run dropped_read_analysis.wdl -i inputs.json", shell=True)
```
I will display the code for each task in the report and graph diagrams of the workflows. This task code won't necessarily be operable from Jupyter as the WDL docker image manages the dependencies for these tasks.

### Big Data Concerns
Some datasets may be too large to run efficiently on a personal computer, particularly if we implement a BLAST module. In this case, anything run in a Jupyter notebook must be example code unless we both run the notebook hosted via an HPC or VM. However, we could downsample the datasets to make them operable if this is an issue. Hopefully, this isn't the case. The notebook may need to be paired with additional files, such as database FASTAs, which will be provided with the submission unless they are too large, in which case the workflow will download them from the internet.
## References (Rough Notes from Literature Review)
- Simon Haile, Richard D Corbett, Steve Bilobram, Morgan H Bye, Heather Kirk, Pawan Pandoh, Eva Trinh, Tina MacLeod, Helen McDonald, Miruna Bala, Diane Miller, Karen Novik, Robin J Coope, Richard A Moore, Yongjun Zhao, Andrew J Mungall, Yussanne Ma, Rob A Holt, Steven J Jones, Marco A Marra, Sources of erroneous sequences and artifact chimeric reads in next generation sequencing of genomic DNA from formalin-fixed paraffin-embedded samples, Nucleic Acids Research, Volume 47, Issue 2, 25 January 2019, Page e12, https://doi.org/10.1093/nar/gky1142
    - This article discusses a type of erroneous sequences known as chimeras, specifically biological chimeras of two sequences from the same organisms. These chimeric reads are generated due to nucleic acid damage, causing DNA from separate parts of the genome to concatenate during sequencing. In this study, the chimeras occurred during the sequencing of formalin-fixed paraffin-embedded samples, but they can also occur in the microbial context. Chimeras are often dropped in the analysis process and are a potential read class we would like to detect in the Dropped Reads Analysis Pipeline. The article outlines a mechanism to reduce chimeras, indicating that if high chimera levels could be identified the assay could be tweaked to reduce their occurrence. This would increase useful throughput and improve analysis quality overall as not all chimeric reads are dropped, and those that make it through introduce errors, as described in the article. While this article focuses on mapped chimeras, we are also interested in unmapped chimeric reads and should develop tooling to detect these.
- Lu, N., Qiao, Y., Lu, Z., & Tu, J. (2023). Chimera: The spoiler in multiple displacement amplification. Computational and Structural Biotechnology Journal, 21, 1688-1696. https://doi.org/10.1016/j.csbj.2023.02.034
    - This recent article outlines another way chimeras are generated during Multiple Strand Displacement amplification (MDA), a form of amplification common in some library prep techniques and sequencing technologies. This project may want to be able to detect and quantify these kinds of chimeras as they are a superset of all chimera types. The article references several tools, including ChimeraMiner, that I will investigate the feasibility of during the tools investigation stage of the project. It also outlines long read chimera detection, though long reads are a stretch goal of my project. 
- Koboldt, D. C., Ding, L., Mardis, E. R., & Wilson, R. K. (2010). Challenges of sequencing human genomes. Briefings in Bioinformatics, 11(5), 484-498. https://doi.org/10.1093/bib/bbq016 
    - This is a rather old article from 2010 back in the early days of NGS but has a lot of good info about the issues in the technology and how generally to address them. This overview is useful for this project as it gives me some context to set the goal of finding ways to detect these issues via analyzing the reads dropped at various stages of an analysis pipeline. Many of these issues are thought of as a given at this point in the lifecycle of NGS so they are less discussed in modern papers
    Issues of relevant to dropped reads analysis
        - Contamination
        - Library Chimeras
        - Run Quality issues (bad flowcell, liquid handling ect)
- Van Dijk, E. L., Jaszczyszyn, Y., & Thermes, C. (2014). Library preparation methods for next-generation sequencing: Tone down the bias. Experimental Cell Research, 322(1), 12-20. https://doi.org/10.1016/j.yexcr.2014.01.008 
        - It describes various biases introduced by NGS library prep and also serves as a good overview of the library prep process. Lower impact on this project, but could provide further avenues of research for detecting library prep issues via dropped reads. 
- Hu, T., Chitnis, N., Monos, D., & Dinh, A. (2021). Next-generation sequencing technologies: An overview. Human Immunology, 82(11), 801-811. https://doi.org/10.1016/j.humimm.2021.02.012
    - This is a modern article that explains in detail the process and limitations of short and long read NGS. It also explains how quality information is generated by Illumina sequencers. Of particular interest is information about how read clusters are formed and how they can affect quality. Ideally, identifying where dropped reads occur on a flowcell could allow us to understand failure modes better. 
- Ravi, R.K., Walton, K., Khosroheidari, M. (2018). MiSeq: A Next Generation Sequencing Platform for Genomic Analysis. In: DiStefano, J. (eds) Disease Gene Identification. Methods in Molecular Biology, vol 1706. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7471-9_12 
    - This article outlines the protocol to do library prep and run the Illumina miseq, a very common sequencer, the data from which would likely be analyzed by my tools. By reading this, I can find key areas that can cause dropped reads and add these as targets for the pipeline to detect
        - Air bubbles on the flowcell
        - Adapter dimers and primer dimers due to poor cleanup 
        - Poor read quality
        - Over or under-loading
- Stoler, N., & Nekrutenko, A. (2021). Sequencing error profiles of Illumina sequencing instruments. NAR Genomics and Bioinformatics, 3(1). https://doi.org/10.1093/nargab/lqab019
        - This article from 2021 describes the error profiles particular to illumina sequencing. For my project I want to understand the kinds of errors that occur in illumina sequencing that can cause low quality and therefore dropped reads and there causes so my dropped reads analysis tool can detect these profiles and inform the user. 
- Tan, G., Opitz, L., Schlapbach, R., & Rehrauer, H. (2019). Long fragments achieve lower base quality in Illumina paired-end sequencing. Scientific Reports, 9(1), 1-7. https://doi.org/10.1038/s41598-019-39076-7
    - This article from 2019 that long fragments over 500bp, meaning large insert size reads yield poorer quality second strand reads on average. This could lead to a higher rate of discarded reads due to quality filtering. The article explains that this is a common feature of illumina sequencers across many contexts. The article also highlights a general decrease in quality due to high GC and larger insert size. These are likely due to poor bridge amplification having to do with melting temperature
    - This article highlights the need to detect insert sizes in dropped reads where possible, especially for quality-filtered reads. This could be archived via blast in some contexts, or by mapping with lose settings to the target genome in others 
- Shore S, Henderson JM, Lebedev A, Salcedo MP, Zon G, McCaffrey AP, et al. (2016) Small RNA Library Preparation Method for Next-Generation Sequencing Using Chemical Modifications to Prevent Adapter Dimer Formation. PLoS ONE 11(11): e0167009. https://doi.org/10.1371/journal.pone.0167009 
    - I found several articles ( others linked below) from 2016 to 2023 outlining various biochemical techniques to reduce the effects of Adapter Dimers in RNA-seq assays. The chemical approaches to reduce dimers are irrelevant to my project, other than to highlight that if we can characterize dropped reads as being due to adapter dimers there are a multitude of approaches to reduce their formation. The articles explain how adapter dimers are formed via the ligation of adapters to eachother rather than to sequence fragments, and how these dimers go on to form  clusters on the flowcell. Short dimers can be caught by size selection however if some get through the effect of PCR preferentially amplifying short reads can overpower this. 
    - My project should be able to take a list of adapters for the datset and return the amount of adapter dimers in the sample so users can attempt to mitigate. 
    - Additional sources 
        - Maqueda, J. J., Giovanazzi, A., Rocha, A. M., Rocha, S., Silva, I., Saraiva, N., Bonito, N., Carvalho, J., Maia, L., M. Wauben, M. H., & Oliveira, C. (2023). Adapter dimer contamination in sRNA-sequencing datasets predicts sequencing failure and batch effects and hampers extracellular vesicle-sRNA analysis. Journal of Extracellular Biology, 2(6), e91. https://doi.org/10.1002/jex2.91
        - Xu, H., Yao, J., Wu, D. C., & Lambowitz, A. M. (2019). Improved TGIRT-seq methods for comprehensive transcriptome profiling with decreased adapter dimer formation and bias correction. Scientific Reports, 9(1), 1-17. https://doi.org/10.1038/s41598-019-44457-z
        - Shore, S., Henderson, J.M., McCaffrey, A.P. (2018). CleanTag Adapters Improve Small RNA Next-Generation Sequencing Library Preparation by Reducing Adapter Dimers. In: Head, S., Ordoukhanian, P., Salomon, D. (eds) Next Generation Sequencing. Methods in Molecular Biology, vol 1712. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7514-3_10 
- Garafutdinov, R. R., Galimova, A. A., & Sakhabutdinova, A. R. (2020). The influence of quality of primers on the formation of primer dimers in PCR. Nucleotides, Nucleotides & Nucleic Acids, 39(9), 1251–1269. https://doi.org/10.1080/15257770.2020.1803354 
    - This article reviews the importance of primer design to reduce the formation of primer dimers. Primer dimers are the results of primer interaction in targeted assay where the amplification primers overlap and amplify into concatenated sequences. These primer dimers can take up sequencing space themselves, and can also cause off target amplification by matching unintended sequences. The article then goes into features to avoid to reduce dimers
    If the Dropped Reads pipeline can detect primer dimers, and potentially off target amplifications in the dropped reads then it can display this info and the effected  primers to highlight which primers need to be redesigned. 
- Goig, G.A., Blanco, S., Garcia-Basteiro, A.L. et al. Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability. BMC Biol 18, 24 (2020). https://doi.org/10.1186/s12915-020-0748-z
    - This article documents a study of public wgs data that used taxonomic filtering to show that many datasets have significant levels of contamination that can lead to incorrect conclusions. This highlight the prevalence of contamination and the need to detect and characterize it. The article highlights that many pipelines fail to do taxonomic filtering as well. The study utilized Bracken and Kraken to characterize the contamination, tools i also intend to investigate. 
    By incorporating contamination detection into the dropped reads analysis that users can slot into their pipelines to understand why reads failed to mapped we can show that the pipeline need to be modified to do taxonomic filtering, and the assay needs to be modified to mitigate contamination. Filtering will reduce error introduced by erroneously mapping contaminants and contamination mitigation will reduce loss of sequencing real estate. 
    - A Few more article describe similar studies with similar conclusions 
        - Zinter, M.S., Mayday, M.Y., Ryckman, K.K. et al. Towards precision quantification of contamination in metagenomic sequencing experiments. Microbiome 7, 62 (2019). https://doi.org/10.1186/s40168-019-0678-6
 When researching contamination I found a number of articles describing tools for contamination detection. I will investigate these tools in further detail in my tools investigation and planning
    - Steinegger, M., Salzberg, S.L. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biol 21, 115 (2020). https://doi.org/10.1186/s13059-020-02023-1
        - Describes Conterminator a tool/workflow for contamination detection and contig based identification. The tool is fast even for large genomes 
        - Mentions VecScreen which may be a good tool for adapter and primer detection/filtering
    - Sangiovanni, M., Granata, I., Thind, A. et al. From trash to treasure: detecting unexpected contamination in unmapped NGS data. BMC Bioinformatics 20 (Suppl 4), 168 (2019). https://doi.org/10.1186/s12859-019-2684-x
        - Describes DecontamMiner which is specifically meant to target unmapped reads for contamination detection. Returns an html report, if it also returns other files I can use those as well, or I can modify the sourcecode for my report 
    - Low AJ, Koziol AG, Manninger PA, Blais B, Carrillo CD. 2019. ConFindr: rapid detection of intra-species and cross-species contamination in bacterial whole-genome sequence data. PeerJ 7:e6995 https://doi.org/10.7717/peerj.6995
        - Describes conFinder a tool for detecting intra species contamination and cross-species contamination in short read data
        - May not be as useful as it's more for detection, but I will investigate  
