# Submodule 3: Workflow automation with Nextflow and aquisition of public datasets
--------
## Overview

The primary goal of this submodule is to **generate multiple genome datasets** that will be used in the subsequent submodule on comparative genomics. This will be a short submodule in terms of content, but it is the most computationally intensive submodule.

To achieve this, we will **automate the processes outlined in Submodules 1 & 2** using **Nextflow**, a workflow management platform designed to streamline and ensure reproducibility in bioinformatics pipelines. Nextflow allows us to automate the entire workflow, including program version control, ensuring data integrity, and facilitating parallel execution of multiple datasets. This enables us to efficiently run many genome datasets simultaneously, while managing the computational resources available in our environment.

After executing the Nextflow workflow on our set of genomes, we will supplement our data with **reference-quality genome sequences from NCBI** in **Submodule 4**.


### Learning Objectives
- **Learn Reproducibility in Bioinformatics**:  
  Understand the importance of reproducibility in bioinformatics workflows and learn how to implement practices that ensure consistent, reliable results across different systems and environments.

- **Automate Genome Data Processing**:  
  Learn how to automate and streamline the processes of genome assembly, quality control, and annotation to handle multiple genome datasets efficiently using the workflow management platform **Nextflow**.

- **Execute Parallel Processing and Ensure Data Integrity**:  
  Gain proficiency in managing and processing large sets of genomic data simultaneously, while also developing the ability to monitor, assess, and maintain data integrity and quality throughout the workflow.

- **Prepare and Refine Genome Sequences for Comparative Genomics**:  
  Refine and prepare a collection of genome datasets, ensuring they are processed, cleaned, and ready for detailed examination in comparative genomics.

## The Importance of Reproducibility in Bioinformatics

Reproducibility is a cornerstone of scientific research, particularly in bioinformatics, where computational analyses of biological data play a significant role in advancing our understanding of genomics, disease, and evolution. In bioinformatics, ensuring that analyses can be repeated with the same results is critical for validating findings and building trust within the scientific community. This is especially important given the large datasets and complex analyses often involved, where slight variations in workflows, software versions, or data can lead to different outcomes. Reproducible research ensures that experiments can be independently verified, improving the credibility of results and facilitating the development of robust, evidence-based conclusions.

To support reproducibility in bioinformatics, many workflows are built using standardized, transparent tools and platforms, such as **Nextflow**, **Snakemake**, and **Galaxy**, which allow scientists to document and automate every step of their analysis. This makes it possible for others to re-run the same pipeline on their own systems, using the same parameters and input data, thus ensuring consistent results.

### FAIR Principles

The FAIR principles are guidelines designed to enhance the visibility and accessibility of research data. FAIR stands for **Findable, Accessible, Interoperable, and Reusable**. These principles aim to make scientific data easier to find, use, and share, which is especially important in bioinformatics.

- **Findable**: Data should be easy to locate through searchable metadata, unique identifiers, and standardized naming conventions.
- **Accessible**: Data should be available in formats that are easy to access and retrieve, even if only through proper authentication or permissions.
- **Interoperable**: Data and tools should be compatible, allowing for the exchange of information across systems, platforms, and communities.
- **Reusable**: Data should be well-documented, with clear descriptions of methods, metadata, and usage licenses, so others can confidently reuse it in future research.

### Tools  to Aid in Reproducability in Bioinformatics
To ensure reproducibility in bioinformatics workflows, several tools and platforms are available that help automate, document, and track the analysis process. These tools facilitate consistent execution of pipelines, improve collaboration, and ensure that results can be verified and reproduced across different environments and systems.

| Tool               | Description | Link |
|--------------------|-------------|------|
| **Nextflow**       | A workflow management system for running bioinformatics pipelines in a highly scalable, reproducible, and portable manner. | [Nextflow Documentation](https://www.nextflow.io/) |
| **RepeatFS**       | A tool designed to automate repetitive computational tasks, similar to Nextflow, for large-scale genomic data analysis. | [RepeatFS GitHub](https://github.com/RepeatFS) |
| **Snakemake**      | A popular workflow management system for bioinformatics, known for its simple syntax and scalability. | [Snakemake Documentation](https://snakemake.readthedocs.io/en/stable/) |
| **Cromwell**       | A workflow engine designed for scientific workflows, with built-in support for WDL (Workflow Description Language). | [Cromwell GitHub](https://github.com/broadinstitute/cromwell) |
| **Galaxy**         | A web-based platform for data-intensive research, offering tools for reproducible data analysis and workflow management. | [Galaxy Project](https://galaxyproject.org/) |
| **Airflow**        | An open-source tool for orchestrating complex workflows, with strong scheduling and monitoring features. | [Apache Airflow](https://airflow.apache.org/) |
| **Bash / Python**  | Bash and Python scripts can automate tasks, with Bash being simple and flexible for Unix-like systems and Python offering more extensive libraries for bioinformatics tasks. | [Bash Scripting Guide](https://www.gnu.org/software/bash/manual/bash.html), [Python Documentation](https://docs.python.org/3/) |
| **Other Languages**| Other programming languages like **R**, **Perl**, and **Ruby** can be used for bioinformatics workflows, depending on the user's needs and preferences. | N/A |
| **PipeLine**       | A pipeline management tool that allows for easy tracking and execution of bioinformatics workflows, especially useful in large-scale projects. | [PipeLine GitHub](https://github.com/Pipeline) |


## **Install required software**

The only additional tools required for this submodule is **Nextflow**. All the tools required for the Nextflow workflow have already been installed and described in the previous submodules. To install Nextflow see the official documentation https://www.nextflow.io/docs/latest/install.html. **Nextflow has been installed in the container and does not need to be installed now.**

## Starting Data
As a reminder from Submodule 1, the data used for this module is described in a manuscript comparing phenotypic and whole-genome sequencing-derived AMR profiles ([Painset et al. 2020](https://pubmed.ncbi.nlm.nih.gov/31943013/)). The study includes sequencing read datasets for 528 isolates of Campylobacter spp. (452 C. jejuni and 76 C. coli) from human (494), food (21), and environmental (2) sources.

For a robust comparative genomic analysis, it is essential to include a balanced and statistically sufficient sample size for each species. A minimum of 50-100 isolates per species is recommended to capture population-level variability, with larger sample sizes (e.g., 100-200) providing greater statistical power to detect subtle genomic differences.

For this tutorial, we will use a small subset of ~10 isolates to ensure analyses run quickly. However, the methods presented here are fully scalable and can be applied to larger datasets like those described in the manuscript. This smaller subset of sequencing data is derived from Table 2. of the manuscript.

Lets take a look at the starting dataset. We'll download this data from the NCBI SRA database using the SRA accession column and the same methods we used in Submodule 1.

| **Isolate No.** | **SRA Accession** | **Species**  | **Resistance to Erythromycin (Macrolide)** | **Resistance to Ciprofloxacin (Fluoroquinolone)** | **Resistance Determinants (Ciprofloxacin)** | **Resistance to Tetracycline (Tetracycline)** | **Resistance Determinants (Tetracycline)** | **Resistance to Gentamicin (Aminoglycoside)** | **Resistance Determinants (Gentamicin)** | **Resistance to Streptomycin (Aminoglycoside)** | **Resistance Determinants (Streptomycin)** |
|------------------|-------------------|--------------|--------------------------------------------|--------------------------------------------------|---------------------------------------------|----------------------------------------------|---------------------------------------------|-----------------------------------------------|-------------------------------------------|------------------------------------------------|---------------------------------------------|
| 72               | SRR10067958      | *C. jejuni*  | S                                          | S                                                | —                                           | R                                            | —                                           | S                                             | —                                         | S                                              | —                                           |
| 91               | SRR10068079      | *C. jejuni*  | S                                          | S                                                | gyrA_CJ[86:T-I]                             | S                                            | tet(O)                                      | S                                             | —                                         | S                                              | —                                           |
| 145              | SRR10068117      | *C. jejuni*  | S                                          | S                                                | —                                           | R                                            | —                                           | S                                             | —                                         | S                                              | —                                           |
| 193              | SRR10056681      | *C. jejuni*  | S                                          | S                                                | gyrA_CJ[86:T-I]                             | R                                            | tet(O)_2                                    | S                                             | —                                         | R                                              | —                                           |
| 213              | SRR10056829      | *C. jejuni*  | S                                          | S                                                | —                                           | S                                            | tet(O)                                      | S                                             | —                                         | S                                              | ant(6)-Ia, aadE-Cp2                        |
| 214              | SRR10056914      | *C. coli*    | S                                          | S                                                | —                                           | S                                            | —                                           | S                                             | —                                         | S                                              | —                                           |
| 242              | SRR10056778      | *C. coli*    | S                                          | S                                                | —                                           | S                                            | —                                           | S                                             | —                                         | R                                              | —                                           |
| 248              | SRR10056855      | *C. jejuni*  | S                                          | S                                                | 23s_CJ[2074:A-M], gyrA_CJ[86:T-I; 90:D-N]  | R                                            | tet(O)_2                                    | S                                             | —                                         | S                                              | —                                           |
| 263              | SRR10056784      | *C. jejuni*  | S                                          | S                                                | gyrA_CJ[86:T-I]                             | R                                            | tet(O)_2                                    | S                                             | —                                         | R                                              | ant(6)-Ia, aadE-Cp2                        |
| 276              | SRR10056856      | *C. jejuni*  | S                                          | S                                                | —                                           | S                                            | —                                           | S                                             | —                                         | R                                              | —                                           |



## Nextflow Overview

**Nextflow** is a powerful workflow management platform designed for bioinformatics and data science. It enables the automation of complex workflows, facilitating reproducibility, parallel execution of tasks, and efficient use of computational resources across different environments. Nextflow is especially useful in genomics pipelines where large-scale datasets need to be processed with consistent results.

---

### Key Features of Nextflow:
- **Reproducibility**: Nextflow ensures that workflows are reproducible by capturing the environment, dependencies, and exact command used for each execution.
- **Parallelization**: It efficiently runs tasks in parallel, utilizing computational resources optimally, which speeds up large-scale data analyses.
- **Compatibility**: Nextflow supports multiple computing environments, including local machines, cloud infrastructure, and high-performance computing clusters.
- **Version Control**: Integrated support for managing software versions and dependencies, ensuring consistency across different systems and workflows.

---

### Learning Resources for Nextflow

| Resource | Description | Link |
|----------|-------------|------|
| **Nextflow Official Documentation** | Comprehensive guide on installation, syntax, and core concepts. | [Nextflow Documentation](https://www.nextflow.io/docs/latest/) |
| **Nextflow Tutorials** | Step-by-step tutorials for beginners and advanced users. | [Nextflow Tutorials](https://www.nextflow.io/docs/latest/tutorial/) |
| **Nextflow YouTube Channel** | Recorded webinars, conference talks, and demos for visual learners. | [Nextflow YouTube Channel](https://www.youtube.com/c/Nextflow) |
| **Nextflow Community Forum** | A forum for discussing issues, asking questions, and sharing knowledge. | [Nextflow Forum](https://www.nextflow.io/community/) |
| **Nextflow GitHub Repository** | Access to source code, examples, and community contributions. | [Nextflow GitHub](https://github.com/nextflow-io/nextflow) |


### Download starting read datasets

Below are two methods to download starting read datasets. 

1. Download data from the NCBI SRA using a list of accessions.
2. Copying a predownloaded dataset from an AWS S3 bucket. 

For the standard tutorial, the SRA download is commented out and will not run by default. Within this code, the script *SRA_download.sh* automates the same process we used in Submodule 01 to download data. It iterates through a list of SRA accessions and fetches paired-end sequencing reads for ten samples. If you would like to use the full set of 10 samples with the full number of reads (1.5–3 million per sample), you can comment out (#) the AWS S3 code and remove the comment characters in front of the SRA code. This full dataset will take up to a half hour to run. Also, to use this approach with your own data, simply replace the SRR_list.txt file with a custom list of SRA accessions.

The data we pull from the AWS S3 bucket are five samples randomly subsampled down to 300,000 reads to make the NextFlow workflow run in under seven minutes. 

<div class="alert alert-block alert-warning"> <b>Attention: Commented lines below will not run by default. See the note above about the two methods available for acquiring the starting data.</b>  </div>

In [2]:
%%bash

read_dir=wgs-nf/raw-reads/
mkdir -p $read_dir

########################
# ### SKIPPING DOWNLOAD FROM SRA, BUT HERE ARE INSTRUCTIONS
# # Create list of accessions from manuscript metadata
# cat data/metadata.tsv  | grep SRR | awk '{print $2}' > data/SRR_list.txt
# cat data/SRR_list.txt

# # run a custom BASH script that retrieves the data from NCBI
# scripts/SRA_download.sh data/SRR_list.txt $read_dir
########################

### DOWNLOAD FROM S3 bucket, subset data
aws s3 cp s3://nh-inbre-genome-sequencing-and-comparative-genomic-analysis/nextflow-reads/ $read_dir --recursive


# list the contents of the starting directory
echo "Directory contents:"
echo "--------------------"
ls $read_dir

download: s3://nh-inbre-genome-sequencing-and-comparative-genomic-analysis/nextflow-reads/SRR10056784_1.fastq.gz to wgs-nf/raw-reads/SRR10056784_1.fastq.gz
download: s3://nh-inbre-genome-sequencing-and-comparative-genomic-analysis/nextflow-reads/SRR10056784_2.fastq.gz to wgs-nf/raw-reads/SRR10056784_2.fastq.gz
download: s3://nh-inbre-genome-sequencing-and-comparative-genomic-analysis/nextflow-reads/SRR10056855_1.fastq.gz to wgs-nf/raw-reads/SRR10056855_1.fastq.gz
download: s3://nh-inbre-genome-sequencing-and-comparative-genomic-analysis/nextflow-reads/SRR10056855_2.fastq.gz to wgs-nf/raw-reads/SRR10056855_2.fastq.gz
download: s3://nh-inbre-genome-sequencing-and-comparative-genomic-analysis/nextflow-reads/SRR10056856_1.fastq.gz to wgs-nf/raw-reads/SRR10056856_1.fastq.gz
download: s3://nh-inbre-genome-sequencing-and-comparative-genomic-analysis/nextflow-reads/SRR10056856_2.fastq.gz to wgs-nf/raw-reads/SRR10056856_2.fastq.gz
download: s3://nh-inbre-genome-sequencing-and-comparative-genomi

## The Nextflow Workflow

### Overview

Next we will run a bash command that initiates the Nextflow workflow on the directory of paired-end FASTQ files we just downloaded. Each of the sets of FASTQ files represents the sequencing data genearted from a bacterial isolate. The workflow automates all the steps from Submodules 1 and 2, ensuring a seamless process. Nextflow intelligently determines which steps can be executed concurrently (e.g., BLAST and BWA read mapping) and which steps require prior completion of others (e.g., running BlobTools after BLAST and related processes finish). 

The NextFlow code is written in the file called *main.nf*, I enourage you to open up this file (wgs-nf/main.nf) and examine the code, but we'll examine some of the sections together.

### Example of a Nextflow Process: Trimming Paired-End Reads with FASTP
Below is a code snippet of the contents of wgs-nf/main.nf as an example of the Nextflow language process for running a command. In this step of the workflow, we trim paired-end sequencing reads using **FASTP**. The `RUN_FASTP` process takes paired-end reads as input and outputs the trimmed versions of the reads. Here's the Nextflow code to perform this task:

```nextflow
// Step 2: Trim paired-end reads with FASTP
process RUN_FASTP {
    input:
    tuple val(sampleid), path(fq1), path(fq2)
    
    output:
    tuple val(sampleid), path("${sampleid}.trimmed_R1.fastq.gz"), path("${sampleid}.trimmed_R2.fastq.gz")
    
    script:
    """
    fastp -i $fq1 -I $fq2 \
        -o "$sampleid".trimmed_R1.fastq.gz -O "$sampleid".trimmed_R2.fastq.gz \
        --thread ${params.threads}
    """
}

### Nextflow Workflow: Putting it all together

A similar snippet of code exists for each step of the workflow, with controlled options and threads for each step. After all the processes are written you can execute the code using the workflow block. Below is the Nextflow code for this workflow, this includes steps for read assessment, quality control, genome assembly, and various downstream analyses.

```nextflow
workflow {
    // Load paired-end FASTQ files
    fastq_files = Channel
    .fromFilePairs(params.reads, flat: false)
    .map { sample_id, files -> tuple(sample_id, files[0], files[1]) }

    // Step 0: Count reads
    read_info = fastq_files | ASSESS_READS
    read_info.view()

    // Step 1: Run FASTQC
    fastqc_report = fastq_files | RUN_FASTQC

    // Step 2: Trim reads with FASTP
    trimmed_fastq = fastq_files | RUN_FASTP

    // Step 3: Assemble genome with SPADES
    genome = trimmed_fastq | RUN_SPADES

    // Step 4.1: Run BWA and Samtools
    bwa_channel = trimmed_fastq.join(genome)
    bam = RUN_BWA(bwa_channel)

    // Step 4.2: Run BUSCO
    busco_results = genome | RUN_BUSCO

    // Step 4.3: Run QUAST
    quast_report = genome | RUN_QUAST

    // Step 4.4: Run BLAST
    blast_results = genome | RUN_BLAST

    // Step 5: Run Blobtools
    busco_channel = genome.join(bam).join(blast_results)
    blob_output = RUN_BLOBTOOLS(busco_channel)
}

## Run Nextflow

Lets start the process.

Note: When running Nextflow inside Jupyter notebooks, the progress updates are printed repeatedly instead of updating cleanly on a single line as they would in a normal terminal. While the output may appear verbose, it contains useful information about the status of the workflow. We will keep this output as-is for the demonstration. You can remove this status bar update using the option *-ansi-log false*. **You can click the left side of the notebook screen to truncate the output feed.**

In [23]:
%%bash

# move into working directory
cd wgs-nf/

# run nextflow with the continue option
nextflow run main.nf

[33mNextflow 24.10.6 is available - Please consider updating your version to it[m



 N E X T F L O W   ~  version 24.04.4

Launching `main.nf` [voluminous_hawking] DSL2 - revision: e230daa73d

[-        ] ASSESS_READS -
[-        ] RUN_FASTQC   -
[-        ] RUN_FASTP    -

[-        ] ASSESS_READS  -
[-        ] RUN_FASTQC    -
[-        ] RUN_FASTP     -
[-        ] RUN_SPADES    -
[-        ] RUN_BWA       -
[-        ] RUN_BUSCO     -
[-        ] RUN_QUAST     -
[-        ] RUN_BLAST     -
[-        ] RUN_BLOBTOOLS -
[-        ] RUN_BAKTA     -

executor >  local (15)
[e1/bb62bb] ASSESS_READS (5) | 0 of 5
[09/6e1371] RUN_FASTQC (1)   | 0 of 5
[b9/917be8] RUN_FASTP (3)    | 0 of 5
[-        ] RUN_SPADES       -
[-        ] RUN_BWA          -
[-        ] RUN_BUSCO        -
[-        ] RUN_QUAST        -
[-        ] RUN_BLAST        -
[-        ] RUN_BLOBTOOLS    -
[-        ] RUN_BAKTA        -

executor >  local (15)
[e1/bb62bb] ASSESS_READS (5) | 0 of 5
[09/6e1371] RUN_FASTQC (1)   | 0 of 5
[b9/917be8] RUN_FASTP (3)    | 0 of 5
[-        ] RUN_SPADES       -
[-  

## Nextflow Explanation:

As the NextFlow pipeline processed through the steps in the workflow you may have noticed the updates printing to the screen. For example.


Example of Nextflow status as the job was running.

[hash] PROCESS_NAME (attempt) | completed of total ✔ or pending    

[1e/9847bd] ASSESS_READS (5)  | 5 of 5 ✔  
[21/55f7a1] RUN_FASTQC (3)    | 5 of 5 ✔  
[eb/54aa16] RUN_FASTP (2)     | 5 of 5 ✔  
[0d/47a4f6] RUN_SPADES (3)    | 3 of 5  
[a8/562c66] RUN_BWA (3)       | 2 of 3  
[2a/72a30e] RUN_BUSCO (2)     | 2 of 3  
[76/b9a85d] RUN_QUAST (3)     | 2 of 3  
[c5/3c7a21] RUN_BLAST (3)     | 3 of 3  
[8e/391483] RUN_BLOBTOOLS (2) | 1 of 2  
[8c/187cd5] RUN_BAKTA (3)     | 0 of 3  

When running a Nextflow pipeline, the status output helps track the progress of each process (step) in the workflow.
Each line summarizes a different process:

The `PROCESS_NAME` is the current task, such as FASTQC or BUSCO. The number in parentheses reflects how many batches or scheduling groups were created for that process during the workflow. `5 of 5 ✔` means all 5 jobs for this process have successfully completed. 

You may have noticed that jobs were parallelized, but specific tasks needed to wait to start based on the output from other jobs. For example, SPADES can't run until FASTP finishes generating the trimmed reads. Once the assembly finished, nearly all the other tasks (BUSCO, BWA, QUAST, BAKTA, BLAST) were able to start at the same time. This is one of the most powerful features of the Nextflow language: automatic parallelization and queueing based on dependencies between tasks. This queueing system is sample-dependent and dynamically adapts to your data.

### Outputs
The workflow cleans up and writes the files we want to save to a final directory called *output-dir*. This main output dircetory contains selected reports from each process (BUSCO results, blobtools plots, etc.). All other work is saved by NextFlow within a directory called *work*. The *hash* ID at the begining of each status details which directory all files are saved.

Lets view the primary outputs.

<div class="alert alert-block alert-warning"> <b>Attention:</b> Before You Proceed: Pause here and wait for the nextflow code to finish running. The brackets around the block of code will switch from '*' to a number when it is completed and you will see the duration displayed at the bottom of the output. </div>

In [4]:
%%bash

# view contents of output directory
ls wgs-nf/output-dir/*

wgs-nf/output-dir/output-bakta:
output-bakta-SRR10056784
output-bakta-SRR10056855
output-bakta-SRR10056856
output-bakta-SRR10067958
output-bakta-SRR10068079

wgs-nf/output-dir/output-blast:
SRR10056784-blast-genome-vs-db.tsv
SRR10056855-blast-genome-vs-db.tsv
SRR10056856-blast-genome-vs-db.tsv
SRR10067958-blast-genome-vs-db.tsv
SRR10068079-blast-genome-vs-db.tsv

wgs-nf/output-dir/output-blobtools:
SRR10056784_blobplot.png
SRR10056784_blobplot_read_cov.png
SRR10056784_table.tsv
SRR10056855_blobplot.png
SRR10056855_blobplot_read_cov.png
SRR10056855_table.tsv
SRR10056856_blobplot.png
SRR10056856_blobplot_read_cov.png
SRR10056856_table.tsv
SRR10067958_blobplot.png
SRR10067958_blobplot_read_cov.png
SRR10067958_table.tsv
SRR10068079_blobplot.png
SRR10068079_blobplot_read_cov.png
SRR10068079_table.tsv

wgs-nf/output-dir/output-busco:
SRR10056784_busco_report.txt
SRR10056855_busco_report.txt
SRR10056856_busco_report.txt
SRR10067958_busco_report.txt
SRR10068079_busco_report.txt

wgs-nf/output-

In [20]:
%%bash

### Summarize QUAST results

# 
# cat wgs-nf/output-dir/output-quast/*

# genoem sizes
echo 'NUMBER OF CONTIGS'
grep '# contigs' wgs-nf/output-dir/output-quast/* | grep -v '('

# genoem sizes
echo 'GENOME ASSEMBLY LENGTH'
grep 'Total length' wgs-nf/output-dir/output-quast/* | grep -v '('

# N50s
echo 'N50s'
grep N50 wgs-nf/output-dir/output-quast/*


NUMBER OF CONTIGS
wgs-nf/output-dir/output-quast/SRR10056784_report.txt:# contigs                   103        
wgs-nf/output-dir/output-quast/SRR10056855_report.txt:# contigs                   42         
wgs-nf/output-dir/output-quast/SRR10056856_report.txt:# contigs                   94         
wgs-nf/output-dir/output-quast/SRR10067958_report.txt:# contigs                   173        
wgs-nf/output-dir/output-quast/SRR10068079_report.txt:# contigs                   194        
GENOME ASSEMBLY LENGTH
wgs-nf/output-dir/output-quast/SRR10056784_report.txt:Total length                1710343    
wgs-nf/output-dir/output-quast/SRR10056855_report.txt:Total length                1698056    
wgs-nf/output-dir/output-quast/SRR10056856_report.txt:Total length                1725406    
wgs-nf/output-dir/output-quast/SRR10067958_report.txt:Total length                1622855    
wgs-nf/output-dir/output-quast/SRR10068079_report.txt:Total length                1739324    
N50s
wgs-nf/output-

In [21]:
%%bash

# See the number of complete and missing BUSCOs
echo 'Complete BUSCOs:'
grep 'single-copy' wgs-nf/output-dir/output-busco/*
echo 'Missing BUSCOs:'
grep 'Missing' wgs-nf/output-dir/output-busco/*

Complete BUSCOs:
wgs-nf/output-dir/output-busco/SRR10056784_busco_report.txt:	110	Complete and single-copy BUSCOs (S)	   
wgs-nf/output-dir/output-busco/SRR10056855_busco_report.txt:	109	Complete and single-copy BUSCOs (S)	   
wgs-nf/output-dir/output-busco/SRR10056856_busco_report.txt:	110	Complete and single-copy BUSCOs (S)	   
wgs-nf/output-dir/output-busco/SRR10067958_busco_report.txt:	109	Complete and single-copy BUSCOs (S)	   
wgs-nf/output-dir/output-busco/SRR10068079_busco_report.txt:	108	Complete and single-copy BUSCOs (S)	   
Missing BUSCOs:
wgs-nf/output-dir/output-busco/SRR10056784_busco_report.txt:	7	Missing BUSCOs (M)			   
wgs-nf/output-dir/output-busco/SRR10056855_busco_report.txt:	8	Missing BUSCOs (M)			   
wgs-nf/output-dir/output-busco/SRR10056856_busco_report.txt:	7	Missing BUSCOs (M)			   
wgs-nf/output-dir/output-busco/SRR10067958_busco_report.txt:	7	Missing BUSCOs (M)			   
wgs-nf/output-dir/output-busco/SRR10068079_busco_report.txt:	7	Missing BUSCOs (M)			   


In [None]:
%%bash

# final output directory for the next steps

ls wgs-nf/output-dir/proteomes

<div class="alert alert-block alert-warning"> <b>Attention:</b> Before You Proceed: If you're not moving on immediately to the next submodule, be sure to shutdown your instance. </div>