
![Biofilm image](../images/Biofilm_Website_2.png)

# Submodule #1:  Metagenome Data Preparation and QC

## Overview

This Jupyter Notebook guides users through the initial steps of a metagenomic analysis workflow. It focuses on preparing raw sequencing data and performing quality control checks using FastQC and MultiQC. The notebook begins by downloading example data from Qiime2, representing a human microbiome time series analysis, and sets up the necessary directories. It then installs FastQC and MultiQC, runs them on the downloaded data, and visualizes the results to identify regions of low sequencing quality. The notebook concludes by emphasizing the need for trimming these low-quality regions in subsequent steps and provides a quiz to assess understanding. Essentially, it ensures the input data is of sufficient quality for downstream analysis.

# Learning Objectives:
In submodule 1 we will leverage the fundamental concepts from [**Submodule 0**](./SubModule00.ipynb) to do the following:
- Introduce two use cases 
- Learn how to prepare the data for metagenomic analysis
- Apply quality control tools, fastqc and multiqc on the datasets

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Tip: </b>  If you're having trouble with any part of this tutorial, feel free to leverage Gemini (Google's advanced generative AI model) at the bottom of this module.
</div>  

## Prerequisites

*   **Python environment**
*   **Software (installed in notebook):**
    *   FastQC
    *   MultiQC
    *   mamba (package manager)

## Get Started

### STEP 1. Core Dataset Preparation
The first step consists of identifying and downloading the raw sequencing data that we will use throughout the module. For this submodule we will download the Qiime2 example data. If you would like to run the module with public data, check out [this tutorial](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/#sra) to download data from SRA.

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  At the completion of this step, you should have all your metagenome files and metadata available in your local or cloud folders for processing. Once that folder is choosen, we recommend to avoid any change on the folder name.
</div>

#### Dataset Use Case Overview
    ○ PMID: 21624126 
    ○ Pre-processed dataset
 > Abstract (from Pubmed)<br>
Background: Understanding the normal temporal variation in the human microbiome is critical to developing treatments for putative microbiome-related afflictions such as obesity, Crohn’s disease, inflammatory bowel disease and malnutrition. Sequencing and computational technologies, however, have been a limiting factor in performing dense time series analysis of the human microbiome. Here, we present the largest human microbiota time series analysis to date, covering two individuals at four body sites over 396 timepoints.
<br>
Results: We find that despite stable differences between body sites and individuals, there is pronounced variability in an individual’s microbiota across months, weeks and even days. Additionally, only a small fraction of the total taxa found within a single body site appear to be present across all time points, suggesting that no core temporal microbiome exists at high abundance (although some microbes may be present but drop below the detection threshold). Many more taxa appear to be persistent but non-permanent community members.
<br>
Conclusions: DNA sequencing and computational advances described here provide the ability to go beyond infrequent snapshots of our human-associated microbial ecology to high-resolution assessments of temporal variations over protracted periods, within and between body habitats and individuals. This capacity will allow us to define normal variation and pathologic states, and assess responses to therapeutic interventions.

In [None]:
# Create a directory to store our datasets
! mkdir -p Core_Dataset_Prep/emp-single-end-sequences
# Get qiime2 example data
! wget -O "Core_Dataset_Prep/sample-metadata.tsv" "https://data.qiime2.org/2022.2/tutorials/moving-pictures/sample_metadata.tsv"
! wget -O "Core_Dataset_Prep/emp-single-end-sequences/barcodes.fastq.gz" "https://data.qiime2.org/2022.2/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz"
! wget -O "Core_Dataset_Prep/emp-single-end-sequences/sequences.fastq.gz" "https://data.qiime2.org/2022.2/tutorials/moving-pictures/emp-single-end-sequences/sequences.fastq.gz"

## STEP 2. Raw Data Quality Control (FastQC, MultiQC):

### Quality Check (QC) - FastQC + MultiQC

In this section we use FastQC and MultiQC to evaluate our raw dataset quality.  Each cell focuses on one dataset and will create an output directory to store each **quality control report**. 

At the completion of this step, you will have an output folder with multiple files and a navigation HTML file to visualize your data set quality. Several interactive figures will be generated, including the "per base quality score" example map below which will tell us if these reads need to be trimmed to remove any adapter sequences and low-sequencing-quality bases.

<div align="center">
    <img src="../images/fastqc-sequencing-quality-l.jpg" alt= “fastqc-sequencing-quality” width="550" height="550">
</div>

### Quality Checking Our Dataset
    

In [None]:
# installing fastqc multiqc
! mamba install -c bioconda fastqc multiqc -y

In [None]:
# create output directories
! mkdir -p Dataset_QC/multiqc-output
#run fastqc
! fastqc -o Dataset_QC/ Core_Dataset_Prep/emp-single-end-sequences/*
#run multiqc
! multiqc Dataset_QC -o Dataset_QC/multiqc-output

Lets take a look at our fastqc output.

In [None]:
from IPython.display import IFrame
IFrame('Dataset_QC/sequences_fastqc.html', width=1000, height=550)

Looking at the first graph we can see the quality score for our fasta file drastically decreases around bp 120. This means we will need to trim only on the right side starting at bp 120. Trimming will be done as part of the pipeline in Submodule 2.

### Quiz

In [None]:
#Run the command below to view the quiz
from IPython.display import IFrame
IFrame("../Quiz/QS12.html", width=800, height=350)

# Conclusion

Submodule 1 walked you through the data preparation, quality control checks and trimming of poor quality reads. This submodule used the tools fastqc, and multiqc. Our datasets are now prepared, we checked their quality, and in the next submodule we will start our microbiome analysis!

## Clean up

Remember to stop your notebook instance when you are done!

## Gemini (Optional)
--------

If you're having trouble with this submodule (or others within this tutorial), feel free to leverage Gemini by running the cell below. Gemini is Google's advanced generative AI model designed to enhance the capabilities of AI applications across various domains.

In [None]:
# Ensure you have the necessary libraries installed
!pip install -q google-generativeai google-cloud-secret-manager
!pip install -q git+https://github.com/NIGMS/NIGMS-Sandbox-Repository-Template.git#subdirectory=llm_integrations
!pip install -q ipywidgets

import sys
import os
util_path = os.path.join(os.getcwd(), 'util')
if util_path not in sys.path:
    sys.path.append(util_path)

from gemini import run_gemini_widget, create_gemini_chat_widget 
from IPython.display import display

run_gemini_widget()