# Module: Importing Data Files to QIIME2

In QIIME2, native file formats (i.e. artifacts (`.qza`) and visualizations (`.qzv`)) are utilized instead of the more common file formats that we are familiar with (e.g. FASTA, FASTQ, etc). In these QIIME2-native files, other than the data itself, several metadata are attached as well such as type, format, and provenance. This has several advantages since it somewhat standardizes the formats and we are also able to trace back how the file was generated.

This module will demonstrate how to import common bioinformatics data files into QIIME2 formats.

The following references were used for this module: [QIIME2 core concepts](https://docs.qiime2.org/2024.10/concepts/#), [QIIME2 importing](https://docs.qiime2.org/2024.10/tutorials/importing/).

Created by: _Microbial Oceanography Laboratory (MOLab)_

---
## How to Use This Notebook

1. Activate conda environment in terminal window. Make sure to change the environment name to what is applicable in your case.
```bash
conda activate qiime2-2023.2
```
2. Open jupyter notebook with the command below and select the notebook.
```bash
jupyter notebook
```
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **QIIME 2 Amplicon Distribution**
    - Installation procedure can be found here: [QIIME2 native installation](https://docs.qiime2.org/2024.10/install/native/)

---
## Starting Files 

1. Paired-end demultiplexed FASTQ files.
2. Single-end demultiplexed FASTQ files.
3. Other QIIME2-importable formats.

---
## Expected Outputs

1. `.qza` of type `SampleData[PairedEndSequencesWithQuality]`
2. `.qza` of type `SampleData[SequencesWithQuality]`
3. Other `.qza` types resulting from `qiime tools import`

---
## Table of Contents
 * [**Importing Paired-End Demultiplexed FASTQ Files**](#Importing-Paired-End-Demultiplexed-FASTQ-Files)
     * [Creating paired-end manifest file](#Creating-paired-end-manifest-file)
     * [Importing paired-end demultiplexed data](#Importing-paired-end-demultiplexed-data)
 * [**Importing Single-End Demultiplexed FASTQ Files**](#Importing-Single-End-Demultiplexed-FASTQ-Files)
     * [Creating single-end manifest file](#Creating-single-end-manifest-file)
     * [Importing single-end demultiplexed data](#Importing-single-end-demultiplexed-data)
 * [**Importing Other Formats**](#Importing-Other-Formats)

---
# <font color = 'gray'>Importing Paired-End Demultiplexed FASTQ Files</font>

Reference: [Importing through FASTQ manifest formats](https://docs.qiime2.org/2024.10/tutorials/importing/#fastq-manifest-formats).

### Creating paired-end manifest file

For MOLab, most raw metabarcoding data will be of this form. As such, the first step is to create a manifest file summarizing the sample names, and absolute file path of the forward and reverse reads. The Python code below generates a file named `manifest.txt` which should contain these information.

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
The code below assumes that:
    <li> The FASTQ files are placed inside a folder named <code>0-raw-sequences</code> which is located in the same path as this Jupyter notebook.
    <li> Forward and reverse read FASTQ filenames are separated by underscores where the first block contains the sample ID.
    <li> The filenames should end with <code>\_1.fastq.gz</code> and <code>\_2.fastq.gz</code>, respectively. If not, you must reformat the filenames first (or edit the code below).
    
An example of a pair of valid filenames is: <code>SRR123_F_1.fastq.gz</code> (forward reads) and <code>SRR123_R_2.fastq.gz</code> (reverse reads).
</div>

In [None]:
import pandas as pd
import glob
import os

sampleIDs, forwardpaths, reversepaths = [], [], []
fpath= os.getcwd()+"/0-raw-sequences/"

for filepath in (glob.glob(fpath+"*.gz")):
    sample = filepath.split("/")[-1].rsplit("_", 2)[0]

    if sample not in sampleIDs:
        sampleIDs.append(sample)
    if "_1.fastq.gz" in filepath:
        forwardpaths.append(filepath)
    elif "_2.fastq.gz" in filepath:
        reversepaths.append(filepath)

manifest =  pd.DataFrame({'sampleID': sorted(sampleIDs), 'forward-absolute-filepath': sorted(forwardpaths), 'reverse-absolute-filepath':sorted(reversepaths)} ) 

with open('manifest.txt', 'w') as m:
    print(manifest.to_csv(sep='\t', index=False, header=True), file=m)

### Importing paired-end demultiplexed data

Check `manifest.txt`. It should containg three columns: `sampleID`, `forward-absolute-filepath`, and `reverse-absolute-filepath`. Verify if the correct filepaths have been listed. 

If there is no issue, you can now use the `qiime tools import` command to compile the FASTQ files into a QIIME2 artifact. Since a manifest file for paired-end demultiplexed data was utilized, `--type 'SampleData[PairedEndSequencesWithQuality]'` was set.

The imported file is indicated in the `--output-path` argument.

In [None]:
!qiime tools import \
    --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path manifest.txt \
    --output-path 0-raw-sequences/seqs.qza \
    --input-format PairedEndFastqManifestPhred33V2

---
# <font color = 'gray'>Importing Single-End Demultiplexed FASTQ Files</font>

Reference: [Importing through FASTQ manifest formats](https://docs.qiime2.org/2024.10/tutorials/importing/#fastq-manifest-formats).

### Creating single-end manifest file

For this type of dataset, the approach is almost the same except that the manifest file should only contain two columns: `sampleID` and `absolute-filepath`. Run code below to generate a manifest file for a single-end dataset.

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
The code below assumes that:
    <li> The FASTQ files are placed inside a folder named <code>0-raw-sequences</code> which is located in the same path as this Jupyter notebook.
    <li> The FASTQ filenames are separated by underscores where the first block contains the sample ID.
    <li> The FASTQ filenames should end in <code>.fastq.gz</code>.
    
A few examples of a valid filenames are: <code>SRR456_01.fastq.gz</code>, <code>SRR567_001.fastq.gz</code>, and <code>SRR678_02.fastq.gz</code>.
</div>

In [None]:
import pandas as pd
import glob
import os

sampleIDs, filepaths = [], []
fpath= os.getcwd()+"/0-raw-sequences/"

for filepath in (glob.glob(fpath+"*fastq.gz")):
    sample = filepath.split("/")[-1].rsplit("_", 2)[0]

    if sample not in sampleIDs:
        sampleIDs.append(sample)
        filepaths.append(filepath)

manifest =  pd.DataFrame({'sampleID': sorted(sampleIDs), 'absolute-filepath': sorted(filepaths)})

with open('manifest.txt', 'w') as m:
    print(manifest.to_csv(sep='\t', index=False, header=True), file=m)

### Importing single-end demultiplexed data

Double check if the information listed in the manifest file is accurate. Afterwards, you can then run the `qiime tools import` command. This time we set `--type 'SampleData[SequencesWithQuality]'` because the data is a FASTQ single-end and demultiplexed.

The imported file is indicated in the `--output-path` argument.

In [None]:
!qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path manifest.txt \
  --output-path 0-raw-sequences/seqs.qza \
  --input-format SingleEndFastqManifestPhred33V2

---
# <font color = 'gray'>Importing Other Formats</font>

If your data is neither of the formats mentioned above, check this more detailed tutorial provided by QIIME2: [QIIME2 importing](https://docs.qiime2.org/2024.10/tutorials/importing/#).