In [1]:
#set up environment
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization

# create directories for the notebook. DO NOT change
data_dir = 'data/07_Functional_analysis'
!data_dir = 'data/07_Functional_analysis'

!mkdir -p data
!mkdir -p $data_dir

# fetches useful files for the current notebook. All files will be saved in $data_dir
!wget 'https://polybox.ethz.ch/index.php/s/WaSKZs2E5xn9SHm/download' -O data/Download.zip
!unzip -o data/Download.zip -d data
!rm data/Download.zip

/usr/bin/sh: line 1: data_dir: command not found
--2025-12-03 11:35:42--  https://polybox.ethz.ch/index.php/s/WaSKZs2E5xn9SHm/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘data/Download.zip’

data/Download.zip       [        <=>         ] 306.57M   192MB/s    in 1.6s    

2025-12-03 11:35:44 (192 MB/s) - ‘data/Download.zip’ saved [321463931]

Archive:  data/Download.zip
 extracting: data/07_Functional_analysis/Euler_scripts/annotate_bacteria_95.sh  
 extracting: data/07_Functional_analysis/Euler_scripts/gunc_bacteria_95.sh  
 extracting: data/07_Functional_analysis/Euler_scripts/q2_gunc_db.sh  
 extracting: data/07_Functional_analysis/Euler_scripts/rgi_bacteria_95.sh  
 extracting: data/07_Functional_analysis/data/bacteria_95_gunc_results.qzv  
 extracting: data/07_Functional_analysis/

In [2]:
'''import os
from qiime2 import Visualization

# Check the current working directory
cwd = os.getcwd()
print("Current working directory:", cwd)

print("\nContents of this directory:")
for name in os.listdir():
    prefix = "[dir]" if os.path.isdir(name) else "     "
    print(f" {prefix} {name}")

# Expected subdirectories
scripts_dir = os.path.join(cwd, "Euler_scripts")
data_dir = os.path.join(cwd, "data")

print("\nCheck important subdirectories:")
print("Euler_scripts exists:", os.path.isdir(scripts_dir), "->", scripts_dir)
print("data exists:", os.path.isdir(data_dir), "->", data_dir)'''

'import os\nfrom qiime2 import Visualization\n\n# Check the current working directory\ncwd = os.getcwd()\nprint("Current working directory:", cwd)\n\nprint("\nContents of this directory:")\nfor name in os.listdir():\n    prefix = "[dir]" if os.path.isdir(name) else "     "\n    print(f" {prefix} {name}")\n\n# Expected subdirectories\nscripts_dir = os.path.join(cwd, "Euler_scripts")\ndata_dir = os.path.join(cwd, "data")\n\nprint("\nCheck important subdirectories:")\nprint("Euler_scripts exists:", os.path.isdir(scripts_dir), "->", scripts_dir)\nprint("data exists:", os.path.isdir(data_dir), "->", data_dir)'

# MAG-based functional analysis for the space fermentation dataset

In this notebook we document how we analyzed the dereplicated bacterial MAG catalog
using QIIME 2 and several plugins (`q2-annotate`, `q2-gunc`, `q2-rgi`). All heavy
computations were run on the Euler cluster using SLURM batch scripts. Here we:

1. show the exact SLURM scripts that we submitted (using `cat`), and  
2. load the resulting QIIME 2 visualizations (`.qzv`) stored in the `data/`
   subdirectory.

## Input data and constraints

For this project we did **not** receive the original metagenomic reads. We only had:

- a set of dereplicated bacterial MAGs (95% ANI).

This has two important consequences:

- We **cannot** run steps that require read mapping (e.g., per-sample coverage,
  read-based functional profiling, differential abundance of pathways).
- We **can** run any analysis that only requires the DNA sequences of each MAG
  (gene prediction, ortholog annotation, genome quality checks, AMR gene detection).

Therefore, our workflow is deliberately **MAG-centric** and focuses on genomic
potential rather than sample-level abundances.

## Functional annotation with EggNOG (COG, CAZy, vitamins)

In this section we annotate the dereplicated bacterial MAGs using the
`q2-annotate` plugin and EggNOG. The goal is to characterize the functional
potential encoded in each MAG in terms of:

- broad COG functional categories (e.g., metabolism, information processing),
- CAZy families (carbohydrate-active enzymes),
- vitamin-related genes (based on EggNOG annotations containing the word "vitamin").

All heavy computation is done in the following SLURM script:

In [3]:
!cat $data_dir/Euler_scripts/annotate_bacteria_95.sh

#!/bin/bash
# File: annotate_bacteria_95.sh
# Purpose: Functional annotation of dereplicated bacterial MAGs using EggNOG
#          (COG, CAZy, vitamin-related functions) with QIIME 2 / moshpit.

#SBATCH --job-name=annotate_bacteria_95
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=2500MB
#SBATCH --time=72:00:00
#SBATCH --output=/cluster/home/yiyahe/bio_info/01_logs/%x_%j_out.log
#SBATCH --error=/cluster/home/yiyahe/bio_info/01_logs/%x_%j_err.log

# -----------------------------------------------------------------------------
# 1. Environment setup
# -----------------------------------------------------------------------------

# ETH proxy (required for internet access on Euler)
module load eth_proxy

# Initialize and activate the QIIME 2 / moshpit environment
source /cluster/home/yiyahe/miniconda3/etc/profile.d/conda.sh
conda activate qiime2-moshpit-2025.10

# -----------------------------------------------------------------------------
# 2. Input / output directo

### Summary of the EggNOG annotation script

The script performs the following main steps:

1. **Environment setup**  
   - Loads the ETH proxy module and activates the `qiime2-moshpit-2025.10`
     conda environment.

2. **EggNOG ortholog search (DIAMOND)**  
   - Runs `mosh annotate search-orthologs-diamond` on the dereplicated MAGs
     to find EggNOG orthologs for predicted proteins.

3. **Mapping to ortholog annotations**  
   - Uses `mosh annotate map-eggnog` and the EggNOG database (`$EGGNOG_DB`) to
     convert hits into functional annotations that can be filtered by category
     (e.g., COG, CAZy).

4. **CAZy annotation and visualization**  
   - Extracts CAZy annotation frequencies per MAG.  
   - Summarizes them as a feature table and generates a heatmap.

5. **COG annotation and visualization**  
   - Extracts COG category frequencies per MAG.  
   - Summarizes them and produces a COG heatmap.

6. **Vitamin-related genes (keyword search)**  
   - Searches the exported EggNOG annotation tables for the keyword "vitamin".  
   - Counts vitamin-related hits per MAG and saves the summary.

Overall, this script answers the question:
> *What functional potential (COG, CAZy, vitamin-related genes) is encoded in the
> dereplicated bacterial MAG catalog?*

### Visualizing COG and CAZy profiles

Below we load the QIIME 2 visualizations for:

- the COG functional profile across MAGs, and  
- the CAZy enzyme profile across MAGs.

We assume that all `.qzv` files have been uploaded to the `data/` directory
next to this notebook.

In [None]:
Visualization.load(f"{data_dir}/data/cog_annot_ft_bacteria_95_heatmap.qzv")
Visualization.load(f"{data_dir}/data/caz_annot_ft_bacteria_95_heatmap.qzv")

### Interpretation (EggNOG-based functional profiles)

From the COG and CAZy heatmaps we can see that:

- Core information processing and central metabolism categories are present
  in most MAGs, as expected for bacterial genomes.
- Some MAGs show higher counts of CAZy families related to polysaccharide
  degradation, suggesting a stronger potential for processing complex
  carbohydrates in the space fermentation environment.
- Vitamin-related gene counts highlight a subset of MAGs that encode multiple
  putative vitamin biosynthesis or transport functions.

Because we do not have read coverage information, these patterns describe
**genomic potential per MAG**, not the relative abundance of each function in
the community.

## Genome quality assessment with q2-gunc

Next, we assess the quality of the dereplicated MAGs using `q2-gunc`, which
wraps the GUNC (Genome UNClutterer) tool. GUNC detects chimerism and
contamination in MAGs by checking whether genes on the same contig belong to
consistent taxonomic lineages.

We first downloaded the GUNC database using:

In [None]:
!cat $data_dir/Euler_scripts/q2_gunc_db.sh

We then ran GUNC on the dereplicated bacterial MAG catalog:

In [None]:
!cat $data_dir/Euler_scripts/gunc_bacteria_95.sh

### Summary of the q2-gunc scripts

1. **`q2_gunc_db.sh`**  
   - Activates a QIIME 2 environment with the `q2-gunc` plugin.  
   - Downloads the GUNC ProGenomes database as a QIIME 2 artifact.

2. **`gunc_bacteria_95.sh`**  
   - Uses the dereplicated bacterial MAGs and the downloaded GUNC database as input.  
   - Runs `qiime gunc run-gunc` with 20 threads to compute chimerism and
     contamination metrics for each MAG.  
   - Generates a `.qza` file with the results and a `.qzv` visualization.

The goal of this step is to check whether our MAG catalog is affected by
chimeric or heavily contaminated genomes before interpreting their functions.

### Visualizing GUNC results

We now load the `.qzv` file produced by `qiime gunc visualize`. We assume it has
been uploaded to `data/bacteria_95_gunc_results.qzv`.

In [None]:
Visualization.load(f"{data_dir}/data/bacteria_95_gunc_results.qzv")

### Interpretation (GUNC)

The GUNC visualization reports per-MAG scores for:

- the fraction of genes assigned to discordant lineages, and
- overall GUNC contamination/chimerism scores.

We use these metrics to identify MAGs that may be chimeric or strongly
contaminated. These MAGs should be interpreted with caution or excluded
from downstream biological interpretation.

In our case, most MAGs pass GUNC quality criteria, which supports using
this catalog for functional and AMR potential analysis.

## Antimicrobial resistance potential with q2-rgi

Finally, we annotate the dereplicated bacterial MAGs for antimicrobial
resistance (AMR) genes using the `q2-rgi` plugin, which integrates the
CARD (Comprehensive Antibiotic Resistance Database).

The SLURM script we used on Euler is shown below:

In [None]:
!cat $data_dir/Euler_scripts/rgi_bacteria_95.sh

### Summary of the q2-rgi script

The script performs the following steps:

1. **Environment setup**  
   - Activates the `q2-rgi-amplicon-2025.10` environment containing the
     `q2-rgi` plugin.

2. **CARD database download (if needed)**  
   - Uses `qiime rgi fetch-card-db` to download and preprocess the CARD
     reference databases as QIIME 2 artifacts.

3. **AMR annotation of MAGs**  
   - Runs `qiime rgi annotate-mags-card` on the dereplicated MAG catalog.  
   - Produces:
     - `amr_annotations_bacteria_95.qza`: detailed AMR annotations per MAG  
     - `amr_ft_bacteria_95.qza`: feature table summarizing AMR gene frequencies

4. **Heatmap visualization**  
   - Uses `qiime rgi heatmap` to create a visualization grouped by `drug_class`,
     using annotation frequencies.

This analysis answers the question:
> *Which antimicrobial resistance genes and drug classes are encoded in the
> space fermentation MAG catalog?*

### Visualizing the AMR heatmap

We assume that the RGI heatmap has been uploaded as
`data/rgi_heatmap_bacteria_95.qzv`.

In [None]:
Visualization.load(f"{data_dir}/data/rgi_heatmap_bacteria_95.qzv")

### Interpretation (AMR potential)

The RGI heatmap shows, for each MAG:

- which AMR genes are detected, and  
- which antibiotic drug classes they are associated with.

We observe that:

- Some MAGs encode multiple AMR genes spanning different drug classes,
  indicating a broader AMR potential.
- Other MAGs have few or no detected ARGs.

Due to the lack of read coverage, this is a **genome-level resistome
profile** (potential), not a quantitative estimate of AMR gene abundance
in different samples or conditions.