# MAG Downstream Analysis

MAG assembled from Illumina short-reads, PacBio long-reads and ASG are streamlined together. 
- MAG dereplication with dRep
- Genome identification with GTDB-TK
- Coverage information and quantification with CoverM

In [None]:
# Basic modules and environments required for the scripts
module load gcc12-env/12.3.0
module load miniconda3/24.11.1
conda activate dRep
conda activate GTDBTK
conda activate METABAT2

cd /gxfs_work/geomar/smomw681/DATA

### MAG downstream analysis

1. dRep: Filter and dereplicate METABAT2 bins
    - Script: 4_0 (all)
        - Only prokaryotic bins: 4_0_1
    - The package uses prodigal, checkM, mash and  fastANI (alignment algorithm)
    - Filter criteria: min. completeness 50% and max. contamination 5%
    - The output of this pipeline can be used for identification of contained prokaryotes using GTDBTK
    - Run CheckM2 for drep MAG (script 4_0)

2. GTDB-TK: Taxonomic identification of MAG based on big database
    - The DB must be downloaded beforehand (takes longer due to the size)
    - classify step uses pplacer to find the maximum-likelihood placement of each genome in the GTDB-Tk reference tree. GTDB-Tk classifies each genome based on its placement in the reference tree, its relative evolutionary divergence, and/or average nucleotide identity (ANI) to reference genomes.
    - Output: 
        - The output of this pipeline can be used to generate coverage information by mapping using coverM
        - summary output: https://ecogenomics.github.io/GTDBTk/files/summary.tsv.html
    - Script: 4_0

3. CoverM: Map error-corrected reads to dereplicated MAGs
    - Script: 4_1
    - all
    - Output: output_coverm.tsv
        - .tsv file: sample (stats) x genome (names)
        - Genome: The name of the genome
        - $sample Mean: The mean read coverage from sample across the given genome, i.e. the average height across the genome if reads aligned were stacked on top of each other.
        - $sample: Relative Abundance (%): The relative abundance of the genome within sample This metric accounts for differing genome sizes by using the proportion of mean coverage rather than the proportion of reads.
        - $sample Covered Fraction: The proportion of the genome that is covered by at least one read.
    
