# Metagenomic assembled genome construction steps

From assembled contigs and scaffolds from short and long reads \
conda environment: MAG_Assembly


In [None]:
# Basic modules and environments required for the scripts
module load gcc12-env/12.3.0
module load miniconda3/24.11.1
conda activate MAG
cd $WORK/DATA

# if required also activate perl locally 
module load gcc/12.3.0
source ~/perl5/perlbrew/etc/bashrc
perlbrew switch perl-5.38.0


## Illumina metagenomics: contigs to MAG

### Contig preparation for MAG assembly
1. Change the name of the assembly output .fasta files
    - contigs.fasta => SRR..._contigs.fasta
    - scaffolds.fasta => SRR..._scaffolds.fasta
2. Clean the folders and remove everything else than fasta files
3. Filter the contigs by minimum length of 500bp
    - script no. 3_1
    - outputs: SRR..._contigs_min500.fasta
4. Rename the contigs ID
    - script no. 3_2
    - for contigs: >NODE... => >SRR_ctg(no.)
    - outputs: SRR.._contigs_min500_renamed.fasta
    - for scaffolds: >NODE... => >SRR_scaffolds_ctg(no.)

-> summarized script: 3_All_1


In [None]:

# 1. Rename files
# Then rename teh contigs and scaffold files to distinguish them
cd /gxfs_work/geomar/smomw681/DATA
dir1="/gxfs_work/geomar/smomw681/DATA/ASSEMBLIES"
DIRS=(${dir1}/*_SPADessembly)
for dir in ${DIRS[@]};do
    cd $dir
    pwd
    base=$(basename $dir "_SPADessembly")
    mv contigs.fasta ${base}_contigs.fasta
    mv scaffolds.fasta ${base}_scaffolds.fasta
done

# 2. Clean assembly folders:
# first remove every unnecessary files of assembly except for contigs and scaffolds
cd /gxfs_work/geomar/smomw681/DATA
dir1="/gxfs_work/geomar/smomw681/DATA/ASSEMBLIES"
ASSEM_DIRS=(${dir1}/*_SPADessembly)

for file in ${ASSEM_DIRS[@]};do
    if $file 
    find $file -mindepth 1 ! \
    -name "*_contigs.fasta" ! -name "*_scaffolds.fasta" -exec rm -rf {} +
    echo "Files cleaned for the directory $file"
    
    fi
done


### MAG assembly from contigs

1. DeepMicroClass: Classify contigs by their origin
    - script no. 3_3
2. Extract the contigs from prokaryotes
    - script no. 3_4

- Do statistics for prokaryotic contigs 
        - script no. 3_4_0

2. Prodigal: prokaryotic gene prediction 
    - script no. 3_5

3. BBmap: map error corrected reads to assembled contigs and get coverage information
    - script no. 3_6
    - alignment mapping to .bam 
    - sort bam to .sorted.bam
    - generate coverage file for contigs with sorted bam file

-> summarized script: 3_ALL_2

4. QC: QC for mapped files
    - script: 3_7_1
    - the mapped fasta file for statistics

5. MetaBAT2: MAG assembly
    - script no. 3_8
    - MAG assembly using prokaryotic contigs and coverage file

6. CheckM: check the quality of MAGs (This step is included in the dRep, maybe in future  unnecessary)
    - Script: 3_9 

7. dRep: Filter and dereplicate the prokaryotic bins 
    - The package uses prodigal, checkM, mash and  fastANI (alignment algorithm)
    - Filter criteria: min. completeness 50% and max. contamination 5%
    - Next dRep follows while streamlining with MAGs from ASG database and PacBio
    - The output of this pipeline can be used for identification of contained prokaryotes using GTDBTK


## MAG Assembly for contigs from PacBio Long-read sequencing

1. BAM/SAM/Coverage file generation 
    - For long-read MAG assembly: Not provided as METABAT2 input due to low coverage
        - to be more accurate consider using strainy pipeline
2. METABAT2
    - Script: 2_4 
    - Run without -a option (coverage inforamtion not provided)
3. CheckM2
    - Script: 2_5
    - Only few over 50 % completeness

In [None]:
## How to use perl
# Load additional module and install perl locally
module load gcc/12.3.0
curl -L https://install.perlbrew.pl | bash
source ~/perl5/perlbrew/etc/bashrc
perlbrew install --notest perl-5.38.0
#activate the perl to use
source ~/perl5/perlbrew/etc/bashrc
perlbrew list
perlbrew switch perl-5.38.0
# install cpanm to install other modules required, optionally check the module
perlbrew install-cpanm
cpanm "autodie"
perl -e "use autodie"

In [None]:
## How to use python env
module load gcc12-env/12.3.0
module load python/3.11.5
# create a python environment
mkdir $HOME/my_python_env
python -m venv $HOME/my_python_env/my_env
# install package into env
source $HOME/my_python_env/my_env/bin/activate
module load gcc/12.3.0
pip install ...
deactivate
# Use the installed package 
module load gcc12-env/12.3.0
source $HOME/my_python_env/my_env/bin/activate
module load python/3.11.5
...
deactivate

##### Debugging:
- Problem with installing packages
    -  It was a problem of the conda solver and I changed it from default to libmamba -> *it worked!*