# Metagenomic assembled genome construction steps

From assembled contigs and scaffolds from short and long reads \
conda environment: MAG_Assembly


In [None]:
# Basic modules and environments required for the scripts
module load gcc12-env/12.3.0
module load miniconda3/24.11.1
conda activate MAG
cd $WORK/DATA

# if required also activate perl locally 
module load gcc/12.3.0
source ~/perl5/perlbrew/etc/bashrc
perlbrew switch perl-5.38.0


### First steps
1. Change the name of the assembly output .fasta files
    - contigs.fasta => SRR..._contigs.fasta
    - scaffolds.fasta => SRR..._scaffolds.fasta
2. Clean the folders and remove everything else than fasta files
3. Filter the contigs by minimum length of 500bp
    - script no. 3_1
    - outputs: SRR..._contigs_min500.fasta
4. Rename the contigs ID
    - script no. 3_2
    - for contigs: >NODE... => >SRR_ctg(no.)
    - outputs: SRR.._contigs_min500_renamed.fasta
    - for scaffolds: >NODE... => >SRR_scaffolds_ctg(no.)

In [None]:

# 1. Rename files
# Then rename teh contigs and scaffold files to distinguish them
cd /gxfs_work/geomar/smomw681/DATA
dir1="/gxfs_work/geomar/smomw681/DATA/ASSEMBLIES"
DIRS=(${dir1}/*_SPADessembly)
for dir in ${DIRS[@]};do
    cd $dir
    pwd
    base=$(basename $dir "_SPADessembly")
    mv contigs.fasta ${base}_contigs.fasta
    mv scaffolds.fasta ${base}_scaffolds.fasta
done

# 2. Clean assembly folders:
# first remove every unnecessary files of assembly except for contigs and scaffolds
cd /gxfs_work/geomar/smomw681/DATA
dir1="/gxfs_work/geomar/smomw681/DATA/ASSEMBLIES"
FILES=(${dir1}/*_SPADessembly)
for file in ${FILES[@]};do
    if $file 
    find $file -mindepth 1 ! \
    -name "*_contigs.fasta" ! -name "*_scaffolds.fasta" -exec rm -rf {} +
    echo "Files cleaned for the directory $file"
    fi
done


### MAG assembly from contigs
1. DeepMicroClass: Classify contigs by their origin
    - script no. 3_3
2. Extract the contigs from prokaryotes
    - script no. 3_4
    - Make statistics out it 
        - script no. 3_4_0
2. Prodigal: prokaryotic gene prediction 
    - script no. 3_5

3. BBmap: map error corrected reads to assembled contigs and get coverage information
    - script no. 3_6
    - alignment mapping to .bam 
    - sort bam to .sorted.bam
    - generate coverage file for contigs with sorted bam file

4. MetaBat2: MAG assembly
    - script no. 3_7
    - 3_7: MAG assembly using prokaryotic contigs and coverage file
    - 3_7_1 **?**: maybe rerun with all contigs -> also mapping info required 


### MAG to annotation and coverage

1. CheckM: check the quality of MAGs
    - 3_8_0
    - Filter criteria: min. completeness 50% and contamination max. 5%

In [None]:
## How to use perl
# Load additional module and install perl locally
module load gcc/12.3.0
curl -L https://install.perlbrew.pl | bash
source ~/perl5/perlbrew/etc/bashrc
perlbrew install --notest perl-5.38.0
#activate the perl to use
source ~/perl5/perlbrew/etc/bashrc
perlbrew list
perlbrew switch perl-5.38.0
# install cpanm to install other modules required, optionally check the module
perlbrew install-cpanm
cpanm "autodie"
perl -e "use autodie"

## How to use python env
module load gcc12-env/12.3.0
module load python/3.11.5
# create a python environment
mkdir $HOME/my_python_env
python -m venv $HOME/my_python_env/my_env
# install package into env
source $HOME/my_python_env/my_env/bin/activate
module load gcc/12.3.0
pip install ...
deactivate
# Use the installed package 
module load gcc12-env/12.3.0
source $HOME/my_python_env/my_env/bin/activate
module load python/3.11.5
...
deactivate

##### Debugging:
- Problem with installing packages
    -  It was a problem of the conda solver and I changed it from default to libmamba -> *it worked!*