# MagCluster
![Anaconda.org](
https://anaconda.org/bioconda/magcluster/badges/version.svg) ![License](https://anaconda.org/bioconda/magcluster/badges/license.svg) ![Downloads](https://anaconda.org/bioconda/magcluster/badges/downloads.svg) ![Install](https://anaconda.org/bioconda/magcluster/badges/installer/conda.svg) ![Last update](
https://anaconda.org/bioconda/magcluster/badges/latest_release_date.svg)

MagCluster is a tool for identification, annotation and visualization of magnetosome gene clusters (MGCs) from genomes of magnetotactic bacteria (MTB).

## Contents
- [Installation](#installation)
  - [Conda](#conda)
  - [Bioconda](#bioconda)
  - [Pip](#pip)
- [Usage](#usage)
  - [Genomes annotation](#genomes-annotation)
  - [MGCs screening](#MGCs-screening)
  - [MGCs alignment and visualization](#MGCs-alignment-and-visualization)
- [Tutorials](#tutorials)
- [Citation](#Citation)
- [Contact us](#contact-us)
---

## Installation

### Conda
MagCluster can be installed through [Conda](https://www.anaconda.com/products/individual). We recommend creating a ***new environment*** for MagCluster to avoid dependency conflicts.

```bash
wget https://github.com/RunJiaJi/magcluster/releases/download/0.1.8/magcluster-0.1.8.yml
conda env create -n magcluster --file magcluster-0.1.8.yml

# Optinal cleanup
rm magcluster-0.1.8.yml

# Activate magcluster environment
conda activate magcluster

# Check for the usage of MagCluster
magcluster -h
```
### Bioconda
```bash
# Create magcluster environment
conda create -n magcluster

# Activate magcluster environment
conda activate magcluster

# Install MagCluster through bioconda channel
conda install -c conda-forge -c bioconda -c defaults blast=2.9 prokka=1.13.4 magcluster=0.1.8

# Check for the usage of MagCluster
magcluster -h
```
### Pip
Alternatively, you can install MagCluster through pip in an existing environment. In this way, please make sure you have [Prokka](https://github.com/tseemann/prokka) installed.

```bash
# Install MagCluster through pip
pip install magcluster

# Check for the usage of MagCluster
magcluster -h
```

## Usage


MagCluster comprises three modules for MGCs batch processing: 
(i) MTB genomes annotation with [Prokka](https://github.com/tseemann/prokka)
(ii) MGCs screening with MGC_Screen
(iii) MGCs visualization with [Clinker](https://github.com/gamcil/clinker)


```bash
usage: magcluster [options]

Options:
  {prokka,mgc_screen,clinker}
    prokka              Genome annotation with Prokka
    mgc_screen          Magnetosome gene cluster screening with MGC_Screen
    clinker             Magnetosome gene cluster visualization with Clinker
```
#### Genomes annotation
 **Multiple genome files** or **genome-containing folder(s)** are accepted as input for batch annotation. The general usage is same as Prokka yet some parameters are set with default value for genomes batch annotation.

To avoid confusion, the name of each genome is used as the output folder’s name (`--outdir GENOME_NAME`), output files’ prefix (`--prefix GENOME_NAME`), and GenBank file’s locus_tag (`--locustag GENOME_NAME`) by default. The `--compliant` parameter is also used by default to ensure standard GenBank files. 

For MGCs annotation, we provide a [reference MGCs file](https://github.com/RunJiaJi/magcluster/releases/download/v1.0/Magnetosome_protein_data.fasta.faa) containing magnetosome protein sequences from representative MTB strains which is attached to MagCluster and used by default. The value of `--evalue` is recommended to set to 1e-05.
```bash
example usage: 

# MGCs annotation with multiple MTB genomes as input
$ magcluster prokka --evalue 1e-05 --proteins Magnetosome_protein_data.fasta MTB_genome1.fasta MTB_genome2.fasta MTB_genome3.fasta

# MGCs annotation with MTB genomes containing folder as input
$ magcluster prokka --evalue 1e-05 --proteins Magnetosome_protein_data.fasta /MTB_genomes_folder
```
#### MGCs screening
MGC_Screen module retrieves MGC-containing contigs/scaffolds in GenBank files. As magnetosome genes are always physically clustered in MTB genomes, MGC_Screen identify MGC based on the number of magnetosome genes gathered. 
Three parameters involved in MGC screening, `--contiglength`, `--windowsize` and `--threshold` (see below). You can adjust them according to needs. 
For each genome, MGC_Screen produces two files as output: a *GenBank file of MGCs containing contigs* and a *csv file summarizing all magnetosome protein sequences*.
```bash

usage: magcluster mgc_screen [-h] [-l CONTIGLENGTH] [-win WINDOWSIZE] [-th THRESHOLD] [-o OUTDIR] gbkfile [gbkfile ...]

positional arguments:
  gbkfile               .gbk/.gbf files to analyzed. Multiple files or files-containing folder is acceptable.

optional arguments:
  -h, --help            show this help message and exit
  -l CONTIGLENGTH, --contiglength CONTIGLENGTH
                        The minimum size of a contig for screening (default '2,000 bp')
  -w WINDOWSIZE, --windowsize WINDOWSIZE
                        The window size in the text mining of magnetosome proteins (default '10,000 bp')
  -th THRESHOLD, --threshold THRESHOLD
                        The minimum number of magnetosome genes existed in a window size (default '3')
  -o OUTDIR, --outdir OUTDIR
                        Output folder (default 'mgc_screen')
```
```bash
example usage: 

# MGCs screening with multiple GenBank files as input
$ magcluster mgc_screen --threshold 3 --contiglength 2000 --windowsize 10000 file1.gbk file2.gbk file3.gbk

# MGCs screening with GenBank files containing folder as input
$ magcluster mgc_screen --threshold 3 --contiglength 2000 --windowsize 10000 /gbkfiles_folder
```
#### MGCs alignment and visualization
We use [Clinker](https://github.com/gamcil/clinker) for MGCs alignment and visualization. Note that the `-p` parameter is used by default to generate an interactive HTML web page where you can modify the MGCs figure and export it as a publication-quality file.

```bash
example usage: 

# MGCs screening with multiple GenBank files as input
$ magcluster clinker -p MGC_align.html /MGCs_files_folder/*.gbk
```

# Tutorials
This is a simple example to help you quickly start MagCluster journey. We use the genomes of ***Candidatus* Magnetominusculus xianensis strain HCH-1** ([see the paper](https://www.pnas.org/content/pnas/114/9/2171.full.pdf)) and ***Magnetofaba australis* IT-1** ([see the paper](https://www.frontiersin.org/articles/10.3389/fmicb.2014.00072/full)) to show how it works.
## Step 0: Prepare MagCluster
To start our journey, make sure you have a working MagCluster in your system.

For installation informations, please check [Installation](https://github.com/RunJiaJi/MagCluster#installation).

In [1]:
# Activate magcluster environment
conda activate magcluster
# Check MagCluster version
magcluster -v

magcluster 0.1.8 [?2004l
 

: 1

## Step 1: Obtaining data
You can download genome data from NCBI.
- HCH-1 [LNQR00000000](https://www.ncbi.nlm.nih.gov/Traces/wgs/LNQR01?display=download)
- IT-1 [LVJN00000000](https://www.ncbi.nlm.nih.gov/Traces/wgs/LVJN01?display=download)

In [2]:
# Make a new folder
mkdir magtest
# Open magtest folder
cd magtest

 [?2004l[?2004l[?2004l

: 1

In [3]:
# Download genomes from NCBI
wget https://sra-download.ncbi.nlm.nih.gov/traces/wgs03/wgs_aux/LN/QR/LNQR01/LNQR01.1.fsa_nt.gz
wget https://sra-download.ncbi.nlm.nih.gov/traces/wgs03/wgs_aux/LV/JN/LVJN01/LVJN01.1.fsa_nt.gz

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
--2021-09-12 22:21:45--  https://sra-download.ncbi.nlm.nih.gov/traces/wgs03/wgs_aux/LN/QR/LNQR01/LNQR01.1.fsa_nt.gz
Resolving sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)... 130.14.250.25, 130.14.250.24, 130.14.250.27
Connecting to sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)|130.14.250.25|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1119868 (1.1M) [application/octet-stream]
Saving to: ‘LNQR01.1.fsa_nt.gz’


2021-09-12 22:21:48 (742 KB/s) - ‘LNQR01.1.fsa_nt.gz’ saved [1119868/1119868]

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
--2021-09-12 22:21:48--  https://sra-download.ncbi.nlm.nih.gov/traces/wgs03/wgs_aux/LV/JN/LVJN01/LVJN01.1.fsa_nt.gz
Resolving sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)... 165.112.9.235, 165.112.9.231, 165.112.9.232
Connecting to sra-download.ncbi.nlm.nih

: 1

In [4]:
# Check the genome files
ls

[0m[01;32mLNQR01.1.fsa_nt.gz[0m  [01;32mLVJN01.1.fsa_nt.gz[0m
 

: 1

In [5]:
# Unzip files
gunzip *.gz
ls

[0m[01;32mLNQR01.1.fsa_nt[0m  [01;32mLVJN01.1.fsa_nt[0m
 

: 1

In [6]:
# Rename the genomes
mv LNQR01.1.fsa_nt HCH-1.fasta
mv LVJN01.1.fsa_nt IT-1.fasta
ls

[0m[01;32mHCH-1.fasta[0m  [01;32mIT-1.fasta[0m
 

: 1

## Step 2: Genome annotation
We recommand the `evalue` to be set as 1e-05. Note that the `--outdir`, `--prefix`, `--locustag` and `--compliant` parameters are used by default. The reference MGCs file that we provide is also used with `--proteins`.

This step should take a while and you can check the log file for details.

In [7]:
magcluster prokka --evalue 1e-05 . 

[?2004l
 

: 1

In [8]:
# Move `.gbk` files to a new folder
mkdir gbkfolder
mv -v */*.gbk gbkfolder/

renamed 'HCH-1_annotation/HCH-1.gbk' -> 'gbkfolder/HCH-1.gbk'
renamed 'IT-1_annotation/IT-1.gbk' -> 'gbkfolder/IT-1.gbk'
 

: 1

## Step 3: MGCs screening
Now that we have `HCH-1.gbk` and `IT-1.gbk`, let's start MGCs screening!

In [9]:
magcluster mgc_screen gbkfolder -o mgcfolder

[22:30:40] INFO - Starting mgc_screen...
[22:30:40] INFO - Your file is HCH-1.gbk
[22:30:40] INFO - The minimum length of contigs to be considered is 2000
[22:30:40] INFO - The maxmum length of contigs to be considered is 10000
[22:30:40] INFO - The threshold of magnetosome genes in one contig is 3
[22:30:40] INFO - The output directory is /mnt/c/Users/edith/Desktop/test/magtest/mgcfolder/
[22:30:40] INFO - Opening your file...
[22:30:40] INFO - Starting magnetosome genes screening...
[22:30:40] INFO - Magnetosome gene cluster containing contigs screening completed!
[22:30:40] INFO - Creating output folder: /mnt/c/Users/edith/Desktop/test/magtest/mgcfolder/
[22:30:40] INFO - Writing mgc.gbk file...
[22:30:40] INFO - Writing magpro.csv file(s)...
[22:30:40] INFO - Starting mgc_screen...
[22:30:40] INFO - Your file is IT-1.gbk
[22:30:40] INFO - The minimum length of contigs to be considered is 2000
[22:30:40] INFO - The maxmum length of contigs to be considered is 10000
[22:30:40] INFO -

: 1

In [10]:
ls mgcfolder

[0m[01;32mHCH-1_magpro.csv[0m  [01;32mHCH-1_mgc.gbk[0m  [01;32mIT-1_magpro.csv[0m  [01;32mIT-1_mgc.gbk[0m
 

: 1

Good job! Now you can open the `.csv` files to check the magnetosome protein sequences.
![HCH-1_magpro.csv](docs/_static/HCH-1.PNG)

## Step 4: MGCs visualization
Final step! We are almost there! 

Use Clinker to generate a MGCs figure. We recommand to use the `-o` parameter to generate a MGCs alignment file where you can browse the homologous gene similarities among genomes.

Be Careful! If there is complete genome(s) in your dataset, the alignment process will take unreasonable time. In that case, we recommand you skip the alignment process with `-na` (no alignment) parameter.

In [11]:
magcluster clinker -o mgc_align.txt -p mgc_align.html mgcfolder/*.gbk

[22:31:00] INFO - Starting clinker
[22:31:00] INFO - Parsing files:
[22:31:00] INFO -   HCH-1_mgc.gbk
[22:31:00] INFO -   IT-1_mgc.gbk
[22:31:00] INFO - Starting cluster alignments
[22:31:00] INFO - HCH-1_mgc vs IT-1_mgc
[22:31:16] INFO - Generating results summary...
[22:31:16] INFO - Writing alignments to: mgc_align.txt
[22:31:16] INFO - Building clustermap.js visualisation
[22:31:16] INFO - Writing to: mgc_align.html
[22:31:17] INFO - Done!


Congratulations! Now you should have an interactive html opened in your browser. You can adjust the figure as you like and export it as a `.svg` file.

![MGCs.html](docs/_static/MGCs_figure.PNG)

## Citation
The manuscript is in preparation.

## Contact us
If you have any questions or suggestions, feel free to contact us.

jirunjia@gmail.com or weilin@mail.iggcas.ac.cn