## Anvi'O 7.1
Anvi'O is a powerful piece of software in metagenomics and takes an odd place in the workflow of this project. It is both a binning software and a visualization strategy. I will clearly write down what goes where, but be aware that of all the software presented here, Anvi'O is quite likely the most powerful but has the steepest learning curve. It expects a great deal of foreknowledge, so be highly aware of that. If you want to read up on the software, they publish a lot of its functionality on their website:

 https://anvio.org/

For now, know that the Anvi'O workflow consists of three separate pieces:

* Contig profiling: this creates a database of your contigs and does part of the binning for you
* Single profiling: this creates a database of your reads matched with your contigs
* Processing: this is where Anvi'O becomes powerful. You can collect all the data from previous workflows and start defining your findings in Anvi'O (taxonomy, representation, etc.)

This last step is clearly more oriented towards results interpretation, while the first two are more geared towards data generation. The workflow file that is on this GitHub should show you clearly how to work this data!

## Contig profiling
Contig profiling creates a database of your contigs. It calculates k-mer frequencies for your sample (standard k-setting is 4, which you can change with the --kmer-size parameter (DON'T unless you have a good reason)), soft splits long contigs, and identifies open reading frames (which can be skipped using --skip-gene-calling). Run the following code to generate your database:

Subsequently, you can add various elements of analysis to your contig profile. The following list is available:
* Augustus + Prodigal gene calls: adds open reading frames to your dataset from the genes from the Augustus database (eukaryotes) and Prodigal (bacteria + archaea) WORKING ON THIS: NOT SURE IF IT WORKS

* Hidden Markov Model (HMM): A widely used prediction model in bioinformatics software, which can offer great advantages in homology detection. 

* NCBI's Cluster of Orthologous genes (NCBI COG): this allows you to annotate your database with gene functions from NCBI COG. Current version: 2020

* KoFAM Metabolism calls: Uses the KEGG database to call metabolic genes and estimate paths of your community. Currently used KEGG version is KEGG_build_2020-12-23. 

* Kaiju Taxonomy calls: 


Each of these adds a new layer of information to your dataset, so might be very interesting to explore.


In [None]:
#set parameters:
$SAMPLENAME=samplename
$CONTIGPATH=/path/to/contigs/
$DATABASE=databasename
$COGPATH=/path/to/NCBI-COG/database

[ -d $CONTIGPATH ] || { echo 'Invalid path to contigs, exiting.' && done}
[ -d $COGPATH ] || { echo 'Invalid path to NCBI COG database, exiting.' && done }

#generate the contigs database:
anvi-gen-contigs-database -f $CONTIGPATH/co-assembly1.contigs.fa -o ../data/working/$SAMPLENAME.contigs.db -n $DATABASE || { echo 'Exit code 1: Unable to create contigs database, exiting.' && done}

#integrate HMMs into the database:
anvi-run-hmms -c ../data/working/$SAMPLENAME.contigs.db || echo 'WARNING: Unable to add HMMs to contigs database, continuing'

#this runs NCBI COGs against your contigs.db, integrating gene functions.
anvi-run-ncbi-cogs -c ../data/working/$SAMPLENAME.contigs.db  --cog-data-dir $COGDIR || echo 'WARNING: Unable to run NCBI COGs, continuing'

#ADD KEGG-KOFAM
anvi-run-kegg-kofams -c ../data/working/CONTIGS.db \
                     -T 4 #these are the threads that Anvi'O is allowed to use
#ADD CONTIG STATS
anvi-display-contigs-stats "$f".contigs.db --report-as-text --as-markdown -o ../data/results/"$f"_contigstats.md

After creating your HMMs, you can now look at your contig stats. This is the first part of Anvi'O that is interactive, so you now have to be careful. If you are running this workflow on a computing cluster, you are unlikely to be able to run the interactive mode (most computing clusters aren't set out to be visual). If you want the full Anvi'O experience, you can download the database on your local device, run Anvi'O on it as normal, and get the interactive output. For a full explanation of results, consult this page: 

https://anvio.org/help/main/programs/anvi-display-contigs-stats/

If you choose to remain in the cluster environment, you can get the stats as .txt or .md output and see what's in there. 

In [5]:

    
# Alternatively, you could download your contigs database onto your local computer and run it as follows:
anvi-display-contigs-stats "$f".contigs.db

SyntaxError: invalid syntax (<ipython-input-5-dbd2ad518711>, line 1)

#### Jaeger module
The Jaeger module uses Kaiju to perform taxonomic gene calls on your contigs database. What is important to realise is that this is not overly reliable. It may however, guide your hand in binning somewhat and make things clearer later down the line. You first have to select and create your Kaiju database, as provided in the Kaiju documentation: https://github.com/bioinformatics-centre/kaiju. You can then call the Kaiju taxonomy and subsequently import them into Anvi'O.

In [None]:
anvi-get-sequences-for-gene-calls -c CONTIGS.db -o gene_calls.fa

kaiju -t /path/to/nodes.dmp \
      -f /path/to/kaiju_db.fmi \
      -i gene_calls.fa \
      -o gene_calls_nr.out \
      -z 16 \
      -v

addTaxonNames -t /path/to/nodes.dmp \
              -n /path/to/names.dmp \
              -i gene_calls_nr.out \
              -o gene_calls_nr.names \
              -r superkingdom,phylum,order,class,family,genus,species

anvi-import-taxonomy-for-genes -i gene_calls_nr.names \
                               -c contigs.db \
                               -p kaiju \
                            --just-do-it


#### KoFAM KEGG estimate metabolism
This piece of code first sets up the KEGG database and then calls genes with KoFAM and KEGG. After that, you can execute metabolism calls. For more information, look at https://merenlab.org/tutorials/fmt-mag-metabolism/

In [None]:
anvi-setup-kegg-kofams --kegg-data-dir path/to/kegg #you only need the last bit if you do not have write access to the Anvi'O directory

In [None]:
anvi-estimate-metabolism -c CONTIGS.db -p PROFILE.db -C DAS_collection

## Single profiling
Other than the contig file, the profile that is about to be generated contains information about your contigs, based on the results of your mapping step. Each database links to a contig database. Its important to make sure that all the profiles that you are generating are generated using the same parameters, since you're quite likely to *merge* them later. For more information on profiling:

https://anvio.org/help/main/programs/anvi-profile/

To work any further with Anvi'O, you need to merge your profiles into a single profile. There is more to this step, but that is all very neatly explained here:

https://anvio.org/help/main/programs/anvi-merge/

Anvi'O no longer bins results on its own. You can ask it to, but it will use the same algorithms as above. So therefore, you might just want to do it manually, as you can oversee your results a little better. You need to provide your binning results as a tab-delimited text file, where each contig name is assigned a bin (they can also be left out). In the binning_refiner script a few lines of code are already there that create a text file that should be working with your contigs.db that you generated earlier.


In [None]:
#set parameters:
MAPPINGPATH=/path/to/mapping/files
CONTIGSDBPATH=/path/to/contigs.db
SAMPLENAME=samplename
COLLECTION=collection_name
COLLECTIONPATH=/path/to/collection/txt

[ -d $CONTIGSDBPATH ] || { echo 'Invalid path to contigs database, exiting.' && exit }
[ -d $MAPPINGPATH ] || { echo 'Invalid path to contigs database, exiting.' && exit }
#this runs the single profiling step:
for f in <sample1> <sample2> <sample3> <sample4>
do
rm -r ../data/working/"$SAMPLENAME"_profiles || echo 'Everything okay, passed the test'
anvi-profile -i $MAPPINGPATH/"$f".bam -c $CONTIGSDBPATH/"$SAMPLENAME".contigs.db \
--min-contig-length 1000 \
--output-dir ../data/working/"$SAMPLENAME"_profiles/"$f" \
--sample-name "$f"_singleprofile
done

#needs a check whether all samples are there

#merge the profiles into a single profile for your samples
anvi-merge ../data/working/"$SAMPLENAME"_profiles/Coral*/PROFILE.db \
-o ../data/results/anvi-profile_"$SAMPLENAME"/ \
-c $CONTIGSDBPATH/"$SAMPLENAME".contigs.db \
-S $SAMPLENAME 

#import your binning results from DAS_tool
anvi-import-collection "$SAMPLENAME"_collection.txt \
-p ../data/results/anvi-profile_"$SAMPLENAME"/PROFILE_$SAMPLENAME.db \
-c $CONTIGSDBPATH/$SAMPLENAME.contigs.db \
--contigs-mode \
-C $COLLECTION

##### Taxonomy estimates
Assessing the taxonomy of your bins can be very helpful in the long run, when you do manual binning combined with your automated binning. There are two ways of calling taxonomy on your samples: using Anvi'O SCG taxonomy and using Kaiju. I have done both, since they take a slightly different approach. Additionally, while SCG taxonomy is great, Anvi'O only uses bacterial and archaea genomes to do taxonomy calling, leaving a host of important organisms out. Kaiju can help you later in assessing the quality of your bins and seeing what belongs where. Use the following links to get a little streetwise in this step:

Kaiju: https://github.com/bioinformatics-centre/kaiju

Anvi'O taxonomy: https://merenlab.org/2019/10/08/anvio-scg-taxonomy/

Combining Anvi'O with Kaiju: https://merenlab.org/2016/06/18/importing-taxonomy/

In [None]:
#this runs the Anvi'O SCG script:
anvi-run-scg-taxonomy -c ../data/working/contigs.db
#and this allows you to integrate said information with your profiles and your bins
anvi-estimate-scg-taxonomy -c ../data/working/contigs.db \
                           -p ../data/working/<name>/PROFILE.db \
                           -C <Collection name>

### Data interpretation


Before you run the interactive interface, you will want to integrate some information into your collection, or "populate" it with information. The following script uses several things already included into your databases to do exactly that. Some of these lines will create separate files which will help with the interpretation of the data, while some will show up in the interface. 

In [None]:
anvi-estimate-metabolism -c CONTIGS.db -p PROFILE.db -C DAS_collection

One of the more powerful features of Anvi'O is the interactive interface. For this to work, you'll probably need to download your contigs database and your merged profile to run this. Alternatively, you can run Anvi'O off the computing cluster you have been working on, which is super useful if you just want to take a quick peek!

https://merenlab.org/2015/11/28/visualizing-from-a-server/


In [None]:
  anvi-interactive -p profile-db \
                 -c contigs-db \
                 -C collection #run this if you specifically want to run your bins in the interface

Anvi'O will automatically try and open your browser at this point (not if you are running it from the server). If your browser doesn't pop up, try entering this into ~Chrome~ your browser:

http://localhost:8080

Which should show you the results! You can kill the session at any moment by entering CTRL+C in the command line. This tutorial shows you some of the power of the interactive interface:
https://merenlab.org/tutorials/interactive-interface/

In [None]:
#this line should allow you to add metadata to your samples:
anvi-import-misc-data ../data/working/metadata.txt \ #see below!
                         --target-data-table layers \
                         --pan-or-profile-db ../data/working/<samplename>.profile.db

Using this piece of code, you can insert metadata in your Anvi'O graph. Super useful, but take into account that you need to adhere to some principles: 

* This is a tab-deliminated text file containing information about the samples you're displaying
* The first column should match the name of the samples for each row
* The following columns can contain all sorts of information

https://anvio.org/help/main/programs/anvi-import-misc-data/