This script phylogenetically classifies translated nifH gene (or amplicon) sequences using the Classification and Regression Tree (CART) from Frank et al., 2016.
The scripts directory includes the original version by Frank for Python 2 as well as an updated version for Python 3. The updated version does not require you to specify the residue position in Azotobacter vinelandii NifH (protein) where multiple sequence alignment starts. Instead the script calculates the start residue, knowing that first sequence in the alignment is NifH from Azotobacter vinelandii (WP_012698955.1). (A warning is issued if the first sequence does not appear to be from A. vinelandii.) We recommend that you use the updated script or the automated alternative described below.
To use the CART classifier you must have installed Biopython.
You can verify the classifier with the included multiple sequence alignment:
python scripts/NifH_Clusters.py data/CART_Test_Atlantic.fasta
which outputs CART_Test_Atlantic_Clusters.fasta.
If you are working with nifH amplicons, e.g. ASVs from DADA2, then consider using our CART classifier in nifH_amplicons_DADA2 as the ancillary script NifHClustersFrank2016. This version predicts open reading frames (using FragGeneScan), performs a multiple sequence alignment (using MAFFT), and then runs the CART classifier. All of the required external tools (including Biopython) are provided by the miniconda environment that you create when you install nifH_amplicons_DADA2.