This tutorial shows how to train the DNN model of DeeMicrobes from scratch.
The training set can be composed of any sequences as long as you have a ground truth category label for each sequence. To train a species classifier for metagenome utilizing a collection of microbial genomes, you would need to first simulate sequencing reads from these genomes.
For example, to prepare the training set for the gut model from the DeepMicrobes paper (under preparation):
- Download the genomes in the complete bacterial repertoire of the human gut microbiota from this FTP site.
- Assign a category label for each species.
- Read simulation with ART simulator for each genome.
- Trim reads to variable lengths with the custom script
random_trim.py
. - Shuffle all the reads and convert them to TFRecord.
See below for details. random_trim.py
and fna_label.py
can be found in scripts.
Each category should be given a ground true label which is an integer between 0
and num_classes-1
.
E.g., Suppose we have 100 categories, we should assign a non-redundant integer label between 0-99 to each category.
Please provide a label file for fna_label.py
. The script add a label prefix for each sequence in a fasta genome file. These labels will be carried to sequence identifiers of simulated reads using ART simulator.
The label file is a tab-delimited file which looks like:
genome_00.fna 0
genome_01.fna 1
genome_02.fna 2
genome_03.fna 3
genome_04.fna 4
...
genome_99.fna 99
The script fna_label.py
output all the labeled fasta genome files in an user-specified dictionary.
fna_label.py -m /path/to/label_file.txt -o output_dir
Arguments:
-m
Tabular file mapping from names of the genome files (can be full path) to integer labels-o
Output dictionary
The labeled genomes we used to train the species and genus model of DeepMicrobes are available here.
We recommend generating equal proportion of reads for each category. Next-generation sequencing read simulators generally produce fixed-length reads.
To trim the simulated reads to variable lengths:
random_trim.py -i input_fastq -o output_fasta -f fastq -l 150 -min 0 -max 75
Arguments:
-i
input fastq/fasta sequences (fixed-length)-o
output fasta sequences (variable-length)-f
input file type (fastq/fasta)-l
length of the input sequences-min
minimum number of trimmed bases-max
maximum number of trimmed bases
The example command line above trims the 150bp reads from 3'end to 75-150bp.
Please refer to the TFRecord tutorial.
To train a DNN model for DeepMicrobes:
DeepMicrobes.py --input_tfrec=train.tfrec --model_name=attention --model_dir=/path/to/weights
Arguments:
input_tfrec
TFRecord containing sequences and their labelsmodel_name
Model architecture (must be specified)model_dir
Dictionary in which trained weights are savedbatch_size
Number of sequences in one batch [32]num_classes
Number of classes [2505]kmer
K-mer length [12]keep_prob
Keep probability for dropout [1.0]vocab_size
Number of k-mers in the vocabulary file plus one [8390658]cpus
Number of parallel calls for input preparation [8]train_epochs
Number of epochs used to train [1]lr_decay
Learning rate decay [0.05]lr
Learning rate [0.001]
To get a full list of training options for DeepMicrobes.py
:
DeepMicrobes.py --helpfull
Note:
- Recommended batch size for training on thousands of species is 2048 or 4096. Try a lower value when training on fewer classes.
vocab_size
should matchkmer
. See the table below forvocab_size
of the provided vocabulary files.
vocabulary filename | vocab_size |
---|---|
tokens_merged_12mers.txt | 8390658 |
tokens_merged_11mers.txt | 2097154 |
tokens_merged_10mers.txt | 524802 |
tokens_merged_9mers.txt | 131074 |
tokens_merged_8mers.txt | 32898 |