Skip to content

Latest commit

 

History

History
107 lines (77 loc) · 4.23 KB

tfrecord.md

File metadata and controls

107 lines (77 loc) · 4.23 KB

How to convert fastq/fasta sequences to TFRecord

We provide the wrapper scripts that convert fastq/fasta sequences to TFRecord for training and prediction, respectively. DeepMicrobes and the other embedding based DNNs use kmer representation. Other tested architectures including the convolutional model, the hybrid model, and seq2species use onehot representation.


Training

The following steps are required to process the sequences in training sets before loading them into the model:

  • Shuffle the sequences
  • Split the large dataset to multiple small files (for acceleration)
  • Convert the fasta sequences to TFRecord

To convert fasta to TFRecord for training:

tfrec_train_kmer.sh -i train.fa -v /path/to/vocab/tokens_merged_12mers.txt -o train.tfrec -s 20480000 -k 12

Arguments:
-i Fasta file of training set
-v Absolute path to the vocabulary file (path/to/vocab/tokens_merged_12mers.txt)
-o Output name of converted TFRecord
-s (Optional) Number of sequences per file for splitting (default: 20480000)
-k (Optional) k-mer length (default: 12)

The converted TFRecord will be stored in train.tfrec (or other specified names) in the current dictionary.

NOTE:

  • The script parses category labels from sequence IDs starting with prefix|label (e.g., >this_is_prefix|0).
  • Suppose we have 100 categories, we should assign a non-redundant integer label between 0-99 to each category.
  • The label is taken as ground truth during training and not required during prediction.
  • Each subset files are processed with one CPU core, so that the optimal number of sequences per file depends on how many CPU cores you have and the total number of reads as well.
  • The vocabulary file and k-mer length should be matched.

Prediction

The shell script below takes as input paired-end reads, though both paired-end and single-end modes are supported by DeepMicrobes. We recommend running DeepMicrobes in paired-end mode, which provides more accurate predictions than single-end mode.

The following steps are required to process the sequences in test sets before loading them into the model:

  • Interleave paired-end reads
  • Split the large dataset to multiple small files (for acceleration)
  • Convert the fastq/fasta sequences to TFRecord

To convert fastq/fasta to TFRecord for prediction:

tfrec_predict_kmer.sh -f sample_R1.fastq -r sample_R2.fastq -t fastq -v /path/to/vocab/tokens_merged_12mers.txt -o sample_name -s 4000000 -k 12

Arguments:
-f Fastq/fasta file of forward reads
-r Fastq/fasta file of reverse reads
-v Absolute path to the vocabulary file (path/to/vocab/tokens_merged_12mers.txt)
-o Output name prefix
-s (Optional) Number of sequences per file for splitting (default: 4000000)
-k (Optional) k-mer length (default: 12)
-t (Optional) Sequence type fastq/fasta (default: fastq)

The converted TFRecord will be stored in sample.tfrec (or other specified names) in the current dictionary.

NOTE:

  • Each subset files are processed with one CPU core, so that the optimal number of sequences per file depends on how many CPU cores you have and the total number of reads as well.
  • The vocabulary file and k-mer length should be matched.
  • The number of sequences per file must be a multiple of 4 (complementary reads of R1 and R2).

One-hot encoding

We also provide wrapper scripts of one-hot encoding for users who would like to play with the other tested DNNs.

Training set (one-hot)

tfrec_train_onehot.sh -i train.fa -o train.tfrec -s 20480000 

Arguments:
-i Fasta file of training set
-o Output name of converted TFRecord
-s (Optional) Number of sequences per file for splitting (default: 20480000)

Test set (one-hot)

tfrec_predict_onehot.sh -f sample_R1.fastq -r sample_R2.fastq -t fastq -o sample_name -s 4000000 

Arguments:
-f Fastq/fasta file of forward reads
-r Fastq/fasta file of reverse reads
-o Output name prefix
-s (Optional) Number of sequences per file for splitting (default: 4000000)
-t (Optional) Sequence type fastq/fasta (default: fastq)