Skip to content

4. Detection of non‐adenosines

nemitheasura edited this page Nov 23, 2023 · 1 revision

Classification of reads using wrapper function

check_tails() is the main function which allows to classify sequencing reads based on presence/absence of non-adenosine residues within their poly(A) tails (and additional conditions, such as minimal read length and qc_tag assigned by Nanopolish polya function).

Below is an example of how to use check_tails() function:

results <- ninetails::check_tails(
  nanopolish = system.file('extdata', 
                           'test_data', 
                           'nanopolish_output.tsv', 
                           package = 'ninetails'),
  sequencing_summary = system.file('extdata', 
                                   'test_data', 
                                   'sequencing_summary.txt', 
                                   package = 'ninetails'),
  workspace = system.file('extdata', 
                          'test_data', 
                          'basecalled_fast5', 
                          package = 'ninetails'),
  num_cores = 2,
  basecall_group = 'Basecall_1D_000',
  pass_only=TRUE,
  save_dir = '~/Downloads')

This function returns a list consisting of two tables: read_classes and nonadenosine_residues. In addition, the function saves results to text files in the user-specified directory.

Moreover, the function also creates a log file in the directory specified by the user.

Classification of reads using standalone functions

The Ninetails pipeline may be also launched without the wrapper - as sometimes it might be useful, especially if the input files are large and/or you would like to plot some produced matrices.

The first function in processing pipeline is create_tail_feature_list(). It extracts the read data from the provided outputs and merges them based on read identifiers (readnames). This function works as follows:

tfl <- ninetails::create_tail_feature_list(
  nanopolish = system.file('extdata',
                           'test_data', 
                           'nanopolish_output.tsv', 
                           package = 'ninetails'),
  sequencing_summary = system.file('extdata', 
                                   'test_data', 
                                   'sequencing_summary.txt',
                                   package = 'ninetails'),
  workspace = system.file('extdata', 
                          'test_data', 
                          'basecalled_fast5', 
                          package = 'ninetails'), 
  num_cores = 2,
  basecall_group = 'Basecall_1D_000', 
  pass_only=TRUE)

The second function, create_tail_chunk_list(), segments the reads and produces a list of segments in which a change of state (move = 1) along with significant local signal anomaly (so-called "pseudomove") has been recorded, possibly indicating the presence of a non-adenosine residue.

tcl <- ninetails::create_tail_chunk_list(tail_feature_list = tfl, 
                                         num_cores = 2)

The list of fragments should be then passed to the function create_gaf_list(), which transforms the signals into gramian angular fields (GAFs). The function outputs a list of arrays (100,100,2). First channel of each array consists of gramian angular summation field (GASF), while the second channel consists of gramian angular difference field (GADF).

gl <- ninetails::create_gaf_list(tail_chunk_list = tcl, 
                                 num_cores = 2)

The penultimate function, predict_gaf_classes(), launches the neural network to classify the input data. This function uses the tensorflow backend.

pl <- ninetails::predict_gaf_classes(gl)

The last function, create_outputs(), allows to obtain the final output: a list composed of read_classes (reads are labelled accordingly as "modified", "unmodified" and "unclassified" based on applied criteria) and nonadenosine_residues (detailed positional info regarding detected nonadenosine residues) data frames. Note that in this form the function does not automatically save data to files.

out <- ninetails::create_outputs(
  tail_feature_list = tfl,
  tail_chunk_list = tcl,
  nanopolish = system.file('extdata', 
                           'test_data', 
                           'nanopolish_output.tsv', 
                           package = 'ninetails'),
  predicted_list = pl,
  num_cores = 2,
  pass_only=TRUE)

Output explanation

read_classes dataframe:

column name content
readname an identifier of a given read (36 characters)
contig reference to which the given read was mapped (inherited from nanopolish)
polya_length tail length estimation provided by nanopolish polya function
qc_tag quality tag assigned by nanopolish polya function
class the crude result of classification
comments a code indicating whether the classification criteria were met/unmet

The class column contains information whether the given read was recognized as decorated (containing non-adenosine residue) or not. Whereas the comment column contains details underlying the classification outcome. The content of these columns is explained below:

class comments explanation
decorated YAY move transition present, nonA residue detected
blank MAU move transition absent, nonA residue undetected
blank MPU move transition present, nonA residue undetected
unclassified QCF nanopolish qc failed
unclassified NIN not included in the analysis (pass only = T)
unclassified IRL insufficient read length

nonadenosine_residues dataframe:

column name content
readname an identifier of a given read (36 characters)
prediction the result of classification (basic model: C, G, U assignment)
est_nonA_pos the approximate nucleotide position where nonadenosine is to be expected; reported from 5' end
polya_length the tail length estimated according to Nanopolish polya function
qc_tag quality tag assigned by nanopolish polya function