Skip to content

5. Data postprocessing

nemitheasura edited this page Nov 23, 2023 · 1 revision

The Ninetails package offers possibility of further processing of obtained files (objects), including visualizations.

Reading file(s) into R

Data post-processing module needs output from main Ninetails pipeline (e.g. check_tails()) to work.

Reading single file

Ninetails can read a single output file with read_class_single() in case of read_classes data frame and read_residue_single() in case of nonadenosine_residues data frame:

class_path <- "/directory/with/ninetails/read_class_output.tsv"
class_data <- ninetails::read_class_single(class_path)

residue_path <- "/directory/with/ninetails/nonadenosine_residues_output.tsv"
residue_data <- ninetails::read_residue_single(residue_path)

Reading multiple files

Ninetails can read multiple output files at once with read_class_multiple() in case of read_classes data frame and read_residue_multiple() in case of nonadenosine_residues data frame. It also can associate any metadata provided by the user.

Note

In order to use built-in data processing and/or data vis modules, user has to provide at least following metadata:

  • sample_name - unique ID of the sample/replicate
  • group - experimental condition
  • class_path - path to read_classes data frame
  • residue_path - path to nonadenosine_residues data frame

Let's assume we have performed an experiment with two conditions (group_1, group_2), and two replicates per condition (sample_1, sample_2, sample_3, sample_4). After running the check_tails() function, we will have 2 output files per each sample (read_classes and nonadenosine_residues, respectively).

directly:

We can read all of them at once and associate metadata from provided additional data frame:

# define table with metadata
samples_table <- data.frame(sample_name = c("sample_1","sample_2","sample_3","sample_4"),
                            group = c("group_1","group_1","group_2","group_2"),
                            class_path = c("/home/user/ANALYSES/Ninetails/sample_1/read_classes.tsv",
                                           "/home/user/ANALYSES/Ninetails/sample_2/read_classes.tsv",
                                           "/home/user/ANALYSES/Ninetails/sample_3/read_classes.tsv",
                                           "/home/user/ANALYSES/Ninetails/sample_4/read_classes.tsv"),
                            residue_path = c("/home/user/ANALYSES/Ninetails/sample_1/nonadenosine_residues.tsv",
                                            "/home/user/ANALYSES/Ninetails/sample_2/nonadenosine_residues.tsv",
                                            "/home/user/ANALYSES/Ninetails/sample_3/nonadenosine_residues.tsv",
                                            "/home/user/ANALYSES/Ninetails/sample_4/nonadenosine_residues.tsv"))

# read the data at once
class_data <- ninetails::read_class_multiple(samples_table)
residue_data <- ninetails::read_residue_multiple(samples_table)

from *.yml file:

Alternatively, one may provide metadata in configuration file (config.yml) and then read the data as in the following example:

# provide metadata
config<-yaml::yaml.load_file("config_dummy.yml")

samples_table<-data.frame(t(sapply(config$samples,unlist)))
rownames(samples_table) <- NULL

# read the data at once
class_data <- ninetails::read_class_multiple(samples_table)
residue_data <- ninetails::read_residue_multiple(samples_table)

An example content of the config.yml:

samples:
  sample_1:
    sample_name: sample_1
    group: group_1
    class_path: /home/user/ANALYSES/Ninetails/sample_1/read_classes.tsv
    residue_path: /home/user/ANALYSES/Ninetails/sample_1/nonadenosine_residues.tsv
  sample_2:
    sample_name: sample_2
    group: group_1
    class_path: /home/user/ANALYSES/Ninetails/sample_2/read_classes.tsv
    residue_path: /home/user/ANALYSES/Ninetails/sample_2/nonadenosine_residues.tsv
  sample_3:
    sample_name: sample_3
    group: group_2
    class_path: /home/user/ANALYSES/Ninetails/sample_3/read_classes.tsv
    residue_path: /home/user/ANALYSES/Ninetails/sample_3/nonadenosine_residues.tsv
  sample_4:
    sample_name: sample_4
    group: group_2
    class_path: /home/user/ANALYSES/Ninetails/sample_4/read_classes.tsv
    residue_path: /home/user/ANALYSES/Ninetails/sample_4/nonadenosine_residues.tsv

Note

This is just a minimal reproducible example. User may provide any sort of additional data (e.g. guppy version, reference transcriptome, batch...)

Correcting classification

Ninetails allows to minimize segmentation errors inherited from nanopolish.

Sometimes nucleotides from the 3' ends of some AT-rich transcripts are misidentified as poly(A) tails, when in fact they are still nucleotides belonging to the body of the transcript. A large enrichment of non-adenosine positions is observed in close proximity to the body of these transcripts.

To minimize the impact of segmentation artifacts on the results, one can use the following function:

# Reclassify the data
ninetails_data <- reclassify_ninetails_data(residue_data=residue_data,
                                            class_data=class_data,
                                            grouping_factor="sample_name", 
                                            transcript_column="ensembl_transcript_id_short", 
                                            ref="mmusculus")
# Retrieve the data frames
class_data <- ninetails_data[[1]]
residue_data <- ninetails_data[[2]]

Note

This function should be applied before further analysis/manipulation on the class and residue data.

Currently, Ninetails can reclassify transcripts from the following species:

  • Arabidopsis thaliana
  • Homo sapiens
  • Mus musculus
  • Saccharomyces cerevisiae
  • Caenorhabditis elegans
  • Trypanosoma brucei

Detailed information about the correction of data from other sources can be found in the function documentation.

Merging results

Ninetails provides function to merge tabular outputs to produce one concise table for all data. Each read is represented by a single row.

merged_tables <- ninetails::merge_nonA_tables(class_data=class_data,
                                              residue_data=residue_data,
                                              pass_only=TRUE)

In addition, an extra nonA_residues column is located at the end of the output table. It contains all non-A residues positions summarized (per read), given from the 5' to 3' end, separated by ":".

In this table, only reads that have been classified by Ninetails are included (reads marked "unclassified" are omitted from the analysis).

Summarizing results

Ninetails also produces summary table of non-adenosine occurrences within analyzed dataset.

summarized <- ninetails::summarize_nonA(merged_nonA_tables=merged_nonA_tables,
                                        summary_factors="group",
                                        transcript_id_column="ensembl_transcript_id_short")

In the output table, counts are understood as the number of reads in total or containing a given type of non-adenosine residue (see column headers for details). Whereas hits are understood as the number of occurrences of a given separate instance of non-adenosine in total (see column headers for details). Please be aware that there may be several hits in one read.

The function also reports the mean and median poly(A) tail length by transcript.