microflowseq

The goal of microflowseq is to assign probabilities to bacteria that are sorted in mFLOW-Seq experiments.

Installation

You can install the development version of microflowseq like so:

devtools::install_github("PennChopMicrobiomeProgram/microflowseq")

Example: mFLOW-Seq with two fractions

library(microflowseq)

Imagine an experiment where we have a specimen containing organisms from two species, Escherichia coli and Streptococcus mitis. The organisms are sorted into two fractions, IgA+ and IgA-. Let’s create a matrix giving the actual number of organisms. Here, the species are represented in the rows of the matrix and the fractions are represented in columns.

dual_cts <- matrix(
  c(5, 7, 19, 13), nrow = 2,
  dimnames = list(c("e.coli", "s.mitis"), c("IgA+", "IgA-")))
dual_cts
#>         IgA+ IgA-
#> e.coli     5   19
#> s.mitis    7   13

For each species, we are interested in the probability of finding an organism in the IgA+ or IgA- fraction. This is straightforward to calculate if we know the absolute number of organisms: we simply calculate a probability across each row.¹ The microflowseq package provides a function, normalize_sum() to perform the common task of converting to probability or relative abundance.

t(apply(dual_cts, 1, normalize_sum))
#>              IgA+      IgA-
#> e.coli  0.2083333 0.7916667
#> s.mitis 0.3500000 0.6500000

However, we don’t measure the absolute number of organisms directly in a real experiment, so we can’t use the method above to calculate the probabilities. In an mFLOW-Seq experiment, we measure two things: the abundance of each fraction, and, within each fraction, the relative abundance of the bacteria.

First, we compute the abundance of each fraction by summing down the columns. To show that our approach does not depend on the scale of the fraction abundance, we multiply the absolute fraction abundances by a large number.

dual_fracs <- 35 * colSums(dual_cts)
dual_fracs
#> IgA+ IgA- 
#>  420 1120

Next, we compute the relative abundance of each species within each fraction.

dual_props <- apply(dual_cts, 2, normalize_sum)
dual_props
#>              IgA+    IgA-
#> e.coli  0.4166667 0.59375
#> s.mitis 0.5833333 0.40625

We now have our example data set: a vector of fraction abundances (dual_fracs) and a matrix of bacteria relative abundances within each fraction (dual_props). As before, each row corresponds to a bacterial species and each column corresponds to a fraction.

We can use mflow_apply() to solve for the probabilities. The answer here matches what we computed when we knew the exact number of organisms in each fraction. Success!

mflow_apply(dual_props, dual_fracs)
#>              IgA+      IgA-
#> e.coli  0.2083333 0.7916667
#> s.mitis 0.3500000 0.6500000

Example: mFLOW-Seq with multiple fractions

Let’s imagine that the bacteria were sorted into four fractions, according to whether they were tagged by IgA, IgG, both, or neither.

multi_cts <- matrix(
  c(5, 7, 19, 13, 2, 11, 23, 29), nrow=2,
  dimnames = list(
    c("e.coli", "s.mitis"),
    c("IgA+IgG+", "IgA+IgG-", "IgA-IgG+", "IgA-IgG-")))
multi_cts
#>         IgA+IgG+ IgA+IgG- IgA-IgG+ IgA-IgG-
#> e.coli         5       19        2       23
#> s.mitis        7       13       11       29

As before, we want to work one species at a time and calculate how organisms of the species are distributed among the fractions. Both organisms are found in the double negative fraction about half the time. E. coli is almost never found in the IgA-IgG+ fraction, whereas that fraction accounts for about 18% of the S. mitis organisms.

t(apply(multi_cts, 1, normalize_sum))
#>          IgA+IgG+  IgA+IgG-   IgA-IgG+  IgA-IgG-
#> e.coli  0.1020408 0.3877551 0.04081633 0.4693878
#> s.mitis 0.1166667 0.2166667 0.18333333 0.4833333

Here is what we measure in the experiment. As before, I multiply the fraction abundances by some number to show that the absolute scale doesn’t matter.

multi_fracs <- 0.2 * colSums(multi_cts)
multi_fracs
#> IgA+IgG+ IgA+IgG- IgA-IgG+ IgA-IgG- 
#>      2.4      6.4      2.6     10.4
multi_props <- apply(multi_cts, 2, normalize_sum)
multi_props
#>          IgA+IgG+ IgA+IgG-  IgA-IgG+  IgA-IgG-
#> e.coli  0.4166667  0.59375 0.1538462 0.4423077
#> s.mitis 0.5833333  0.40625 0.8461538 0.5576923

With the data in hand, mflow_apply() works just the same as it did for the previous example. Here’s the answer we wanted.

mflow_apply(multi_props, multi_fracs)
#>          IgA+IgG+  IgA+IgG-   IgA-IgG+  IgA-IgG-
#> e.coli  0.1020408 0.3877551 0.04081633 0.4693878
#> s.mitis 0.1166667 0.2166667 0.18333333 0.4833333

Example: long format

With the advent of the tidyverse, it’s no longer common to work in matrix format in the R programming language. This is probably a good thing overall; it encourages people to work in a long format, where each data point is represented in one row of a table.

Let’s go back to our first example and convert the data to long format, so we can show how to use the microflowseq library in that context. First, we’ll load the collection of packages in the tidyverse.

library(tidyverse)

It’s actually a bit of a pain to convert our data from the matrix format. We won’t explain the process line-by-line, but we’ll show the answer so you can check that it matches the example above.

Here are the fraction abundances. We add a column to indicate the specimen from which the bacteria were sorted. In a typical sequencing experiment, you’d have many specimens.

dual_fracs_df <- dual_fracs |>
  enframe("fraction", "fraction_abundance") |>
  mutate(specimen_id = "Specimen1") |>
  select(specimen_id, everything())
dual_fracs_df
#> # A tibble: 2 × 3
#>   specimen_id fraction fraction_abundance
#>   <chr>       <chr>                 <dbl>
#> 1 Specimen1   IgA+                    420
#> 2 Specimen1   IgA-                   1120

Here are the bacteria relative abundances. Again, we add a specimen ID because this is important to the demonstration of data in long format. Each fraction derived from a specimen would appear as a different sample in the sequencing data. To keep track of which samples were derived from each specimen, you would use a separate column for the specimen ID, just as we’ve done here.

Also, we’re going to name the species column “asv”, for amplicon sequence variant. In a real sequencing experiment, we often measure sequence variants rather than the actual species per se. Hopefully, this makes our data table look more similar to what you’d see in a real study.

dual_props_df <- dual_props |>
  as.data.frame() |>
  rownames_to_column("asv") |>
  pivot_longer(-asv, names_to = "fraction", values_to = "asv_abundance") |>
  mutate(specimen_id = "Specimen1") |>
  select(specimen_id, fraction, everything())
dual_props_df
#> # A tibble: 4 × 4
#>   specimen_id fraction asv     asv_abundance
#>   <chr>       <chr>    <chr>           <dbl>
#> 1 Specimen1   IgA+     e.coli          0.417
#> 2 Specimen1   IgA-     e.coli          0.594
#> 3 Specimen1   IgA+     s.mitis         0.583
#> 4 Specimen1   IgA-     s.mitis         0.406

Using data in long format, we compute the probabilities in three steps:

Join the fraction abundance table to the bacteria relative abundance table.
Group by specimen ID and species (ASV).
Use mflow_probability() to compute the probabilities.

The mflow_probability() function works on one species/ASV at a time. Under the hood, this function is used by mflow_apply() when we work in matrix format. Here, we no longer have to think about rows and columns, and our result is ready to feed into ggplot2 for visualization.

dual_props_df |>
  left_join(dual_fracs_df, by = c("specimen_id", "fraction")) |>
  group_by(specimen_id, asv) |>
  mutate(fraction_prob = mflow_probability(asv_abundance, fraction_abundance))
#> # A tibble: 4 × 6
#> # Groups:   specimen_id, asv [2]
#>   specimen_id fraction asv     asv_abundance fraction_abundance fraction_prob
#>   <chr>       <chr>    <chr>           <dbl>              <dbl>         <dbl>
#> 1 Specimen1   IgA+     e.coli          0.417                420         0.208
#> 2 Specimen1   IgA-     e.coli          0.594               1120         0.792
#> 3 Specimen1   IgA+     s.mitis         0.583                420         0.35 
#> 4 Specimen1   IgA-     s.mitis         0.406               1120         0.65

Unfortunately, I have to transpose the result after I apply a function across the rows of a matrix. It makes the code messier, but it’s necessary because R will flip the rows and columns otherwise. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
R		R
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
microflowseq.Rproj		microflowseq.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

R

R

man

man

vignettes

vignettes

.Rbuildignore

.Rbuildignore

.gitignore

.gitignore

DESCRIPTION

DESCRIPTION

LICENSE

LICENSE

LICENSE.md

LICENSE.md

NAMESPACE

NAMESPACE

README.Rmd

README.Rmd

README.md

README.md

microflowseq.Rproj

microflowseq.Rproj

Repository files navigation

microflowseq

Installation

Example: mFLOW-Seq with two fractions

Example: mFLOW-Seq with multiple fractions

Example: long format

About

Licenses found

Releases

Packages

Languages

License

Licenses found

PennChopMicrobiomeProgram/microflowseq

Folders and files

Latest commit

History

Repository files navigation

microflowseq

Installation

Example: mFLOW-Seq with two fractions

Example: mFLOW-Seq with multiple fractions

Example: long format

Footnotes

About

Resources

License

Licenses found

Stars

Watchers

Forks

Languages