Setting pData factor levels #297

DillonHammill · 2020-01-30T01:13:02Z

I am having trouble setting factor levels of pData variables using the latest flowWorkspace. I used to be able to do this:

pData(gs)$Treatment <- factor(pData(gs)$Treatment, levels = c("A","B","C","D"))
levels(pData(gs)$Treatment)
> [1] "A" "B" "C" "D"

Now the column retains the original class and is not converted to a factor.

pData(gs)$Treatment <- factor(pData(gs)$Treatment, levels = c("A","B","C","D"))
levels(pData(gs)$Treatment)
> NULL

Any idea why this is happening?

The text was updated successfully, but these errors were encountered:

mikejiang · 2020-01-30T01:35:18Z

Because each column of pData is now simply treated/stored as string in H5 without preserving their extra attributes or types. We'll need to discuss whether we want to redesign the pData structure in h5 in order to keep these info.

DillonHammill · 2020-01-30T02:34:42Z

I certainly use it a lot to order samples by experiment variables, it would be great if we could keep this capability.

mikejiang · 2020-01-30T04:22:47Z

Do you have to use factor?Ordering doesn't necessarily require that.

DillonHammill · 2020-01-30T04:26:34Z

How would you do it? I need the ordering to be retained by the GatingSet and it should not affect the sample order.

mikejiang · 2020-01-30T18:15:30Z

I don't fully understand your intention. Please give a concrete use case showing that character isn't sufficient and factor is necessary

gfinak · 2020-01-30T18:27:18Z

@DillonHammill Default factors are not ordered. Do you mean that you use levels which returns the factors in alphabetical order? If that's the case, the equivalent would be sort(unique(x)), though if there are factor levels missing, this would not capture that.

jacobpwagner · 2020-01-30T18:46:35Z

Well, factor levels always have a default ordering, usually dictated by the values as they appear in the data.frame, right? So, I think maybe @DillonHammill wants to be able to arbitrarily change that ordering and have that be persistent, maybe for the sake of summaries or plots later.

gfinak · 2020-01-30T18:47:36Z

No, they are alphabetical:

> factor(c("B","C","A"))
[1] B C A
Levels: A B C

jacobpwagner · 2020-01-30T18:51:25Z

Oh, right, by default. For some reason I was thinking I ran in to an exception to that, but it must have been manually reordered. But anyway, if for some reason he has a method that arbitrarily reorders the levels then that would be a problem if it doesn't persist.

jacobpwagner · 2020-01-30T19:27:59Z

Basically, even if a factor is not ordered (as in the the ordered flag is not set to true and the factor levels should not be considered ordinal), default group ordering information for tables/plots/summaries can still be preserved (which may be related to the use case here). That is levels returns the underlying levels vector order, not re-alphabetizing on the fly:

> blah <- iris
> levels(blah$Species)
[1] "setosa"     "versicolor" "virginica" 
> is.ordered(blah$Species)
[1] FALSE
> table(blah$Species)

    setosa versicolor  virginica 
        50         50         50 
> levels(blah$Species) <- c("versicolor", "virginica", "setosa")
> levels(blah$Species)
[1] "versicolor" "virginica"  "setosa"
> is.ordered(blah$Species)
[1] FALSE
> table(blah$Species)

versicolor  virginica     setosa 
        50         50         50 
> blah$Species <- factor(blah$Species, levels = c("virginica", "versicolor", "setosa"))
> levels(blah$Species)
[1] "virginica"  "versicolor" "setosa"    
> is.ordered(blah$Species)
[1] FALSE
> table(blah$Species)

 virginica versicolor     setosa 
        50         50         50

But in order to preserve that behavior in the h5 we would have to add an attribute to retain the order.

DillonHammill · 2020-01-30T19:38:45Z

Thanks for looking into this, as @jacobpwagner mentioned I primarily use this for summaries and plots.

For example, in my plotting function cyto_plot users can supply a GatingSet/cytoset and request that samples be grouped or ordered by experiment variables when plotting. Users may for example want to plot the unstimulated samples in the first panel and stimulated samples in the second (this is hardly ever in alphabetical order). It much easier for users if this is taken care of behind the scenes so that they don't have to manually supply the indices to order the samples for plotting (we are normally dealing with hundreds of samples). Here is an example where samples are ordered by two variables (stimulus and antigen concentration):

cyto_plot(gs,
          parent = "CD4 T Cells",
          alias = "",
          channels = c("CD44", "CD69"),
          group_by = c("Treatment", "OVAConc")) # pData variables

I also use pData to to store experimental information which is exported with calculated statistics. Setting factor levels early on in the pipeline means that it can be used to order/group samples for plotting within CytoExploreR but also be retained in exported statistics for plotting with ggplot.

jacobpwagner · 2020-01-30T19:58:17Z

That's kind of what I suspected. @mikejiang , maybe we could add representation of the levels vectors to the h5 for the columns to match R-level factor columns. We could then check those levels vectors in the process of gs_cyto_data to mimic the R-level levels method:

levels.default <- function(x) attr(x, "levels")

of course, this starts to look a bit like we're re-inventing the wheel of data.frame.

jacobpwagner · 2020-01-31T01:42:24Z

Just documenting some of our offline discussions:

2 possible approaches:

1. The distributed approach:
Expand the pData representation within each CytoFrame to handle more than simple string-string key-value mapping, thus allowing factor levels to be carried as well. That is, instead of this:
https://github.com/RGLab/cytolib/blob/d0f70a2549e035f81169301b531554907f8b9c1f/inst/include/cytolib/CytoFrame.hpp#L29

typedef unordered_map<string, string> PDATA;

do something like this:

enum class pDataType {pd_int, pd_dbl, pd_str, pd_bool};
struct PDATA_VAL{
	pDataType type;
	string value; // can be cast to appropriate return type
	set<string> levels; // only non-empty for factors
};
typedef unordered_map<string, PDATA_VAL> PDATA;

This extra information would then also have to be written out to the pdata H5 DataSet

Pros:

Less code updating as this requires minimal changes to the cytolib::GatingSet and leaves in place the status quo of distributed pData
Allows access/reconstitution of metadata from distributed samples, without the need for the protocol buffer file

Cons:

Some more code updating from having to add more to the pdata DataSet in the H5 representation (more work updating H5CytoFrame) -- this is also complicated by the fact that H5 is not really designed for jagged structures with mixed data types.
Need to enforce consistency in levels and data types across cytoframes

2. The centralized approach:
Add a C++ level representation of the pData data.frame to cytolib::GatingSet and save this out to the protocol buffer, maintaining one centralized store of group metadata (like existed in the R representation before the cytoset merge and still exists for flowFrame/flowSet in flowCore)

Pros:

Column consistency is managed in a centralized data structure, so no need to check when combining results from get_pheno_data for each CytoFrame.
Protocol buffers should make it relatively easy to add annotations with variable length and mixed type
pdata DataSet in the H5 files for each CytoFrame could be removed entirely

Cons:

Possibly significantly more code updating to be done in cytolib::GatingSet to handle the new pData data member, with particular caution needing to be paid to consistently slicing/subsetting related data structures
Requires updating the protocol buffer interface
Archiving pData of even an un-gated set of samples requires the pb file as well.

We can discuss run-time efficiency as well, but the crux of the issue may be where the metadata should live. In some ways it's nice to have it attached to the H5 for each sample's cytoframe, but maybe it doesn't make sense to store group-level information (like factor levels) with individual samples.

gfinak · 2020-01-31T17:47:27Z

I favor the first approach.

DillonHammill · 2020-03-26T20:47:38Z

@jacobpwagner, just letting you know that I have a workaround for ordering things now. This is no longer essential. Users can just specify the ordering in a list when plotting etc and set factor levels on the exported stats prior to plotting with ggplot2.

It may actually be beneficial to not change this as then we don't have to worry about differing column classes.

jacobpwagner · 2020-03-26T20:49:22Z

Thanks for the update @DillonHammill. This was still on my list, but it's good to know it's not urgent for your use case.

DillonHammill · 2020-03-26T21:02:39Z

Thanks for using the NEWS.md to track changes, this is very useful.

osthomas · 2020-05-01T19:13:12Z

Thanks for looking into this, as @jacobpwagner mentioned I primarily use this for summaries and plots.

[ ... ]

I also use pData to to store experimental information which is exported with calculated statistics. Setting factor levels early on in the pipeline means that it can be used to order/group samples for plotting within CytoExploreR but also be retained in exported statistics for plotting with ggplot.

Hello,

I'd just like to chime in and add support to this use case. I played around with CytoExploreR (thanks @DillonHammill by the way for your development work on this - of course, this extends to the whole cyto suite) and had to update to the development version of flowWorkspace for that purpose.

I really like to use ggcyto in conjunction with facet_grid to get a quick overview of my samples, and clarity suffers quite a bit from not being able to specify order arbitrarily. And while it is possible to add the desired ordering further downstream in the analysis pipeline (after extracting stats), it is tedious to do so (for me at least, because I prefer to take care of such things when reading in data and setting up the environment for analysis).

jacobpwagner · 2020-06-05T21:43:49Z

Just updating with a note here that it would also be helpful to store other sorts of type information attached to the pheno data stored at the cytolib level. Due to everything being stored as a string and returning to R as a character type, information about which variables should be true numerical types (integer, floating-point) is lost. A few problems can arise from this. For example, columns attached as integers via pData<- will return to R using pData as characters, which will change their sort order:

> nums <- 1:20
> charnums <- as.character(nums)
> sort(nums)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> sort(charnums)
 [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2"  "20" "3"  "4"  "5"  "6"  "7"  "8"  "9"

Also, we can not assume all columns that are capable of conversion to numerical types when loaded back in to R should in fact be converted. For example, subject IDs often have formatted digit strings that are not necessarily supposed to be converted in to simple numeric types. If the character id stored is "001", type conversion would succeed but make it just 1 (integer). Similarly, subject IDs with decimals like "1.24" implying arm 1 subject 24 would be converted to doubles.

So, point being, we ultimately will likely need type information attached to the pheno data stored at the C++ level in cytolib.

herbstk · 2020-09-15T14:01:25Z

I arrived here because I had the issue that pData(cytoset)<- assignments mess up the column order and loose the column types of the provided data.frame, I guess because of the translation into C structures. After reading this thread and #294 I believe I got the point of the transition from flowSet to cytoset except for the handling of metadata.

To my understanding phenoData/pData were designed to hold arbitrary metadata (as I read in the Biobase intro vignette) and the normalization by pData(cytoset)<- to strings goes against that. I was previously using ncdfFlowSet of which I liked the separation of lightweight metadata as R structures from heavyweight cytodata as on-disk structures. I guess this means that people will then start to keep cytometry data and sample metadata as separate objects and having a pData slot becomes kind of obsolete.

Maybe you could improve the cytoset help page for the pData function to make it clearer for the users what to expect.

rebeccaongmtu · 2024-03-06T17:04:06Z

This situation where the pData of the gatingset is forced to characters makes it challenging to integrate with typical ggplot functionality. Usually, I rely on factor levels to order facets in my plots, but I'm unable to do that with the gatingset data.

I'm trying to combine ggcyto with ggridges to obtain stacked density plots that are ordered based on sampling time. In my flowset, my variables are set up in the pData as factors but because the pData for the gatingset forces all the variables to characters, the stacked plots end up ordered illogically. I also cannot reorder the facets - logically the 50:50 should be between the live and dead cell populations. Do any of you know of another approach to reorder that doesn't rely on pData factor levels?

ggcyto(gs_sc,aes(x=FL1.A, fill = sampling_time), subset = "singlets") + theme_bw() + geom_density_ridges(aes(y = sampling_time), alpha = 0.5) + facet_grid(treatment~.)

jacobpwagner mentioned this issue Jul 9, 2020

Function or documentation to create derived parameter RGLab/openCyto#214

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting pData factor levels #297

Setting pData factor levels #297

DillonHammill commented Jan 30, 2020

mikejiang commented Jan 30, 2020

DillonHammill commented Jan 30, 2020

mikejiang commented Jan 30, 2020

DillonHammill commented Jan 30, 2020

mikejiang commented Jan 30, 2020

gfinak commented Jan 30, 2020

jacobpwagner commented Jan 30, 2020

gfinak commented Jan 30, 2020 •

edited

Loading

jacobpwagner commented Jan 30, 2020

jacobpwagner commented Jan 30, 2020 •

edited

Loading

DillonHammill commented Jan 30, 2020 •

edited

Loading

jacobpwagner commented Jan 30, 2020 •

edited

Loading

jacobpwagner commented Jan 31, 2020 •

edited

Loading

gfinak commented Jan 31, 2020

DillonHammill commented Mar 26, 2020 •

edited

Loading

jacobpwagner commented Mar 26, 2020

DillonHammill commented Mar 26, 2020

osthomas commented May 1, 2020

jacobpwagner commented Jun 5, 2020

herbstk commented Sep 15, 2020

rebeccaongmtu commented Mar 6, 2024 •

edited

Loading

Setting pData factor levels #297

Setting pData factor levels #297

Comments

DillonHammill commented Jan 30, 2020

mikejiang commented Jan 30, 2020

DillonHammill commented Jan 30, 2020

mikejiang commented Jan 30, 2020

DillonHammill commented Jan 30, 2020

mikejiang commented Jan 30, 2020

gfinak commented Jan 30, 2020

jacobpwagner commented Jan 30, 2020

gfinak commented Jan 30, 2020 • edited Loading

jacobpwagner commented Jan 30, 2020

jacobpwagner commented Jan 30, 2020 • edited Loading

DillonHammill commented Jan 30, 2020 • edited Loading

jacobpwagner commented Jan 30, 2020 • edited Loading

jacobpwagner commented Jan 31, 2020 • edited Loading

gfinak commented Jan 31, 2020

DillonHammill commented Mar 26, 2020 • edited Loading

jacobpwagner commented Mar 26, 2020

DillonHammill commented Mar 26, 2020

osthomas commented May 1, 2020

jacobpwagner commented Jun 5, 2020

herbstk commented Sep 15, 2020

rebeccaongmtu commented Mar 6, 2024 • edited Loading

gfinak commented Jan 30, 2020 •

edited

Loading

jacobpwagner commented Jan 30, 2020 •

edited

Loading

DillonHammill commented Jan 30, 2020 •

edited

Loading

jacobpwagner commented Jan 30, 2020 •

edited

Loading

jacobpwagner commented Jan 31, 2020 •

edited

Loading

DillonHammill commented Mar 26, 2020 •

edited

Loading

rebeccaongmtu commented Mar 6, 2024 •

edited

Loading