Skip to content

What metadata should be #2276

@pdeffebach

Description

@pdeffebach

This post outlines, briefly, how I would like metadata to work in DataFrames.

  • Metadata is a property of a column name in a data frame, not the vector itself. As a consequence, df.income is simply a Vector and there is not metadata attached to it in general. Conceptually, think of metadata as an extension of column names. If I pass df.income to a function, that function only knows it recieves a Vector and does not know it has a name :income.

Metadata should work the same way.

  • Metadata is persistent. copy(df) preserves metadata, as does filter etc. It is also persistent across joins. For instance, if df1 and df2 both have the column :id, then
df = leftjoin(df1, df2, on = :id)` 

will preserve metata for all columns. The entry for metadata(df, :id) will be the same as metadata(df1, :id) because in a leftjoin the left data frame is thought of as the master data frame and the right one is the using data frame, in Stata-speak.

  • Metadata should be easy to access but an ecosystem should not rely on particular naming conventions for metadata. For instance, we should not write any functions guaranteeing that the metadata of a data frame includes the field label. Rather, if someone wants to graph df.income, they should do
histogram(df.income, title = metadata(df, "income")["label"])

or similar.

My ideal API for this is implemented, at least partly, in #1458. It includes the functions

  • metadata! for setting metadata via metadata!(df, :income, :label, "Personal Income")
  • metadata for getting metadata for an object, via metadata(df, :income, :label) == "Personal Income".

Notice, again, that these are handled at the level of the data frame and agnostic about what the columns are, you could change the vector corresponding to df.income and the metadata would be the same.

Here are @bkamins

  • the result of df.col not to have any metadata attached

True. Just like df.col has no name attached to it.

  • but on the other hand by doing df.col = some_new_value then the metadata should be kept

Yes. Metadata is attached to a name in a data frame.

  • given the two rules above I was not clear for example what you wanted to happen in the following cases:

    • df.col2 = df.col (I guess you do not want col2 to have any metadata)

Yes, df.col2 has no metadata.

  • if you then do select!(df, :col => :col2, :col2 => :col) - then still :col should have metadata and :col2 should not have metadata

Exactly, :col is a name in the data frame. If the user wants to transfer the metadata from one column to another they can do

@pipe df |>
    select!(df, :col => :col2, :col2 => :col)
    metadata!(df, :col, :label, metadata(df, :col2, :label))

Or something along those lines. Presumably we can overload getindex for cleaner syntax. In stata this would be

replace col = col2
label var col "`var label `col2''" // i forget the escaping rules at the moment

I used that kind of workflow a lot working with survey data.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions