-
Notifications
You must be signed in to change notification settings - Fork 373
Description
This post outlines, briefly, how I would like metadata to work in DataFrames.
- Metadata is a property of a column name in a data frame, not the vector itself. As a consequence,
df.incomeis simply aVectorand there is not metadata attached to it in general. Conceptually, think of metadata as an extension of column names. If I passdf.incometo a function, that function only knows it recieves aVectorand does not know it has a name:income.
Metadata should work the same way.
- Metadata is persistent.
copy(df)preserves metadata, as doesfilteretc. It is also persistent acrossjoins. For instance, ifdf1anddf2both have the column:id, then
df = leftjoin(df1, df2, on = :id)`
will preserve metata for all columns. The entry for metadata(df, :id) will be the same as metadata(df1, :id) because in a leftjoin the left data frame is thought of as the master data frame and the right one is the using data frame, in Stata-speak.
- Metadata should be easy to access but an ecosystem should not rely on particular naming conventions for metadata. For instance, we should not write any functions guaranteeing that the metadata of a data frame includes the field
label. Rather, if someone wants to graphdf.income, they should do
histogram(df.income, title = metadata(df, "income")["label"])
or similar.
My ideal API for this is implemented, at least partly, in #1458. It includes the functions
metadata!for setting metadata viametadata!(df, :income, :label, "Personal Income")metadatafor getting metadata for an object, viametadata(df, :income, :label) == "Personal Income".
Notice, again, that these are handled at the level of the data frame and agnostic about what the columns are, you could change the vector corresponding to df.income and the metadata would be the same.
Here are @bkamins
- the result of
df.colnot to have any metadata attached
True. Just like df.col has no name attached to it.
- but on the other hand by doing
df.col = some_new_valuethen the metadata should be kept
Yes. Metadata is attached to a name in a data frame.
given the two rules above I was not clear for example what you wanted to happen in the following cases:
df.col2 = df.col(I guess you do not wantcol2to have any metadata)
Yes, df.col2 has no metadata.
- if you then do
select!(df, :col => :col2, :col2 => :col)- then still:colshould have metadata and:col2should not have metadata
Exactly, :col is a name in the data frame. If the user wants to transfer the metadata from one column to another they can do
@pipe df |>
select!(df, :col => :col2, :col2 => :col)
metadata!(df, :col, :label, metadata(df, :col2, :label))
Or something along those lines. Presumably we can overload getindex for cleaner syntax. In stata this would be
replace col = col2
label var col "`var label `col2''" // i forget the escaping rules at the moment
I used that kind of workflow a lot working with survey data.