Will Rosenthal [9:17 AM]
Hi Margaret. I'm working with the dataset knb-lter-cap-652, which has a lot of tables. 4 of these tables are soil cores and other related soil information, which I'm not sure how to handle. The soil samples were collected at multiple times, making them tricky to fit into location-ancillary, and I'm not sure if they would fit more under an environmental driver (for observation-ancillary) or just normal observations. Any thoughts on how they would be best handled?


Margaret O'Brien [9:35 AM]
Hi Will. Taking a look at that dataset now.

Will Rosenthal [9:41 AM]
Great, thanks!

Margaret O'Brien [9:43 AM]
this one is interesting! What they've done is put an entire project's data into one data package. From the project's point of view, that makes sense. But from the point of view of reuse, it would be harmonized differently for for a meta analysis of plant biodiversity (which is what the ecocom model can do), vs what you would do for something else (e.g., meta analysis of soil chemistry).
So right to put soil samples into location ancillary. But are you saying that since they were collected at multiple times, they don't match up with the primary observations?
I missoke!
they go inder obseravation ancillary

Will Rosenthal [9:44 AM]
That makes sense -- thank you!

Margaret O'Brien [9:44 AM]
sorry, typing too fast

Will Rosenthal [9:45 AM]
I'll let you know if I can't get the soil sampling events to match up with the event IDs I've already created

Margaret O'Brien [9:45 AM]
observation_ancillary
OK. lets try that first. If it doesn't work, I have some other idea.
I have to be away from my desk for a bit, but will be back in an hour or so

Will Rosenthal [9:48 AM]
Okay, sounds good. I've got some metadata/variable unit stuff I can work on as well

Will Rosenthal [9:58 AM]
All of the sampling events for the soil information match up with the event ids I've already created, so all is well! Thanks again for the help

Margaret O'Brien [10:56 AM]
great!

Will Rosenthal [11:55 AM]
Another quick question for you. I've got a way to get the "unit" information for the variables in all of the tables of a dataset (if there is a unit) from the metadata. Would this be good to use? It does include "number" as a unit for count data, which is why I ask (I've previously interpreted counts as unitless metrics, but could change that).

Margaret O'Brien [11:56 AM]
yes! getting from the EML metadata is a general need.
"number" is one of those dimensionless units that are hard to classify, but we do call it a "unit".
some people prefer "count", as this is closer to a 'measurement'

Will Rosenthal [11:57 AM]
Okay, sounds good!

Will Rosenthal [8:38 AM]
Hi Margaret. I'm working with the knb-lter-cap-632 dataset, and there's one bit of the data that's a bit odd. It's a measurement of the stem length on a plant either at the start of the fertilizer treatment period or at the end, but the timing is indicated by another field labeled "post_measurement" that has a boolean indicator of whether the measurement is pre or post treatment. Should this boolean field be included in the observation table or the observation_ancillary table?

Margaret O'Brien [8:42 AM]
Hi Will. thinking on that a moment.
I think observation_ancillary is the best place, because from there it can be linked directly to the primary observation (length).
Why I am thinking is 2 things: 1) in some ways, the "length" might be ancillary too. and 2) having a long (instead of wide) table makes it harder to group all the related observations together they way they were in the original.
grouping together is in the event table. Are you using that one?
e.g., these observations are from a treatment event
over

Will Rosenthal [8:48 AM]
I don't think I'm familiar with the event table

Margaret O'Brien [8:49 AM]
its many:many with observation

Will Rosenthal [8:49 AM]
Ahh yes the event_id. Yes I plan to use that
Anyways, what would your guidelines be for determining which information in that dataset is ancillary and which belongs in the observation table then? All of the data is from a fertilizer application experiment

Margaret O'Brien [8:50 AM]
sorry. the event table was a ghost table
so right. observation_ancillary.event_id:observation.event_id is many:many

Will Rosenthal [8:51 AM]
My interpretation is that every table except the fertilizer table contains at least some information that should be in the observation table
This is regarding your statement that "length" might be an ancillary measurement as well, which I didn't have it pegged as initially

Margaret O'Brien [8:53 AM]
since it's a model for ecological community data, the observation table is for data about the community organisms themselves. The reason I was waffling about length is that sometimes the size of a thing is collected (e.g., fish length, tree height), but that info is more about the population (of fish or trees), and less useful for a community analysis
but if they are using length to compute something like biomass (which could be a measure of abundance, which IS a community measurement), then it is the primary observation.
as far as things like fertilization-state, those are definitly ancillary, but someone using the data will want to get at it for filtering, etc.

Will Rosenthal [8:55 AM]
They don't seem to use length for that purpose, but instead actually measured biomass by harvesting material.

Margaret O'Brien [8:55 AM]
aha. can you tell what they took length for?

Will Rosenthal [8:56 AM]
Not really, at least based on the table metadata. My guess is a growth rate comparison, but I imagine I'd have to read any publication also provided to know for sure

Margaret O'Brien [8:57 AM]
so I would say that length goes into observation_ancillary too.

Will Rosenthal [8:58 AM]
Okay, thanks. What do you think of information gathered by root-simulating probes to see what kinds of nutrients the plants are exposed too? Or about nutrient and isotope analyses of plant tissues?

Margaret O'Brien [8:58 AM]
ancillary
what an interesting dataset. they sure crammed a lot into it.

Will Rosenthal [8:59 AM]
That seems to be a bit of a theme at the CAP LTER station haha

Margaret O'Brien [8:59 AM]
:slightly_smiling_face:
well, then, for synthesis, we are doing them a great favor!

Will Rosenthal [8:59 AM]
So I take it information about biomass and ground cover composition would go in the observation table as well?

Margaret O'Brien [9:00 AM]
yes. biiomass and cover would be primary measurements for observation table

Will Rosenthal [9:00 AM]
Biomass does not have any species information associated with it, if that makes a difference

Margaret O'Brien [9:00 AM]
oh, total?
which measuremens are assoicated with taxa?

Will Rosenthal [9:01 AM]
Ground cover percentages and the nutrient/isotope analyses are the only ones associated with any taxonomic information
And also the stem length information

Margaret O'Brien [9:02 AM]
obseravaion should be by taxa
so in the obs table, cover%
I'm still thinking that nutrient/isotope analyses are ancillary though. they can still be linked back through the event_id

Will Rosenthal [9:04 AM]
Okay, sounds good. Thanks for your help! This is the first dataset I've encountered with such a lack of taxonomic information, which made it more difficult to parse through on my own

Margaret O'Brien [9:04 AM]
I think it's because they crammed  an entire experiment into one dataset.
so teasing out the part that is about "the community" actually helps it out quite a lot - makes it more useable for community analysis than it was before

Will Rosenthal [9:05 AM]
Yes, the other robust datasets I've worked with so far were not experiements but more of surveys

Margaret O'Brien [9:05 AM]
at my local site, we do it that way.
thanks! this is a good example that we could review (with the site) and see if they agree with what we did.
and I'm getting a better idea of the order and type of questions we have to ask when we start on a new dataset. most of what we had to start with was surveys, which are simpler.

Will Rosenthal [9:08 AM]
Yes, they seem simpler! My guideline so far has been that things with taxonomic information are usually good candidates for the observation table. I'll be going back over the other datasets I've worked on to make sure they follow what you laid out in this discussion, though

Margaret O'Brien [9:10 AM]
but also the type of measurement matters too, e.g., is it a measurement of abundance. but there are lots of ways to measure 'abundance'

Will Rosenthal [9:11 AM]
I'll be sure to let you know if anything in the other dataset I've worked on might be questionable on whether it belongs in the observation or observation_ancillary table.

Margaret O'Brien [9:13 AM]
great! thanks. FYI, pretty soon (next couple of weeks) I want to start quering the converted datasets to get lists of variable_names. So we can look at the patterns, and match these to external vocabularies (for the variable_mapping table). But need to talk to Colin first

Will Rosenthal [9:14 AM]
Okay, sounds good. I've been putting parts into my script to collect that information, so let me know when you want it and I can have it relatively quickly

Margaret O'Brien [9:15 AM]
OK. I have been studying more external vocabs we could use, but it keeps getting pushed to the end of the day!

Will Rosenthal [9:21 AM]
It seems the biomass information has information in it that is associated with non-taxonomic things (e.g. soil crusts, litter, and some "sampled" information). How would handle that?
*ground cover information, not biomass

Margaret O'Brien [9:25 AM]
Looking through some notes. we have run into that before (e.g., not taxon cover, like bare ground)
one sec
since it's not a taxon, it should go into ancillary, but exacly how is what I am looking up

Margaret O'Brien [9:32 AM]
The problem I am seeing (which I think we have a solution for) is that you may need to include several of these, eg, cover_soil_crust, cover_litter, cover_rocks, and the observation_ancillary table does not have a place to record the 'thing' being measured. only the variable_name. So I am looking up how we did this for other dastasets. I think Corinna has done some of these.

Will Rosenthal [9:33 AM]
Okay, thank you

Margaret O'Brien [3:21 PM]
Hi @Will Rosenthal I found a good example:
here is the taxon table from edi.194.1, (the L1) the L0 is knb-lter-mcr.7.30
`
taxon_id,taxon_rank,taxon_name,authority_system,authority_taxon_id
tx_1,NA,Echinostrephus aciculatus,WORMS,513245
tx_2,NA,Diadema savignyi,WORMS,213375
tx_3,NA,No invertebrate observed,NA,NA
'

and the observation table looks like this:
`
grep tx_3 Annual_invertebrate_surveys_observation.csv | head
ob_5,ev_5,edi.194.1,lo_4_5,2006-01-14,tx_3,count,0,NA
ob_13,ev_13,edi.194.1,lo_4_13,2006-01-14,tx_3,count,0,NA
ob_16,ev_15,edi.194.1,lo_4_15,2006-01-14,tx_3,count,0,NA
ob_19,ev_17,edi.194.1,lo_4_17,2006-01-14,tx_3,count,0,NA
ob_48,ev_41,edi.194.1,lo_4_41,2005-05-20,tx_3,count,0,NA
ob_49,ev_42,edi.194.1,lo_4_42,2005-05-20,tx_3,count,0,NA
ob_50,ev_43,edi.194.1,lo_4_43,2005-05-20,tx_3,count,0,NA
` (edited)

Margaret O'Brien [3:42 PM]
so the non-taxa still get a taxon identifier, but then in the taxon table, the taxon name is whatever it needs to be (bare ground, rocks, etc), and the authoritative_id and system are NA
Here is the link: https://portal.edirepository.org/nis/mapbrowse?packageid=edi.194.1