In need of a Source object #144

Closed
EssyGreen opened this Issue Feb 19, 2012 · 184 comments

Projects

None yet

4 participants

@EssyGreen

I realise the infinite flexibility of inheriting everything from a Resource and hence allowing it to be considered a source but this also makes for infinite complexity and infinite nonsense!

To allow, for example, a DatePart to be used as evidence is nonsensical. Theoretically we could cite millions of documents with a DatePart of "Day=1" but it means absolutely nothing without the wider context of the Date, which means nothing without the wider context of the Fact.

I think we need a definite Source object (probably equating, or similar, to the Record) and it is this (not the Resource) which should be referenced in Citations and used in Evidence.

@stoicflame
Member

There already is a Source object, although we already recognize it's inadequate.

But I think what you want is the requirement for all source references to resolve to an instance of that thing?

@jralls
Contributor
jralls commented Feb 22, 2012

Just a different facet of #146

@EssyGreen

There already is a Source object

Forgive me for being dim but the link just shows the "Description" with all the DC meta data .... I get that the DC meta data is effectively attributes/properties of the "Description" but where in the model is the "Description" object (which you say equates to a Source)?

I think what you want is the requirement for all source references to resolve to an instance of that thing?

Er, yes. How can they not? Surely all references to the same source should be links to the same Source object?

@stoicflame
Member

Just a different facet of #146

Yeah, that was my thoughts, too. I'm trying to figure out the difference.

but where in the model is the "Description" object

Umm... sorry... I don't understand the question. Where in the model? It's part of the model...

How can they not?

You could refer to an image as a source. Or a multimedia file. Or a web page. Or anything else that can be identified with a URI.

Surely all references to the same source should be links to the same Source object?

Sure. But not everything cited as a source needs to be an instance of that same type.

@jralls
Contributor
jralls commented Feb 23, 2012

Or anything else that can be identified with a URI.

And that is the problem. As the model is presently expressed every reference reduces to a URI. I could enter a standard place, or some persona, or some random Slashdot article as a source. OK, a Slashdot article might be a valid source. Unlikely but possible. The real problem would be an internal object, an easy error to create, and which might cause actual trouble (think a is a source of b is a source of c is a source of a = crash).

@EssyGreen

@stoicflame

There already is a Source object, [...]

I asked where and you gave me a link to a "Description" definition (http://www.gedcomx.org/model/dcterms_Description.html). Maybe it's just the .Net version is broken or something but there RDFDescription is never referenced anywhere so I can only assume its just the meta data for the whole GEDCOMX file ... ergo the file itself is the one and only source and all sources must be in separate GEDCOMX files referenced by their URIs.

If this is the case then as @jralls says any uri can be used as a Source and if that is so then there is no guarantee that there is any useful data whatsoever in the related uri since there is no guarantee it actually is a GEDCOMX file (and not just a jpg, a PC application, a link to a virus, etc).

@stoicflame
Member

And that is the problem. As the model is presently expressed every reference reduces to a URI.

Okay. How else would you suggest the serialization format (de)reference other objects? simple string?

The real problem would be an internal object, an easy error to create, and which might cause actual trouble (think a is a source of b is a source of c is a source of a = crash).

So that's an error in the data. Agreed.

How is that any different from any other data error cases that will need to be intelligently handled by the application? I can write an "I'm my own grandpa" loop, too.

@stoicflame
Member

ergo the file itself is the one and only source and all sources must be in separate GEDCOMX files referenced by their URIs.

No... that's not the intent of that at all. The intent of that object was to be akin to a Source object like you're asking for here in this thread. Obviously, that needs to be clarified, so I'll use this issue to track that work.

@jralls
Contributor
jralls commented Feb 23, 2012

And that is the problem. As the model is presently expressed every reference reduces to a URI.

Okay. How else would you suggest the serialization format (de)reference other objects? simple string?

See #146. But:

The serialization isn't the model. The model describes the internal data structures that are serialized and that result when the stream is deserialized. So in the model references to other objects in the serial stream should be references to the class of those other objects, not the URI that is the proxy for the reference in the stream. Deserialization will have to be two passes: The first to construct all of the objects in the stream and the second to resolve the URIs into references and validate them.

@EssyGreen

The serialization isn't the model. The model describes the internal data structures that are serialized and that result when the stream is deserialized. So in the model references to other objects in the serial stream should be references to the class of those other objects, not the URI that is the proxy for the reference in the stream.

+++++++++1

@jralls
Contributor
jralls commented Feb 24, 2012

Deserialization will have to be two passes: The first to construct all of the objects in the stream and the second to resolve the URIs into references and validate them.

If the file format includes an index of the objects with their types as @nealmcb is asking for in #140 the first pass wouldn't be necessary.

(Yeah, I'm talking to myself. ;-) )

@EssyGreen

LOL!

@stoicflame
Member

(I know I'm attempting to revive a stale thread here.)

Given the new set of specifications and recipes that attempt to clarify how to model the "source object" you seek, what still needs to be addressed here?

@EssyGreen

Just catching up ... may take a while .... bear with me!

@jralls
Contributor
jralls commented Jun 10, 2012

Given the new set of specifications and recipes that attempt to clarify how to model the "source object" you seek,
what still needs to be addressed here?

See #164 & #165

@EssyGreen

Given the new set of specifications and recipes that attempt to clarify how to model the "source object" you seek, what still needs to be addressed here?

I'm still wading through the amended spec having been awol for some months but as far as I can see there is still no source object - just a "SourceReference" with an id, type, description and attribution. If the same source is reference multiple times then I presume these will also be replicated throughout the file creating a fragmented but inadequate source "record".

@jralls
Contributor
jralls commented Jun 17, 2012

Over in #134 Sarah and I got going on Source analysis and its importance to good genealogical work.

In #156, Ryan expressed the mission of GedcomX:

The purpose of GEDCOM X has been stated as:

To define an open data model and an open serialization format for exchanging the components of the genealogical proof standard.

When we talk about "the components of the genealogical proof standard," we mean these:

  • Search Reliable Sources
  • Cite Each Source
  • Analyze Sources, Information, and Evidence
  • Resolve Conflicts
  • Make a Soundly-Reasoned Conclusion

Which also recognizes the importance of source analysis. How then should GedcomX record the source analysis? The logical answer to me is to have a proper Source object with a citation property (what's currently called a "Description") and an analysis property (which can be just a long string).

@EssyGreen

Can you clarify what you mean by "source analysis"? Is this what I call an "interpretation" ie working out what is explicitly and implicitly detailed in a single source?

@jralls
Contributor
jralls commented Jun 17, 2012

Source analysis has three phases:

  • Analyze the source itself: The provenance, the informant, whether the source is contemporary to the events or characteristics it records, how legible it is, etc.and through that analysis make an evaluation of the reliability of each "evidence statement" contained in the souce.
  • Extract the evidence which is directly stated in the document.
  • Analyze what the source says in the context of your understanding of the requirements and customs which govern similar sources from that place and time to infer indirect evidence. If the source is part of a larger source (e.g., a single household in a census), consider it in the context of the surrounding entries for more potential inferences. Pay attention as well for information which one ordinarily finds in the type of source at hand but which are missing from this one.

It's important to not try to make connections to evidence from other sources while you're doing this analysis so that you don't read in something that isn't there -- or miss something that is -- because "pieces of the puzzle" seem to fit.

Yes, you could call it interpretation if you like, but "working out what is explicitly and implicitly detailed" leaves out the first part.

@EssyGreen

Yes that's what I call interpretation :)
I prefer to model this in the same way as in the Conclusion model rather than a single text description ... the only difference between them is the number of sources being analysed.

@jralls
Contributor
jralls commented Jun 18, 2012

OK. How would you structure the Source class and how would you tie the elements into the conclusional Persons, Relationshps, and Events?

@EssyGreen

Briefly - simplistic syntax:

Source is top level entity with properties:

  • Persons
  • Relationships
  • Events
  • Evidence (=Sources)
  • Proof (text)

Quick example:
Source 1: Marriage Cert for Fred Bloggs & Freda Jacobs, GRO ref 1234ABC etc etc

  • Person 1: Fred Bloggs
  • Person 2: Freda Smith
  • Relationships, Marriage event etc
  • Proof: Original copy held by Boris Biggles - as descendant of Fred Bloggs
  • Notes: Fred & Freda sign using their "mark"

Source 2: Birth Cert for Fred Bloggs, GRO ref 1234XYZ etc etc

  • Person 1: Fred Bill Blogs
  • Person 2: Freda James
  • Person 3: Frederick Blogs
  • Relationships, Birth event etc
  • Evidence: ANOther Source
  • Proof: Derivative from ANOther Source

Source 3: Birth Cert for Fred Bloggs, GRO ref 1234ABC etc etc

  • Person 1: Fred Bill Blogs
  • Person 2: Freda Smith
  • Person 3: Frederick Blogs
  • Relationships, Birth event etc
  • Proof: Certified copy with photocopy from original register

Source 4: My Family Tree, author: Sarah Green other attributes of source etc etc

  • Person 1: Fred Bill Bloggs
  • Person 2: Freda Smith
  • Person 3: Frederick Bloggs
  • Relationships, Birth event, Marriage event etc
  • Evidence: Source 1 (+5), Source 2 (-4), Source 3 (+4)
  • Proof: The marriage cert. is considered to be accurate since it is held in the family records. Two births were found for Fred - the first (1234XYZ) is thought to be inaccurate since the mother's maiden name is James whereas the second (1234ABC) matches that of the marriage certificate. The slight difference in the spelling of the names is considered insignificant since neither Fred nor Fred were literate (as seen from the "Mark" on their marriage certificate)

NB: Persons are contained within each source, not pointers to somewhere else

@EssyGreen

Hmm actually that's not quite right ... The proof of Source 4 should actually be the proof for Fred (or if you like for Fred's birth event) not the proof of the whole tree! Apologies.

@jralls
Contributor
jralls commented Jun 18, 2012

OK.
Are the Persons, Relationshps, and Events objects in their own right or are they just strings? Yes, I saw the note at the bottom about them being contained, but they could still be structures with elements like Name, Age, Sex, etc. for the Persons.
Is "Evidence" the source-citation data (what the present spec calls the Description)?
Should Sources 1 - 3 have an Event, since that's what they seem to describe?
Shouldn't the relationships be "Person1 married Person2" (Source 1) and "Person3 child of Person1, Person3 child of Person2" (Sources 2 and 3)?

Isn't "Source 4" really a set of conclusion-model objects (Two Events, the marriage and the birth, 3 Persons, Fred, Freda, and Frederick, and 3 Relationships), each of which has the appropriate SourceReferences pointing back to Sources 1 - 3, along with the appropriate proof statements? (Here I'm assuming that you're not using your "My Family Tree" database as a source for some other database.)

Where would you put "Facts" (for example, the ages of the bride and groom from the marriage certificate)?

@EssyGreen

Are the Persons, Relationshps, and Events objects in their own right or are they just strings?
Objects - I just used strings to simplify syntax here

they could still be structures with elements like Name, Age, Sex, etc. for the Persons.
Yup - again simplified for brevity of example

Is "Evidence" the source-citation data (what the present spec calls the Description)?
Probably similar - tho' I don't really understand what the GEDCOM X Description is or what it's trying to do

Should Sources 1 - 3 have an Event, since that's what they seem to describe?
Yup they can have events but the evaluation in the example is being done against the whole set not just any one single event so as to retain the context (i.e. Fred's birth place etc is intrinsically linked to the Father's name and occupation within the source so I wouldn't want to be able to cite one and quietly drop the other).

Shouldn't the relationships be "Person1 married Person2" (Source 1) and "Person3 child of Person1, Person3 child of Person2" (Sources 2 and 3)?
I didn't actually detail the relationships for simplicity but in my world they would be Source 1 (man/wife with marriage event & roles etc), Source 2/3: ditto plus child/mother and child/father

Isn't "Source 4" really a set of conclusion-model objects

If you're happy for the definition of a Source to be "a set of conclusion-model objects" then yes

I'm assuming that you're not using your "My Family Tree" database as a source for some other database

Why would you assume that? That's sort of my point that it is a source (albeit one being changed as the research progresses)

Where would you put "Facts" (for example, the ages of the bride and groom from the marriage certificate)?

  • either as role of event or characteristics of person whichever you prefer :)
@jralls
Contributor
jralls commented Jun 18, 2012

Is "Evidence" the source-citation data (what the present spec calls the Description)?

Probably similar - tho' I don't really understand what the GEDCOM X Description is or what it's trying to do

It's trying to use DC/RDF to construct a citation.

I think I see where you're trying to go: To collect several sources, extract the direct evidence for each of them, then treat them together in a proof argument to generate a second-level source (meta-source?) which you would then reference in the conclusion bits (e.g., the birth and marriage events).

I find the concept attractive. Is that what you meant?

@EssyGreen

It's trying to use DC/RDF to construct a citation

Indeed but personally I don't give a fig about DC/RDF :)

To collect several sources, extract the direct evidence for each of them, then treat them together in a proof argument to generate a second-level source

Yes effectively ... The tree is a source and can be used elsewhere as one, say in another tree .. which is also a source ... ad infinitum

@jralls
Contributor
jralls commented Jun 19, 2012

It's trying to use DC/RDF to construct a citation

Indeed but personally I don't give a fig about DC/RDF :)

Right, but you do care about good citations... at least you've said that you do. So, is a citation (title, creator, publication data or repository, date, etc.) what goes into your "Evidence" field?

The tree is a source and can be used elsewhere as one, say in another tree .. which is also a source ... ad infinitum

Well, the tree can be a source if it is used elsewhere, but I don't think it should be a source for itself.

@EssyGreen

Right, but you do care about good citations... at least you've said that you do. So, is a citation (title, creator, publication data or repository, date, etc.) what goes into your "Evidence" field?

The evidence object would be a pointer to the Source record (=GEDCOM X source "Description"?) which would contain the title, creator, publication etc (otherwise I find this becomes massively duplicated throughout a file). Hence the citation/evidence item itself only needs an optional equivalent to the GEDCOM "Where in Source" where the source is sufficiently large to warrant it.

the tree can be a source if it is used elsewhere, but I don't think it should be a source for itself.

I would agree :) Though not sure how you would code up that constraint.

@jralls
Contributor
jralls commented Jun 19, 2012

The evidence object would be a pointer to the Source record (=GEDCOM X source "Description"?) which would contain the title, creator, publication etc (otherwise I find this becomes massively duplicated throughout a file). Hence the citation/evidence item itself only needs an optional equivalent to the GEDCOM "Where in Source" where the source is sufficiently large to warrant it.

Oh, then it's a SourceReference. OK. I'd do that differently, embedding the citation in the Source object and use the SourceReference to point to the whole thing. There need be only one citation, one note analyzing the source's quality, provenance, and so on, and one extraction of persons, places, events, characteristics, etc. Might as well keep the whole thing together; everything that refers to it uses a reference.

the tree can be a source if it is used elsewhere, but I don't think it should be a source for itself.

I would agree :) Though not sure how you would code up that constraint.

Nor am I. I suspect that it's not worth the effort, as anyone who published something containing self-references would be a laughingstock, and there's only so far one can go to protect novices from themselves.

@EssyGreen

There need be only one citation, one note analyzing the source's quality, provenance, and so on, and one extraction of persons, places, events, characteristics, etc. Might as well keep the whole thing together; everything that refers to it uses a reference.

Not necessarily one citation since there could be different interpretations/hypotheses of the same evidence set. Hence each source could be used in multiple places and hence the need for a fat source record and a slim-line citation.

@jralls
Contributor
jralls commented Jun 19, 2012

Not necessarily one citation since there could be different interpretations/hypotheses of the same evidence set. Hence each source could be used in multiple places and hence the need for a fat source record and a slim-line citation.

Hmm. Sounds muddled. Try modeling your approach with different classes for first-level sources (#1 - 3 in your examples) and second-level sources (#4). I think the hypotheses will wind up in the second-level class, which won't have a citation but will have one-to-many SourceReferences (in the example, pointing back to #1 - 3).

@EssyGreen

Try modeling your approach with different classes for first-level sources (#1 - 3 in your examples) and second-level sources (#4)

I'm not clear on your links here ... Are you saying that a "1st level source" is a Conclusion (#1)? and what in @carpentermp's post (#4) are you saying is a "2nd level source"?

@jralls
Contributor
jralls commented Jun 20, 2012

I'm not clear on your links here ... Are you saying that a "1st level source" is a Conclusion (#1)? and what in @carpentermp's post (#4) are you saying is a "2nd level source"?

Sorry, that was a markdown misfire. I didn't mean to link to other issues, I was referring to source numbers 1 - 4 in your earlier comment. So source numbers 1 - 3 I'm calling "1st-level" because they each cite an external source document (a marriage certificate and two birth certificates) and source number 4 I'm calling "2nd-level" because it cites several 1st-level sources and analyzes them collectively.

@EssyGreen

Thank goodness for that! I was getting really lost :)

I think the hypotheses will wind up in the second-level class, which won't have a citation but will have one-to-many SourceReferences

So still trying to follow you .. you're saying that the hypothesis (4) will wind up as a second-level class ( er ... well you've defined 4 as 2nd class so I guess I can't argue with that)

... and it won't have a citation ... er but it has 3 in the Evidence bag

... but it will have one-to-many SourceReferences ... er yes but these are the citations

Sorry but I'm just not following your argument here, could you re-phrase?

@jralls
Contributor
jralls commented Jun 20, 2012

OK, try this:

Source

  • id {locally-unique string identifier}
  • citation {contains creator, title, pub data, etc.}
  • direct evidence {persons, places, events, relationships, etc.}
  • indirect evidence {the same, but inferred rather than direct}
  • analysis {condition, provenance, apparent quality of informant, etc.}

Synthesis

  • id {locally-unique string identifier}
  • Attribution {researcher & date}
  • Proof argument {Narrative analysis of a set of relevant sources to arrive at a set of conclusions; includes resolution of conflicts)
  • List of Source ids

Conclusion

  • id
  • Synthesis id

Persons, Relationships, Events, Places, and Dates extend Conclusion with additional properties.

I prefer "Synthesis" to "Hypothesis" because at some point one becomes satisfied that one has completed the requirements of the GPS and that those points are "proved". It also emphasizes that one is pulling together evidence from a range of sources.

The last bit, Conclusion, doesn't fit the present GedcomX model exactly because Persons and Relationships are separate classes from Conclusions, but for the purposes of this discussion I don't think the distinction is important.

Going back to your original example, I'd categorize the first three "sources" with the Source class and the fourth with the Synthesis class.

@EssyGreen

Terminology is getting confusing ... here's my definitions (leaving aside code for a moment):

Source - something which holds data of interest to the researcher. Needs to contain details which describe the origin/provenance, author, publisher, owner's reference numbers, scanned images etc etc

Interpretation - a transformation (by someone) of the raw data in a single Source into meaningful information represented by one or more Persons, Relationships, Events and/or Characteristics. A Source may have none/many Interpretations (either made by different people or by the same person if, for example, the data is ambiguous). An Interpretation has a single Source (and hence is contained within it). An Interpretation is a "Derived" Source (because it transforms the original data) - if the interpreter is the researcher then this is somewhat superfluous but if not this is critical to identifying the source as secondary.

Hypothesis - a theory about one or more Persons, Relationships, Events and/or Characteristics made by a researcher as a result of their analysis of the Interpretations of multiple Sources. A Hypothesis with no Sources could theoretically exist but is just a fantasy of the author. If a Source has no explicit "Interpretation" then one must have been made implicitly by the researcher. An Interpretation is a "Derived" Source (because it amalgamates or "synthesises" bits and pieces from a variety of sources).

Evidence - a collection of references to Sources (each of which contributes towards the probability of a Hypothesis being true/false) plus a verbatim Analysis of the logic/reasoning/assumptions/anomalies. A Hypothesis has one "Evidence bag". Evidence must be contained within a Hypothesis.

@jralls
Contributor
jralls commented Jun 21, 2012

OK, rewriting my set of structs above to use your terms:

Source

  • id {locally-unique string identifier}
  • citation {contains creator, title, pub data, etc.}
  • analysis {condition, provenance, apparent quality of informant, etc.}

Interpretation

  • id
  • Source:id
  • direct evidence {persons, places, events, relationships, etc.}
  • indirect evidence {the same, but inferred rather than direct}
  • Attribution {Researcher and date}

Hypothesis

  • id {locally-unique string identifier}
  • Attribution {researcher & date}
  • Evidence
    • Proof argument {Narrative analysis of a set of relevant sources to arrive at a set of conclusions; includes resolution of conflicts)
    • List of Interpretation ids

Conclusion

  • id
  • Synthesis id
@EssyGreen

Thanks :) I'm with you now I think ...

I agree with your Source and largely with your Interpretation (tho I see an Interpretation as embedded within a Source rather than referencing it). I think your Hypothesis is different to mine and I can't see that your Conclusion/Synthesis is doing anything (it seems to have no data).

We disagree on the following points:

  • Direct/Indirect ... your way of handling this is to have a clear distinction between the two; my way is just to make the Interpretation explicit - whether something is direct/indirect is then evident just by comparing the original Source data with the Interpretation. I prefer this because whether something is direct or indirect is largely a matter of interpretation (that word again!). For example, if someone is cited as a "Step-father" on a specific date to me that is a fairly explicit statement that the biological father has died before that date but another researcher might disagree. So having a flag for direct/indirect doesn't really give any benefit/value or mean anything to anyone other than the author. Conversely, the researcher themselves might see that something could mean two (or more) things - the Direct/Indirect approach doesn't allow them to express the differences; whereas having two (or more) Interpretations would clarify this. The important thing for the reader (and the researcher) is to make it clear what/how they extracted/interpreted the data.
  • Conclusion/Synthesis vs Hypothesis ... I strongly believe that there is no "Conclusion" in genealogy (I have a real problem with the "Conclusion" model but that's another argument!) ... just a Hypothesis which is currently favoured due to the analysis of the available data. More data might change it and swing favour another way. This is why I believe it is important for the Hypothesis to exist with an assessment/analysis of the sources which are relevant to it. Events, Characteristics and Relationships are all Hypotheses waiting for more "Evidence" to disprove them.
@jralls
Contributor
jralls commented Jun 22, 2012

I separated out Interpretation to highlight your definitions from yesterday. It could just as easily be
Source

  • id {locally-unique string identifier}
  • citation {contains creator, title, pub data, etc.}
  • analysis {condition, provenance, apparent quality of informant, etc.}
  • Interpretation
    • direct evidence {persons, places, events, relationships, etc.}
    • indirect evidence {the same, but inferred rather than direct}

My reasoning for separating direct and inferred evidence is that the original source may not always be available, and it's an important distinction. That's especially true in a collaborative situation, where one researcher might consult a source and, using GedcomX, report back to the team. I disagree with you that it's a matter of interpretation: If the evidence is clearly stated in the document (so-and-so is the HoH's stepfather in your example), that's direct evidence. Anything else is inferred (the HoH's natural father's death or divorce in your example). A careful researcher will document her reasoning for inferred evidence.

The Conclusion element was meant to stand in for the rest of GedcomX: the Persons, Relationships, Events, Places, and Dates that the program needs to have in structured form in order to generate charts and reports. If you're not going to use that part, you might just as well use Evernote and your favorite word processor -- a lot of professional genealogists do.

@EssyGreen

My reasoning for separating direct and inferred evidence is that the original source may not always be available, and it's an important distinction

Yes I had considered that problem ... my approach would be to have a transcription field (or possibly a derived source which is the transcription) which would provide the "original".

A careful researcher will document her reasoning for inferred evidence.

I totally agree ... it's just a matter of how and where that is done. I would do it by providing either an image copy or a transcription as part of the source; and having a hypothesis for the deceased father with a statement quoting the original e.g. something like "Death of Bob: before 1851" - "In the census of 1851, Joey is referred to as the step-son of Fred Bloggs. Since divorce was rare and expensive at this time, it is probable that Bob had died before this date." I might add another hypothesis to the cover the possibility of divorce if I felt it was worth further research or I might leave this until/unless further evidence was found which swayed the hypothesis one way or another. I believe this is clearer than a flag/code of "Direct/Indirect"

The Conclusion element was meant to stand in for the rest of GedcomX

Ah I see! I didn't realise that - wasn't attempting to negate the rest of GEDCOM X - just didn't understand you. ... So your Synthesis id = my Hypothesis id?

@jralls
Contributor
jralls commented Jun 22, 2012

transcription field

The GDM had that, they generalized it to a "representation", which could be an image, a transcript, or an abstract, and a source could have more than one.

a hypothesis for the deceased father with a statement quoting the original

That really gets to the heart of it, I think. Inferred evidence needs some sort of a statement along with it. One could even go so far as to say that it doesn't belong in the source at all.

So your Synthesis id = my Hypothesis id?

Yes. While I agree with you about the permanence of genealogical conclusions (that they're not always conclusive), I take the view that one forms a hypothesis from the first round of research, then designs new research to try to confirm or refute it. Once one has completed the requirements of the GPS, it isn't a hypothesis anymore. No matter, so long as we can agree about what goes into each class then the class names don't matter a bit.

@EssyGreen

I think we're pretty much in alignment

Inferred evidence needs some sort of a statement along with it. One could even go so far as to say that it doesn't belong in the source at all.

I'm happy for it to be in the Hypothesis which is where I see the bulk of the important analysis but to be honest it could go in any old note if the researcher wanted to annotate elsewhere.

Once one has completed the requirements of the GPS, it isn't a hypothesis anymore

I believe they differentiate between a Hypothesis and a Theory but it is just a matter of how much evidence there is so I don't see the need for separate structure.

@jralls
Contributor
jralls commented Jun 22, 2012

I believe they differentiate between a Hypothesis and a Theory but it is just a matter of how much evidence there is so I don't see the need for separate structure.

No, the GPS doesn't use either term. Elizabeth Mills started using "hypothesis" in her research process lectures last year in the same way we've been using it here, but I've never heard any of the top lecturers use "theory" in any formal (as opposed to conversational) sense.

I'm not suggesting separating them, either. It's just the reason why I prefer the name "Synthesis" over "Hypothesis".

@EssyGreen

ESM uses the two separately:

Hypothesis - a proposition based on an analysis of evidence at hand [...]
Theory - a tentative conclusion reached after a hypothesis has been extensively researched [...]

(see EE p17)

Will have to agree to disagree over the term Synthesis :)

@jralls
Contributor
jralls commented Jun 22, 2012

OK, always nice when Ms. Mills agrees with me! ;-) My copy of EE is at home in California, and I'm in Ireland, so it will be a couple of weeks before I can look at the reference.

Anyway, it's not important enough to disagree about. We agree about what goes where, now we need to get Ryan to do something with it.

@thomast73
Contributor

Lots of good discussion here!

In GedcomX, there is a class in the model called Description. This issue is largely about what the Description class ought to look like.

We would like to modify the Description class to address some of the issues being raised here. In the coming posts, I hope to describe some modifications that are planned and get feedback relative to the discussion here and otherwise.

@thomast73
Contributor

First, since some effort has been put toward definitions, I am going to give some of my own definitions in hopes you will have a better chance at being able to interpret what I am saying:

  • Source: the thing with the evidence of interest to the researcher – i.e., a document, preservation image, recording, book, headstone, or other such thing; the thing that contains or manifests the evidence relevant to the research problem
  • Citation: {@jralls}contains creator, title, pub data, etc.
  • Extractions (aka: {@EssyGreen}Interpretation, {@jralls}Evidence): {@EssyGreen}a transformation (by someone) of the raw data in a single Source into meaningful information represented by one or more Persons, Relationships, Events and/or Characteristics.
    • This is a sticky one; by nature it is a derivative of the source, and also becomes a source; in the model we discuss, one question will be about whether this is part of the source Description or whether it needs to be handled some other way, with a source Description to document its source
  • Analysis: {@jralls}condition, provenance, apparent quality of informant, etc.
  • Description (aka: SourceDescription, {@EssyGreen, @jralls}Source): a description of the Source; what ought to be included in a source “description” is the topic of this issue; it is also the current name of the class being discussed in this issue; it could be that calling the class SourceDescription would help people be less confused about what its purpose is, and we may get to that on this thread as well.
  • Conclusion (aka: Assertion, {@jralls}Synthesis, {@EssyGreen}Hypothesis): a statement of what the researcher believes after analyzing the set evidence in hand
    • In GedcomX, certain classes represent conclusions -- e.g., a Person, a Relationship, a Fact, etc.
    • Yes, I have read and comprehend @EssyGreen's concern about calling this a by this name; I share the belief that conclusions ought to be subject to being reworked, but do not share the belief that a name change will clarify things
  • SourceReference: associates a source Description with the object holding the reference (e.g., a Conclusion)

These may need to be refined a bit, but lets start there.

@EssyGreen

@thomast73 - excellent summary :)

One tiny picky thing - "Extractions" would be better called "Abstractions" when represented as Persons etc (to distinguish from verbatim extractions which are usually referred to as "extracts")

@jralls
Contributor
jralls commented Jul 3, 2012

The current "Description" (paragraph 3.1) isn't a class. It is including by reference the RDF specification, which is (unfortunately) used extensively in GedcomX. You can't replace that with what we're discussing here without breaking the rest of GedcomX.

I don't like that you're conflating "conclusion" with "hypothesis". Conclusion is already a class in GedcomX, as are Person and Relationship (and, if Ryan ever gets around to committing #134, Event). Hypothesis is a separate step which combines multiple sources and which will provide a proof argument for one or more conclusion/relationship/event objects. It should be represented as a separate class, and Conclusion, Relationship, and Event objects should be able to use it as a source instead of a Source object.

Yes, Extraction/Abstraction/Evidence/Interpretation can get sticky, and it's made stickier by re-using the top-level object names Person, Relationship, and Event. The GDM addressed that stickiness in part with the "Persona" concept, which is used in the Record Model.

These may need to be refined a bit, but lets start there.

That is rude. Why should we re-start a 5-month old discussion just because you've finally decided to join in?

@EssyGreen

I think @thomast73 was just trying to get a common vocabulary as a starting point for resolving the things we've discussed in this thread. Let's at least hear his next steps.

@thomast73
Contributor

Why should we re-start a 5-month old discussion just because you've finally decided to join in?

I am sorry. I've no intention toward rudeness, nor am I attempting to re-start from the beginning. I am joining (at the request of @stoicflame) in hopes we can distill some of these ideas into something concrete in the GedcomX model.

The current "Description" (paragraph 3.1) isn't a class. It is including by reference the RDF specification, which is (unfortunately) used extensively in GedcomX.

In referring to the Description class, I am referring to the org.gedcomx.metadata.rdf.Description class in the GedcomX Souce Metadata model.

Also, I think the result of our work here will be that the RDF specification will not be so prominent and that the resulting model will feel more directly applicable to our community and not so abstract and general purpose and free form.

@EssyGreen

Yup I'm with you :) So where do we go from here?

@thomast73
Contributor

I don't like that you're conflating "conclusion" with "hypothesis".

I see that I have misread what has previously transpired in this regard. Indeed, they are not the same as @jralls has described them. I am not sure that @EssyGreen is making a clear distinction between the two in what she has written?

So, repeating from above:

  • Conclusion (aka: Assertion): a statement of what the researcher believes after analyzing the set evidence in hand

And adding:

  • Reasoning (loosely: {@jralls}Synthesis, {@EssyGreen}Hypothesis): the reason the referenced source(s) are applicable to the conclusion/source
  • Attribution: who said so and when did they say it

The concept of "reasonsing" and "attribution" are combined into a single Attribution class in the current model (Release 18). In the current model, Attribution instances can be associated with SourceReference instances and with GenealogicalResource instances. It is my belief that our model has a few problems with regards to Attribution and its use within our model. In my mind, the "reason" (currently proofStatement) should be pulled out of Attribution so that Attribution is just about the "who" and the "when". It makes sense to me to associate an instance of this new Attribution' with SourceReferences and with GenealogicalResources, but it does not seem true that all instances of GenealogicalResource need an associated "reason" (e.g., does a Note need a "reason" for being?). It does seem that instances of Conclusion need "reasons" so that I want to say that "reason" ought to be an attribute of Conclusion, but not everything I consider to be a "conclusion" is currently derived from Conclusion (something we will consider modifying). My musings...but off topic somewhat; may need to open a separate issue for this.

Also, I do not feel satisfied with any of the names proffered here: "reasoning", "synthesis", "hypothesis", "theory", .... [Please do not take my saying so as a personally-directed critique. :-)] None of them really "speak to me" of the intended usage as I currently see it. "Reason" seems the closest fit ...-but is so generic! :-) Still pondering...

@EssyGreen

I don't like that you're conflating "conclusion" with "hypothesis".

I see that I have misread what has previously transpired in this regard. Indeed, they are not the same as @jralls has described them. I am not sure that @EssyGreen is making a clear distinction between the two in what she has written?

Apologies if I wasn't clear enough ... I believe that a "Conclusion" is the reasoning/rationale/explanation for why a specific Hypothesis is favoured over others. We could model this by having a collection of conflicting hypotheses together with a verbatim "conclusion" statement. However, if each Hypothesis contains not just a verbatim but also a weighted/numeric evaluation of the Evidence for and against it then a separate "Conclusion" is not necessary and will dynamically change as the Hypotheses are re-evaluated as/when new Evidence is found. If you don't like this then I would see the need for a "Conclusion" statement alongside the aggregate of conflicting Hypotheses - I think this will be more difficult to model.

I need a use case to get any clarity on this, so here's one for starters.

Say you had 2 conflicting sources (S1 & S2) for a Person's Name (but you have already established/proven that they relate to the same Person). You have a number of hypotheses:
(a) S1 is correct and S2 is wrong for some reason
(b) S2 is correct and S1 is wrong for some reason
(c) both names were used simultaneously
(d) S1 and S2 are in fact the same thing with different spellings/languages
(e) the real name is something deduced from an amalgamation of S1 and S2

The "Conclusion" is (a), (b), (c), (d) or (e) plus the reason why. But if we explain "why" and give a score to each of (a)...(e) then we don't need a separate Conclusion.

It is my belief that our model has a few problems with regards to Attribution and its use within our model. In my mind, the "reason" (currently proofStatement) should be pulled out of Attribution so that Attribution is just about the "who" and the "when". It makes sense to me to associate an instance of this new Attribution' with SourceReferences and with GenealogicalResources, but it does not seem true that all instances of GenealogicalResource need an associated "reason" (e.g., does a Note need a "reason" for being?).

I totally agree - see #178 :)

I do not feel satisfied with any of the names proffered here: "reasoning", "synthesis", "hypothesis", "theory"

Fair enough - read EE chapter 1 - yes John I did just say that;) - and pick a phrase that describes what you mean :) Then at least we'll all have a common reference point.

@EssyGreen

PS: Since we are allowing duplicate Births and Deaths etc in the model it cannot be a "Conclusion" model ... it is in fact a "Hypothesis" model with Person, Name, "Fact", Event and Relationship all being types of Hypothesis.

@jralls
Contributor
jralls commented Jul 4, 2012

The "Conclusion" is (a), (b), (c), (d) or (e) plus the reason why. But if we explain "why" and give a score to each of (a)...(e) then we don't need a separate Conclusion.

This seems a bit mechanistic, and ISTM that the scoring will be subjective and thus will need a narrative argument of some sort for each score. It seems easier to just write a narrative argument and arrive at a single synthesis/hypothesis/reasoning.

@jralls
Contributor
jralls commented Jul 4, 2012

It is my belief that our model has a few problems with regards to Attribution and its use within our model.

We discussed that briefly in #134 starting here.

I don't necessarily agree that the "who" and "when" should be separated from the "proof argument". Certainly a proof argument (or synthesis, hypothesis, reasoning) needs to have one-or-more "who", but that, along with "when" can be provided by version control in a collaborative environment, and is redundant in the single-researcher case.

But if there's no proof argument, what exactly are the "who" and "when" applying to? The mere atomization of the hypothesis into (generalized, including person, relationship, and event) conclusion objects?

@jralls
Contributor
jralls commented Jul 4, 2012

Conclusion (aka: Assertion): a statement of what the researcher believes after analyzing the set evidence in hand

So if that's the "conclusion", what are you going to call its atomization into discrete characteristics, relationships, etc.?

Once one moves into the period before censuses enumerated everyone by name, reconstruction of families becomes much more dependent upon multiple sources and circumstantial evidence. Suppose you have couple of wills, one of which explicitly lists everyone, while the other doesn't name anyone, just says "To my loving wife ... the house, it's contents, and 1/3 of the land for her lifetime ... after which all remaining assets to be divided equally among my 6 children and their heirs or assigns". You find a couple of wills transferring parcels "for $1 and natural love and affection", a family bible, but the handwriting is the same for all of the children, making it a bit suspect. It doesn't match the detailed will, either. There are property and personal tax records (including tithables appearing to come of age), and of course a couple of census entries with the right name for the HoH and similar ages & sexes for other members. Naturally, not everything agrees...

A proficient genealogist will analyze all of this documentation together and write a single analysis (which might be quite long) separating the families as best s/he can. That's the initial hypothesis. You go and digs some more to try and confirm or refute your hypothesis. You find a few more records in the town clerk's basement, but nothing really conclusive. You make some adjustments to your hypothesis to reflect the new records, and you're now satisfied that you've conducted a "reasonably exhaustive search".

To capture that in GedcomX you'll have:

  • A Description record for each source.
  • "Conclusion" objects:
    • A Person for each member of both families
    • A collection of "Facts" for each Person
    • A bunch of relationships connecting the members of the families
    • Subject to Ryan committing #134, some Events

Since there are no Attribution references provided, each of those objects will have its own Attribution, all of which will be identical, and an identical set of SourceReferences. There's no provision for SourceReferences in the Attribution, so any citations in the ProofArgument will have to be ad-hoc and subject to referential integrity issues. If you were foolish enough to enter your hypothesis and create the "conclusion" objects before going doing the second round of research, you will have to individually edit each of the Attributions and SourceReference lists.

Granted, GedcomX isn't intended to have a UI, and there's nothing preventing your program from providing for AttributionReferences to a single proof statement and then copying that into the various objects when it exports the GedcomX file... but consider the logic needed in the receiving program if it's to recognize all of those duplicated statements and resolve them into a single object with references. I'm ignoring the fact that no existing genealogy program is anywhere near sophisticated enough to handle this anyway. GedcomX is supposed to model the Genealogical Proof Standard, not existing software.

@jralls
Contributor
jralls commented Jul 4, 2012

In referring to the Description class, I am referring to the org.gedcomx.metadata.rdf.Description class in the GedcomX Souce Metadata model.

Yup. RDF.DESCRIPTION, or in the RDF XML Syntax spec rdf:Description.

Also, I think the result of our work here will be that the RDF specification will not be so prominent and that the resulting model will feel more directly applicable to our community and not so abstract and general purpose and free form.

Sure hope so. As I've said several times before, RDF should be an implementation detail. It has no place in the conceptual model/specification. But at present it's used many places, not just in SourceReferences.

@jralls
Contributor
jralls commented Jul 4, 2012

PS: Since we are allowing duplicate Births and Deaths etc in the model it cannot be a "Conclusion" model ... it is in fact a "Hypothesis" model with Person, Name, "Fact", Event and Relationship all being types of Hypothesis.

That's a misfeature of many genealogy programs, a lame way to handle conflicting evidence. It fits well with RDB architecture but has little else to recommend it. On the other hand, constraining certain "fact" and "event" types to one-per-person rather complicates the model.

@EssyGreen

The "Conclusion" is (a), (b), (c), (d) or (e) plus the reason why. But if we explain "why" and give a score to each of (a)...(e) then we don't need a separate Conclusion.

This seems a bit mechanistic, and ISTM that the scoring will be subjective and thus will need a narrative argument of some sort for each score. It seems easier to just write a narrative argument and arrive at a single synthesis/hypothesis/reasoning.

I totally agree that a narrative is needed to support the numeric evaluations (which are of course subjective) ... I think where the narrative goes (and whether it applies to an aggregate or single evidence objects) depends on the general approach being taken by the researcher ie conclusion-based (by this I mean the researcher focuses on presenting a cohesive tree of non-conflicting hypotheses which all hang together) vs evidence-based (by this I mean the researcher presents all their hypotheses - whether conflicting or not - directly in the tree and indicates which are "preferred" in some way). Personally I prefer conclusion based research (and would hence have the verbatim evaluation as part of each hypothesis) but the GEDCOM X model is evidence based and so there is no-where for the aggregate evaluation to go because no aggregation is ever done.

@EssyGreen

if there's no proof argument, what exactly are the "who" and "when" applying to?

In my opinion the who and when are the researcher (assertion) whereas the proof statement is the "what" (e.g. Person A = Person B). The researcher info can be right at top as part of the source; the proof statement needs to be where the assertion/hypothesis/whatever is being made.

@EssyGreen

A proficient genealogist will analyze all of this documentation together and write a single analysis (which might be quite long) separating the families as best s/he can. That's the initial hypothesis.

I agree - that's what I call conclusion-based research above

@EssyGreen

PS: Since we are allowing duplicate Births and Deaths etc in the model it cannot be a "Conclusion" model ... it is in fact a "Hypothesis" model with Person, Name, "Fact", Event and Relationship all being types of Hypothesis.

That's a misfeature of many genealogy programs, a lame way to handle conflicting evidence.

I agree (again!) but it's also the way GEDCOM X is going

@thomast73
Contributor

I do not feel satisfied with any of the names proffered here: "reasoning", "synthesis", "hypothesis", "theory"

Fair enough - read EE chapter 1 - yes John I did just say that;) - and pick a phrase that describes what you mean :)

EE 2nd Ed, Section 1.3 is titled "Conclusions: Hypothesis, Theory & Proof" and essentially states that hypotheses, theories and proofs are "conclusions" in various states of proven-ness. So these names are a classification system (of sorts) for conclusions.

The state of proven-ness seems rather subjective...what is proven in one person's view may not be satisfactorily so in another's view. Perhaps this is the reason @EssyGreen wishes to emphasize the "unproven" end of the spectrum with the "hypothesis" name.

For the model's sake, the generic "conclusion" designation seems good enough. We are not asking data providers to make statements about where a given conclusion is on the scale of proven-ness.

The @jralls' "synthesis" name seems to conflate the notion of "conclusion" with the processes required to demonstrate proven-ness -- the reasoning/rationale/explanation. It also seems to assume that all conclusions can be demonstrated with more than one piece of selected evidence. It seems to me that a "conclusion" can exist with or without proven-ness -- that conclusions can existing with no substantiation, based with a single piece of evidence, or based on a systhesis of selected evidence.

{@EssyGreen}a "Conclusion" is the reasoning/rationale/explanation for why a specific Hypothesis is favoured over others

The reasoning/rationale/explanation for a conclusion is not a conclusion, but rather information about how we arrived at said conclusion.

Conclusion (aka: Assertion): a statement of what the researcher believes after analyzing the set evidence in hand

I think my definition might still be susceptible to misinterpretation. I want to be sure it is narrowly construed to exclude the statement of rationale. So perhaps something like this:

  • Conclusion (aka: Assertion): a proffered representation of fact relative to a given research question.
    • For example, of my question is "When was Fred Bill Blogs born?", my "birth" conclusion might be "5 May 1862".

In review, I think the following comparisons might be true.

{@jralls}"synthesis" is not {@thomast73}"conclusion"
{@jralls}"synthesis" is not {@thomast73}"reasoning"
{@jralls}"synthesis" might be the same as {@thomast73}"conclusion" + {@thomast73}"reasoning"

{@EssyGreen}"hypothesis" might be the same as {@thomast73}"conclusion"
{@EssyGreen}"conclusion" might be the same as {@thomast73}"reasoning"

@EssyGreen

these names are a classification system (of sorts) for conclusions.

Ouch I shot myself in the foot there didn't I ? lol :)

The state of proven-ness seems rather subjective...what is proven in one person's view may not be satisfactorily so in another's view. Perhaps this is the reason @EssyGreen wishes to emphasize the "unproven" end of the spectrum with the "hypothesis" name.

Indeed :) I'm a pessimist and very wary of over-optimism where secondary sources are concerned :)

{@EssyGreen}a "Conclusion" is the reasoning/rationale/explanation for why a specific Hypothesis is favoured over others

The reasoning/rationale/explanation for a conclusion is not a conclusion, but rather information about how we arrived at said conclusion.

OK ... so what/where is the " reasoning/rationale/explanation for why a specific Hypothesis is favoured over others" in the model? Where do I put conflicting/competing hypotheses? How do I distinguish these from non-conflicting hypotheses?

{@EssyGreen}"hypothesis" might be the same as {@thomast73}"conclusion"
{@EssyGreen}"conclusion" might be the same as {@thomast73}"reasoning"

I think you might be right :)

@thomast73
Contributor

I think where the narrative goes (and whether it applies to an aggregate or single evidence objects) depends on the general approach being taken by the researcher ie conclusion-based (by this I mean the researcher focuses on presenting a cohesive tree of non-conflicting hypotheses which all hang together) vs evidence-based (by this I mean the researcher presents all their hypotheses - whether conflicting or not - directly in the tree and indicates which are "preferred" in some way). Personally I prefer conclusion based research (and would hence have the verbatim evaluation as part of each hypothesis) but the GEDCOM X model is evidence based and so there is no-where for the aggregate evaluation to go because no aggregation is ever done.

An intriguing statement ... though I'm not sure I completely comprehend what you are saying here.

The object(s) I associate with my "rationale" statement is a function of what question(s) I am attempting to answer via that statement. If the statement is just about a birth date, I'd associate it with the conclusion object most closely associated with that proffered representation -- e.g., the "birth" Fact. If the "rationale" statement was more of a synthesis of analysis involving many conclusions and pieces of evidence -- i.e., {@EssyGreen} "a cohesive tree of non-conflicting [conclusions] which all hang together" -- perhaps we could associate it with each conclusion object relevant to the statement (perhaps even across multiple "person" conclusions and their subordinate conclusions), or perhaps just associate it with the enclosing conclusion (e.g., the "person" conclusion that includes the "birth", "death", etc. conclusions discussed in the "rationale" statement).

But if the model supported associating a "rationale" statement with any conclusion, and with more than one conclusion, wouldn't the model support both research "styles" -- "evidence-based" and "conclusion-based" research?

OK ... so what/where is the " reasoning/rationale/explanation..." in the model?

Good question. Right now it is conflated into Attribution. Right now a single statement cannot be referenced by multiple conclusions (@jralls also discusses this here). Seems like we need some work here.

@thomast73
Contributor

The "rationale" statement, at a base level, could be represented as a Note. Is there a strong reason to distinguish this type of note -- the "rationale" note -- from other notes?

@thomast73
Contributor

So, I guess I am going to give an answer to my own question -- in part, because I wanted to get in on the whole "talking to myself" thing. ;-)

Yes, the "rationale" statement would benefit from being something distinct from Note in that I would like to associate sources with my statement -- something that I would not do with Note.

@jralls
Contributor
jralls commented Jul 5, 2012

The @jralls' "synthesis" name seems to conflate the notion of "conclusion" with the processes required to demonstrate proven-ness -- the reasoning/rationale/explanation.

That's because you're using "conclusion" in its broadest sense, while I'm using it to mean the atomized classes derived from "Conclusion" in GedcomX plus the Relationship class and (prospectively) the Event class.

It also seems to assume that all conclusions can be demonstrated with more than one piece of selected evidence.

No, that conclusions must be demonstrated with as much evidence as can be collected from a "reasonably exhaustive search", as specified by the Genealogical Proof Standard. No "selection" of evidence is permitted: All the evidence must be considered and all contradictions explained in the proof argument.

It seems to me that a "conclusion" can exist with or without proven-ness -- that conclusions can existing with no substantiation, based with a single piece of evidence, or based on a systhesis of selected evidence.

Perhaps, but that's not genealogy and it is certainly not consistent with the GPS.

Are you telling us that the GedcomX development team is repudiating Ryan's statement of purpose in #154, and that the official position of FamilySearch is that the GPS is no longer important to GedcomX?

@thomast73
Contributor

Are you telling us that the GedcomX development team is repudiating Ryan's statement of purpose in #154, and that the official position of FamilySearch is that the GPS is no longer important to GedcomX?

No.

We would like the model to support GPS-based research. But the model is not about forcing all genealogical data into the GPS mold.

It seems to me that a "conclusion" can exist with or without proven-ness -- that conclusions can existing with no substantiation, based with a single piece of evidence, or based on a systhesis of selected evidence.

Perhaps, but that's not genealogy and it is certainly not consistent with the GPS.

Genealogical data evolves. I often start with someone's word, then find one source, then maybe another, and so on, until I reach my preferred level of proven-ness. At any given time, conclusions in my tree are in various states of proven-ness. Not only that, my preferred level of proven-ness may not meet your own standard from proven-ness. So, does a lack of proof, or a sub-standard proof, mean it is no longer a conclusion? I would say "no". In a data exchange, it would still be exchanged as a conclusion. In accepting it, you might require that more work be done, or that additional statements be made, or you may even choose to skip/reject it, but a lack of proven-ness ought not prevent an exchange.

The @jralls' "synthesis" name seems to conflate the notion of "conclusion" with the processes required to demonstrate proven-ness -- the reasoning/rationale/explanation.

That's because you're using "conclusion" in its broadest sense, while I'm using it to mean the atomized classes derived from "Conclusion" in GedcomX plus the Relationship class and (prospectively) the Event class.

I guess I am missing your point?

Whether I am talking about the abstract "conclusion" concept, or a specialization (atomization?) of it (e.g., an instance of Name), wouldn't we still model the "rationale" statement separate from the "conclusion"? How does talking about a specialization of conclusion change how we model the "conclusion" and the "rationale" that led to it?

@EssyGreen

We would like the model to support GPS-based research. But the model is not about forcing all genealogical data into the GPS mold.

Why the heck not????? Surely the credibility of GEDCOM X lies in it supporting the GPS! How can you possibly produce a "standard" which doesn't support another established standard you already support on the same subject? The GPS isn't exactly controversial - in fact it's pretty much Motherhood & Apple Pie

@EssyGreen

does a lack of proof, or a sub-standard proof, mean it is no longer a conclusion? I would say "no". In a data exchange, it would still be exchanged as a conclusion. In accepting it, you might require that more work be done, or that additional statements be made, or you may even choose to skip/reject it, but a lack of proven-ness ought not prevent an exchange.

That is true but the "level of proven-ness should also be able to be transported. If I had a work in progress and was investigating a theory that A=B I would want that indicated so that I couldn't be mis-cited as saying "Sarah has proven A=B"

@EssyGreen

The "rationale" statement, at a base level, could be represented as a Note. Is there a strong reason to distinguish this type of note -- the "rationale" note -- from other notes?

YES!!! One of the problems of old GEDCOM was that you could stuff NOTEs in anywhere and they meant nothing so a receiving program can't tell the different between something like "I must remember to go to the record office next week" vs "This source is really unreliable" vs "I believe that Person A = Person B because ..." vs "This is a picture of my grandmother" vs "Although the marriage cert says this, it can't possibly relate to this person because ..." vs ... ad infinitum.

If we are going the "everything can be verbatim text" route then we don't need much of a model - let's just use wiki markup and have done with it. If not, then we need to understand and model the key statements which are important for an application to recognise/understand and make these distinctive objects which the applications can then utilise. A conclusion/rationale/hypothesis/whatever is absolutely vital in being able to "understand" genealogical data.

Personally I think generic Note objects are superfluous, ambiguous and confusing and I would rather not have them at all. Instead allow a single generic CDATA narrative field wherever appropriate.

@jralls
Contributor
jralls commented Jul 6, 2012

Whether I am talking about the abstract "conclusion" concept, or a specialization (atomization?) of it (e.g., an instance of Name), wouldn't we still model the "rationale" statement separate from the "conclusion"? How does talking about a specialization of conclusion change how we model the "conclusion" and the "rationale" that led to it?

At the abstract level, everything from extraction of evidence on is a conclusion: Even the date of an event recorded in a document needs to be analyzed to determine the calendar in use at that time and place. Modelling that was the approach taken by the Gentech GDM, and it was so cumbersome that no-one implemented it. You'll find some choice comments from Tom Wetmore about it in some of the other issues.

"Atomization" means isolating each conclusion element (the name, the date, the place, the participants, etc.) into its own object (or RDB field) for machine representation. That's necessary to some extent for genealogy software to work -- particular relationships, for example, need to be recognized by the program in order to form a tree. When one is actually working on reconstructing a family it's rare to be able to isolate a particular "atom" because a source which provides no context to an atom isn't useful: Consider a bible you find at a garage sale with a single note, "John Smith was born July 11, 1854". No provenance, no other names, nothing to tell you which John Smith among the thousands born in that period it's talking about. Not very useful. At the other end of the spectrum, a bible in the possession of a member of the Smith family you're working on, with 3 generations of Smith BMD entries all written in different inks and hands. It also has an entry, "John Smith was born July 11, 1854", but the provenance of the bible and the other entries tell you exactly which John Smith it means. Other documents with overlapping information allow you to relate the evidence in each to the others and to build a complete picture of the family. You can't get there by taking John Smith's birth date out of the 3 or 4 which mention it and working only on that -- especially if the dates don't agree.

How does that change how we model conclusion and rationale?
In the "standard model", used by almost every genealogy database I've seen, there are only molecules: A birth event, for example, with a date, a place, and a list of participants as atoms. There's a list of sources, and a "note" field to collect some sort of rationale if the user is motivated to do so -- but it's a pain to discuss the sources because there's no way to link them to the note. There's also no way to tie the note to other molecules. (GRAMPS is the exception: We support linking sources inside notes, and notes are separate objects that can be linked to as many other objects as the researcher wants.)

GedcomX as presently designed goes further: There aren't any molecules, just compounds (Persons, Relationships) and a glass-like blob of atoms for each. Ryan has proposed adding Events, which are almost molecular -- but he hasn't committed the change in spite of general approval from the 3 or 4 people who bothered to comment.

In the "synthesis" model I've been trying to explain, the "rationale" collects all of the relevant evidence and explains a body of the "abstract" conclusions -- however much in the researcher's judgement is needed to deal with manageable part of the puzzle. The "atomic" or "molecular" conclusions can then link to the rationale.

@EssyGreen

In the "standard model", used by almost every genealogy database I've seen, there are only molecules: A birth event, for example, with a date, a place, and a list of participants as atoms. There's a list of sources, and a "note" field to collect some sort of rationale if the user is motivated to do so -- but it's a pain to discuss the sources because there's no way to link them to the note. There's also no way to tie the note to other molecules. (GRAMPS is the exception: We support linking sources inside notes, and notes are separate objects that can be linked to as many other objects as the researcher wants.)

Unless I misunderstand you, your experience differs from mine ... old GEDCOM spec has NOTEs exactly as you specify here (they can be linked to virtually anything and can reference any number of sources via citations). I've seen this implemented many times (e.g. Family Historian, Family Tree Builder, TNG). But this "flexibility" comes at a price ... it is impossible to deduce when importing from elsewhere what the "Note" was used for (whether that be evidence, proof, rationale, to-do lists, captions on pictures, narrative descriptions of places etc etc etc). So all the importing program can do is keep the structure and render it as "text". What I believe is missing is the ability to distinguish the type of "Note" being made (without having to hazard a guess by looking at the types of objects it links together).

@jralls
Contributor
jralls commented Jul 6, 2012

Personally I think generic Note objects are superfluous, ambiguous and confusing and I would rather not have them at all. Instead allow a single generic CDATA narrative field wherever appropriate.

That's XML-specific: The "conceptual model" needs something to hold the string for serialization to CDATA for XML and to whatever else for other implementations. It need not be a top-level object, though; in many cases a simple string parameter will do. Other cases, though, and "rationale" is one, should have a top-level object that can be referenced by one-to-many other objects.

@jralls
Contributor
jralls commented Jul 6, 2012

Genealogical data evolves. I often start with someone's word, then find one source, then maybe another, and so on, until I reach my preferred level of proven-ness. At any given time, conclusions in my tree are in various states of proven-ness. Not only that, my preferred level of proven-ness may not meet your own standard from proven-ness. So, does a lack of proof, or a sub-standard proof, mean it is no longer a conclusion? I would say "no".

Yes, agreed, that's the nature of the research process. As Sarah says, it is vital that the "proven-ness" is firmly attached to the conclusions. A proficient genealogist will insist that the "proven-ness" consists of:

  • The sources and repositories searched so far, and the results of each search (including searches which turned up no sources).
  • Complete citations for each source used.
  • The evidence found, clearly traceable to which source and clearly identified as direct or inferred.
  • An analysis of each source, including legibility, context, provenance, informant(s), etc.
  • An analysis of the all of the evidence taken together, explaining any inconsistencies and leading to the conclusions made from the evidence.
  • A plan for further research if the "concluder" is not satisfied with the level of "proven-ness".
  • A clear statement of the conclusions.

In other words, a demonstration of how far along the research is in complying with the GPS. As I noted earlier, this level of detail is not supported by most extant programs.

In a data exchange, it would still be exchanged as a conclusion. In accepting it, you might require that more work be done, or that additional statements be made, or you may even choose to skip/reject it, but a lack of proven-ness ought not prevent an exchange.

Agreed, but the lack of "proven-ness" must be clearly stated and firmly attached to the conclusions.

@jralls
Contributor
jralls commented Jul 6, 2012

perhaps we could associate it with each conclusion object relevant to the statement (perhaps even across multiple "person" conclusions and their subordinate conclusions), or perhaps just associate it with the enclosing conclusion (e.g., the "person" conclusion that includes the "birth", "death", etc. conclusions discussed in the "rationale" statement).

But if the model supported associating a "rationale" statement with any conclusion, and with more than one conclusion, wouldn't the model support both research "styles" -- "evidence-based" and "conclusion-based" research?

Now you're getting it. There's a discussion of whether the proof statement should be a giant one for each person or more focused on subsets of conclusions between Tom, Sarah, maybe Louis, and me a couple of weeks ago, but now I can't find it. If the model can be structured to support both that would be fine with me, though it would complicate parsing for receiving programs.

BTW, I don't agree that those are research styles. I'd say that they're presentation styles, and I find the one labelled "evidence-based" to be rather lacking because it doesn't allow for the evidence to be treated in a single unit. Unfortunately it's the one used by most programs, so it would be counter-productive to not support it.

@EssyGreen

@jralls - I think your example of John Smith is excellent but I'm not convinced that it supports your need for a rationale as a top-level object ... The way I see it you have a subject you are researching and a number of "conclusions" (generic sense of the word) you are investigating. The source doesn't come out of the ether - you happen to notice it in the car boot sale because you are researching a John Smith and think "aha! I wonder if this will provide any evidence to support my theories about my John Smith" ... you analyse it in this context, firstly ascertaining the likelihood that this "John Smith" is your John Smith and then moving on to extracting further information to support, challenge and/or supplement your existing hypotheses about John Smith and his relationships etc. In each case the rationale is related to only one "Conclusion" (object) .... which can then be used as evidence for further conclusions. I think if you try to prove many Conclusions in one lump it gets very confusing and very hard to unravel. If you chain them together then if a link breaks you can trek back up the chain unpicking each dependant conclusion en route.

If you push the rationale to the top level then you are putting the answer before the question.

@EssyGreen

I don't agree that those are research styles. I'd say that they're presentation styles, and I find the one labelled "evidence-based" to be rather lacking because it doesn't allow for the evidence to be treated in a single unit. Unfortunately it's the one used by most programs, so it would be counter-productive to not support it.

Sadly I believe that it is increasingly used as a "research" style - tho' it might be more appropriate called a "search style" since all the effort goes into the searching and none into the analysis and evaluation of the findings. Personally I would label it as "Junk Genealogy" but I thought that would be impolite.

@jralls
Contributor
jralls commented Jul 6, 2012

We are not asking data providers to make statements about where a given conclusion is on the scale of proven-ness.

We should ask, and make it easy to comply. In the interest of supporting existing programs, we can't require it.

@EssyGreen

We are not asking data providers to make statements about where a given conclusion is on the scale of proven-ness.

We should ask, and make it easy to comply. In the interest of supporting existing programs, we can't require it.

I'm actually split on this one ... yes it sounds sensible but as a genealogist I know I won't believe it anyway ... I have to take every secondary source and trace back to validate it so I don't care if they say "Definitely" ... I'm not going to believe them. I'm more interested in what their sources were. If they supply some rationale then I'd be interested to see their view-point in case it enlightened mine but that will come from the text and not from a "scale of proven-ness".

@jralls
Contributor
jralls commented Jul 6, 2012

Personally I would label it as "Junk Genealogy"

+1!

but I thought that would be impolite.

Oh well. ;-)

Yes, it is a common style among novices, and is I think driven by the way that most programs work. Have you seen Ancestry Insider's Genealogical Maturity Model? He also took a shot at Evidence Management that's useful, though I don't think that he sufficiently treats circumstantial evidence.

@EssyGreen

As an aside, I'm aware that we're debating from two different angles here ... what we want researchers to do and have available for themselves vs what we want researchers to do when making their data available to others which comes which comes back to #141 (as does the whole debate about fitting with the process model) . I didn't get a clear view from that discussion about what is most important to GEDCOM.

@EssyGreen

it is a common style among novices, and is I think driven by the way that most programs work.

Indeed - I was sort of hoping GEDCOM X would "encourage" the apps to move towards a better way. Thx for the links - reading now :)

@thomast73
Contributor

We would like the model to support GPS-based research. But the model is not about forcing all genealogical data into the GPS mold.

Why the heck not????? Surely the credibility of GEDCOM X lies in it supporting the GPS! How can you possibly produce a "standard" which doesn't support another established standard you already support on the same subject? The GPS isn't exactly controversial - in fact it's pretty much Motherhood & Apple Pie

It's not that it wouldn't be good. It's just that almost all existing data does not exist in that form. And as @jralls says, almost all tools do not lend themselves to the documentation needs of GPS. To get this existing data into GPS requires each conclusion to be individually examined, documented, etc. in the GPS way.

@EssyGreen

We would like the model to support GPS-based research. But the model is not about forcing all genealogical data into the GPS mold.

Why the heck not????? [...]

It's not that it wouldn't be good. It's just that almost all existing data does not exist in that form.

But if we provide a model that does conform to the the standard but also makes it recommended but not mandatory then surely that would be better than just ignoring it? For example (top of my head):

  • Reasonably exhaustive search - so ensure that the standard has an object for a "Search" with multiple "Search Result" objects
  • Complete and accurate citation of sources - ensure that every source has fields which enable it to be cited completely and accurately (this one's easy since a text field "citation text" can do it all)
  • Analysis and correlation of the collected information - ensure that the conclusion model and record model share the same objects so that the information in them can be compared
  • Resolution of conflicting evidence - ensure the conclusion object has a collection of evidence objects and each evidence object has a property for comparing the source to the conclusion and resolving any conflicts - this means supporting negative evidence as well as positive
  • Soundly reasoned, coherently written conclusion - ensure the conclusion object has a verbatim rationale property

There's many a genealogist out there just dying for this (in fact I'd put my hand up and say that that's exactly why I'm here) so they get a decent genealogy application and can fill in the blanks. If we just follow the status quo then let's just mock up GEDCOM 6 in XML and go home.

@thomast73
Contributor

We would like the model to support GPS-based research. But the model is not about forcing all genealogical data into the GPS mold.

Why the heck not????? [...]

It's not that it wouldn't be good. It's just that almost all existing data does not exist in that form.

But if we provide a model that does conform [and make] it recommended but not mandatory [...]

As I understand it, that is our goal...
...so we are on the same page -- from a goal point of view.

@EssyGreen

Wonderful :) So where are we on the detail?

@thomast73
Contributor

We are not asking data providers to make statements about where a given conclusion is on the scale of proven-ness.

We should ask, and make it easy to comply. In the interest of supporting existing programs, we can't require it.

I'm actually split on this one ... yes it sounds sensible but as a genealogist I know I won't believe it anyway ... I have to take every secondary source and trace back to validate it so I don't care if they say "Definitely" ... I'm not going to believe them. I'm more interested in what their sources were. If they supply some rationale then I'd be interested to see their view-point in case it enlightened mine but that will come from the text and not from a "scale of proven-ness".

So this gets to the core of things, in my mind.

We need an object that is (perhaps) a specialization of Note (maybe called EvendenceAnalysis?), that can be associated with multiple conclusions, and that can have associated sources. [I have proposed some model changes that include such a thing to @stoicflame and his initial feedback is positive.]

The presence of such an object would indicate research documented in a GPS fashion. As @EssyGreen states, most researchers will re-evaluate all the sources and the statement of analysis anyway, so the analysis narrative and all of the subjective material ought to stay a part of the statement -- not be called out by the model.

At an application level, it seems that it might be useful to allow an individual user to further decorate conclusions or analysis with indicators to revisit them because the are not satisfied (perhaps even a proven-ness scale), but these indicators/scales break down when the data becomes shared or exchanged because it is opinion and cannot be reliably interpreted from one researcher to another -- unless (perhaps) it can be when the analysis itself has reached the "proven" state according to the strictest interpretation of the GPS standard.

{@jralls}A proficient genealogist will insist that the "proven-ness" consists of:
sources and repositories searched so far

  • connected as sources to the EvidenceAnalysis, summarized in the analysis narrative

the results of each search (including searches which turned up no sources).

  • described in the analysis narrative

Complete citations for each source used

  • connected as sources to the EvidenceAnalysis

The evidence found, clearly traceable to which source and clearly identified as direct or inferred.

  • ...needs work...

An analysis of each source, including legibility, context, provenance, informant(s), etc.

  • described in the analysis narrative and/or the source decription

An analysis of the all of the evidence taken together, explaining any inconsistencies and leading to the conclusions made from the evidence.

  • described in the analysis narrative

A plan for further research if the "concluder" is not satisfied with the level of "proven-ness".

  • described in the analysis narrative

A clear statement of the conclusions.

  • described in the analysis narrative, represented as conclusions which reference this analysis

@EssyGreen gives a similar list here. If I am understanding things well enough, the differences between her list and @jralls' list can be found in their modeling of the extracted evidence -- the part above marked as needing work.

So in summary, if we could get an EvidenceAnalysis object introduced into the model as described, it takes us a fair distance on the path toward being able to represent data documented via the GPS standard...

...which brings us back to our original purpose?

@jralls
Contributor
jralls commented Jul 6, 2012

...which brings us back to our original purpose?

Which was that the RDF.Description (incorrectly) specified in para 3.1 is inadequate, and that we need in addition to your EvidenceAnalysis class * a proper Source class, the definition of which Sarah and I settled upon before you arrived.

* How about a description, in the same sort of rough spirit as we used for the Source class, of what you have in mind?

@thomast73
Contributor

How about a description, in the same sort of rough spirit as we used for the Source class, of what you have in mind?

SourceDescription (replacing Description)

  • id[1] : String -- system identifer
  • citation[1] : SourceCitation -- in the words of @EssyGreen: fields which enable [the Source] to be cited completely and accurately
  • about[0..1] : URI -- the URI of the actual Source (if applicable)
  • owner[0..1] : ResourceReference<foaf::Agent> -- a reference to an object that describes the individual or repository that holds/owns the Source being described
  • derivedFrom[0..1] : SourceReference -- a reference to the SourceDescription that describes the Source from which this Source was derived (if applicable)
  • componentOf[0..1] : SourceReference -- a reference to the SourceDescription that describes the Source of which this Source is a component (if applicable)
  • extractedEvidence : needs work
  • displayName[0..1] : String -- the name of this Source for display purposes (e.g., the name of the Belgian civil registrations collection, in Flemish but modified to include the locale, as recorded for my source list).
  • alternateNames[0..*] : List<RDFValue> -- the alternates display names for this Source (e.g., the name in English, the name as found in the FHL catalog, etc.)
  • notes[0..*] : List<Note> -- for notes about the source (if desired)
    • perhaps one of the places one could record what @jralls called analysis, though this analysis could also be recorded in a narrative that is part of an EvidenceAnalysis instance?

The big hole here are the "needs work" aspects of extractedEvidence. There is the business of "extracts" (i.e., full or partial transcriptions), possibly translations, and the business about evidence objects that represent the genealogically significant evidence (e.g., persons, relationships, etc.). I do not want to discuss all that might or should be involved here. However, given that all of the data in this extractedEvidence category are derivatives of the Source, I would like to think about whether this information is part of (contained by) the SourceDescription for the Source (with attribution on each derivative), or whether the extracted evidence ought to be more loosely coupled (by reference) to the Source (via a SourceDescription indicating the person making the derivations and a componentOf reference pointing to the SourceDescription describing the Source)? Is SourceDescription better with or without an extractedEvidence member.

@thomast73
Contributor

Yes, a general purpose Note could be used to hold the source analysis, but that has the same problems as using a general Note object for the "rationale statement", which you already dismissed.

Why would it be important to call attention to Source "analysis" from among the other possible notes one might associated with a given Source? Are you thinking every note needs to be classified and typed? If we make a special type for this, what other special types are needed? And why?

I see the purpose in calling attention to the presence of a "rationale statement" -- the existence of a "proof argument" would be important to someone reviewing the research being transmitted. But it seems a difficult task to assign meaningful categories to other types of notes.

@thomast73
Contributor

OK, that adds some more fields to the Source class Sarah and I had already worked out.

Just to be clear, what I posted is our attempt to consolidate all the discussion (including the discussion here) into model changes. Making these changes would close this issue.

@jralls
Contributor
jralls commented Jul 9, 2012

Just to be clear, what I posted is our attempt to consolidate all the discussion (including the discussion here) into model changes. Making these changes would close this issue.

OK, why don't you make a branch and do a pull request for review... or turn this issue into a pull request. Don't forget to update gedcom.zargo.
Please separate the changes to specifications/ and gedcomx-common/ into separate commits; it makes seeing the differences easier.

@thomast73
Contributor

I started to say that I could go either way on extractedEvidence, but as I wrote it up I realized that it is likely to get too heavyweight. The argument in favor is that for XML I think it's better practice to keep hierarchies intact rather than to have a bunch of links all over the place. The problem is that extractedEvidence is conclusional: It's the result of a researcher's analysis, so it needs to be attributed in a collaborative environment, may need a note explaining inferences, and ideally it would be under version control of some sort. That argues for a separate class holding a SourceReference rather than a List inside the SourceDescription. Another point is that one is likely to want to have references to extractedEvidence objects in dependent conclusions, or at least in proof arguments ("reasonings"), which is more complicated with embedded elements. Yes, I realize that that's a change from my earlier position.

I am also leaning this direction. It seems valuable to keep conclusionary objects separate from source descriptions, and that we will ultimately want to reference those conclusionary objects in one or both of the manners you have identified.

@EssyGreen

Why would it be important to call attention to Source "analysis" from among the other possible notes one might associated with a given Source? Are you thinking every note needs to be classified and typed? If we make a special type for this, what other special types are needed? And why?

Suppose an application wishes to highlight conclusions which have no rationale - and hence need attention. Or suppose the app wants to provide different UIs for different purposes e.g. "research-mode" vs "reader-mode". Or suppose the app wants to hide general/private notes to some users but make the rationale notes always available. Or suppose the app wants to show the rationale at both ends (ie with the source and with the Person/Relationship etc) but doesn't want to do this for notes such as "Remember to look up Joe Bloggs in this collection" .... I could go on and on here.

As I've said above I believe the note object overly complex and would prefer the text to be a property rather than a separate object. However, if you are intent on going the object route then Yes I think you need to identify different types of note ... you will however create problems in that some will not make sense in some situations (e.g. if we were to have a "Proof" type of note and one was attached to say a date structure but nothing else).

@thomast73
Contributor

Why would it be important to call attention to Source "analysis" from among the other possible notes one might associated with a given Source? Are you thinking every note needs to be classified and typed? If we make a special type for this, what other special types are needed? And why?

Suppose an application wishes to highlight conclusions which have no rationale - and hence need attention. Or suppose the app wants to provide different UIs for different purposes e.g. "research-mode" vs "reader-mode". Or suppose the app wants to hide general/private notes to some users but make the rationale notes always available. Or suppose the app wants to show the rationale at both ends (ie with the source and with the Person/Relationship etc) but doesn't want to do this for notes such as "Remember to look up Joe Bloggs in this collection" .... I could go on and on here.

In you use cases, I see two types of notes: notes giving rationale/analysis (EvidenceAnalysis), and everything else (Note).

I am suggesting -- and want -- this type of distinction. I am not in favor of any further note "types". I am sorry. I think I must have confused the issue.

You also call out "public" vs. "private". For data exchange, I would think the data producer would be responsible for removing "private" data prior to making an exchange, and that only "public" data would be included in the exchange. In other words, "public" vs. "private" is an application-level attribute and that anything included in the exchange would be "public" by definition.

@EssyGreen

In you use cases, I see two types of notes: notes giving rationale/analysis (EvidenceAnalysis), and everything else (Note).

Er that's cos you were asking about EvidenceAnalysis ... If you want a full list I suggest we move over to #135 On the other hand if you're leaving that to someone else then OK leave it as EvidenceAnalysis only.

For data exchange, I would think the data producer would be responsible for removing "private" data prior to making an exchange, and that only "public" data would be included in the exchange.

That rather depends on the type of data exchange - see #141. If GEDCOM is only intended to support publication of data then yes I would agree but atm it is also used for interworking and migration between applications by the same user. I believe the privacy flag is needed on every object type - or rather it should be in the base Genealogical Resource object.

@EssyGreen

alternateNames[0..*] : List -- the alternates display names for this Source (e.g., the name in English, the name as found in the FHL catalog, etc.)

Is this really going to be used/needed? It sounds like one of those over-engineered things that will just complicate the model by its multiplicity ... what is an app to do with them all? I can't see a user actually wanting them except in rare occasions where they would be more appropriate in a (dare I say it) "Note"

@EssyGreen

notes[0..*] : List -- for notes about the source (if desired)

For the record ... I think a simple (and single) text field would suffice in 99% of cases.

@EssyGreen

EvidenceAnalysis extends Note and adds a list or SourceReferences.

I can't find the link to the model atm and the Conclusion object seems to be changing anyway so I've lost track ... can someone tell me what's in a Note apart from the text and hence why it needs to be an object at all?

@thomast73
Contributor

... can someone tell me what's in a Note apart from the text and hence why it needs to be an object at all?

Note has text and language and inherits an identifier (so it can be referenced by multiple objects) and attribution. It also inherits the option for extension elements.

@jralls
Contributor
jralls commented Jul 10, 2012

typedef foaf::Agent ResourceReference;

ResourceReference would probably point to an instance of Organization; if, instead, the source was held by an individual (e.g., a family bible?), it would reference a Person.

Ah, so
typdef foaf:Agent* ResourceReference;

I'd prefer that the conceptual model used only internally-defined objects with RDF-related stuff relegated to the implementation specs (XML and JSON), but that's a different issue.

@jralls
Contributor
jralls commented Jul 10, 2012

As I've said above I believe the note object overly complex and would prefer the text to be a property rather than a separate object. However, if you are intent on going the object route then Yes I think you need to identify different types of note ... you will however create problems in that some will not make sense in some situations (e.g. if we were to have a "Proof" type of note and one was attached to say a date structure but nothing else).

Gramps has notes as objects which can be referenced by 1 to many other objects; we find this feature to be quite useful and would like to see it supported by GedcomX.

@EssyGreen

Gramps has notes as objects which can be referenced by 1 to many other objects; we find this feature to be quite useful and would like to see it supported by GedcomX.

Like I said before, loads of apps do this but none actually do anything meaningful with it.

Nonetheless I will back-out. I obviously have quite different requirements from those GEDCOM X is focused on.

@jralls
Contributor
jralls commented Jul 10, 2012

Yes, a general purpose Note could be used to hold the source analysis, but that has the same problems as using a general Note object for the "rationale statement", which you already dismissed.

Why would it be important to call attention to Source "analysis" from among the other possible notes one might associated with a given Source? Are you thinking every note needs to be classified and typed? If we make a special type for this, what other special types are needed? And why?

I see the purpose in calling attention to the presence of a "rationale statement" -- the existence of a "proof argument" would be important to someone reviewing the research being transmitted. But it seems a difficult task to assign meaningful categories to other types of notes.

There are two issues: Does a particular flavor of Note need additional properties (e.g., the SourceReference List in EvidenceAnalysis) and does the Note fulfill a required step in the research process. The latter (which would include source analysis) can be satisfied by using a Note object as a specific member of the containing class. E.g.,

class SourceDescription {
...
  public Note sourceAnalysis;
...
  public List<NoteReference> Notes;
...
}
@EssyGreen

Indeed but I can't see many people specifying a language, author and date for every note they make. Yes it can be automated but why bulk out the file just for the sake of it?

@jralls
Contributor
jralls commented Jul 10, 2012

Note has text and language and inherits an identifier (so it can be referenced by multiple objects) and attribution. It also inherits the option for extension elements.

The specification is supposed to be canonical, not the code. See #114.

Indeed but I can't see many people specifying a language, author and date for every note they make. Yes it can be automated but why bulk out the file just for the sake of it?

I agree that making Note extend GenealogicalResource is too heavyweight, but that should be its own issue, so I wrote #181.

@EssyGreen

The specification is supposed to be canonical, not the code. See #114.

Can you explain what you mean by "canonical" here?

I agree that making Note extend GenealogicalResource is too heavyweight, but that should be its own issue, so I wrote #181.

Er we already had #135 but what the heck

@jralls
Contributor
jralls commented Jul 10, 2012

Er we already had #135 but what the heck

Right. I've close #181 as a duplicate.

@jralls
Contributor
jralls commented Jul 10, 2012

Can you explain what you mean by "canonical" here?

Normative, if you like. The document which is the fundamental source, and which takes precedence in the case of conflicts.

@thomast73
Contributor

I have create a pull request (#182) to do the refactoring we have been discussing here (#123 and #144). I have attempted to update gedcomx.zargo to reflect the changes I will be making. I will be updating the classes and documentation to reflect these change over the next several days.

@EssyGreen

@thomast73 could we also have the DerivativeType e.g. original, image copy, abstract, extract, transcription, translation, interpretation etc ... this is one of those occasions where I think a coded value rather than free-form text would be useful but I'd be happy to go either way as long as there is somewhere to put it.

@thomast73
Contributor

...could we also have the DerivativeType...?

I have had a chance to socialize this query around here a bit. Here is an attempt to describe what we came up with.

A SourceReference is used to describe the relationship between one object and its source (really, the description of its source). It is our thinking that the SourceReference itself ought to say what type of relationship it is describing between that object and its source. It seems in most cases, that relationship being described is a "this is a derivative of source S of type X". So, we need a mechanism to say that the SourceReference is of type X (where X is something in a list like the list @EssyGreen has detailed).

SourceReference has a type field, but that field (as currently defined) describes the type of the source itself. The citation for the source generally includes information about its type also. So in a way, putting the source type in the source reference is duplicating citation data. Also, the current type vocabulary is insufficient for its originally intended purpose.

So we would like to propose that we re-purpose the SourceReference.type property to become a "source reference type" property with a vocabulary like the following:

SourceReferenceType

  • preservationCopy
  • abstract
  • transcription
  • translation
  • extractedConclusion
  • analysis
  • workingConclusion

As a means of explaining this, I offer the following use case as an example (SD objects are SourceDescription objects; S objects are sources, but do not have a concrete representation in the model; EA objects are EvidenceAnalysis objects; and C objects are derivations of Conclusion):

  • Source S1 -- the original -- described by SD1
  • Source S2 -- preservation copy; GSU film -- described by SD2, derivedFrom.type == "preservationCopy" and pointing at SD1
  • Source S3 -- transcription; produced by FamilySearch forum volunteer -- described by SD3, derivedFrom.type == "transcription" and pointing at SD2
  • Source S4 -- translation; produced by FamilySearch forum volunteer -- described by SD4, derivedFrom.type == "translation" and pointing at SD3
  • Conclusion C1 -- a Conclusion that represents evidence extracted from the translation (e.g., a Person, Relationship, Fact, etc.) -- described by SD5, derivedFrom.type == "extractedConclusion" and pointing at SD4
  • Evidence Analysis EA1 -- an EvidenceAnalysis note that includes analysis of S5 -- described by SD6, derivedFrom.type == "analysis" and pointing at SD5
  • Conclusion C2 -- a Conclusion (e.g., a Person, Relationship, Fact, etc.) representing the current working hypothesis -- has a SourceReference pointing to SD6 with derivedFrom.type == "workingConclusion"

Modeling things in this way would allow us to distinguish conclusions that represent "extracted evidence" from "working conclusions". I also think it gives a mechanism to determine a "derivative type" as @EssyGreen has requested. I think it might also obviate the need for a `ResourceType'.

@jralls
Contributor
jralls commented Jul 13, 2012

That's an interesting approach.

I suggest one more type: containedIn. That allows hierarchical source trees for sources where one might have more than one citation from a source. Examples range from a compiled genealogy where one might be citing several people (and therefore pages) to a census, where one might have several "layers" (state, county, ED, page) if one were to get carried away.

@thomast73
Contributor

I suggest one more type: containedIn.

I would be fine with this addition.

I had modified the model (see the pull request #182) such that we were modeling this concept with a separate member in the SourceDescription class called componentOf. With the proposed changes of yesterday, we probably do not want the componentOf or derivedFrom members, but instead maybe a references member (of type SourceReference and mutiplicity '0..*').

@jralls
Contributor
jralls commented Jul 13, 2012

With the proposed changes of yesterday, we probably do not want the componentOf or derivedFrom members,

Yes, a better, more general solution IMO.

@EssyGreen

we would like to propose that we re-purpose the SourceReference.type property to become a "source reference type" property with a vocabulary like the following [...]

On a general level I don't have a problem with the re-purposing of the type (although I find some of the terms out of kilter with genealogy terminology e.g. preservationCopy?). However, if we attach this to the SourceReference rather than the source then we massively replicate this throughout the file. It is my opinion that all the details about the source (including what it was derived from etc) be stored as one object and that references to it are simply pointers. To do otherwise is to fragment the source in such a way that a receiving application has to re-assemble it in order to make it usable for a researcher (who surely will not want to repeat the type of source etc each time they reference say a birth certificate).

I suggest one more type: containedIn

... I had thought that we'd agreed on "ComponentOf" (not that I'm arguing over the words!) for this purpose ... I think this is needed in addition to the derivative type for example I have a page describing a household in a Census from Ancestry ... the household is "contained in" the larger Census which is a derivative published by Ancestry from the original in the National Archives.

@jralls
Contributor
jralls commented Jul 13, 2012

However, if we attach this to the SourceReference rather than the source then we massively replicate this throughout the file ... who surely will not want to repeat the type of source etc each time they reference say a birth certificate.

It's not a type of the Source, it's a usage type of the SourceReference. To restate one of Thomas's examples,

EvidenceAnalysis EA1 is an "analysis" of (the Source described in) SourceDescription SD5

I think this is needed in addition to the derivative type for example I have a page describing a household in a Census from Ancestry ... the household is "contained in" the larger Census which is a derivative published by Ancestry from the original in the National Archives.

You're missing the chaining effect that Thomas demonstrated in his example. Let SourceDescription SDA describe the original census page, SDB describe the Ancestry image, and SDC describe the particular family on the page. Then:

  • SDB will have a SourceReference of type "preservationCopy" pointing to SDA.
  • SDC will have a SourceReference of type "containedIn" pointing to SDB.

SDA may in turn have a SourceReference of type "containedIn" pointing to some higher level, perhaps the county, and so on.

To do otherwise is to fragment the source in such a way that a receiving application has to re-assemble it in order to make it usable for a researcher

That's a legitimate beef. It's a classic speed-space tradeoff: If the whole source (Census year, county, parish, page, family) is contained as a single source, then all of that must be repeated for every family, using more space. If it's broken down into a hierarchy (whether in separate objects linked by pointers or by nesting elements) then it saves space but the application must do some extra work to assemble the whole source for display.

@EssyGreen

I think you misunderstand me .. I was the one asking for the chaining effect in the first place! But I want the source record chained not the references to it ... in old GEDCOM terms I want to be able to have chained SOURce records but not to have to put all this info in every SOURce citation. I maybe misunderstanding the term SourceReference but to me that "Reference" indications it is a pointer not a "record".

@stoicflame
Member

I maybe misunderstanding the term SourceReference but to me that "Reference" indications it is a pointer not a "record".

Yes, let's be clear. It's a reference. But it's a heavy reference because it can be used to point to either the actual source, the description of the source, or both. It also has a type.

I think you're wanting it to be simpler: just have it reference the description. I think this is a valid way of doing it, but I personally feel that the extra complexity of the source reference is justified. There are a couple of reasons why:

  1. There are a lot cases where I don't want to take two hops to get to the real source. If I'm looking at a "working" conclusion about a person that cites an "extracted" conclusion about a person, I don't want to have to look up a _description _ of that extracted conclusion before I get to the actual extracted conclusion.
  2. If I want to analyze all the sources for a working conclusion about a person, I want to know what types of sources are being cited without having to go read descriptions of each of the sources.
  3. If I want to calculate all of the working conclusions that are citing a given source, it would be much easier to do in a single loop through a list of source references rather than through a double loop of descriptions, then references to those descriptions.
  4. When monitoring changes on a working conclusion, I'd like to know when changes happen to the list of sources being cited. This would be much easier to implement if I just needed to watch the source reference list rather than watching the source reference list and all the descriptions to which those references resolve.

So I like the source reference.

@jralls
Contributor
jralls commented Jul 13, 2012

If I want to analyze all the sources for a working conclusion about a person, I want to know what types of sources are being cited without having to go read descriptions of each of the sources.

Meaning, I suppose, that you're opposed to Thomas's proposal, which would replace the "resource type" with the "source reference types" outlined above.

NB: "Resource Type" is not at present defined, and the known resource types link doesn't actually point to anything.

But it's a heavy reference because it can be used to point to either the actual source, the description of the source, or both

It's actually an anything reference. The resource URI is explicitly unconstrained:

No restrictions on what the URI must resolve to. It could be an image, a conclusion, a book, etc.

(Setting aside that paper books don't have URIs, only descriptions of them do.)

@EssyGreen

I think you're wanting it to be simpler: just have it reference the description

Yes you are right - that's what I'm after :)

I don't want to have to look up a _description _ of that extracted conclusion before I get to the actual extracted conclusion.

Why would you need to? You would surely be looking at the information in the extracted conclusion since that is what you are referencing.

If I want to analyze all the sources for a working conclusion about a person, I want to know what types of sources are being cited without having to go read descriptions of each of the sources.

There are lots of things you might want to analyse about a source - if we put them all in the "reference" to it just in case you want to analyse them then it gets very fat and fragments the data. If we keep them in the source description it is easy for a program to read off the records and present the information in a variety of ways. This retains the data integrity rather than running the risk of corrupting some of those references and creating bad data

If I want to calculate all of the working conclusions that are citing a given source, it would be much easier to do in a single loop through a list of source references rather than through a double loop of descriptions, then references to those descriptions.

See my last comment. GEDCOM X will largely be used as a transportation mechanism not a final data store for an application so an attempt to optimise it for a UI at this level is in my opinion a wasted effort.

When monitoring changes on a working conclusion, I'd like to know when changes happen to the list of sources being cited.

Indeed ... so say you get a photocopied census record ... you put it all in and link it up to your conclusions and for some reason you forget to put a type or you put the wrong one ... how to correct ... oh sugar! go through all the 100 links and change each one! Nightmare!

One of the problems with old GEDCOM was that the citation was often too heavyweight because it included "text from source" and "where in source" ... both of which are frequently used for more than one citation/evidence. Correcting mistakes or unlinking these citations is a nightmare for the user. It is very easy to think you've found all the references but miss some .... it's a basic data integrity issue ... don't fragment and replicate data throughout the data store.

If instead we keep the source description to be exactly that and use "ComponentOf" and "DerivedFrom" to chain the variations together then we keep the data intact.

@jralls
Contributor
jralls commented Jul 14, 2012

There are lots of things you might want to analyse about a source - if we put them all in the "reference" to it just in case you want to analyse them then it gets very fat and fragments the data. If we keep them in the source description it is easy for a program to read off the records and present the information in a variety of ways. This retains the data integrity rather than running the risk of corrupting some of those references and creating bad data

+1

The SourceReference should contain a reference and a reference type. Nothing more.

@jralls
Contributor
jralls commented Jul 14, 2012
  1. There are a lot cases where I don't want to take two hops to get to the real source. If I'm looking at a "working" conclusion about a person that cites an "extracted" conclusion about a person, I don't want to have to look up a _description _ of that extracted conclusion before I get to the actual extracted conclusion.
  2. If I want to analyze all the sources for a working conclusion about a person, I want to know what types of sources are being cited without having to go read descriptions of each of the sources.
  3. If I want to calculate all of the working conclusions that are citing a given source, it would be much easier to do in a single loop through a list of source references rather than through a double loop of descriptions, then references to those descriptions.
  4. When monitoring changes on a working conclusion, I'd like to know when changes happen to the list of sources being cited. This would be much easier to implement if I just needed to watch the source reference list rather than watching the source reference list and all the descriptions to which those references resolve.

None of which have anything to do with GedcomX. Those are all functions for a database program to resolve after it has ingested the GedcomX file, and they are all things that databases are good at -- while maintaining a single instance of each datum.

@stoicflame
Member

Okay, let me try a different tack.

Let's take a census record, for example. I understand that despite the complexities, a good genealogical data provider needs to be able to create a description of that record, including how its derived, what it's a component of, and all of the other bibliographic fields used to describe it. Fine. I think that's reasonable.

But now let's take a working conclusion derived from an analysis of a set of extracted conclusions of a translation of a transcription of an image of a census record.

Do you really expect every provider to know how to describe the source of that working conclusion? I think we're going to have a lot of empty source description objects that do nothing but forward the reference on to the next description in the chain without providing any of the descriptive metadata. If you look at it like a tree or a pyramid, only the descriptions at the "top" of the pyramid would have anything useful and the bulk of them will be useless except as a forwarding mechanism.

Instead, why not provide a way to just go directly to the source? You wouldn't have to burden your dataset with a ton of empty descriptions, nor would you burden providers with knowing how to provide description metadata for each step in the source chain.

@jralls
Contributor
jralls commented Jul 16, 2012

What's a "provider"? It implies to me an entity like FamilySearch or Ancestry.com -- which in turn would imply the Record Model rather than the Conclusion Model.

Substituting "researcher" for "provider", heck yes I expect that a researcher can describe in detail the intermediate conclusions and trace them back to the original sources. If she can't, the conclusion, working or otherwise, is of no value at all.

Why would someone create intermediate source levels without having metadata for each?

If the person writing a conclusion didn't consult the original source but instead consulted a transcript or abstract, then it would be dishonest for him to cite the original source. If I'm using that conclusion and the original source is available, yes, I'll go look at it. I'll also want to look at the transcript he did actually consult.

@thomast73 thomast73 added a commit that referenced this issue Jul 16, 2012
@thomast73 thomast73 Refactoring relative of Source Metadata and the Conclusion model, lar…
…gely a result of discussions on #123 and #144 (and some of #135).
32326d7
@EssyGreen

@jralls - spot on - agree with everything you said there :)

@thomast73
Contributor

{@stoicflame} Yes, let's be clear. It's a reference. But it's a heavy reference because it can be used to point to either the actual source, the description of the source, or both. It also has a type.

After further consideration and discussion, we have decided to remove the option to point to the actual source via the SourceReference object. This addresses the data duplication concerns voiced above ... but the reference is still a "heavy" reference, as follows:

SourceReference

  • id : String -- A system identifier.
  • type : SourceReferenceType -- An indication of the type of relationship that exists between the object making the reference and the source being referenced.
  • sourceDescription : ResourceReference -- A pointer to the description of the source being referenced.
  • attribution : Attribution -- who is contributing this reference, when, why and the level of confidence they have in their contribution.

I have updated pull request #182 to reflect these changes.

@jralls
Contributor
jralls commented Jul 18, 2012

SourceReference

  • id : String -- A system identifier.
  • type : SourceReferenceType -- An indication of the type of relationship that exists between the object making the reference and the source being referenced.
  • sourceDescription : ResourceReference -- A pointer to the description of the source being referenced.
  • attribution : Attribution -- who is contributing this reference, when, why and the level of confidence they have in their contribution.

Well, if SourceReference has an Id, then presumably Conclusion and SourceDescription are (when serialized, anyway) storing the Id, meaning that the SourceReference is reusable. That reduces the weight alot. But if a Conclusion has an Attribution, the SourceReference doesn't really need one, does it? It doesn't seem to make a lot of sense to require an attribution to add an hierarchical element to a SourceDescription. What, then, is the point of the Attribution on SourceReference?

@jralls
Contributor
jralls commented Jul 18, 2012

Is the "sourceDescription" URI in SourceReference still intended to be able to point to anything, Conclusions as well as Sources, but maybe also a YouTube video of someone's cat? If not, how does a Conclusion reference an EvidenceAnalysis? By creating a SourceDescription pointing to it?

@jralls
Contributor
jralls commented Jul 18, 2012

Is the "sourceDescription" URI in SourceReference still intended to be able to point to anything, Conclusions as well as Sources, but maybe also a YouTube video of someone's cat? If not, how does a Conclusion reference an EvidenceAnalysis? By creating a SourceDescription pointing to it?

@thomast73
Contributor

Is the "sourceDescription" URI in SourceReference still intended to be able to point to anything...?

No. SourceReference.sourceDescription is only supposed to point to a SourceDescription instance.

It used to be (briefly, in the branch for pull request #182) that SourceReference.source allowed the "point-at-anything" case and, therefore, would have allowed a Conclusion to point directly at another Conclusion (e.g., an AnalysisDocument -- the object in the pull request that will be my EvidenceAnalysis object). By removing SourceReference.source, you may now reference another Conclusion (e.g., an AnalysisDocument) only by creating a SourceDescription that points to that Conclusion, then adding a SourceReference that points to that SourceDescription to the dependent Conclusion.

@jralls
Contributor
jralls commented Jul 19, 2012

creating a SourceDescription that points to that Conclusion, then adding a SourceReference that points to that SourceDescription to the dependent Conclusion

OK. Conceptually pure, but may result in whining (but not from me).

@EssyGreen

I'm prolly gonna do the whining but still struggling to find the model to whine about! (see #182)

@thomast73
Contributor

creating a SourceDescription that points to that Conclusion, then adding a SourceReference that points to that SourceDescription to the dependent Conclusion

OK. Conceptually pure, but may result in whining (but not from me).

I think your statement is correct...

...SourceReference.source allowed the "point-at-anything" case...

...but that is the implication of doing away with the source member. If we keep both the source and sourceDescription members, it is possible (even desired) that the source will duplicate the SourceDescription.about reference -- something we discussed above and that most votes seemed to indicate was a "bad" thing. In removing source we encourage the sourcing model by, in part, removing the option of directly connecting to objects that are not described by a SourceDescription -- a lost in flexibility, but a more "normal form" design.

Are you proposing that we should have dropped sourceDescription and kept source in SourceReference? The ultimate in flexibility, but the "wild west" in terms that anything is possible and almost all of it is impossible to govern?

@thomast73
Contributor

SourceReference

  • id : String ...
  • ...
  • attribution : Attribution ...

Well, if SourceReference has an Id, then presumably Conclusion and SourceDescription are (when serialized, anyway) storing the Id, meaning that the SourceReference is reusable.

???

There appears to be some confusion as to the purpose of the id -- even for myself. @stoicflame plans to open an issue to discuss this. One of the potential outcomes of that discussion is that id would go away. In other words, we have been thinking of SourceReference as being held in-line (by value), not by id (by reference).

But if a Conclusion has an Attribution, the SourceReference doesn't really need one, does it? It doesn't seem to make a lot of sense to require an attribution to add an hierarchical element to a SourceDescription. What, then, is the point of the Attribution on SourceReference?

There are some issues that need to be discussed here. @stoicflame or myself plan to open an issue on this soon -- or you could beat us to it. :-)

For now, I plan to leave these two members intact -- not part of this pull request -- pending further, merited discussion.

@jralls
Contributor
jralls commented Jul 19, 2012

Are you proposing that we should have dropped sourceDescription and kept source in SourceReference? The ultimate in flexibility, but the "wild west" in terms that anything is possible and almost all of it is impossible to govern?

Absolutely not.

I was thinking of keeping/resurrecting GenealogicalResource and having both Conclusion and SourceDescription subclass it. SourceReference.(re)source would have type GenealogicalResource. But requiring that one explicitly declare that a Conclusion is a Source via a SourceDescription is good, too, and introduces the ability to add a citation for a conclusion from someone else's database. That's a win for the web service use case and the extra overhead for the single-file case is tolerable.

@jralls
Contributor
jralls commented Jul 19, 2012

Well, if SourceReference has an Id, then presumably Conclusion and SourceDescription are (when serialized, anyway) storing the Id, meaning that the SourceReference is reusable.

???

There appears to be some confusion as to the purpose of the id -- even for myself. @stoicflame plans to open an issue to discuss this. One of the potential outcomes of that discussion is that id would go away. In other words, we have been thinking of SourceReference as being held in-line (by value), not by id (by reference).

We're talking about the serialized data here, right? What purpose would the Id field serve if the SourceReferences are inline?

@thomast73
Contributor

What purpose would the Id field serve if the SourceReferences are inline?

And thus the possibility it might not be needed in the model... :-)

@jralls
Contributor
jralls commented Jul 19, 2012

But if a Conclusion has an Attribution, the SourceReference doesn't really need one, does it? It doesn't seem to make a lot of sense to require an attribution to add an hierarchical element to a SourceDescription. What, then, is the point of the Attribution on SourceReference?

There are some issues that need to be discussed here. @stoicflame or myself plan to open an issue on this soon -- or you could beat us to it. :-)

OK. It's #192.

@thomast73
Contributor

We have been reviewing the updates to the sources model with various interested parties (internal and external) and, in the process, we have been discovering that the "source reference type" idea is difficult for others to understand and get right. So, we are considering modifications as follows:

SourceReference

  • id : String -- A local-only, system identifier
  • sourceDescription : ResourceReference -- A pointer to the description of the source being referenced
  • attribution : Attribution -- the attribution this reference

SourceDescription

  • id : String -- a local-only, system identifier
  • citation : SourceCitation -- the metadata used to described and cite this source
  • sourceDerivationType : SourceDerivationType -- how this source is related to its source(s)
  • about : URI -- if applicable, the URI to the actual source
  • mediator : URI -- a reference to the entity that mediates access to the described source
  • sources[0..*] : StringReferences -- references to any sources to which this source is related
  • displayName : String -- a display name for this source
  • alternateNames[0..*] : TextValue -- a list of alternate display names for this source
  • notes[0..*] : Note -- a list of notes about a source
  • attribution : Attribution` -- the attribution for this source description

SourceDerivationType

  • Original
  • PreservationCopy
  • Abstract
  • Transcription
  • Translation
  • ExtractedConclusion
  • Analysis
  • WorkingConclusion

The there are two things I would point out about this change.

First, what was SourceReferenceType is now SourceDerivationType and the derivation type is a member of the SourceDescription (removed from SourceReference). We feel that users/implementors will more readily understand and apply this type field as part of the SourceDescription.

Second, we dropped the "ComponentOf" from they type list, and added "Original". If the source description is about the original source, we would want to say so using "Original" type value. The "ComponentOf" was removed because it did not fit with the idea expressed with this field -- it is not a true derivation type. I do not think that removing "ComponentOf" will prevent users/implementors from describing source hierarchies of this type. With this modification, the source hierarchies (source provenance chains) are now independent of the type field.

@jralls
Contributor
jralls commented Aug 7, 2012

sources[0..*] : StringReferences -- references to any sources to which this source is related

That's supposed to be SourceReferences, right?

At this point the SourceReference class isn't carrying any useful information and should be dropped in favor of just using the URI of the SourceDescription. Furthermore, since SourceReferences no longer carry a type and SourceDescriptions can only be "original" or one type of derivative, sources should be a single item.

Next, ExtractedConclusion, Analysis, and WorkingConclusion applied to Conclusion objects, not SourceDescription objects. They shouldn't be carried over to SourceDescription::SourceDerivation. Analysis has its own class, AnalysisDocument, and by definition it lays out the case for a WorkingConclusion. Some -- but not all -- of the other Conclusion subclasses need to carry an Extracted boolean or a "ConclusionType" enum, but we should probably open a new issue for that along with what multiplicity of SourceDescription references is appropriate for each.

I expect that SourceDescriptions of Conclusions will generally be of type "Original".

Second, we dropped the "ComponentOf" from they type list, and added "Original". If the source description is about the original source, we would want to say so using "Original" type value. The "ComponentOf" was removed because it did not fit with the idea expressed with this field -- it is not a true derivation type. I do not think that removing "ComponentOf" will prevent users/implementors from describing source hierarchies of this type. With this modification, the source hierarchies (source provenance chains) are now independent of the type field.

I suppose a hierarchy might be implied if the parent Source's SourceDerivationType is the same as the child. If that's your intent, it should be specified.

@EssyGreen

I too have found the SourceReference difficult to understand ... I still think of it in terms of a Source record (=SourceDescription) and anything that references a source is just a pointer/cross-reference to it. The idea of having a "SourceReference" which contains both an id and a pointer to the source just seems crazy to me - and the Attribution surely should be a part of the Source(Description) and not a property of the Reference.

Also, could you clarify how source hierarchies can be linked together without the componentOf property? (I'm not saying it can't be done - I'd just like clarity on how ... and don't have time to wade through all the models/code atm)

@EssyGreen

sources[0..*]

In my opinion this is totally ambiguous and therefore meaningless (in this context)

@thomast73
Contributor

ExtractedConclusion, Analysis, and WorkingConclusion applied to Conclusion objects, not SourceDescription objects. They shouldn't be carried over to SourceDescription::SourceDerivation.

Sometimes, conclusions are used as sources. In such cases, we would describe our "source" conclusion with a SourceDescription. For the example of a genealogical proof statement, we would have a SourceDescription describing the proof statement (who created it, when it was created, etc.) and the derivation type would be set to "Analysis". The working conclusion would then add a SourceReference pointing to the description of the proof statement.

@thomast73
Contributor

... sources should be a single item.

I have thought about this as well. The use case that causes me to leave it as 0..* and not 0..1 is this:

If I want to reference a source that is derived from multiple sources (e.g., a genealogical proof statement) and I want to describe the provenance of that source but the source is not representable as a Conclusion in the system, then I would need to describe these parent sources and connect them to the source description that describes my source.

While this may be a rare case, it still seems relevant and, I think, requires the 0..* multiplicity for sources.

@thomast73
Contributor

@jralls: I suppose a hierarchy might be implied if the parent Source's SourceDerivationType is the same as the child. If that's your intent, it should be specified.

I do not think a particular kind of hierarchy can be implied. All that can be said is that the sources listed in on description are related to the source described in in that description.

The statement of derivation type is made independent of sources. I am not guaranteed to find a description of the source from which the described source was derived in the sources list, nor do I have a mechanism to identify it if it is in that list. The statement of derivation type merely represents that the source was analyzed and a judgment was made as to its relationship to its parent source (if applicable).

@EssyGreen: ...could you clarify how source hierarchies can be linked together without the componentOf property?

Consider the following Conclusion about a "residence" with the following source provenance chain:
A. Household found in 1900 US Census
B. "contained in" FamilySearch "United States Census, 1900" index
C. "derived from" "Population Schedules of the 1900 Census." NARA microfilm publication M432.

We could describe this as follows:

Conclusion.sources = [ SrcRefA ]

SrcRefA.sourceDescription = SrcDescA

SrcDescA.citaiton = { …(details about the specific household)… }
SrcDescA.souces = [ SrcRefB ]
SrcDescA.derivationType = SourceDerivationType.ExtractedConclusion

SrcRefB.sourceDescription = SrcDescB

SrcDescB.citatoin = { …(details about the FamilySearch index)… }
SrcDescB.sources = [ SrcRefC ]
SrcDescB.derivationType = SourceDerivationType.ExtractedConclusion

SrcRefC.sourceDescription = SrcDescC

SrcDescC.citatoin = { …(detail about NARA microfilm publication)… }
SrcDescC.sources = null
SrcDescC.derivationType = SourceDerivationType.PreservationCopy

@jralls
Contributor
jralls commented Aug 7, 2012

Sometimes, conclusions are used as sources. In such cases, we would describe our "source" conclusion with a SourceDescription. For the example of a genealogical proof statement, we would have a SourceDescription describing the proof statement (who created it, when it was created, etc.) and the derivation type would be set to "Analysis". The working conclusion would then add a SourceReference pointing to the description of the proof statement.

Sorry, that doesn't make sense.

The Conclusion object already has an attribution describing who created it, etc., and it becomes a Source in its own right; the SourceDescription should be of type "Original". The Attribution on the SourceDescription applies to the SourceDescription itself, not the Source. Suppose you and I are working on the same family in a public tree. You work all day writing up an Analysis of several sources with arguments for a bunch of conclusions, but your wife calls you to dinner before you have time to fill in the actual conclusions. I'm hour behind you (time zone) so I write a SourceDescription for your AnalysisDocument and create a bunch of Conclusion objects with references to that SourceDescription. The Attribution on the AnalysisDocument describes you, and the one on the SourceDescription and Conclusions describes me. The SourceDescription is of type "Original" because your AnalysisDocument is, in fact, original, not derivative. The Conclusions are Extracted because you argue through all of the various evidence in your analysis to work out the people, places, dates, etc.

@jralls
Contributor
jralls commented Aug 7, 2012

I do not think a particular kind of hierarchy can be implied. All that can be said is that the sources listed in on description are related to the source described in in that description.

That's rather less than useless. What does "related to" mean in this context? What is the use of "SourceDerivation" if sources doesn't point to what it's derived from?

EDIT:

The statement of derivation type merely represents that the source was analyzed and a judgment was made as to its relationship to its parent source (if applicable).

There is no automated use for such a flag, so that properly belongs in the source analysis, which we earlier agreed should be in a Note on the SourceDescription.

@jralls
Contributor
jralls commented Aug 7, 2012

Consider the following Conclusion about a "residence" with the following source provenance chain:
A. Household found in 1900 US Census
B. "contained in" FamilySearch "United States Census, 1900" index
C. "derived from" "Population Schedules of the 1900 Census." NARA microfilm publication M432.

That's not the problem that ComponentOf was intended for. Rather, consider that I have 8 great-grandparents and a few great-great-grandparents, along with a bunch of great-great-uncles and aunts alive in 1900. Add in my wife's family and other families of interest and there are easily 100 census entries that I'm interested in, clustered in a few towns in 3 states. I don't want to have to type all of the top-level source info 100 times, I want to break it down into levels. Here's one way:

A. Family XX, page YY, accessed DD Month YYYY
B. Town, County, State, Roll
C. 1900 U.S. Census, digital image, FamilySearch
D. NARA Microfilm T623

No conclusions there, extracted or otherwise. How do you do this in your new model?

@thomast73
Contributor

A. Family XX, page YY, accessed DD Month YYYY
B. Town, County, State, Roll
C. 1900 U.S. Census, digital image, FamilySearch
D. NARA Microfilm T623

Conclusion.sources = [ SrcRefA ]

SrcRefA.sourceDescription = SrcDescA

SrcDescA.citation= { …(Family XX, page YY, accessed DD Month YYYY)… }
SrcDescA.souces = [ SrcRefB ]
SrcDescA.derivationType = SourceDerivationType.PreservationCopy

SrcRefB.sourceDescription = SrcDescB

SrcDescB.citation= { …(Town, County, State, Roll)… }
SrcDescB.sources = [ SrcRefC ]
SrcDescB.derivationType = SourceDerivationType.PreservationCopy

SrcRefC.sourceDescription = SrcDescC

SrcDescC.citation = { …(1900 U.S. Census, digital image, FamilySearch)… }
SrcDescC.sources = [ SrcRefD ]
SrcDescC.derivationType = SourceDerivationType.PreservationCopy

SrcRefD.sourceDescription = SrcDescD

SrcDescD.citation = { …(NARA Microfilm T623)… }
SrcDescD.sources = null
SrcDescD.derivationType = SourceDerivationType.PreservationCopy

@jralls
Contributor
jralls commented Aug 7, 2012

I do not think a particular kind of hierarchy can be implied. All that can be said is that the sources listed in on description are related to the source described in in that description.

SrcDescA.citation= { …(Family XX, page YY, accessed DD Month YYYY)… }
SrcDescA.souces = [ SrcRefB ]
SrcDescA.derivationType = SourceDerivationType.PreservationCopy

SrcRefB.sourceDescription = SrcDescB

SrcDescB.citation= { …(Town, County, State, Roll)… }
SrcDescB.sources = [ SrcRefC ]
SrcDescB.derivationType = SourceDerivationType.PreservationCopy

You're not being very consistent here...

@jralls
Contributor
jralls commented Aug 7, 2012

If I want to reference a source that is derived from multiple sources (e.g., a genealogical proof statement) and I want to describe the provenance of that source but the source is not representable as a Conclusion in the system, then I would need to describe these parent sources and connect them to the source description that describes my source.

That's not really a valid use-case. See EE Para 2.21, "Citing the Source of a Source". (Summary: Don't, unless you found and examined the original source.) OTOH, if your genealogical proof statement is an AnalysisDocument in the GedcomX dataset at hand, then it already has its citation information and adding it to the SourceDescription is redundant. The only legitimate use of "Source of a Source" is facsimile reproduction (EE 2.11).

@EssyGreen

I do not think a particular kind of hierarchy can be implied. All that can be said is that the sources listed in on description are related to the source described in in that description.

That's rather less than useless. What does "related to" mean in this context? What is the use of "SourceDerivation" if sources doesn't point to what it's derived from?

I totally agree ... that was my point - but thx for the example :)

The only legitimate use of "Source of a Source" is facsimile reproduction

I don't agree here ... I think you can cite anything as long as it is clear what the source is. For example, Bishop's Transcripts are often used here where the original BMD records have been destroyed. By definition these are derivatives (albeit old ones!) but are nonetheless valuable. Similarly, let's suppose the 1851 Census was destroyed .. would we just shrug and say we can no longer cite it? Or would we try to build up a picture of it by using the transcriptions held by various FHS?

@thomast73
Contributor

I do not think a particular kind of hierarchy can be implied. All that can be said is that the sources listed in on description are related to the source described in in that description.

That's rather less than useless. What does "related to" mean in this context? What is the use of "SourceDerivation" if sources doesn't point to what it's derived from?

I am not suggesting that sources will be used to do anything that is not describing the provenance of the source being described. The reason we include sources in SourceDescription is so that we can provide the best possible description of a source`s provenance.

I do feel that the relationship between a source that is a "component of" another source is different than the relationship between a source that is "derived from" another source. It seems that the use cases proffered for the "component of" relationship are really about describing multiple elements from a single source -- a mechanism for defining a citation in pieces so that one or more of those pieces can be reused to in citing additional elements in that source.

On one hand, it seems the source metadata model is about supporting the need to describe a source and its provenance. This description (of the source and its provenance) is generally treated as a single logical item ("the source"). Regardless of the number elements used to describe "the source" (and the type of relationships that relate those elements), the user is still just describing "the source" and associating "the source" with his "conclusion".

I think the model is sufficient for this purpose.

On the other hand, if we were to state that the sources metadata model MUST support the automation of rending of entire provenance chains that include the citation of sources that are split into pieces (via the "component of" mechanism), then the model may need some help.

For example, it may be that if we added a type to SourceReference that could be either "derived from" or "component of", we could automatically combine consecutive instances of SourceDescription linked together by "component of" relationships into a single citation for that source, while citations for sources "derived from" other sources could be linked together with the " citing " phrase. But is this the right mechanism for this? I am not sure it is. I'd like to get a better feel for with the requirements for this really are. I think the model is better prepared to be morphed to support such scenarios. But we are not prepared to automate all of this yet.

@EssyGreen

I don't agree with your use case for "component of". My intent (I believe I was the one to request it originally - see #123) was to enable the ability to represent the hierarchical nature of archival source collections and to cite from various parts of said collections. For example, a census book cover might have information about the households in statistical form and details about the enumerator. A page within that book might have details about a particular household. I may be interested in both and the fact that one is within the scope of the other may be relevant (e.g. the fact that the enumerator may have been recording his only family details; the fact that there should only be one representation of one person within a census etc etc).

If you are not prepared to provide a means of using these references then I think it is better and less confusing to simply omit them. After all, things may move on by the time you get to them (witness the old FONE and ROMN tags). Don't build in a partial "something" just in case it "might come in useful one day" - it will be ignored, misinterpreted and misused and we'll all have a nightmare trying to unravel the mess.

@jralls
Contributor
jralls commented Aug 8, 2012

I believe I was the one to request it originally

No, I did.

For example, a census book cover might have information about the households in statistical form and details about the enumerator. A page within that book might have details about a particular household.

That's essentially the same use-case at a smaller scale:

  • Census book SDA
  • Book cover, SDB, is containedIn SDA
  • Family page, SDC, is containedIn SDA.

(And book SDA is one of ten contained in box SDX, part of record group SDY).

If you are not prepared to provide a means of using these references then I think it is better and less confusing to simply omit them.

+1

@jralls
Contributor
jralls commented Aug 8, 2012

I am not suggesting that sources will be used to do anything that is not describing the provenance of the source being described. The reason we include sources in SourceDescription is so that we can provide the best possible description of a source`s provenance.

OK, then the spec should say so, not just that the sources are "related to" the source being described.

On one hand, it seems the source metadata model is about supporting the need to describe a source and its provenance. This description (of the source and its provenance) is generally treated as a single logical item ("the source"). Regardless of the number elements used to describe "the source" (and the type of relationships that relate those elements), the user is still just describing "the source" and associating "the source" with his "conclusion".

OK again, but recognize that that will force a large amount of redundant data into the file and require extra parsing overhead on the part of using applications which do provide multi-level citations (and most provide for at least two levels).

@thomast73
Contributor

On one hand, it seems the source metadata model is about supporting the need to describe a source and its provenance. This description (of the source and its provenance) is generally treated as a single logical item ("the source"). Regardless of the number elements used to describe "the source" (and the type of relationships that relate those elements), the user is still just describing "the source" and associating "the source" with his "conclusion".

... but recognize that that will force a large amount of redundant data into the file and require extra parsing overhead ...

???

I am not suggesting we eliminate the ability to create a hierarchy of source descriptions to represent "the source". And I am not suggesting that parts of that hierarchy cannot be reused to describe another source. I am just saying that the whole chain of source descriptions (that starts with the reference in the conclusion) is logically one "source". If we used five source descriptions linked together to describe that source and its provenance, it is still logically one source.

@thomast73
Contributor

If you are not prepared to provide a means of using these references then I think it is better and less confusing to simply omit them.

What exactly is missing? How is it that there is no means of using these references?

@jralls
Contributor
jralls commented Aug 8, 2012

If we used five source descriptions linked together to describe that source and its provenance, it is still logically one source.

Agreed.

I am not suggesting we eliminate the ability to create a hierarchy of source descriptions to represent "the source". And I am not suggesting that parts of that hierarchy cannot be reused to describe another source.

Then what do you mean by

I think the model is better prepared to be morphed to support such scenarios. But we are not prepared to automate all of this yet.

?

You earlier gave an example where you used the sources field to indicate a hierarchy, but at present there's nothing in the spec to indicate to an importing app that that's what they're there for. The SourceReferenceType did that, attaching meaning to each source.

@jralls
Contributor
jralls commented Aug 8, 2012

What exactly is missing? How is it that there is no means of using these references?

A description of what each one of them is for. "related to" isn't sufficient, and "best possible description of a source's provenance" is pretty vague, too. SourceReferenceType described how each SourceReference in sources contributed to the "description of provenance". SourceDerivationType, being uncoupled from sources (never mind that their can be only one SourceDerivationType and many sources) doesn't:

The statement of derivation type is made independent of sources. I am not guaranteed to find a description of the source from which the described source was derived in the sources list, nor do I have a mechanism to identify it if it is in that list. The statement of derivation type merely represents that the source was analyzed and a judgment was made as to its relationship to its parent source (if applicable).

You can't seem to keep your story straight. ISTM you guys need to pull this and go think about it some more.

@stoicflame
Member

So what if SourceDescription contained a list of sources (as proposed) but also contained another property, componentOf that is of type SourceReference?

So then:

A. Family XX, page YY, accessed DD Month YYYY
B. Town, County, State, Roll
C. 1900 U.S. Census, digital image, FamilySearch
D. NARA Microfilm T623

Is modeled like:

Conclusion.sources = [ SrcRefA ]

SrcRefA.sourceDescription = SrcDescA

SrcDescA.citation= { …(Family XX, page YY, accessed DD Month YYYY)… }
SrcDescA.componentOf = SrcRefB

SrcRefB.sourceDescription = SrcDescB

SrcDescB.citation= { …(Town, County, State, Roll)… }
SrcDescB.componentOf = SrcRefC

SrcRefC.sourceDescription = SrcDescC

SrcDescC.citation = { …(1900 U.S. Census, digital image, FamilySearch)… }
SrcDescC.sources = [ SrcRefD ]
SrcDescC.derivationType = SourceDerivationType.PreservationCopy

SrcRefD.sourceDescription = SrcDescD

SrcDescD.citation = { …(NARA Microfilm T623)… }
SrcDescD.derivationType = SourceDerivationType.PreservationCopy

@stoicflame
Member

A description of what each one of them is for. "related to" isn't sufficient, and "best possible description of a source's provenance" is pretty vague, too. SourceReferenceType described how each SourceReference in sources contributed to the "description of provenance"

The problem is that we couldn't get very far down that road without confusing people. The question they kept asking went something like this:

If all source references to an extracted conclusion (or analysis, or transcription, or image, etc) are of type ExtractedConclusion, why don't you just put the type ExtractedConclusion on the description instead of repeating it all over the place?

Since we were unable to give a good answer, we decided to propose the consolidation of the type onto the SourceDescription.

So what would your answer to that question be?

@jralls
Contributor
jralls commented Aug 8, 2012

So what if SourceDescription contained a list of sources (as proposed) but also contained another property, componentOf that is of type SourceReference?

I started to propose exactly that, but got distracted by Thad's "logically one source" tangent.

Yes, that will work.

@jralls
Contributor
jralls commented Aug 8, 2012

So what would your answer to that question be?

I think I already did:

It's not a type of the Source, it's a usage type of the SourceReference.

So if SourceDescA describes a birth certificate, and EventA extracts the particulars of that birth event without any inferences, then the SourceReference in EventA pointing to SourceDescA has type "ExtractedConclusion". SourceDescB, which describes EventA, doesn't need a type, it has a pointer to the original. If AnalysisDocumentG uses SourceDescB, its SourceReference will be of type Analysis. There is no repetition.

@stoicflame
Member

And if AnalysisDocumentG uses all of SourceDescB, SourceDescC, SourceDescD, SourceDescE, and SourceDescF, under what circumstances will one of those SourceReferences not be of type Analysis? And if they're all of type Analysis, why do we even need a type on the source reference? You know it's of type Analysis because you're looking at an analysis document--the type is redundant and you're repeating yourself.

@jralls
Contributor
jralls commented Aug 8, 2012

And if AnalysisDocumentG uses all of SourceDescB, SourceDescC, SourceDescD, SourceDescE, and SourceDescF, under what circumstances will one of those SourceReferences not be of type Analysis? And if they're all of type Analysis, why do we even need a type on the source reference? You know it's of type Analysis because you're looking at an analysis document--the type is redundant and you're repeating yourself.

Roger. So the only non-redundant usage is ExtractedConclusion vs. WorkingConclusion.

So rip it all out. SourceReference should be a SourceDescription Reference in the model, and the URI of a SourceDescription in the implementations. No need for typing, no need for a separate ID, no need for an Attribution.

SourceDescription doesn't need a SourceDerivationType, that's covered in the Citation, perhaps with additional detail provided in Notes.

SourceDescription does need a containedInSource SourceReference. I started to say that it needs a citesSource SourceReference too, but maybe that should be in the Citation as well. I could go either way on that.

Conclusion needs to have an "Extracted" flag.

@thomast73
Contributor

I started to say that it needs a citesSource SourceReference too, but ...

Isn't that what sources is? This case is already covered.

@jralls
Contributor
jralls commented Aug 8, 2012

Isn't that what sources is? This case is already covered.

I don't know. Sometimes you say it is, sometimes not. If it is, SourceDescription should have only one -- only some Conclusion subclasses (AnalysisDocument, Event, Person, Relationship, maybe others, but only when ExtractedConclusion is False) need multiple sources.

But maybe, in the case of SourceDescription, it should be handled as part of Citation.

@thomast73
Contributor

But maybe, in the case of SourceDescription, it should be handled as part of Citation.

I certainly agree that it could be modeled that way. For now, lets get some feedback on the sources way of modeling it.

@thomast73
Contributor

So, to summarize, I will update the model as follows:

SourceReference

  • sourceDescription : ResourceReference -- A pointer to the description of the source being referenced
  • attribution : Attribution -- the attribution this reference

SourceDescription

  • id : String -- a local-only, system identifier
  • citation : SourceCitation -- the metadata used to described and cite this source
  • about : URI -- if applicable, the URI to the actual source
  • mediator : URI -- a reference to the entity that mediates access to the described source
  • sources[0..*] : SourceReferences -- references to any sources from which this source is derived (i.e., sources that this source cites)
  • componentOf : SourceReference -- a reference to the source that contains this source -- its parent context; for cases where this description is not complete without the description of its parent context
  • displayName : String -- a display name for this source
  • alternateNames[0..*] : TextValue -- a list of alternate display names for this source
  • notes[0..*] : Note -- a list of notes about a source
  • attribution : Attribution` -- the attribution for this source description

SourceDerivationType will be removed from the model.

NOTE: The "Extracted" flag will need to be discussed separately. We recognize the issue this flag attempts to address. We will open an issue after we merge this pull request.

NOTE: I am also making a change to the type of the alternateNames field; it is now of type TextValue; this is a slight simplification.

@jralls
Contributor
jralls commented Aug 9, 2012

I think it would flow better if SourceReference were pulled up to 3.2 so that all of the CitationFoo paragraphs are together.

@EssyGreen

sources[0..*] : SourceReferences -- references to any sources from which this source is derived (i.e., sources that this source cites)

I think this is confusing .... most people in genealogy would think of a citation as what we are now calling proof/evidence. You are using it to indicate a derivative which I think is a subtle but significant difference. For example, I "cite" a source in a narrative about someone; but a transcription of a baptism record "is derived from" the original.

Also I don't see the point of trying to list all the sources which cite/refer to this source since it will necessarily be impossible to do so in the real world. Instead each source should refer to the single source it is derived from (rather than trying to keep track of the infinite number of things which might reference it).

@EssyGreen

I still don't see why we have to have an attribution in the SourceReference

@thomast73
Contributor

I have merged the pull request (#182) related to this issue. I am going to close this issue. I know that at least two questions are outstanding (about attribution - #192; and about ids - #198). If anything is unresolved and needs further discussion, would you please open a new issue(s) and summarize it there.

Thank you for all of the input given here! We appreciate the help in improving our model! :-)

@thomast73 thomast73 closed this Aug 10, 2012
@thomast73
Contributor

I forgot, there was one other issue that I promised to open as a result of conversation here -- relative to being able to distinguish "working conclusions" from "extracted conclusions". The issue can be found in #202.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment