Improve how to indicate the data source following data extension #5130

lozanaross · 2022-07-30T10:07:20Z

Currently a default OpenRefine-to-Wikibase workflow works well if users only reconcile data against a single service, e.g. Wikidata, and then proceed towards schema building and data upload with this single service. However, in reality - with the introduction of the possibility to reconcile against and upload data edits to any arbitrary Wikibase (and Wikimedia Commons, too) - users may wish to work across several services at once. For example, enriching (or extending) a dataset from one Wikibase with data from another, in addition to also including data from various other authority control sources (e.g. GND, Getty’s ULAN, VIAF, etc). Currently, however, once a data extension operation is carried out — there is no easy visual cue to a user to specify what source each column has been pulled from and - depending on the data type - also reconciled against, leading easily to making mistakes during schema building when users only see column titles and a green bar.

If data is pulled from Wikidata via reconciled values - the data can be shown to be reconciled to Wikidata with a little logo as per suggestion in this issue: #4824, but there is no way to tell users the source was also Wikidata. For that users proposed that prefixes are added to the column headers.

Proposed solution

Add prefixes to column names to indicate where the data was pulled from. E.g. wmc: for Wikimedia Commons; wd: for Wikidata; wb: for any other Wikibase, etc.

Example where some artworks are reconciled with Wikidata and then the 'collection' property values are pulled from Wikidata via data extension:

Example with data pulled from Commons:

Alternatives considered

Adding small service logos next to the reconciled status green bar, but that solution does not always translate 1:1 with the data extension use case.

Additional context

This feature request is most useful during schema building, so the prefixes should naturally be carried over to the drag / drop elements in that interface that contain the column names.

wetneb · 2022-07-30T13:00:55Z

I can totally relate to the need as a user, I have found myself in that situation many times.

The main question is: how do we generate that prefix? Reconciliation services are not currently associated to any such prefix.

Do we prompt the user for it, for instance in the data extension dialog? They will probably not bother giving a prefix there, because they probably do not realize that they need it before later on (at least the first time).
Do we require that all reconciliation services announce a prefix themselves? That would require changes in the reconciliation API, and it is not really clear we can phrase the need in a way that makes sense outside OpenRefine's context.
Can we use the full service name instead? It would give us column names such as "Wikimedia Commons [en]: Wikitext", which is long, but maybe has the benefit of being explicit?
Or an automatically-generated acronym from the service name?

thadguidry · 2022-07-30T15:56:55Z

In other tools, they typically have a 2nd or 3rd header (with a toggle to show or not) for context/metadata, instead of having a very long string shoved into 1 header row.

Wikimedia Commons [en]
Wikitext
----
=={int:filedesc}==

antoine2711 · 2022-07-30T17:22:50Z

@lozanaross : why not put just the icon of the service?

I would find it much better visually and conceptually. The name of the column is used in script. I don't want it to be constraint.

And if it's just for display, a prefix will confuse users.

I feel this very strongly.

Regards, Antoine

lozanaross · 2022-07-30T22:32:28Z

@lozanaross : why not put just the icon of the service?

@antoine2711 check issue #4824 - these are two separate problems. I also want to show logos but only when something is reconciled against a particular service - this may or may not overlap with the service where the data came from.

For example, I may pull data from Wikidata, then reconcile that same column that was generated via data extension against my private Wikibase in order to upload to my Wikibase, or vice versa. So we need a prefix regardless of the logo symbol. This is also quite relevant for Commons where things can get "hairy" since the data modeling with Wikidata is intertwined in very specific ways.

lozanaross · 2022-07-30T22:35:48Z

In other tools, they typically have a 2nd or 3rd header (with a toggle to show or not) for context/metadata, instead of

@thadguidry I think this makes a lot of sense, but unfortunately doesn't help in the schema building interface, which is where these prefixes will really make a positive impact in terms of UX. The columns view (aka row / records view) is actually less of an issue because if you hover over the data you can get a sense where it came from (at least until further reconciliation takes place). We need a solution that will work consistently well across all data views.

lozanaross · 2022-07-30T22:46:10Z

The main question is: how do we generate that prefix? Reconciliation services are not currently associated to any such prefix.

Do we prompt the user for it, for instance in the data extension dialog? They will probably not bother giving a prefix there, because they probably do not realize that they need it before later on (at least the first time).

I'd rather avoid putting the burden on the user here, although it may in the end prove to be the most feasible solution. If we go this route, we'll have to emphasize this step in tutorials / documentation, etc. There is one benefit here in the sense that the user will immediately uderstand what is going on. If a prefix appears automatically it might not be actually that meaningful to new users / users unfamiliar with LOD principles. If we decide to opt for this, I'll make a mock up for the UI.

Do we require that all reconciliation services announce a prefix themselves? That would require changes in the reconciliation API, and it is not really clear we can phrase the need in a way that makes sense outside OpenRefine's context.

This might be the best option long term, but unclear if we can rely that external service providers/maintainers will actually do it.

Can we use the full service name instead? It would give us column names such as "Wikimedia Commons [en]: Wikitext", which is long, but maybe has the benefit of being explicit?

Explicit is good, but I think it will look terrible in the schema building UI, and that's where it's most crucial.

Or an automatically-generated acronym from the service name?

Or can OpenRefine actually determine prefixes for the key services that are well used / maintained (based on some kind of acronym indeed - but maybe human determined, rather than automated) and pull the values from a dedicated column in this page: https://reconciliation-api.github.io/testbench/#/

Then for any new services being made, it will be required that maintainers submit their service to this page and add the prefix there -- which will then be fed back into OpenRefine? Sorry this may make no sense from programming perspective, but just wondering if there is some mixed approach between us determining some core values to begin with and then maintainers being able to add / update in the future.

antoine2711 · 2022-07-30T23:26:45Z

@antoine2711 check issue #4824 - these are two separate problems. I also want to show logos but only when something is reconciled against a particular service - this may or may not overlap with the service where the data came from.

For example, I may pull data from Wikidata, then reconcile that same column that was generated via data extension against my private Wikibase in order to upload to my Wikibase, or vice versa. So we need a prefix regardless of the logo symbol. This is also quite relevant for Commons where things can get "hairy" since the data modeling with Wikidata is intertwined in very specific ways.

@lozanaross : this raises interesting questions. From my part, if I query data from a source, I will probably want to reload data from that source. And I will most probably push data to that source also, even if like you said, I could also push the same data to another Wikibase. So, from my point of view, I'd rather have 2 icons, than a prefix and an icon.

This question also raise the notion of link between Recon Services, and services where we push data, like WD, WC, or a wikibase. Something we dont currently have.

Regards, Antoine

lozanaross · 2022-07-31T09:52:27Z

@lozanaross : this raises interesting questions. From my part, if I query data from a source, I will probably want to reload data from that source. And I will most probably push data to that source also, even if like you said, I could also push the same data to another Wikibase. So, from my point of view, I'd rather have 2 icons, than a prefix and an icon.

I'm not sure I understand what you mean by "if I query data from a source, I will probably want to reload data from that source. And I will most probably push data to that source also" - what I've observed users doing during data extension is generally moving data across different services, so not keeping it in the same service. I think using 2 icons would be potentially confusing and difficult to design in the schema building view. This is really the crux, how to make it super clear to users where the data came from and what it was reconciled to, when you're in this view:

In the example above the scenario is as follows: a user has a list of artworks as a starting point, and these are first reconciled to Wikidata; user then pulls the file names of images linked to the artworks via the P18 image property. User also pulls data from Wikidata regarding the creators (P170) of the works. Now to upload structured data to commons, the user first has to reconcile the file names from the image column to Commons, then they can create statements about "depicts" / "creator" in Commons, too. This is extremely simplified, in the cases I've seen, users might have 20-30 columns with various data sources, including also e.g. GND (german authority control), Getty Vocabs, etc. and various reconciliation target services.

This question also raise the notion of link between Recon Services, and services where we push data, like WD, WC, or a wikibase. Something we dont currently have.

I think there is some link via the manifests, no? I guess @wetneb would know best about this.

thadguidry · 2022-07-31T14:11:42Z

Does the source of data (provenance) need to be readily viewable? As you suggested with a prefix always showing in column header.
Or can it be viewable upon a hover or click? What's you feeling @lozanaross of users needing to always see the data source or do we not know yet until more user feedback? My hunch is that we might want to have a way that it is always viewable for quick disambiguation between similarly named columns from different sources. stupid example: gnd: image_date, wd: image_data

wetneb · 2022-07-31T14:49:27Z

Or can OpenRefine actually determine prefixes for the key services that are well used / maintained (based on some kind of acronym indeed - but maybe human determined, rather than automated) and pull the values from a dedicated column in this page: https://reconciliation-api.github.io/testbench/#/

For now OpenRefine does not rely at all on this list of services (beyond linking to it in the UI) and I would rather avoid tying a feature to a centralized repository.

This question also raise the notion of link between Recon Services, and services where we push data, like WD, WC, or a wikibase. Something we dont currently have.This question also raise the notion of link between Recon Services, and services where we push data, like WD, WC, or a wikibase. Something we dont currently have.

I think there is some link via the manifests, no? I guess @wetneb would know best about this.

Indeed the Wikibase manifests mention the reconciliation service(s) they rely on. It is not clear to me why we would need more than that.

lozanaross · 2022-07-31T20:50:31Z

Does the source of data (provenance) need to be readily viewable? As you suggested with a prefix always showing in column header. Or can it be viewable upon a hover or click?

@thadguidry - actually this is a really great suggestion because it solves my UI struggles :)
Prefix is actually not ideal because: 1) it requires us to come up with some conventional way of generating the prefix (and as explained above by @wetneb there is no straightforward way to do that); 2) because although some users will figure it out immediately, others who are less of an LOD expert might not know what a prefix is, etc.

So how about we use the full name of the service (which we should have via the recon service) and offer that in a tooltip on hover. We'll have to do some testing to make sure it's not annoying to users and doesn't interfere with the drag and drop operation during schema building, but I think it might work.

Regarding the importance of provenance - it's pretty important, because data from Wikidata is often inconsistently modeled with Commons. So you might want to check what collection is listed for an artwork on Wikidata and compare, add or correct that on Commons. Or you might want to know that an artist birth date is coming from Wikidata (less reliable) vs Getty ULAN (supposedly more reliable), etc. Licenses are another example, where a correct license might be listed in Commons, but then a wrong one in Wikidata or vice versa (there was a concrete example of this with some of the Ghent archive users we interviewed) - so users might want to pull data from both services to first compare and contrast in "row/ records" view and then decide what to upload where (in order to fix any inaccuracies) during schema building. Of course they can manually rename the columns to not have identical or similar names based on identical properties, but it's a bit of an extra effort.

So any solution that shows a clear message about the data source would be super useful, and prefix was just one idea, because it succinct, but a hover message in the style of a tooltip could work just as well.

wetneb · 2022-08-01T08:23:54Z

A tooltip would be fairly easy to implement. I guess it less discoverable, perhaps? In the schema editor, we do have some freedom in how we render the columns, so we could potentially also add a visual clue for all the columns that have been fetched from the reconciliation service associated with the currently selected Wikibase instance.

wetneb · 2023-12-07T14:23:12Z

This is partly solved by #4824, which adds a logo to column headers, but I am not sure about the columns fetched from reconciled values which are not reconciled themselves: I think the changes in #6156 will not be sufficient to add a logo in that case.
Under the hood, we do store the fact that those columns were obtained from data extension I think, so it might not be so hard to also display a logo in that case.

ayushrai206 · 2023-12-11T07:22:32Z

Hey everyone, we were discussing about where should the logo be displayed in the data extension column without reconciled values, we had two options in mind-
1)The same place where we display logo in the reconciled column but without a recon stats bar
2)beside the column name, in between the drop down and column name(which would lead to column name being pushed further right)
@lozanaross @Lydiaofficial !

lozanaross · 2023-12-11T13:44:22Z

Hi @ayushrai206 & @wetneb - I've read again the whole thread here & I think we are mixing up the issues a bit. I think we need a logo only when it comes to reconciliation. But data extension can produce various scenarios:

Data extension for items, e.g. place of birth - in that case the results returned after data extension will be reconciled to e.g. Wikidata, so that logo would be there.
Data extension for non-items, eg. date of birth - in that case the results returned after data extension will not be reconciled, so we don't need a logo.
Data extension for items, eg. place of birth, but the column is afterwards reconciled to another service (e.g. I am pulling places from Wikidata to add to my own Wikibase; Or I am pulling places from the Getty ULAN to add to Wikidata) - in that case the logo will be that of the reconciled service. Or we'll need 2 logos, but that will get confusing...

So I think the consensus from the discussion above was to add a tooltip (on hover) that shows the fully spelled out name of the service used for data extension. The hover should work in the schema view. I don't know if it will really make sense in Row or Records view. The text of the tooltip could be "Data extended from [name of the service]"

Alternatively, perhaps we can revisit the idea of spelling out the service name in full before the name of the column (which currently defaults to the name of the property used). It could get very long with some services, but maybe it's better than nothing. The tooltip otherwise might remain entirely undiscovered.

Example how the latter option to show the name of the service in the column name could look in row view for item vs non-item:

Note: in the Genre column we should also show the logo, but I just haven't mocked it up.

antoine2711 · 2023-12-11T14:48:26Z

Alternatively, perhaps we can revisit the idea of spelling out the service name in full before the name of the column (which currently defaults to the name of the property used). It could get very long with some services, but maybe it's better than nothing. The tooltip otherwise might remain entirely undiscovered.

Respectfully, I would strongly be against putting the name of the Recon service before or even after the column name. That would force the whole column to be much wider. It would be a nightmare for a checkbox column…

As for the case of the icon, I would also be against putting 2 icons. Maybe we could put the last Recon service used (either by reconciliation, or if we improve the data fetching, the last fetched data). But we could also show the one with the most reconcialed values provenance.

Regards, Antoine

lozanaross · 2023-12-11T15:44:23Z

@antoine2711 - I agree the column name will get long, but not that long - have you worked with XML files, the column names there are untenable.

Maybe we could put the last Recon service used

I don't think we should mix data extension (i.e. data provenance) and reconciliation (i.e. data matching), so I disagree with the superseding idea. Provenance information should never be simply overwritten by whatever reconciliation the user does afterwards.

we could also show the one with the most reconcialed values provenance.

I'm afraid I don't understand this comment. I am also worried that this doesn't address all the cases where we don't have any reconciliation happening because the pulled data is non-items - e.g. dates, coordinates, strings, etc. In those cases we have no solution as it stands.

Tooltips on hover are perhaps the most "safe" / least disruptive choice, but also the least visible if the user doesn't already know it's there.

wetneb · 2023-12-11T15:47:00Z

@lozanaross fine, adding a tooltip is of course doable too.

I just thought it's not super discoverable and not as convenient as having a permanent visual sign that a column came from a reconciliation service, hence the suggestion to add the logo of the recon service to all columns obtained from data extension, regardless of whether they are reconciled or not. But yes, it's useful to make a distinction between provenance and which reconciliation service was used on a column.

antoine2711 · 2023-12-11T16:19:30Z

I don't think we should mix data extension (i.e. data provenance) and reconciliation (i.e. data matching), so I disagree with the superseding idea. Provenance information should never be simply overwritten by whatever reconciliation the user does afterwards.

@lozanaross : there are a lot of details in the discussion. Sorry if I'm not precise, I will try to wrap-up.

Now, we can only get data from a data extension when creating a column. But I would LOVE to be able requery data in an existing colunm that was fetch thru data extension. This is actually complex, as it would be important to keep the link from the old cell with the new (Imagine a scenario were one value is deleted in the data service…)
If we implement such a feature, we could imagine someone querying data in an existing column with another data service. This would essentially be the same as doing Recon on the same column with 2 Recon services.
I didn't say it, but I'm totally for the hover/tooltip solution.
We have to keep in mind that in the same column, cells could be reconciliated from many different services. This as implication for the Recon % bar, and I would like to show all the Recon services used in one column. Not showable beside the name of the column.
Yes, the XML generate long column name. It's a big problem in my opinion, because of the width.

I just thought it's not super discoverable and not as convenient as having a permanent visual sign that a column came from a reconciliation service

@wetneb : I think that showing the logo would imply that there would also be a hover/tooltip with more information.

I would show the logo for any Recon values, fetched by a Data extension or by a Reconciliation.

Regards, Antoine

wetneb · 2023-12-12T08:34:39Z

Alright, then let's stick to this:

data extension source is indicated by tooltip
reconciliation service used for a reconciled column is indicated by logo

@ayushrai206 I think this should be best implemented by adding a new field to the Column class which would store the recon config of the source column (which was used to obtain the current column by data extension).

wetneb · 2024-01-13T10:02:55Z

Concerning the actual tooltip, @ayushrai206 has proposed: "Data extended from " + service.name, which gives something like that:

Are we happy with this formulation?

cc @OpenRefine/designers

thadguidry · 2024-01-13T10:11:45Z

That's a giant tooltip! Can't miss it! :-) I think shrinking the margins a few pixels might be best. And bordering with 1px the service.name directly with a rounded rectangle, to make the variable distinct from the tooltip string?
But we probably need to come up with a consistent design system for our "tooltips" and their default margins/padding. And style guides for bordering elements inside any tooltips.
And if we ever have "dark mode", I guess reversing this proposal would not be a problem :-)

wetneb · 2024-01-13T10:15:20Z

The tooltip would likely look different, depending on the browser/platform. It's the native tooltip that is displayed when you hover elements with attr="…". I don't know if it can be styled in CSS.

Closes #5130 --------- Co-authored-by: Antonin Delpeuch <antonin@delpeuch.eu>

tfmorris removed the Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators label Oct 6, 2022

lozanaross changed the title ~~Include a default prefix in column names to indicate the data source following data extension~~ Improve how to indicate the data source following data extension Sep 1, 2023

lozanaross assigned ayushrai206 Sep 1, 2023

ayushrai206 mentioned this issue Jan 5, 2024

Improve how to indicate the data source following data extension#5130 #6285

Merged

wetneb closed this as completed in #6285 Jan 23, 2024

wetneb added a commit that referenced this issue Jan 23, 2024

Add tooltip on column headers obtained from data extension (#6285)

bb64817

Closes #5130 --------- Co-authored-by: Antonin Delpeuch <antonin@delpeuch.eu>

wetneb added this to the 3.8 milestone Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve how to indicate the data source following data extension #5130

Improve how to indicate the data source following data extension #5130

lozanaross commented Jul 30, 2022

wetneb commented Jul 30, 2022

thadguidry commented Jul 30, 2022 •

edited

antoine2711 commented Jul 30, 2022 •

edited

lozanaross commented Jul 30, 2022

lozanaross commented Jul 30, 2022

lozanaross commented Jul 30, 2022

antoine2711 commented Jul 30, 2022

lozanaross commented Jul 31, 2022

thadguidry commented Jul 31, 2022

wetneb commented Jul 31, 2022

lozanaross commented Jul 31, 2022

wetneb commented Aug 1, 2022

wetneb commented Dec 7, 2023

ayushrai206 commented Dec 11, 2023

lozanaross commented Dec 11, 2023

antoine2711 commented Dec 11, 2023

lozanaross commented Dec 11, 2023

wetneb commented Dec 11, 2023

antoine2711 commented Dec 11, 2023 •

edited

wetneb commented Dec 12, 2023

wetneb commented Jan 13, 2024 •

edited

thadguidry commented Jan 13, 2024 •

edited

wetneb commented Jan 13, 2024

Improve how to indicate the data source following data extension #5130

Improve how to indicate the data source following data extension #5130

Comments

lozanaross commented Jul 30, 2022

Proposed solution

Alternatives considered

Additional context

wetneb commented Jul 30, 2022

thadguidry commented Jul 30, 2022 • edited

antoine2711 commented Jul 30, 2022 • edited

lozanaross commented Jul 30, 2022

lozanaross commented Jul 30, 2022

lozanaross commented Jul 30, 2022

antoine2711 commented Jul 30, 2022

lozanaross commented Jul 31, 2022

thadguidry commented Jul 31, 2022

wetneb commented Jul 31, 2022

lozanaross commented Jul 31, 2022

wetneb commented Aug 1, 2022

wetneb commented Dec 7, 2023

ayushrai206 commented Dec 11, 2023

lozanaross commented Dec 11, 2023

antoine2711 commented Dec 11, 2023

lozanaross commented Dec 11, 2023

wetneb commented Dec 11, 2023

antoine2711 commented Dec 11, 2023 • edited

wetneb commented Dec 12, 2023

wetneb commented Jan 13, 2024 • edited

thadguidry commented Jan 13, 2024 • edited

wetneb commented Jan 13, 2024

thadguidry commented Jul 30, 2022 •

edited

antoine2711 commented Jul 30, 2022 •

edited

antoine2711 commented Dec 11, 2023 •

edited

wetneb commented Jan 13, 2024 •

edited

thadguidry commented Jan 13, 2024 •

edited