data corrections & additions management #6

petri · 2020-05-02T13:38:33Z

ISO data provided by gleif is a incomplete work in progress whose quality currently leaves a lot to hope for. Currently, this package addresses the data issue by providing the original gleif version & a "cleaned" version.

How should data issues be handled going forward? Issues & PRs could be used for data updates of course, and gleif has its own 'challenge' process as it should - ultimately the question of data quality lies at ISO & gleif.

Processing updates in the package would take effort. Any gleif updates are not that frequent - and even if they were, they need to be added to the package by someone. It's also reasonable to presume that people need something more expedient that they have control over.

Would it therefore be a good idea to provide means to override the data easily? Perhaps by a custom "overrides" CSV file that users could provide if they want?

Thoughts? I can submit a PR for an override mechanism if that seems useful.

Gawaboumga · 2020-05-03T11:54:03Z

I think that the solution you proposed in the PR is correct. Let the user decides which files it tries to load and provide a "default" and last to date version.

By my work, I had been somehow aware of the update made by LEI, but it went at a bad time and I forgot to update the package.

I also think that the "cleaned" dataset should disappear, in fine, I will contact gleif to see if we can get a compromise or if we can add indicator of position (some legal form are always at the end, glued with the name or on both side).

petri · 2020-05-03T13:34:42Z

Ok. I think it's going to be a long time until all the codes in CSVs provided by gleif are up to date with adequate data - if ever. If for not any other reason, then just because some parts of the data are not considered so important by the organizations that update them (say, legal form name abbreviations for example - there are many missing).

Yet some uses rely on this data. For example the https://github.com/psolin/cleanco package uses the abbreviations information to help people determine base names of organizations. The term data definitions list of that package has about 150 unique abbreviations that do not exist in the CSV data provided by gleif. Even if a significant percentage of those are not valid (they might be old etc), there still are many that are missing from gleif data (I've checked).

So I am wondering if the Elf class could be for example extended so that it would support incorporating new or updated entries from other sources than the gleif CSV, at runtime. Or perhaps it'd be sufficient to simply be able to point to another CSV :) Perhaps I can submit an attempt at that in another PR.

Gawaboumga · 2020-05-03T13:45:35Z

I also wanted to add additional forms (as I'm working with a big database for companies ~400M). But I had 2 issues:

How to identifiy those new legal forms (since we can't give them a ELF identifier).
I'm scare that some legal forms may be confused with other common words. Exemple: "SA" which is "anonymous company" in French but also the pronoun "her".

petri · 2020-05-03T17:44:06Z

For 1., how about adopting an extension code for new forms? The identifiers seem to conform to a ([A-Z][0-9]){4} regexp. But it seems the full available expression space is not utilized. For example there are no identifiers starting with zero (0). So perhaps we could assign additional forms identifiers that start with zero? I guess the ISO standard should tell us the full spec for the identifiers but it's not freely available.

Regarding 2., yeah I guess that's always going to be an issue. Perhaps it can be mitigated by restricting the set of forms by jurisdiction, language etc. in those cases where that is known for the company names. But I don't think there can really be a perfect solution.

Gawaboumga · 2020-05-24T14:18:59Z

I have added some additional legal forms.

It's based on a big mix of what I found on Internet, what I found on the biggest database of companies (wink wink) and on Wikipedia. It may, of course, contain errors.

Data is separated by countries:

If I put an ELF identifier, it means that it exists in the original file and that I added some information.
If it ends with a "?", it means that I'm really not sure but I saw it quite often.
If there is a specific spacing / some lines are grouped together, it means they should share the same "ELF" code.

I will try to go once again in the dataset to try to enhance it.

About the ELF identifier pattern, there are two mysterious lines:
"8888","to be used when a new ELF Code (for a legal form not yet on the list) is requested from GLEIF for a jurisdiction which is on the list"
"9999","to be used for LEIs from a jurisdiction which is not on the list yet"
=> So, once I consider the new elements as "good enough", I will see what GLEIF says. But I think "0XXX" may be a good idea temporarly.

petri mentioned this issue May 3, 2020

have Elf support arbitrary CSV loading #9

Merged

petri mentioned this issue May 3, 2020

support loading entries from a generic reader #10

Merged

Gawaboumga added the enhancement New feature or request label Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data corrections & additions management #6

data corrections & additions management #6

petri commented May 2, 2020 •

edited

Gawaboumga commented May 3, 2020

petri commented May 3, 2020 •

edited

Gawaboumga commented May 3, 2020

petri commented May 3, 2020 •

edited

Gawaboumga commented May 24, 2020 •

edited

data corrections & additions management #6

data corrections & additions management #6

Comments

petri commented May 2, 2020 • edited

Gawaboumga commented May 3, 2020

petri commented May 3, 2020 • edited

Gawaboumga commented May 3, 2020

petri commented May 3, 2020 • edited

Gawaboumga commented May 24, 2020 • edited

petri commented May 2, 2020 •

edited

petri commented May 3, 2020 •

edited

petri commented May 3, 2020 •

edited

Gawaboumga commented May 24, 2020 •

edited