Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data corrections & additions management #6

Open
petri opened this issue May 2, 2020 · 5 comments
Open

data corrections & additions management #6

petri opened this issue May 2, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@petri
Copy link
Contributor

petri commented May 2, 2020

ISO data provided by gleif is a incomplete work in progress whose quality currently leaves a lot to hope for. Currently, this package addresses the data issue by providing the original gleif version & a "cleaned" version.

How should data issues be handled going forward? Issues & PRs could be used for data updates of course, and gleif has its own 'challenge' process as it should - ultimately the question of data quality lies at ISO & gleif.

Processing updates in the package would take effort. Any gleif updates are not that frequent - and even if they were, they need to be added to the package by someone. It's also reasonable to presume that people need something more expedient that they have control over.

Would it therefore be a good idea to provide means to override the data easily? Perhaps by a custom "overrides" CSV file that users could provide if they want?

Thoughts? I can submit a PR for an override mechanism if that seems useful.

@Gawaboumga
Copy link
Owner

I think that the solution you proposed in the PR is correct. Let the user decides which files it tries to load and provide a "default" and last to date version.

By my work, I had been somehow aware of the update made by LEI, but it went at a bad time and I forgot to update the package.

I also think that the "cleaned" dataset should disappear, in fine, I will contact gleif to see if we can get a compromise or if we can add indicator of position (some legal form are always at the end, glued with the name or on both side).

@petri
Copy link
Contributor Author

petri commented May 3, 2020

Ok. I think it's going to be a long time until all the codes in CSVs provided by gleif are up to date with adequate data - if ever. If for not any other reason, then just because some parts of the data are not considered so important by the organizations that update them (say, legal form name abbreviations for example - there are many missing).

Yet some uses rely on this data. For example the https://github.com/psolin/cleanco package uses the abbreviations information to help people determine base names of organizations. The term data definitions list of that package has about 150 unique abbreviations that do not exist in the CSV data provided by gleif. Even if a significant percentage of those are not valid (they might be old etc), there still are many that are missing from gleif data (I've checked).

So I am wondering if the Elf class could be for example extended so that it would support incorporating new or updated entries from other sources than the gleif CSV, at runtime. Or perhaps it'd be sufficient to simply be able to point to another CSV :) Perhaps I can submit an attempt at that in another PR.

@Gawaboumga
Copy link
Owner

I also wanted to add additional forms (as I'm working with a big database for companies ~400M). But I had 2 issues:

  1. How to identifiy those new legal forms (since we can't give them a ELF identifier).
  2. I'm scare that some legal forms may be confused with other common words. Exemple: "SA" which is "anonymous company" in French but also the pronoun "her".

@petri
Copy link
Contributor Author

petri commented May 3, 2020

For 1., how about adopting an extension code for new forms? The identifiers seem to conform to a ([A-Z][0-9]){4} regexp. But it seems the full available expression space is not utilized. For example there are no identifiers starting with zero (0). So perhaps we could assign additional forms identifiers that start with zero? I guess the ISO standard should tell us the full spec for the identifiers but it's not freely available.

Regarding 2., yeah I guess that's always going to be an issue. Perhaps it can be mitigated by restricting the set of forms by jurisdiction, language etc. in those cases where that is known for the company names. But I don't think there can really be a perfect solution.

@Gawaboumga
Copy link
Owner

Gawaboumga commented May 24, 2020

I have added some additional legal forms.

It's based on a big mix of what I found on Internet, what I found on the biggest database of companies (wink wink) and on Wikipedia. It may, of course, contain errors.

Data is separated by countries:

  • If I put an ELF identifier, it means that it exists in the original file and that I added some information.
  • If it ends with a "?", it means that I'm really not sure but I saw it quite often.
  • If there is a specific spacing / some lines are grouped together, it means they should share the same "ELF" code.

I will try to go once again in the dataset to try to enhance it.

About the ELF identifier pattern, there are two mysterious lines:
"8888","to be used when a new ELF Code (for a legal form not yet on the list) is requested from GLEIF for a jurisdiction which is on the list"
"9999","to be used for LEIs from a jurisdiction which is not on the list yet"
=> So, once I consider the new elements as "good enough", I will see what GLEIF says. But I think "0XXX" may be a good idea temporarly.

@Gawaboumga Gawaboumga added the enhancement New feature or request label Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants