Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opentapioca long-due contributions #53

Open
ziodave opened this issue Jan 4, 2023 · 9 comments
Open

opentapioca long-due contributions #53

ziodave opened this issue Jan 4, 2023 · 9 comments

Comments

@ziodave
Copy link
Contributor

ziodave commented Jan 4, 2023

Hello @wetneb

I worked intensively on opentapioca in the last days and I wanted to summarize the major points. The general aim is to add proper multilingual support to OpenTapioca, improve its performance (in relations to its bootstrap, like creating the index) and including more entities into the indexing by relying on the type inheritance.

As I started this work a long time ago and only recently got back to it, some of the things might not make 100% sense :-)

In short:

  1. Multiple language support (Solr configset and document structure)
  2. Performance (train-bow and index-dump)
  3. Wikidata type inheritance

Multiple language support

In order to properly support multiple languages I created a different structure for doc where the labels would be stored in tag_{language}name and the aliases in tag{language}_alias.

This is fundamental to ensure that I can tag targeting a specific language (set as a parameter to the annotate api).

In order to enable this I created a custom solr image (based on the official one) which includes the required extra libs and support files:
https://github.com/wordlift/docker-solr-opentapioca

You can see the managed-schema configuration here:
https://github.com/wordlift/docker-solr-opentapioca/blob/master/tapioca/conf/managed-schema

The image is published to docker hub as well:
https://hub.docker.com/r/wordlift/solr-opentapioca/tags

You can spot the updated structure in my own fork:

Performance (train-bow and index-dump)

I tried to boost as much as possible the performance of train-bow and index-dump since I would like to be able to run many tests and I can't wait one week before I can recreate an environment.

I used cProfile to profile the app and remove bottlenecks, working in particular on the readers (removing the json loads from there) and replacing the system json.loads with the more efficient msgspecs package, while defining the schema that we need.

I also added rxpy and added multithreading support for index-dump (i.e. not waiting for Solr to reply while moving forward processing new docs):
https://github.com/ziodave/opentapioca/blob/master/opentapioca/taggerfactory.py#L146-L159

The result is that I believe that I would be able to index the full wikidata dump in ~4 hours (compared to the 3-4 days I needed before) on my M1.

This is in conjuction with the type inheritance cache (more on this on the next paragraph).

Wikidata type inheritance

More than a year ago I noticed that some of the entities I needed where not in the index (e.g. ingredients). By investigating I found out that we needed to do some sparql queries to Wikidata for the type inheritance. TBH I don't recall how much was already in opentapioca and what I contributed :-)

Anyway as sparql queries are what they are, I added here a @cache hof https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/sparqlwikidata.py#L9 which stores the results to disk for future references.

And then I created a GH repo holding this cache:
https://github.com/wordlift/opentapioca-wikidata-query-cache

The increase of speed is drastic and I think the Wikidata folks may also be happy that I don't repeat the same queries.

Summary

In short I really like OpenTapioca and I am trying to grow it according to my needs (and from some conversation I had, also other people needs).

My aim now would be to share this work with you, see if you have the time to review it and eventually merge it upstream.

Thanks for your time 🙏

@wetneb
Copy link
Member

wetneb commented Jan 4, 2023

Hey @ziodave, that is really impressive work! Thanks a lot for offering to upstream it.

Here are some broad comments about your points:

  1. Multiple language support (Solr configset and document structure)

That's really nice! The managed-schema configuration is really impressive - getting the pipeline right for so many languages is a ton of work.
The only issue I could think of is that those pipelines rely on external libs, so we cannot run tapioca off an off-the-shelf Solr instance. But that's a reasonable requirement to introduce given the benefit the change can bring. And your Docker image seems to be a great solution for it (especially since you already include the tapioca configset, which makes set up even easier).
If I understand correctly, the set of languages to be used for the indexing is configurable, so one should still be able to use a stock Solr instance if we only enable English, right?

  1. Performance (train-bow and index-dump)

It's really nice too and I do not see any reason not to have this upstream. The extra dependencies are worth it.

  1. Wikidata type inheritance

We already have a mechanism to fetch all the subclasses of a given type (well, all the ones we can get via a SPARQL query without timeout…) and caching those more aggressively can make sense in some situations.
But for the real-time update of the index, it is still worth refreshing those type hierarchies relatively frequently. When debugging the outcome of a tagging, for instance to understand why a particular entity is picked up or not, it is fairly common to have to fix the type hierarchy. For this to circle back to tapioca, the type hierarchy needs to be refreshed in some way.

How to upstream your changes

I am keen to review and merge any PR you make around those topics.

I don't have a good overview of your fork - I do not know to what extent it contains other changes which you might not be ready to upstream yet, or changes which are too specific to your own use case to be worth including upstream.

I can imagine that it is significant work for you to break down the changes you made in your fork into topical PRs. Perhaps it is a useful exercise to ensure that the changes are well documented, tested, and that the CI remains green. But if you judge that your fork is generic enough and free from such "un-upstreamable" changes, we could simply decide to make your fork the official thing. I'd be happy to tag along as a co-maintainer of course. The way I generally do this is creating a GitHub org for the project, transfer the project from my account to the org and add you as an owner. Then you can just push the changes from your fork there, and done.

@ziodave
Copy link
Contributor Author

ziodave commented Jan 5, 2023

If I understand correctly, the set of languages to be used for the indexing is configurable, so one should still be able to use a stock Solr instance if we only enable English, right?

Right now the list of languages is hard-coded but it's easy to move it to a configuration file, so that one may run only english (or any other preferred language):
https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/indexingprofile.py#L290-L336

BTW in the process I also moved the configuration to an AppConfig class which makes it easier to deploy on managed environments (e.g. K8s), by using environment variables (or an .env file).

When debugging the outcome of a tagging, for instance to understand why a particular entity is picked up or not, it is fairly common to have to fix the type hierarchy.

Agree, I was thinking that we can also provide an API that, given a Q will tell you whether it'll be indexed or not along with some helpful information.

I can imagine that it is significant work for you to break down the changes you made in your fork into topical PRs. Perhaps it is a useful exercise to ensure that the changes are well documented, tested, and that the CI remains green. But if you judge that your fork is generic enough and free from such "un-upstreamable" changes, we could simply decide to make your fork the official thing. I'd be happy to tag along as a co-maintainer of course. The way I generally do this is creating a GitHub org for the project, transfer the project from my account to the org and add you as an owner. Then you can just push the changes from your fork there, and done.

I ran the UTs and CI is still green with 2 exceptions:

  1. I removed extra_aliases since we have the aliases per language, so I think we can remove this test: https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/tests/test_indexingprofile.py#L108-L116
  2. This test yields 1 mention instead of 2, but I still didn't look thoroughly into it: https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/tests/test_classifier.py#L59-L62

If I try to open a PR now, there are quite some changes and I am afraid I am to the point where I wouldn't be able to split it into smaller PRs:

image

If you agree, I would like to get to a point where I can create a copy of opentapioca.org adding a language selector, so that we can run some e2e tests.

And if we like it, we can then move to a separate org 🚀

@wetneb
Copy link
Member

wetneb commented Jan 5, 2023

Sounds like a plan, looking forward to it! :)

@ziodave
Copy link
Contributor Author

ziodave commented Jan 9, 2023

Just a quick update: I have the instance ready, I am performing the latest task to prepare and then train the classifier (as a matter of fact it already works, but it doesn't populate the best_qid as I am missing the classifier).

@ziodave
Copy link
Contributor Author

ziodave commented Jan 12, 2023

Another quick update: I have a version running here, https://opentapioca.wordlift.io/, atm w/o classifier.

Pending actions:

  1. Use the sample classifier (let me know if you have a better one)
  2. To use the classifier, I need to fix the image with the right version of scikit-learn (0.20.1, currently not building in my environment), or investigate whether to upgrade to 1.x
  3. At the same time I am preparing a new classifier by using an extensive (21 GB gzipped) model by dbpedia
  4. I also need to tune the Solr schema (specifically remove the POS filter from the query handler)

@ziodave
Copy link
Contributor Author

ziodave commented Jan 14, 2023

  • Use the sample classifier
  • To use the classifier, I need to fix the image with the right version of scikit-learn (0.20.1, currently not building in my environment), or investigate whether to upgrade to 1.x see Pretrained dumps as download #4 (comment)
  • I also need to tune the Solr schema (specifically remove the POS filter from the query handler)
  • add back the ability to remove items from index
  • move the supported languages as a configuration parameter
  • improve the html page along with the select plus a nice logo
  • bump the release to 1.0.0 to signal a breaking change (the Solr schema)
  • update the documentation
  • add stalebot
  • point the domain to the new server? @wetneb in the meanwhile it can be tested here: https://opentapioca.wordlift.io

At the same time I am preparing a new classifier by using an extensive (21 GB gzipped) model by dbpedia

I'll look at preparing a new classifier 👆 later on - for the time being I am using the sample one

@ziodave
Copy link
Contributor Author

ziodave commented Jan 18, 2023

OT_logo open tapioca 500 px

@wetneb new logo ☝️ I'll move forward with updating the repo 🙏

@wetneb
Copy link
Member

wetneb commented Jan 18, 2023

I love it! Thanks for the initiative :)

@vittau
Copy link

vittau commented Jun 27, 2023

Hi everyone,

Thanks for this amazing project!

I have one question, while running @ziodave's fork and trying to annotate some text, I get AttributeError: 'WikidataGraph' object has no attribute 'pagerank' in file "/app/opentapioca/wikidatagraph.py", line 250, in get_pagerank. Is the sample PageRank wd_2019-02-24.pgrank.npy available here not compatible? Do I have to rebuild it again?

If that's relevant, I'm using scikit-learn 0.23.2 with the sample_classifier.pkl and wd_2019-02-24.bow.pkl BOW as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants