opentapioca long-due contributions #53

ziodave · 2023-01-04T16:00:35Z

I worked intensively on opentapioca in the last days and I wanted to summarize the major points. The general aim is to add proper multilingual support to OpenTapioca, improve its performance (in relations to its bootstrap, like creating the index) and including more entities into the indexing by relying on the type inheritance.

As I started this work a long time ago and only recently got back to it, some of the things might not make 100% sense :-)

In short:

Multiple language support (Solr configset and document structure)
Performance (train-bow and index-dump)
Wikidata type inheritance

Multiple language support

In order to properly support multiple languages I created a different structure for doc where the labels would be stored in tag_{language}name and the aliases in tag{language}_alias.

This is fundamental to ensure that I can tag targeting a specific language (set as a parameter to the annotate api).

In order to enable this I created a custom solr image (based on the official one) which includes the required extra libs and support files:
https://github.com/wordlift/docker-solr-opentapioca

You can see the managed-schema configuration here:
https://github.com/wordlift/docker-solr-opentapioca/blob/master/tapioca/conf/managed-schema

The image is published to docker hub as well:
https://hub.docker.com/r/wordlift/solr-opentapioca/tags

You can spot the updated structure in my own fork:

Performance (train-bow and index-dump)

I tried to boost as much as possible the performance of train-bow and index-dump since I would like to be able to run many tests and I can't wait one week before I can recreate an environment.

I used cProfile to profile the app and remove bottlenecks, working in particular on the readers (removing the json loads from there) and replacing the system json.loads with the more efficient msgspecs package, while defining the schema that we need.

I also added rxpy and added multithreading support for index-dump (i.e. not waiting for Solr to reply while moving forward processing new docs):
https://github.com/ziodave/opentapioca/blob/master/opentapioca/taggerfactory.py#L146-L159

The result is that I believe that I would be able to index the full wikidata dump in ~4 hours (compared to the 3-4 days I needed before) on my M1.

This is in conjuction with the type inheritance cache (more on this on the next paragraph).

Wikidata type inheritance

More than a year ago I noticed that some of the entities I needed where not in the index (e.g. ingredients). By investigating I found out that we needed to do some sparql queries to Wikidata for the type inheritance. TBH I don't recall how much was already in opentapioca and what I contributed :-)

Anyway as sparql queries are what they are, I added here a @cache hof https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/sparqlwikidata.py#L9 which stores the results to disk for future references.

And then I created a GH repo holding this cache:
https://github.com/wordlift/opentapioca-wikidata-query-cache

The increase of speed is drastic and I think the Wikidata folks may also be happy that I don't repeat the same queries.

Summary

In short I really like OpenTapioca and I am trying to grow it according to my needs (and from some conversation I had, also other people needs).

My aim now would be to share this work with you, see if you have the time to review it and eventually merge it upstream.

Thanks for your time 🙏

The text was updated successfully, but these errors were encountered:

wetneb · 2023-01-04T16:59:46Z

Hey @ziodave, that is really impressive work! Thanks a lot for offering to upstream it.

Here are some broad comments about your points:

Multiple language support (Solr configset and document structure)

That's really nice! The managed-schema configuration is really impressive - getting the pipeline right for so many languages is a ton of work.
The only issue I could think of is that those pipelines rely on external libs, so we cannot run tapioca off an off-the-shelf Solr instance. But that's a reasonable requirement to introduce given the benefit the change can bring. And your Docker image seems to be a great solution for it (especially since you already include the tapioca configset, which makes set up even easier).
If I understand correctly, the set of languages to be used for the indexing is configurable, so one should still be able to use a stock Solr instance if we only enable English, right?

Performance (train-bow and index-dump)

It's really nice too and I do not see any reason not to have this upstream. The extra dependencies are worth it.

Wikidata type inheritance

We already have a mechanism to fetch all the subclasses of a given type (well, all the ones we can get via a SPARQL query without timeout…) and caching those more aggressively can make sense in some situations.
But for the real-time update of the index, it is still worth refreshing those type hierarchies relatively frequently. When debugging the outcome of a tagging, for instance to understand why a particular entity is picked up or not, it is fairly common to have to fix the type hierarchy. For this to circle back to tapioca, the type hierarchy needs to be refreshed in some way.

How to upstream your changes

I am keen to review and merge any PR you make around those topics.

I don't have a good overview of your fork - I do not know to what extent it contains other changes which you might not be ready to upstream yet, or changes which are too specific to your own use case to be worth including upstream.

I can imagine that it is significant work for you to break down the changes you made in your fork into topical PRs. Perhaps it is a useful exercise to ensure that the changes are well documented, tested, and that the CI remains green. But if you judge that your fork is generic enough and free from such "un-upstreamable" changes, we could simply decide to make your fork the official thing. I'd be happy to tag along as a co-maintainer of course. The way I generally do this is creating a GitHub org for the project, transfer the project from my account to the org and add you as an owner. Then you can just push the changes from your fork there, and done.

ziodave · 2023-01-05T07:20:31Z

If I understand correctly, the set of languages to be used for the indexing is configurable, so one should still be able to use a stock Solr instance if we only enable English, right?

Right now the list of languages is hard-coded but it's easy to move it to a configuration file, so that one may run only english (or any other preferred language):
https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/indexingprofile.py#L290-L336

BTW in the process I also moved the configuration to an AppConfig class which makes it easier to deploy on managed environments (e.g. K8s), by using environment variables (or an .env file).

When debugging the outcome of a tagging, for instance to understand why a particular entity is picked up or not, it is fairly common to have to fix the type hierarchy.

Agree, I was thinking that we can also provide an API that, given a Q will tell you whether it'll be indexed or not along with some helpful information.

I can imagine that it is significant work for you to break down the changes you made in your fork into topical PRs. Perhaps it is a useful exercise to ensure that the changes are well documented, tested, and that the CI remains green. But if you judge that your fork is generic enough and free from such "un-upstreamable" changes, we could simply decide to make your fork the official thing. I'd be happy to tag along as a co-maintainer of course. The way I generally do this is creating a GitHub org for the project, transfer the project from my account to the org and add you as an owner. Then you can just push the changes from your fork there, and done.

I ran the UTs and CI is still green with 2 exceptions:

I removed extra_aliases since we have the aliases per language, so I think we can remove this test: https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/tests/test_indexingprofile.py#L108-L116
This test yields 1 mention instead of 2, but I still didn't look thoroughly into it: https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/tests/test_classifier.py#L59-L62

If I try to open a PR now, there are quite some changes and I am afraid I am to the point where I wouldn't be able to split it into smaller PRs:

If you agree, I would like to get to a point where I can create a copy of opentapioca.org adding a language selector, so that we can run some e2e tests.

And if we like it, we can then move to a separate org 🚀

wetneb · 2023-01-05T10:07:46Z

Sounds like a plan, looking forward to it! :)

ziodave · 2023-01-09T09:28:53Z

Just a quick update: I have the instance ready, I am performing the latest task to prepare and then train the classifier (as a matter of fact it already works, but it doesn't populate the best_qid as I am missing the classifier).

ziodave · 2023-01-12T07:08:33Z

Another quick update: I have a version running here, https://opentapioca.wordlift.io/, atm w/o classifier.

Pending actions:

Use the sample classifier (let me know if you have a better one)
To use the classifier, I need to fix the image with the right version of scikit-learn (0.20.1, currently not building in my environment), or investigate whether to upgrade to 1.x
At the same time I am preparing a new classifier by using an extensive (21 GB gzipped) model by dbpedia
I also need to tune the Solr schema (specifically remove the POS filter from the query handler)

ziodave · 2023-01-14T16:10:44Z

ziodave · 2023-01-18T09:09:47Z

@wetneb new logo ☝️ I'll move forward with updating the repo 🙏

wetneb · 2023-01-18T14:07:13Z

I love it! Thanks for the initiative :)

vittau · 2023-06-27T20:20:36Z

Hi everyone,

Thanks for this amazing project!

I have one question, while running @ziodave's fork and trying to annotate some text, I get AttributeError: 'WikidataGraph' object has no attribute 'pagerank' in file "/app/opentapioca/wikidatagraph.py", line 250, in get_pagerank. Is the sample PageRank wd_2019-02-24.pgrank.npy available here not compatible? Do I have to rebuild it again?

If that's relevant, I'm using scikit-learn 0.23.2 with the sample_classifier.pkl and wd_2019-02-24.bow.pkl BOW as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opentapioca long-due contributions #53

opentapioca long-due contributions #53

ziodave commented Jan 4, 2023

wetneb commented Jan 4, 2023

ziodave commented Jan 5, 2023

wetneb commented Jan 5, 2023

ziodave commented Jan 9, 2023

ziodave commented Jan 12, 2023

ziodave commented Jan 14, 2023 •

edited

ziodave commented Jan 18, 2023

wetneb commented Jan 18, 2023

vittau commented Jun 27, 2023

opentapioca long-due contributions #53

opentapioca long-due contributions #53

Comments

ziodave commented Jan 4, 2023

Multiple language support

Performance (train-bow and index-dump)

Wikidata type inheritance

Summary

wetneb commented Jan 4, 2023

How to upstream your changes

ziodave commented Jan 5, 2023

wetneb commented Jan 5, 2023

ziodave commented Jan 9, 2023

ziodave commented Jan 12, 2023

ziodave commented Jan 14, 2023 • edited

ziodave commented Jan 18, 2023

wetneb commented Jan 18, 2023

vittau commented Jun 27, 2023

ziodave commented Jan 14, 2023 •

edited