Support for multi-valued fields; Handling of missing termId's #9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Sorry for changing two things in one pull request, but after vacation I was no longer aware of the termId change and so those two things got mixed together in a single commit :(
(A) Support for multi-valued fields:
While building a FST over all entities in Freebase I noticed that I was not able to find the Freebase Entity for "BBC" even that I have indexed the data in a way that I had both "BBC" and "British Broadcasting Corporation" as labels for this entity. After some debugging I recognized that TaggerFstCorpus only retrieves a single value for the configured
storedFieldName
. So in case the corpus does use multiple valued fields (like in my case) only a single one will get added into the FST.This is fixed by this pull request by using
IndexableField[] storedFields = document.getFields(storedFieldName);
and iterating over possible multiple stored values.(B) Dealing with missing Term IDs in the inverted index
While building FST models for several languages of Freebase I was running into exceptions like
Couldn't lookup term TEXT=... TERM=...
One example I investigated further was
This specific case is related to the German label of https://www.freebase.com/m/03q6lyt.
However note that the term 'WIẞT' in this label does not use the normal 'ß' as typically used in the German language, but an upper case version that was only very recently added to Unicode (see Wikipedia: Capital_ẞ for details).
I checked also a some other examples and for all cases it was related to very uncommon Letters in Terms.
As throwing a IllegalStateException for such cases would prevent to create a FST for the whole corpus I decided to change this to a WARN level logging and just to skip missing termIds when building the FST.