Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multi-valued fields; Handling of missing termId's #9

Merged
merged 2 commits into from
Aug 24, 2013

Conversation

westei
Copy link
Contributor

@westei westei commented Aug 21, 2013

Sorry for changing two things in one pull request, but after vacation I was no longer aware of the termId change and so those two things got mixed together in a single commit :(

(A) Support for multi-valued fields:

While building a FST over all entities in Freebase I noticed that I was not able to find the Freebase Entity for "BBC" even that I have indexed the data in a way that I had both "BBC" and "British Broadcasting Corporation" as labels for this entity. After some debugging I recognized that TaggerFstCorpus only retrieves a single value for the configured storedFieldName. So in case the corpus does use multiple valued fields (like in my case) only a single one will get added into the FST.

This is fixed by this pull request by using IndexableField[] storedFields = document.getFields(storedFieldName);and iterating over possible multiple stored values.

(B) Dealing with missing Term IDs in the inverted index

While building FST models for several languages of Freebase I was running into exceptions likeCouldn't lookup term TEXT=... TERM=...

One example I investigated further was

TEXT=Einladung zur ACID-HOUSE-IH-WI?T-WO-Party TERM=wi?t

This specific case is related to the German label of https://www.freebase.com/m/03q6lyt.
However note that the term 'WIẞT' in this label does not use the normal 'ß' as typically used in the German language, but an upper case version that was only very recently added to Unicode (see Wikipedia: Capital_ẞ for details).

I checked also a some other examples and for all cases it was related to very uncommon Letters in Terms.

As throwing a IllegalStateException for such cases would prevent to create a FST for the whole corpus I decided to change this to a WARN level logging and just to skip missing termIds when building the FST.

@dsmiley
Copy link
Member

dsmiley commented Aug 22, 2013

Hi Westel,

I'll likely commit this tomorrow or the day after; it looks good. I can't believe I overlooked the multi-value case for stored fields -- doh! I'm curious as to more specifically what the technical root cause of the failed term lookup is but it seems we don't know for certain. Logging a warning is fine.

Have you seen the MemPF branch? It's what I consider to be the next version -- 2.0-SNAPSHOT is what's in the pom on that branch now. The memory use should be about 40% less (comparing the .fst file here with the .ram postings format file there). And there is a lot less code; TaggerFSTCorpus goes away. A bunch of other benefits too. I haven't yet tested tagging speed but I suspect it's a bit less. It should not be hard for you to test out your application (Stanbol?) with the new version.

p.s. I'm on vacation this week and a biz trip the next but I've got a little time here & there.

@westei
Copy link
Contributor Author

westei commented Aug 22, 2013

Yes I have seen the MemPF branch I have even had a short looked at the code.

Is my assumption correct that with the MemPF branch it would no longer be necessary to rebuild the whole FST if some documents are changed in the index?

Regarding memory: The FST for the english Freebase (36 million entities, 50 GByte Solr index size) is 360MByte. German is the 2nd largest with 17MByte followed by French 15, Spanish 12, Italian 10, Russian 8.6, Portuguese 7.6 ... The FST for all ~200 languages are ~500MByte in size. I am already very exited about the current memory usage. If the new version reduces such numbers by 40% thats even better.

Regarding indexing speed: Currently creating all FST for Freebase takes ~8h on my machine (with 4 concurrent threads). IMO most of the time is spent because TaggerFSTCorpus iterates over all documents in the index. Something very inefficient for languages where there are only labels defined for a few documents. E.g. creating the 360MByte FST for English (en) does take maybe two times longer as creating the 405Byte FST for Ganda (gg). So if with the new model all configured FST could be updated while adding a document to the index it should give a huge boost for multi lingual scenarios.

Regarding Tagging Performance: When using the FST tagging I observed that the time spent on the FST tagging is only a tiny percentage of the whole. Most of the time is spent on getting the Solr ID for the Lucene integer IDs for matching documents. This is especially visible for Freebase as for a typical News article you will get about 5k-10k matching Documents. In such cases the FST time would be about 20ms and loading the documents would take about 1sec. For sure there is some caching in place because if one sends the same or similar articles the time spend on loading the Solr document ids decreases, but it is still the major contributor to the overall time spend on tagging. So IMO even if the MemPF branch version would need twice the time for FST tagging one should not see a big difference in the overall processing time.

BTW: For the Stanbol EnhancementEngine I have gone a different way: As the engine need anyway to load other informations as the Solr docID (at least labels, types and entity ranking) it does not lookup the Solr docID in the callback, but uses IndexReader#document(int docId,Set fieldsToLoad) when the tagging was finished to get the data. In addition I added a SolrCache (the FastLRUCache) that keeps such information in memory.

@dsmiley dsmiley merged commit 7e37ecd into OpenSextant:master Aug 24, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants