Skip to content
This repository has been archived by the owner on Oct 20, 2018. It is now read-only.

Using the database backed core (GSoC 2012, Jo)

sandroacoelho edited this page Jul 26, 2013 · 11 revisions

The current version of the database-backed core can be found in my branch: dbpedia-spotlight-db


Data Import

Data Sources

There are Source objects for the following datasets in org.dbpedia.spotlight.db.io:

Data Required datasets (pig only) Required objects
SurfaceForm sfCounts, phrasesCounts
DBpediaResource uriCounts, instanceTypes.tsv WikipediaToDBpediaClosure
CandidateMap pairCounts WikipediaToDBpediaClosure, SurfaceFormStore, ResourceStore
Tokens token_counts*.tsv
TokenOccurrences token_counts*.tsv WikipediaToDBpediaClosure, TokenStore
  • the file instanceTypes.tsv contains DBpedia, Schema.org and Freebase types for each DBpediaResource, it is produced by types.sh.
  • each source can be created from the legacy files (TSV) or from Pig (note however, that the pig version are more up-to-date).
  • WikipediaToDBpediaClosure converts the Wikipedia URL to the DBpedia format and then follows the transitive closure of redirects in DBpedia to the final URI. If the DBpedia resource is a disambiguation page, it throws a NotADBpediaResourceException. This class requires the DBpedia triple files redirects_en.nt and disambiguations_en.nt.

Indexers

The interfaces for indexing the data sources are specified in org.dbpedia.spotlight.model in the index module. There are currently two indexers implementing the interfaces:

  • in-memory indexer (org.dbpedia.spotlight.db.MemoryStoreIndexer, uses Kryo for serializiation)
  • disk-based indexer (org.dbpedia.spotlight.db.JDBMStoreIndexer, uses JDBM3, but this is still in development since I focused on having a running in-memory version first)

Running the import

Currently, there are two Scala objects for running the import: org.dbpedia.spotlight.db.ImportPig, org.dbpedia.spotlight.db.ImportTSV, which need to have the correct data paths and can be run with:

mvn exec:java -pl index -Dexec.mainClass=org.dbpedia.spotlight.db.ImportPig

(needs mvn package first but has less overhead) or

mvn scala:run -DmainClass=org.dbpedia.spotlight.db.ImportPig

The full pig-based import takes about 1.5h (mainly due to reading the token occurrence file) for in-memory and 6-7h for disk-based.

Creating the in-memory version

When creating the in-memory version, the import should be run with enough Heap space. SurfaceForms, DBpediaResource, CandidateMap can be run with -Xmx5GB or -Xmx6GB, but TokenOccurrences should be run with at least -Xmx12GB. The resulting serialized files will be written to disk and require ~7GB of memory when fully loaded (-Xmx10GB worked well for me). The following files will be written to disk:

135M sf.mem
187M res.mem
204M candmap.mem
 19M tokens.mem
4.4G context.mem

The data (except for context.mem) can be downloaded here. The memory consumption after loading each store (the stores are loaded one after the other):

Store Used heap space
1. MemorySurfaceFormStore 798MB
2. MemoryResourceStore 1526MB
3. MemoryCandidateMapStore 2188MB
4. MemoryTokenStore 2016MB
5. MemoryContextStore 6762MB

Using the data

All data stores follow the interfaces in org.dbpedia.spotlight.db.model. The elements in a data store can usually be queried by their internal ID or by their name (e.g. the URI without prefix for DBpedia resources):

Interfaces

ResourceStore
  def getResource(id: Int): DBpediaResource
  def getResourceByName(name: String): DBpediaResource

SurfaceFormStore
  def getSurfaceForm(surfaceform: String): SurfaceForm

CandidateMapStore
  def getCandidates(surfaceform: SurfaceForm): Set[Candidate]

TokenStore
  def getToken(token: String): Token
  def getTokenByID(id: Int): Token

ContextStore
  def getContextCount(resource: DBpediaResource, token: Token): Int
  def getContextCounts(resource: DBpediaResource): Map[Token, Int]

Using the in-memory stores

The in-memory stores can be used as follows:

val sfStore = MemoryStore.loadSurfaceFormStore(new FileInputStream("data/sf.mem"))
val candMap = MemoryStore.loadCandidateMapStore(new FileInputStream("data/candmap.mem"), resStore)
[...]

Disk-based stores can be used like this:

val diskContext = new DiskContextStore("data/context.disk")

Database-backed TF*ICF disambiguator

The ParagraphDisambiguator DBTwoStepDisambiguator relies only on the Store interfaces defined above and uses TF*ICF as the measure for context similarity.

Performance and early Results

The following table shows the time performance on the Wikify dataset (this is not a thorough evaluation but an indication, so the table shows only a single run each). TF*ICF was calculated only for the best k candidates for each surface form, as measured by the prior probability of the candidate P(res|sf).

k dataset time
0 (uses only prior) Wikify, 50 paragraphs, 706 disambiguations 6 sec
10 Wikify, 50 paragraphs, 706 disambiguations 18 sec
25 Wikify, 50 paragraphs, 706 disambiguations 47 sec
50 Wikify, 50 paragraphs, 706 disambiguations 109 sec
100 Wikify, 50 paragraphs, 706 disambiguations 244 sec

Evaluation

Update: Note that the test dataset still contains disambiguation pages and redirects are not all resolved, so the final result will be slightly better than the results below. I will update the results once I have rebased my branch and can run the latest evaluation.

The accuracy and global MRR for only P(res|sf) derived from the pig data and using the Wikify dataset:

Disambiguator: Database-backed 2 Step TF*ICF disambiguator (k=0)
Correct URI not found = 115 / 706 = 0.163
Accuracy = 528 / 706 = 0.748
Global MRR: 0.7808735541769214

UPDATE: after resolving redirects and excluding disambiguation pages

Corpus: MilneWitten
Number of occs: 706 (original), 638 (processed)
Disambiguator: Database-backed 2 Step TF*ICF disambiguator (k=0)
Correct URI not found = 58 / 638 = 0.091
Accuracy = 526 / 638 = 0.824
Global MRR: 0.7724539606704485

and for using only TF*ICF :

Disambiguator: Database-backed 2 Step TF*ICF disambiguator
Correct URI not found = 123 / 706 = 0.174
Accuracy = 356 / 706 = 0.504
Global MRR: 0.6227541921588945

The accuracy for TFICF only is very low, it is likely that there are still issues with the calculation of the TFICF score.

Evaluations including TFICF will be added here as soon as I have re-estimated the weights for mixing the prior and the TFICF score.

Issues and TODO

  • re-estimate the weights for the disambiguator, try to combine the scores using a log-linear model?
  • the disk-based stores still need some work
  • WikipediaToDBpediaClosure should ultimately be moved to Pig
  • check and improve performance of TF*ICF calculation
Clone this wiki locally