Workflow

Workflows for DBpedia Spotlight, taking into account two major stages: training/indexing time (building the service) and execution time (running the service online). It includes a vague description of how things work now, and some roadmapping for GSoC2012.

Training Time (aka Indexing Time)

Data Preprocessing

Generic flow

DBpedia Extraction: create resource URIs, extract properties from infoboxes, extract categories, extract redirects, disambiguations, perform redirect TC
Context Extraction: extract occurrences (paragraphs with wikilinks), articles and definitions.
CandidateMap Extraction: built from titles, redirects, disambiguates and anchor links
Computing Statistics: counts of uris, surface forms, tokens, topics and their co-occurrences
Storage: load statistics into the chosen data storage
Training Spotter: based on the statistics above, train an algorithm that chooses substrings that should be disambiguated from incoming texts
Training Disambiguator: based on the statistics above, train an algorithm that chooses the most likely URI based on surface form, context and topic
Training Linker: based on the statistics above, and a test run, train a linker to detect NILs and to adjust to different annotation styles

Current flow as of v0.5

DBpedia Extraction: single machine, with the DEF batch and streaming.
categories are not used yet.
Context Extraction: single machine, with ExtractOccsFromWikipedia. Batch only.
CandidateMap Extraction: single machine, with ExtractCandidateMap. Batch only.
Computing Statistics, and...
...Storage: both in one go with Lucene, based on IndexMergedOccurrences. Statistics TF(t,uri), DF(t,uri). Batch only.
Training Spotter: use trained OpenNLP components, create a dictionary for lexicon-based spotters (LingPipe, etc.)
Training Disambiguator: no training (ranking approach, based on Lucene)
Training Linker: currently based on thresholding (discarding low-scored disambiguations) trained together with EvalDisambiguationOnly. To be separated soon for 0.6.

New flow for GSoC 2012:

DBpedia Extraction: same (being improved separately from our projects)
Flattening category hierarchy (Dirk)
Context Extraction: map-reduce via Pig/Hadoop (Max/Chris)
CandidateMap Extraction: map-reduce via Pig/Hadoop (Max/Chris)
Computing Statistics:
co-occurrences of (sf,uri,context): map-reduce via Pig/Hadoop (Chris)
co-occurrences of category x (sf,uri,context,cat): map-reduce via Pig/Hadoop (Dirk)
co-occurrences of entities: map-reduce via Pig/Hadoop (Hector)
Storage: JDBM3, Mem (Jo), Lucene (Pablo,Max), Other?
Training Spotter: Train new spotter based on stats computed here
Training Disambiguator: train disambiguators based on probabilities computed above (Pablo and Jo)
Training Linker: find good decision functions (Pablo)

Execution Time

Run spotting
Run topic classification
Run candidate selection
Run disambiguation
Run linking

Project interactions

Dirk works on topic classification

setup:
flattens Wikipedia hierarchy to top 20-30 categories (topics)
associates every DBpedia URI to each of the topics
extracts one TSV file (or one lucene index) per topic
topic classifier training (updateable/streaming)
execution:
topic classification: input=text, output=topics (categories)

Jo works on a DB-backed core supporting the Entity-Mention generative model:

setup:
run from jar -> load needed files from stream instead of files
reorganizes configuration
loading count statistics to database (counts come TSV files)
smoothing counts (e.g. truncate count<5, add 5, smooth at query time)
execution:
computing probabilities from smoothed counts and using them for disambiguation

Chris works on complementary vector space models of words/URIs

setup:
extract (on Pig) one TSV file per resource type (people, org, etc.)
compute statistics of word-resource and resource-word vectors on pig/hadoop
execution
explicit semantic analysis: input=text, output=vector of weighted DBpedia resources
disambiguation: input=text+surface_form, output=uri (using ESA from above)

Hector works on collective disambiguation:

setup:
basic entity-entity statistics computed from occs.tsv
loads resource association statistics
other association statistics may come from different strategies as discussed above.
builds (weighted) graphs of interconnections between entities
execution:
reweights candidate scores based on other candidates in the same context

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Model backend

Developers

Google Summer of Code - GSoC

GSoC - Guidelines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow

Training Time (aka Indexing Time)

Data Preprocessing

Execution Time

Project interactions

Clone this wiki locally