Skip to content
This repository has been archived by the owner on Oct 20, 2018. It is now read-only.

Workflow

sandroacoelho edited this page Jul 26, 2013 · 7 revisions

Workflows for DBpedia Spotlight, taking into account two major stages: training/indexing time (building the service) and execution time (running the service online). It includes a vague description of how things work now, and some roadmapping for GSoC2012.

Training Time (aka Indexing Time)

Data Preprocessing

Generic flow

  • DBpedia Extraction: create resource URIs, extract properties from infoboxes, extract categories, extract redirects, disambiguations, perform redirect TC
  • Context Extraction: extract occurrences (paragraphs with wikilinks), articles and definitions.
  • CandidateMap Extraction: built from titles, redirects, disambiguates and anchor links
  • Computing Statistics: counts of uris, surface forms, tokens, topics and their co-occurrences
  • Storage: load statistics into the chosen data storage
  • Training Spotter: based on the statistics above, train an algorithm that chooses substrings that should be disambiguated from incoming texts
  • Training Disambiguator: based on the statistics above, train an algorithm that chooses the most likely URI based on surface form, context and topic
  • Training Linker: based on the statistics above, and a test run, train a linker to detect NILs and to adjust to different annotation styles

Current flow as of v0.5

  • DBpedia Extraction: single machine, with the DEF batch and streaming.
  • categories are not used yet.
  • Context Extraction: single machine, with ExtractOccsFromWikipedia. Batch only.
  • CandidateMap Extraction: single machine, with ExtractCandidateMap. Batch only.
  • Computing Statistics, and...
  • ...Storage: both in one go with Lucene, based on IndexMergedOccurrences. Statistics TF(t,uri), DF(t,uri). Batch only.
  • Training Spotter: use trained OpenNLP components, create a dictionary for lexicon-based spotters (LingPipe, etc.)
  • Training Disambiguator: no training (ranking approach, based on Lucene)
  • Training Linker: currently based on thresholding (discarding low-scored disambiguations) trained together with EvalDisambiguationOnly. To be separated soon for 0.6.

New flow for GSoC 2012:

  • DBpedia Extraction: same (being improved separately from our projects)
  • Flattening category hierarchy (Dirk)
  • Context Extraction: map-reduce via Pig/Hadoop (Max/Chris)
  • CandidateMap Extraction: map-reduce via Pig/Hadoop (Max/Chris)
  • Computing Statistics:
  • co-occurrences of (sf,uri,context): map-reduce via Pig/Hadoop (Chris)
  • co-occurrences of category x (sf,uri,context,cat): map-reduce via Pig/Hadoop (Dirk)
  • co-occurrences of entities: map-reduce via Pig/Hadoop (Hector)
  • Storage: JDBM3, Mem (Jo), Lucene (Pablo,Max), Other?
  • Training Spotter: Train new spotter based on stats computed here
  • Training Disambiguator: train disambiguators based on probabilities computed above (Pablo and Jo)
  • Training Linker: find good decision functions (Pablo)

Execution Time

  • Run spotting
  • Run topic classification
  • Run candidate selection
  • Run disambiguation
  • Run linking

Project interactions

Dirk works on topic classification

  • setup:
  • flattens Wikipedia hierarchy to top 20-30 categories (topics)
  • associates every DBpedia URI to each of the topics
  • extracts one TSV file (or one lucene index) per topic
  • topic classifier training (updateable/streaming)
  • execution:
  • topic classification: input=text, output=topics (categories)

Jo works on a DB-backed core supporting the Entity-Mention generative model:

  • setup:
  • run from jar -> load needed files from stream instead of files
  • reorganizes configuration
  • loading count statistics to database (counts come TSV files)
  • smoothing counts (e.g. truncate count<5, add 5, smooth at query time)
  • execution:
  • computing probabilities from smoothed counts and using them for disambiguation

Chris works on complementary vector space models of words/URIs

  • setup:
  • extract (on Pig) one TSV file per resource type (people, org, etc.)
  • compute statistics of word-resource and resource-word vectors on pig/hadoop
  • execution
  • explicit semantic analysis: input=text, output=vector of weighted DBpedia resources
  • disambiguation: input=text+surface_form, output=uri (using ESA from above)

Hector works on collective disambiguation:

  • setup:
  • basic entity-entity statistics computed from occs.tsv
  • loads resource association statistics
  • other association statistics may come from different strategies as discussed above.
  • builds (weighted) graphs of interconnections between entities
  • execution:
  • reweights candidate scores based on other candidates in the same context
Clone this wiki locally