Internationalization (Lucene backed core)

Authors:

Sandro Coelho (@sandroacoelho)
Pablo Mendes (@pablomendes)

See more at https://github.com/dbpedia-spotlight/lucene-quickstarter

Summary

You only need to add your language-specific config to download.sh and indexing.properties, close your eyes, execute the commands and hope for the best.

In one shot:

git clone https://github.com/dbpedia-spotlight/dbpedia-spotlight.git
cd dbpedia-spotlight/
mvn install
cd bin/
./download.sh
./index.sh

The script download.sh will download all necessary files. By default all files will be downloaded to /usr/local/spotlight. The default language is Portuguese. You should edit your copy of download.sh for your language and chosen destination. Edit at least:

export lang_i18n=ca
export language=catalan

The script index.sh again assumes Portuguese and /usr/local/spotlight and conducts the indexing process automatically in one shot. Make sure to edit conf/indexing.properties for your language and chosen destination. All properties MUST be consistently pointing to the same language:

# Language-specific config
# --------------
org.dbpedia.spotlight.language = Catalan
org.dbpedia.spotlight.lucene.analyzer = org.apache.lucene.analysis.ca.CatalanAnalyzer
...
# Internationalization (i18n) support -- work in progress
org.dbpedia.spotlight.default_namespace = http://ca.dbpedia.org/resource/
# Stop word list
org.dbpedia.spotlight.data.stopWords.catalan = /var/local/spotlight/dbpedia_data/stopwords.ca.list
...
# URI patterns that should not be indexed. e.g. List_of_*
org.dbpedia.spotlight.data.badURIs.catalan = /var/local/spotlight/dbpedia_data/blacklistedURIPatterns.ca.list

In the remainder of this guide we will describe the process step-by-step.

##0. Requirements

You’ll need at least DBpedia Spotlight version 0.6.5.
For this part, Java 1.7 is required. (Contrary to the rest of Spotlight, for which Java 1.6 is sufficient.)

##1. Obtaining Data

Before start building indexes and Spotters training operations, you need to get the following files for your target language.

0 - Let's use the variable $lang to represent your i18n language code -- e.g. "en" for English, "pt" for Portuguese, etc.

Example command line:

export lang=en

1 - Get latest DBpedia files containing labels, redirects, disambiguations and instance_types from http://downloads.dbpedia.org/

wget http://downloads.dbpedia.org/3.8/$lang/labels_$lang.nt.bz2;
wget http://downloads.dbpedia.org/3.8/$lang/redirects_$lang.nt.bz2;
wget http://downloads.dbpedia.org/3.8/$lang/disambiguations_$lang.nt.bz2;
wget http://downloads.dbpedia.org/3.8/$lang/instance_types_$lang.nt.bz2.

2 - Get the latest Wikipedia dump (XXwiki-latest-pages-articles.xml.bz2 file). Link: http://en.wikipedia.org/wiki/Wikipedia:Database_download

wget "http://dumps.wikimedia.org/"$lang"wiki/latest/"$lang"wiki-latest-pages-articles.xml.bz2"

3 - A dictionary (already built) for LingPipe Spotter. This file will later be replaced by a new one generated in your language.

wget http://dbp-spotlight.svn.sourceforge.net/viewvc/dbp-spotlight/tags/release-0.5/dist/src/deb/control/data/usr/share/dbpedia-spotlight/spotter.dict

4 - A small index (also already built) that will also later be replaced.

wget http://dbp-spotlight.svn.sourceforge.net/viewvc/dbp-spotlight/tags/release-0.5/dist/src/deb/control
/data/usr/share/dbpedia-spotlight/index.tgz

5 - A Hidden Markov Model file used for POS tagging.

wget http://dbp-spotlight.svn.sourceforge.net/viewvc/dbp-spotlight/tags/release-0.5/dist/src/deb/control/data/usr/share/dbpedia-spotlight/pos-en-general-brown.HiddenMarkovModel

6 - Models for enabling the CoOccurrenceBasedSelector (in English) -- can be skipped if you remove CoOccurrenceBasedSelector from your config file Link: http://spotlight.dbpedia.org/download/release-0.5/spot_selector.tgz This spotter enables the "no common words" strategy and is geared for English language only. It could be retrained for another language by following the instructions in Jo Daiber's thesis

7 - Apache OpenNLP models for tokenization, sentence splitting, noun phrase chunking and named entity recognition from http://opennlp.sourceforge.net/models-1.5 These models are necessary for OpenNLP-based spotters such as NESpotter, OpenNLPChunkerSpotter, etc.

wget http://opennlp.sourceforge.net/models-1.5/$lang-chunker.bin;
wget http://opennlp.sourceforge.net/models-1.5/$lang-ner-location.bin
wget http://opennlp.sourceforge.net/models-1.5/$lang-ner-organization.bin;
wget http://opennlp.sourceforge.net/models-1.5/$lang-ner-person.bin;
wget http://opennlp.sourceforge.net/models-1.5/$lang-pos-maxent.bin;
wget http://opennlp.sourceforge.net/models-1.5/$lang-sent.bin;
wget http://opennlp.sourceforge.net/models-1.5/$lang-token.bin;

Important note: not all files above will be available in your language. We do not cover in this tutorial how to build OpenNLP models for your language. You will need to consult their manuals for that. At first, you can download equivalent files generated in English and rename to your language code. E.g.: es-token.bin is not available. Please download the en-token.bin and rename it to es-token.bin. Link: http://opennlp.sourceforge.net/models-1.5/ This will allow the system to start, but naturally the results for OpenNLP-based spotters will not be good. In order to use the OpenNLP-based spotters and obtain good results, you need models trained for your language.

8 - Finally, get a stopword file (stopwords.txt). The file should contain one word per line.

You can use the files available from the Snowball project as starting points: http://svn.tartarus.org/snowball/trunk/website/algorithms/

##2. Configuring Language-specific Tokenization

According Lucene in Action book:

Analysis, in Lucene, is the process of converting field text into its most fundamental indexed representation, terms. These terms are used to determine what documents match a query during searching. For example, if you indexed this sentence in a field the terms might start with for and example, and so on, as separate terms in sequence. An analyzer is an encapsulation of the analysis process. An analyzer tokenizes text by performing any number of operations on it, which could include extracting words, discarding punctuation, removing accents from characters, lowercasing (also called normalizing), removing common words, reducing words to a root form (stemming), or changing words into the basic form (lemmatization). This process is also called tokenization, and the chunks of text pulled from a stream of text are called tokens. Tokens, combined with their associated field name, are terms

DBpedia Spotlight is using Lucene's Analyzers to support index building and search operations. Therefore you need choose an appropriate analyzer and set it on org.dbpedia.spotlight.lucene.analyzer property in indexing and server files. The table below lists the available analyzers as of version 3.6.0

LingPipeSpotter also needs tokenization. By default we use IndoEuropeanTokenizer. If your language cannot be tokenized in that way, you will need to find a tokenizer and wrap it much like we did for JAnnotationTokenizerFactory.

3. Building indexes

We are ready to run the script available at https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/bin/index.sh (with some modifications). Documentation is available inside the script as step-by-step comments.

Training the Spotters

Before using a spotter, you need to train it. For the LingPipeSpotter, it is just a matter of building a spotter dictionary. For that, you can run:

mvn scala:run -DmainClass=org.dbpedia.spotlight.spot.lingpipe.IndexLingPipeSpotter "-DaddArgs=$INDEX_CONFIG_FILE"

It will read the file output/surfaceForms.tsv (or whatever TSV file specified in the org.dbpedia.spotlight.data.surfaceForms parameter) and produce a dictionary (called output/surfaceForms.tsv.spotterDictionary). Make a copy to /usr/local/spotlight/dbpedia_data/es/data/spotter.dict, replacing the spotter.dict file.

4. Configuring the server

After executing all tasks above, it is possible run [1] the Server for checking the results. At this point the last task is to configure server.properties available in the conf folder of DBpedia Spotlight's distribution.

Here you have an example for the current folders/files structure and language (Spanish).

This file was saved as server.es.properties at /usr/local/spotlight/dbpedia_data/es

[1] - mvn scala:run -DmainClass=org.dbpedia.spotlight.web.rest.Server "-DaddArgs=/usr/local/spotlight/dbpedia_data/es/server.es.properties"

5. An Example with Spanish

Note: the folder organization below is only a suggestion. You can customize according to your operation system and personal preferences. As an example, let's run indexing in Spanish using Linux.

After getting all files, create a folder named as /usr/local/spotlight/dbpedia_data and subfolders /usr/local/spotlight/dbpedia_data/es, /usr/local/spotlight/dbpedia_data/es/data, /usr/local/spotlight/dbpedia_data/es/data/opennlp, /usr/local/spotlight/dbpedia_data/es/data/opennlp/spanish and /usr/local/spotlight/dbpedia_data/es/data/output in /usr/local/spotlight/.

mkdir /usr/local/spotlight/dbpedia_data
mkdir /usr/local/spotlight/dbpedia_data/es   
mkdir /usr/local/spotlight/dbpedia_data/es/data
mkdir /usr/local/spotlight/dbpedia_data/es/data/opennlp
mkdir /usr/local/spotlight/dbpedia_data/es/data/opennlp/spanish 
mkdir /usr/local/spotlight/dbpedia_data/es/data/output

Make a copy of the following files in the folder /usr/local/spotlight/dbpedia_data/es/data:

eswiki-latest-pages-articles.xml;
pos-en-general-brown.HiddenMarkovModel;
spotter.dict
stopwords.es.list

Create a new file called blackListedUriPattens.es.list to put instructions (one regex pattern per line) that will guide DBpedia Spotlight to remove pages and articles that should not be considered valid resources for annotation (e.g. Disambiguation pages, lists, and optionally anything you'd like: only numbers, letters, etc.). In Catalan the line would look like this:

.+\([Dd]esambiguació\)$
Llista_d'.+
Llista_de.+
^[0-9]+$

Create a folder to store DBpedia files (labels_XX.nt.bz2, redirects_XX.nt.bz2, disambiguations_XX.nt.bz2, instance_types_XX.nt.bz2). I suggest you put this new folder in /usr/local/spotlight/dbpedia_data/es and named as with a DBpedia version.

E.g: /usr/local/spotlight/dbpedia_data/es/data/3.8

Unpack the files index.tgz and spot_selector.tgz inside the folder /usr/local/spotlight/dbpedia_data/es/data.

Make a copy of the downloaded files (list below) from http://opennlp.sourceforge.net/models-1.5/. to the folder /usr/local/spotlight/dbpedia_data/es/data/opennlp/spanish.

After the above steps, your folder/files structure should be similar to that shown below:

![dbpedia folder](http://dl.dropbox.com/u/99877231/dbpedia/dbpdia_folder.png)

TODO: Upload this image to GitHub.

Get the indexing.properties file from DBpedia Spotlight conf folder and set the properties to your target language, files/folders structure. Here you have an example of how would this file to Spanish language and folder/files structure. The file was saved with the name indexing.es.properties in /usr/local/spotlight/dbpedia_data/es folder.

Inspecting output files

When you run index.sh, several files will be produced in the output directory.

TSV files

$ ll /var/local/spotlight/dbpedia_data/data/output/
-rw-r--r-- 1 conceptURIs.list
-rw-r--r-- 1 conceptURIs.list.NOT
-rw-r--r-- 1 occs.tsv
-rw-r--r-- 1 redirects_tc.tsv
-rw-r--r-- 1 surfaceForms.tsv

The size of the files varies from language to language. None of them are expected to be empty.

The conceptURIs.list file contains a list of URIs that we consider "valid", that is, those that denote entities and concepts (excluding redirects, disambiguation pages, etc.)

$ head /var/local/spotlight/dbpedia_data/data/output/conceptURIs.list
Abadia
Adam
Addicció
Abril

The conceptURIs.list.NOT file contains the complement of conceptURIs.list.NOT, i.e. a list of invalid URIs (redirects, disambiguation pages, etc.) which were detected with the help of blacklistedURIs.list

$ head /var/local/spotlight/dbpedia_data/data/output/conceptURIs.list.NOT
Abadessa
Addicte
Aixecament_de_pesos
Anys_90
Amphibia
Alguer

The surfaceForms.tsv contains a candidate map: from surface form (a phrase) to a URI (a unique identifier). Each file should have 2 tab-separated fields.

$ head /var/local/spotlight/dbpedia_data/data/output/surfaceForms.tsv
Abadia  Abadia
Adam    Adam
Addicció        Addicció
Abril   Abril
Adagi   Adagi

The redirects_tc.tsv contains a URI map: from "alias" URI to a valid URI. Each file should have 2 tab-separated fields.

$ head /var/local/spotlight/dbpedia_data/data/output/redirects_tc.tsv
Martin_van_Buren        Martin_Van_Buren
Abarkubadh      Abarqobad
Sepia_subulata  Calamarsó
Sishu   Quatre_Llibres
Santa_Maria_de_Montserrat       Monestir_de_Montserrat

The occs.tsv file contains a tab-separated collection of occurrences of URIs in conceptURIs.list in Wikipedia. Each occurrence is roughly the size of a paragraph and contains the fields: <occId, URI, surfaceForm, text, offset>

occId: unique identifier of an occurrence (e.g. paragraph identifier)
DBpedia URI (without namespace prefix)
surface form: the name used to refer to the DBpedia URI
context: a piece of text (e.g. paragraph) that contains a mention to the DBpedia URI
offset: position of surface form in context (in number of characters)

Showing two example occurrences:

$ head /var/local/spotlight/dbpedia_data/data/output/occs.tsv

%C3%80bac-p1l1  Decimal decimal  marcant el nombre 37.925 Un àbac (del llatí abăcus, i grec άβαξ-ακος, que significa "taula") és una eina per al càlcul manual d'operacions aritmètiques, que consisteix en un marc amb filferros paral·lels per on es fan córrer boles. S'hi poden representar nombres enters o decimals. Per a representar un nombre es fa servir la base decimal on cada fil de boles representa les unitats, desenes, centenes, etcètera.   332

%C3%80bac-p1l3  Nombre_decimal  decimals         marcant el nombre 37.925 Un àbac (del llatí abăcus, i grec άβαξ-ακος, que significa "taula") és una eina per al càlcul manual d'operacions aritmètiques, que consisteix en un marc amb filferros paral·lels per on es fan córrer boles. S'hi poden representar nombres enters o decimals. Per a representar un nombre es fa servir la base decimal on cada fil de boles representa les unitats, desenes, centenes, etcètera.   273

Binary files

Lucene index should be inspected using Luke

When you open your index in Luke, you will see something like this:

If you change to the docs tab, you will see something like this (if your index has not yet been compressed):

If you change to the docs tab, you will see something like this (if you have already compressed your index):

surfaceForms.spotterDictionary should be inspected using (need a class for this)

6. Further Information

Please feel free to subscribe and ask questions to our mailing list dpb-spotlight-users (cc. Sandro) if you have found errors or need more information about this guide.

Finally, if you plan to change the source code, please consider sending us a pull request so that you can be acknowledged as awesome contributor! It will also help other people trying to do the same, and give them the chance to find things that you haven't found yet.

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Model backend

Developers

Google Summer of Code - GSoC

GSoC - Guidelines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly