A simple HowTo for collective disambiguation

This page describes the workflow and steps on how to run the collective disambiguation module on this [branch] (https://github.com/hunterhector/dbpedia-spotlight). Please note that the module is currently under experimental and subject to change. The module was tested on ARCH-3.3.8.1. The pig scripts were only tested on a pseudo-distributive mode.

Prerequisites (based on the testing environment)

The followings are the settings of the tested environment:

JDK 1.7.0
Scala 2.9.2
Apache Maven 2.2.1

Other Maven versions may not resolve dependencies successfully, change to Maven 2.2.1 if you encounter problems.

Apache Hadoop 1.0.2 Set up guide
- Running Pig Scripts in local mode will not require a standalone hadoop installation because Pig uses a bundled version of Hadoop
Apache Pig 0.10.0 Set up guide

##Installation guide

Clone out the master branch of https://github.com/hunterhector/dbpedia-spotlight

git clone https://github.com/hunterhector/dbpedia-spotlight

Follow the installation guide to set up DBpedia Spotlight
Edit properties file in conf/properties to point to the correct files
For graph.properties configuration, you need to:
- Specify a directory to store all the graph files to be generated in org.dbpedia.spotlight.graph.dir
- Point to occs.tsv generated by org.dbpedia.spotlight.lucene.index.ExtractOccsFromWikipedia in org.dbpedia.spotlight.graph.occsrc
- Point to co-occs-count.tsv generated by index/src/main/pig/CooccurrencesCount.pig in org.dbpedia.spotlight.graph.cooccsrc (will be discussed later)
- Set graph org.dbpedia.spotlight.graph.offline to true will make the program to read graph on disk, which consumes less memory but would make the process slower
- The other properties directories are all relevant paths to the root directory org.dbpedia.spotlight.graph.dir. The simple way is to leave them as they are.

Run Disambiguation Only

To run the disambiguation task only, you need to download two files and uncompressed them, remember to edit the properties to point to the location of these files.

uriMap.tsv (45.8 MB, 110.5MB uncompressed)
semantic graph folder (168 MB, 205.7 MB uncompressed) (all files should be under the same folder)

You can then skip the Graph Generation section and run the disambiguator at the disambiguation section.

Graph generation

Before running any of these launchers, check if the argument of the launcher in the pom.xml file points to the correct "properties file" <arg>../conf/graph.properties</arg> (I created different properties file to create graph on different languages).

You can also download the full set of graphs(download size: 1.5GB; uncompressed size: 3.8GB) and co-occurrences file(download size: 295.5MB; uncompressed size:772.3MB) generated to perform experiments from any point during the process.

To shorten the running time, the current implementation finish all basic graph and semantic calculation at indexing time. There are a few steps need to be run to generate all the files.

Get code and compile

git clone https://github.com/caizhiwei/pignlproc.git
mvn assembly:assembly -Dmaven.test.skip=true

Edit token_counts.pig.params file to point to your files(can be inside HDFS or local path, if in local path, use pig -x local in following commands).
Get occurence counts and co-occurence counts

pig -m token_counts.pig.params occs.pig
pig -m token_counts.pig.params co-occs.pig

make graph

cd dbpedia-spotlight
mvn sacala:run -DmainClass=org.dbpedia.spotlight.graph.GraphMaker "-DaddArgs=conf/graph.properties"

Now the graph generation process is done. You could start using the disambiguation.

Disambiguation

Currently GraphBaseDisambiguator is integrated into Core module, and an run example can be seen in EvaluateSpotlightModel.scala#l66

More Downloads

If you are interested in playing with the graph. I have gathered a few graphs I generated for public download to save the time:

English Files
- Can be found [here] (http://cairo.lti.cs.cmu.edu/~hector/data/dbpedia_data/en/)
- The compressed graph graph.tar.gz has everything except the co-occs-count.tsv
Spanish Files
- Can be found [here] (http://cairo.lti.cs.cmu.edu/~hector/data/dbpedia_data/es/)
- The compressed graph graph.tar.gz has everything except the co-occs-count.es.tsv.tar.gz

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Model backend

Developers

Google Summer of Code - GSoC

GSoC - Guidelines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly