Skip to content
This repository has been archived by the owner on Oct 20, 2018. It is now read-only.

A simple HowTo for collective disambiguation

Hector edited this page Feb 3, 2015 · 20 revisions

This page describes the workflow and steps on how to run the collective disambiguation module on this [branch] (https://github.com/hunterhector/dbpedia-spotlight). Please note that the module is currently under experimental and subject to change. The module was tested on ARCH-3.3.8.1. The pig scripts were only tested on a pseudo-distributive mode.

Prerequisites (based on the testing environment)

The followings are the settings of the tested environment:

  • JDK 1.7.0

  • Scala 2.9.2

  • Apache Maven 2.2.1

  • Other Maven versions may not resolve dependencies successfully, change to Maven 2.2.1 if you encounter problems.
  • Apache Hadoop 1.0.2 Set up guide

    • Running Pig Scripts in local mode will not require a standalone hadoop installation because Pig uses a bundled version of Hadoop
  • Apache Pig 0.10.0 Set up guide

##Installation guide

  1. Clone out the master branch of https://github.com/hunterhector/dbpedia-spotlight
git clone https://github.com/hunterhector/dbpedia-spotlight
  1. Follow the installation guide to set up DBpedia Spotlight
  2. Edit properties file in conf/properties to point to the correct files
  3. For graph.properties configuration, you need to:
    • Specify a directory to store all the graph files to be generated in org.dbpedia.spotlight.graph.dir
    • Point to occs.tsv generated by org.dbpedia.spotlight.lucene.index.ExtractOccsFromWikipedia in org.dbpedia.spotlight.graph.occsrc
    • Point to co-occs-count.tsv generated by index/src/main/pig/CooccurrencesCount.pig in org.dbpedia.spotlight.graph.cooccsrc (will be discussed later)
    • Set graph org.dbpedia.spotlight.graph.offline to true will make the program to read graph on disk, which consumes less memory but would make the process slower
    • The other properties directories are all relevant paths to the root directory org.dbpedia.spotlight.graph.dir. The simple way is to leave them as they are.

Run Disambiguation Only

To run the disambiguation task only, you need to download two files and uncompressed them, remember to edit the properties to point to the location of these files.

  1. uriMap.tsv (45.8 MB, 110.5MB uncompressed)
  2. semantic graph folder (168 MB, 205.7 MB uncompressed) (all files should be under the same folder)

You can then skip the Graph Generation section and run the disambiguator at the disambiguation section.

Graph generation

Before running any of these launchers, check if the argument of the launcher in the pom.xml file points to the correct "properties file" <arg>../conf/graph.properties</arg> (I created different properties file to create graph on different languages).

You can also download the full set of graphs(download size: 1.5GB; uncompressed size: 3.8GB) and co-occurrences file(download size: 295.5MB; uncompressed size:772.3MB) generated to perform experiments from any point during the process.

To shorten the running time, the current implementation finish all basic graph and semantic calculation at indexing time. There are a few steps need to be run to generate all the files.

  1. Get code and compile
git clone https://github.com/caizhiwei/pignlproc.git
mvn assembly:assembly -Dmaven.test.skip=true
  1. Edit token_counts.pig.params file to point to your files(can be inside HDFS or local path, if in local path, use pig -x local in following commands).

  2. Get occurence counts and co-occurence counts

pig -m token_counts.pig.params occs.pig
pig -m token_counts.pig.params co-occs.pig
  1. make graph
cd dbpedia-spotlight
mvn sacala:run -DmainClass=org.dbpedia.spotlight.graph.GraphMaker "-DaddArgs=conf/graph.properties"

Now the graph generation process is done. You could start using the disambiguation.

Disambiguation

Currently GraphBaseDisambiguator is integrated into Core module, and an run example can be seen in EvaluateSpotlightModel.scala#l66

More Downloads

If you are interested in playing with the graph. I have gathered a few graphs I generated for public download to save the time:

  1. English Files

  2. Spanish Files

Clone this wiki locally