Skip to content
Branch: master
Find file History
Pull request Compare This branch is 26 commits ahead, 450 commits behind dbpedia:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
Dockerfile
LocalSettings.php
README.md
download.doc.properties
extraction.abstracts.doc.properties
extraction.doc.properties
my.cnf
start.sh

README.md

DBpedia extraction framework in Docker

This is a fork of the dbpedia extraction framework which contains a Dockerfile for using the extractor wihin the Docker environment. This fork has also adapted some scripts in order to simplify some operations in Docker.

Build the image

docker build -t dbpedia-extractor https://github.com/KIZI/extraction-framework.git#master:docker

Run the container

docker run -d --name dbpedia-extractor dbpedia-extractor <parameters>

List of parameters:

Flag Description Default
-l A short name of a language. You have to specify only one language! Only this parameter is required
-s An enumeration of the extraction steps
  • c - download ontology, mappings and settings.
  • d - download a wikipedia dump for the set language and the common dump for the image extraction etc.
  • w - download the wikidata dump.
  • i - import the wikipedia dump into the mediawiki database for the abstract extraction.
  • a - extract abstracts from a mediawiki within the docker container.
  • e - extract data using specified extractors.
cdiae
-v A wikipedia dump version newest
-e DBpedia extractors to be used
  • .AnchorTextExtractor
  • .ArticleCategoriesExtractor
  • .ArticlePageExtractor
  • .ArticleTemplatesExtractor
  • .CategoryLabelExtractor
  • .ExternalLinksExtractor
  • .GalleryExtractor
  • .InfoboxExtractor
  • .InterLanguageLinksExtractor
  • .LabelExtractor
  • .PageIdExtractor
  • .PageLinksExtractor
  • .RedirectExtractor
  • .RevisionIdExtractor
  • .ProvenanceExtractor
  • .SkosCategoriesExtractor
  • .WikiPageLengthExtractor
  • .WikiPageOutDegreeExtractor
  • .GeoExtractor
  • .HomepageExtractor
  • .ImageExtractor
  • .MappingExtractor
  • .DisambiguationExtractor
-b Start an interactive container with the bash console. It is suitable if you want to launch the extraction process manually inside the container.

Examples

It starts extraction for the czech chapter with the wikipedia dump version 20170201 and only for these extractors: .LabelExtractor,.MappingExtractor,.DisambiguationExtractor. It does not download any wikidata dump and does not use a mediawiki for the abstract extraction.

docker run -d --name dbpedia-extractor dbpedia-extractor -l=cs -s=cde -v=20170201 -e=.LabelExtractor,.MappingExtractor,.DisambiguationExtractor

It launches a container and attaches you into the console inside the container.

docker run -it --name dbpedia-extractor dbpedia-extractor -b

It starts a container for the czech chapter with default parameters.

docker run -d --name dbpedia-extractor dbpedia-extractor -l=cs

You can check extraction logs during running of the container or after completion.

docker logs dbpedia-extractor

If the extraction process fails, you may start the stopped container again. It will do nothing and will not start the extraction process again. You can connect to the restarted container and check errors or start some extraction parts inside the container again manually.

docker start dbpedia-extractor
docker exec -it dbpedia-extractor bash

After successfully completion you can copy extracted datasets from the docker container.

docker cp dbpedia-extractor:/root/datasets ./
You can’t perform that action at this time.