Skip to content
aftiqb edited this page Aug 14, 2012 · 3 revisions

The following scripts are used in the Eurostat Linked Data conversion process. The scripts can be used together or standalone in order to serve different scenarios. In the following, a brief description on how to run each script is outlined.

ParseToC

Parses the Table of Contents and prints the dataset URLs:

How to Run on Windows: ParseToC.bat -n 5

How to Run on Linux: sh ParseToC.sh -n 5

where

* `n` represents the number of dataset URLs to print

Type -h for help.

UnCompressFile

Uncompress the contents of the compressed dataset file:

How to Run on Windows: UnCompressFile.bat -i c:/test/zip/bsbu_m.sdmx.zip -o c:/uncompress/

How to Run on Linux: sh UnCompressFile.sh -i ~/test/zip/bsbu_m.sdmx.zip -o ~/uncompress/

where

* `i` is the input directory path of the compressed file
* `o` is output directory path where the contents of the compressed file will be stored

Type -h for help.

DonwloadZipFile

Downloads the compressed dataset file from the specified URL:

How to Run on Windows: DownloadZip.bat -p c:/test/zip/ -t c:/test/tsv/ -u "http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&downfile=data/apro_cpb_sugar.sdmx.zip"

How to Run on Linux: sh DownloadZip.sh -p ~/test/zip/ -t ~/test/tsv/ -u "http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&downfile=data/apro_cpb_sugar.sdmx.zip"

where

* `p` is the directory path where the compressed `.zip` file will be stored
* `t` is the directory path where the compressed `.tsv` file will be stored
* `u` is the URL of the dataset file

Type -h for help.

DSDParser

Parses the Data Structure Definition (DSD) of a dataset and converts it into RDF using Data Cube vocabulary.

How to Run on Windows: DSDParser.bat -i c:/tempZip/bsbu_m.dsd.xml -o c:/test/ -f TURTLE -a c:/sdmx-code.ttl

How to Run on Linux: sh DSDParser.sh -i ~/tempZip/dsd/bsbu_m.dsd.xml -o ~/test/ -f TURTLE -a ~/sdmx-code.ttl

where

* `i` is the file path of DSD xml file
* `o` is the output directory path where RDF will be stored
* `f` is the format for RDF serialization (RDF/XML, TURTLE, N-TRIPLES)
* `a` is the file path of `sdmx-code.ttl`. It can be downloaded from http://code.google.com/p/publishing-statistical-data/source/browse/trunk/specs/src/main/vocab/sdmx-code.ttl

Type -h for help.

SDMXParser

Parses the SDMX dataset observations and converts it into RDF using DataCube vocabulary.

How to Run on Windows: SDMXParser.bat -f tsieb010 -o c:/test/ -i c:/tempZip/tsieb010.sdmx.xml -l c:/log/ -t c:/tsv/tsieb010.tsv.gz

How to Run on Linux: sh SDMXParser.sh -f tsieb010 -o ~/test/ -i ~/sdmx/tsieb010.sdmx.xml -l ~/log/ -t ~/tsv/tsieb010.tsv.gz

where

* `f` is the name of the datset
* `o` is the output directory path where RDF will be stored
* `i` is the file path of the SDMX `xml` file
* `l` is the directory path where the logs of the dataset conversion will be stored
* `t` is the file path of the SDMX `tsv` file

Type -h for help.

Metadata

Generates the VoID file which will be used to populate the triple store described in Step 5 and Step6.

How to Run on Windows: Metadata.bat -i c:/toc/table_of_contents.xml -o c:/test/

How to Run on Linux: sh Metadata.sh -i ~/toc/table_of_contents.xml -o ~/test/

where

* `i` is the file path of the table of contents (optional parameter)
* `o` is the output directory path where the VoID file will be stored

Type -h for help.

DictionaryParser

Converts the dictionaries/codelists into RDF. It further generates a catalog file which is used to load all dictionaries/codelists into the triple store

How to Run on Windows: DictionaryParser.bat -i c:/dicPath/ -o c:/outputPath/ -c c:/catalogPath/ -f TURTLE

How to Run on Linux: sh DictionaryParser.sh -i ~/dicPath/ -o ~/outputPath/ -c ~/catalogPath/ -f TURTLE

where

* `i` is the directory path where the dictionaries are stored
* `o` is the directory path where the RDF will be stored
* `c` is the directory path where the catalog file will be stored
* `f` is the format for RDF serialization (RDF/XML, TURTLE, N-TRIPLES). This RDF serialization is *only* used to create the catalog file. Dictionaries are generated only in RDF/XML format 

Type -h for help.

EuroStatMirror

Downloads all the compressed Datasets files from the Bulk Download page by extracting URLs from Table of Contents.

How to Run on Windows: EuroStatMirror.bat -p c:/zip/ -t c:/tsv/

How to Run Linux: sh EuroStatMirror.sh -p ~/zip/ -t ~/tsv/

where

* `p` is the directory path where the `zip` files are downloaded
* `t` is the directory path where the `tsv` files are downloaded

Type -h for help.

Main

Converts the complete Eurostat datasets into RDF:

How to Run: sh Main.sh -i ~/sdmx-code.ttl -l ~/logs/

where

* `i` is the file path of `sdmx-code.ttl`. It can be downloaded from http://code.google.com/p/publishing-statistical-data/source/browse/trunk/specs/src/main/vocab/sdmx-code.ttl
* `l` is the directory path where logs will be generated

Type -h for help.

Dataset Titles

Generates the titles of the datasets in RDF.

How to Run : sh DatasetTitles.sh -o ~/title/

where

* `o` is the output directory path where the RDF will be stored

Type -h for help.