LubmDataGeneration

This pages describes the way that we loaded the data generated by LUBM into Quest

Issues to address

At the moment Quest uses the OWLAPi to load TBox and ABoxes. This is very inefficient for large ABoxes. We need a lighter mechanism where little parsing is done and where streaming of triples is possible.

Solution:

Generate all LUBM data files.
Transform and merge all the data in a simple triple file (e.g., N-Triple)
Create a new ABox assertion streamer that read the file line by line with very simple parsing.

Generating the files

This is done using the traditional LUBM data generator tool using the command:

java -cp classes/ edu.lehigh.swat.bench.uba.Generator -univ 1000 -onto http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl

Transforming and merging (Source)

To do this we will use Jena, in particular the command rdfcat.

Setting up Jena. Download Jena and setup your environment as follows:
- Add the following to your .bashrc file.

export JENAROOT=~/Documents/OBDA/related_software/Jena-2.6.4
export PATH=$JENAROOT/bin:$PATH

Execute the command

```
chmod u+x $JENAROOT/bin/*
```

With Jena configured, we can now process the original data and dump it as N-Triples with the command

find . -type f -name "University*.owl" -exec rdfcat -out N-TRIPLE -x {} >> University0-99.nt \;

We also need to remove imports and other non-data triples with the commands:

cat University0-99.nt | grep -v http://www.w3.org/2002/07/owl#Ont > University0-99-clean.nt
cat University0-99-clean.nt | grep -v http://www.w3.org/2002/07/owl#imports > University0-99-clean2.nt

To merge each university into a single nt file we used the following bash script:

#sh
#!/bin/bash
echo "Generating nt files"
for i in {0..99}
  do
     echo "Doing uni $i"
     find . -type f -name "University$i_*.owl" -exec rdfcat -out N-TRIPLE -x {} >> uni$i.nt \;
 done

To clean all files we did

#sh
#!/bin/bash
echo "Cleaning nt files"
for i in {0..99}
  do
     echo "Doing uni $i"
     cat university-data-$i.nt | grep -v http://www.w3.org/2002/07/owl#Ont | grep -v http://www.w3.org/2002/07/owl#imports > university-data-$i.nt.tmp
     rm university-data-$i.nt
     mv university-data-$i.nt.tmp university-data-$i.nt
 done

Loader

To load the triples we are going to use Quest .load(Iterator<Assertion>) and we will implement and N-triple reader that is able to generate an Iterator for the data it reads. The reader is very simple, and doesn't support all features of N-triple. Specifically, typing and blanknodes and literal typing are not supported yet.

Requirements

One triple per line, finished with "."
URI's delimited by <>
Literals delimited by ""
No other content in the file.

Example:

<http://www.Department9.University9.edu/UndergraduateStudent73> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse> <http://www.Department9.University9.edu/Course2> .
<http://www.Department9.University9.edu/UndergraduateStudent73> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse> <http://www.Department9.University9.edu/Course21> .
<http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> .
<http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#name> "UndergraduateStudent306" .
<http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#memberOf> <http://www.Department9.University9.edu> .
<http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#emailAddress> "UndergraduateStudent306@Department9.University9.edu" .
<http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#telephone> "xxx-xxx-xxxx" .
<http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse> <http://www.Department9.University9.edu/Course20> .
<http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse> <http://www.Department9.University9.edu/Course10> .
<http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#advisor> <http://www.Department9.University9.edu/AssociateProfessor5> .

Postgres tunning

before

shared_buffers = 32MB
#work_mem = 1MB                         # min 64kB
#maintenance_work_mem = 16MB            # min 1MB
#max_stack_depth = 2MB                  # min 100kB
#wal_level = minimal                     # minimal, archive, or hot_standby
#checkpoint_segments = 3                # in logfile segments, min 1, 16MB each
#archive_mode = off
#max_wal_senders = 0             # max number of walsender processes
checkpoint_timeout = 5min              # range 30s-1h
#effective_cache_size = 128MB
#fsync = on                             # turns forced synchronization on or off
#synchronous_commit = on                # synchronization level; on, off, or local

Now

shared_buffers = 3GB
work_mem = 24MB                         # min 64kB
maintenance_work_mem = 256MB            # min 1MB
max_stack_depth = 7680KB                  # min 100kB
wal_level = minimal                     # minimal, archive, or hot_standby
checkpoint_segments = 15                # in logfile segments, min 1, 16MB each
archive_mode = off
max_wal_senders = 0             # max number of walsender processes
checkpoint_timeout = 10min              # range 30s-1h
effective_cache_size = 4GB

Now

shared_buffers = 2GB
work_mem = 10MB                         # min 64kB
maintenance_work_mem = 128MB            # min 1MB
max_stack_depth = 4MB                  # min 100kB
wal_level = minimal                     # minimal, archive, or hot_standby
checkpoint_segments = 10                # in logfile segments, min 1, 16MB each
archive_mode = off
max_wal_senders = 0             # max number of walsender processes
checkpoint_timeout = 10min              # range 30s-1h
effective_cache_size = 1GB
fsync = off                             # turns forced synchronization on or off
synchronous_commit = off                # synchronization level; on, off, or local

Provide feedback

Saved searches