Compresses factor graph binary files with bzip2 #450

Merged
merged 3 commits into from Jan 26, 2016

Conversation

Projects
None yet
2 participants
@netj
Contributor

netj commented Jan 5, 2016

when grounding, and decompresses on the fly with bzcat when running the
sampler.

This cuts a 2MB (2300279 bytes) factor graph down to 184kB (184224 bytes)
(one produced by the test with spouse_example/ddlog app) with negligible
impact on runtime (or even faster!).

groundingTime() {
    local log=$1
    tstart=$((sed -n '\@process/grounding/.*/dump@{ p; q; }' | awk '{print $2}') <$log)
    tend=$((sed -n '\@LEARNING EPOCH 0@{ p; q; }' | awk '{print $2}') <$log)
    echo $(date --date="$tend" +%s.%N) - $(date --date="$tstart" +%s.%N) | bc
}

$ # dump from database and load by sampler without compression
$ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/040549.120506000/log.txt
3.363981000

$ # dump from database and load by sampler with compression
$ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/034709.985065000/log.txt
3.319830000

Note that gzip has a 4GB limitation, so bzip2 was used despite its
higher computational cost. xz may be another good candidate to consider.

@netj netj added this to the DeepDive 0.8.1 milestone Jan 5, 2016

@netj netj added the performance label Jan 5, 2016

@feiranwang

This comment has been minimized.

Show comment
Hide comment
@feiranwang

feiranwang Jan 5, 2016

Contributor

Maybe we should test it on a larger factor graph (> 1GB) to see how it performs?

Contributor

feiranwang commented Jan 5, 2016

Maybe we should test it on a larger factor graph (> 1GB) to see how it performs?

@netj

This comment has been minimized.

Show comment
Hide comment
@netj

netj Jan 7, 2016

Contributor

Compression certainly has overhead. The question is whether it'll be a bottleneck. I'm trying to ground a larger one by running spouse example on a larger corpus I synthesized. But it revealed mkmimo's lower throughput, and running much slower than expected.

Meanwhile, here're my notes from doing a quick overhead test with several choices: https://gist.github.com/netj/c6f15bb78ff3a52057cb

Contributor

netj commented Jan 7, 2016

Compression certainly has overhead. The question is whether it'll be a bottleneck. I'm trying to ground a larger one by running spouse example on a larger corpus I synthesized. But it revealed mkmimo's lower throughput, and running much slower than expected.

Meanwhile, here're my notes from doing a quick overhead test with several choices: https://gist.github.com/netj/c6f15bb78ff3a52057cb

netj added some commits Dec 21, 2015

Compresses factor graph binary files with bzip2
when grounding, and decompresses on the fly with bzcat when running the
sampler.

This cuts a 2MB (2300279 bytes) factor graph down to 184kB (184224 bytes)
(one produced by the test with spouse_example/ddlog app) with negligible
impact on runtime (or even faster!).

```bash
groundingTime() {
    local log=$1
    tstart=$((sed -n '\@process/grounding/.*/dump@{ p; q; }' | awk '{print $2}') <$log)
    tend=$((sed -n '\@learning EPOCH 0@{ p; q; }' | awk '{print $2}') <$log)
    echo $(date --date="$tend" +%s.%N) - $(date --date="$tstart" +%s.%N) | bc
}

$ # dump from database and load by sampler without compression
$ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/040549.120506000/log.txt
3.363981000

$ # dump from database and load by sampler with compression
$ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/034709.985065000/log.txt
3.319830000
```

Note that gzip has a 4GB limitation, so bzip2 was used despite its
higher computational cost.  xz may be another good candidate to consider.
@netj

This comment has been minimized.

Show comment
Hide comment
@netj

netj Jan 25, 2016

Contributor

Before I forget, I'll drop some numbers I got a while ago for a large factor graph I synthesized by duplicating the corpus for the spouse example (~12GB uncompressed, 199k vars, 16k weights, 337M factors).

LOADED VARIABLES: #199907
         N_QUERY: #139603
         N_EVID : #60304
LOADED WEIGHTS: #16664
LOADED FACTORS: #337742718

Following are rough measurements on raiders6 with 111 processes, only accounting the dumping time and loading time.

uncompressed

  • 11828322038 bytes (~12GiB)
  • 401.224535 secs

pbzip2

  • 197572897 bytes (~191MiB; 59.8x smaller)
  • 420.276131 secs (+19s; +4.7% increase)

bzip2

  • 195875810 bytes (~189MiB; 60.4x smaller)
  • 464.805231 secs (+64s; +16% increase)

Since the full grounding took significantly more time (materializing the factors, weights), I'd say compression overhead is negligible while it's savings on storage footprint and in turn I/O are quite dramatic. The higher-than-usual compression rate (>>10x) is probably due to the regularity in the factor graph data representation. I think we should turn this on by default unless there's a really good counter argument.

Contributor

netj commented Jan 25, 2016

Before I forget, I'll drop some numbers I got a while ago for a large factor graph I synthesized by duplicating the corpus for the spouse example (~12GB uncompressed, 199k vars, 16k weights, 337M factors).

LOADED VARIABLES: #199907
         N_QUERY: #139603
         N_EVID : #60304
LOADED WEIGHTS: #16664
LOADED FACTORS: #337742718

Following are rough measurements on raiders6 with 111 processes, only accounting the dumping time and loading time.

uncompressed

  • 11828322038 bytes (~12GiB)
  • 401.224535 secs

pbzip2

  • 197572897 bytes (~191MiB; 59.8x smaller)
  • 420.276131 secs (+19s; +4.7% increase)

bzip2

  • 195875810 bytes (~189MiB; 60.4x smaller)
  • 464.805231 secs (+64s; +16% increase)

Since the full grounding took significantly more time (materializing the factors, weights), I'd say compression overhead is negligible while it's savings on storage footprint and in turn I/O are quite dramatic. The higher-than-usual compression rate (>>10x) is probably due to the regularity in the factor graph data representation. I think we should turn this on by default unless there's a really good counter argument.

@feiranwang

This comment has been minimized.

Show comment
Hide comment
@feiranwang

feiranwang Jan 26, 2016

Contributor

Seems there's a huge saving in space with negligible overhead! Merging.

Contributor

feiranwang commented Jan 26, 2016

Seems there's a huge saving in space with negligible overhead! Merging.

feiranwang added a commit that referenced this pull request Jan 26, 2016

Merge pull request #450 from HazyResearch/compressed-factorgraph-bina…
…ries

Compresses factor graph binary files with bzip2

@feiranwang feiranwang merged commit 96dab13 into master Jan 26, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@netj netj deleted the compressed-factorgraph-binaries branch Jan 28, 2016

@netj netj modified the milestones: DeepDive 0.8.1, DeepDive 0.8.0 Feb 11, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment