Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenStudio::UnzipFile extractAllFiles is very slow #4456

Closed
shorowit opened this issue Oct 1, 2021 · 5 comments · Fixed by #4493
Closed

OpenStudio::UnzipFile extractAllFiles is very slow #4456

shorowit opened this issue Oct 1, 2021 · 5 comments · Fixed by #4493

Comments

@shorowit
Copy link
Contributor

shorowit commented Oct 1, 2021

Issue overview

The OpenStudio::UnzipFile extractAllFiles method is slower than it seems like it should be. It appears to get exponentially slower as the unzip operation progresses.

Unzip times for test below:

Test Operation time
Using 7zip manually on Windows 0.5 minutes
Using OpenStudio::UnzipFile extractAllFiles on Windows 24.5 minutes
Using OpenStudio::UnzipFile extractAllFiles on Linux (Windows Subsystem for Linux) 6 minutes

cc @aspeake

Current Behavior

Slow operation.

Expected Behavior

Faster operation.

Steps to Reproduce

  1. Download this file: https://data.nrel.gov/system/files/156/BuildStock_TMY3_FIPS.zip

  2. Create this unzip.rb script at the same location:

require 'openstudio'

unzip_file = OpenStudio::UnzipFile.new('BuildStock_TMY3_FIPS.zip')
unzip_file.extractAllFiles(OpenStudio::toPath('zip_contents'))
  1. Run openstudio unzip.rb and time the operation

Possible Solution

Details

Environment

Some additional details about your environment for this issue (if relevant):

  • Platform (Operating system, version):
  • OpenStudio 3.2.1

Context

@shorowit shorowit added the Triage Issue needs to be assessed and labeled, further information on reported might be needed label Oct 1, 2021
@jmarrec
Copy link
Collaborator

jmarrec commented Oct 7, 2021

@shorowit You should be using the rubyzip gem instead. It's embedded in the CLI already, since t's part of the workflow gem.

I've done some benchmarking, and indeed OpenStudio's Unzip file is slow.

rubyzip apparently uses zlib, same as OpenStudio (which uses minizip which is based on zlib). But the rubyzip code is much, much more complicated, UnzipFile.cpp is less than 100 lines of code...

UnzipFile::UnzipFile(const openstudio::path& filename) : m_unzFile(unzOpen(openstudio::toString(filename).c_str())) {
if (!m_unzFile) {
if (!openstudio::filesystem::exists(filename)) {
throw std::runtime_error("UnzipFile " + openstudio::toString(filename) + " does not exist, could not be opened");
} else {
throw std::runtime_error("UnzipFile " + openstudio::toString(filename) + " exists, could not be opened");
}
}
}
UnzipFile::~UnzipFile() {
unzClose(m_unzFile);
}
std::vector<openstudio::path> UnzipFile::extractAllFiles(const openstudio::path& outputPath) const {
std::vector<openstudio::path> files = listFiles();
std::vector<openstudio::path> retfiles;
for (std::vector<openstudio::path>::const_iterator itr = files.begin(); itr != files.end(); ++itr) {
if (toString(itr->filename()) == "." || toString(itr->filename()) == "/") {
// This is a directory - skip it
} else {
retfiles.push_back(extractFile(*itr, outputPath));
}
}
return retfiles;
}
openstudio::path UnzipFile::extractFile(const openstudio::path& filename, const openstudio::path& outputPath) const {
if (unzLocateFile(m_unzFile, openstudio::toString(filename).c_str(), 1) != UNZ_OK) {
throw std::runtime_error("File does not exist in archive: " + openstudio::toString(filename));
}
if (unzOpenCurrentFile(m_unzFile) != UNZ_OK) {
throw std::runtime_error("Unable to open file in archive: " + openstudio::toString(filename));
}
try {
bool cont = true;
openstudio::path createdFile = outputPath / filename;
openstudio::filesystem::create_directories(createdFile.parent_path());
openstudio::filesystem::ofstream file(createdFile, std::ios_base::trunc | std::ios_base::binary);
while (cont) {
std::vector<char> buffer(1024);
int bytesread = unzReadCurrentFile(m_unzFile, &buffer.front(), buffer.size());
if (bytesread == 0) {
cont = false;
} else if (bytesread < 0) {
throw std::runtime_error("Unable to read from file");
} else {
file.write(&buffer.front(), bytesread);
if (!file.good()) {
throw std::runtime_error("Error writing to output file: " + toString(createdFile));
}
}
}
file.close();
return createdFile;
} catch (...) {
unzCloseCurrentFile(m_unzFile);
throw;
}
unzCloseCurrentFile(m_unzFile);
}
std::vector<openstudio::path> UnzipFile::listFiles() const {
bool cont = unzGoToFirstFile(m_unzFile) == UNZ_OK;
std::vector<openstudio::path> paths;
do {
unz_file_info file_info;
std::vector<char> filename(300);
unzGetCurrentFileInfo(m_unzFile, &file_info, &filename.front(), filename.size(), nullptr, 0, nullptr, 0);
paths.push_back(openstudio::toPath(std::string(&filename.front(), file_info.size_filename)));
cont = unzGoToNextFile(m_unzFile) == UNZ_OK;
} while (cont);
return paths;
}
). I think the rubyzip code is smarter, it tries to find the compression method used and co.

UnzipFile in OpenStudio was never really meant to decompress 800MB archives...

Benchmarking methodology

I created a first zip 'TestZipSmaller' where I zipped all E+ weather_data/ files.
I created a second TestZipBigger by duplicating each weather_data file three times in root. Copied all 4 to another 'Subfolder'
Then I took everything from TestZipBigger and put that twice/thrice in subfolders for TestZipTwiceBigger / TestZipThriceBigger.

I then wrote this benchmark file to test : OpenStudio, rubyzip, and system (ubuntu 20.04).

bench_unzip.rb

require 'openstudio'
require 'zip'
require 'fileutils'
require 'json'

def test_openstudio(test_case)
  destination = "OpenStudio-#{test_case}"
  path = "#{test_case}.zip"
  FileUtils.rm_rf(destination)
  FileUtils.mkdir_p(destination)

  t = Time.now
  uf = OpenStudio::UnzipFile.new(path)
  uf.extractAllFiles(destination)
  t = Time.now - t

  puts "OpenStudio-#{test_case}: #{t}"
  return t
end

def test_rubyzip(test_case)
  destination = "Ruby-#{test_case}"
  path = "#{test_case}.zip"
  FileUtils.rm_rf(destination)
  FileUtils.mkdir_p(destination)

  t = Time.now
  Zip::File.open(path) do |zip_file|
    zip_file.each do |f|
      fpath = File.join(destination, f.name)
      zip_file.extract(f, fpath) unless File.exist?(fpath)
    end
  end
  t = Time.now - t

  puts "Ruby-#{test_case}: #{t}"
  return t
end

def test_system(test_case)

  destination = "System-#{test_case}"
  path = "#{test_case}.zip"
  FileUtils.rm_rf(destination)
  FileUtils.mkdir_p(destination)

  t = Time.now
  system("unzip #{path} -d #{destination} > /dev/null")
  t = Time.now - t

  puts "System-#{test_case}: #{t}"
  return t
end

timings = {}

tests = ['TestZipSmaller', 'TestZipBigger', 'TestZipTwiceBigger', 'TestZipThriceBigger']
tests.each do |test_case|
  timings[test_case] = {}
  timings[test_case]['openstudio'] = test_openstudio(test_case)
  timings[test_case]['rubyzip'] = test_rubyzip(test_case)
  timings[test_case]['system'] = test_system(test_case)
end
{
  "TestZipSmaller": {
    "openstudio": 0.326467787,
    "rubyzip": 0.343753568,
    "system": 0.297883918,
  },
  "TestZipBigger": {
    "openstudio": 2.252606288,
    "rubyzip": 1.495430877,
    "system": 1.735657113,
  },
  "TestZipTwiceBigger": {
    "openstudio": 6.975214381,
    "rubyzip": 3.090054221,
    "system": 4.047724282,
    "Reference-openstudio": 6.975214381
  },
  "TestZipThriceBigger": {
    "openstudio": 12.320878892,
    "rubyzip": 4.761708978,
    "system": 6.060392038,
  }
}

@jmarrec
Copy link
Collaborator

jmarrec commented Oct 7, 2021

I ran openstudio 50 times on each zip. It's weird to see that much deviation.

image

image

@jmarrec
Copy link
Collaborator

jmarrec commented Oct 7, 2021

I found a way to avoid doing extra work. Using the ThriceBigger, before/after:

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_Current/iterations:10       7932 ms         7924 ms           10
BM_Mod/iterations:10           3875 ms         3873 ms           10

Re running my ruby benchmark code:

{
  "TestZipSmaller": {
    "openstudio": 0.18681929,
    "rubyzip": 0.351584652,
    "system": 0.291233044,
    "Reference-openstudio": 0.326467787
  },
  "TestZipBigger": {
    "openstudio": 1.118212894,
    "rubyzip": 1.548469729,
    "system": 1.74556207,
    "Reference-openstudio": 2.252606288
  },
  "TestZipTwiceBigger": {
    "openstudio": 2.445955365,
    "rubyzip": 3.116927016,
    "system": 4.03409431,
    "Reference-openstudio": 6.975214381
  },
  "TestZipThriceBigger": {
    "openstudio": 4.044556489,
    "rubyzip": 7.097331949,
    "system": 8.684504536,
    "Reference-openstudio": 12.320878892
  }
}

@jmarrec
Copy link
Collaborator

jmarrec commented Oct 7, 2021

Just changing the size of the chunk improves it significantly.

std::vector<char> buffer(1024);

rubyzip uses 32768: https://github.com/rubyzip/rubyzip/blob/e70e1d3080efc09fa83963b0b2b08116532ee760/lib/zip/decompressor.rb#L5

@jmarrec
Copy link
Collaborator

jmarrec commented Oct 7, 2021

Current is the current implementation except you can set the chunksize. Mod is the other where I've made changes.

Testing chunksize of 1024 to 65536.

------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
BM_Current/1024/iterations:5        9761 ms         9745 ms            5
BM_Current/2048/iterations:5        8831 ms         8827 ms            5
BM_Current/4096/iterations:5        8210 ms         8202 ms            5
BM_Current/8192/iterations:5        8074 ms         8063 ms            5
BM_Current/16384/iterations:5       7904 ms         7900 ms            5
BM_Current/32768/iterations:5       7784 ms         7770 ms            5
BM_Current/65536/iterations:5       7669 ms         7655 ms            5
BM_Mod/1024/iterations:5            5900 ms         5895 ms            5
BM_Mod/2048/iterations:5            5020 ms         5010 ms            5
BM_Mod/4096/iterations:5            4391 ms         4382 ms            5
BM_Mod/8192/iterations:5            4173 ms         4170 ms            5
BM_Mod/16384/iterations:5           4056 ms         4050 ms            5
BM_Mod/32768/iterations:5           3913 ms         3906 ms            5
BM_Mod/65536/iterations:5           3844 ms         3838 ms            5

jmarrec added a commit that referenced this issue Oct 7, 2021
jmarrec added a commit that referenced this issue Nov 4, 2021
@tijcolem tijcolem added severity - Normal Bug and removed Triage Issue needs to be assessed and labeled, further information on reported might be needed labels Nov 5, 2021
tijcolem added a commit that referenced this issue Nov 15, 2021
Fix #4456 - Improve performance of OpenStudio::UnzipFile::extractAllFiles
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants