Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing CASes to a zip archive #135

Open
daxenberger opened this issue Jun 9, 2015 · 20 comments
Open

Writing CASes to a zip archive #135

daxenberger opened this issue Jun 9, 2015 · 20 comments

Comments

@daxenberger
Copy link
Member

Originally reported on Google Code with ID 135

DKPro-Core 1.6.1. will support writing to ZIP archives using e.g. BinaryCasWriter. We
should make use of this feature:

[PreprocessingTask]

AnalysisEngineDescription writer = createEngineDescription(BinaryCasWriter.class,
BinaryCasWriter.PARAM_TARGET_LOCATION, "jar:file:" + root + "/archive.zip", 
BinaryCasWriter.PARAM_TYPE_SYSTEM_LOCATION, root + "/typesystem.bin",
BinaryCasWriter.PARAM_FORMAT, "6");

and likewise for the Meta- and FeatureExtractionTasks.

One problem remains: I am not sure whether this makes sense for the BatchTaskCrossValidation,
where we (currently) need to split the overall set of files into various folds (file
sets), that need to be retrieved individually in each fold.

Reported by daxenberger.j on 2014-05-28 12:41:02

@daxenberger
Copy link
Member Author

"root" points to the path on the file system. Unless you have a strong reason to store
the type system outside the ZIP, I suggest you remove the "root" from PARAM_TYPE_SYSTEM_LOCATION
and just set it to "typesystem.bin" (no slash). Relative type system locations are
placed inside the ZIP - absolute locations are placed directly on the file system.

Reported by richard.eckart on 2014-05-28 12:42:45

@daxenberger
Copy link
Member Author

Thanks for the hint. I don't see a reason to store the typesystem outside the ZIP, so
the location should be relative.

Reported by daxenberger.j on 2014-05-28 12:47:58

@daxenberger
Copy link
Member Author

Reported by daxenberger.j on 2014-06-04 16:09:40

  • Labels added: Milestone-Release0.7.0

@daxenberger
Copy link
Member Author

I wonder, didn't we plan to do this in 0.6.0? 

Reported by richard.eckart on 2014-06-25 15:04:57

@daxenberger
Copy link
Member Author

Because of the problem mentioned in the first post: I'm not sure how to integrate this
with the current Crossvalidation BatchTask.

Reported by daxenberger.j on 2014-06-25 15:09:46

@daxenberger
Copy link
Member Author

Ah, I see. It shouldn't be a big problem but it is probably too much for the 0.6.0 release.


The basic principle should remain the same. We'd just need some extra code to extract
the file names for the folds from the ZIP instead of scanning them from the file system.

Reported by richard.eckart on 2014-06-25 15:11:57

@daxenberger
Copy link
Member Author

Reported by daxenberger.j on 2015-01-06 11:40:17

  • Labels added: Milestone-Release0.8.0
  • Labels removed: Milestone-Release0.7.0

@reckart reckart modified the milestone: 0.8.0 Aug 8, 2015
@Horsmann Horsmann modified the milestones: 0.9.0, 0.8.0 Mar 26, 2016
@Horsmann
Copy link
Member

@daxenberger this one can be closed as won't fix now, right?

@daxenberger
Copy link
Member Author

This is independent of the latest changes to CV mode. The idea here was to write all CASes into a zip archive rather than individual files.

Or why did you think it is obsolete?

@Horsmann
Copy link
Member

Horsmann commented May 5, 2016

Oh ok, I misunderstood it then. Sry.

@Horsmann Horsmann modified the milestones: 1.0.0, 0.9.0 Oct 19, 2016
@Horsmann
Copy link
Member

Horsmann commented Feb 9, 2018

@reckart Is this feature available now? What exactly is the benefit of writing a single .zip instead of N bin-cas? Both is not human-readable but the naming of the bin-cas by document name allows some visual confirmation that the reader read what it was supposed to read? It helps to understand at least a little bit what TC is doing. Unless this makes processing a lot faster I would rather not have zips?

@reckart
Copy link
Member

reckart commented Feb 9, 2018

Should be available.

@reckart
Copy link
Member

reckart commented Feb 9, 2018

I don't remember the rationale. Might be to avoid using subfolders in an execution context... or to reduce the number of files which can at times become very large... maybe @daxenberger remembers more.

@daxenberger
Copy link
Member Author

This was certainly to reduce the number of files produce by TC - which can become quite big for larger datasets. The "visual confirmation" issue could be avoided by writing some sort of log(?) file, which records the names of files written to the archive.

@Horsmann
Copy link
Member

@reckart Do you have a code-example that writes to .zip?

@reckart
Copy link
Member

reckart commented Feb 16, 2018

@Horsmann
Copy link
Member

Hm, when adapting this for the BinaryCasWriter and BinaryCasReader I get a Not in GZIP format exception

writing:
        AnalysisEngineDescription xmiWriter = createEngineDescription(BinaryCasWriter.class,
                BinaryCasWriter.PARAM_TARGET_LOCATION,
                "jar:file:" + aContext.getFolder(output, AccessMode.READWRITE).getPath() + "/data.gz",
                BinaryCasWriter.PARAM_FORMAT, "6+"
                );

reading:
createReaderDescription(BinaryCasReader.class, BinaryCasReader.PARAM_SOURCE_LOCATION,
            		root.getAbsolutePath() + "/data.gz!*.bin");

@reckart
Copy link
Member

reckart commented Feb 19, 2018

Looks like during reading, you are missing the jar:file: prefix.

@reckart
Copy link
Member

reckart commented Feb 19, 2018

... and mind that these are "zip" files, not "gz" files.

@Horsmann Horsmann modified the milestones: 1.0.0, 1.1.0 Apr 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants