New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #510 #520

Merged
merged 1 commit into from Dec 15, 2017

Conversation

Projects
None yet
2 participants
@sebastian-nagel
Collaborator

sebastian-nagel commented Dec 14, 2017

Tested so far in small scale:

  • WARC files validate
  • file rotation by size works
    A large scale test will follow during the next days.

A first trial to implement a class GzipHdfsWriter extending AbstractHDFSWriter (AbstractHDFSWriter needs to be returned by AbstractHdfsBolt.makeNewWriter(path, tuple)) wasn't successful:

  • the field offset in AbstractHDFSWriter is only visible in the package org.apache.storm.hdfs.common, it cannot be used in a Storm-crawler package
  • the method write(tuple) which accesses the package-internal offset is final:
abstract public class AbstractHDFSWriter {

    long offset;

    final public long write(Tuple tuple) throws IOException{
        doWrite(tuple);
        this.needsRotation = rotationPolicy.mark(tuple, offset);

        return this.offset;
    }
}

I didn't find a way to operate directly on the output stream and offset while still extending the classes in org.apache.storm.hdfs.common. That's why compression is moved to a GzipRecordFormat. As a consequence, the option to do the rotation on uncompressed offsets (content length) is dropped.

@jnioche jnioche added this to the 1.8 milestone Dec 14, 2017

@jnioche jnioche added the warc label Dec 14, 2017

@jnioche jnioche self-assigned this Dec 14, 2017

@jnioche jnioche merged commit 0afe3ed into DigitalPebble:master Dec 15, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnioche

This comment has been minimized.

Member

jnioche commented Dec 15, 2017

looks good, thanks @sebastian-nagel

@jnioche jnioche changed the title from Upgrade WARC module to latest version of storm-hdfs, fixes #590 to Upgrade WARC module to latest version of storm-hdfs, fixes #510 Dec 15, 2017

@sebastian-nagel sebastian-nagel deleted the sebastian-nagel:sc-510-warc-storm-hdfs-upgrade branch Dec 15, 2017

@jnioche jnioche changed the title from Upgrade WARC module to latest version of storm-hdfs, fixes #510 to Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #510 Mar 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment