Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
WARCHdfsBolt writes zero byte files #596
As discussed in a stackoverflow thread, I tried to create a storm-crawler-archetype 1.10 based project that emits warc files. Unfortunately though, these warc files always appear empty and contain 0 bytes.
I created a repo where the setup is shown. Also I tried to amend the suggestions that @jnioche gave me (Adding a FileTimeSizeRotationPolicy set to rotate every 10 seconds and 10 Kbytes and setting a new CountSyncPolicy(1)) to no avail.
The command I use to run this example is
However, when either lowering the filesize threshold to an excessive value like 4 kbyte or using the MemoryStatusUpdater for recursion, valid single page archives started to appear. It seems that flushing behavior might still be somewhat random.
Hi @keyboardsamurai to get the sync working you need to configure HDFS like so
This uses the RawLocalFileSystem, which unlike the checksum one used by default does a proper sync of the content to the file.
This seems to work with SC 1.8. The latest version of SC is broken and does not generate a proper gzip.
This worked because the rotation had time to work as new URLs were coming through and / or the size was low enough.
Have found the cause of the problem and fixed it. This had to do with the compression of the entries. We should now get a valid gzip regardless of whether triggered by a sync or a rotation.
Thanks @keyboardsamurai for reporting it. Please give the fix a try if you can.