-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sarra winnow option sum n bug #377
Comments
Thanks jun, please:
then it can be easily analyzed. |
OK, I tried it... first problem 'is' is a syntax error, replaced with ==. |
in order for the checksum algorithm to work with partitioned files, it should not prevent publication of parts of a file when the partition size is different. However partitioned files are very rarely used currently... for the case of a file published in a single part, the whole point of 'n' as a checksum choice is to avoid considering data changes... which, when there are no partitions, will result in the blocksize being different in the 'partflg'... hmm... options:
not sure.. |
OK have a fix for that now... (for v2... discovered v3 is very messed up... adding that tag here to track fixing that, but v2 should be OK now.) |
v3 fix is still outstanding. |
When using the option sum n in the messages, we only want winnow to publish the first one coming from the source if the name of the files are the same. But the problem is that if the content is different, winnow will publish all messages even the files have the same name.
Ex:
2021-06-08 19:47:04,444 [INFO] post_log notice=20210608194659.733278513 sftp://xxx@xxx/ /tmp/test3 headers={'mtime': '20210608194653.442133427', 'to_clusters': 'xxx', 'mode': '664', 'source': 'feeder', 'atime': '20210608194653.442133427', 'parts': '1,12,1,0,0', 'sum': 'n,8ad8757baa8564dc136c1e07507f4a98', 'from_cluster': 'xxx'}
...
2021-06-08 19:47:04,449 [DEBUG] sr_config set_sumalgo n
2021-06-08 19:47:04,449 [DEBUG] notice 20210608194703.711236954 sftp://xxx@xxx/ /tmp/test3
2021-06-08 19:47:04,449 [DEBUG] urlstr sftp://xxx@xxx//tmp/test3
2021-06-08 19:47:04,449 [DEBUG] Received notice 20210608194703.711236954 sftp://xxx@xxx//tmp/test3
...
2021-06-08 19:47:04,450 [DEBUG] sr_cache check basis=path
2021-06-08 19:47:04,450 [DEBUG] sum already in cache: key=n,8ad8757baa8564dc136c1e07507f4a98
2021-06-08 19:47:04,451 [DEBUG] added value=/tmp/test3*1,5,1,0,0
2021-06-08 19:47:04,451 [DEBUG] new entry, not a part: part=1,5,1,0,0
...
2021-06-08 19:47:04,451 [DEBUG] sr_winnow on_post
2021-06-08 19:47:04,451 [INFO] post_log notice=20210608194703.711236954 sftp://xxx@xxx/ /tmp/test3 headers={'mtime': '20210608194641.602636099', 'to_clusters': 'xxx', 'mode': '644', 'source': 'feeder', 'atime': '20210608194641.602636099', 'parts': '1,5,1,0,0', 'sum': 'n,8ad8757baa8564dc136c1e07507f4a98', 'from_cluster': 'xxx'}
That's the related code in sr_cache.py:
self.logger.debug("sum already in cache: key={}".format(key))
kdict = self.cache_dict[key]
present = value in kdict
kdict[value] = now
Adding the following if condition could solve the problem:
The text was updated successfully, but these errors were encountered: