Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sr_poll posts old files when the timestamp format of the file changes #366

Closed
habilinour opened this issue Dec 29, 2020 · 45 comments
Closed
Assignees
Labels
bug Something isn't working v3 issue deferred to or affects version 3

Comments

@habilinour
Copy link

If the timestamp of a file is in the past six months, ls displays the date and the time. If it's longer ago than six months or if it's in the future, ls displays the date and the year as documented in:
http://www.gnu.org/software/coreutils/manual/coreutils.html#Formatting-file-timestamps
Some files could have the timestamp change during the day if the timestamp becomes older than 6 months. In this case, sr_poll will then consider them as new and publishes them even though they were already published
Here is an example where the timestamp format changed and the file is considered as new and published.

2020-12-28 13:11:28,127 [DEBUG] sr_ftp line_callback -rw-rw-r--    1 500        510           2881804 Jul  1 07:10 ims2020182_1km_GIS_v1.3.tif.gz
2020-12-28 13:16:29,291 [DEBUG] sr_ftp line_callback -rw-rw-r--    1 500        510           2881804 Jul  1  2020 ims2020182_1km_GIS_v1.3.tif.gz

It could eventually be fixed it if sr_poll uses ls with --full-time option to keep the same timestamp format and avoid posting old files.

@habilinour habilinour added the bug Something isn't working label Dec 29, 2020
@petersilva
Copy link
Contributor

what is being polled?

  • with FTP, there is no option to change the date format. receiving ascii text from server... defined by server.
  • with sftp, similarly do not have a shell. do not know if sftp service honours any options.
  • with SFTP if not dealing with a restricted sftp, server, we might be able to do this.
  • for 'file' url, should be using watch instead, not poll.

In most situation, cannot select the different format. What protocol poll are you doing?

@habilinour
Copy link
Author

it's using ftp.

@petersilva
Copy link
Contributor

so all we have is ASCII. we would need to improve the parser, but the server can produce any format they like. it is not standardized. On windows, it is even more entertaining.

@habilinour
Copy link
Author

That would be a big challenge if we could not get the details from the remote server.

@petersilva
Copy link
Contributor

petersilva commented Dec 30, 2020

Should we just have an upper bound on the age of files... if something is older than that... ignore it.. would such a setting help in this case? If it is > 1 month old, we purposefully ignore it.

@habilinour
Copy link
Author

It makes sense to ignore files older than specific days. Actually that would be a good enhancement because old files usually cause some backlog and some client applications do not deal with old data since they were not expecting to have it. This would also help not to pull lot of old files when a cache is not set yet especially when a poll is started the first time.

cleaaum added a commit that referenced this issue Jun 15, 2021
Skip file dates older than a certain given time (2 months for ex).  #366
cleaaum added a commit that referenced this issue Jun 15, 2021
cleaaum added a commit that referenced this issue Jun 21, 2021
…e = 60 days, but can be changed in config file to anything else. (value no longer hardcoded) #366
@petersilva
Copy link
Contributor

In the rest of the configuration language, all time intervals are specified in seconds, with an optional unit suffix for minutes, hours, days, possibly weeks. The code used to have varying units in different locations (for example for log rotation, in days, expiry in milliseconds) and it did confuse people in practice. There is a to_duration entry_point somewhere that you just supply a string
and it spits out floating point seconds... you can then convert that as needed.,.. for the date comparisons. There may be cases in future... when very high number of files, that we want to ignore stuff less than a day old. maybe an hour old....

cleaaum added a commit that referenced this issue Jul 7, 2021
…et the default duration from days to seconds instead. Also updated the en and fr doc for sr_poll.1.rst with the updated functionality #366
@petersilva
Copy link
Contributor

In the documentation, the summary gives the default is 3600 seconds, which is one hour, but the text says it is a month. Would be best to reconcile those. There is also a change to printing in sr.py that I don't think has anything to do with this patch.

@cleaaum
Copy link
Contributor

cleaaum commented Jul 19, 2021

The text says it is two months, and yes I realized that mistake! but already made a patch for it with default 60d instead of 3600 (see next two commits right after). And yes the modification for sr.py was for a different issue, and accidentally modified it on this branch.

@petersilva
Copy link
Contributor

merged great!

@petersilva petersilva added the v3 issue deferred to or affects version 3 label Jul 21, 2021
@petersilva
Copy link
Contributor

great for v2. needs forward port to v3.

@petersilva petersilva reopened this Jul 21, 2021
cleaaum added a commit that referenced this issue Aug 4, 2021
…than a certain time limit are ignored. default value: file_time_limit = "60d". #366
@petersilva petersilva added this to Blocker in sr3 stable release Aug 26, 2021
@petersilva
Copy link
Contributor

testing from @junmtl:

Test for polling of old files when date format on remote server changes: file_time_limit only works with interger.

file_time_limit 24 works
2021-08-25 16:22:47,789 [INFO] File date is: 2021-08-17 00:55:00 > File is 30732 seconds old
2021-08-25 16:22:47,790 [INFO] File should be skipped

file_time_limit 24h failed
2021-08-25 16:22:59,599 [INFO] File date is: 2021-08-17 00:55:00 > File is 30720 seconds old
2021-08-25 16:22:59,599 [INFO] File should be processed

file_time_limit 1d failed
2021-08-25 16:25:50,993 [INFO] File date is: 2021-08-17 00:55:00 > File is 30549 seconds old
2021-08-25 16:25:50,993 [INFO] File should be processed

@petersilva
Copy link
Contributor

Playing with flow tests, found some issues:

  • when the try/except falls through, you end up with a print. It should be a logger.warning( ... or logger.error( the print messages without saying severity or location in the code were a bit mystifying. After that change, saw message below:
  • so you probably need to add a new format.
  • And when you fail to parse the date, probably should assume a good date, since we didn't understand it. Otherwise we will never be able to poll sites whose formats we don't understand. I mean in the case where you fall through to the error, probably should return True if you don't understand the date.

2021-09-08 00:39:09,446 [DEBUG] sarracenia.flow.poll differ_ls_file File should be skipped
2021-09-08 00:39:09,446 [ERROR] sarracenia.flow.poll _file_date_within_limit time data '11 Mar 2021' does not match format '%b %d %Y'
2021-09-08 00:39:09,446 [DEBUG] sarracenia.flow.poll differ_ls_file File should be skipped
2021-09-08 00:39:09,446 [ERROR] sarracenia.flow.poll _file_date_within_limit time data '11 Mar 2021' does not match format '%b %d %Y'
2021-09-08 00:39:09,446 [DEBUG] sarracenia.flow.poll differ_ls_file File should be skipped
2021-09-08 00:39:09,446 [ERROR] sarracenia.flow.poll _file_date_within_limit time data '11 Mar 2021' does not match format '%b %d %Y'
2021-09-08 00:39:09,446 [DEBUG] sarracenia.flow.poll differ_ls_file File should be skipped
2021-09-08 00:39:09,446 [ERROR] sarracenia.flow.poll _file_date_within_limit time data '11 Mar 2021' does not match format '%b %d %Y'
2021-09-08 00:39:09,446 [DEBUG] sarracenia.flow.poll differ_ls_file File should be skipped
2021-09-08 00:39:09,446 [ERROR] sarracenia.flow.poll _file_date_within_limit time data '11 Mar 2021' does not match format '%b %d %Y'
2021-09-08 00:39:09,446 [DEBUG] sarracenia.flow.poll differ_ls_file File should be skipped
2021-09-08 00:39:09,446 [ERROR] sarracenia.flow.poll _file_date_within_limit time data '11 Mar 2021' does not match format '%b %d %Y'
2021-09-08 00:39:09,446 [DEBUG] sarracenia.flow.poll differ_ls_file File should be skipped
2021-09-08 00:39:09,446 [ERROR] sarracenia.flow.poll _file_date_within_limit time data '11 Mar 2021' does not match format '%b %d %Y'

petersilva added a commit that referenced this issue Sep 8, 2021
…her format. added it, and

always use logger for messages.
@petersilva
Copy link
Contributor

quick patch, so that tests can still run... but wonder if the last ditch message should change... it's just printing that it wasn't in the last guessed format... it should say something like "unrecognized date format for: dos de mayo 1814"

@petersilva
Copy link
Contributor

It would be great to include TIMEZONE conversion into the options as well. need to CONVERT_TO utc

@cleaaum
Copy link
Contributor

cleaaum commented Oct 4, 2021

Added the above with an option for TIMEZONE and CONVERT_TO is always UTC. by default TIMEZONE is UTC. Accidentally referenced the issue as 3660 instead of 366 in the commit message.

@petersilva
Copy link
Contributor

can you fix this:

fractal% git status
On branch issue366_plugin
Your branch is up to date with 'origin/issue366_plugin'.
nothing to commit, working tree clean
fractal% 
fractal% git merge main
Auto-merging sarra/sr_poll.py
CONFLICT (content): Merge conflict in sarra/sr_poll.py
Auto-merging sarra/sr_config.py
Automatic merge failed; fix conflicts and then commit the result.
fractal% 

also you added 'timezone' option... but there isn't a man page entry for it.
don't know if it should have a different name... like 'destination_timezone' ?
to match the 'destination' keyword...

I still only get 26/30 tests passing... but I'd rather look into it when there is something that can be merged.

@petersilva
Copy link
Contributor

I confirmed the TZ conversion stuff works perfectly... that's exciting!

@petersilva
Copy link
Contributor

you also need to add dateparser to install_requires in setup.py as we have a new dependency.

cleaaum added a commit that referenced this issue Oct 4, 2021
@cleaaum
Copy link
Contributor

cleaaum commented Oct 4, 2021

The commit above includes

  • new branch: hopefully that solves the merge conflict (re did everything from the beginning)
  • added dateparser to setup.py
  • renamed timezone to destination_timezone
  • flow test passing with 30/30
  • I still need to add the destination_timezone to the man page

@petersilva
Copy link
Contributor

what is the new branch called, and has it been pushed ?

@petersilva
Copy link
Contributor

never mind... found it.

@petersilva
Copy link
Contributor

Great! Ran static/flakey/dynamic... and everything was clean.
Merged.

@petersilva
Copy link
Contributor

significant changes made in v3... suggest starting a new branch for there also... or at least make sure you git pull before you start.

@petersilva
Copy link
Contributor

merged 863f042 for v2 man pages.

@petersilva
Copy link
Contributor

for v3, there is significant inter-relationship with #394 refactor of poll to use the duplicate suppression cache. We don't just need to compare dates, but also need to use it to store checksum data, in order to fill 'mtime' field for message for cache check purposes.

instead of a _file_date_within_limits function, need to figure out the mtime of the file
(in a numeric format... call it fltmtim - floating point mtime) then do the comparion,
and if it is within the time limit ... as is already done, but then keep that floating point mtime to create an 'mtime' field for the message (using sarracenia.timeflt2str( fltmtime ) ) to use for storing in the reception cache.

@petersilva
Copy link
Contributor

I'm wondering if the output format from the line_date should not be mtime string... one field...
hmm...

petersilva added a commit that referenced this issue Oct 9, 2021
file mtime, or file size for duplicate suppression, if available.
this patch is relevant to both #366 and #394
@petersilva
Copy link
Contributor

the "standard" date-time format within Sarracenia v3 is YYYYMMDD "T" HHMMSS "." ... you can see it in pubTime, mtime, and atime fields. probably the format to normalize to is that. I've mocked in the point where 'mtime' needs to be known... so you can adjust the dateparsing stuff as necessary... maybe make the desc line have a standard date-time? patch above...

cleaaum added a commit that referenced this issue Oct 19, 2021
… the SFTPattributes object in the poll - the old function can be deleted now. #366
cleaaum added a commit that referenced this issue Oct 25, 2021
cleaaum added a commit that referenced this issue Nov 1, 2021
petersilva pushed a commit that referenced this issue Nov 17, 2021
…object for v3. #366

Now calculating how old a file is by accessing the mtime attribute of the SFTPattributes object in the poll - the old function can be deleted now. #366

Modified logic of plugin. Manually setting attributes instead of using os.stat() for SFTPattributes. #366

improved previous patch and added options for time zone and nodupe_file_time_limit. #366

Updated documentation and removed unnecessary ifs. Assuming all lines in poll are now in SFTPattributes format. #366
@petersilva
Copy link
Contributor

OK I rebased the above patches into a single one, after running static and flakey tests successfully (having merged in from v03_wip first) dynamic gets 20/30 now... as it does for everyone. Merged.

@petersilva
Copy link
Contributor

looking at the code... still some weirdnesses... discussing offline.

@petersilva
Copy link
Contributor

a4016c1
last patch for v03:

  • nodupe_fileAgeMaximum implemented.
  • check for presence of 'mtime' field, because of crashes in dynamic_flow.
  • set default in all components except poll to 0 (disabled.) in poll, it is a month.

sr3 stable release automation moved this from Blocker to Done Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v3 issue deferred to or affects version 3
Projects
Development

No branches or pull requests

3 participants