Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getValidSnapshots does not find date from html code #258

Closed
jarauh opened this issue Nov 24, 2017 · 5 comments
Closed

getValidSnapshots does not find date from html code #258

jarauh opened this issue Nov 24, 2017 · 5 comments

Comments

@jarauh
Copy link

jarauh commented Nov 24, 2017

With a custom snapshot server, in the call to getValidSnapshots, text[idx] becomes:

c("<tr><td valign=\"top\"><img src=\"/icons/folder.gif\" alt=\"[DIR]\"></td><td><a href=\"2017-11-08/\">2017-11-08/</a>            </td><td align=\"right\">2017-11-08 18:17  </td><td align=\"right\">  - </td><td>&nbsp;</td></tr>",  "<tr><td valign=\"top\"><img src=\"/icons/folder.gif\" alt=\"[DIR]\"></td><td><a href=\"2017-11-09/\">2017-11-09/</a>            </td><td align=\"right\">2017-11-09 08:44  </td><td align=\"right\">  - </td><td>&nbsp;</td></tr>" )

Then gsub statement fails to isolate the checkpoint dates, because the lines do not start with "<a href...", but they start with "...".

@hongooi73
Copy link
Contributor

The best solution here is probably "don't mess with the standard layout"

@jarauh
Copy link
Author

jarauh commented Mar 27, 2020

Dear @Hong-Revo, thank you for your timely, friendly comment. Could you please specify what you mean by "standard layout"?

I guess the problem is that our local Apache server has a different (but standard, I assure you!) configuration from whatever web server the default mran is using. Therefore, when I point my browser at the right snapshot-adress on our local server, the page looks different (and has a different html-structure) than when I point my browser at https://mran.microsoft.com/snapshot/.

I don't know what is a good robust way to query subfolders from a webpage. Just reading the directory itself may lead to unexpected results.

@hongooi73
Copy link
Contributor

Standard as in whatever mran.microsoft.com does.

The code for listing snapshots could be made smarter, but I'm not going to go down the rabbit hole of parsing HTML with regex. The good news is that the next version of checkpoint won't try to get the list of snapshots every time you run checkpoint, so this issue shouldn't be a blocker.

@jarauh
Copy link
Author

jarauh commented Mar 29, 2020

Fair enough. Is there any documentation of "whatever mran.microsoft.com" does?

@jarauh
Copy link
Author

jarauh commented Mar 29, 2020

One could also say that getting a list of subdirectories via HTTP is flawed right from the beginning. The question is how to easily work around that. The obvious solution would be to use FTP instead of HTTP.

Another way would be to add an additional file that contains index information about available checkpoints. Of course, that file would have to be maintained, adding an overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants