0 ROAs from ripe or apnic -- how to troubleshoot? #246

wmclendon · 2019-11-24T19:04:00Z

Hello,

I have two installations of routinator (just upgraded, but this issue predates the upgrade I'm almost certain). somewhere along the way, it seems my instances are no longer able to get ROAs from one or more regions. One instance reports 0 ROAs from apnic and ripe, and another 0 from just apnic. example output from the /metrics:

# TYPE routinator_valid_roas gauge
routinator_valid_roas{tal="arin"} 5878
routinator_valid_roas{tal="apnic"} 0
routinator_valid_roas{tal="lacnic"} 2767
routinator_valid_roas{tal="afrinic"} 590
routinator_valid_roas{tal="ripe"} 13091

# TYPE routinator_valid_roas gauge
routinator_valid_roas{tal="arin"} 5878
routinator_valid_roas{tal="lacnic"} 2767
routinator_valid_roas{tal="apnic"} 0
routinator_valid_roas{tal="ripe"} 0
routinator_valid_roas{tal="afrinic"} 590

i can't seem to find any logs to see why one or more seem to be failing. Looking at other metrics being reported it looks like rsync exit codes 0 for all of the repositories (and it seems RRDP is failing with code -1 for the ones it is attempting).

Any feedback as to how to troubleshoot or where to look would be appreciated!

Thanks!

The text was updated successfully, but these errors were encountered:

partim · 2019-11-24T19:36:50Z

APNIC and RIPE are the two regions that use RRDP, so it probably is related to that. You can verify by running Routinator with the --disable-rrdp option. If that gets you a full set of ROAs, then indeed RRDP is the problem. In that case, there should be an error message in the logs. Where does are depends a bit on how you run Routinator. In server mode it will prefer syslog, in vrps mode it will prefer to print to stderr.

wmclendon · 2019-11-24T19:45:10Z

thanks @partim I will give that a try. Is there anything I need add / install to enable RRDP? In my config file I have 'disable-rrdp = false'. I run routinator with routinator server -d at present with a .routinator.conf file. The log level specified in the config file is WARN -- do I need to adjust that to see the relevant log messages?

wmclendon · 2019-11-24T19:59:11Z

i updated my config file to log to local file and level INFO, and also disabled rrdp, and i see in the log all the rsync messages seemingly complete. and indeed, on /metrics I now see ROAs for all the RIRs:

# HELP routinator_valid_roas number of valid ROAs seen
# TYPE routinator_valid_roas gauge
routinator_valid_roas{tal="arin"} 5878
routinator_valid_roas{tal="lacnic"} 2767
routinator_valid_roas{tal="apnic"} 5202
routinator_valid_roas{tal="ripe"} 13091
routinator_valid_roas{tal="afrinic"} 590

I will try re-enabling RRDP and see if any logs point direction to why it fails.

wmclendon · 2019-11-24T20:06:23Z

well, now i'm thoroughly confused, as it seems to have everything working, at least for now (maybe some are cached from the prior rsync run?). info from metrics after restarting with RRDP enabled again:

# HELP routinator_valid_roas number of valid ROAs seen
# TYPE routinator_valid_roas gauge
routinator_valid_roas{tal="arin"} 5878
routinator_valid_roas{tal="lacnic"} 2767
routinator_valid_roas{tal="apnic"} 5202
routinator_valid_roas{tal="ripe"} 13091
routinator_valid_roas{tal="afrinic"} 590

# HELP routinator_rsync_status exit status of rsync command
# TYPE routinator_rsync_status gauge
routinator_rsync_status{uri="rsync://rpki.arin.net/repository/"} 0
routinator_rsync_status{uri="rsync://repository.lacnic.net/rpki/"} 0
routinator_rsync_status{uri="rsync://rpki.apnic.net/repository/"} 0
routinator_rsync_status{uri="rsync://rpki-repository.nic.ad.jp/ap/"} 0
routinator_rsync_status{uri="rsync://rpki.ripe.net/ta/"} 0
routinator_rsync_status{uri="rsync://ca.rg.net/rpki/"} 0
routinator_rsync_status{uri="rsync://rpki.afrinic.net/repository/"} 0

# HELP routinator_rrdp_status status code for getting notification file
# TYPE routinator_rrdp_status gauge
routinator_rrdp_status{uri="https://rrdp.apnic.net/notification.xml"} 200
routinator_rrdp_status{uri="https://rrdp.ripe.net/notification.xml"} 200
routinator_rrdp_status{uri="https://rrdp.twnic.tw/rrdp/notify.xml"} 200
routinator_rrdp_status{uri="https://rpki.idnic.net/rrdp/notify.xml"} 200
routinator_rrdp_status{uri="https://rpki.cnnic.cn/rrdp/notify.xml"} 200
routinator_rrdp_status{uri="https://ca.rg.net/rrdp/notify.xml"} -1

for ca.rg.net RRDP failure, this was seen in the log:

[2019-11-24 20:01:25] routinator::repository Found valid trust anchor rsync://rpki.ripe.net/ta/ripe-ncc-ta.cer. Processing.
[2019-11-24 20:01:25] routinator::rrdp::server RRDP https://rrdp.ripe.net/notification.xml: Updating server
[2019-11-24 20:01:25] routinator::rrdp::server New session. Need to get snapshot.
[2019-11-24 20:01:25] routinator::rrdp::server RRDP https://rrdp.ripe.net/notification.xml: updating from snapshot.
[2019-11-24 20:02:09] routinator::rrdp::server RRDP https://ca.rg.net/rrdp/notify.xml: Updating server
[2019-11-24 20:02:09] routinator::rrdp::http https://ca.rg.net/rrdp/notify.xml: https://ca.rg.net/rrdp/notify.xml: error trying to connect: invalid certificate: UnknownIssuer
[2019-11-24 20:02:09] routinator::rrdp::server Marking RRPD server 'https://ca.rg.net/rrdp/notify.xml' as unusable
[2019-11-24 20:02:09] routinator::rsync rsyncing from rsync://ca.rg.net/rpki/.
[2019-11-24 20:02:09] routinator::rsync rsync://ca.rg.net/rpki: Running command "rsync" "--contimeout=10" "--timeout=300" "-rltz" "--delete" "rsync://ca.rg.net/rpki/" "/home/layer3/.rpki-cache/repository/rsync/ca.rg.net/rpki/"
[2019-11-24 20:02:26] routinator::rsync rsync://ca.rg.net/rpki: successfully completed.

ill go through the same process on my other instance and see if it recovers for the time being. i'll check it again tomorrow and see if it still reports data from all the RIRs.

EDIT: the above output was when I ran routinator server -- if I re-run as routinator server -d, the RRDP is timing out:

[2019-11-24 20:08:57] routinator::rrdp::server RRDP https://rrdp.ripe.net/notification.xml: Updating server
[2019-11-24 20:09:27] routinator::rrdp::http https://rrdp.ripe.net/notification.xml: https://rrdp.ripe.net/notification.xml: timed out
[2019-11-24 20:09:32] routinator::rrdp::server RRDP https://ca.rg.net/rrdp/notify.xml: Updating server
[2019-11-24 20:10:02] routinator::rrdp::http https://ca.rg.net/rrdp/notify.xml: https://ca.rg.net/rrdp/notify.xml: timed out
[2019-11-24 20:10:02] routinator::rrdp::server Marking RRPD server 'https://ca.rg.net/rrdp/notify.xml' as unusable

this was within minutes of it seeming to succeed in the prior output -- and the metrics for rrdp status all report -1 again:

# HELP routinator_rrdp_status status code for getting notification file
# TYPE routinator_rrdp_status gauge
routinator_rrdp_status{uri="https://rpki.idnic.net/rrdp/notify.xml"} -1
routinator_rrdp_status{uri="https://rrdp.apnic.net/notification.xml"} -1
routinator_rrdp_status{uri="https://rrdp.ripe.net/notification.xml"} -1
routinator_rrdp_status{uri="https://rpki.cnnic.cn/rrdp/notify.xml"} -1
routinator_rrdp_status{uri="https://rrdp.twnic.tw/rrdp/notify.xml"} -1
routinator_rrdp_status{uri="https://ca.rg.net/rrdp/notify.xml"} -1

not sure one has any cause for the other, will do a few more restarts and see if its consistent.

additional edit: after several restarts on both instances I have, if I do routinator server -d, RRDP never works. if I just do routinator server RRDP works like a champ. if you like I can open a separate issue for this, as this isn't exactly my original problem, but perhaps this caused my original problem? not sure how best to try and replicate it, but this is not a critical system so if there's any tests you want me to run I can do that.

Thanks,

Will

partim · 2019-11-25T09:54:12Z

Interesting! Thanks for figuring that out!

partim · 2019-11-25T13:05:59Z

The reason why you ended up with 0 ROAs for the RRDP-capable regions is that while Routinator will fail over to rsync if RRDP doesn’t work at all, it will not if it has what appears to be a valid local copy of an RRDP repository. This is so that it won’t fail over for intermittent RRDP failures that would fix itself on the next run. In this case it will just use the existing local copy. I suspect that in your case, it did exactly that but since it didn’t update anymore, that copy slowly expired, eventually leading to 0 ROAs.

PR #250 fixes the failure to run RRDP with the -d option, so now RRDP failures should really be actual networking or server issues again and the strategy makes sense.

I believe we never discovered the issue since hardly anyone actually uses that option – in most cases the init system seems to want to do its own forking. For the same reason, I am tempted to not do an urgent release for the fix.

partim · 2019-11-25T13:07:42Z

Let’s move this to a separate comment for posterity: My impression is that the -d option is used relatively rarely and a bug-fix release for the issue is not necessary. If I am mistaken and someone would rather I did a release, I am happy too, though.

wmclendon · 2019-11-25T13:11:24Z

Thanks @partim - what is the most common method to run routinator? via systemd? I saw the example unit file, but myself am not familiar with how to incorporate -- if that is the best / preferred way to run, i'll hit up my colleagues that are more systems oriented to help me figure it out with how I currently have routinator installed (also likely not ideal...right now just in my user home directory on the instance i'm running)

Thanks again!

partim · 2019-11-25T13:26:01Z

On Linux, I would indeed suggest to use systemd. There is two service unit files, the minimal one should work on any version of systemd, the regular one needs a newer and we had issues with it leading to segfaults on some systems. The service assumes a config file similar to etc/routinator.conf.system-service.

If you are using Debian-based systems, you can also build a Debian package via cargo deb, but I am not quite sure how well that works currently. It definitely doesn’t do the routinator init bit, so you’d have to run routinator -c /etc/routinator/routinator.conf init after installing.

Alternatively, Docker via Docker Hub seems to be very popular.

wmclendon · 2019-11-25T13:31:28Z

yea i think longer term we'll probably look to deploy as a docker container, but in the interim i'll work on getting the unit file working.

Thanks,

Will

partim mentioned this issue Nov 25, 2019

Delay creating of the HTTP client until after a possible fork. #250

Merged

partim closed this as completed in #250 Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0 ROAs from ripe or apnic -- how to troubleshoot? #246

0 ROAs from ripe or apnic -- how to troubleshoot? #246

wmclendon commented Nov 24, 2019 •

edited

Loading

partim commented Nov 24, 2019

wmclendon commented Nov 24, 2019 •

edited

Loading

wmclendon commented Nov 24, 2019

wmclendon commented Nov 24, 2019 •

edited

Loading

partim commented Nov 25, 2019

partim commented Nov 25, 2019

partim commented Nov 25, 2019 •

edited

Loading

wmclendon commented Nov 25, 2019

partim commented Nov 25, 2019

wmclendon commented Nov 25, 2019

0 ROAs from ripe or apnic -- how to troubleshoot? #246

0 ROAs from ripe or apnic -- how to troubleshoot? #246

Comments

wmclendon commented Nov 24, 2019 • edited Loading

partim commented Nov 24, 2019

wmclendon commented Nov 24, 2019 • edited Loading

wmclendon commented Nov 24, 2019

wmclendon commented Nov 24, 2019 • edited Loading

partim commented Nov 25, 2019

partim commented Nov 25, 2019

partim commented Nov 25, 2019 • edited Loading

wmclendon commented Nov 25, 2019

partim commented Nov 25, 2019

wmclendon commented Nov 25, 2019

wmclendon commented Nov 24, 2019 •

edited

Loading

wmclendon commented Nov 24, 2019 •

edited

Loading

wmclendon commented Nov 24, 2019 •

edited

Loading

partim commented Nov 25, 2019 •

edited

Loading