Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0 ROAs from ripe or apnic -- how to troubleshoot? #246

Closed
wmclendon opened this issue Nov 24, 2019 · 10 comments · Fixed by #250
Closed

0 ROAs from ripe or apnic -- how to troubleshoot? #246

wmclendon opened this issue Nov 24, 2019 · 10 comments · Fixed by #250

Comments

@wmclendon
Copy link

wmclendon commented Nov 24, 2019

Hello,

I have two installations of routinator (just upgraded, but this issue predates the upgrade I'm almost certain). somewhere along the way, it seems my instances are no longer able to get ROAs from one or more regions. One instance reports 0 ROAs from apnic and ripe, and another 0 from just apnic. example output from the /metrics:

# TYPE routinator_valid_roas gauge
routinator_valid_roas{tal="arin"} 5878
routinator_valid_roas{tal="apnic"} 0
routinator_valid_roas{tal="lacnic"} 2767
routinator_valid_roas{tal="afrinic"} 590
routinator_valid_roas{tal="ripe"} 13091
# TYPE routinator_valid_roas gauge
routinator_valid_roas{tal="arin"} 5878
routinator_valid_roas{tal="lacnic"} 2767
routinator_valid_roas{tal="apnic"} 0
routinator_valid_roas{tal="ripe"} 0
routinator_valid_roas{tal="afrinic"} 590

i can't seem to find any logs to see why one or more seem to be failing. Looking at other metrics being reported it looks like rsync exit codes 0 for all of the repositories (and it seems RRDP is failing with code -1 for the ones it is attempting).

Any feedback as to how to troubleshoot or where to look would be appreciated!

Thanks!

@partim
Copy link
Member

partim commented Nov 24, 2019

APNIC and RIPE are the two regions that use RRDP, so it probably is related to that. You can verify by running Routinator with the --disable-rrdp option. If that gets you a full set of ROAs, then indeed RRDP is the problem. In that case, there should be an error message in the logs. Where does are depends a bit on how you run Routinator. In server mode it will prefer syslog, in vrps mode it will prefer to print to stderr.

@wmclendon
Copy link
Author

wmclendon commented Nov 24, 2019

thanks @partim I will give that a try. Is there anything I need add / install to enable RRDP? In my config file I have 'disable-rrdp = false'. I run routinator with routinator server -d at present with a .routinator.conf file. The log level specified in the config file is WARN -- do I need to adjust that to see the relevant log messages?

@wmclendon
Copy link
Author

i updated my config file to log to local file and level INFO, and also disabled rrdp, and i see in the log all the rsync messages seemingly complete. and indeed, on /metrics I now see ROAs for all the RIRs:

# HELP routinator_valid_roas number of valid ROAs seen
# TYPE routinator_valid_roas gauge
routinator_valid_roas{tal="arin"} 5878
routinator_valid_roas{tal="lacnic"} 2767
routinator_valid_roas{tal="apnic"} 5202
routinator_valid_roas{tal="ripe"} 13091
routinator_valid_roas{tal="afrinic"} 590

I will try re-enabling RRDP and see if any logs point direction to why it fails.

@wmclendon
Copy link
Author

wmclendon commented Nov 24, 2019

well, now i'm thoroughly confused, as it seems to have everything working, at least for now (maybe some are cached from the prior rsync run?). info from metrics after restarting with RRDP enabled again:

# HELP routinator_valid_roas number of valid ROAs seen
# TYPE routinator_valid_roas gauge
routinator_valid_roas{tal="arin"} 5878
routinator_valid_roas{tal="lacnic"} 2767
routinator_valid_roas{tal="apnic"} 5202
routinator_valid_roas{tal="ripe"} 13091
routinator_valid_roas{tal="afrinic"} 590

# HELP routinator_rsync_status exit status of rsync command
# TYPE routinator_rsync_status gauge
routinator_rsync_status{uri="rsync://rpki.arin.net/repository/"} 0
routinator_rsync_status{uri="rsync://repository.lacnic.net/rpki/"} 0
routinator_rsync_status{uri="rsync://rpki.apnic.net/repository/"} 0
routinator_rsync_status{uri="rsync://rpki-repository.nic.ad.jp/ap/"} 0
routinator_rsync_status{uri="rsync://rpki.ripe.net/ta/"} 0
routinator_rsync_status{uri="rsync://ca.rg.net/rpki/"} 0
routinator_rsync_status{uri="rsync://rpki.afrinic.net/repository/"} 0

# HELP routinator_rrdp_status status code for getting notification file
# TYPE routinator_rrdp_status gauge
routinator_rrdp_status{uri="https://rrdp.apnic.net/notification.xml"} 200
routinator_rrdp_status{uri="https://rrdp.ripe.net/notification.xml"} 200
routinator_rrdp_status{uri="https://rrdp.twnic.tw/rrdp/notify.xml"} 200
routinator_rrdp_status{uri="https://rpki.idnic.net/rrdp/notify.xml"} 200
routinator_rrdp_status{uri="https://rpki.cnnic.cn/rrdp/notify.xml"} 200
routinator_rrdp_status{uri="https://ca.rg.net/rrdp/notify.xml"} -1

for ca.rg.net RRDP failure, this was seen in the log:

[2019-11-24 20:01:25] routinator::repository Found valid trust anchor rsync://rpki.ripe.net/ta/ripe-ncc-ta.cer. Processing.
[2019-11-24 20:01:25] routinator::rrdp::server RRDP https://rrdp.ripe.net/notification.xml: Updating server
[2019-11-24 20:01:25] routinator::rrdp::server New session. Need to get snapshot.
[2019-11-24 20:01:25] routinator::rrdp::server RRDP https://rrdp.ripe.net/notification.xml: updating from snapshot.
[2019-11-24 20:02:09] routinator::rrdp::server RRDP https://ca.rg.net/rrdp/notify.xml: Updating server
[2019-11-24 20:02:09] routinator::rrdp::http https://ca.rg.net/rrdp/notify.xml: https://ca.rg.net/rrdp/notify.xml: error trying to connect: invalid certificate: UnknownIssuer
[2019-11-24 20:02:09] routinator::rrdp::server Marking RRPD server 'https://ca.rg.net/rrdp/notify.xml' as unusable
[2019-11-24 20:02:09] routinator::rsync rsyncing from rsync://ca.rg.net/rpki/.
[2019-11-24 20:02:09] routinator::rsync rsync://ca.rg.net/rpki: Running command "rsync" "--contimeout=10" "--timeout=300" "-rltz" "--delete" "rsync://ca.rg.net/rpki/" "/home/layer3/.rpki-cache/repository/rsync/ca.rg.net/rpki/"
[2019-11-24 20:02:26] routinator::rsync rsync://ca.rg.net/rpki: successfully completed.

ill go through the same process on my other instance and see if it recovers for the time being. i'll check it again tomorrow and see if it still reports data from all the RIRs.

EDIT: the above output was when I ran routinator server -- if I re-run as routinator server -d, the RRDP is timing out:

[2019-11-24 20:08:57] routinator::rrdp::server RRDP https://rrdp.ripe.net/notification.xml: Updating server
[2019-11-24 20:09:27] routinator::rrdp::http https://rrdp.ripe.net/notification.xml: https://rrdp.ripe.net/notification.xml: timed out
[2019-11-24 20:09:32] routinator::rrdp::server RRDP https://ca.rg.net/rrdp/notify.xml: Updating server
[2019-11-24 20:10:02] routinator::rrdp::http https://ca.rg.net/rrdp/notify.xml: https://ca.rg.net/rrdp/notify.xml: timed out
[2019-11-24 20:10:02] routinator::rrdp::server Marking RRPD server 'https://ca.rg.net/rrdp/notify.xml' as unusable

this was within minutes of it seeming to succeed in the prior output -- and the metrics for rrdp status all report -1 again:

# HELP routinator_rrdp_status status code for getting notification file
# TYPE routinator_rrdp_status gauge
routinator_rrdp_status{uri="https://rpki.idnic.net/rrdp/notify.xml"} -1
routinator_rrdp_status{uri="https://rrdp.apnic.net/notification.xml"} -1
routinator_rrdp_status{uri="https://rrdp.ripe.net/notification.xml"} -1
routinator_rrdp_status{uri="https://rpki.cnnic.cn/rrdp/notify.xml"} -1
routinator_rrdp_status{uri="https://rrdp.twnic.tw/rrdp/notify.xml"} -1
routinator_rrdp_status{uri="https://ca.rg.net/rrdp/notify.xml"} -1

not sure one has any cause for the other, will do a few more restarts and see if its consistent.

additional edit: after several restarts on both instances I have, if I do routinator server -d, RRDP never works. if I just do routinator server RRDP works like a champ. if you like I can open a separate issue for this, as this isn't exactly my original problem, but perhaps this caused my original problem? not sure how best to try and replicate it, but this is not a critical system so if there's any tests you want me to run I can do that.

Thanks,

Will

@partim
Copy link
Member

partim commented Nov 25, 2019

Interesting! Thanks for figuring that out!

@partim
Copy link
Member

partim commented Nov 25, 2019

The reason why you ended up with 0 ROAs for the RRDP-capable regions is that while Routinator will fail over to rsync if RRDP doesn’t work at all, it will not if it has what appears to be a valid local copy of an RRDP repository. This is so that it won’t fail over for intermittent RRDP failures that would fix itself on the next run. In this case it will just use the existing local copy. I suspect that in your case, it did exactly that but since it didn’t update anymore, that copy slowly expired, eventually leading to 0 ROAs.

PR #250 fixes the failure to run RRDP with the -d option, so now RRDP failures should really be actual networking or server issues again and the strategy makes sense.

I believe we never discovered the issue since hardly anyone actually uses that option – in most cases the init system seems to want to do its own forking. For the same reason, I am tempted to not do an urgent release for the fix.

@partim
Copy link
Member

partim commented Nov 25, 2019

Let’s move this to a separate comment for posterity: My impression is that the -d option is used relatively rarely and a bug-fix release for the issue is not necessary. If I am mistaken and someone would rather I did a release, I am happy too, though.

@wmclendon
Copy link
Author

Thanks @partim - what is the most common method to run routinator? via systemd? I saw the example unit file, but myself am not familiar with how to incorporate -- if that is the best / preferred way to run, i'll hit up my colleagues that are more systems oriented to help me figure it out with how I currently have routinator installed (also likely not ideal...right now just in my user home directory on the instance i'm running)

Thanks again!

@partim
Copy link
Member

partim commented Nov 25, 2019

On Linux, I would indeed suggest to use systemd. There is two service unit files, the minimal one should work on any version of systemd, the regular one needs a newer and we had issues with it leading to segfaults on some systems. The service assumes a config file similar to etc/routinator.conf.system-service.

If you are using Debian-based systems, you can also build a Debian package via cargo deb, but I am not quite sure how well that works currently. It definitely doesn’t do the routinator init bit, so you’d have to run routinator -c /etc/routinator/routinator.conf init after installing.

Alternatively, Docker via Docker Hub seems to be very popular.

@wmclendon
Copy link
Author

yea i think longer term we'll probably look to deploy as a docker container, but in the interim i'll work on getting the unit file working.

Thanks,

Will

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants