Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPN route not always propagated to peer in LabN BGPD CI test #1052

Closed
louberger opened this issue Aug 26, 2017 · 7 comments
Closed

VPN route not always propagated to peer in LabN BGPD CI test #1052

louberger opened this issue Aug 26, 2017 · 7 comments
Assignees
Labels

Comments

@louberger
Copy link
Member

First seen on pull request #1023. as stated there, the speculation is that this was actually introduced by PR #939 .

Topology is CE4<--(bgp unicast safi)-->PE/GW2<--(vpn safi)-->NVA0<--(vpn safi) -->PE/GW3<--(unicast safi) -->CE5

o Scenario:

  1. Add 10 static routes CE4

  2. Add VNC route NVA0

  3. Confirm NVA0 sees GW2 imported CE4 routes (over vpn safi) and CE5 sees 10 unicast routes

  4. Withdraw static routes at CE4

  5. Confirm NVA0 see withdrawals

  6. Readd static route at CE4

  7. Confirm NVA0 see routes

  8. Remove VNC route at NVA0

  9. See all imported routes removed at NVA0 and GW2

  10. Remove static routes

  11. (reverse order of imported routes) Add VNC route NVA0

  12. Add 10 static routes CE4

  13. Confirm GW 2 sees 10 unicast routes and 11 VPN routes (10 GW2 imported + NVA0 learned)

  14. Confirm NVA0 sees 11 VPN routes (10 GW2 imported + NVA0 learned)

Intermittent failure is on step 14

o Data showing issue:

  • CE has 10 static routes: (as expected)

address-family ipv4 unicast
network 5.0.0.0/24 route-map rm-nh
network 5.0.1.0/24 route-map rm-nh
network 5.0.2.0/24 route-map rm-nh
network 5.0.3.0/24 route-map rm-nh
network 5.0.4.0/24 route-map rm-nh
network 5.0.5.0/24 route-map rm-nh
network 5.0.6.0/24 route-map rm-nh
network 5.0.7.0/24 route-map rm-nh
network 5.0.8.0/24 route-map rm-nh
network 5.0.9.0/24 route-map rm-nh

route-map rm-nh permit 10
match ip address al-any
set ip next-hop 127.0.0.134
set local-preference 123
set metric 98

  • GW2 sees 10 routes and imports them to VPN safi, also sees VNC registered route: (as expected)

GW-127.0.0.132> show bgp ipv4 unicast
BGP table version is 50, local router ID is 127.0.0.132
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*>i5.0.0.0/24 127.0.0.134 98 123 0 i
*>i5.0.1.0/24 127.0.0.134 98 123 0 i
*>i5.0.2.0/24 127.0.0.134 98 123 0 i
*>i5.0.3.0/24 127.0.0.134 98 123 0 i
*>i5.0.4.0/24 127.0.0.134 98 123 0 i
*>i5.0.5.0/24 127.0.0.134 98 123 0 i
*>i5.0.6.0/24 127.0.0.134 98 123 0 i
*>i5.0.7.0/24 127.0.0.134 98 123 0 i
*>i5.0.8.0/24 127.0.0.134 98 123 0 i
*>i5.0.9.0/24 127.0.0.134 98 123 0 i

Displayed 10 routes and 10 total paths
GW-127.0.0.132> show bgp ipv4 vpn
BGP table version is 0, local router ID is 127.0.0.132
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 172.16.1.1:5226
*>i5.0.0.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{ET:7 127.0.0.134:5226 1000:1 1000:2 ET:7} label=3 life=500 type=bgp-direct, subtype=3
*>i5.0.1.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{ET:7 127.0.0.134:5226 1000:1 1000:2 ET:7} label=3 life=500 type=bgp-direct, subtype=3
*>i5.0.2.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{ET:7 127.0.0.134:5226 1000:1 1000:2 ET:7} label=3 life=500 type=bgp-direct, subtype=3
*>i5.0.3.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{ET:7 127.0.0.134:5226 1000:1 1000:2 ET:7} label=3 life=500 type=bgp-direct, subtype=3
*>i5.0.4.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{ET:7 127.0.0.134:5226 1000:1 1000:2 ET:7} label=3 life=500 type=bgp-direct, subtype=3
*>i5.0.5.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{ET:7 127.0.0.134:5226 1000:1 1000:2 ET:7} label=3 life=500 type=bgp-direct, subtype=3
*>i5.0.6.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{ET:7 127.0.0.134:5226 1000:1 1000:2 ET:7} label=3 life=500 type=bgp-direct, subtype=3
*>i5.0.7.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{ET:7 127.0.0.134:5226 1000:1 1000:2 ET:7} label=3 life=500 type=bgp-direct, subtype=3
*>i5.0.8.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{ET:7 127.0.0.134:5226 1000:1 1000:2 ET:7} label=3 life=500 type=bgp-direct, subtype=3

  • i5.0.9.0/24 172.16.1.1 98 123 0 ?
    UN=9.8.7.6 EC{ET:7 127.0.0.134:5226 1000:1 1000:2 ET:7} label=3 life=500 type=bgp-direct, subtype=3
    *>i127.0.0.134/32 172.16.1.1 205 0 ?
    UN=9.8.7.6 EC{1000:1 1000:2 ET:7} label=3 life=500 type=bgp, subtype=0

Displayed 11 routes and 21 total paths

  • NVA0 only sees 9 of the 10 imported routes: (failure at step 14)

NVA-127.0.0.130> show bgp ipv4 vpn
BGP table version is 0, local router ID is 127.0.0.130
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 172.16.1.1:5226
*>i5.0.0.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{1000:1 1000:2 127.0.0.134:5226 ET:7} label=3 life=500 type=bgp, subtype=0
*>i5.0.1.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{1000:1 1000:2 127.0.0.134:5226 ET:7} label=3 life=500 type=bgp, subtype=0
*>i5.0.2.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{1000:1 1000:2 127.0.0.134:5226 ET:7} label=3 life=500 type=bgp, subtype=0
*>i5.0.3.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{1000:1 1000:2 127.0.0.134:5226 ET:7} label=3 life=500 type=bgp, subtype=0
*>i5.0.4.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{1000:1 1000:2 127.0.0.134:5226 ET:7} label=3 life=500 type=bgp, subtype=0
*>i5.0.5.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{1000:1 1000:2 127.0.0.134:5226 ET:7} label=3 life=500 type=bgp, subtype=0
*>i5.0.6.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{1000:1 1000:2 127.0.0.134:5226 ET:7} label=3 life=500 type=bgp, subtype=0
*>i5.0.7.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{1000:1 1000:2 127.0.0.134:5226 ET:7} label=3 life=500 type=bgp, subtype=0
*>i5.0.8.0/24 172.16.1.1 98 123 0 ?
UN=9.8.7.6 EC{1000:1 1000:2 127.0.0.134:5226 ET:7} label=3 life=500 type=bgp, subtype=0
*> 127.0.0.134/32 172.16.1.1 205 0 ?
UN=9.8.7.6 EC{ET:7 1000:1 1000:2} label=3 life=500 type=bgp, subtype=4

Displayed 10 routes and 19 total paths

@louberger
Copy link
Member Author

feels like a timer / list race condition. Attached valgrind reported losses are the same (I think) each time the failure occurs...
bgpd-2017-08-25-09-17-17.127.0.0.132.valgrind.txt

@louberger
Copy link
Member Author

louberger commented Aug 28, 2017

Able to reproduce this about 7% of the time on master (c3f779d). (18 out of 244)

After 163 runs unable to reproduce on master prior to pull request #939 merge (commit 57a553f)

Retested at point where pull request #939 merge (commit 3f54388) and see fails so this is definitely the source of the problem (9 out of 116 runs). @jbonor any thoughts?

@louberger
Copy link
Member Author

louberger commented Aug 29, 2017

I think the issue is STAILQ_INSERT_TAIL being called for a wg from within the same queue handler callback (bgp_process_wq) - see pull for how avoided, but root cause is still to be fixed.

I suspect issue is items added to the same WG in the last call to bgp_process_main_one won't be processed due to use of STAILQ_FOREACH_SAFE

@donaldsharp
Copy link
Member

Lou has a proposed fix and is testing it. Once it reaches good enough a PR will be submitted(#1067).

The safe walking is the issue.

We need to make sure that RN doesn't show up on multiple work queues. and we need to look at FOREACH_STAIL_SAFE is handled properly in workqueue.c

@donaldsharp
Copy link
Member

Donald to look at workqueue stuff

@donaldsharp
Copy link
Member

The workqueue is only used in 2 places besides bgp:

  1. The metaqueue processing in zebra_rib.c. There is a guard to only allow 1 item on the queue ever
  2. lsp processing in zebra_mpls.c. There is a guard to not allow more than 1 lsp on the list at a time.

I believe we are good from the workqueue perspective.

@louberger
Copy link
Member Author

This issue has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants