Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM on Jool boxes, possible leak? #410

Closed
suffieldacademy opened this issue Aug 3, 2023 · 59 comments
Closed

OOM on Jool boxes, possible leak? #410

suffieldacademy opened this issue Aug 3, 2023 · 59 comments
Labels
Milestone

Comments

@suffieldacademy
Copy link

We have two jool boxes in an active/active load-sharing setup that we're testing for our campus. Things have been fine for months in limited testing. This week we added more test clients to the boxes and have been getting several machine lockups requiring a hard reboot. The message on the screen is typically out of memory.

This is jool 4.1.8.0 on Debian Bullseye.

I'm not a kernel expert, but looking at some of the other memory issue reports people mentioned /proc/slabinfo. I sampled that every 2 seconds and the "jool_joold_nodes" line is increasing constantly (this is on a machine that's been up less than 2 hours and 'jool session display' lists approximately 12,000 sessions):

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3804464 3804464    120   34    1 : tunables    0    0    0 : slabdata 111896 111896      0
session_nodes      14106  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14532  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3804668 3804668    120   34    1 : tunables    0    0    0 : slabdata 111902 111902      0
session_nodes      14129  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14564  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3805620 3805620    120   34    1 : tunables    0    0    0 : slabdata 111930 111930      0
session_nodes      14150  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14557  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3805824 3805824    120   34    1 : tunables    0    0    0 : slabdata 111936 111936      0
session_nodes      14125  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14539  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

I sampled active sessions in jool and even when those decreased, the slabs continued to increase. Meanwhile, "available" memory (as reported by top/free) has been steadily decreasing (several MiB per minute). Since we've increased the number of users, the machines have needed a reboot in as few as 20 hours.

I don't know enough about the kernel structures to know what jool_joold_nodes represents, but I'm guessing it shouldn't be monotonically increasing. I'm happy to gather any additional data that may be helpful.

One of my two boxes is locked up at the moment, but once I'm back to fully redundant I can try things like restarting jool or unloading the kernel module to see if we can recover memory without rebooting.

@suffieldacademy
Copy link
Author

Some additional data:

After 12 hours I'm up to about 3GiB of memory in use.

The machine that most recently locked up showed OOM messages. The processes that were killed were all listed as <2MiB RSS, so my guess is that jool is eating up kernel memory:

[107231.7488871 Out of memory: Killed process 2866 (system total-vm:164920KB, anon-rss:0KB, file-rss:124KB, shmem-rss:OKB, UID:0 pgtables:88KB com_score_ad]:0
[107250.5731331 Out of memory: Killed process 2865 (cron) total-vm:6684kB, anon-rss:0KB, file-rss:1696kB, shmem-rss:OKB, UID:0 pgtables:56kB oom_score_ad :0
[107250.5876641 out of memory: Killed process 970 (Joold) total-vm:151412kB, anon-rss:0KB file-rss:344kB, shmem-rss:0kB, UID: 0 pgtables:60kB oom_score_adj:0

I rebooted the machine but didn't put it back into service immediately so I could get a baseline on memory. It was receiving sync state information, but not forwarding any traffic. I watched /proc/slabinfo for a bit and while session_nodes and bib_nodes were going up, jool_joold_nodes stayed constant.

As soon as I started forwarding traffic through the box, jool_joold_nodes started to increase regularly.

I then failed over and looked at the machine that was running out of memory. As soon as it wasn't forwarding traffic, jool_joold_nodes stayed constant even as session_nodes and bib_nodes fluctuated. So it seems that the memory consumption is specifically from packet processing and not states.

I restarted the joold processes (I have 2 instances in different netns), and as soon as I did that /proc/slabinfo showed the jool_joold_nodes count drop to < 1000. Free memory on the box shot back up to 3.5GiB. I did not need to unload the kernel module; simply restarting the joold instance was enough to reclaim the memory.

Any thoughts on what I should check on next?

@gehoern
Copy link

gehoern commented Aug 4, 2023

we have a similar issue and are not sure what you need:

jool 4.1.10 on a 6.1.42 and we have increases of 1.5mb/minute

on slabtop the entry for kmalloc-256 increases as long as jool is running. if jool is stopped the growth stops. BUT our memory is not freed.

ydahhrk added a commit that referenced this issue Aug 4, 2023
joold->queue is a listing of joold sessions whose fetch needs to be
postponed because of the Netlink channel's limitations. Quite
surprisingly, the code was not actually ever fetching them, which is
why they were queuing indefinitely.

It suspect this has gone unnoticed because, depending on the
environment, Joold seemingly needs lots of traffic to queue. I myself
had to disable ss-flush-asap to be able to replicate the bug.

Likely f1xes #410, but I noticed a couple extra issues that need to be
addressed before a new release. It seems joold has gotten dusty again.
It makes me wonder if people is using it.

In particular, I had to disable --advertise to prevent it from
synchronizing uninitialized memory. Will need to look into this in the
following days.
@ydahhrk
Copy link
Member

ydahhrk commented Aug 4, 2023

I believe I fixed it. Patch uploaded to branch issue410.

I noticed two more bugs while reviewing joold, so I will need to work on this further before the release is ready.

For the sake of transparency:

  1. joold advertise doesn't do anything because it also relies on joold->queue, but fixing it is more complicated than the baseline patch above.
  2. Similarly, ss-capacity isn't doing anything. (Which is why you never saw the warning, even though too many sessions are queuing.)

@suffieldacademy
Copy link
Author

Thank you so much for looking into this! I'm just trying to understand your bug descriptions and have a couple questions:

  • Is joold advertise not working in the current code? I've been using it but have not checked the session tables to confirm it was actually sending out existing states when asked.
  • Should I enable ss-flush-asap? Your commit message made it sound like that might prevent (some) of the queuing.

To answer your musing in the commit message, I am definitely using Jool! My campus is working toward being IPv6-native internally, which means I need to have NAT64 functioning to provide connectivity to the legacy internet. My other favorite firewall/router (OpenBSD) doesn't have all the NAT64 features that Jool does, so that's why we are using it. I realize it's bugfix/maintenance at this point, but I'm hoping that since all the functionality I need is already in the project we can keep running it to provide this service.

Thanks!

@ydahhrk
Copy link
Member

ydahhrk commented Aug 7, 2023

Is joold advertise not working in the current code?

I didn't actually test it, but the code review says "yes." (By which I mean it doesn't work at all.)

Should I enable ss-flush-asap?

Yes; ss-flush-asap true is mandatory in Active/Active. Notice it's true by default, so you probably already have it.

Your commit message made it sound like that might prevent (some) of the queuing.

Sure, but the queuing is probably already fixed.

I am definitely using Jool!

I'm glad, but I was specifically wondering about joold.

It's an optional feature that has often been problematic (to the point I actually regret having implemented it), and the amount of bug reports doesn't seem to match the state of the code. I mean, I casually found three bugs in a single review. That's not healthy.

@suffieldacademy
Copy link
Author

Yes; ss-flush-asap true is mandatory in Active/Active. Notice it's true by default, so you probably already have it.

We actually had it off (we're doing active/passive but with two instances as a kind of poor-man's load balancing). I've enabled it again for now.

I'm glad, but I was specifically wondering about joold.

My apologies for the misunderstanding; I thought it was the project as a whole. In terms of joold, a vote of support for that as well. We picked Jool in part due to the state sync feature as we wanted to run things in a cluster.

Thanks again!

ydahhrk added a commit that referenced this issue Aug 10, 2023
For #410.

I more or less finished the implementation (albeit not the testing),
but the unit tests revealed that one of my assumptions regarding Generic
Netlink is incorrect.

nla_put() cannot be trusted to respect the allocated packet size,
because alloc_skb() can reserve more tail area than it's requested.
This means ss-max-payload has to be enforced manually.

And this by itself wouldn't be enough to justify another rewrite, but
after thinking about it, I realized a more natural implementation would
also reduce the range of the spinlock... which is currently not the
best. Lots of performance to be gained from switching.

So checkpointing. I might have to return to this implementation if the
new one turns out unviable.
@ydahhrk
Copy link
Member

ydahhrk commented Aug 11, 2023

Ok, all three bugs have been fixed in the issue410 branch.

Two notes:

Can you run it from source? It'd be great to get some feedback before I release it.

@ydahhrk ydahhrk added the Bug label Aug 11, 2023
@suffieldacademy
Copy link
Author

Sorry for the delay on this; had some other emergencies that kept getting in the way.

I have checked out the issue410 branch from git and built jool from source. I changed our script to use ss-max-sessions-per-packet, and dialed the value down to 8 (we're using HSR which has a lower MTU so I wanted to make sure we weren't going over).

Everything is up and running and I've done some light testing. /proc/slabinfo shows bib entries going up, but no runaway joold_nodes, which is great!

I'll give it a few days to see if memory use climbs, or if we note any regressions. Thanks for the quick patch!

@ydahhrk
Copy link
Member

ydahhrk commented Aug 22, 2023

Reopening until release.

@ydahhrk ydahhrk reopened this Aug 22, 2023
@suffieldacademy
Copy link
Author

And just to provide an update, memory usage is holding steady on the machines, with no mention of joold_nodes. We're still in light testing (students have not returned to campus), but the indicators seem very promising.

@suffieldacademy
Copy link
Author

Update from today. Looks like we might still be leaking memory, but not in the same place (and not nearly as quickly, since I just noticed this after 1+ week of uptime). Would you like me to open a separate issue to track? Short version is kernel memory is growing, and this time it's all being hogged by hugepages:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
22849668 22849358  99%    0.11K 634713       36   2538852K khugepaged_mm_slot

I will see if unloading the kernel module resets it when I have a minute to try.

@suffieldacademy
Copy link
Author

I had a minute. ;-)

Shutting down the jool instance and unloading the kernel modules did not release the memory. I'm rebooting this box now to recover the RAM and we'll keep an eye on it.

So, not certain that it's a jool issue (e.g., could be netfilter or something else), but that's all that this box is doing so it seems connected.

@ydahhrk
Copy link
Member

ydahhrk commented Aug 28, 2023

Would you like me to open a separate issue to track?

I personally don't mind either way. I suppose it'd depend on whether gehoern wants to keep getting notifications from this thread.

Short version is kernel memory is growing, and this time it's all being hogged by hugepages:

Do you still get increasing hugepages if you disable joold?

If you run joold in active/passive mode, which instance gets the hugepages?


Most of Jool's allocations are tracked by this thing.

If you enable it, wait for some hugepages then unload the module, maybe Jool will tell us what it's leaking.

$ make CFLAGS_MODULE="-DJKMEMLEAK"
$ sudo make install
$ sudo dmesg -C
$ sudo modprobe jool && sudo modprobe -r jool
$ sudo dmesg -ct
Jool: Core Jool v4.1.10.0 module inserted.
NAT64 Jool v4.1.10.0 module inserted.
NAT64 Jool v4.1.10.0 module removed.
Jool: Memory leaks:
Jool: None.
Jool: Core Jool v4.1.10.0 module removed.

@suffieldacademy
Copy link
Author

We're active/active so it's happening on both hosts. I'm trying to turn down journaling to make sure it's not just logging that's chewing up memory.

I recompiled with the requested flag on one of the machines. Starting it back up and watching dmesg for a few minutes, I got:

[Mon Aug 28 21:01:13 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:02:31 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:03:36 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:04:44 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:05:45 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:06:49 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:07:54 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:08:39 2023] ------------[ cut here ]------------
[Mon Aug 28 21:08:39 2023] We're out of sync: Incoming session entry [2607:f460:1833:bf:c436:763e:41b3:6064#46149, 64:ff9b::d891:6d62#3478, 72.10.119.12#55668, 216.145.109.98#3478, UDP] collides with DB entry [2607:f460:1833:bf:c436:763e:41b3:6064#46149, 64:ff9b::d891:6d62#3478, 72
[Mon Aug 28 21:08:39 2023] WARNING: CPU: 0 PID: 4753 at /var/lib/dkms/jool/4.1.10.git.v4.1.10.7.g4fcfe184/build/src/mod/common/log.c:53 log_err+0xc0/0x120 [jool_common]
[Mon Aug 28 21:08:39 2023] Modules linked in: jool_siit(OE) jool(OE) jool_common(OE) nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables libcrc32c nfnetlink hsr(OE) binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common nls_ascii sb_edac nls_cp437 vfat fat x86_pkg_temp_thermal intel_powerclamp coretemp joydev ghash_clmulni_intel sha512_ssse3 dcdbas sha512_generic hid_generic mgag200 usbhid ipmi_si aesni_intel hid crypto_simd drm_shmem_helper cryptd rapl drm_kms_helper intel_cstate evdev i2c_algo_bit wdat_wdt ipmi_devintf intel_uncore watchdog ipmi_msghandler pcspkr acpi_pad button acpi_power_meter mei_me sg mei ioatdma drm loop fuse efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic dm_mirror dm_region_hash dm_log dm_mod sd_mod t10_pi sr_mod crc64_rocksoft cdrom crc64 crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32_pclmul ahci crc32c_intel ixgbe libahci ehci_pci libata ehci_hcd lpc_ich megaraid_sas xfrm_algo dca usbcore mdio_devres tg3
[Mon Aug 28 21:08:39 2023]  scsi_mod usb_common mdio libphy scsi_common ptp pps_core wmi [last unloaded: jool_common(OE)]
[Mon Aug 28 21:08:39 2023] CPU: 0 PID: 4753 Comm: joold Tainted: G        W  OE      6.1.0-11-amd64 #1  Debian 6.1.38-4
[Mon Aug 28 21:08:39 2023] Hardware name: Dell Inc. PowerEdge R620/01W23F, BIOS 2.7.0 05/23/2018
[Mon Aug 28 21:08:39 2023] RIP: 0010:log_err+0xc0/0x120 [jool_common]
[Mon Aug 28 21:08:39 2023] Code: 00 48 89 44 24 10 e8 bf 98 db de 65 8b 05 f8 c1 20 3f f6 c4 ff 0f 84 51 74 01 00 48 89 de 48 c7 c7 9f 4b e3 c0 e8 00 90 48 de <0f> 0b 48 89 df e8 36 da 69 de 48 8b 44 24 18 65 48 2b 04 25 28 00
[Mon Aug 28 21:08:39 2023] RSP: 0018:ffffb614c07f7508 EFLAGS: 00010282
[Mon Aug 28 21:08:39 2023] RAX: 0000000000000000 RBX: ffff944235450800 RCX: 0000000000000000
[Mon Aug 28 21:08:39 2023] RDX: 0000000000000203 RSI: ffffffffa054191e RDI: 00000000ffffffff
[Mon Aug 28 21:08:39 2023] RBP: ffffb614c07f7568 R08: 0000000000000000 R09: ffffb614c07f7378
[Mon Aug 28 21:08:39 2023] R10: 0000000000000003 R11: ffff94432ff23e90 R12: ffffffffc0e2fcc8
[Mon Aug 28 21:08:39 2023] R13: ffffffffc0e3594b R14: 000000000000b445 R15: 0000000000000d96
[Mon Aug 28 21:08:39 2023] FS:  00007fbc421876c0(0000) GS:ffff94432b200000(0000) knlGS:0000000000000000
[Mon Aug 28 21:08:39 2023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Aug 28 21:08:39 2023] CR2: 00007fcc7145ff88 CR3: 0000000044d76003 CR4: 00000000000606f0
[Mon Aug 28 21:08:39 2023] Call Trace:
[Mon Aug 28 21:08:39 2023]  <TASK>
[Mon Aug 28 21:08:39 2023]  ? __warn+0x7d/0xc0
[Mon Aug 28 21:08:39 2023]  ? log_err+0xc0/0x120 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? report_bug+0xe6/0x170
[Mon Aug 28 21:08:39 2023]  ? handle_bug+0x41/0x70
[Mon Aug 28 21:08:39 2023]  ? exc_invalid_op+0x13/0x60
[Mon Aug 28 21:08:39 2023]  ? asm_exc_invalid_op+0x16/0x20
[Mon Aug 28 21:08:39 2023]  ? log_err+0xc0/0x120 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? log_err+0xc0/0x120 [jool_common]
[Mon Aug 28 21:08:39 2023]  collision_cb+0xd5/0xf0 [jool_common]
[Mon Aug 28 21:08:39 2023]  decide_fate+0x67/0x230 [jool_common]
[Mon Aug 28 21:08:39 2023]  bib_add_session+0x1b1/0x320 [jool_common]
[Mon Aug 28 21:08:39 2023]  add_new_session+0xba/0x110 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? add_new_session+0x110/0x110 [jool_common]
[Mon Aug 28 21:08:39 2023]  joold_sync+0xaf/0xe0 [jool_common]
[Mon Aug 28 21:08:39 2023]  handle_joold_add+0x69/0x100 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? handling_hairpinning_siit+0x170/0x170 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? is_hairpin_nat64+0x40/0x40 [jool_common]
[Mon Aug 28 21:08:39 2023]  genl_family_rcv_msg_doit+0x100/0x160
[Mon Aug 28 21:08:39 2023]  genl_rcv_msg+0x122/0x250
[Mon Aug 28 21:08:39 2023]  ? handle_instance_flush+0x70/0x70 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? genl_start+0x160/0x160
[Mon Aug 28 21:08:39 2023]  netlink_rcv_skb+0x51/0x100
[Mon Aug 28 21:08:39 2023]  genl_rcv+0x24/0x40
[Mon Aug 28 21:08:39 2023]  netlink_unicast+0x242/0x390
[Mon Aug 28 21:08:39 2023]  netlink_sendmsg+0x250/0x4c0
[Mon Aug 28 21:08:39 2023]  sock_sendmsg+0x5f/0x70
[Mon Aug 28 21:08:39 2023]  ____sys_sendmsg+0x277/0x2f0
[Mon Aug 28 21:08:39 2023]  ? copy_msghdr_from_user+0x7d/0xc0
[Mon Aug 28 21:08:39 2023]  ___sys_sendmsg+0x9a/0xe0
[Mon Aug 28 21:08:39 2023]  __sys_sendmsg+0x76/0xc0
[Mon Aug 28 21:08:39 2023]  do_syscall_64+0x5b/0xc0
[Mon Aug 28 21:08:39 2023]  ? fpregs_assert_state_consistent+0x22/0x50
[Mon Aug 28 21:08:39 2023]  ? exit_to_user_mode_prepare+0x40/0x1d0
[Mon Aug 28 21:08:39 2023]  ? syscall_exit_to_user_mode+0x17/0x40
[Mon Aug 28 21:08:39 2023]  ? do_syscall_64+0x67/0xc0
[Mon Aug 28 21:08:39 2023]  entry_SYSCALL_64_after_hwframe+0x69/0xd3
[Mon Aug 28 21:08:39 2023] RIP: 0033:0x7fbc42a9594d
[Mon Aug 28 21:08:39 2023] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 7a ad f7 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 ce ad f7 ff 48
[Mon Aug 28 21:08:39 2023] RSP: 002b:00007fbc421864b0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[Mon Aug 28 21:08:39 2023] RAX: ffffffffffffffda RBX: 000055649ab67960 RCX: 00007fbc42a9594d
[Mon Aug 28 21:08:39 2023] RDX: 0000000000000000 RSI: 00007fbc42186510 RDI: 0000000000000004
[Mon Aug 28 21:08:39 2023] RBP: 00007fbc34000da0 R08: 0000000000000000 R09: 0000000000000000
[Mon Aug 28 21:08:39 2023] R10: 0000000000000000 R11: 0000000000000293 R12: 000055649ab67a00
[Mon Aug 28 21:08:39 2023] R13: 00007fbc42186510 R14: 00007fff834a14b0 R15: 00007fbc41987000
[Mon Aug 28 21:08:39 2023]  </TASK>
[Mon Aug 28 21:08:39 2023] ---[ end trace 0000000000000000 ]---
[Mon Aug 28 21:08:39 2023] ------------[ cut here ]------------
[Mon Aug 28 21:08:39 2023] We're out of sync: Incoming session entry [2607:f460:1833:bf:c436:763e:41b3:6064#46149, 64:ff9b::d85d:f612#3478, 72.10.119.12#55668, 216.93.246.18#3478, UDP] collides with DB entry [2607:f460:1833:bf:c436:763e:41b3:6064#46149, 64:ff9b::d85d:f612#3478, 72.

... additional stack traces ...

Several traces happened in a short time frame (dozens), but the IP/ports seem to be the same for most (I only count two distinct tuples between all the messages). After that it got quiet agin for a bit with just some more "Too many sessions deferred" messages.

I can send the full dmesg output for this brief test if that's helpful.

@ydahhrk
Copy link
Member

ydahhrk commented Aug 30, 2023

Incoming session entry [...] collides with DB entry [...]

Just to be clear: This is not really strange, it just means the two instances did not start at the same time (an advertise wouldn't go amiss), or joold wasn't able to sync fast enough. I should probably reduce the severity of the message.

joold: Too many sessions deferred! I need to drop some; sorry.

This is indeed strange.


Also, none of this is consequence of the -DJKMEMLEAK flag (unless it's because it's slowing everything down). Did it print something when you removed the module?

@suffieldacademy
Copy link
Author

Sorry, I didn't realize I had to unload the module. When I did, there wasn't anything in dmesg about memory, just the unloading messages:

[Wed Aug 30 07:50:42 2023] Jool: Deleted instance 'nat64-wkp-lower'.
[Wed Aug 30 07:50:42 2023] Jool: Deleted instance 'siit-dmz'.
[Wed Aug 30 07:50:42 2023] Jool: Deleted instance 'nat64-wkp-upper'.
[Wed Aug 30 07:50:51 2023] NAT64 Jool v4.1.10.0 module removed.
[Wed Aug 30 07:50:51 2023] SIIT Jool v4.1.10.0 module removed.
[Wed Aug 30 07:50:51 2023] Jool: Core Jool v4.1.10.0 module removed.

Is there somewhere else I should check?

@ydahhrk
Copy link
Member

ydahhrk commented Aug 30, 2023

Is there somewhere else I should check?

No. Did you uninstall the Debian package? If both the Debian package and a custom Jool are installed, the former takes precedence.

sudo apt remove jool-dkms jool-tools

Note, if you uninstall the Debian package, you also lose the systemd unit file. If you're starting Jool via systemd, here are instructions to restore the unit files. Start with the curls, end with the daemon-reload. Mind your config files.

I uploaded a new commit to issue410. It's optional. It reduces the session desync message, and also bumps Jool's version number slightly, so you can more easily tell whether you're running the custom binary or the Debian package'd Jool:

$ make CFLAGS_MODULE="-DJKMEMLEAK"
$ sudo make install
$ sudo dmesg -C
$ sudo modprobe jool
$ sudo dmesg -ct
Jool: Core Jool v4.1.10.1 module inserted.
NAT64 Jool v4.1.10.1 module inserted.

ydahhrk added a commit that referenced this issue Aug 30, 2023
Might help monitor joold, as well as debug #410.

Print them with

	jool stat display --all | grep JSTAT_JOOLD
@ydahhrk
Copy link
Member

ydahhrk commented Aug 30, 2023

I'm still worried about you getting so many Too many sessions deferred! messages.

So new commit. Version number bumped to 4.1.10.2.

This one prints a few extra stats that might help us monitor kernelside joold's state:

$ sudo jool stat display --all | grep JSTAT_JOOLD
JSTAT_JOOLD_EMPTY: 63
JSTAT_JOOLD_TIMEOUT: 1
JSTAT_JOOLD_MISSING_ACK: 3
JSTAT_JOOLD_AD_ONGOING: 0
JSTAT_JOOLD_FLUSH_ASAP: 5
JSTAT_JOOLD_PKT_FULL: 0
JSTAT_JOOLD_QUEUING: 0
JSTAT_JOOLD_SSS_QUEUED: 6
JSTAT_JOOLD_SSS_SENT: 6
JSTAT_JOOLD_SSS_RCVD: 0
JSTAT_JOOLD_PKT_SENT: 6
JSTAT_JOOLD_PKT_RCVD: 0
JSTAT_JOOLD_ADS: 0
JSTAT_JOOLD_ACKS: 6

You can get their descriptions by adding --explain:

$ sudo jool stat display --all --explain
...

JSTAT_JOOLD_TIMEOUT: 1
Joold packet sent; ss-flush-deadline reached.

JSTAT_JOOLD_MISSING_ACK: 3
Joold packet not sent; still waiting for ACK.

...

@suffieldacademy
Copy link
Author

OK, recompiled off that branch with the new commits. Been running for about 16 hours. I took a stats dump and then unloaded the module. Still not seeing any leak info in dmesg.

We have 3 instances (1 SIIT, 2 NAT64) running on the box; here are the stats from all:

# ip netns exec sa-jool-ns jool_siit -i siit-dmz stats display --all
JSTAT_RECEIVED6: 76039
JSTAT_RECEIVED4: 306243
JSTAT_SUCCESS: 381909
JSTAT_BIB_ENTRIES: 0
JSTAT_SESSIONS: 0
JSTAT_ENOMEM: 0
JSTAT_XLATOR_DISABLED: 0
JSTAT_POOL6_UNSET: 0
JSTAT_SKB_SHARED: 0
JSTAT_L3HDR_OFFSET: 0
JSTAT_SKB_TRUNCATED: 0
JSTAT_HDR6: 0
JSTAT_HDR4: 0
JSTAT_UNKNOWN_L4_PROTO: 0
JSTAT_UNKNOWN_ICMP6_TYPE: 0
JSTAT_UNKNOWN_ICMP4_TYPE: 8
JSTAT_DOUBLE_ICMP6_ERROR: 0
JSTAT_DOUBLE_ICMP4_ERROR: 0
JSTAT_UNKNOWN_PROTO_INNER: 0
JSTAT_HAIRPIN_LOOP: 0
JSTAT_POOL6_MISMATCH: 0
JSTAT_POOL4_MISMATCH: 0
JSTAT_ICMP6_FILTER: 0
JSTAT_UNTRANSLATABLE_DST6: 0
JSTAT_UNTRANSLATABLE_DST4: 0
JSTAT_6056_F: 0
JSTAT_MASK_DOMAIN_NOT_FOUND: 0
JSTAT_BIB6_NOT_FOUND: 0
JSTAT_BIB4_NOT_FOUND: 0
JSTAT_SESSION_NOT_FOUND: 0
JSTAT_ADF: 0
JSTAT_V4_SYN: 0
JSTAT_SYN6_EXPECTED: 0
JSTAT_SYN4_EXPECTED: 0
JSTAT_TYPE1PKT: 0
JSTAT_TYPE2PKT: 0
JSTAT_SO_EXISTS: 0
JSTAT_SO_FULL: 0
JSTAT64_SRC: 0
JSTAT64_DST: 0
JSTAT64_PSKB_COPY: 0
JSTAT64_6791_ENOENT: 0
JSTAT64_ICMP_CSUM: 0
JSTAT64_UNTRANSLATABLE_PARAM_PROB_PTR: 0
JSTAT64_TTL: 0
JSTAT64_FRAGMENTED_ICMP: 0
JSTAT64_2XFRAG: 0
JSTAT64_FRAG_THEN_EXT: 0
JSTAT64_SEGMENTS_LEFT: 0
JSTAT46_SRC: 0
JSTAT46_DST: 0
JSTAT46_PSKB_COPY: 0
JSTAT46_6791_ENOENT: 0
JSTAT46_ICMP_CSUM: 0
JSTAT46_UNTRANSLATABLE_PARAM_PROBLEM_PTR: 0
JSTAT46_TTL: 13
JSTAT46_FRAGMENTED_ICMP: 0
JSTAT46_SRC_ROUTE: 0
JSTAT46_FRAGMENTED_ZERO_CSUM: 352
JSTAT46_BAD_MTU: 0
JSTAT_FAILED_ROUTES: 0
JSTAT_PKT_TOO_BIG: 0
JSTAT_DST_OUTPUT: 0
JSTAT_ICMP6ERR_SUCCESS: 0
JSTAT_ICMP6ERR_FAILURE: 0
JSTAT_ICMP4ERR_SUCCESS: 365
JSTAT_ICMP4ERR_FAILURE: 0
JSTAT_ICMPEXT_BIG: 0
JSTAT_JOOLD_EMPTY: 0
JSTAT_JOOLD_TIMEOUT: 0
JSTAT_JOOLD_MISSING_ACK: 0
JSTAT_JOOLD_AD_ONGOING: 0
JSTAT_JOOLD_FLUSH_ASAP: 0
JSTAT_JOOLD_PKT_FULL: 0
JSTAT_JOOLD_QUEUING: 0
JSTAT_JOOLD_SSS_QUEUED: 0
JSTAT_JOOLD_SSS_SENT: 0
JSTAT_JOOLD_SSS_RCVD: 0
JSTAT_JOOLD_PKT_SENT: 0
JSTAT_JOOLD_PKT_RCVD: 0
JSTAT_JOOLD_ADS: 0
JSTAT_JOOLD_ACKS: 0
JSTAT_UNKNOWN: 0
# ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display --all
JSTAT_RECEIVED6: 22673034
JSTAT_RECEIVED4: 53629379
JSTAT_SUCCESS: 76179608
JSTAT_BIB_ENTRIES: 13646
JSTAT_SESSIONS: 13669
JSTAT_ENOMEM: 0
JSTAT_XLATOR_DISABLED: 0
JSTAT_POOL6_UNSET: 0
JSTAT_SKB_SHARED: 0
JSTAT_L3HDR_OFFSET: 0
JSTAT_SKB_TRUNCATED: 11
JSTAT_HDR6: 0
JSTAT_HDR4: 0
JSTAT_UNKNOWN_L4_PROTO: 26
JSTAT_UNKNOWN_ICMP6_TYPE: 0
JSTAT_UNKNOWN_ICMP4_TYPE: 0
JSTAT_DOUBLE_ICMP6_ERROR: 0
JSTAT_DOUBLE_ICMP4_ERROR: 0
JSTAT_UNKNOWN_PROTO_INNER: 0
JSTAT_HAIRPIN_LOOP: 0
JSTAT_POOL6_MISMATCH: 0
JSTAT_POOL4_MISMATCH: 9527
JSTAT_ICMP6_FILTER: 0
JSTAT_UNTRANSLATABLE_DST6: 0
JSTAT_UNTRANSLATABLE_DST4: 0
JSTAT_6056_F: 0
JSTAT_MASK_DOMAIN_NOT_FOUND: 0
JSTAT_BIB6_NOT_FOUND: 0
JSTAT_BIB4_NOT_FOUND: 19683
JSTAT_SESSION_NOT_FOUND: 194
JSTAT_ADF: 0
JSTAT_V4_SYN: 45656
JSTAT_SYN6_EXPECTED: 45549
JSTAT_SYN4_EXPECTED: 783
JSTAT_TYPE1PKT: 0
JSTAT_TYPE2PKT: 0
JSTAT_SO_EXISTS: 0
JSTAT_SO_FULL: 0
JSTAT64_SRC: 0
JSTAT64_DST: 0
JSTAT64_PSKB_COPY: 0
JSTAT64_6791_ENOENT: 0
JSTAT64_ICMP_CSUM: 0
JSTAT64_UNTRANSLATABLE_PARAM_PROB_PTR: 0
JSTAT64_TTL: 0
JSTAT64_FRAGMENTED_ICMP: 0
JSTAT64_2XFRAG: 0
JSTAT64_FRAG_THEN_EXT: 0
JSTAT64_SEGMENTS_LEFT: 0
JSTAT46_SRC: 0
JSTAT46_DST: 0
JSTAT46_PSKB_COPY: 0
JSTAT46_6791_ENOENT: 0
JSTAT46_ICMP_CSUM: 0
JSTAT46_UNTRANSLATABLE_PARAM_PROBLEM_PTR: 0
JSTAT46_TTL: 120
JSTAT46_FRAGMENTED_ICMP: 0
JSTAT46_SRC_ROUTE: 0
JSTAT46_FRAGMENTED_ZERO_CSUM: 0
JSTAT46_BAD_MTU: 0
JSTAT_FAILED_ROUTES: 1256
JSTAT_PKT_TOO_BIG: 0
JSTAT_DST_OUTPUT: 0
JSTAT_ICMP6ERR_SUCCESS: 0
JSTAT_ICMP6ERR_FAILURE: 0
JSTAT_ICMP4ERR_SUCCESS: 120
JSTAT_ICMP4ERR_FAILURE: 0
JSTAT_ICMPEXT_BIG: 0
JSTAT_JOOLD_EMPTY: 15405972
JSTAT_JOOLD_TIMEOUT: 23
JSTAT_JOOLD_MISSING_ACK: 60790827
JSTAT_JOOLD_AD_ONGOING: 0
JSTAT_JOOLD_FLUSH_ASAP: 28703736
JSTAT_JOOLD_PKT_FULL: 0
JSTAT_JOOLD_QUEUING: 0
JSTAT_JOOLD_SSS_QUEUED: 76164725
JSTAT_JOOLD_SSS_SENT: 74939867
JSTAT_JOOLD_SSS_RCVD: 138148403
JSTAT_JOOLD_PKT_SENT: 28703759
JSTAT_JOOLD_PKT_RCVD: 34168029
JSTAT_JOOLD_ADS: 0
JSTAT_JOOLD_ACKS: 28705795
JSTAT_UNKNOWN: 0
# ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display --all
JSTAT_RECEIVED6: 2036
JSTAT_RECEIVED4: 0
JSTAT_SUCCESS: 2036
JSTAT_BIB_ENTRIES: 1678
JSTAT_SESSIONS: 1679
JSTAT_ENOMEM: 0
JSTAT_XLATOR_DISABLED: 0
JSTAT_POOL6_UNSET: 0
JSTAT_SKB_SHARED: 0
JSTAT_L3HDR_OFFSET: 0
JSTAT_SKB_TRUNCATED: 0
JSTAT_HDR6: 0
JSTAT_HDR4: 0
JSTAT_UNKNOWN_L4_PROTO: 0
JSTAT_UNKNOWN_ICMP6_TYPE: 0
JSTAT_UNKNOWN_ICMP4_TYPE: 0
JSTAT_DOUBLE_ICMP6_ERROR: 0
JSTAT_DOUBLE_ICMP4_ERROR: 0
JSTAT_UNKNOWN_PROTO_INNER: 0
JSTAT_HAIRPIN_LOOP: 0
JSTAT_POOL6_MISMATCH: 0
JSTAT_POOL4_MISMATCH: 0
JSTAT_ICMP6_FILTER: 0
JSTAT_UNTRANSLATABLE_DST6: 0
JSTAT_UNTRANSLATABLE_DST4: 0
JSTAT_6056_F: 0
JSTAT_MASK_DOMAIN_NOT_FOUND: 0
JSTAT_BIB6_NOT_FOUND: 0
JSTAT_BIB4_NOT_FOUND: 0
JSTAT_SESSION_NOT_FOUND: 0
JSTAT_ADF: 0
JSTAT_V4_SYN: 0
JSTAT_SYN6_EXPECTED: 0
JSTAT_SYN4_EXPECTED: 0
JSTAT_TYPE1PKT: 0
JSTAT_TYPE2PKT: 0
JSTAT_SO_EXISTS: 0
JSTAT_SO_FULL: 0
JSTAT64_SRC: 0
JSTAT64_DST: 0
JSTAT64_PSKB_COPY: 0
JSTAT64_6791_ENOENT: 0
JSTAT64_ICMP_CSUM: 0
JSTAT64_UNTRANSLATABLE_PARAM_PROB_PTR: 0
JSTAT64_TTL: 0
JSTAT64_FRAGMENTED_ICMP: 0
JSTAT64_2XFRAG: 0
JSTAT64_FRAG_THEN_EXT: 0
JSTAT64_SEGMENTS_LEFT: 0
JSTAT46_SRC: 0
JSTAT46_DST: 0
JSTAT46_PSKB_COPY: 0
JSTAT46_6791_ENOENT: 0
JSTAT46_ICMP_CSUM: 0
JSTAT46_UNTRANSLATABLE_PARAM_PROBLEM_PTR: 0
JSTAT46_TTL: 0
JSTAT46_FRAGMENTED_ICMP: 0
JSTAT46_SRC_ROUTE: 0
JSTAT46_FRAGMENTED_ZERO_CSUM: 0
JSTAT46_BAD_MTU: 0
JSTAT_FAILED_ROUTES: 0
JSTAT_PKT_TOO_BIG: 0
JSTAT_DST_OUTPUT: 0
JSTAT_ICMP6ERR_SUCCESS: 0
JSTAT_ICMP6ERR_FAILURE: 0
JSTAT_ICMP4ERR_SUCCESS: 0
JSTAT_ICMP4ERR_FAILURE: 0
JSTAT_ICMPEXT_BIG: 0
JSTAT_JOOLD_EMPTY: 28665501
JSTAT_JOOLD_TIMEOUT: 2035
JSTAT_JOOLD_MISSING_ACK: 0
JSTAT_JOOLD_AD_ONGOING: 0
JSTAT_JOOLD_FLUSH_ASAP: 1
JSTAT_JOOLD_PKT_FULL: 0
JSTAT_JOOLD_QUEUING: 0
JSTAT_JOOLD_SSS_QUEUED: 2036
JSTAT_JOOLD_SSS_SENT: 2036
JSTAT_JOOLD_SSS_RCVD: 138965356
JSTAT_JOOLD_PKT_SENT: 2036
JSTAT_JOOLD_PKT_RCVD: 34271807
JSTAT_JOOLD_ADS: 0
JSTAT_JOOLD_ACKS: 28635463
JSTAT_UNKNOWN: 0

[Thu Aug 31 14:19:44 2023] Jool: Deleted instance 'nat64-wkp-upper'.
[Thu Aug 31 14:19:44 2023] Jool: Deleted instance 'nat64-wkp-lower'.
[Thu Aug 31 14:19:44 2023] Jool: Deleted instance 'siit-dmz'.
[Thu Aug 31 14:19:52 2023] NAT64 Jool v4.1.10.2 module removed.
[Thu Aug 31 14:19:53 2023] SIIT Jool v4.1.10.2 module removed.
[Thu Aug 31 14:19:53 2023] Jool: Core Jool v4.1.10.2 module removed.
[Thu Aug 31 14:19:53 2023] ixgbe 0000:05:00.0: removed PHC on enp5s0f0
[Thu Aug 31 14:19:53 2023] ixgbe 0000:05:00.1: removed PHC on enp5s0f1
[Thu Aug 31 14:19:53 2023] device eno3 left promiscuous mode
[Thu Aug 31 14:19:53 2023] device eno4 left promiscuous mode

However, I'm not sure if the numbers for any of these counters are alarming. Anything I should try to drill down on?

@ydahhrk
Copy link
Member

ydahhrk commented Aug 31, 2023

We have 3 instances (1 SIIT, 2 NAT64) running on the box; here are the stats from all:

Waitwaitwaitwaitwait. WhAt iS HaPpEnInG?!?

  1. Why is there an SIIT in your setup and why do you think it's relevant?
  2. What do you mean "on the box"? Are all three instances running on the same machine?

The numbers more or less tell me that both NAT64 instances are receiving ALL the joold packets; even the ones they themselves multicasted. This is backwards.

And it seems to confirm the theory that they're running on the same machine. They're subscribed to literally the exact same Netlink multicast group... in the same kernel.

We're going to have to zoom out for a sec, and scrutinize your setup. Please draw your network, provide your configuration, and explain what you're trying to do. (If you need to obfuscate your IPs, please obfuscate them consistently.)

The entire point of Active/Active is load balancing. You don't get anything out of the load balancing if both artifacts are running on the same processor. If all you want is a backup instance in case one of them crashes, PLEASE do Active/Passive, and also, place the instances in different (at least virtual) machines.

Jool is a kernel module, which means that if one of them crash, the entire system will go down with it. (And that includes the other Jool, if it's on the same system.)

Still not seeing any leak info in dmesg.

You don't need to wait 16 hours to find out if you did this correctly.

Just compile the module with -DJKMEMLEAK, reinstall it, then quickly insert and remove it, then immediately query dmesg:

$ make CFLAGS_MODULE="-DJKMEMLEAK"
$ sudo make install
$ sudo modprobe jool && sudo modprobe -r jool; sudo dmesg -ct
Jool: Core Jool v4.1.10.2 module inserted.
NAT64 Jool v4.1.10.2 module inserted.
NAT64 Jool v4.1.10.2 module removed.
Jool: Memory leaks:
Jool: None.
Jool: Core Jool v4.1.10.2 module removed.

If you don't see the "Memory leaks:" line right away, something went wrong.

@suffieldacademy
Copy link
Author

I realize I haven't explained the architecture and it sounds like I'm crazy. I'll DM you the details on the network as it's more than I can easily fit here (and I don't want to post a full net diagram). It probably is crazy, but hopefully not in a breaking sort of way.

For now, with regards to the module, I think the issue is that I'm installing via dkms. When I do the "make" that must only be userspace (make install doesn't change what's in /lib/modules). How do I pass the MEMLEAK flags to dkms?

@ydahhrk
Copy link
Member

ydahhrk commented Sep 9, 2023

How do I pass the MEMLEAK flags to dkms?

One way is to add -DJKMEMLEAK to ccflags-y at src/mod/common/Kbuild:

ccflags-y := -I$(src)/../.. -DJKMEMLEAK

@ydahhrk ydahhrk modified the milestone: 4.1.11 Dec 22, 2023
@ydahhrk
Copy link
Member

ydahhrk commented Jan 9, 2024

Observations by stat:

  • JSTAT_JOOLD_PKT_SENT, JSTAT_JOOLD_ACKS: These always match. No ACKs are getting lost, in any instance.

Oddities by stat:

  • JSTAT_JOOLD_FLUSH_ASAP: B upper still seems to have flush-asap enabled. (While the others do not.)

(I'd expect all the flush-asaps to be consistent, and at this point I'd prefer them all to be disabled.)

  • JSTAT_JOOLD_QUEUING: Neither of the B instances are queuing, yet both A instances are queuing.

I'd expect both primaries to be queuing. The secondaries should not be queuing, unless one of the primaries has gone offline.

This might just be a consequence of the previous oddity; B is flushing quickly, which means queuing is discouraged.


Observations by instance:

B lower (secondary)

This instance received 7311 packets from IPv6, they were all translated successfully, and they were all joold'd queuelessly because they all happened more than ss-flush-deadline seconds away from each other. The counters are consistent.

It didn't receive any IPv4 packets; not even as responses for the IPv6 packets. It seems the Internet is correctly routing all lower packets to A, while the LAN isn't. (Though stranded lower packets are few.)

This is the only instance that isn't printing Too many sessions deferred!.

A upper (secondary)

The counters suggest this instance is receiving more traffic that belongs to its primary, through both IPv's. It also had to drop some sessions.

However, the advertise suggests there was a period in which the primary was inactive, and I suppose that explains both quirks.

B upper (primary)
A lower (primary)

Other than ENOSPC, I don't see anything suspicious going on.

But ENOSPC is consistently small in the grand scheme of things. Maybe there's not much in the way of bugs left, and all that's left to do is adjust each field for its intended purpose.

  • ss-capacity might be too small; try raising it.
  • Since no ACKs are getting lost, the Netlink channel is probably not getting saturated. You could also try shrinking ss-flush-deadline, to reduce the queuing needed.

@ydahhrk
Copy link
Member

ydahhrk commented Jan 10, 2024

Still getting "Too many sessions deferred!" messages, including over this holiday weekend with very low traffic amounts.

Maybe someone left a cron job that suddenly spikes the traffic at midnight, or something?

I suppose I could add stat counters to keep track of the queue length and ACK RTT. You could decide your ss-capacity and ss-flush-deadline based on these numbers.

Difficult to say if it's causing issues; no complaints from users but the connections might be retried?

Sure, they might be recovering on their own. Also, many sessions joold sends are redundant because they need to be updated very shortly afterwards.

@suffieldacademy
Copy link
Author

Sorry, been very busy these last couple weeks. I'm going to dump out the current status of the counters just so we have recent numbers and so we can see if there are any changes. After this I'll double-check all the settings you mentioned (ss-capacity and ss-flush-deadline) and make some tweaks.

# (node a primary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 3610658425
JSTAT_RECEIVED4: 9966468183
JSTAT_SUCCESS: 13556576935
JSTAT_BIB_ENTRIES: 29270
JSTAT_SESSIONS: 29294
JSTAT_SKB_TRUNCATED: 2762
JSTAT_HDR4: 1
JSTAT_UNKNOWN_L4_PROTO: 528
JSTAT_UNKNOWN_ICMP4_TYPE: 96
JSTAT_UNKNOWN_PROTO_INNER: 2
JSTAT_POOL4_MISMATCH: 613345
JSTAT_BIB4_NOT_FOUND: 4894290
JSTAT_SESSION_NOT_FOUND: 21558
JSTAT_V4_SYN: 8859089
JSTAT_SYN6_EXPECTED: 5657127
JSTAT_SYN4_EXPECTED: 61559
JSTAT64_TTL: 11491
JSTAT46_TTL: 11983
JSTAT_FAILED_ROUTES: 415318
JSTAT_ICMP6ERR_SUCCESS: 11491
JSTAT_ICMP4ERR_SUCCESS: 11983
JSTAT_ICMPEXT_BIG: 524
JSTAT_JOOLD_EMPTY: 76781732
JSTAT_JOOLD_TIMEOUT: 1
JSTAT_JOOLD_MISSING_ACK: 4581630543
JSTAT_JOOLD_AD_ONGOING: 228
JSTAT_JOOLD_PKT_FULL: 520211402
JSTAT_JOOLD_QUEUING: 8899259103
JSTAT_JOOLD_SSS_QUEUED: 13556123855
JSTAT_JOOLD_SSS_SENT: 13525502358
JSTAT_JOOLD_SSS_RCVD: 104772
JSTAT_JOOLD_SSS_ENOSPC: 30627365
JSTAT_JOOLD_PKT_SENT: 520211631
JSTAT_JOOLD_PKT_RCVD: 104772
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 520211631
# (node a secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 392104
JSTAT_RECEIVED4: 1917566
JSTAT_SUCCESS: 2305977
JSTAT_BIB_ENTRIES: 5576
JSTAT_SESSIONS: 5583
JSTAT_UNKNOWN_L4_PROTO: 1
JSTAT_POOL4_MISMATCH: 610
JSTAT_BIB4_NOT_FOUND: 122
JSTAT_V4_SYN: 2925
JSTAT_SYN6_EXPECTED: 22
JSTAT_SYN4_EXPECTED: 13
JSTAT_JOOLD_EMPTY: 1660972
JSTAT_JOOLD_TIMEOUT: 109670
JSTAT_JOOLD_MISSING_ACK: 750747
JSTAT_JOOLD_AD_ONGOING: 38
JSTAT_JOOLD_PKT_FULL: 84037
JSTAT_JOOLD_QUEUING: 1441682
JSTAT_JOOLD_SSS_QUEUED: 2305880
JSTAT_JOOLD_SSS_SENT: 2304131
JSTAT_JOOLD_SSS_RCVD: 5992950272
JSTAT_JOOLD_SSS_ENOSPC: 2709
JSTAT_JOOLD_PKT_SENT: 193745
JSTAT_JOOLD_PKT_RCVD: 1471006661
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 193743
# (node b secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 105129
JSTAT_SUCCESS: 105129
JSTAT_BIB_ENTRIES: 29411
JSTAT_SESSIONS: 29441
JSTAT_JOOLD_EMPTY: 1652625
JSTAT_JOOLD_TIMEOUT: 105129
JSTAT_JOOLD_SSS_QUEUED: 105129
JSTAT_JOOLD_SSS_SENT: 105129
JSTAT_JOOLD_SSS_RCVD: 13491263006
JSTAT_JOOLD_PKT_SENT: 105129
JSTAT_JOOLD_PKT_RCVD: 518894731
JSTAT_JOOLD_ACKS: 105128
# (node b primary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 999272889
JSTAT_RECEIVED4: 5024326643
JSTAT_SUCCESS: 6019325034
JSTAT_BIB_ENTRIES: 5500
JSTAT_SESSIONS: 5505
JSTAT_SKB_TRUNCATED: 527
JSTAT_HDR4: 1
JSTAT_UNKNOWN_L4_PROTO: 595
JSTAT_UNKNOWN_ICMP4_TYPE: 18
JSTAT_UNKNOWN_PROTO_INNER: 1
JSTAT_POOL4_MISMATCH: 514425
JSTAT_BIB4_NOT_FOUND: 430017
JSTAT_SESSION_NOT_FOUND: 38569
JSTAT_V4_SYN: 2814391
JSTAT_SYN6_EXPECTED: 439203
JSTAT_SYN4_EXPECTED: 23852
JSTAT64_TTL: 16
JSTAT_FAILED_ROUTES: 12865
JSTAT_ICMP6ERR_SUCCESS: 16
JSTAT_ICMPEXT_BIG: 18
JSTAT_JOOLD_EMPTY: 667072784
JSTAT_JOOLD_TIMEOUT: 9749
JSTAT_JOOLD_MISSING_ACK: 5353347106
JSTAT_JOOLD_FLUSH_ASAP: 1475364608
JSTAT_JOOLD_SSS_QUEUED: 6018872393
JSTAT_JOOLD_SSS_SENT: 6010419681
JSTAT_JOOLD_SSS_RCVD: 2287867
JSTAT_JOOLD_SSS_ENOSPC: 8452712
JSTAT_JOOLD_PKT_SENT: 1475374357
JSTAT_JOOLD_PKT_RCVD: 192796
JSTAT_JOOLD_ACKS: 1475374357

@suffieldacademy
Copy link
Author

OK, I have reconfigured the instances with the following

  • flush-asap was accidentally still on for node b; it is now disabled for all instances
  • ss-flush-deadline halved from 2000 to 1000
  • ss-capacity changed from default/unspecified (512) to 2048
  • logging-bib disabled (it was disabled previously as I was trying to reduce overall load on the machine, but wanted to point out that I had made the change)

It's not possible to restart both instances simultaneously without causing an outage, so I restarted node b followed by node a. Below are the stats for both shortly after both instances were brought back online so we have a basis for comparison:

# (node a primary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 14720
JSTAT_RECEIVED4: 34378
JSTAT_SUCCESS: 48210
JSTAT_BIB_ENTRIES: 5829
JSTAT_SESSIONS: 5857
JSTAT_POOL4_MISMATCH: 1
JSTAT_BIB4_NOT_FOUND: 34
JSTAT_V4_SYN: 9
JSTAT_SYN6_EXPECTED: 602
JSTAT_SYN4_EXPECTED: 229
JSTAT_FAILED_ROUTES: 13
JSTAT_JOOLD_EMPTY: 460
JSTAT_JOOLD_TIMEOUT: 2
JSTAT_JOOLD_MISSING_ACK: 10646
JSTAT_JOOLD_PKT_FULL: 1854
JSTAT_JOOLD_QUEUING: 37145
JSTAT_JOOLD_SSS_QUEUED: 48222
JSTAT_JOOLD_SSS_SENT: 48206
JSTAT_JOOLD_SSS_RCVD: 133564
JSTAT_JOOLD_PKT_SENT: 1856
JSTAT_JOOLD_PKT_RCVD: 5138
JSTAT_JOOLD_ACKS: 1856
# (node a secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 2
JSTAT_SUCCESS: 2
JSTAT_BIB_ENTRIES: 2818
JSTAT_SESSIONS: 2831
JSTAT_JOOLD_EMPTY: 31
JSTAT_JOOLD_TIMEOUT: 2
JSTAT_JOOLD_SSS_QUEUED: 2
JSTAT_JOOLD_SSS_SENT: 2
JSTAT_JOOLD_SSS_RCVD: 223184
JSTAT_JOOLD_PKT_SENT: 2
JSTAT_JOOLD_PKT_RCVD: 8584
JSTAT_JOOLD_ACKS: 2
# (node b secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 64739
JSTAT_RECEIVED4: 115575
JSTAT_SUCCESS: 178336
JSTAT_BIB_ENTRIES: 9661
JSTAT_SESSIONS: 9703
JSTAT_POOL4_MISMATCH: 12
JSTAT_BIB4_NOT_FOUND: 41
JSTAT_V4_SYN: 23
JSTAT_SYN6_EXPECTED: 1293
JSTAT_SYN4_EXPECTED: 576
JSTAT64_TTL: 1
JSTAT_FAILED_ROUTES: 32
JSTAT_ICMP6ERR_SUCCESS: 1
JSTAT_JOOLD_EMPTY: 1913
JSTAT_JOOLD_TIMEOUT: 5
JSTAT_JOOLD_MISSING_ACK: 33678
JSTAT_JOOLD_AD_ONGOING: 332
JSTAT_JOOLD_PKT_FULL: 6857
JSTAT_JOOLD_QUEUING: 142827
JSTAT_JOOLD_SSS_QUEUED: 178331
JSTAT_JOOLD_SSS_SENT: 186896
JSTAT_JOOLD_SSS_RCVD: 530090
JSTAT_JOOLD_PKT_SENT: 7194
JSTAT_JOOLD_PKT_RCVD: 20390
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 7194
# (node b primary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 209411
JSTAT_RECEIVED4: 230710
JSTAT_SUCCESS: 439664
JSTAT_BIB_ENTRIES: 3005
JSTAT_SESSIONS: 3017
JSTAT_POOL4_MISMATCH: 16
JSTAT_BIB4_NOT_FOUND: 2
JSTAT_V4_SYN: 66
JSTAT_SYN6_EXPECTED: 223
JSTAT_SYN4_EXPECTED: 150
JSTAT_JOOLD_EMPTY: 1946
JSTAT_JOOLD_TIMEOUT: 3
JSTAT_JOOLD_MISSING_ACK: 140810
JSTAT_JOOLD_AD_ONGOING: 114
JSTAT_JOOLD_PKT_FULL: 16909
JSTAT_JOOLD_QUEUING: 296995
JSTAT_JOOLD_SSS_QUEUED: 439664
JSTAT_JOOLD_SSS_SENT: 442580
JSTAT_JOOLD_SSS_RCVD: 238860
JSTAT_JOOLD_PKT_SENT: 17026
JSTAT_JOOLD_PKT_RCVD: 9190
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 17026

@suffieldacademy
Copy link
Author

A quick checkin after running this for about a week. Anecdotally the number of "Too many sessions deferred!" messages has diminished; we're now seeing hours between some of the messages whereas before it was several times per hour. We still do have some bursts that are closer to each other, but overall the number is slowing.

Assuming this improvement is from tweaking the parameters, should we continue to change them? I'm guessing we would need to continue increasing ss-capacity as lowering the deadline probably won't continue to help. If so, I can make it much bigger (10x) just to see if it helps. Since there is a flush deadline, having a large capacity shouldn't affect anything other than RAM usage, right (even though the buffer is big, it will still be flushed on a schedule)?

Below I'll post our current counters just for reference and so we can get an absolute sense of how things are going.

# (node a primary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 1069308022
JSTAT_RECEIVED4: 2520677922
JSTAT_SUCCESS: 3583041960
JSTAT_BIB_ENTRIES: 42870
JSTAT_SESSIONS: 42945
JSTAT_SKB_TRUNCATED: 43715
JSTAT_UNKNOWN_L4_PROTO: 105
JSTAT_UNKNOWN_ICMP4_TYPE: 8
JSTAT_POOL4_MISMATCH: 107970
JSTAT_BIB4_NOT_FOUND: 1751874
JSTAT_SESSION_NOT_FOUND: 10338
JSTAT_V4_SYN: 3347143
JSTAT_SYN6_EXPECTED: 1484615
JSTAT_SYN4_EXPECTED: 23260
JSTAT64_TTL: 1376
JSTAT46_TTL: 3796
JSTAT_FAILED_ROUTES: 169680
JSTAT_ICMP6ERR_SUCCESS: 1376
JSTAT_ICMP4ERR_SUCCESS: 3796
JSTAT_ICMPEXT_BIG: 104
JSTAT_JOOLD_EMPTY: 19626025
JSTAT_JOOLD_TIMEOUT: 2
JSTAT_JOOLD_MISSING_ACK: 1215434001
JSTAT_JOOLD_PKT_FULL: 137771498
JSTAT_JOOLD_QUEUING: 2347821764
JSTAT_JOOLD_SSS_QUEUED: 3582591136
JSTAT_JOOLD_SSS_SENT: 3582058950
JSTAT_JOOLD_SSS_RCVD: 153259
JSTAT_JOOLD_SSS_ENOSPC: 532170
JSTAT_JOOLD_PKT_SENT: 137771500
JSTAT_JOOLD_PKT_RCVD: 24833
JSTAT_JOOLD_ACKS: 137771500
# (node a secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 25717
JSTAT_SUCCESS: 25717
JSTAT_BIB_ENTRIES: 6563
JSTAT_SESSIONS: 6576
JSTAT_JOOLD_EMPTY: 312554
JSTAT_JOOLD_TIMEOUT: 23184
JSTAT_JOOLD_MISSING_ACK: 61
JSTAT_JOOLD_QUEUING: 3756
JSTAT_JOOLD_SSS_QUEUED: 25717
JSTAT_JOOLD_SSS_SENT: 25717
JSTAT_JOOLD_SSS_RCVD: 1778029696
JSTAT_JOOLD_PKT_SENT: 23184
JSTAT_JOOLD_PKT_RCVD: 68426186
JSTAT_JOOLD_ACKS: 23184
# (node b secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 84498
JSTAT_RECEIVED4: 115575
JSTAT_SUCCESS: 198095
JSTAT_BIB_ENTRIES: 43534
JSTAT_SESSIONS: 43616
JSTAT_POOL4_MISMATCH: 12
JSTAT_BIB4_NOT_FOUND: 41
JSTAT_V4_SYN: 23
JSTAT_SYN6_EXPECTED: 1293
JSTAT_SYN4_EXPECTED: 576
JSTAT64_TTL: 1
JSTAT_FAILED_ROUTES: 32
JSTAT_ICMP6ERR_SUCCESS: 1
JSTAT_JOOLD_EMPTY: 312296
JSTAT_JOOLD_TIMEOUT: 19764
JSTAT_JOOLD_MISSING_ACK: 33678
JSTAT_JOOLD_AD_ONGOING: 332
JSTAT_JOOLD_PKT_FULL: 6857
JSTAT_JOOLD_QUEUING: 142827
JSTAT_JOOLD_SSS_QUEUED: 198090
JSTAT_JOOLD_SSS_SENT: 206655
JSTAT_JOOLD_SSS_RCVD: 3568568564
JSTAT_JOOLD_PKT_SENT: 26953
JSTAT_JOOLD_PKT_RCVD: 137252639
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 26953
# (node b primary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 311602953
JSTAT_RECEIVED4: 1474760933
JSTAT_SUCCESS: 1785639579
JSTAT_BIB_ENTRIES: 6460
JSTAT_SESSIONS: 6472
JSTAT_SKB_TRUNCATED: 5022
JSTAT_UNKNOWN_L4_PROTO: 104
JSTAT_POOL4_MISMATCH: 89860
JSTAT_BIB4_NOT_FOUND: 80313
JSTAT_SESSION_NOT_FOUND: 13837
JSTAT_V4_SYN: 435972
JSTAT_SYN6_EXPECTED: 88170
JSTAT_SYN4_EXPECTED: 6203
JSTAT64_TTL: 9
JSTAT_FAILED_ROUTES: 4784
JSTAT_ICMP6ERR_SUCCESS: 9
JSTAT_ICMPEXT_BIG: 33
JSTAT_JOOLD_EMPTY: 7269038
JSTAT_JOOLD_TIMEOUT: 85005
JSTAT_JOOLD_MISSING_ACK: 653906446
JSTAT_JOOLD_AD_ONGOING: 114
JSTAT_JOOLD_PKT_FULL: 68619571
JSTAT_JOOLD_QUEUING: 1124598375
JSTAT_JOOLD_SSS_QUEUED: 1785483148
JSTAT_JOOLD_SSS_SENT: 1785267361
JSTAT_JOOLD_SSS_RCVD: 264485
JSTAT_JOOLD_SSS_ENOSPC: 218723
JSTAT_JOOLD_PKT_SENT: 68704690
JSTAT_JOOLD_PKT_RCVD: 32293
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 68704690

@ydahhrk
Copy link
Member

ydahhrk commented Feb 1, 2024

Analysis attached. (Tip: Hide columns B, C, G and H.)

I don't see anything alarming anymore, though I still don't quite like the numbers. Excluding the sessions it dropped because of ss-capacity, a-primary sent 3,582,010,744 sessions, of which 3,568,038,474 made it to b-secondary. That's 13,972,270 (0.39%) lost.

But no Netlink packets are getting dropped, so it seems they're getting lost in the network.

Assuming this improvement is from tweaking the parameters, should we continue to change them?

Well... ideally yes, though the ss-capacity session drop count (532,170) seems somewhat small in comparison to the (presumable) network drop count (13,972,270).

I'm guessing we would need to continue increasing ss-capacity as lowering the deadline probably won't continue to help.

Hmm... HMMMM. 1 second is still a lot in computer time. I'm trying to think how we might graph the average RTT with those ACKs, but since there are a lot every second, the data set of a single day would probably be gargantuan.

ss-flush-deadline mainly exists to prevent Jool from saturating the Netlink channel. The default is comical; it aims for one Netlink packet at a time. I don't know what the limit is, but you can probably easily fit more in there without breaking a sweat.

The more data you can send at a time, the less you have to queue. You can dramatically reduce your needed capacity by sending more greedily.

I mean... your current capacity is 2048. Since you can fit 26 sessions per packet... that's a 79-sized packet queue. You have 79 packets waiting for their turn, and you can only send one at a time, each having to wait for the previous's ACK. And while that happens, more sessions are arriving.

And sure, since those traffic spikes seem to be rare, a mega queue will probably be able to retain the sessions long enough for the queue to gradually clear naturally. But it's one of those things where increasing capacity will net you linear improvements, while decreasing deadline could be an order of magnitude better.

The way to know if you're saturating the Netlink channel is by checking if the ACKs are getting dropped. That would be JSTAT_JOOLD_PKT_SENT - JSTAT_JOOLD_ACKS. I'd say keep reducing ss-flush-deadline until the subtraction stops being zero.

I can make it much bigger (10x) just to see if it helps.

I'm trying to think why you would not want to do this, but I'm coming out empty-handed.

Outside of the unit tests and cleanup, Jool never traverses more than ss-max-sessions-per-packet nodes in that list at a time, so the performance shouldn't tank. It seems your RAM is your limit, but each node costs 112 bytes (in my PC)... a big fat nothing.

Since there is a flush deadline, having a large capacity shouldn't affect anything other than RAM usage, right (even though the buffer is big, it will still be flushed on a schedule)?

Right... I think.

I just realized the word "flush" is misleading. I don't see anything in your text that suggests you've been misled, but just to be clear:

It's not a strict full "flush." It only "flushes" the amout of sessions it can send at a time (in your case, 26). Because it doesn't drop them; it forces a single packet out.

Sessions that arrive while the queue is full are dropped. So in times of stress, the queue fills up, and new sessions are dropped. The deadline "flushes" 26 sessions. So 26 new sessions make it in, and the rest are dropped again.

If you can think of a better way to implement this, let me know.

@ydahhrk
Copy link
Member

ydahhrk commented Feb 1, 2024

Well... ideally yes
(...)
I'd say keep reducing ss-flush-deadline until the subtraction stops being zero.

(Also, remember that ss-flush-deadline and ss-capacity are runtime switches, and don't have to be synchronized across instances. You can experiment without having to restart nor re-advertise.)

@suffieldacademy
Copy link
Author

Thank you so much for that spreadsheet, it helped make a lot of things clearer. Additionally, your description of ss-flush-deadline was very helpful; I was imagining it more as a "if a full session packet hasn't been sent within this timeframe, send one anyway." I see now that it's more of an unacked packet timer, so making it lower does make sense.

I've lowered ss-flush-deadline to 100ms to see what that does. The machines are directly connected, so there should be very low RTT latency network-wise (only processing on the nodes), so I think that's a reasonable starting point and we could even go lower. I'll watch the numbers for a bit and see how they look. My students are away for the weekend, so I'll wait until they're back so we have normal traffic levels.

@ydahhrk
Copy link
Member

ydahhrk commented Feb 2, 2024

Additionally, your description of ss-flush-deadline was very helpful; I was imagining it more as a "if a full session packet hasn't been sent within this timeframe, send one anyway." I see now that it's more of an unacked packet timer, so making it lower does make sense.

Wait... what?

... ...

[Scratches head] Ok, ss-flush-deadline is both of those things. It's just that the latter is more important right now, yeah.

Look, full disclosure. Let's let the code talk. That function returns true if the packet should be sent, false if the buffering should continue.

As you can see, the flush_deadline validation is pretty high up and short-circuits, skipping the rest of the function. This can be read as "if more than ss-flush-deadline milliseconds have elapsed since the last time we sent a packet, then I don't care if we haven't received an ACK, I don't care about ADs (advertisements), don't care about flush-asap, don't care if the queue is small. Send it out NOW."

So yes; it can be used to skip slow ACKs (assuming they're really being slow, IDK), but in essence it's more of a general stagnation-avoidance policy.

Sorry for the confusion.

I'll watch the numbers for a bit and see how they look. My students are away for the weekend, so I'll wait until they're back so we have normal traffic levels.

ACK.

@suffieldacademy
Copy link
Author

OK, we've been back for a bit so I gathered some more stats and pushed them into a spreadsheet. The loss rates are less than 1%, but don't look a whole lot better than the last time (this is with a 100ms flush deadline).

Everything is still being ACKed, so I'll lower the flush deadline again to see if there are any changes.

sa-joold-stats.ods

@ydahhrk
Copy link
Member

ydahhrk commented Feb 23, 2024

Did you also increase ss-capacity?

@suffieldacademy
Copy link
Author

Yes, simultaneously decreased ss-flush-deadline to 20ms and increased ss-capacity to 16384. Latest numbers are attached for review. At first glance it looks like the overall loss is about the same, though the "packets queued but not sent" has dropped to almost negligible levels, which is good. Also, it looks like ENOSPC is zero for both, so that's good, right?

I've lost track a little bit of what the different numbers mean. Should I continue to lower the deadline and increase buffer, or now that ENOSPC is zero am I basically done? Are all the other lost packets just due to network losses of some sort?

sa-joold-stats-20240225.ods

@ydahhrk
Copy link
Member

ydahhrk commented Feb 27, 2024

Also, it looks like ENOSPC is zero for both, so that's good, right?

Yes

Should I continue to lower the deadline and increase buffer, or now that ENOSPC is zero am I basically done?

The capacity and deadline only target ENOSPC. You're good.

Are all the other lost packets just due to network losses of some sort?

Probably; I don't see any other explanation.

If you want to analyze it, I just uploaded one more test commit: Stats on the userspace daemons.

This one is simpler: fcc5ccc.

If you want. At this point I feel like the bug is fixed already.

@suffieldacademy
Copy link
Author

Sorry, was on vacation and then busy. Trying out the latest commit so we can put this issue to bed. Unfortunately, joold doesn't seem to want to start with this latest version. I get a log message that it's trying to open a "port" that is the netsocket file?

2024-03-30T22:46:55.184489-04:00 mario joold: Setting up statsocket (port /run/sa-joold-nat64-wkp-lower/modsocket.json)...

Maybe an off-by-one error in the arg processing?

        if (argc < 3) {
		syslog(LOG_INFO, "statsocket port unavailable; skipping statsocket.");
		return 0;
	}

	error = create_socket(argv[2], &sk);
	if (error)
		return error;

I'm exiting with a -8 code. My guess is I need to pass a third arg to joold but I couldn't find any documentation on that. Any hints on what to pass (and if the code is correct)? I'm not much of a C hacker so these are my only guesses...

@ydahhrk
Copy link
Member

ydahhrk commented Apr 2, 2024

Any hints on what to pass?

See the commit message of fcc5ccc.

Maybe an off-by-one error in the arg processing?

You're right.

@ydahhrk
Copy link
Member

ydahhrk commented Apr 2, 2024

Patch uploaded: b1e5021

@suffieldacademy
Copy link
Author

See the commit message of fcc5ccc.

Ah, sorry. I did see that go by but got distracted and only looked at the man page. I've got that figured out now.

There is still an off-by-one in statsocket.c (wrong argv[] index):

diff --git a/src/usr/joold/statsocket.c b/src/usr/joold/statsocket.c
index 234ca199..b2e220d2 100644
--- a/src/usr/joold/statsocket.c
+++ b/src/usr/joold/statsocket.c
@@ -117,7 +117,7 @@ int statsocket_start(int argc, char **argv)
                return 0;
        }
 
-       error = create_socket(argv[2], &sk);
+       error = create_socket(argv[3], &sk);
        if (error)
                return error;

With that change I'm up and running and I am able to query the stats socket, so I think we're all set.

Thank you again for all of your help and patience with this issue! I'm fine to close this out now, and look forward to having it all rolled up in a future release.

@ydahhrk
Copy link
Member

ydahhrk commented Jul 7, 2024

There is still an off-by-one in statsocket.c (wrong argv[] index):

Huhhhhhhhhhhhh?

🤦🤦🤦🤦🤦🤦🤦

How did I miss this during the release? How did I miss it during the TESTING?

WHAT

ydahhrk added a commit that referenced this issue Jul 7, 2024
@ydahhrk
Copy link
Member

ydahhrk commented Jul 7, 2024

You know what, I'm going to upgrade to argp. This is clearly not working out.

ydahhrk added a commit that referenced this issue Jul 8, 2024
For #410.

Ugh. Ran out of time, and I still have some issues with it.
Also, it's missing documentation.

Will keep grinding next weekend.
ydahhrk added a commit that referenced this issue Jul 13, 2024
ydahhrk added a commit that referenced this issue Aug 3, 2024
Before:

	echo '{ "port": "9999" }' > statsocket.json
	joold netsocket.json modsocket.json statsocket.json

Now:

	joold netsocket.json modsocket.json 9999

Restores the fcc5ccc interface.
It's less consistent, but eliminates the need to re-explain the third
argument in #410.

I don't mind the inconsistency, because `joold` has been superseded by
`jool session proxy` anyway.
ydahhrk added a commit that referenced this issue Aug 4, 2024
This option is just a liability at this point, and its ill-advised
default is a trap. Early flushing is no longer on option; SS always
queues now.

Rather than Active/Active, it's best to set up two Active/Passive
couples, per #410.
@ydahhrk
Copy link
Member

ydahhrk commented Aug 4, 2024

Hello again. Think I'm finally happy with it.

joold's interface has been bugging me for a while, because of its strange reliance on files:

joold <netsocket file> <modsocket file> <statsocket port>

I decided to deprecate that, and move userspace joold to jool session proxy. As determined before, it's an argp spread like the rest of jool's commnads:

$ jool session proxy --help
Usage: proxy [OPTION...] <net.mcast.address>

  -i, --net.dev.in=STR       IPv4: IP_ADD_MEMBERSHIP; IPv6: IPV6_ADD_MEMBERSHIP
                             (see ip(7))
  -o, --net.dev.out=STR      IPv4: IP_MULTICAST_IF, IPv6: IPV6_MULTICAST_IF
                             (see ip(7))
  -p, --net.mcast.port=STR   UDP port where the sessions will be advertised
      --stats.address=STR    Address to bind the stats socket to
      --stats.port=STR       Port to bind the stats socket to
  -t, --net.ttl=INT          Multicast datagram Time To Live
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Similarly, jool joold advertise is now jool session advertise.

You don't need to update your scripts because, even though the old commnads are deprecated, they still work, and I'm not really in a rush to delete them. But it'd be great if you could confirm I didn't break something again.

The code is in the main branch.

@ydahhrk ydahhrk added this to the 4.1.13 milestone Aug 24, 2024
@ydahhrk
Copy link
Member

ydahhrk commented Aug 24, 2024

Version 4.1.13 released; closing.

@ydahhrk ydahhrk closed this as completed Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants