New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
th/netlink (not more libnl3) #67
Conversation
30db304
to
a45f90a
Compare
4e2664c
to
ced3644
Compare
This crash can also happen on master. Attached patch to downstream bug (same patch is also part of this pull request for the moment). |
According to latest measurement, this branch increases the binary size of |
Looks good to me. |
Platform invokes change events while reading netlink events. However, platform code is not re-entrant and calling into platform again is not allowed (aside operations that do not process the netlink socket, like lookup of the platform cache). That basically means, we have to always process events in an idle handler. That is not a too strong limitation, because we anyway don't know the call context in which the platform event is emitted and we should avoid unguarded recursive calls into platform. Otherwise, we get hit an assertion/crash in nm-iface-helper: 1 raise() 2 abort() 3 g_assertion_message() 4 g_assertion_message_expr() 5 do_delete_object() 6 ip6_address_delete() >>> 7 nm_platform_ip6_address_delete() 8 nm_platform_ip6_address_sync() 9 nm_ip6_config_commit() 10 ndisc_config_changed() 11 ffi_call_unix64() 12 ffi_call() 13 g_cclosure_marshal_generic_va() 14 _g_closure_invoke_va() 15 g_signal_emit_valist() 16 g_signal_emit() >>> 17 nm_ndisc_dad_failed() 18 ffi_call_unix64() 19 ffi_call() 20 g_cclosure_marshal_generic() 21 g_closure_invoke() 22 signal_emit_unlocked_R() 23 g_signal_emit_valist() 24 g_signal_emit() >>> 25 nm_platform_cache_update_emit_signal() 26 event_handler_recvmsgs() 27 event_handler_read_netlink() 28 delayed_action_handle_one() 29 delayed_action_handle_all() 30 do_delete_object() 31 ip6_address_delete() 32 nm_platform_ip6_address_delete() 33 nm_platform_ip6_address_sync() >>> 34 nm_ip6_config_commit() 35 ndisc_config_changed() 36 ffi_call_unix64() 37 ffi_call() 38 g_cclosure_marshal_generic_va() 39 _g_closure_invoke_va() 40 g_signal_emit_valist() 41 g_signal_emit() 42 check_timestamps() 43 receive_ra() 44 ndp_call_eventfd_handler() 45 ndp_callall_eventfd_handler() 46 event_ready() 47 g_main_context_dispatch() 48 g_main_context_iterate.isra.22() 49 g_main_loop_run() >>> 50 main() NMPlatform already has a check to assert against recursive calls in delayed_action_handle_all(): g_return_val_if_fail (priv->delayed_action.is_handling == 0, FALSE); priv->delayed_action.is_handling++; ... priv->delayed_action.is_handling--; Fixes: f85728e https://bugzilla.redhat.com/show_bug.cgi?id=1546656
From libnl3, we only used the helper function to parse/generate netlink messages and the socket functions to send/receive messages. We don't need an external dependency to do that, it is simple enough. Drop the libnl3 dependency, and replace all missing code by directly copying it from libnl3 sources. At this point, I mostly tried to import the required bits to make it working with few modifications. Note that this increases the binary size of NetworkManager by 4736 bytes for contrib/rpm build on x86_64. In the future, we can simplify the code further. A few modifications from libnl3 are: - netlink errors NLE_* are now in the domain or regular errno. The distinction of having to bother with two kinds of error number domains was annoying. - parts of the callback handling is copied partially and unused parts are dropped. Especially, the verbose/debug handlers are not used. In following commits, the callback handling will be significantly simplified. - the complex handling of seleting ports was simplified. We now always let kernel choose the right port automatically.
Originally, these were error numbers from libnl3. These error numbers are separate from errno, which is unfortunate, because sometimes we care about the native errno returned from kernel. Now, refactor them so that the error numbers are in the shared realm of errno, but failures from kernel or underlying API are still returned via their native errno. - NLE_INVAL doesn't exist anymore. Passing invalid arguments to a function is commonly a bug. g_return_*(NLE_BUG) is the right answer to that. - NLE_NOMEM and NLE_AGAIN is replaced by their errno counterparts. - drop several error numbers. If nobody cares about these numbers, there is no reason to have a specific error number for them. NLE_UNSPEC is sufficient.
The only difference between nl_recvmsgs_report() and nl_recvmsgs() is the return value on success. libnl3 couldn't change that for backward compatibility reasons. We can merge them.
With libnl3, each socket has it's own callback structure. One would often take that callback structure, clone it, modify it and invoke a receive operation with it. We don't need this complexity. We got rid of all default handlers, hence, by default all callbacks are unset. The only callbacks that are set, are those that we specify immediately before invoking the receive operation. Just pass the callback structure at that point. Also, no more ref-counting, and cloning of the callback structure. It is so simple, just stack allocate one if you need it.
Glib is not out of memory safe, meaning it always aborts the program when an allocation fails. It is not possible to meaningfully handle out of memory when using glib. Replace all allocation functions for netlink message with their glib counter part and remove the NULL checks.
- adjust some coding style (space after function name). - ensure to use g_free(), as we no longer use malloc but the g_malloc aliases. Nowadays, glib's malloc is identical to malloc from the standard library and so this is no issue in practice. Still it's bad style to mix g_malloc() with free(). - use cleanup attribute for memory handling.
Now that we cleaned up nl_recv(), we have full control over which error variables are returned when. We no longer need to check "errno" directly, and we no longer need the NLE_USER_* workaround.
…_netlink() - refactor the loop in event_handler_read_netlink() to mark pending requests as answered by adding a new helper function delayed_action_wait_for_nl_response_complete_check() - delayed_action_wait_for_nl_response_complete_all() can be implemented in terms of delayed_action_wait_for_nl_response_complete_check() - if nm_platform_netns_push() fails, also complete all pending requests with a new error code WAIT_FOR_NL_RESPONSE_RESULT_FAILED_SETNS.
Merged as 3d987fe |
This branch drops the dependency on libnl3 and implements all the netlink related functions ourself.
This was done by copying all the missing bits over from libnl3 library, which is compatible licence-wise.
I did not try to simplify the code, except at a few places where it simplified the import itself. Hence, this is not yet as simple as it could be. For example, I think that
nl_cb
andnl_socket
should essentially cease to exist and instead the relevant parameters should be passed on as needed -- instead of attaching them to thenl_socket
object. That is not yet done intentionally, to first get a function equivalent form.