Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

th/netlink (not more libnl3) #67

Merged
merged 15 commits into from Feb 21, 2018
Merged

th/netlink (not more libnl3) #67

merged 15 commits into from Feb 21, 2018

Conversation

thom311
Copy link
Member

@thom311 thom311 commented Feb 16, 2018

This branch drops the dependency on libnl3 and implements all the netlink related functions ourself.

This was done by copying all the missing bits over from libnl3 library, which is compatible licence-wise.

I did not try to simplify the code, except at a few places where it simplified the import itself. Hence, this is not yet as simple as it could be. For example, I think that nl_cb and nl_socket should essentially cease to exist and instead the relevant parameters should be passed on as needed -- instead of attaching them to the nl_socket object. That is not yet done intentionally, to first get a function equivalent form.

@thom311 thom311 force-pushed the th/netlink branch 6 times, most recently from 30db304 to a45f90a Compare February 16, 2018 13:37
@lkundrak lkundrak force-pushed the th/netlink branch 2 times, most recently from 4e2664c to ced3644 Compare February 18, 2018 13:40
@vbenes
Copy link
Collaborator

vbenes commented Feb 19, 2018

@thom311
Copy link
Member Author

thom311 commented Feb 19, 2018

crash here: https://bugzilla.redhat.com/show_bug.cgi?id=1546656

This crash can also happen on master. Attached patch to downstream bug (same patch is also part of this pull request for the moment).

@thom311
Copy link
Member Author

thom311 commented Feb 19, 2018

According to latest measurement, this branch increases the binary size of NetworkManager by 4544 bytes (2813336 vs. 2817880). Build on x86_64, contrib/rpm without extra debugging.

@bengal
Copy link
Contributor

bengal commented Feb 21, 2018

Looks good to me.

Platform invokes change events while reading netlink events. However,
platform code is not re-entrant and calling into platform again is not
allowed (aside operations that do not process the netlink socket, like
lookup of the platform cache).

That basically means, we have to always process events in an idle
handler. That is not a too strong limitation, because we anyway don't
know the call context in which the platform event is emitted and we
should avoid unguarded recursive calls into platform.

Otherwise, we get hit an assertion/crash in nm-iface-helper:

     1  raise()
     2  abort()
     3  g_assertion_message()
     4  g_assertion_message_expr()
     5  do_delete_object()
     6  ip6_address_delete()
 >>> 7  nm_platform_ip6_address_delete()
     8  nm_platform_ip6_address_sync()
     9  nm_ip6_config_commit()
     10 ndisc_config_changed()
     11 ffi_call_unix64()
     12 ffi_call()
     13 g_cclosure_marshal_generic_va()
     14 _g_closure_invoke_va()
     15 g_signal_emit_valist()
     16 g_signal_emit()
 >>> 17 nm_ndisc_dad_failed()
     18 ffi_call_unix64()
     19 ffi_call()
     20 g_cclosure_marshal_generic()
     21 g_closure_invoke()
     22 signal_emit_unlocked_R()
     23 g_signal_emit_valist()
     24 g_signal_emit()
 >>> 25 nm_platform_cache_update_emit_signal()
     26 event_handler_recvmsgs()
     27 event_handler_read_netlink()
     28 delayed_action_handle_one()
     29 delayed_action_handle_all()
     30 do_delete_object()
     31 ip6_address_delete()
     32 nm_platform_ip6_address_delete()
     33 nm_platform_ip6_address_sync()
 >>> 34 nm_ip6_config_commit()
     35 ndisc_config_changed()
     36 ffi_call_unix64()
     37 ffi_call()
     38 g_cclosure_marshal_generic_va()
     39 _g_closure_invoke_va()
     40 g_signal_emit_valist()
     41 g_signal_emit()
     42 check_timestamps()
     43 receive_ra()
     44 ndp_call_eventfd_handler()
     45 ndp_callall_eventfd_handler()
     46 event_ready()
     47 g_main_context_dispatch()
     48 g_main_context_iterate.isra.22()
     49 g_main_loop_run()
 >>> 50 main()

NMPlatform already has a check to assert against recursive calls
in delayed_action_handle_all():

    g_return_val_if_fail (priv->delayed_action.is_handling == 0, FALSE);

    priv->delayed_action.is_handling++;
    ...
    priv->delayed_action.is_handling--;

Fixes: f85728e

https://bugzilla.redhat.com/show_bug.cgi?id=1546656
From libnl3, we only used the helper function to parse/generate netlink
messages and the socket functions to send/receive messages. We don't
need an external dependency to do that, it is simple enough.

Drop the libnl3 dependency, and replace all missing code by directly
copying it from libnl3 sources. At this point, I mostly tried to
import the required bits to make it working with few modifications.

Note that this increases the binary size of NetworkManager by 4736 bytes
for contrib/rpm build on x86_64. In the future, we can simplify the code
further.

A few modifications from libnl3 are:

- netlink errors NLE_* are now in the domain or regular errno.
  The distinction of having to bother with two kinds of error
  number domains was annoying.

- parts of the callback handling is copied partially and unused parts
  are dropped. Especially, the verbose/debug handlers are not used.
  In following commits, the callback handling will be significantly
  simplified.

- the complex handling of seleting ports was simplified. We now always
  let kernel choose the right port automatically.
Originally, these were error numbers from libnl3. These error numbers
are separate from errno, which is unfortunate, because sometimes we
care about the native errno returned from kernel.

Now, refactor them so that the error numbers are in the shared realm
of errno, but failures from kernel or underlying API are still returned
via their native errno.

- NLE_INVAL doesn't exist anymore. Passing invalid arguments to a function
  is commonly a bug. g_return_*(NLE_BUG) is the right answer to that.

- NLE_NOMEM and NLE_AGAIN is replaced by their errno counterparts.

- drop several error numbers. If nobody cares about these numbers,
  there is no reason to have a specific error number for them.
  NLE_UNSPEC is sufficient.
The only difference between nl_recvmsgs_report() and nl_recvmsgs() is
the return value on success. libnl3 couldn't change that for backward
compatibility reasons. We can merge them.
With libnl3, each socket has it's own callback structure.
One would often take that callback structure, clone it, modify it
and invoke a receive operation with it.

We don't need this complexity. We got rid of all default handlers,
hence, by default all callbacks are unset.

The only callbacks that are set, are those that we specify immediately
before invoking the receive operation. Just pass the callback structure
at that point.

Also, no more ref-counting, and cloning of the callback structure. It is
so simple, just stack allocate one if you need it.
Glib is not out of memory safe, meaning it always aborts the program
when an allocation fails. It is not possible to meaningfully handle
out of memory when using glib.

Replace all allocation functions for netlink message with their glib
counter part and remove the NULL checks.
- adjust some coding style (space after function name).
- ensure to use g_free(), as we no longer use malloc
  but the g_malloc aliases. Nowadays, glib's malloc
  is identical to malloc from the standard library and
  so this is no issue in practice. Still it's bad
  style to mix g_malloc() with free().
- use cleanup attribute for memory handling.
Now that we cleaned up nl_recv(), we have full control over which error
variables are returned when. We no longer need to check "errno"
directly, and we no longer need the NLE_USER_* workaround.
…_netlink()

- refactor the loop in event_handler_read_netlink() to mark pending
  requests as answered by adding a new helper function
  delayed_action_wait_for_nl_response_complete_check()

- delayed_action_wait_for_nl_response_complete_all() can be implemented
  in terms of delayed_action_wait_for_nl_response_complete_check()

- if nm_platform_netns_push() fails, also complete all pending requests
  with a new error code WAIT_FOR_NL_RESPONSE_RESULT_FAILED_SETNS.
@thom311
Copy link
Member Author

thom311 commented Feb 21, 2018

Merged as 3d987fe

@lkundrak lkundrak deleted the th/netlink branch February 21, 2018 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants