Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wifi randomly pulls a second 169 IP, causing all internet connections to fail post-boot #29

Open
Drakonas opened this issue Jun 17, 2022 · 52 comments

Comments

@Drakonas
Copy link

Drakonas commented Jun 17, 2022

Over the last week a few of us have been hammering at what is causing these weird wifi-related failures on boot at random. When they occur, the MiSTer gets an IP, even gets the time when you don't have an RTC board, and SAMBA/SSH connections work, but anything attempting to get an internet connection after that fails. Even nslookup google.com 8.8.8.8 fails with a timeout to the DNS.

What we have found are the following:

  • The issue seems to stem from the commits around April 2022. This commit may be the underlying culprit, but it is unsure how.
  • The issue will occur if the wlan0 interface exists when dhcpcd and ifup start up. Normally the interface does not exist until further in the boot process, but for mine it is there early (within a few seconds), so dhcpcd pulls an initial IP when its process is started.
  • On my MiSTer, a second udev fires, causing a second lease to be called, and it obtains a 169 IP address, while the previous lease still has a correct local IP that is reachable. This does not fire automatically for most other people, but forcefully firing it does reproduce the issue: udevadm trigger /sys/class/net/wlan0 --action add
  • Wifi seems to initiate for me very early, before the filesystem is mounted as r/w. This is not the case for most others. I am using a TP-Link Archer T3U Plus adapter (Amazon link). It is uncertain that this plays a factor in the issue at all. The random loss of internet occurs with the adapter from Porkchop as well, but it is uncertain that it is the same exact issue.
  • Resetting the router does often fix the issue, but only shortly. Otherwise, you'll find the issue fixed after cold rebooting the MiSTer 6-10 times, and then it returns a number of days later again, requiring another count of excessive reboots to fix.
  • dhcpcd fails to write the lease file to /var/lib/dhcpcd/interface-ssid.lease because the dhcpcd directory does not exist and it states the filesystem is read-only, which is the truth because it fires before the remount to r/w occurs. Considering I do not have an RTC addon (and the system time defaults to 1/1/1979), this means that the lease is about 40 years old once the time is pulled from NTP servers, which is most likely expired. It is uncertain if this causes it to pull a second lease.
  • dhcpcd will also fail to write the duid to /var/lib/dhcpcd/duid, since it is a read-only filesystem at the time it runs.
  • dhcpcd is outdated on the MiSTer. It is uncertain if updating it can be the fix. Possibly BusyBox needs work.
  • The issue occurs with multiple of my own routers/modems, so the issue shouldn't be related to the router. The udev command is reproducible even on reputable routers.
  • The DUID problem appears to cause some routers(mine) to give a new/different lease on every boot. Additionally it appears the r/o filesystem lease problem causes the system to trigger the IPV4LL behavior because if I move the lease directory to /media/fat/something with a symlink I can use the udevadm command multiple times and it just uses the existing lease.
  • This second lease seems to load under udhcpd and not dhcpcd. It is probably best to only use one and not both. This may be causing issues.
  • Depending on timing, the DUID can be static, but with my MiSTer it pulls it too early, and fails to keep it.
  • The wifi initiating early is most likely caused by the wlan0 interface loading within the first few seconds of booting. On working MiSTers, it seems to not load the wlan0 interface for a while longer, which allows the system to work normally.
  • Running rm /sbin/udhcpc causes ifup to use dhcpcd, and then the rc script starts another copy. It seems the priority for the startup script is to load udhcpc before dhcpcd. It would probably be best to only use dhcpcd. If udhcpd is wanted instead, eth0 does not load since it is not in the interfaces config file, so it would need to be added there.

Possible methods to fix (doing multiple is not a bad idea):

  • Seemingly the beest way to fix could be to use wpa_supplicant hook scripts (/usr/share/dhcpcd/hooks/10-wpa_supplicant), but it is broken with the MiSTer's wpa_supplicant implementation. wpa_supplicant would need to be fixed. This would allow using dhcpcd.conf to address anything related to network issues. And have ifupdown just do loopback.
  • Find what's causing the interface to exist on boot. This alone should fix the issue, but it is a bit of a dirty fix.
  • Fixing write-permissions on boot, so dhcpcd can write the lease/duid files properly. Another dirty fix, but should address the issue.
  • Fix it so only dhcpcd or udhcpd run, and not have a possibility of both. I recommend doing this regardless of other methods.

Any thoughts as to what may be directly causing this is welcome. I'd like to get to the bottom of this, as various users besides me have reported this happening at random with their MiSTer. We are using the latest Mr. Fusion images as far as I know.

I have attached my syslog for review.

/var/log/messages

@sorgelig
Copy link
Member

To fix the problem, need to have this problem. Generally speaking 99% of time i use wired connection and it's hard for me to work on this issue. I'm not very much in Linux specifics, so don't treat me as a master here :)
If you can offer a working solution, then go on.

@sorgelig
Copy link
Member

there is kind of race condition in boot sequence. I was trying to fix it when i was working on Bluetooth. But it seems impossible to fix it especially when more USB devices are connected. You may try to play with etc/init.d/* scripts sequence - may be it will help.

@Drakonas
Copy link
Author

Drakonas commented Jun 17, 2022

there is kind of race condition in boot sequence. I was trying to fix it when i was working on Bluetooth. But it seems impossible to fix it especially when more USB devices are connected. You may try to play with etc/init.d/* scripts sequence - may be it will help.

It's hard to really see what's going on without access to the buildroot scripts. These don't seem to be public. Do you know where these are located for the project?

@birdybro
Copy link
Member

It occurs for me too with this adapter --> https://www.amazon.com/gp/product/B08D72GSMS/ combined with this router+ap --> https://www.amazon.com/dp/B08DTF7KGC/ref=twister_B09P4Q7JK4

I just have to run ip addr flush dev wlan0 and it comes back with one address. It also doesn't come back between boots and it is definitely related to the dhcp lease time, because it will only come back after my dhcp lease has expired I've noticed (or joined a different network). It also occurred at my parents house with a totally different router+ap.

@zakk4223
Copy link

zakk4223 commented Jun 18, 2022

There's multiple things going on:

  1. the root filesystem is read-only so dhcpcd can't write a lease or duid file to /var/db/dhcpcd

1a) for some users the wlan0 device is available when dhcpcd first starts. It negotiates a lease but can't write any state to disk. Then for some reason it also receives a udev 'add' event for wlan0. Due to the fact there's no lease state written it tries to refresh/rediscover a dhcp lease. For some users this fails (I suspect the router is applying "protection"). When it fails dhcpcd falls back to a self assigned IP and deletes the route/dns for the 'good' lease. Or at least inserts a higher priority route to nowhere.

I unfortunately cannot debug this one because my wlan0 interface is not available when dhcpcd launches, so it only tries to get a lease once due to udev add event. I've seen Drakonas' logs and their wlan0 device is detected a full 2 seconds before mine so I suspect this is just due to variations in USB setups (hub, other devices etc).

Either /var/db/dhcpcd needs to be writeable or dhcpcd needs to use a different database directory. You can symlink /var/db/dhcpcd to /media/fat/dhcpcd and it will work. Or you could recompile dhcpcd and set DBDIR to /media/fat/dhcpcd (configure --dbdir=/media/fat/dhcpcd)

  1. udhcpc is still being run for wlan0 which means there are two dhcp clients running on those interfaces with possibly unpredictable results.

If you change /etc/network/interfaces so the line like 'iface wlan0 inet dhcp' is instead 'iface wlan0 inet manual' the ifup script won't try to invoke a dhcp client, but will still invoke the pre-up scripts for wpa_supplicant. Then dhcpcd will handle the dhcp lease when it runs.

@sorgelig
Copy link
Member

Nothing to do with buildroot. Boot scripts are in image and can be read/tweaked.
For debug purpose root fs can be mounted in read/write at boot (by uncommenting the line in inittab). If it will fix the problem then need to check which directory needs to be mounted as rw (as tmpfs).

@prenetic
Copy link

I have what seems to be the same or at least a very similar problem, outlined in greater detail in this thread on the Mister FPGA forums:

https://misterfpga.org/viewtopic.php?p=58198#p58198
https://misterfpga.org/viewtopic.php?p=58210#p58210
https://misterfpga.org/viewtopic.php?p=58274#p58274

Essentially two leases are allocated to the MiSTer on Wi-Fi (haven't tested whether this happens wired as well). Same MAC address, but one is registered without a hostname and a vendor ID of "udhcp", the other with the "MiSTer" hostname but no vendor ID (from dhcpc). When the device comes up, there is a brief moment of connectivity, followed by 10-20 seconds of disruption, and then connectivity again. You can't see both addresses with ifconfig, but you CAN see both with ip address. The flags differ too, with the udhcpc address showing perpetual validity (doesn't seem to respect the DHCP lease time).

Looking at the DHCP datagrams it's more clear -- the udhcpc requests come in with the MAC address (DUID) as the client identifier, but the dhcpc request comes in with the MAC address (DUID) PLUS an IAID as the client identifier. In the eyes of the DHCP server, these each require a unique IP address despite having the same base MAC address.

For testing/as a workaround, I changed the following option from duid to clientid which causes dhcpc to only send the MAC address as part of the DHCP request, so now the client identifiers between udhcpc and dhcpc match and only one lease is provided as confirmed by logs on my DHCP server (dnsmasq on my router).

# Use the hardware address of the interface for the Client ID.
clientid
# or
# Use the same DUID + IAID as set in DHCPv6 for DHCPv4 ClientID as per RFC4361.
# Some non-RFC compliant DHCP servers do not reply with this set.
# In this case, comment out duid and enable clientid above.
#duid

So given the behavior I think we're running into the same thing here. I'm under the impression that when the choice was made to transition to dhcpc from udhcpc, the latter wasn't fully disabled in the base image and the two DHCP clients are conflicting and causing issues -- so +1 to sticking to one or the other regardless of any other fixes.

Separately, if people are still running into new IP addresses every startup/polluting DHCP pools even after disabling one of the two DHCP clients, then the config change above should take care of that problem since the MAC address shouldn't be changing every boot. There's really no reason to include IAID as part of the client identifier for the case of MiSTer as far as I can tell (though it shouldn't be cycling with every boot anyway), and it is typically omitted for compatibility purposes for IPv4 anyway.

@Drakonas
Copy link
Author

Drakonas commented Aug 15, 2022

# Use the hardware address of the interface for the Client ID.
clientid
# or
# Use the same DUID + IAID as set in DHCPv6 for DHCPv4 ClientID as per RFC4361.
# Some non-RFC compliant DHCP servers do not reply with this set.
# In this case, comment out duid and enable clientid above.
#duid

To clarify, what file is this change supposed to occur? I assume /etc/dhcpcd.conf

@prenetic
Copy link

# Use the hardware address of the interface for the Client ID.
clientid
# or
# Use the same DUID + IAID as set in DHCPv6 for DHCPv4 ClientID as per RFC4361.
# Some non-RFC compliant DHCP servers do not reply with this set.
# In this case, comment out duid and enable clientid above.
#duid

To clarify, what file is this change supposed to occur? I assume /etc/dhcpcd.conf

Yep, that's the one. Sorry I forgot to include that here.

@Akuma-Git
Copy link

Akuma-Git commented Aug 15, 2022

That's because ifup/ifdown are hardcoded to use udhcpc. Try renaming /usr/sbin/udhcpc to /usr/sbin/_udhcpc.

@Drakonas
Copy link
Author

Drakonas commented Aug 16, 2022

That's because ifup/ifdown are hardcoded to use udhcpc. Try renaming /usr/sbin/udhcpc to /usr/sbin/_udhcpc.

Why is this then? It seems, based on this, that two separate DHCP clients can (or maybe always?) load on startup, given the right scenario. Does Busybox handle both for the MiSTer setup? Pretty sure that fixing this should really warrant a change having everything hardcoded for one DHCP client, instead of requiring dirty workarounds

Also, please see Zakk's statements a couple months ago in this issue. There is more to the issue than two DHCP daemons. @Akuma-Git

@Akuma-Git
Copy link

Why is this then?

Because udhcpc, ifup, ifdown are busybox components.

It seems, based on this, that two separate DHCP clients can (or maybe always?) load on startup, given the right scenario.

Correct

Does Busybox handle both for the MiSTer setup? Pretty sure that fixing this should really warrant a change having everything hardcoded for one DHCP client, instead of requiring dirty workarounds

Idk, afaict the dhcpcd package is unnecessary

Also, please see Zakk's statements a couple months ago in this issue. There is more to the issue than two DHCP daemons.

Correct, this is due to configuration errors resulting in some fighting between:

  • udev
  • busybox (ifup/ifdown/udhcpc)
  • network
  • dhcpcd

@prenetic
Copy link

Idk, afaict the dhcpcd package is unnecessary

I'm curious what sparked this change as it seems like it was an intentional choice to switch to dhcpc. Possibly @sorgelig can provide more context, if there was originally an issue with udhcpc that can be addressed here.

@sorgelig
Copy link
Member

because udhcpc didn't work well.
I will re-check it. Make sure your solution works with Ethernet connection too.

@prenetic
Copy link

prenetic commented Aug 18, 2022

Via wired Ethernet without the dhcpcd.conf client identifier change (duid) -- datagrams include IAID of 04050607:

dnsmasq DHCP logs

user@router1:~$ tail -f -n 0 /var/log/dnsmasq.log | grep -i 02:03:04:05:06:07
Aug 17 16:27:14 dnsmasq-dhcp[32608]: DHCPDISCOVER(bond0.10) 02:03:04:05:06:07
Aug 17 16:27:14 dnsmasq-dhcp[32608]: DHCPOFFER(bond0.10) 192.168.10.212 02:03:04:05:06:07
Aug 17 16:27:14 dnsmasq-dhcp[32608]: DHCPREQUEST(bond0.10) 192.168.10.212 02:03:04:05:06:07
Aug 17 16:27:14 dnsmasq-dhcp[32608]: DHCPACK(bond0.10) 192.168.10.212 02:03:04:05:06:07 MiSTer

dnsmasq DHCP lease

02:03:04:05:06:07 192.168.10.212 MiSTer ff:04:05:06:07:00:03:00:01:02:03:04:05:06:07

Datagram contents

Option: (61) Client identifier
    Length: 15
    IAID: 04050607
    DUID Type: link-layer address (3)
    Hardware type: Ethernet (1)
    Link layer address: 02:03:04:05:06:07

Active IP addresses

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 02:03:04:05:06:07 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.212/24 brd 192.168.10.255 scope global dynamic noprefixroute eth0
       valid_lft 82688sec preferred_lft 71888sec

Via wired Ethernet with the dhcpcd.conf client identifier change (clientid) -- datagrams do not include IAID:

dnsmasq DHCP logs

user@router:~$ tail -f -n 0 /var/log/dnsmasq.log | grep -i 02:03:04:05:06:07
Aug 17 16:11:22 dnsmasq-dhcp[30909]: DHCPDISCOVER(bond0.10) 02:03:04:05:06:07
Aug 17 16:11:22 dnsmasq-dhcp[30909]: DHCPOFFER(bond0.10) 192.168.10.211 02:03:04:05:06:07
Aug 17 16:11:22 dnsmasq-dhcp[30909]: DHCPREQUEST(bond0.10) 192.168.10.211 02:03:04:05:06:07
Aug 17 16:11:22 dnsmasq-dhcp[30909]: DHCPACK(bond0.10) 192.168.10.211 02:03:04:05:06:07 MiSTer

dnsmasq DHCP lease

02:03:04:05:06:07 192.168.10.211 MiSTer 01:02:03:04:05:06:07

Datagram contents

Option: (61) Client identifier
    Length: 7
    Hardware type: Ethernet (0x01)
    Client MAC address: MS-NLB-PhysServer-03_04:05:06:07 (02:03:04:05:06:07)

Active IP addresses

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 02:03:04:05:06:07 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.211/24 brd 192.168.10.255 scope global dynamic noprefixroute eth0
       valid_lft 86327sec preferred_lft 75527sec

Regardless of which configuration I'm only seeing one set of requests show up from dhcpc -- nothing comes in from udhcpc over wired Ethernet from what I can tell. This seems to indicate the udhcp/duplicate lease issue is limited to Wi-Fi (and possibly USB wired Ethernet adapters as well), and that this change doesn't appear to have any negative impact for wired Ethernet other than temporarily creating a second lease when changing the IAID behavior which you can see above. The second lease in this case should expire gracefully as the two aren't tied up at the same time like they are via Wi-Fi.

@Drakonas while this certainly isn't a long-term fix, does the change to dhcpc.conf I mentioned above take care of the second/169.x.x.x address issue you were seeing via Wi-Fi? Wondering if your DHCP server just isn't handling the IPv4 IAID that's going out through dhcpc and fails, that's known behavior for some vendors.

@zakk4223
Copy link

There are at least two 'multiple lease' problems:

  1. udhcp and dhcpd both try to get leases on an interface. This is not the original poster's problem, but obviously it is happening for some people. Seems likely that udhcp just needs to be disabled.

  2. On some systems 'wlan0' is visible when dhcpd launches, and it immediately sends a lease request and gets a valid response. However, it is so early in the boot process the filesystem is still read only so it can't write a lease state file. Then for some reason it receives a udev 'device added' event. Without a valid lease file it tries to solicit a new lease. This fails, so it falls back to Ipv4LL which then adds a 2nd IP, but more importantly it messes up the route table and makes the network effectively non-functional.

The log linked in the initial post seems to indicate the same IAID is used for both requests. Unfortunately I can't reproduce the issue here (my wlan0 is not available when dhcpd starts up, so it only reacts to the udev event) so I can't see if there's anything different about the request.

Dhcpd needs a way to write lease files, even on startup. I'm not sure if it is feasible to move the startup of dhcpd so it always starts after the filesystem is remounted rw

@prenetic
Copy link

  1. On some systems 'wlan0' is visible when dhcpd launches, and it immediately sends a lease request and gets a valid response. However, it is so early in the boot process the filesystem is still read only so it can't write a lease state file. Then for some reason it receives a udev 'device added' event. Without a valid lease file it tries to solicit a new lease. This fails, so it falls back to Ipv4LL which then adds a 2nd IP, but more importantly it messes up the route table and makes the network effectively non-functional.

So I haven't been able to repro this one on my end, but I wonder if it'd be good enough to instead write the DHCP lease state to ephemeral storage /tmp (and not to SD) since minimizing writes by default seems to be a design philosophy of MiSTer.

@sorgelig
Copy link
Member

If you have a working config (both ethernet and wifi) already, then please put modified files here, i will include it in next linux release.

@ghost
Copy link

ghost commented Jan 16, 2023

problem is related to 2 dhcp clients running at the same time
udpcpc being ran ondemand
and dhcpcd running in the background

and it's not an kernel problem but userspace linux image

do:
ifdown wlan0
killall -9 dhcpcd
ifup wlan0

@Drakonas
Copy link
Author

Drakonas commented Jan 16, 2023

problem is related to 2 dhcp clients running at the same time udpcpc being ran ondemand and dhcpcd running in the background

and it's not an kernel problem but userspace linux image

do: ifdown wlan0 killall -9 dhcpcd ifup wlan0

I believe the fix suggested by @gkrzystek is similar to the proposed change here.

I, for one, this think should be revisited.

In short, the boot script in the linux image doesn't actually bring down the interfaces and kill the dhcp client. From my understanding, they are left running and the boot process starts again, attempting to grab a new lease with the previous one still active. Addressing this should fix this issue, I expect.

The proposed change also unmounts the filesystem. I can neither confirm or deny that this is necessary.

@Drakonas
Copy link
Author

Drakonas commented Jan 16, 2023

@Drakonas while this certainly isn't a long-term fix, does the change to dhcpc.conf I mentioned above take care of the second/169.x.x.x address issue you were seeing via Wi-Fi? Wondering if your DHCP server just isn't handling the IPv4 IAID that's going out through dhcpc and fails, that's known behavior for some vendors.

obs64_2023-01-16_14-22-50

Sorry to get back so late on this, but the issue is not resolved by changing to clientid alone and leaving /usr/sbin/udhcpc in place.

Disabling udhcpc (renaming it) has consistently fixed the issue for me over wifi, and I've been using ethernet without issue for months.

I should mention that the udhcpc issue does not affect everyone, but it's because some routers will not get confused by the duplicate lease attempt, and handle it properly. Good routers will not actually see this issue at large. But cheap or poorly made ones (especially those provided by Internet Providers, which some force you to use) will get confused and handoff a wrong IP oor fail to give the lease, and it's easily reproducible. I am using one of these routers, sadly.

@sorgelig Does this give enough information that ethernet is unaffected, and that getting rid of udhcpc should be looked into? I can do more testing if you'd like.

@sorgelig
Copy link
Member

so, all i need to do is to remove udhcpc and problem solved?

@ghost
Copy link

ghost commented Jan 20, 2023

@Drakonas statementa about "good" touters is slight miss. all routers are just another linux , i did test most of dhhcp server implementations , and most of them assigning lase on combination Clientid + mac , by default.
and you have to specially set flag to ignore RFC and use maconly laeases...

so not routers are bad ,but our linux distro is badly set.

@ghost
Copy link

ghost commented Jan 20, 2023

what we can do here is:

  1. set dhcpcd to actuualy work only on eth0 , which will leave wifi0 for udhcp
  2. alter wpa_supplicant config to do not call udhcpd and leave dhcp job to dhcpcd
    fact that udhcpd exist in the system (as part of busybox) don't mean we have to use it at all
    some background
    https://wiki.archlinux.org/title/dhcpcd

@ghost
Copy link

ghost commented Jan 20, 2023

Simplest solution for everyone affected who wish to test
add following line to /etc/dhcpcd.conf and reboot:
denyinterfaces wlan*

@prenetic
Copy link

prenetic commented Jan 20, 2023

what we can do here is:

  1. set dhcpcd to actuualy work only on eth0 , which will leave wifi0 for udhcp
  2. alter wpa_supplicant config to do not call udhcpd and leave dhcp job to dhcpcd
    fact that udhcpd exist in the system (as part of busybox) don't mean we have to use it at all
    some background
    https://wiki.archlinux.org/title/dhcpcd

I'm not sure option 1 here works as-is. When adding the denyinterfaces wlan* line to /etc/dhcpcd.conf I lose all DNS resolution on the device, and looking at /etc/resolv.conf I'm no longer seeing my DHCP-advertised DNS servers or domain suffix. May make more sense to stick with dhcpcd for everything if it's already handling the generation of resolv.conf.

[01/20/23 11:43:11 AM]
root@MiSTer:~>cd /media/fat/Scripts/ && ./update_all.sh
Launching Update All

No Internet connection, please try again later.
[01/20/23 11:45:39 AM]
root@MiSTer:/media/fat/Scripts>ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 02:03:04:05:06:07 brd ff:ff:ff:ff:ff:ff
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether b4:b0:24:29:08:21 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.66/24 brd 192.168.10.255 scope global wlan0
       valid_lft forever preferred_lft forever
[01/20/23 11:46:29 AM]
root@MiSTer:/media/fat/Scripts>cat /etc/resolv.conf
# Generated by dhcpcd
# /etc/resolv.conf.head can replace this line
# /etc/resolv.conf.tail can replace this line

@ghost
Copy link

ghost commented Jan 20, 2023

have you reboot?
i did set it in mine system, booting with only wifi conencted and:
root@MiSTer:>ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 02:03:04:05:06:07 brd ff:ff:ff:ff:ff:ff
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:13:25:4c:17:e4 brd ff:ff:ff:ff:ff:ff
inet 10.76.175.195/24 brd 10.76.175.255 scope global wlan0
valid_lft forever preferred_lft forever
only single ip on the interface
root@MiSTer:~>cat /etc/resolv.conf
# Generated by dhcpcd
# /etc/resolv.conf.head can replace this line
# /etc/resolv.conf.tail can replace this line
search ninex.info # wlan0
nameserver 10.76.175.1 # wlan0
[01/20/23 10:55:47 PM]

note , you should not have booth eth+ wifi connected

there is small chance in yopour system wifi starts before dhcpcd ... which ovverwrite resolv.conf

imho we should go with dhcpcd as is , global , just reconfigure wpa_supplicant hook to do not call udhcpd...

@ghost
Copy link

ghost commented Jan 20, 2023

found ULTIMATE simple solution. switch all dhcp to dhcpcd (so revert dhcpcd.conf to default please)
/etc/network/interfaces
change
iface wlan0 inet dhcp
and
iface wlan1 inet dhcp

to
iface wlan0 inet manual
iface wlan1 inet manual

explanation

network startup script start udhcp as interface is set to dhcp (we do not wish to do that)
seting manual cause startup script sassume user will provide necessary adresses....

as dhcpcd daemon listens, it pickup interface and configure it...

boom magic ;)
root@MiSTer:>ps auxw |grep dhcp
645 dhcpcd dhcpcd: [master] [ip4]
646 root dhcpcd: [privileged actioneer]
647 dhcpcd dhcpcd: [network proxy]
648 dhcpcd dhcpcd: [control proxy]
717 dhcpcd dhcpcd: [BPF ARP] wlan0 10.76.175.197
866 root grep dhcp
[01/20/23 11:52:40 PM]
root@MiSTer:~>ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 02:03:04:05:06:07 brd ff:ff:ff:ff:ff:ff
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:13:25:4c:17:e4 brd ff:ff:ff:ff:ff:ff
inet 10.76.175.197/24 brd 10.76.175.255 scope global dynamic noprefixroute wlan0
valid_lft 6914sec preferred_lft 6014sec

root@MiSTer:~>cat /etc/resolv.conf
# Generated by dhcpcd from wlan0.dhcp
# /etc/resolv.conf.head can replace this line
domain ninex.info
nameserver 10.76.175.1
# /etc/resolv.conf.tail can replace this line

@prenetic
Copy link

Ahh nice, I like the simplicity of this approach. The proposed change to /etc/network/interfaces is working flawlessly for me. I've rebooted (soft and hard) about 40 times now and I'm able to connect via Wi-Fi and resolve DNS records every time.

@sorgelig
Copy link
Member

Need more feedbacks. If it will work for others, then i will add it.

@Drakonas
Copy link
Author

Drakonas commented Jan 26, 2023

With this "inet manual" fix and no others, I can reach the machine via Samba (and wifi symbol shows), but it still doesn't seem to register everything properly:
obs64_2023-01-26_11-18-58
I have re-enabled udhcpc for this test. It seems udhcpc still causes the fault on boot. Renaming /user/sbin/udhcpc to /usr/sbin/_udhcpc (or something else to disable it) always fixes this for me.

@prenetic
Copy link

With this "inet manual" fix and no others, I can reach the machine via Samba (and wifi symbol shows), but it still doesn't seem to register everything properly: obs64_2023-01-26_11-18-58 I have re-enabled udhcpc for this test. It seems udhcpc still causes the fault on boot. Renaming /user/sbin/udhcpc to /usr/sbin/_udhcpc (or something else to disable it) always fixes this for me.

Assuming your MiSTer is fully up-to-date, can you confirm whether /etc/resolv.conf is being generated properly with inet manual specified?

@ghost
Copy link

ghost commented Jan 27, 2023

@Drakonas
only one fix at once.
technically you should rollback to default settings and (ONLY!) do /etc/hetwork/interfaces change.
use only one active interface ,or wifi or eth ,not booth at once.
then do screnshot of:
ps auxw
ip a
ip r s
cat /etc/resolv.conf

there is no magic here , dhcpcd daemon will work for all. (assuming your linux is up to date)

@Drakonas
Copy link
Author

Drakonas commented Feb 4, 2023

@Drakonas only one fix at once. technically you should rollback to default settings and (ONLY!) do /etc/hetwork/interfaces change. use only one active interface ,or wifi or eth ,not booth at once. then do screnshot of: ps auxw ip a ip r s cat /etc/resolv.conf

there is no magic here , dhcpcd daemon will work for all. (assuming your linux is up to date)

I said 'With this "inet manual" fix and no others'. I had reverted all other changes to default prior to testing including any dhcp config and executable renames, but I'll do it again to get you this information.

Ethernet is not affected by any of these changes, and I've never had trouble with ethernet regardless of using everything default or not. It's just wifi that is affected.

The following is only with the inet manual change in /etc/network/interfaces. The /etc/dhcpcd.conf is default, and /usr/sbin/udhcpc exists. I've removed screenshot previews because this post would be astronomically long with them.

With eth0 only
With wlan0 only

As you can see from the wlan0 screenshot, with the inet manual fix alone, dhcpcd still attempts to get another lease with a 169 address.

wlan0 - ip a
wlan0 - ip r s
Furthermore, wlan0 has two IP's registered, one dynamic and one global.

This allows connections between my machine and the MiSTer (albeit hostname-relationships do not work), but the MiSTer scripts cannot get an internet connection.

wlan0 - cat /etc/resolv.conf
DNS still working though. Yay.

Now, if I rename /usr/sbin/udhcpc to /usr/sbin/_udhcpc, but leave this inet manual fix intact, scripts still do not get an internet connection. dhcpcd still obtains a second ip with 169 address:
wlan0 - ps auxw (inet manual + no udhcpc)

I should mention that my initial attempt to run inet manual with no udhcpc was met with it finally only grabbing one IP address. However, as I know this issue is related to certain modems getting confused, I turned the MiSTer off, unplugged the wlan adapter, plugged in ethernet, and then turned it on. It grabbed a new IP. I rebooted once more as-is. Still fine. Then I turned it off as-is, unplugged ethernet and plugged in wlan adapter, and now it gets a 169 address again. So this shows my initial experience with it eventually working after a few reboots but the issue will return again later.

My assumption for this working after a few reboots eventually is the modem stops getting confused. So you have to force the MiSTer to change IP's, then the issue returns when you next try wifi again.

Now, if I leave /usr/sbin/_udhcpc renamed (disabled) and revert /etc/network/interfaces to defaults (inet dhcp), everything works. See below:
wlan0 - ps auxw (inet dhcp default and no udhcpc)

And now I will repeat the exact same wlan0 -> ethernet + reboot twice -> wlan0 to prove it will work first time wlan0 pulls a new IP (I'm writing this before doing it, to show how confident I am that removing udhcpc is all that is needed to fix this issue):
wlan0 working (only /usr/sbin/udhcpc removed) ps auxw && ip a && ip r s

TL;DR. Just remove udhcpc. That's all that's needed. There's no reason to get anymore complicated.

@ghost
Copy link

ghost commented Feb 4, 2023

@Drakonas
calm down man, just wanted to see what is going on here.
now i see situation in your system
and new question raised
obraz
for some reason dhcpd have problem with keeping ip allocation.
see your own screenshot ,single dhcpd keeps 2 addresses, one from allocation from your router and one link-local (usually spin only when no dhcp found)
most probably problem here is not dhcp
you can add line:

noipv4ll

to your /etc/dhcpcd.conf

which will prevent from bringing up link locall addresses (169.254.x.x)

however this will solve only "no intenret" "problem"

can you please examine output from
iwconfig command?
such dhcp problems occur mostly when poor signal lievel / link quality or high noise level on wifi is present
what i am guessing is a problem here is poor link quality
(and what i am trying to point here , problem you are fighting with , whatever similar to problem i pointed is something different.

also what i would suggest is to put wpa_supplicant to debug mode and see if it doesn't report rapid re-connections..
because i am sure , dhcpcd here isn't a problem is more or like an Cannary .

thansks for the help wit h investigation

@ghost
Copy link

ghost commented Feb 4, 2023

note for @sorgelig
general fix for conflict udhcp vs dhcpcd , should go in next linux release.
the fix with disabling ip4all , not.
because this may be used to direct connect NAS or something to Ethernet , or many other scenarios.

@Drakonas
Copy link
Author

Drakonas commented Feb 4, 2023

There are at least two 'multiple lease' problems:

  1. udhcp and dhcpd both try to get leases on an interface. This is not the original poster's problem, but obviously it is happening for some people. Seems likely that udhcp just needs to be disabled.

  2. On some systems 'wlan0' is visible when dhcpd launches, and it immediately sends a lease request and gets a valid response. However, it is so early in the boot process the filesystem is still read only so it can't write a lease state file. Then for some reason it receives a udev 'device added' event. Without a valid lease file it tries to solicit a new lease. This fails, so it falls back to Ipv4LL which then adds a 2nd IP, but more importantly it messes up the route table and makes the network effectively non-functional.

The log linked in the initial post seems to indicate the same IAID is used for both requests. Unfortunately I can't reproduce the issue here (my wlan0 is not available when dhcpd starts up, so it only reacts to the udev event) so I can't see if there's anything different about the request.

Dhcpd needs a way to write lease files, even on startup. I'm not sure if it is feasible to move the startup of dhcpd so it always starts after the filesystem is remounted rw

#29 (comment)

@gkrzystek as stated here, this is the cause. Please read the thread before saying my wifi 6 router being 3 meters away from my MiSTer is the issue.

I have been calm, but after that post I am trying my best to be civil. Lol. I have spent months testing this and replacing my modem was already something I tried. the problem still wasn't fixed.

I am.open to suggestions but I propose we try to figure out what is causing wlan0 to be visible sometimes when the MiSTer launches, while having @sorgelig move ahead with removing udhcpc/ifup/ifdown, as they're all hardcoded to use udhcpc in BusyBox. This will fix a number of people's issues, but not all (as in problem 2 in the quoted post above)

So, in regards to what we already know, I have a new theory, and I can do more testing for this @zakk4223 but from what I have found recently, the hard reboot script for MiSTer does not fully reboot and does not bring interfaces down or the dhcp client, but the boot script is relaunched. Could that be the cause of some people having multiple leases?

I'm wondering if some people thought rebooting from the MiSTer menu and power cycling it was the same thing, but my recent findings have shown they are not. Looking further up in this thread you'll find a link to someone proposing a script change for the cold reboot script.

I am not sure if this is necessary to address the issue, but I am wondering if cold rebooting might render different boot process that is worth testing. I can do some later on. I am heading to bed lol.

@ghost
Copy link

ghost commented Feb 4, 2023

@Drakonas
trick is mate / filesystem is alaways RO , it's being remounted rw on user login / script run.
and i will have to spend anouther2 hours to explain you why in details. long story short, it's embedded system with pivot root , without proper shutdown procedure. (if / would be rw whole time , on every reboot you would get at last recover journal or fsck)

  1. to properly diagnose hook console to pc via usb cable.
  2. your diagnosis is almost fine but order don't match
    look for the pids of dhcp forks , your ip4all gets set after dhcplease were got.
    1039 < 1604
    which means, your system seems get proper ip from dhcp then for some reason it "thinks" it didn't got one.

i fully understand your frustration mate , and i am really interested in finding where problem is.
however ,as you are more focused on complain , than on actual troubleshooting...
i pass.
i sorted mine problem ,shared with others steps how to fix similar to mine problem, and as ypu don't wish to cooperate , i am stepping down.

@Drakonas
Copy link
Author

Drakonas commented Feb 4, 2023

@Drakonas trick is mate / filesystem is alaways RO , it's being remounted rw on user login / script run. and i will have to spend anouther2 hours to explain you why in details. long story short, it's embedded system with pivot root , without proper shutdown procedure. (if / would be rw whole time , on every reboot you would get at last recover journal or fsck)

1. to properly diagnose hook console to pc via usb cable.

2. your diagnosis is almost fine but order don't match
   look for the pids of dhcp forks , your ip4all gets set after dhcplease were got.
   1039 < 1604
   which means, your system seems get proper ip from dhcp then for some reason it "thinks" it didn't got one.

i fully understand your frustration mate , and i am really interested in finding where problem is. however ,as you are more focused on complain , than on actual troubleshooting... i pass. i sorted mine problem ,shared with others steps how to fix similar to mine problem, and as ypu don't wish to cooperate , i am stepping down.

I deleted my original post. I was too hasty to respond and for that I apologize.

@ghost
Copy link

ghost commented Feb 4, 2023

@Drakonas no hard feelings
please try modify dhcpcd.conf , putit to debug mode
please try switch wpa_supplicant into debug mode
and boot system with console hooked over usb (read docs how to use usb terminal with putty)
you can then grab text output from putty boot and share.
there is small chance wifi driver making us troubles.
or something like htat.
i really love to solve such puzzles... just need more detailed data.
this proccess will take some work from both of us.
please try to help me to help you :)

@Drakonas
Copy link
Author

Drakonas commented Feb 4, 2023

@Drakonas no hard feelings please try modify dhcpcd.conf , putit to debug mode please try switch wpa_supplicant into debug mode and boot system with console hooked over usb (read docs how to use usb terminal with putty) you can then grab text output from putty boot and share. there is small chance wifi driver making us troubles. or something like htat. i really love to solve such puzzles... just need more detailed data. this proccess will take some work from both of us. please try to help me to help you :)

I can put the effort in and am willing to. I will report back tomorrow. It is 6AM here and I must sleep. Thank you for being understanding. I have anger issues and it's been hard work getting them under control for the past few years.

@ghost
Copy link

ghost commented Feb 4, 2023

and about reboot from menu:
short press only restarts menu process , os stay untouched
if you press for long , hard reboot it will just kill whole system like with kernel panic , and kick bootloader , so kernel will reinitialise everything like in situation of powerloss
so idea @zakk4223 is half true, only short press does not reboot , but it actually not touch network it stay open
you can check it out , keep ssh session to your mister (or usb console open)
on usb console you will see system don't reboot or fully reboot - depend cold/hot reboot from menu chosen
on ssh you will see that your conenction will be cut only on hard reboot
and i agree this menu item is counterintuitive.

@Drakonas
Copy link
Author

Drakonas commented Feb 4, 2023

and about reboot from menu: short press only restarts menu process , os stay untouched if you press for long , hard reboot it will just kill whole system like with kernel panic , and kick bootloader , so kernel will reinitialise everything like in situation of powerloss so idea @zakk4223 is half true, only short press does not reboot , but it actually not touch network it stay open you can check it out , keep ssh session to your mister (or usb console open) on usb console you will see system don't reboot or fully reboot - depend cold/hot reboot from menu chosen on ssh you will see that your conenction will be cut only on hard reboot and i agree this menu item is counterintuitive.

My ssh connection isn't cut even when I power cycle manually, which doesn't make sense to me. It should lose connection. During all my tests with eth0 only and wlan0 only, the ssh connection was never cut when I turned off my mister. I use splitter to USB hub and de10-nano, with analog io.

@ghost
Copy link

ghost commented Feb 4, 2023

My ssh connection isn't cut even when I power cycle manually, which doesn't make sense to me. It should lose connection.

huh taht would be magic (sorry for my sarcasm) or your hub is feeding power to the system somehow
but there is no possibility to keep ssh if system is bein properly switched off

reset pwoer with unpowering both de10 and hub
also please share what kind of usb network adapter you have
i have 3 different units , using same driver ,and all behave differently...

@Drakonas
Copy link
Author

Drakonas commented Feb 4, 2023

My ssh connection isn't cut even when I power cycle manually, which doesn't make sense to me. It should lose connection.

huh taht would be magic (sorry for my sarcasm) or your hub is feeding power to the system somehow but there is no possibility to keep ssh if system is bein properly switched off

reset pwoer with unpowering both de10 and hub also please share what kind of usb network adapter you have i have 3 different units , using same driver ,and all behave differently...

Nevermind, if I wait long enough it is lost. I am tired.

@Drakonas
Copy link
Author

Drakonas commented Feb 4, 2023

@Drakonas no hard feelings please try modify dhcpcd.conf , putit to debug mode please try switch wpa_supplicant into debug mode and boot system with console hooked over usb (read docs how to use usb terminal with putty) you can then grab text output from putty boot and share. there is small chance wifi driver making us troubles. or something like htat. i really love to solve such puzzles... just need more detailed data. this proccess will take some work from both of us. please try to help me to help you :)

I have added the debug parameter to the dhcpcd.conf but wpa_supplicant doesn't seem to support this in its conf, only in execution with -d or -dd parameter. Where is wpa_supplicant executed from? I am unable to find that, and I'm unsure if it can be changed.

I will check back later on.

@ghost
Copy link

ghost commented Feb 24, 2023

@sorgelig can you please rebuild linux image with change i proposed?
/etc/nerwork/interfaces
iface wlan0 inet manual
iface wlan1 inet manual
+rebuild kernel with updated rtl drivers that you merge from me.
so li can try to help @Drakonas with his wifi issue diagnostics ???
RTL drivers update should definetly improve dongle compat list

@sorgelig
Copy link
Member

@ghost
Copy link

ghost commented Feb 24, 2023

@Drakonas please download, unpack and replace linux.img and zImage_dtb in your sd card in linux folder.
this image contains, more stable realtek wifi dongle cards ,as well modification i requested ,so it should connect more realiebly.
if not, i will request lsit of actions from you ,like setting wifi driver into debug mode, etc. no worries will paste list of precise instructions

@ghost
Copy link

ghost commented Feb 24, 2023

@sorgelig thanks for quick response.
just finished test with all 10 dongles i have.
can you consider this image to be published in mister-devel - we need more input from users...

changes we supplied, solving primary problem of 2 dhcpclients handling wlan0 (which at last for some users) were causing a problems. i do not expect more problems occurring because of those changes, one is cosmetics, and rtl drivers have more userbase in aircrack-ng community than we have ;)
and updated realtek drivers, also will reduce the frustrations at last for newjoiners.
as those drivers are build with debug.
i will continue investigations with @Drakonas as his case , require some low level investigation - i am not able reproduce this case other way than actually breaking my wifi link into state where more than 50% packages are lost which seems , not be a rootcause of his problem.

@ghost
Copy link

ghost commented Feb 24, 2023

one more change , that improve how dhcpcd behave
@sorgelig
add to /etc/fstab please
tmpfs /var/db/dhcpcd tmpfs mode=0750 0 0

this will remove dhcpcd complains about lasefile write ,and slightly improve bootup network startup
no need to rebuild image for now. but worth to push it with that change for user testing
Thanks

@prenetic
Copy link

prenetic commented May 4, 2023

Just following up on this -- the most recent updates published seem to be working great on my end. Thanks for that.

The multiple DHCP lease address issue persists however because the of default DUID configuration in dhcpcd.conf that I mentioned previously. With this set to DUID a new address is assigned to the MiSTer every restart for both wired and wireless connections, instead of using an existing DHCP lease. This also causes significant delays reconnecting to the MiSTer over the network due to local DNS caches.

This should be set to use client ID based on my experience, as some routers struggle with DUID in the first place and others will be wasteful with leases in a scope due to the way the MiSTer appears as a new device every time it is restarted. When set to client ID with a DUID-aware DHCP server, the MiSTer should retrieve the same leases upon restart.

#29 (comment)
#29 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants