Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/var/lib/nixos/uid-map corrupted when using nixos-rebuild build-vm many times #97305

Open
davidak opened this issue Sep 6, 2020 · 9 comments

Comments

@davidak
Copy link
Member

davidak commented Sep 6, 2020

Describe the bug
I was using nixos-rebuild build-vm to test a PR i was working on. At some point i was not able to login and many services don't started.

...
running activation script...
malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "\x{0}\x{0}\x{0}\x{0}...") at /nix/store/z9a0mg0qg4xhlih0wix950xgq285fbzh-update-users-groups.pl line 11.
Activation script snippet 'users' failed (2)
setting up /etc...
removing obsolete symlink ‘/etc/resolv.conf’...
removing obsolete symlink ‘/etc/systemd/resolved.conf’...
chown: invalid user: 'root:root'
Activation script snippet 'var' failed (1)
chown: invalid user: 'root.messagebus'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.root'
chown: invalid user: 'root.nogroup'
Activation script snippet 'wrappers' failed (1)
warning: the group 'nixbld' specified in 'build-users-group' does not exist
starting systemd...
...

Screenshot from 2020-09-06 19-30-31

Screenshot from 2020-09-06 19-23-33

nix run nixpkgs.libguestfs-with-appliance
mktemp -d
sudo guestmount -a ./nixos.qcow2 -m /dev/sda --ro /tmp/tmp.1F7pugMFFJ
[root@gaming:/tmp/tmp.1F7pugMFFJ]# hexdump -n 2 var/lib/nixos/uid-map
0000000 0000
0000002

/var/lib/nixos/uid-map and /etc/shadow contain only zeros

Related to #69365, #26788, #61755, #82755

To Reproduce
Steps to reproduce the behavior:

minimal config:

{ config, pkgs, ... }:

{
  users.extraUsers.root.password = "root";
  documentation.enable = false;
}
  1. nixos-rebuild build-vm -I nixpkgs=~/code/nixpkgs/ -I nixos-config='/home/davidak/root'
  2. start vm: /nix/store/js7vf96xzsvj6h23p3jcbixlx0qyvmhq-nixos-vm/bin/run-nixos-vm
  3. stop vm when booted
  4. build vm again mith different config... and repeat

Workaround:

remove disk image file:
rm ./nixos.qcow2

now it boots:

[davidak@gaming:~/code/nixpkgs]$ /nix/store/js7vf96xzsvj6h23p3jcbixlx0qyvmhq-nixos-vm/bin/run-nixos-vm
Formatting '/home/davidak/code/nixpkgs/nixos.qcow2', fmt=qcow2 cluster_size=65536 compression_type=zlib size=536870912 lazy_refcounts=off refcount_bits=16

after removing it, i can't reproduce it anymore.

Expected behavior
NixOS boots into working system

@davidak davidak self-assigned this Sep 6, 2020
@davidak davidak changed the title NixOS broken on master nixos-rebuild build-vm fails to boot when starting it a second time Sep 6, 2020
@davidak davidak changed the title nixos-rebuild build-vm fails to boot when starting it a second time /var/lib/nixos/uid-map corrupted when using nixos-rebuild build-vm many times Sep 6, 2020
@davidak davidak removed their assignment Sep 6, 2020
@Mic92
Copy link
Member

Mic92 commented Sep 23, 2020

should be fixed by #98544

@stale
Copy link

stale bot commented Mar 26, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Mar 26, 2021
@ElvishJerricco
Copy link
Contributor

I still get something similar on unstable if I kill the machine while it's booting. Start a blank NixOS VM and kill the VM when systemd takes over in stage 2, and that VM will not have valid users upon reboot.

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 27, 2021
@Mic92
Copy link
Member

Mic92 commented Jun 28, 2021

I still get something similar on unstable if I kill the machine while it's booting. Start a blank NixOS VM and kill the VM when systemd takes over in stage 2, and that VM will not have valid users upon reboot.

What does the file look like in this case?

@stale
Copy link

stale bot commented Jan 9, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 9, 2022
@Melkor333
Copy link
Contributor

Melkor333 commented Apr 29, 2022

Something like this happened to me on my workstation. It may also have been more like #61755 or something completely different, can't tell. But what happened is that I rebuilt my local system multiple times with nixos-rebuild switch and after that I couldn't reboot my system anymore. I was playing with services.xss-lock and home-managers screen-locker modules so that definitively shouldn't have any effect.

Initially I had an error saying something like mv: /bin/sh is already the same as bin/.sh.tmp. But in reality /bin/sh was a broken symlink pointing to nothing ( -> '') while /bin/.sh.tmp was a proper symlink into the nix store. When I removed /bin/sh it told me that the same was the case for /usr/bin/env and /usr/bin/.env.tmp. After removing (usr/bin/env too, I got above error. Removing uid-map, gid-map and lateron also auto-subuid-map seems to solve the problem as the machine regenerated the files properly (I actually renamed the whole /etc/ -> /etc.old and /var -> /var.old and ran nixos-install from a live ISO). Weirdly the home-manager service for my user failed with following error for a lot of files:

Apr 28 22:05:10 afonil hm-activate-samuelh[1443]: cmp: /home/samuelh/.config/dunst/dunstrc: Is a directory
Apr 28 22:05:10 afonil hm-activate-samuelh[1430]: Existing file '/home/samuelh/.config/dunst/dunstrc' is in the way of '/nix/store/6dszxk7vkdwayk2msir9rgycglwfhyq2-home-manager-files/.config/dunst/dunstrc'
Apr 28 22:05:10 afonil hm-activate-samuelh[1445]: cmp: /home/samuelh/.config/environment.d/10-home-manager.conf: Is a directory
Apr 28 22:05:10 afonil hm-activate-samuelh[1430]: Existing file '/home/samuelh/.config/environment.d/10-home-manager.conf' is in the way of '/nix/store/6dszxk7vkdwayk2msir9rgycglwfhyq2-home-manager-files/.config/environment.d/10-home-manager.conf'
Apr 28 22:05:10 afonil hm-activate-samuelh[1447]: cmp: /home/samuelh/.config/git/config: Is a directory
Apr 28 22:05:10 afonil hm-activate-samuelh[1430]: Existing file '/home/samuelh/.config/git/config' is in the way of '/nix/store/6dszxk7vkdwayk2msir9rgycglwfhyq2-home-manager-files/.config/git/config'

I solved this with the following magic. It takes the files from the journal and moves them to the folder bad. After that I could restart the service just fine:

mkdir bad
mv -t bad/ $(journalctl -u home-manager-samuelh.service | grep 'Existing file' | awk '{ print $8 }' | tr -d "'")

The files in the folder bad are equally corrupted as the /bin/sh and /usr/bin/en files were:

[samuelh@afonil:~]$ ls -lah bad/
total 4.0K
drwxr-xr-x 1 samuelh users 1.1K Apr 28 22:25 .
drwx------ 1 samuelh users  872 Apr 29 06:51 ..
lrwxrwxrwx 1 samuelh users    0 Apr 28 12:51 10-home-manager.conf -> ''
lrwxrwxrwx 1 samuelh users    0 Apr 28 12:51 {446900e4-71c2-419f-a6a7-df9c091e268b}.xpi -> ''
lrwxrwxrwx 1 samuelh users    0 Apr 28 12:51 87677a2c52b84ad3a151a4a72f5bd3c4@jetpack.xpi -> ''
lrwxrwxrwx 1 samuelh users    0 Apr 28 12:51 blueman-applet.service -> ''
lrwxrwxrwx 1 samuelh users    0 Apr 28 12:51 config -> ''
lrwxrwxrwx 1 samuelh users    0 Apr 28 12:51 config.nix -> ''
lrwxrwxrwx 1 samuelh users    0 Apr 28 12:51 {d7742d87-e61d-4b78-b8a1-b469842139fa}.xpi -> ''
lrwxrwxrwx 1 samuelh users    0 Apr 28 12:51 dunstrc -> ''
lrwxrwxrwx 1 samuelh users    0 Apr 28 12:51 dunst.service -> ''
lrwxrwxrwx 1 samuelh users    0 Apr 28 12:51 flameshot.service -> ''

I assume this is some error during activation, maybe even a problem with btrfs?

It might make sense to have some kind of check in the activation which makes sure that such broken symlinks and broken/unreadable /var/lib/nixos/uid-mapfiles are properly removed/recreated (Or are the *-map files usually never recreated?). Of course fixing the reason this behaviour happens is even better but I can imagine that there is always some edge case where e.g. a cold reset of a system which is activating can cause such broken symlinks which don't get overwritten properly.

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Apr 29, 2022
@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Nov 13, 2022
@brianmcgee
Copy link
Contributor

I had a similar experience as @Melkor333 today, in the end I had to blow away auto-subuid-map to get things working again.

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Nov 15, 2022
@MakiseKurisu
Copy link
Contributor

I also encountered this issue. I run NixOS in a VM running multiple podman containers, one of them is qBittorrent, and it filled the entire rootfs. After enlarging the partition I still cannot log into the system, which lead me to investigate and found this issue.

I'm using btrfs single inside and dup outside of the VM disk, so that might be why it was corrupted.

Checking G/UID map shows the same symptom:

root@pve11:~# hexdump /mnt/@/var/lib/nixos/uid-map
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000320 0000 0000 0000 0000 0000 0000          
000032b
root@pve11:~# hexdump /mnt/@/var/lib/nixos/gid-map
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000200 0000 0000 0000 0000 0000 0000 0000     
000020d

which is why update-users-groups.pl failed as shown in OP's error message. This script is responsible of creating /etc/passwd, and as it failed, the system cannot load the user database.

I restored those 2 files from my backup and was able to log into my system again.

@Baitinq
Copy link
Member

Baitinq commented Oct 30, 2023

This is still a problem. I got this today and had to delete the uid-map

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants