Reforming Unix

Unix is bad. But file descriptors are good. Modern Unix-derived operating systems (Linux, some BSDs) are already moving towards file descriptors for everything. Capsicum further showed how we can disable lots of global namespaces so we can simply and elegantly just rely on file descriptors. Now we just need to finish the job, making a few new system calls to make working without global namespaces better.

Why bother? Look at the living relic that is the ribosome. Core abstractions can last a hell of a long time. The less we manage things --- and computing taken as a whole is pretty unmanaged --- the more likely computing lumbers along like the biosphere, and we’re likely to keep layering new things on top but leave the core in place. Maybe that’s fine, and it’s a hell of legacy for some Unix greybeards, but I’d like to think we can do better.

Why reform rather than replace? Projects like Fuschia are very cool, but changing a gazillion things at once is risky. There is a massive amount of userland code that cannot be replaced all at once. There is a massive amount of hardware that can’t be given new drivers at once. In the rest of the economy, we know something about capital turnover. For tasks like decarbonization, there is interest in managing / speeding up that process. The same mindset should be taken to computing I think --- it is a microcosm of these things.

OK, that’s enough philosophizing for now. Here are some concrete new interfaces (and corresponding criticism of old ones) that could be part of such a reform project.

Null namespaces

If I am not mistaken, Capsicum was implemented fairly "shallowly" by just sanitizing system calls. But for operating systems like Unix with some sort of swappable namespaces, we can start implement this more deeper.

Namespaces as a way of achieving isolation are crude (their core competency is not isolation but isolation retrofitting for existing programs). However, they do have an excellent side effect in the implementation of de-globalizing the implementation. For example, before there were (at least conceptually) single global variables for the root file system, the network interfaces, etc., but now that there is no single "the", there cannot be. (I haven’t looked at all the code, and I don’t doubt there are vestiges of old ways, but one gets the idea.)

The world we want is one of not of multiple global namespaces to switch between, but no global interfaces. But the above implementation changes are still really helpful! Now that there are multiple such namespaces to choose from, code and data structures cannot implicitly refer to the single choice (ambient authority), but must be explicit with some sort of reference. To rid ourselves of global namespaces in some context, we should simply delete those references, replacing them with some sort of Null reference. (Thankfully C makes it very easy to have null references ;)) By finding all the places that would use these references and make them fail on the null reference, we have implemented what we want! That may sound hard, but it is much easier than trying to retrofit code from stopping using some global variable; the explicit nature of these references helps immensely.

Put another way, while global namespaces implement ambient authority in userspace (boo!) the force the implementation itself no have less ambient authority (yay!). That’s great and very helpful to us.

This "deep" method should ultimately be more secure and easier to maintain than the "shallow" syscall auditing approach; it acts at a natural choke point (in-kernel references to namespaces) rather than an extremely wide and hard to defend perimeter (the syscalls).

Spawning processes

Fork + exec is a terrible way to spawn processes. See https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf for a respectable paper saying as much too. Unfortunately, it doesn’t offer any alternative which is entirely better. http://catern.com/rsys21.pdf is better and actually does, though the implementation it describes is just in userspace for now. I have sent some emails about this too to the Linux Mailing list, to a FreeBSD one, and to the Capsicum one.

Here is the basic idea:

Load an executable into a fresh new unscheduled process (we might call this an "embryonic" process)
```
int proc_create_unscheduled(const char * path);
```

Set file descriptors and other state on that process

void proc_set_args(int procfd, const char argv[][]);
void proc_set_fd(int procfd, int child_fd, int parent_fd);

Submit it to the scheduler
```
void proc_submit(int procfd);
```

Instead of creating a child with all the capabilities of the parent and then carefully removing capabilities we don’t want, we create a child with 0 capabilities and then grant it the ones we do want. This opt-in rather than opt-out approach gives much better incentives with respect to developer effort and the principle of least privilege.

Also, unlike POSIX spawn, Linux’s clone and friends, we have no need to cram a gazillion flags onto one operation. The fact that the new process is unscheduled means can set up the state however we like, and take as much time / as many operations as we like to do so.

In fact, we generalize the steps above to:

Create a fresh blank unscheduled process
Set state on that process
Submit it to the scheduler

The address space / memory map on the new process is just more state. We can implement loading ELFs in userspace if we like, just `mmap`ing pages in the child from the parent as instructed by the program headers. In this manner, we’re just as powerful as fork too, but weird duplication semantics (trying to set up the child to be like the parent) are also opt-in rather than opt-out.

It really is the best of all options.

File System

Remember, the file system is a flavour of database. We shoudn’t judge it by other standards.

Remember also that file descriptors are much more general than the file system. We should probably call them "OS descriptors" or something to make this clear.

Mounting

Recent versions of (Linux and FreeBSD) have "memfds".

This grew out of "tmpfs". The idea is simple: since the OS already has a file cache in memory for performance reasons, that can just be used without a backing "real" persistent filesystem as a files system in its own right. A "memfd" is just a capability to an anonymous file with an implementation presumably also shared with the file cache / tmpfs.

But if we can create anonymous files, why can’t we create anonymous directories? A memdird_create would work quite well with, say, open_at. There should be no issue, we can again just reuse "tmpfs", just this time more of it.

There is one last variation of this --- why stop at tmpfs? Convention mount needs to take some namespace from the global filesystem / mount namespace. This is silly. Windows and C:\ had it right (almost): we should be able to mount a filesystem afresh without any notion of a global namespace.

int mountfd(int source_at, const char *source_path, const char *filesystem_type, ...);

This would just return a directory file descriptor for the root of the newly-opened filesystem. An anonymous directory would just be this with "tmpfs".

Is there a file equivalent of this? I haven’t yet thought of any, but I haven’t yet thought of a good way to explain the asymmetry either.

The fsopen system call in Linux now has is thankfully a step in that direction. (See this LWN article or this blog post for more information about it.) Unfortunately, it it doesn’t seem than one can get a directory file descriptor from a detached mount yet, like my mountfd described above. I was happy to see that the first comment on that LWN article also wanted this functionality. \[Thanks to Ryan Lahfa for making me aware of this.]

Temporarily temporary files

Memfd / tmpfs is fine when one is sure the data won’t be needed later. But say we are preparing something we later want to persist to disk as the final step? From memfd /tmpfs, we would be doing a cross file system move, which means "VFS → VFS → disk". The "VFS → VFS" is a silly extra hop.

A really snazzy thing would be to take an these VFS things and "back" them after the fact. This would allow just VFS → Disk where we first assign a backing store (the filesystem) to the tmpfs in-memory objects, and fsync or similar to push them out.

This doesn’t exist, but something close does: O_TMPFILE. https://lwn.net/Articles/619146/ covers it well; the idea is to create an unlinked file, which can then be linked later. Almost equivalently, it is a memfd (a later concept) but with a backing fileystem already set. In other words, it is the result of the "assign backing store" operation asked for above on a memfd, but from when the file is created.

Just as before with memfds, the limitation is that it’s regular files only. File systems already implement the "delete when closed and link count goes to 0" logic, which O_TMPFILE can piggy-back upon. However for directories we would have to recursively crawl children to delete everything. This isn’t so inherently bad --- after all, a file that is not contiguous memory will require some crawling around too --- but it is a novel operation for file systems to implement. Thankfully, bachefs is going to implement this! See https://bcachefs.org/Roadmap/#tmpdir_support for details.

It’s instructive to think about syncing/flushing with this. For unlinked files, there is no logical notion of durability to worry about: On restart, all files will be closed, and therefore all unlinked data is unreachable and it is undefined whether it was persisted or not. However, paging out can still happen to free up memory. With a tmpfs/memfd, this just goes to swap (if it exists), the temporary backing store for unbacked pages (a "homeless shelter" of sorts). But if we think we want to write some/all of this data a file system eventually, this is a bit wasteful. The O_TMPFILE approach, by assigning the file system up front, will instead have the page out go to the filesystem devices. This means memory pressure can get us a "head start" on the flush that will need to happen eventually once the file is linked and file system unmounted (if those do happen). The writing to swap doesn’t help in contrast, because while the data is paged out, it is paged out to the wrong place --- we can’t just magically teleport it from the swap device to the right place on the file system’s devices!

This difference of flushing under memory pressure, being the difference between the tmpfs/memfd and O_TMPFILE worlds, is something that the hypothetical "assign backing filesystem to existing VFS object" would allow playing with, since before it is assigned it flushes one way, and after it is assigned in flushes another way.

A small aside: the bcachefs linked above page says the fsync is a noop for temporary items, because the "logical durability isn’t defined". But there is a cruder "physical durability" that is possible: if one does force flushing first without memory pressure requesting it, it does still have the "head start" benefit that a future sync after linking --- i.e. once logical durability is defined --- would be faster if the data is already on disk. (This is a case of the general principle that observational equivalence with and without costs are two distinct things, and both interesting.) Still, I don’t think this sort of "head start on tmp file" use-case is worthy of an addition knob. Just making fsync on (transitively, for the temp dir case) unlinked things always do nothing is the better and more interesting place to start.

File system transactions

https://news.ycombinator.com/item?id=35416477 This thread was really good and mirrored many of my own thoughts. It is good to see more criticism rather than people assuming since Unix is widespread it must be good.

Mutating data correctly usually requires transactions. The Unix file system doesn’t offer this, that’s a big bummer. Instead there is fsync, but it also dissapoints:

Unclear scope: fsync will flush all changes to a file. If something else was also changing the file, too bad, you need to wait for it too.
Overkill for many tasks.

The second one is what I’ll discuss.

Suppose we are working on a batch processing system where jobs produce file system results, and jobs can depend on other jobs. We only want to store the results of a sucessful job, and only store those results in full.We also want downstream jobs to start as soon as possible.

If we don’t do something like an fsync before marking the job complete, we could end up with incomplete files on disk and a corrupted result. If we do an fsync, we avoid that, but we crudely wait for flushes even though the next job can work fine reading from the VFS / unflushed in-memory-information.

A first good solution is a write barrier. Imagine something like the following:

// "O_TMPDIR" unlinked tempdir as described above in previous section
int temp_dir_fd = ...;

// do build populating dir
...;

// new thing, or maybe recursive wrapper around the new thing.
write_barrier_deep(temp_dir_fd);

// N.B. directory link count 0 -> 1 doesn't run afowl no directory
// hardlinks rule.
linkat(temp_dir_fd, NULL, results_store_fd, "job-result-name");

The idea is beyond the barrier is not that everything needs to be flushed at this point, but that nothing that comes afterwards can be flushed until what came before is flushed. This means that if the on-disk directory is linked into place on disk, it must correctly reflect the in-memory version as if the barrier. In other words, if the OS did not finish persisting the directory and its contents, then it must also persist the link. (In our brave new O_TMPDIR world, any partially-flushed unreachable data from the not-let-linked O_TMPDIR will get garbage-collected at some later point. Cool stuff!)

At the same time, though, the barrier only makes some on-disk writes wait for other on-disk writes; it does not make any in-memory operations wait for on-disk operations like sync / fsink do. This keeps everything nice and pipelined --- we can kick of downstream jobs right away without waiting for upstream jobs to be written to disk. It should basically be as fast as the no-barrier case.

Does this sort of barrier exist anywhere? It turns out Darwin has an F_BARRIERFSYNC to fsync that I think does this, though the documentation is spotty. But there has been plenty of discussion for it elsewhere too. The Hacker News thread links https://lwn.net/Articles/270891/ this 2008 LKML email that proposes it. In fact, it it also mentions something better, from https://www.spinics.net/lists/linux-fsdevel/msg70047.html and https://www.spinics.net/lists/linux-fsdevel/msg70047.html

Here’s the thing about barriers, they represent edges in the dependency graph, but the nodes are still implicit ("things before" and "things after"). "Before" and "after" come from the control flow, and also the current OS thread/process/whatever. More implicit per thread/process/whatever state is yet another foot-gun for green threads, and really anything tricky in an event loop (green threads is a spectrum not a binary). The alternative is to explicitly create groups of IO operations, with dependencies between them. Explicitly assigning IO operations to groups and not using control flow makes everything explicit and minimizes ambient state.

As a bonus, the IO groups can be arbitrarily partial ordered as opposed to a totally ordered with barriers (or maybe tree-ordered if issuing a barrier and then forking a thread does the right thing). I am not sure this has practical benefits, but I do like it conceptually.

Asynchronous interfaces

Actually, I don’t have a lot to add here. A lot of people are excited and writing about about io_uring, and Fuschia (last I checked) already made the call to axe synchronous blocking APIs too.

I will point out that this has a good interaction with the file system transactions part. When requests and responses are paired 1-1, the asynchronous APIs offer perhaps performance improvements but no expressive power. When they are not 1-1, however, things get more interesting.

When we have a write-back cache, like the VFS, we can have not one but two useful response events from a write operation: (a) updated the cache, (b) updated the underlying thing.

For the barrier / write-group examples a above, we already covered that we don’t to wait for (a) / that fsync is overkill. But this is only half of the story → only some things don’t need to wait for (b). If we wanted to, e.g. send a message to the external world "OK this job is totally finished and the result saved" (to use our running batch processing example), we would want that to wait for durability, just as an email server sending an acknowledgement of receipt wants to wait for durability. That means waiting for (b).

https://lwn.net/Articles/789024/ nicely discusses plans for "Asynchronous fsync`" with `io_uring which would work in this manner. But we only need a separate request if we want to restore the 1-1 pairing.

Networking

A lot of people admit the socket API sucks and is incongruous with everything else. A lot of people praise Bill Joy for cranking out the implementation and embarrassing the company on the ARPA project. These people should talk to each other more.

A few issues:

A socket from socket that has not yet been bound or connected serves no purpose. It is as if we had to create an "open file description" first before deciding which file we wanted to open with it. This is garbage; the extra steps might correspond to the kernel allocating versus initializing the data structure, but this serves no semantic or performance purpose to the userland program.
No anonymous listening sockets. You can create an single connection with socketpair, but for a listening socket capable of accept -ing multiple connections? Too fucking bad. POSIX says you must bind before `listen`ing, and that means mucking around with the shared filesystem. Yuck. Linux has "abstact" sockets, but they exist in another namespace, with shorter names! Substituting one global namespace for another? Wow, so ambitious.

There can be a much better solution, which is to pass around file descriptors for the interfaces we can connect and listen on.

This would create such a file descriptor.

int netiface_open(...);

It might need to be a family of things, since there are many types of interfaces that support many different sorts of addressing schemes. For example, we could open a raw device (with enough privileges) or open an IP address + port (with the extact device(s) being used left unspecified) for TCP and UDP.

An important variation would be to create these anonymously for Unix sockets. This is how we solve problem 2.

int netiface_connect(int netiface);
int netiface_listen(int netiface);

These give us regular sockets (like socketpair) representing one half of a connection. By not reusing the file descriptor for `connect`ing, the symmetry with `listen`ins is restored.

Capsicum encourages removing permissions from file descriptions before sending around file descriptors. We should likewise be ale to restrict network interface fds so only one of connecting or listening is possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reforming-unix.adoc

reforming-unix.adoc

Reforming Unix

Null namespaces

Spawning processes

File System

Mounting

Temporarily temporary files

File system transactions

Asynchronous interfaces

Networking

Files

reforming-unix.adoc

Latest commit

History

reforming-unix.adoc

File metadata and controls

Reforming Unix

Null namespaces

Spawning processes

File System

Mounting

Temporarily temporary files

File system transactions

Asynchronous interfaces

Networking