-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLVC not booting off of hdimage #20
Comments
Have you tried increasing
Not sure if 'locked' is the word to use here -
So in this case, the system has 12 buffers in use and "L1 mapped", and another is needed! I think the above fix could fix that, but the system would be VERY tight on buffers. The problem is, we don't have space for large number of buffers in the kernel data segment - doing so puts the near kernel heap at risk of exhaustion.
I think there is very little overhead in configuring for L2 buffers (EXT only using CONFIG_FS_EXTERNAL_BUFFER=y) vs L1 only. The only real difference is the buffer_head struct is 6 bytes bigger and the map/unmap_buffer routines are included. The huge benefit, even for small systems, is having the buffers in main memory, out of the kernel data segment, where they don't compete for near heap space. I don't think there is any benefit in an L1-only system, can you think of any? With the recent fix to kick unused buffers out of L1 on sync_buffer, the ELKS kernel can be configured and run with as few as 10 EXT buffers (plus the L1 buffers, so only +10k extra RAM). If TLVC does not have that fix applied, then L1 buffers could remain sticky and get_empty_buffer could busy loop. The kernel doesn't require the MINIX buffers to be L1 mapped with Basically, building an L1-only kernel is a recipe for disaster, since all buffers then compete for "L1 mapping", even when they don't need to be. And even with the new improvements to keep all file data out of L1 buffers, that means nothing if the system is built L1-only. Regardless of whether a configuration is allowed with L1 buffers only, a potential permanent solution to this problem would be dynamically allocating L1 buffers from /bootopts. That would allow a smaller L1 set for normal users, and increased when a large MINIX filesystem is desired. (I suppose the L1 buffers could potentially by automatically increased, but that's another level of complexity).
Yes, currently the MINIX filesystem code reads in all Z and I maps for speed, and keeps them in buffers marked in-use. Rewriting that could get complicated, I'll leave that alone for now. With regards to calculations of buffers: 65Mb max MINIX filesystem = 65535 blocks (Z-map), and usually 1/3 that many inodes (=21845 I-map). All this said, what do you think for viable solution: just increase NR_MAPBUFS to 13 for MINIX and FAT? Or discourage L1-only configuration and specify EXT_BUFFERS=10, which gives a total of 18 or 21 buffers and problem solved? Or another solution, like disallowing L1-only? (I kind of like that last option. It could be removed from make menuconfig but still kept in the source, for developers only). |
I was giving that some thought before creating the issue, and there is only one scenario: Running the system on 256k RAM - which is a goal, not because it makes a 'generally useful' system, but because there will be scenarios with part of the RAM malfunctioning. IOW - a diagnostic setting which is within the ambition of TLVC. Other than that, I agree with you - running L1 only is a recipe for disaster. Now, whether 10 or 15k for some L2 buffer should be a dealbreaker? Maybe not. I created the issue as a reminder, not wanting to fix this until all the other buffer enhancements are in place. Then it's time to experiment with really constrained settings like this. Possibly - if deemed worthwhile - add some logic to release some of the bitmap buffers (keep, say, 3 or 4) and reread them when needed. This is really not so much about the ability to boot a max sized Minix fs as to have sufficient headroom to boot from floppy, mount a HD image of some size and still have a few buffers for operation.
Yes, that's an idea. Possibly simpler than a 'page buffers in and out' mechanism as alluded to above (which would be compiled in only if configured L1 only for example).
I'll check out the 13 buffers option, it's interesting to see how the system behaves. I suspect it will still be dead since there are no buffers for general operation. Then increase one by one to find the practical limit. A fitting tag would be 'fun with old computers' ... I'll make sure to put some practical advice on buffers into the configuration Wiki. Maybe even write up something (in technical notes) on how the TLVC/ELKS buffer system works while it's fresh in memory. |
Hello @Mellvik, Given your comments and some research, I have come to a different conclusion and a sense of new direction on how kernel buffering should be handled: that is, kernel developers cannot guess how a given system might be used, and it is best to allow multiple configuration options and move towards dynamic allocation of all resources (through configuration as well as through /bootopts) to allow rapid tuning and feedback of the system for a particular use. For instance, I can think of the following possible uses (some more likely for ELKS, others for TLVC):
I've looked into the MINIX bitmap code, and rewriting it to read and release a buffer for each I/Z-map access is very straightforward. The buffers are only used when creating new inodes or file blocks, so having them around on a system that's not creating new data is a huge waste of limited resources! If both L1 and L2 buffers are dynamically allocated, the buffer system should be able to be tuned for best performance of I/Z maps, and let the system do its job. The idea of requiring 12-13 buffers for a large MINIX mount, or even 3-4 each for two floppies is ridiculous frankly, and you've pointed out the impossibility of mounting 2-3 large HD filesystems with the current architecture using L1 only. In addition, both L1 and L2 buffers should be directly specifiable in
Adding these options, along with dynamically allocating L1 buffers using a corresponding CONFIG_FS_NR_L1_BUFFERS would go a long way towards allowing all the above-described use cases to work well and provide users with the ability to tune the system. And yes, bdflush-style sync options would be next on the list. What do you think? |
Also, I just realized that the "L1 buffers" aren't actually system buffers! That is, they have no distinct buffer heads associated with them - they are in fact each only 1K cache for a buffer head that's in L2. So when one configures the system for say, 10 L1 buffers and 20 EXT buffers, the system is only actually getting 20 buffers. That's 10k wasted and definitely misleading. (In L1-only configuration, then yes, there are buffer heads associated with these caches, but that's not the normal build case). The good news is that my testing has shown we might be able to do with much fewer L1 buffers now, since file data doesn't require using them. I'm working on an idea that might allow for the "L1" cache to actually have buffer heads associated with it, but it gets a bit complicated guaranteeing copy-in/copy-out working. Making the L1 cache actually into buffers would mean that if configured for 10 L1 buffers and 20 L2 buffers, there would actually be 30 useable buffers in the system, IMO a large improvement. I plan on adding dynamic allocation of L1 cache and EXT/XMS buffers from /bootopts shortly, which will allow one to much more easily play with different settings to see what the system really needs to boot up and run well. |
Howdy @ghaerr, At times I find I have to restrain myself, not becoming too ambitious with TLVC. After all it's not going to be a multiuser system for 10 or 15 users (even though an old PC has more raw capacity than a 15 user PDP11/20 back in the day). Also, like you say, the strategy and choices for TLVC will likely be different from ELKS.
I think the key is to create flexibility and avoid complexity. You list a number of interesting choices in that regard.
|
Yes, I discovered this when I first stumbled into this rabbithole - it was kind of confusing for a while. Now - armed with an entirely different understanding of the entire buffer system and how it works, I've come to see it mostly as a blessing. Being able to configure the system easily (no code complexity) to run on L1 only is fascinating and useful. That said, I'm getting your point, which I'm reading like 'turn the cache into system buffers' and disconnect them from L2 (data buffers). That would either eliminate the entire mapping/unmapping scheme – or leave it as is and add a function to disconnect from L2 after transfer 'down' to L1. Presumably syncing from L1 w/o involving L2 is a natural. This is indeed interesting, in particular if it can contribute to eliminate or simplify some code.
Agreed. With the ability to watch the flow of buffers in the system with the new tools, this would be a rather interesting experiment.
Yes - this is an important first step! |
On 2nd thought, the mapping/unmapping will be required for XMS buffers anyway, so getting rid of it is not an option. And interestingly, the disconnect is what we discovered was happening before, when data blocks came back into L2 in a different buffer. Although not in a controllable manner. Still, I think the disconnect and turning the L1 cache into system buffers is an interesting idea. |
Yes, almost - it wouldn't be that L1 is for metadata and L2 for data, they would all just be system buffers, with the distinction that the "L1" buffers data happens to be stored in the kernel data segment.
Yes, this is the harder part - determining exactly how to kick out a buffer from L1 to somewhere (it'd have to go to L2). They'd all be buffers and there's no "cache" so the current "copy-in/copy-out" isn't quite enough. I haven't come up with a solid design yet, but the idea would be to call
Yep. The copy-in/out still needs to happen, but it'd be done from buffer-to-buffer, rather than cache-to-buffer, along with swapping the buffer heads as well.
Is that true? That's amazing! My first job in college was working at a lab that had a PDP 11/45. It had an MMU which allowed 128K programs, similar to what we have now with medium model.
I plan on adding removing the Z/I map buffers being kept in memory at all times shortly. However, it turns out to be much more complicated to manage a number of deemed "mandatory" buffers, so what will happen is that each buffer will be requested when it needs it, and the LRU algorithm should keep the most active buffers in real memory. If not, a read of the Z/I map will be done. This also keeps things simple, which I also agree with.
What do you mean here? Does
Actually I found that's not true - the current scheme using |
Well, it's a slightly exaggeration because the architecture was different, but close enough. Also - easy to forget: The users were on 300 bps teletypes (yours working yet?) which is a very different interrupt load than the 57600bps times 3 I'm using on the 80286.
The idea is/was to have
OK, I think we have a good picture now. I'm going to rewind back to the original problem, the mount hang, and get the locked buffers down to whatever seems practical (one sounds reasonable), maybe tunable at the source level. By then, and even before that, we have a much improved system. Thanks for the brainstorming session, it has been really useful! |
@Mellvik: I've completed all tasks - dynamic L1 cache loading, 64k I/O overlap check, /bootopts configurable L1, EXT and XMS buffer count and not loading/locking the Z- and I-map buffers for FD or HD mounts. Things are looking good and running well on QEMU. I have some more testing to do before committing the Z/I-map enhancement (ghaerr/elks#1675), mainly to do with seeing just how few L1 and L2 buffers the system now needs to boot and run. I have it now running with 8 EXT buffers and 4k cache with a 65Mb MINIX filesystem mounted... and 6 buffers free after the mount! :) The I/O test scripts are running well, but real-world testing on floppy is unknown. I've been meaning to add a delay to the bioshd driver to simulate real-world floppy delays. |
Wow @ghaerr, I'm waking up to a breakfast table all set, ready to eat :-) Seriously, this is great. I haven't looked yet, as I just completed 'importing' your trace tools and verified the buffer enhancements from these last weeks, but I'm looking forward to it. As the last status before going for the fixes, here's what we're coming from (TLVC version):
|
Thanks, its been fun writing the code and deep diving into buffer system with the issues you've brought up.
Cool, I was meaning to add a check for that myself, rather than hanging. Would you mind if I brought a few of your bug fixes and enhancements back to ELKS? For instance, I noticed your fix for for /etc/passwd being opened/closed every line of |
On another note, I'm wondering whether the super block needs to be cached for every mounted filesystem. I haven't checked the code too deeply yet, but IMO there's not much use in keeping it in RAM, when its only written to once after boot or before reboot, to mark the filesystem dirty. With the super block cache removed, there'd be no hit (except NR_MOUNT) on resources for each filesystem mounted. While doing the recent buffer work I traced multiple writes to the superblock setting the dirty bit after boot - it might be nice to get this down to a single write, or none: I'm wondering whether we really need to set the dirty bit just after boot if the filesystem isn't in fact ever written to. It might be nice to allow an effective read-only handling of the filesystem until a sector actually changes. And regarding that: another issue at ELKS (ghaerr/elks#1619 (comment)) had me trace down that not only is the superblock being written on boot/reboot, but the inode for /dev/tty1 is also written every time (because login calls |
Hmm, yes - seems to have been a (really useful) habit for what, 6 years now? :-)
Hey this is open source. Do whatever and refer/attribute when suitable. Like I said I just imported a bunch of your stuff, mostly related to what we've been working on, but then again there are other changes/bugfixes/updates having happened since the fork that need to come with them to get the pieces together. (Plus I imported the INITPROC stuff, which I haven't tested yet). BTW don't take While at it there are a couple of other things: |
I've been pondering that one (no research like you have done), and came to the conclusion that one, possibly even 2 blocks locked per mount is a reasonable cost (and an incredible improvement from today). The thing is, if the system is really resource constrained, there will be few if any mounts beyond root. Or the opposite angle: If you need to mount 4 or more filesystems, you probably have the RAM for it. So I'm ok with the SB locked in ram, at least for now. |
Agreed, thank you.
Nothing like a code review! Wow, total bug: the problem is:
so the reason gcc didn't complain is that it never compiled it! And the kernel stack is not being checked on syscall entry!!
And far better when multiple eyes are looking at it, and using it. I probably never would have found the I'll mention your name when committing any of your code, if for no other reason to know who to ask for questions :) |
The latest enhancement brings the number of buffers required in memory per mount to 1, ever, and that's only for MINIX. FAT doesn't require any. IMO the kind of stupid thing about never releasing the super block buffer is that after mount, it's contents are NEVER referred to directly again - all the I- and Z-map bitmaps are copied and kept in a kernel The net result is the unusability of a valuable system buffer during the entire period of filesystem I/O activity. The super_block is then again updated (but only changing the super block dirty bit, no other contents ever change) on unmount.
Actually, with the new enhancements in place, there just isn't any extra (or less) RAM required for a single mount versus 5 mounts - the buffers can all be reused, except for the SB. We have eliminated all the prior resource constraints that used to come with mounting. What do you think about the other issue(s) I mentioned with the super block being written to immediately after boot, etc? IIRC you posted somewhere about some issues you'd seen about the current behavior that could be improved. Given how deeply we've dived into the buffer system at this point, I'm going to tackle the super block buffer issue(s) next. |
PR ghaerr/elks#1676 removes a dedicated super block buffer for any mounted filesystem, thus allowing any number of mounted filesystems without any buffer resource usage. It also fixes a double-write of the superblock directly after boot and doesn't wait for I/O to complete, so floppy boots should be faster. The enhancement also doesn't write the superblock at all if the checked bit isn't changing, which should help speed up some boots and reboots as well. This should cover most of the super block issues previously reported. Things are a bit tricky to speed up floppy boots, because a single sector write clears the track cache (perhaps a seperate DMASEG buffer should be used?). So the above fixes should work well, but may still need tuning depending on the state of the filesystem checked flag, the super block being written, and other queued disk reads for an upcoming exec /bin/sh, for instance. |
Well, that's reassuring! I was really scratching my head about the purpose of this check_kstack() call - unsuccessfully :-)
Indeed! |
Again, this is great. Looking at the code it seems
Great - the double write has been on my list for some time after this 'buffer project' started and it was glaring at me 400 times a day.
I don't think this is a problem. From looking at my traces, the SB read-then-write during boot coincides with more reads from the same track. Caveat: This is before the fixes that eliminates the locking of the bitmap buffers. The access pattern is (floppy) R1-R2-R3-R4-W1-W1-R12-R8-R960-R9-R14 where bl12 is the root dir and bl960 is the dev dir. Will test this later today. Thanks you. |
BTW ---
The comment you referred to above says that block 12 is the |
We're getting into the nitty-gritty here: just wondering if the floppy seek to R960 ends up slowing things down much - it seems /dev is created last so it's way out from track 0... this combined with your comment about |
Agreed, this is becoming nitty-gritty level. come to think of it, if anything could be made sticky in the buffer system, that should probably be the root directory block and its inode. Maybe something to ponder on a rainy day. thanks! |
Never mind the root inode, presumably in memory anyway. The disk block containing the inodes for the directories in root might be useful to keep around. Now that we have all these traces, it's easy to figure out how useful. I'll have a look. |
Well, the whole idea of the LRU list is to keep frequently accessed stuff around. With the major enhancements to keep data out of L1 buffers, that might help, although lots of streaming data could still effectively flush all metadata buffers out.
Great. Perhaps better than figuring out exactly what to do with / or /dev, just getting a feel for how many L1 and L2 buffers is required to work well on floppy, first. After that, we might be able to "mark" buffers as metadata or root dir, etc, and have get_free_buffer skip them in the LRU list, but that'd probably require another use count field in order that they don't get skipped forever and become permanently sticky. |
More results in from testing MINIX and FAT with variations of number of L1 and L2 buffers (using /dev/fd0 on QEMU, so results accurate but not real time): Testing with EXT/L2 buffers = 10, L1 variable: Amazingly enough, the system performed slightly better with 8 L1 than 12 (this may be due to lack of L1 LRU, see below). With L1 = 6, FAT was about the same, but MINIX started doing more maps. At only 4 L1 buffers, MINIX required 12k maps, starting to get many more maps. I haven't measured the time required to memcpy 1k bytes, but 4k more copies of 1k bytes is 4 million bytes. Probably a bit slow on an 8088, but still faster than a floppy sector seek & read, right? I then increased the number of L2 buffers from 10 to 20: the MINIX L1=8 maps went from 5k down to 4k. Very little difference. So L2 buffers are likely going to help with floppy speedup by keeping data around, while L1 is a different story, relating to whether metadata is needed and read vs create activity. With the above results, I'm changing the L1 default in limits.h to 8 immediately. This has large ramifications, since that gives 4k more bytes in the kernel data segment, room for 4 more tasks with no other changes. That would increase usability a lot! Bottom line is that with the major L1/L2 enhancements recently made, old assumptions about the amount of kernel buffers needed need to be completely revisited.
The other thing I've realized is that the L1 cache is being searched sequentially when another map is needed; there is no notion of LRU, which is probably a problem for efficiency. We either need to build a second LRU list for L1 cache, or possibly consider moving the L1 cache into being buffer heads themselves, which could also work to increase throughput, although not directly. |
Agreed. It seems to me this is more research than fixing – understanding what's going on when resources are tight. With few buffers, loading e.g. Anyway, I'll get back to experimenting with that when I've figured out what's going on with the buffers. |
Thanks for the numbers and comments, @ghaerr - great starting point for some real hardware runs. But first, the latest rabbit finally got caught - you'll like this one. Turns out the change of What made the hunt unnecessary long was the fact that it worked perfectly on QEMU, thus 64k alignment wasn't even on the list for investigation. Now we know: QEMU does the right wrong thing: Makes the simulated 8237 do what you'd expect it dot do, not what it actually does. That out of the way, another one popped up when I reenabled L2 buffers after the rabbit hunt above: The superblock was mysteriously overwritten with garbage at mount time. Which turned out to be a missing sync between L1 and L2 when the superblock is to be written back to disk. I ended up checking for
This problem affected all Minix mounts (I didn't check FAT), not only the root mount and one floppies, but for some reason the effects were different when using HD. More in the upcoming PR notes. |
Good to know! I had previously added code to handle a 64k crossover in the BIOS driver, but in my kernel never actually saw the boundary crossed.
I've sometimes wondered whether it would make sense to build a custom version of QEMU for things like this, but the last time I l looked, QEMU was unbelievably large and complicated. So there's nothing like real hardware!
I've not ever seen this, and I've spent quite a bit of time testing the numerous L1/L2 changes as well as superblock update changes made recently. Do/did you have a definite way of reproducing this problem? I'd like to try it.
This is all supposed to be handled in the I mention all this because it sounds to me like perhaps not all the L1/L2 buffer I/O and release code may have been copied over fully correctly. I will post what I think are working versions at the end of this post for comparison.
The current design does not require an L1 buffer to be written to L2: it can stay in L1 forever (if not released as described above) and have I/O performed on it as well.
IMO, that will cause lots of L1<->L2 "buffer swapping", so it would be worth figuring out where the real problem is. I am wondering if this problem started just after one of my changes to the superblock code, or earlier. Of course, our repos are not in sync so it's hard to tell. Here's my copy of sync and get buffer, for quick reference. Almost all the work is done in these routines, as far as starting I/O:
Thank you for your report! |
Well, that didn't work, more investigation needed. Seemingly the problem must be fixed in the superblock code. In the meanwhile, this works (in sync_buffers):
|
Sounds like the problem is that I/O from an L1 buffer isn't working... there's been a ton of changes made. I remember that the
Perhaps a diff of
That change looks like L1 I/O isn't working... the above |
This must be it. I have no IO from /to L1, and what happens when the SB gets overwritten is that the buffer address is L1, seg is L2. Thank you, that's energy for a good start in the morning. :-) |
As suspected, this was the culprit (it was missing).
Thanks, and no, I've not used the
Thanks! |
Good to know! Can you post that here, or at least the gist of it? I've only seen the problem when the kernel is configured for too few buffers, but I now see that an application that say, just writes lots of data will also cause the problem. I'm interested in the solution you've taken and whether its similar to the L1 buffer wait on |
Yes, I think it's similar. Here's the main loop - more than half of the code still debug stuff.
There is a wakeup in
... and an experimental in
|
Interesting, thanks. I'm wondering a number of things: It would seem at first thought that a wakeup might probably be best in I'm also wondering what other processes might be able to run when there aren't any buffers left. Given that I'll have to think about this more, and will put together a test script to test this on /dev/ssd. My current scripts might not actually exhaust the buffers so it will be interesting to see what happens. Thanks! |
It's a good point and I wanted to test that, but ran into other snags and left it.
Yes, that's another good point. My thinking is that there may be tasks doing other types of IO (serial, ethernet, even parallel (?)) or CPU bound processing waiting, making the sleep worth it. Even those will eventually need storage IO, but this way they get a chance to do useful stuff. Also, not wait-looping - in most cases makes the system feel more responsive, to terminal IO in particular. |
I'm closing this issue - the discussion continues in a new thread. |
If configured with L1 buffers only, TLVC hangs at rootmount time if the boot device is a full size minix fs.
A 64M minix fs requires 13 blocks of 'locked' (
b_count
> 0) buffers, while there are only 12 L1 buffers available. Actually, the buffer level is unimportant, it's the number of buffers that count. (1 superblock, 12 bitmap blocks).the root cause seems to be located in
minix_read_super
, which reads the entire fs bitmap into buffers and keeps them 'locked' for the duration of the mount. Mounting 4 fullsize file systems locks up 52 buffers.the issue may seem esoteric but needs to be fixed in order to make TLVC bootable on small systems. That said, if the filesystem is small, booting and/or mounting works fine. The qemu std 32M boot image boots, but leaves little left for system operation.
the problem does not apply to fat filesystems which require only one block locked in bufferspace.
The text was updated successfully, but these errors were encountered: