Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLVC not booting off of hdimage #20

Closed
Mellvik opened this issue Aug 25, 2023 · 39 comments
Closed

TLVC not booting off of hdimage #20

Mellvik opened this issue Aug 25, 2023 · 39 comments
Labels
bug Something isn't working

Comments

@Mellvik
Copy link
Owner

Mellvik commented Aug 25, 2023

If configured with L1 buffers only, TLVC hangs at rootmount time if the boot device is a full size minix fs.

A 64M minix fs requires 13 blocks of 'locked' (b_count > 0) buffers, while there are only 12 L1 buffers available. Actually, the buffer level is unimportant, it's the number of buffers that count. (1 superblock, 12 bitmap blocks).

the root cause seems to be located in minix_read_super, which reads the entire fs bitmap into buffers and keeps them 'locked' for the duration of the mount. Mounting 4 fullsize file systems locks up 52 buffers.

the issue may seem esoteric but needs to be fixed in order to make TLVC bootable on small systems. That said, if the filesystem is small, booting and/or mounting works fine. The qemu std 32M boot image boots, but leaves little left for system operation.

the problem does not apply to fat filesystems which require only one block locked in bufferspace.

@Mellvik Mellvik added the bug Something isn't working label Aug 25, 2023
@ghaerr
Copy link

ghaerr commented Aug 25, 2023

Have you tried increasing NR_MAPBUFS to 13 in buffer.c, depending on your configuration?

/* Number of internal L1 buffers,
   used to map/copy external L2 buffers to/from kernel data segment */
#ifdef CONFIG_FS_FAT
#define NR_MAPBUFS  12
#else
#define NR_MAPBUFS  8
#endif

A 64M minix fs requires 13 blocks of 'locked' (b_count > 0) buffers

Not sure if 'locked' is the word to use here - b_locked means the buffer has I/O in progress, b_count means buffer is currently in use by kernel, and b_mapcount means buffer is inuse and mapped L1. Yes, in all three cases the buffer cannot be reused.

while there are only 12 L1 buffers available. Actually, the buffer level is unimportant, it's the number of buffers that count.

So in this case, the system has 12 buffers in use and "L1 mapped", and another is needed! I think the above fix could fix that, but the system would be VERY tight on buffers. The problem is, we don't have space for large number of buffers in the kernel data segment - doing so puts the near kernel heap at risk of exhaustion.

the issue may seem esoteric but needs to be fixed in order to make TLVC bootable on small systems.

I think there is very little overhead in configuring for L2 buffers (EXT only using CONFIG_FS_EXTERNAL_BUFFER=y) vs L1 only. The only real difference is the buffer_head struct is 6 bytes bigger and the map/unmap_buffer routines are included. The huge benefit, even for small systems, is having the buffers in main memory, out of the kernel data segment, where they don't compete for near heap space.

I don't think there is any benefit in an L1-only system, can you think of any?

With the recent fix to kick unused buffers out of L1 on sync_buffer, the ELKS kernel can be configured and run with as few as 10 EXT buffers (plus the L1 buffers, so only +10k extra RAM). If TLVC does not have that fix applied, then L1 buffers could remain sticky and get_empty_buffer could busy loop. The kernel doesn't require the MINIX buffers to be L1 mapped with b_mapcount nonzero, only that b_count is nonzero. Thus, all those Z/I-map buffers could be kept in main memory, using 13k.

Basically, building an L1-only kernel is a recipe for disaster, since all buffers then compete for "L1 mapping", even when they don't need to be. And even with the new improvements to keep all file data out of L1 buffers, that means nothing if the system is built L1-only.

Regardless of whether a configuration is allowed with L1 buffers only, a potential permanent solution to this problem would be dynamically allocating L1 buffers from /bootopts. That would allow a smaller L1 set for normal users, and increased when a large MINIX filesystem is desired. (I suppose the L1 buffers could potentially by automatically increased, but that's another level of complexity).

the root cause seems to be located in minix_read_super, which reads the entire fs bitmap into buffers

A 64M minix fs requires 13 blocks

Yes, currently the MINIX filesystem code reads in all Z and I maps for speed, and keeps them in buffers marked in-use. Rewriting that could get complicated, I'll leave that alone for now. With regards to calculations of buffers:

65Mb max MINIX filesystem = 65535 blocks (Z-map), and usually 1/3 that many inodes (=21845 I-map). fsck has an option to specify the inode size, that might be another way to decrease buffers. Anyways, 65535 Z-map / 8 = 8191 = 8 buffers. 21845 I-map / 8 = 2731 = 3 buffers. So it seems only 11 buffers + superblock, but still too many! (Run fsck or mkfs to confirm inode count, this could be the discrepancy).

All this said, what do you think for viable solution: just increase NR_MAPBUFS to 13 for MINIX and FAT? Or discourage L1-only configuration and specify EXT_BUFFERS=10, which gives a total of 18 or 21 buffers and problem solved? Or another solution, like disallowing L1-only? (I kind of like that last option. It could be removed from make menuconfig but still kept in the source, for developers only).

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 26, 2023

I think there is very little overhead in configuring for L2 buffers (EXT only using CONFIG_FS_EXTERNAL_BUFFER=y) vs L1 only. The only real difference is the buffer_head struct is 6 bytes bigger and the map/unmap_buffer routines are included. The huge benefit, even for small systems, is having the buffers in main memory, out of the kernel data segment, where they don't compete for near heap space.

I don't think there is any benefit in an L1-only system, can you think of any?

I was giving that some thought before creating the issue, and there is only one scenario: Running the system on 256k RAM - which is a goal, not because it makes a 'generally useful' system, but because there will be scenarios with part of the RAM malfunctioning. IOW - a diagnostic setting which is within the ambition of TLVC.

Other than that, I agree with you - running L1 only is a recipe for disaster. Now, whether 10 or 15k for some L2 buffer should be a dealbreaker? Maybe not. I created the issue as a reminder, not wanting to fix this until all the other buffer enhancements are in place.

Then it's time to experiment with really constrained settings like this. Possibly - if deemed worthwhile - add some logic to release some of the bitmap buffers (keep, say, 3 or 4) and reread them when needed.

This is really not so much about the ability to boot a max sized Minix fs as to have sufficient headroom to boot from floppy, mount a HD image of some size and still have a few buffers for operation.

Regardless of whether a configuration is allowed with L1 buffers only, a potential permanent solution to this problem would be dynamically allocating L1 buffers from /bootopts. That would allow a smaller L1 set for normal users, and increased when a large MINIX filesystem is desired. (I suppose the L1 buffers could potentially by automatically increased, but that's another level of complexity).

Yes, that's an idea. Possibly simpler than a 'page buffers in and out' mechanism as alluded to above (which would be compiled in only if configured L1 only for example).

All this said, what do you think for viable solution: just increase NR_MAPBUFS to 13 for MINIX and FAT? Or discourage L1-only configuration and specify EXT_BUFFERS=10, which gives a total of 18 or 21 buffers and problem solved? Or another solution, like disallowing L1-only? (I kind of like that last option. It could be removed from make menuconfig but still kept in the source, for developers only).

I'll check out the 13 buffers option, it's interesting to see how the system behaves. I suspect it will still be dead since there are no buffers for general operation. Then increase one by one to find the practical limit.

A fitting tag would be 'fun with old computers' ... I'll make sure to put some practical advice on buffers into the configuration Wiki. Maybe even write up something (in technical notes) on how the TLVC/ELKS buffer system works while it's fresh in memory.

@ghaerr
Copy link

ghaerr commented Aug 26, 2023

Hello @Mellvik,

Given your comments and some research, I have come to a different conclusion and a sense of new direction on how kernel buffering should be handled: that is, kernel developers cannot guess how a given system might be used, and it is best to allow multiple configuration options and move towards dynamic allocation of all resources (through configuration as well as through /bootopts) to allow rapid tuning and feedback of the system for a particular use.

For instance, I can think of the following possible uses (some more likely for ELKS, others for TLVC):

  • Normal, average use distribution, medium buffering 32-64 buffers
  • System with large MINIX HDs mounted at once, possibly requiring 12-13 buffers each
  • System doing lots of file writing, needs lots of L1 buffers for I- and Z-maps for new inodes and file blocks
  • System doing lots of file reading, minimal new files, no need for lots of L1 buffers
  • PC w/256k RAM, minimal buffering, used for very old machines
  • ROM-based systems with no need for MINIX filesystem, L1 buffers only, no need for map/unmap_buffer code
  • etc

I've looked into the MINIX bitmap code, and rewriting it to read and release a buffer for each I/Z-map access is very straightforward. The buffers are only used when creating new inodes or file blocks, so having them around on a system that's not creating new data is a huge waste of limited resources! If both L1 and L2 buffers are dynamically allocated, the buffer system should be able to be tuned for best performance of I/Z maps, and let the system do its job. The idea of requiring 12-13 buffers for a large MINIX mount, or even 3-4 each for two floppies is ridiculous frankly, and you've pointed out the impossibility of mounting 2-3 large HD filesystems with the current architecture using L1 only.

In addition, both L1 and L2 buffers should be directly specifiable in make menuconfig and overwritten by /bootopts. I propose using the following new /bootopts entries which would overwrite the configured defaults when possible:

buf=32      # number of L2 buffers
cache=12    # number of L1 buffers (same for MINIX or FAT)
xmsbuf=250  # number of XMS buffers if XMS available, otherwise uses buf=

Adding these options, along with dynamically allocating L1 buffers using a corresponding CONFIG_FS_NR_L1_BUFFERS would go a long way towards allowing all the above-described use cases to work well and provide users with the ability to tune the system. And yes, bdflush-style sync options would be next on the list.

What do you think?

@ghaerr
Copy link

ghaerr commented Aug 27, 2023

Also, I just realized that the "L1 buffers" aren't actually system buffers! That is, they have no distinct buffer heads associated with them - they are in fact each only 1K cache for a buffer head that's in L2. So when one configures the system for say, 10 L1 buffers and 20 EXT buffers, the system is only actually getting 20 buffers. That's 10k wasted and definitely misleading. (In L1-only configuration, then yes, there are buffer heads associated with these caches, but that's not the normal build case).

The good news is that my testing has shown we might be able to do with much fewer L1 buffers now, since file data doesn't require using them. I'm working on an idea that might allow for the "L1" cache to actually have buffer heads associated with it, but it gets a bit complicated guaranteeing copy-in/copy-out working. Making the L1 cache actually into buffers would mean that if configured for 10 L1 buffers and 20 L2 buffers, there would actually be 30 useable buffers in the system, IMO a large improvement.

I plan on adding dynamic allocation of L1 cache and EXT/XMS buffers from /bootopts shortly, which will allow one to much more easily play with different settings to see what the system really needs to boot up and run well.

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 27, 2023

Howdy @ghaerr,
this is turning into another really useful discussion, thanks. And as was the case when we discussed alignment issues the other day, I'm sure we'll land something reasonable at the end.

At times I find I have to restrain myself, not becoming too ambitious with TLVC. After all it's not going to be a multiuser system for 10 or 15 users (even though an old PC has more raw capacity than a 15 user PDP11/20 back in the day).

Also, like you say, the strategy and choices for TLVC will likely be different from ELKS.
Anyway, here's my take.

Given your comments and some research, I have come to a different conclusion and a sense of new direction on how kernel buffering should be handled: that is, kernel developers cannot guess how a given system might be used, and it is best to allow multiple configuration options and move towards dynamic allocation of all resources (through configuration as well as through /bootopts) to allow rapid tuning and feedback of the system for a particular use.

For instance, I can think of the following possible uses (some more likely for ELKS, others for TLVC):

  • Normal, average use distribution, medium buffering 32-64 buffers
  • System with large MINIX HDs mounted at once, possibly requiring 12-13 buffers each
  • System doing lots of file writing, needs lots of L1 buffers for I- and Z-maps for new inodes and file blocks
  • System doing lots of file reading, minimal new files, no need for lots of L1 buffers
  • PC w/256k RAM, minimal buffering, used for very old machines
  • ROM-based systems with no need for MINIX filesystem, L1 buffers only, no need for map/unmap_buffer code
  • etc

I've looked into the MINIX bitmap code, and rewriting it to read and release a buffer for each I/Z-map access is very straightforward. The buffers are only used when creating new inodes or file blocks, so having them around on a system that's not creating new data is a huge waste of limited resources! If both L1 and L2 buffers are dynamically allocated, the buffer system should be able to be tuned for best performance of I/Z maps, and let the system do its job. The idea of requiring 12-13 buffers for a large MINIX mount, or even 3-4 each for two floppies is ridiculous frankly, and you've pointed out the impossibility of mounting 2-3 large HD filesystems with the current architecture using L1 only.

In addition, both L1 and L2 buffers should be directly specifiable in make menuconfig and overwritten by /bootopts. I propose using the following new /bootopts entries which would overwrite the configured defaults when possible:

buf=32      # number of L2 buffers
cache=12    # number of L1 buffers (same for MINIX or FAT)
xmsbuf=250  # number of XMS buffers if XMS available, otherwise uses buf=

Adding these options, along with dynamically allocating L1 buffers using a corresponding CONFIG_FS_NR_L1_BUFFERS would go a long way towards allowing all the above-described use cases to work well and provide users with the ability to tune the system. And yes, bdflush-style sync options would be next on the list.

What do you think?

I think the key is to create flexibility and avoid complexity. You list a number of interesting choices in that regard.

  • Reducing the # of 'mandatory' bitmap buffers is important. This can be made dynamic with a simple algorithm taking into account the number of mounts and the size of the file systems, but I think a simple template # is more than adequate. Such as a max # of 'locked' buffered blocks per fs. The max can be a percentage or a number (possibly settable in menuconfig), the latter seem better off the top of my head.
  • The ability to set the # of cache (L1) buffers like you say is important, a default in menuconfig, an override in bootopts. 'cache=' is a good name.
  • The L2 setting (both menuconfig and bootopts) works fine as it is IMHO. Since ext buffers and xms buffers are never concurrent, there is no confusion as to what is specified.
  • mount can fail softly but checking the size of the L1 cache and current mounts, with a recommendation about new /bootopts settings.

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 27, 2023

Also, I just realized that the "L1 buffers" aren't actually system buffers! That is, they have no distinct buffer heads associated with them - they are in fact each only 1K cache for a buffer head that's in L2. So when one configures the system for say, 10 L1 buffers and 20 EXT buffers, the system is only actually getting 20 buffers. That's 10k wasted and definitely misleading. (In L1-only configuration, then yes, there are buffer heads associated with these caches, but that's not the normal build case).

Yes, I discovered this when I first stumbled into this rabbithole - it was kind of confusing for a while. Now - armed with an entirely different understanding of the entire buffer system and how it works, I've come to see it mostly as a blessing. Being able to configure the system easily (no code complexity) to run on L1 only is fascinating and useful.

That said, I'm getting your point, which I'm reading like 'turn the cache into system buffers' and disconnect them from L2 (data buffers). That would either eliminate the entire mapping/unmapping scheme – or leave it as is and add a function to disconnect from L2 after transfer 'down' to L1. Presumably syncing from L1 w/o involving L2 is a natural. This is indeed interesting, in particular if it can contribute to eliminate or simplify some code.

The good news is that my testing has shown we might be able to do with much fewer L1 buffers now, since file data doesn't require using them. I'm working on an idea that might allow for the "L1" cache to actually have buffer heads associated with it, but it gets a bit complicated guaranteeing copy-in/copy-out working. Making the L1 cache actually into buffers would mean that if configured for 10 L1 buffers and 20 L2 buffers, there would actually be 30 useable buffers in the system, IMO a large improvement.

Agreed. With the ability to watch the flow of buffers in the system with the new tools, this would be a rather interesting experiment.

I plan on adding dynamic allocation of L1 cache and EXT/XMS buffers from /bootopts shortly, which will allow one to much more easily play with different settings to see what the system really needs to boot up and run well.

Yes - this is an important first step!

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 27, 2023

That said, I'm getting your point, which I'm reading like 'turn the cache into system buffers' and disconnect them from L2 (data buffers). That would either eliminate the entire mapping/unmapping scheme – or leave it as is and add a function to disconnect from L2 after transfer 'down' to L1. Presumably syncing from L1 w/o involving L2 is a natural. This is indeed interesting, in particular if it can contribute to eliminate or simplify some code.

On 2nd thought, the mapping/unmapping will be required for XMS buffers anyway, so getting rid of it is not an option. And interestingly, the disconnect is what we discovered was happening before, when data blocks came back into L2 in a different buffer. Although not in a controllable manner.

Still, I think the disconnect and turning the L1 cache into system buffers is an interesting idea.

@ghaerr
Copy link

ghaerr commented Aug 27, 2023

That said, I'm getting your point, which I'm reading like 'turn the cache into system buffers' and disconnect them from L2 (data buffers).

Yes, almost - it wouldn't be that L1 is for metadata and L2 for data, they would all just be system buffers, with the distinction that the "L1" buffers data happens to be stored in the kernel data segment.

That would either eliminate the entire mapping/unmapping scheme – or leave it as is and add a function to disconnect from L2 after transfer 'down' to L1. Presumably syncing from L1 w/o involving L2 is a natural.

Yes, this is the harder part - determining exactly how to kick out a buffer from L1 to somewhere (it'd have to go to L2). They'd all be buffers and there's no "cache" so the current "copy-in/copy-out" isn't quite enough. I haven't come up with a solid design yet, but the idea would be to call get_free_buffer within map_buffer to get an empty L2 buffer for which to transfer the current L1 out, and then use that same buffer to transfer the L2 buffer into L1. Given the complexity and possibly bugs, I'm going to leave all this until later, after dynamic L1 cache is implemented, so we can measure the usefulness of it all.

On 2nd thought, the mapping/unmapping will be required for XMS buffers anyway, so getting rid of it is not an option. And interestingly, the disconnect is what we discovered was happening before, when data blocks came back into L2 in a different buffer.

Yep. The copy-in/out still needs to happen, but it'd be done from buffer-to-buffer, rather than cache-to-buffer, along with swapping the buffer heads as well.

(even though an old PC has more raw capacity than a 15 user PDP11/20 back in the day).

Is that true? That's amazing! My first job in college was working at a lab that had a PDP 11/45. It had an MMU which allowed 128K programs, similar to what we have now with medium model.

Reducing the # of 'mandatory' bitmap buffers is important. This can be made dynamic with a simple algorithm taking into account the number of mounts and the size of the file systems, but I think a simple template # is more than adequate.

I plan on adding removing the Z/I map buffers being kept in memory at all times shortly. However, it turns out to be much more complicated to manage a number of deemed "mandatory" buffers, so what will happen is that each buffer will be requested when it needs it, and the LRU algorithm should keep the most active buffers in real memory. If not, a read of the Z/I map will be done. This also keeps things simple, which I also agree with.

mount can fail softly but checking the size of the L1 cache and current mounts, with a recommendation about new /bootopts settings.

What do you mean here? Does mount need an enhancement? When the requirement for RAM-based Z/I maps is removed, mount will only need one buffer per MINIX filesystem to mount it.

The L2 setting (both menuconfig and bootopts) works fine as it is IMHO. Since ext buffers and xms buffers are never concurrent, there is no confusion as to what is specified.

Actually I found that's not true - the current scheme using bufs= specifies either EXT or XMS, and if XMS isn't available on the machine, the EXT default can't be overridden from the configured default. For instance, bufs=2500 would end up using the configured, say 32 EXT buffers if no XMS; there's no way to specify EXT buffers on an XMS machine. PR ghaerr/elks#1672 changes that to use buf= and xmsbuf= to allow specifying them separately. Also, using xmsbuf=0 shuts down XMS and forces the use of EXT buffers.

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 27, 2023

(even though an old PC has more raw capacity than a 15 user PDP11/20 back in the day).

Is that true? That's amazing! My first job in college was working at a lab that had a PDP 11/45. It had an MMU which allowed 128K programs, similar to what we have now with medium model.

Well, it's a slightly exaggeration because the architecture was different, but close enough. Also - easy to forget: The users were on 300 bps teletypes (yours working yet?) which is a very different interrupt load than the 57600bps times 3 I'm using on the 80286.

Reducing the # of 'mandatory' bitmap buffers is important. This can be made dynamic with a simple algorithm taking into account the number of mounts and the size of the file systems, but I think a simple template # is more than adequate.

I plan on adding removing the Z/I map buffers being kept in memory at all times shortly. However, it turns out to be much more complicated to manage a number of deemed "mandatory" buffers, so what will happen is that each buffer will be requested when it needs it, and the LRU algorithm should keep the most active buffers in real memory. If not, a read of the Z/I map will be done. This also keeps things simple, which I also agree with.

mount can fail softly by checking the size of the L1 cache and current mounts, with a recommendation about new /bootopts settings.

What do you mean here? Does mount need an enhancement? When the requirement for RAM-based Z/I maps is removed, mount will only need one buffer per MINIX filesystem to mount it.

The idea is/was to have mount recheck whether the operation was actually possible. If the # of locked buffers per mount can be brought below 3, which as you say should be possible, there is no need.

The L2 setting (both menuconfig and bootopts) works fine as it is IMHO. Since ext buffers and xms buffers are never concurrent, there is no confusion as to what is specified.

Actually I found that's not true - the current scheme using bufs= specifies either EXT or XMS, and if XMS isn't available on the machine, the EXT default can't be overridden from the configured default. For instance, bufs=2500 would end up using the configured, say 32 EXT buffers if no XMS; there's no way to specify EXT buffers on an XMS machine. PR ghaerr/elks#1672 changes that to use buf= and xmsbuf= to allow specifying them separately. Also, using xmsbuf=0 shuts down XMS and forces the use of EXT buffers.

OK, I think we have a good picture now. I'm going to rewind back to the original problem, the mount hang, and get the locked buffers down to whatever seems practical (one sounds reasonable), maybe tunable at the source level.
Second is to add the cache= bootopts option and make sure the bufs= applies to external buffers if XMS isn't available - with 64 as the high limit. It seems to me there is only one rare case that an extra xmsbufs= option is useful so it seems like an overkill (if XMS is on and available, bufs= is XMS buffers, otherwise it's EXT buffers, if out of range, choose the highest available - simplicity rules).
Third and final is turning L1 into more or less regular buffers and enable them to be involved in IO without the need to grab and L2 buffer header first.

By then, and even before that, we have a much improved system.

Thanks for the brainstorming session, it has been really useful!

@ghaerr
Copy link

ghaerr commented Aug 28, 2023

@Mellvik: I've completed all tasks - dynamic L1 cache loading, 64k I/O overlap check, /bootopts configurable L1, EXT and XMS buffer count and not loading/locking the Z- and I-map buffers for FD or HD mounts.

Things are looking good and running well on QEMU. I have some more testing to do before committing the Z/I-map enhancement (ghaerr/elks#1675), mainly to do with seeing just how few L1 and L2 buffers the system now needs to boot and run.

I have it now running with 8 EXT buffers and 4k cache with a 65Mb MINIX filesystem mounted... and 6 buffers free after the mount! :)

The I/O test scripts are running well, but real-world testing on floppy is unknown. I've been meaning to add a delay to the bioshd driver to simulate real-world floppy delays.

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 28, 2023

Wow @ghaerr, I'm waking up to a breakfast table all set, ready to eat :-)

Seriously, this is great. I haven't looked yet, as I just completed 'importing' your trace tools and verified the buffer enhancements from these last weeks, but I'm looking forward to it.

As the last status before going for the fixes, here's what we're coming from (TLVC version):

# mount /dev/dhdb4 /mnt2      <--- 65M Minix fs (0522 is a small Minix fs)

  #    buf/bh   blk/dev  flgs L1map Mcnt Bcnt
  1:   6/e710     6/0200   U   L11    0   0
  2:  11/e788   519/0200   U   L08    0   0
  3:  10/e770    14/0200   U   L07    0   0
  4:   0/e680     4/0200   U   L10    0   0
  5:   7/e728    12/0200   U   L09    0   0
  6:   5/e6f8     9/0200 L U   L03    0   0
  7:   8/e740     3/0522   U   L00    0   1
  8:   9/e758     2/0522   U   L00    0   1
  9:   1/e698     1/0522   U   L00    0   1
 10:   4/e6e0     3/0200   U   L00    0   1
 11:   3/e6c8     2/0200   U   L12    0   1
 12:   2/e6b0     1/0200   U   L00    0   1
Total buffers inuse 7/12 (6 free)

panic: deadlock - no buffers available
SYSTEM HALTED - Press CTRL-ALT-DEL to reboot:ERdbaa|0200;

@ghaerr
Copy link

ghaerr commented Aug 28, 2023

Seriously, this is great.

Thanks, its been fun writing the code and deep diving into buffer system with the issues you've brought up.

panic: deadlock - no buffers available

Cool, I was meaning to add a check for that myself, rather than hanging. Would you mind if I brought a few of your bug fixes and enhancements back to ELKS? For instance, I noticed your fix for for /etc/passwd being opened/closed every line of ps output, and the accept/wait problem in network daemons - nice work.

@ghaerr
Copy link

ghaerr commented Aug 28, 2023

On another note, I'm wondering whether the super block needs to be cached for every mounted filesystem. I haven't checked the code too deeply yet, but IMO there's not much use in keeping it in RAM, when its only written to once after boot or before reboot, to mark the filesystem dirty. With the super block cache removed, there'd be no hit (except NR_MOUNT) on resources for each filesystem mounted.

While doing the recent buffer work I traced multiple writes to the superblock setting the dirty bit after boot - it might be nice to get this down to a single write, or none: I'm wondering whether we really need to set the dirty bit just after boot if the filesystem isn't in fact ever written to. It might be nice to allow an effective read-only handling of the filesystem until a sector actually changes.

And regarding that: another issue at ELKS (ghaerr/elks#1619 (comment)) had me trace down that not only is the superblock being written on boot/reboot, but the inode for /dev/tty1 is also written every time (because login calls fchown on it, which updates the inode). Perhaps there's a way that can be limited as well, although I'm not sure yet exactly what the best method might be for that.

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 28, 2023

Seriously, this is great.

Thanks, its been fun writing the code and deep diving into buffer system with the issues you've brought up.

Hmm, yes - seems to have been a (really useful) habit for what, 6 years now? :-)

panic: deadlock - no buffers available

Cool, I was meaning to add a check for that myself, rather than hanging. Would you mind if I brought a few of your bug fixes and enhancements back to ELKS? For instance, I noticed your fix for for /etc/passwd being opened/closed every line of ps output, and the accept/wait problem in network daemons - nice work.

Hey this is open source. Do whatever and refer/attribute when suitable. Like I said I just imported a bunch of your stuff, mostly related to what we've been working on, but then again there are other changes/bugfixes/updates having happened since the fork that need to come with them to get the pieces together. (Plus I imported the INITPROC stuff, which I haven't tested yet).

BTW don't take ps just yet, it was still too slow on a L1 only floppy system, so I rewrote the tty name lookup part too. PR coming. And I took your ps kernel sanity check code which was smaller and better than mine.

While at it there are a couple of other things: add_buffer() (buffer.c) should probably be INITPROC, it's only called from buffer_init(). And in strace.c, check_kstack() is called w/o parameters from trace_begin() - I didn't think gcc would eat that ...

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 28, 2023

On another note, I'm wondering whether the super block needs to be cached for every mounted filesystem. I haven't checked the code too deeply yet, but IMO there's not much use in keeping it in RAM, when its only written to once after boot or before reboot, to mark the filesystem dirty. With the super block cache removed, there'd be no hit (except NR_MOUNT) on resources for each filesystem mounted.

While doing the recent buffer work I traced multiple writes to the superblock setting the dirty bit after boot - it might be nice to get this down to a single write, or none: I'm wondering whether we really need to set the dirty bit just after boot if the filesystem isn't in fact ever written to. It might be nice to allow an effective read-only handling of the filesystem until a sector actually changes.

And regarding that: another issue at ELKS (ghaerr/elks#1619 (comment)) had me trace down that not only is the superblock being written on boot/reboot, but the inode for /dev/tty1 is also written every time (because login opens it for read/write, and thus the access time is being updated). Perhaps there's a way that can be limited as well, although I'm not sure yet exactly what the best method might be for that.

I've been pondering that one (no research like you have done), and came to the conclusion that one, possibly even 2 blocks locked per mount is a reasonable cost (and an incredible improvement from today). The thing is, if the system is really resource constrained, there will be few if any mounts beyond root. Or the opposite angle: If you need to mount 4 or more filesystems, you probably have the RAM for it. So I'm ok with the SB locked in ram, at least for now.

@ghaerr
Copy link

ghaerr commented Aug 28, 2023

add_buffer() (buffer.c) should probably be INITPROC, it's only called from buffer_init().

Agreed, thank you.

And in strace.c, check_kstack() is called w/o parameters from trace_begin() - I didn't think gcc would eat that

Nothing like a code review! Wow, total bug: the problem is:

#ifdef CHECK_KCHECK // <--- should be CHECK_KSTACK
    if (tracing & TRACE_KSTACK)
        check_kstack();
#endif

so the reason gcc didn't complain is that it never compiled it! And the kernel stack is not being checked on syscall entry!!
Nice catch!! [EDIT: On closer inspection, all the above code can be deleted. The kernel stack is filled with 0x55 on syscall entry and is only checkable on syscall exit. So error checking is accurate, the above code appears to have been forgotten about during one of my rewrites.]

Hey this is open source. Do whatever and refer/attribute when suitable.

And far better when multiple eyes are looking at it, and using it. I probably never would have found the check_kstack bug you just mentioned! Fix coming.

I'll mention your name when committing any of your code, if for no other reason to know who to ask for questions :)

@ghaerr
Copy link

ghaerr commented Aug 28, 2023

came to the conclusion that one, possibly even 2 blocks locked per mount is a reasonable cost

The latest enhancement brings the number of buffers required in memory per mount to 1, ever, and that's only for MINIX. FAT doesn't require any.

IMO the kind of stupid thing about never releasing the super block buffer is that after mount, it's contents are NEVER referred to directly again - all the I- and Z-map bitmaps are copied and kept in a kernel struct super_block.

The net result is the unusability of a valuable system buffer during the entire period of filesystem I/O activity. The super_block is then again updated (but only changing the super block dirty bit, no other contents ever change) on unmount.

The thing is, if the system is really resource constrained, there will be few if any mounts beyond root. Or the opposite angle: If you need to mount 4 or more filesystems, you probably have the RAM for it.

Actually, with the new enhancements in place, there just isn't any extra (or less) RAM required for a single mount versus 5 mounts - the buffers can all be reused, except for the SB. We have eliminated all the prior resource constraints that used to come with mounting.

What do you think about the other issue(s) I mentioned with the super block being written to immediately after boot, etc? IIRC you posted somewhere about some issues you'd seen about the current behavior that could be improved.

Given how deeply we've dived into the buffer system at this point, I'm going to tackle the super block buffer issue(s) next.

@ghaerr
Copy link

ghaerr commented Aug 29, 2023

I'm going to tackle the super block buffer issue(s) next.

PR ghaerr/elks#1676 removes a dedicated super block buffer for any mounted filesystem, thus allowing any number of mounted filesystems without any buffer resource usage.

It also fixes a double-write of the superblock directly after boot and doesn't wait for I/O to complete, so floppy boots should be faster. The enhancement also doesn't write the superblock at all if the checked bit isn't changing, which should help speed up some boots and reboots as well. This should cover most of the super block issues previously reported.

Things are a bit tricky to speed up floppy boots, because a single sector write clears the track cache (perhaps a seperate DMASEG buffer should be used?). So the above fixes should work well, but may still need tuning depending on the state of the filesystem checked flag, the super block being written, and other queued disk reads for an upcoming exec /bin/sh, for instance.

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 29, 2023

so the reason gcc didn't complain is that it never compiled it! And the kernel stack is not being checked on syscall entry!! Nice catch!! [EDIT: On closer inspection, all the above code can be deleted. The kernel stack is filled with 0x55 on syscall entry and is only checkable on syscall exit. So error checking is accurate, the above code appears to have been forgotten about during one of my rewrites.]

Well, that's reassuring! I was really scratching my head about the purpose of this check_kstack() call - unsuccessfully :-)

Hey this is open source. Do whatever and refer/attribute when suitable.

And far better when multiple eyes are looking at it, and using it.

Indeed!

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 29, 2023

Again, this is great. Looking at the code it seems

I'm going to tackle the super block buffer issue(s) next.

PR ghaerr/elks#1676 removes a dedicated super block buffer for any mounted filesystem, thus allowing any number of mounted filesystems without any buffer resource usage.

It also fixes a double-write of the superblock directly after boot and doesn't wait for I/O to complete, so floppy boots should be faster. The enhancement also doesn't write the superblock at all if the checked bit isn't changing, which should help speed up some boots and reboots as well. This should cover most of the super block issues previously reported.

Great - the double write has been on my list for some time after this 'buffer project' started and it was glaring at me 400 times a day.

Things are a bit tricky to speed up floppy boots, because a single sector write clears the track cache (perhaps a seperate DMASEG buffer should be used?). So the above fixes should work well, but may still need tuning depending on the state of the filesystem checked flag, the super block being written, and other queued disk reads for an upcoming exec /bin/sh, for instance.

I don't think this is a problem. From looking at my traces, the SB read-then-write during boot coincides with more reads from the same track. Caveat: This is before the fixes that eliminates the locking of the bitmap buffers. The access pattern is (floppy) R1-R2-R3-R4-W1-W1-R12-R8-R960-R9-R14 where bl12 is the root dir and bl960 is the dev dir.

Will test this later today. Thanks you.

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 29, 2023

BTW ---

And regarding that: another issue at ELKS (ghaerr/elks#1619 (comment)) had me trace down that not only is the superblock being written on boot/reboot, but the inode for /dev/tty1 is also written every time (because login opens it for read/write, and thus the access time is being updated). Perhaps there's a way that can be limited as well, although I'm not sure yet exactly what the best method might be for that.

The comment you referred to above says that block 12 is the /dev/tty1 inode - it's in fact the root directory on a 1440k floppy, as stated in my last comment. So the analysis becomes a bit different, but that should be history now.

@ghaerr
Copy link

ghaerr commented Aug 29, 2023

The access pattern is (floppy) R1-R2-R3-R4-W1-W1-R12-R8-R960-R9-R14 where bl12 is the root dir and bl960 is the dev dir.

We're getting into the nitty-gritty here: just wondering if the floppy seek to R960 ends up slowing things down much - it seems /dev is created last so it's way out from track 0... this combined with your comment about ttyname having to be rewritten because it's to slow for ps makes me wonder whether we should consider /dev being moved closer. I suppose in reality they wouldn't cache into the same cached track read, and ultimately /dev should be in a buffer somewhere. So perhaps all this just needs attending during the tuning process of how many buffers the kernel needs to operate well on floppy?

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 29, 2023

Agreed, this is becoming nitty-gritty level.
If it's easy to do - I know nothing about the cross tools used to create images - moving /dev closer to / would be valuable to reduce slow head movement during floppy boot.

come to think of it, if anything could be made sticky in the buffer system, that should probably be the root directory block and its inode. Maybe something to ponder on a rainy day.

thanks!

@Mellvik
Copy link
Owner Author

Mellvik commented Aug 29, 2023

Never mind the root inode, presumably in memory anyway. The disk block containing the inodes for the directories in root might be useful to keep around. Now that we have all these traces, it's easy to figure out how useful. I'll have a look.

@ghaerr
Copy link

ghaerr commented Aug 29, 2023

The disk block containing the inodes for the directories in root might be useful to keep around.

Well, the whole idea of the LRU list is to keep frequently accessed stuff around. With the major enhancements to keep data out of L1 buffers, that might help, although lots of streaming data could still effectively flush all metadata buffers out.

Now that we have all these traces, it's easy to figure out how useful. I'll have a look.

Great. Perhaps better than figuring out exactly what to do with / or /dev, just getting a feel for how many L1 and L2 buffers is required to work well on floppy, first. After that, we might be able to "mark" buffers as metadata or root dir, etc, and have get_free_buffer skip them in the LRU list, but that'd probably require another use count field in order that they don't get skipped forever and become permanently sticky.

@ghaerr
Copy link

ghaerr commented Aug 30, 2023

just getting a feel for how many L1 and L2 buffers is required to work well on floppy

More results in from testing MINIX and FAT with variations of number of L1 and L2 buffers (using /dev/fd0 on QEMU, so results accurate but not real time):

Testing with EXT/L2 buffers = 10, L1 variable:
FAT L1=12: 16k maps, 56k remaps
FAT L1=8: 16k maps, 52k remaps
FAT L1=6: 17k maps, 55k remaps
MINIX L1=12: 6k maps, 21k remaps
MINIX L1=8: 5k maps, 21k remaps
MINIX L1=6: 9k maps, 18k remaps

Amazingly enough, the system performed slightly better with 8 L1 than 12 (this may be due to lack of L1 LRU, see below). With L1 = 6, FAT was about the same, but MINIX started doing more maps. At only 4 L1 buffers, MINIX required 12k maps, starting to get many more maps. I haven't measured the time required to memcpy 1k bytes, but 4k more copies of 1k bytes is 4 million bytes. Probably a bit slow on an 8088, but still faster than a floppy sector seek & read, right?

I then increased the number of L2 buffers from 10 to 20: the MINIX L1=8 maps went from 5k down to 4k. Very little difference. So L2 buffers are likely going to help with floppy speedup by keeping data around, while L1 is a different story, relating to whether metadata is needed and read vs create activity.

With the above results, I'm changing the L1 default in limits.h to 8 immediately. This has large ramifications, since that gives 4k more bytes in the kernel data segment, room for 4 more tasks with no other changes. That would increase usability a lot! Bottom line is that with the major L1/L2 enhancements recently made, old assumptions about the amount of kernel buffers needed need to be completely revisited.

the whole idea of the LRU list is to keep frequently accessed stuff around

The other thing I've realized is that the L1 cache is being searched sequentially when another map is needed; there is no notion of LRU, which is probably a problem for efficiency. We either need to build a second LRU list for L1 cache, or possibly consider moving the L1 cache into being buffer heads themselves, which could also work to increase throughput, although not directly.

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 2, 2023

The disk block containing the inodes for the directories in root might be useful to keep around.
Perhaps better than figuring out exactly what to do with / or /dev, just getting a feel for how many L1 and L2 buffers is required to work well on floppy, first. After that, we might be able to "mark" buffers as metadata or root dir, etc, and have get_free_buffer skip them in the LRU list, but that'd probably require another use count field in order that they don't get skipped forever and become permanently sticky.

Agreed. It seems to me this is more research than fixing – understanding what's going on when resources are tight. With few buffers, loading e.g. ash is going to purge everything and the ability to make some items sticky could be useful. Now - like you say - what are the balance points? That depends on any number of factors with new discoveries to be made like forever.

Anyway, I'll get back to experimenting with that when I've figured out what's going on with the buffers.

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 6, 2023

just getting a feel for how many L1 and L2 buffers is required to work well on floppy

More results in from testing MINIX and FAT with variations of number of L1 and L2 buffers (using /dev/fd0 on QEMU, so results accurate but not real time):

Testing with EXT/L2 buffers = 10, L1 variable:
FAT L1=12: 16k maps, 56k remaps
FAT L1=8: 16k maps, 52k remaps
FAT L1=6: 17k maps, 55k remaps
MINIX L1=12: 6k maps, 21k remaps
MINIX L1=8: 5k maps, 21k remaps
MINIX L1=6: 9k maps, 18k remaps

Amazingly enough, the system performed slightly better with 8 L1 than 12 (this may be due to lack of L1 LRU, see below). With L1 = 6, FAT was about the same, but MINIX started doing more maps. At only 4 L1 buffers, MINIX required 12k maps, starting to get many more maps. I haven't measured the time required to memcpy 1k bytes, but 4k more copies of 1k bytes is 4 million bytes. Probably a bit slow on an 8088, but still faster than a floppy sector seek & read, right?

I then increased the number of L2 buffers from 10 to 20: the MINIX L1=8 maps went from 5k down to 4k. Very little difference. So L2 buffers are likely going to help with floppy speedup by keeping data around, while L1 is a different story, relating to whether metadata is needed and read vs create activity.

With the above results, I'm changing the L1 default in limits.h to 8 immediately. This has large ramifications, since that gives 4k more bytes in the kernel data segment, room for 4 more tasks with no other changes. That would increase usability a lot! Bottom line is that with the major L1/L2 enhancements recently made, old assumptions about the amount of kernel buffers needed need to be completely revisited.

the whole idea of the LRU list is to keep frequently accessed stuff around

The other thing I've realized is that the L1 cache is being searched sequentially when another map is needed; there is no notion of LRU, which is probably a problem for efficiency. We either need to build a second LRU list for L1 cache, or possibly consider moving the L1 cache into being buffer heads themselves, which could also work to increase throughput, although not directly.

Thanks for the numbers and comments, @ghaerr - great starting point for some real hardware runs.

But first, the latest rabbit finally got caught - you'll like this one. Turns out the change of L1buf from static allocation to heap_alloc was the root cause: the allocated space was crossing a 64k physical memory boundary. With some really interesting practical consequences when run on a L1 (x12) only system, which forced the issue effectively.

What made the hunt unnecessary long was the fact that it worked perfectly on QEMU, thus 64k alignment wasn't even on the list for investigation. Now we know: QEMU does the right wrong thing: Makes the simulated 8237 do what you'd expect it dot do, not what it actually does.

That out of the way, another one popped up when I reenabled L2 buffers after the rabbit hunt above: The superblock was mysteriously overwritten with garbage at mount time. Which turned out to be a missing sync between L1 and L2 when the superblock is to be written back to disk. I ended up checking for b_dirty in unmap_buffer() and do an update right there - which works for now. I suspect there are better ways to do this:

        } else if (--ebh->b_mapcount == 0) {
            if (ebh->b_dirty) {
                printk("unmapping dirty buffer bl:%lu bh:%04x\n", ebh->b_blocknr, bh);
                brelseL1(bh, 1);
            }   
            debug("unmap: %d\n", buf_num(bh));
            wake_up(&L1wait);
        } else

This problem affected all Minix mounts (I didn't check FAT), not only the root mount and one floppies, but for some reason the effects were different when using HD.

More in the upcoming PR notes.

@ghaerr
Copy link

ghaerr commented Sep 6, 2023

the change of L1buf from static allocation to heap_alloc was the root cause: the allocated space was crossing a 64k physical memory boundary.

Good to know! I had previously added code to handle a 64k crossover in the BIOS driver, but in my kernel never actually saw the boundary crossed.

QEMU does the right wrong thing: Makes the simulated 8237 do what you'd expect it dot do, not what it actually does.

I've sometimes wondered whether it would make sense to build a custom version of QEMU for things like this, but the last time I l looked, QEMU was unbelievably large and complicated. So there's nothing like real hardware!

The superblock was mysteriously overwritten with garbage at mount time.

I've not ever seen this, and I've spent quite a bit of time testing the numerous L1/L2 changes as well as superblock update changes made recently. Do/did you have a definite way of reproducing this problem? I'd like to try it.

Which turned out to be a missing sync between L1 and L2 when the superblock is to be written back to disk.

This is all supposed to be handled in the sync_buffers and get_free_buffer code in two ways: 1) in sync_buffers, all dirty buffers are written from where they are (either L1 or L2). (This was the source of the original issue you brought weeks ago: how did I/O start from L1, and its OK to have L1 I/O now). And 2) separately in get_free_buffer, L1 buffers are "released" back into L2 in the second pass, to reduce the "stickiness" of L1. This latter issue should not affect the issue you were seeing, and won't start I/O on a buffer.

I mention all this because it sounds to me like perhaps not all the L1/L2 buffer I/O and release code may have been copied over fully correctly. I will post what I think are working versions at the end of this post for comparison.

Which turned out to be a missing sync between L1 and L2 when the superblock is to be written back to disk.

The current design does not require an L1 buffer to be written to L2: it can stay in L1 forever (if not released as described above) and have I/O performed on it as well.

I ended up checking for b_dirty in unmap_buffer() and do an update right there - which works for now. I suspect there are better ways to do this

IMO, that will cause lots of L1<->L2 "buffer swapping", so it would be worth figuring out where the real problem is. I am wondering if this problem started just after one of my changes to the superblock code, or earlier. Of course, our repos are not in sync so it's hard to tell.

Here's my copy of sync and get buffer, for quick reference. Almost all the work is done in these routines, as far as starting I/O:

static void sync_buffers(kdev_t dev, int wait)
{
    struct buffer_head *bh = bh_lru;
    ext_buffer_head *ebh;
    int count = 0;

    debug_blk("sync_buffers dev %p wait %d\n", dev, wait);
    do {
        ebh = EBH(bh);

        /*
         *      Skip clean buffers.
         */
        if ((dev && (ebh->b_dev != dev)) || !ebh->b_dirty)
           continue;

        /*
         *      Locked buffers..
         *
         *      If buffer is locked; skip it unless wait is requested
         *      AND pass > 0.
         */
        if (ebh->b_locked) {
            debug_blk("SYNC: dev %p buf %d block %ld LOCKED mapped %d skipped %d data %04x\n",
                ebh->b_dev, buf_num(bh), ebh->b_blocknr, ebh->b_mapcount, !wait,
                bh->b_data);
            if (!wait) continue;
            wait_on_buffer(bh);
        }

        /*
         *      Do the stuff
         */
        debug_blk("sync: dev %p write buf %d block %ld count %d dirty %d\n",
            ebh->b_dev, buf_num(bh), ebh->b_blocknr, ebh->b_count, ebh->b_dirty);
        ebh->b_count++;
        ll_rw_blk(WRITE, bh);
        ebh->b_count--;
        count++;
    } while ((bh = ebh->b_next_lru) != NULL);
    debug_blk("SYNC_BUFFERS END %d wrote %d\n", wait, count);
}
static struct buffer_head *get_free_buffer(void)
{
    struct buffer_head *bh = bh_lru;
    ext_buffer_head *ebh = EBH(bh);
    int i;
    int pass = 0;

    while (ebh->b_count || ebh->b_dirty || ebh->b_locked
#if defined(CONFIG_FS_EXTERNAL_BUFFER) || defined(CONFIG_FS_XMS_BUFFER)
           || bh->b_data
#endif
                                                        ) {
        if ((bh = ebh->b_next_lru) == NULL) {
            sync_buffers(0, 0);
            if (++pass > 1) { // <--- this is an optimization that helps keep L1 buffers for longer unless no buffers found
                for (i=0; i<nr_map_bufs; i++) {
                    brelseL1_index(i, 1);   /* release if not mapcount or locked */
                }
            }
            bh = bh_lru;
        }
        ebh = EBH(bh);
    }
#ifdef CHECK_BLOCKIO
    if (ebh->b_mapcount) panic("get_free_buffer"); /* mapped buffer reallocated */
#endif
    put_last_lru(bh);
    ebh->b_uptodate = 0;
    ebh->b_count = 1;
    SET_COUNT(ebh);
    return bh;
}

Thank you for your report!

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 6, 2023

Well, that didn't work, more investigation needed. Seemingly the problem must be fixed in the superblock code. In the meanwhile, this works (in sync_buffers):

        if (ebh->b_blocknr == 1) {
                printk("BL1: bh%04x S%lx B%04x\n", bh, (unsigned long)buffer_seg(bh), buffer_data(bh));
                if (bh->b_data) {
                        xms_fmemcpyw(ebh->b_L2data, ebh->b_L2seg, bh->b_data, kernel_ds, BLOCK_SIZE/2);
                        bh->b_data = 0;
                }
        }
        ebh->b_count++;
        ll_rw_blk(WRITE, bh);
        ebh->b_count--;
    } while ((bh = ebh->b_next_lru) != NULL);

@ghaerr
Copy link

ghaerr commented Sep 6, 2023

Well, that didn't work, more investigation needed. Seemingly the problem must be fixed in the superblock code.

Sounds like the problem is that I/O from an L1 buffer isn't working... there's been a ton of changes made. I remember that the buffer_seg function had to be changed. Did you change yours to this:

ramdesc_t buffer_seg(struct buffer_head *bh)
{
    return (bh->b_data? kernel_ds: EBH(bh)->b_L2seg);
}

Perhaps a diff of buffers.c between your repo and mine is in order. I recently retabbed everything using vis ":%retab" command, FYI.

In the meanwhile, this works (in sync_buffers):

That change looks like L1 I/O isn't working... the above buffer_seg change was introduced when I made the change to reduce the buffer_head struct size by quite 6 bytes. Thus, bh->b_data must be used to determine the buffer segment as well.

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 6, 2023

Well, that didn't work, more investigation needed. Seemingly the problem must be fixed in the superblock code.

Sounds like the problem is that I/O from an L1 buffer isn't working... there's been a ton of changes made. I remember that the buffer_seg function had to be changed. Did you change yours to this:

ramdesc_t buffer_seg(struct buffer_head *bh)
{
    return (bh->b_data? kernel_ds: EBH(bh)->b_L2seg);
}

Perhaps a diff of buffers.c between your repo and mine is in order. I recently retabbed everything using vis ":%retab" command, FYI.

In the meanwhile, this works (in sync_buffers):

That change looks like L1 I/O isn't working... the above buffer_seg change was introduced when I made the change to reduce the buffer_head struct size by quite 6 bytes. Thus, bh->b_data must be used to determine the buffer segment as well.

This must be it. I have no IO from /to L1, and what happens when the SB gets overwritten is that the buffer address is L1, seg is L2.

Thank you, that's energy for a good start in the morning. :-)

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 7, 2023

QEMU does the right wrong thing: Makes the simulated 8237 do what you'd expect it dot do, not what it actually does.

I've sometimes wondered whether it would make sense to build a custom version of QEMU for things like this, but the last time I l looked, QEMU was unbelievably large and complicated. So there's nothing like real hardware!

So true - on both accounts. I actually did the same, stuck my head into QEMU, shook my head and retreated - in an entirely different setting. I was trying to get Venix to work with QEMU. Never happened.

Apropos DMA and the 8237: I just learned something new about the AT DMA configuration, which has 2 DMA controllers. First, the page register is wider on the AT so it can actually address memory up to the 16MB limit. Secondly, the 2nd DMA controller uses address lines a1-a16 instead of a0-a15, which is a really smart trick: Word transfers only, but using the full width of the data bus and enabling transfers of up to 128k (64k words) bytes in one operation.
Not likely we'll ever use it but good to know. And like you said, there's nothing like real hardware.

Skjermbilde 2023-09-07 kl  10 50 49

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 8, 2023

Sounds like the problem is that I/O from an L1 buffer isn't working... there's been a ton of changes made. I remember that the buffer_seg function had to be changed. Did you change yours to this:

ramdesc_t buffer_seg(struct buffer_head *bh)
{
    return (bh->b_data? kernel_ds: EBH(bh)->b_L2seg);
}

As suspected, this was the culprit (it was missing).
Now that it's working and L1 buffers can participate in IO (the debug scaffolding still in place), the only case in which it happens is when updating the SB after a Minix FS mount. Are you seeing the same?

Perhaps a diff of buffers.c between your repo and mine is in order. I recently retabbed everything using vis ":%retab" command, FYI.

Thanks, and no, I've not used the retab command before, a great acquaintance!

buffers.c are growing apart between ELKS and TLVC, but a diff is still useful! Like I mentioned before, I added a sleep_on in get_free_buffer to avoid all the looping. It happens all the time when running off of (async) floppies, just like running out of REQs.

Thanks!

@ghaerr
Copy link

ghaerr commented Sep 8, 2023

I added a sleep_on in get_free_buffer to avoid all the looping. It happens all the time when running off of (async) floppies, just like running out of REQs.

Good to know! Can you post that here, or at least the gist of it? I've only seen the problem when the kernel is configured for too few buffers, but I now see that an application that say, just writes lots of data will also cause the problem. I'm interested in the solution you've taken and whether its similar to the L1 buffer wait on &L1wait. Right now, I'm only able to test async I/O on the SSD driver and its limited to 96k, which may not trigger the problem, depending on buffer configuration.

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 8, 2023

Yes, I think it's similar.

Here's the main loop - more than half of the code still debug stuff.

    while (ebh->b_count || ebh->b_dirty || ebh->b_locked
#if defined(CONFIG_FS_EXTERNAL_BUFFER) || defined(CONFIG_FS_XMS_BUFFER)
                || bh->b_data
#endif
                                                                ) {
        if ((bh = ebh->b_next_lru) == NULL) {
            unsigned long jif = jiffies;
            switch (sync_loop) {
            case 0:
                printk("DEBUG: no free bufs, syncing\n"); /* FIXME delete */
                sync_buffers(0,0);
                break;
        
#if defined(CONFIG_FS_EXTERNAL_BUFFER) || defined(CONFIG_FS_XMS_BUFFER)
            case 1:
            /*
             * Skip if we're running purely on L1 buffers.
             * (in which case brelseL1_index() is a dummy anyway)
             */
                printk("RelL1:");
                for (i=0; i<nr_mapbufs; i++)
                        brelseL1_index(i, 1);
                printk("\n");
                break;
#endif

            case 2:     /* EXPERIMENTAL: This may not make much sense */
                printk("SyncWait;");
                sync_buffers(0, 1);     /* sync w/ wait */
                break;

            default:
                /* we're out of buffers - which happens quite a bit when using floppy
                 * based file systems: Sleep instead of looping.
                 */
                sleep_on(&bufwait);
                printk("Bufwait %d [%lu] jiffies;", sync_loop, jiffies-jif);
            }
            bh = bh_lru;        /* start over */
            sync_loop++;
        }
        ebh = EBH(bh);
    }      

There is a wakeup in unlock_buffer:

void unlock_buffer(struct buffer_head *bh)
{   
    EBH(bh)->b_locked = 0;
    wake_up((struct wait_queue *)bh);   /* use bh as wait address*/
    wake_up(&bufwait);  /* If we ran out of buffers, get_free_buffers is waiting */
}

... and an experimental in brelse:

void brelse(struct buffer_head *bh)
{
    ext_buffer_head *ebh;

    if (!bh) return;
    wait_on_buffer(bh);
    ebh = EBH(bh);
#ifdef CHECK_BLOCKIO
    if (ebh->b_count == 0) panic("brelse");
#endif
    DCR_COUNT(ebh);
//#ifdef BLOAT_FS
    if (!ebh->b_count)
        wake_up(&bufwait);      /* EXPERIMENTAL, probably superfluous */
//#endif
}

@ghaerr
Copy link

ghaerr commented Sep 8, 2023

Interesting, thanks.

I'm wondering a number of things: It would seem at first thought that a wakeup might probably be best in brelse, rather than unlock_buffer, since without a buffer actually being released (i.e. b_count == 0), get_free_buffer wouldn't be able to grab it anyways.

I'm also wondering what other processes might be able to run when there aren't any buffers left. Given that wake_up isn't free (it has to for loop through all tasks, which in this implementation adds overhead to all I/O), perhaps just calling schedule() to reschedule another task (if in fact there are any) in get_free_buffer might just accomplish the same thing.

I'll have to think about this more, and will put together a test script to test this on /dev/ssd. My current scripts might not actually exhaust the buffers so it will be interesting to see what happens. Thanks!

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 8, 2023

Interesting, thanks.

I'm wondering a number of things: It would seem at first thought that a wakeup might probably be best in brelse, rather than unlock_buffer, since without a buffer actually being released (i.e. b_count == 0), get_free_buffer wouldn't be able to grab it anyways.

It's a good point and I wanted to test that, but ran into other snags and left it.

I'm also wondering what other processes might be able to run when there aren't any buffers left. Given that wake_up isn't free (it has to for loop through all tasks, which in this implementation adds overhead to all I/O), perhaps just calling schedule() to reschedule another task (if in fact there are any) in get_free_buffer might just accomplish the same thing.

Yes, that's another good point. My thinking is that there may be tasks doing other types of IO (serial, ethernet, even parallel (?)) or CPU bound processing waiting, making the sleep worth it. Even those will eventually need storage IO, but this way they get a chance to do useful stuff. Also, not wait-looping - in most cases makes the system feel more responsive, to terminal IO in particular.

@Mellvik
Copy link
Owner Author

Mellvik commented Sep 13, 2023

I'm closing this issue - the discussion continues in a new thread.

@Mellvik Mellvik closed this as completed Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants