Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MASSBUS & MSCP Disk Sharing #1116

Open
cmgauger opened this issue Jan 17, 2022 · 4 comments
Open

MASSBUS & MSCP Disk Sharing #1116

cmgauger opened this issue Jan 17, 2022 · 4 comments

Comments

@cmgauger
Copy link
Contributor

This is entirely a feature request, not a bug report.

Would it be possible to implement some method of sharing MASSBUS (pdp11_rp/device RP) and MSCP (pdp11_rq/device RQ) disks in the PDP-11 and various VAX simulators that support the aforementioned devices? Given that real-life systems could and did share disks.

Context

There are several contexts in which systems could share disks; though only one instance applies to SIMH.

  1. The front-end processor of a KL10 which shared an RP06 with the KL10.
  2. Sharing disks between processors in a PDP-11/74.
  3. VAX/VMS clustering using local disks.

Item 3 there would be the only applicable one to SIMH.

@markpizz
Copy link
Member

Hmmm...

Although, the hardware supported concurrent connections with appropriate multi-port drive interfaces, except with VMS's cluster semantics, I'm unaware of mechanisms used to coordinate access to a disk drive from multiple disparate systems. If these mechanisms only existed to serve fail-over purposes and not real concurrent access and there was some active way to detect the appropriate failover conditions, it should just work in simh now. Any effort to share disks would best be advised to attach the disks in RAW mode. RAW mode will avoid any C runtime library buffering which cross simulated system synchronization mechanisms would assume won't ever happen.

The above will cover items 2 and 3 in your question.

Item 1 isn't really worth trying to address. The current model each simulator type has assumes that their simulator owns the data representation in the simulated file which would be analogous to the real hardware case of the drive being formatted for access to 512 byte sectors on the PDP11 and differently on the KL10. These systems had some way of dealing with this in hardware which simh doesn't model.

@cmgauger
Copy link
Contributor Author

Yes, items 1 and 2 in my list of original contexts do not apply to SIMH. Item 3 is the one that would be most useful for users. I shall have to test out attaching disks in RAW mode and get back to you on how it turns out.

@jwbrase
Copy link

jwbrase commented May 10, 2022

Although, the hardware supported concurrent connections with appropriate multi-port drive interfaces, except with VMS's cluster semantics, I'm unaware of mechanisms used to coordinate access to a disk drive from multiple disparate systems.

Could you elaborate on that? From looking at the documentation, it seems that early on (pre 5.0-ish?) VMS only supported dual-ported disks over CI, even though clusters with single-ported disks were supported over ethernet from 4.6-ish (IIUC, basically no connecting two nodes directly to the same disk before 5.0 at the earliest).

The 5.0 documentation talks about dual ported disks via an HSC (requires CI), dual ported DSA (which I think means MSCP) disks in an ethernet configuration, and dual ported MASSBUS disks in an ethernet configuration.

For MASSBUS, it says explicitly that you can't use a MASSBUS disk as both a dual ported disk and a system disk.

For DSA, it says that a DSA disk can be online to only one controller at a time, the second system accesses it as served by the first system over MSCP, and then if the first system goes down the second brings up its own local connection to the disk. This (and the fact that this explicitly can't be done with MASSBUS disks) makes it sound like there's hardware on the disk itself (rather than the cluster nodes coordinating access explicitly over ethernet) that locks out the second path and allows a dual-ported DSA disk to be used as a system disk.

My takeaway, (unless simh has features I'm unaware of) is that even for DSA/MSCP disks, you can probably share them between running instances as non-system disks, but not as system disks, on VMS 5.0-ish, because simh doesn't seem to have any means to simulate the disk only allowing itself to be online to one controller at a time. Is this correct?

I haven't read/found enough documentation yet to have a good handle on what things were like for later versions of VMS. Can you provide any insight on this? I presume things probably got more focused towards ethernet being the primary means of interconnection as time went on.

So from the above, a couple of feature requests:

Would it be possible to implement CI in the VAX simulators (presumably you'd have an "att ci" line that would specify IPs/hostnames and port numbers for other VAX simh instances, and it would just be implemented over IP)? Is the behavior of an HSC well enough specified by available documentation that a black-box simulator could be written for one without having to emulate the actual hardware and run the actual firmware? I presume finding and reverse engineering a working unit would be nigh impossible.

Aside from that, could some kind of locking mechanism be implemented to allow dual-ported system disks for DSA disk types?

I don't know how the logic that determined which controller the disk was online to worked; I presume something like there was a watchdog timer on the disk, and the OS had to tell the controller to send a keepalive signal to reset the watchdog, and if the watchdog expired, the disk would then put itself offline to that controller and allow the other controller to bring it online. The disk logic couldn't just be "is the controller still powered up?", because that wouldn't handle cases where the OS hung (and thus wasn't serving MSCP to the rest of the cluster) with the machine still powered on. It couldn't just be "if another controller asks me to come online, do it", because that wouldn't force the second machine to access the disk over the network if the first machine was still up, and we'd have the same restrictions as MASSBUS.

If my watchdog timer guess is accurate, then the locking mechanism might be something like this:

We add a -d switch to the attach command. The semantics of the switch is:

ATTACH -d {}

Where can be 0 or 1. Lockfile defaults to .lock in the same directory if not set explicitly. In some situations, however, it might be desirable to have a non-default lockfile name (for instance, if is the device file for an actual disk on the host machine, simh is probably not going to have access to the /dev directory to create the lockfile there, nor should it, even if it can).

When the guest system tells the controller to bring a dual-ported disk online, the simulator tries to create a file . If it succeeds, it writes into the lock file (for reasons to be explained later), then starts a watchdog timer. If it the file already exists, it simulates whatever the controller is supposed to do when a disk refuses to be brought online. When the watchdog timer expires without being reset, the simh instance that created the lock file deletes it, and emulates the disk going offline to the controller (or whatever happens hardware-wise when the watchdog expires).

A simh instance with a running watchdog timer periodically checks the lock file to make sure that it still exists and contains the used to attach that disk to that instance, and emits a diagnostic like the following if it doesn't:

"simh error: Port number changed in or deleted! The simh instance on the other port for this disk image isn't respecting the locking mechanism. This is likely a bug in simh, and may cause data corruption. If you deleted manually, it is not a bug, please do not do this while simh is running as it may cause data corruption on (attached as )."

A simh instance that fails to bring a disk online due to an existing lock file checks the lock file to make sure that it contains the opposite port number to the one used to attach the disk to this instance. If the port number is different it does nothing more. If the port number in the lock file is the same, it emits a diagnostic like the following:

"simh warning: Another simh instance appears to be using the same port number to attach dual-ported. Either another instance is running using the same port number (only two instances may attach any given disk, with different port numbers), or the last time simh ran with attached on , it crashed without deleting .

If you are trying to attach a third simh instance to , please stop this instance and allow the other two to run. If only two simh instances are to be attached to , please verify that this instance is using a different port number than the other instance. If you are bringing simh back up after a crash, please delete . To ensure data integrity, neither instance will bring online until is deleted (once it is deleted, one instance or other will bring online and re-create . Please do not delete it a second time when this occurs!)."

Of course, if the actual way the hardware worked was different, the required locking mechanism may be different than what I have proposed.

I've proposed restricting to 0 or 1 because the DEC documentation I've read doesn't talk about disks with more than two ports being a thing, but depending on how VMS behaves, could be allowed to be an arbitrary integer and allow ahistorical configurations. If you had a multi-ported disk, presumably the first system to boot would succeed in bringing the disk online on its controller, and the others would fail and look for the disk being served on MSCP over ethernet, just as with a two-port configuration. If the first system then went offline, the other systems would try to access it locally, and the first of those to try would bring it online locally and the rest would fail. But what would happen then? Would they retry MSCP, or would they panic because their initial MSCP connection to the disk went down and they can't get it locally either?

@jwbrase
Copy link

jwbrase commented May 12, 2022

Having done some more reading, the 5.4 clustering documentation has verbiage saying that a dual-ported disk can't be a system disk for an attached machine for both MASSBUS and DSA disks, so, contrary to the impression the 5.0 documentation gave, it does not look like the DSA disks had any special means of preventing two machines from trying to both access the disk locally, so my watchdog timer guess was totally bogus.

As late as 7.3 (the last VMS version for the VAX), the documentation basically indicates that the only way to have a disk be local to two VAXen and simultaneously usable as a system disk for both is to use CI or DSSI (system disks can be shadowed on 7.3, but apparently not over Ethernet). It would be nice to have some means of having an emulated CI or DSSI bus between multiple simh instances. It could be argued that this is somewhat academic because your cluster will only be as reliable as the host machine you're running it on, but if you have two host machines with a gluster volume replicated between them holding the disk images (and with a third machine running an arbiter/thin arbiter for gluster quorum), you could actually have an emulated cluster that will withstand the failure of either host machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants