Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O ERROR initializing disk in Apple II emulation #1218

Closed
ryandesign opened this issue Nov 27, 2023 · 7 comments · Fixed by #1239
Closed

I/O ERROR initializing disk in Apple II emulation #1218

ryandesign opened this issue Nov 27, 2023 · 7 comments · Fixed by #1239
Labels

Comments

@ryandesign
Copy link
Contributor

Boot an Apple II or II+ or IIe with the 1983-01-01 Apple DOS 3.3 System Master disk, then eject it by inserting a blank disk that was created with this command:

dd bs=1024 count=140 if=/dev/zero of=blank.dsk

At the BASIC prompt, type INIT HELLO and press Return.

In Open Emulator 1.1.1-202203110628 and Virtual ][ 11.4 this succeeds, returning you to the BASIC prompt and the disk has been initialized and can be used to boot the machine.

In Clock Signal 23.10.29, I/O ERROR occurs and only parts of the disk have been written and trying to boot it either hangs or crashes into the monitor. Curiously, what gets written seems to vary (by Apple II model? or randomly?).

This might be the same bug as #1211 but this reproduction recipe works on any Apple II model, not just the enhanced IIe, so it is not limited to or caused by the 65C02 CPU.

@ryandesign
Copy link
Contributor Author

ryandesign commented Nov 29, 2023

I added logging to show the head position:

--- a/Storage/Disk/Drive.cpp
+++ b/Storage/Disk/Drive.cpp
@@ -103,6 +103,8 @@ void Drive::step(HeadPosition offset) {
                track_ = nullptr;
        }
 
+       fprintf(stderr, "head position: %5.2f %*s|\n", head_position_.as_quarter() / 4.0f, head_position_.as_quarter(), "");
+
        // Allow a subclass to react, if desired.
        did_step(head_position_);
 }

With enhanced IIe, INIT fails immediately on track 0 every time due to #1211 which does now look like it's a separate problem.

With unenhanced IIe, which track it fails on is random. The amount of time it takes to initialize each track is also somewhat random. I'm surprised by the randomness; I would expect the same result each time. Once in a blue moon, the initialization "succeeds" but the disk that was created is not actually initialized correctly. Compared to a disk initialized by Virtual ][, several 256-byte sectors on the disk initialized by Clock Signal contain only nulls.

@ryandesign
Copy link
Contributor Author

The bytes coming in to the Disk II data latch should all be valid disk bytes: top bit set, maximum of two consecutive 0 bits, etc.

Examining the bytes as they come in (#1211 (comment)) while initializing a disk, I occasionally see some very invalid bytes:

ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff d5 aa 96 ff fe ae af aa af fb
fe de aa eb ff ff ff ff ff d5 aa ad 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 de aa eb ff a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 04 c0 a0 c0 a0 c0 a0 c0 a0
c0 36 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 2f c0 a0 c0 a0 c0 a0 c0 a0 c0
a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 ff ff ff ff ff d5 aa
ad 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 de aa eb ff ff ff ff

In the top half of this output, we have a gap 3 (14-24 x ff), followed by an address field (d5 aa 96 ... de aa eb), followed by a gap 2 (5-10 x ff), followed by a data field (d5 aa ad ... de aa eb).

What should follow is another gap 3 and an address field, but instead we have just one ff followed by a bunch of invalid bytes (a0 c0 a0 c0 ... — the high bit is set on most of them but the 0 nibble is four consecutive 0 bits) — and then we get another gap 2 and a data field as if nothing was wrong. No wonder Storage::Encodings::AppleGCR::sectors_from_segment can't read that address field to decide where on the disk image to write it, which would explain why such sectors are blank in the disk image.

Now to find out how those invalid bytes are making their way into the data latch...

@ryandesign
Copy link
Contributor Author

ryandesign commented Nov 29, 2023

The random results have made me think this might be timing related. In addition to logging the byte in the data latch, I'm now logging the number of cycles the Disk II is being asked to run_for:

--- a/Components/DiskII/DiskII.cpp
+++ b/Components/DiskII/DiskII.cpp
@@ -82,6 +82,7 @@ void DiskII::run_for(const Cycles cycles) {
        if(preferred_clocking() == ClockingHint::Preference::None) return;
 
        auto integer_cycles = cycles.as_integral();
+       fprintf(stderr, "T%lld ", integer_cycles);
        while(integer_cycles--) {
                const int address = (state_ & 0xf0) | inputs_ | ((shift_register_&0x80) >> 6);
                if(flux_duration_) {
@@ -113,7 +114,7 @@ void DiskII::run_for(const Cycles cycles) {
                        break;
 
                        case 0xb:
-                       case 0xf:       shift_register_ = data_input_;                                                          break;  // load data register from data bus
+                       case 0xf:       shift_register_ = data_input_;  fprintf(stderr, "%02x ", data_input_);                  break;  // load data register from data bus
                }
 
                // Currently writing?

This produces an enormous amount of data that makes Xcode unhappy so I have to run it outside of Xcode and pipe stderr to a file, but what I'm seeing is that most of the time during initialization Disk II is run_for 2 cycles, but every so often I see it run_for 72 cycles. So far, I've always seen this 72-cycle run be where a sequence of bad a0 c0 a0 bytes starts:

T2 T2 T2 T2 ff T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T8 T72 T2 a0 T2 T2 T2 T2
c0 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2
T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 a0 T2 T2 T2 T2 c0 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2 T2

I know the Disk II needs constant input from the CPU at specific intervals. It looks to me like the Disk II is being asked to run for many cycles without the CPU running at the same time to provide data during those cycles.

I see plenty of 72-cycle runs elsewhere too but evidently they don't always cause a problem. The reason for these occasional long runs may be specific to my computer, which I see is using a lot of swap and compressed memory right now and is generally sluggish and not fun, but I feel that the emulator should be able to cope with that and not break down.

@TomHarte
Copy link
Owner

I can't offhand explain the cause of 72-cycle runs but the Disk II is implemented — bugs notwithstanding — so that the call frequency is reduced only when it is safe to do so.

In writing mode that is supposed to mean only if both drive motors are disabled. Clearly there's some sort of issue afoot with that. Modifying DiskII::decide_clocking_preference() to always set clocking_preference_ = ClockingHint::Preference::RealTime; and call update_clocking_observer(); would be a potential diagnostic around that.

But it's disjoint from host machine performance.

@ryandesign
Copy link
Contributor Author

ryandesign commented Nov 30, 2023

I think there are two separate problems here, because with your suggestion...

--- a/Components/DiskII/DiskII.cpp
+++ b/Components/DiskII/DiskII.cpp
@@ -141,6 +141,10 @@ void DiskII::run_for(const Cycles cycles) {
 }
 
 void DiskII::decide_clocking_preference() {
+	clocking_preference_ = ClockingHint::Preference::RealTime;
+	update_clocking_observer();
+	return;
+
 	ClockingHint::Preference prior_preference = clocking_preference_;
 
 	// If in read mode, clocking is either:

...a number of things are improved when initializing a disk:

  • run_for is always called every 2 cycles
  • I don't see invalid bytes in the data latch anymore
  • Each track after 0 now takes about the same time to initialize
  • Initialization doesn't fail with I/O ERROR anymore
  • I can PR#6 to reboot to the new disk successfully

However, if I close the emulator, which writes the disk to the disk image file, and look at the disk image file in a hex editor, I still see 256-byte sectors of nulls where there should (comparing with a good disk image) be data. And trying to boot the disk in a newly created emulator crashes into the monitor. So, in addition to the timing-related problem of getting data onto the disk in memory, there is a problem with getting the data from the disk in memory to the disk image.

@ryandesign
Copy link
Contributor Author

ryandesign commented Dec 1, 2023

I broke the second problem out into a separate issue: #1227

Reverting the above patch for DiskII::decide_clocking_preference and adding logging instead...

--- a/Components/DiskII/DiskII.cpp
+++ b/Components/DiskII/DiskII.cpp
@@ -142,12 +168,14 @@ void DiskII::run_for(const Cycles cycles) {
 
 void DiskII::decide_clocking_preference() {
 	ClockingHint::Preference prior_preference = clocking_preference_;
+	int path = 0;
 
 	// If in read mode, clocking is either:
 	//
 	//	just-in-time, if drives are running or the shift register has any 1s in it and shifting may occur, or a flux event hasn't yet passed; or
 	//	none, given that drives are not running, the shift register has already emptied or stopped and there's no flux about to be received.
 	if(!(inputs_ & ~input_flux)) {
+		path = 1;
 		const bool is_stuck_at_nop =
 			!flux_duration_ && state_machine_[(state_ & 0xf0) | inputs_ | ((shift_register_&0x80) >> 6)] == state_ && (state_ &0xf) == 0x8;
 
@@ -158,18 +186,23 @@ void DiskII::decide_clocking_preference() {
 
 	// If in writing mode, clocking is real time.
 	if(inputs_ & input_mode) {
+		path = 2;
 		clocking_preference_ = ClockingHint::Preference::RealTime;
 	}
 
 	// If in sense-write-protect mode, clocking is just-in-time if the shift register hasn't yet filled with the value that
 	// corresponds to the current write protect status. Otherwise it is none.
 	if((inputs_ & ~input_flux) == input_command) {
+		path = 3;
 		clocking_preference_ = (shift_register_ == (is_write_protected() ? 0xff : 0x00)) ? ClockingHint::Preference::None : ClockingHint::Preference::JustInTime;
 	}
 
 	// Announce a change if there was one.
 	if(prior_preference != clocking_preference_)
+	{
+		fprintf(stderr, "\nC%d=%s inputs_=%02x input_flux=%02x input_mode=%02x input_command=%02x\n", path, (clocking_preference_ == ClockingHint::Preference::RealTime ? "RealTime" : clocking_preference_ == ClockingHint::Preference::JustInTime ? "JustInTime" : "None"), inputs_, input_flux, input_mode, input_command);
 		update_clocking_observer();
+	}
 }
 
 bool DiskII::is_write_protected() {

...I get these results:

C1=JustInTime inputs_=01 input_flux=01 input_mode=08 input_command=04

C3=None inputs_=05 input_flux=01 input_mode=08 input_command=04

C2=RealTime inputs_=0d input_flux=01 input_mode=08 input_command=04
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff d5 aa 96 ff fe ae aa af
af fe fb de aa eb
C1=JustInTime inputs_=01 input_flux=01 input_mode=08 input_command=04

C3=None inputs_=05 input_flux=01 input_mode=08 input_command=04

C2=RealTime inputs_=0d input_flux=01 input_mode=08 input_command=04
ff ff ff ff ff d5 aa ad 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
de aa eb ff
C1=JustInTime inputs_=01 input_flux=01 input_mode=08 input_command=04

head position:  8.50                                   |

head position:  8.75                                    |

head position:  9.00                                     |

C3=None inputs_=05 input_flux=01 input_mode=08 input_command=04

C2=RealTime inputs_=0d input_flux=01 input_mode=08 input_command=04
a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 44 c0 a0 c0 a0 c0 a0 c0 a0 c0 1f c0 a0 c0 a0
c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 53 c0 a0 c0 a0 c0 a0 c0 a0 c0 07 c0 a0 c0 a0 c0
a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 53 c0 a0 c0 a0 c0 a0 c0 a0 c0 07 c0 a0 c0 a0 c0 37
c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 09 c0 a0 c0 a0 c0 a0 c0 a0 c0 00 c0
a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 04 c0 a0 c0 a0 c0 a0 c0 a0 c0 16 c0 a0
c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 2f c0 a0 c0 a0 c0 a0 c0 a0 c0 30 c0 a0 c0
a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 60 c0 a0 c0 a0 c0 a0 c0 a0 c0 09 c0 a0 c0 a0
c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 60 c0 a0 c0 a0 c0 a0 c0 a0 c0 1b c0 a0 c0 a0 c0
a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 02 c0 a0 c0 a0 c0 a0 c0 a0 c0 37 c0 a0 c0 a0 c0 a0
c0 a0 c0 a0 c0 a0 c0 a0 c0 02 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0
a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0 a0 c0
C1=JustInTime inputs_=01 input_flux=01 input_mode=08 input_command=04

C3=None inputs_=05 input_flux=01 input_mode=08 input_command=04

C2=RealTime inputs_=0d input_flux=01 input_mode=08 input_command=04
ff ff ff ff ff d5 aa ad 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
de aa eb ff
C1=JustInTime inputs_=01 input_flux=01 input_mode=08 input_command=04

C3=None inputs_=05 input_flux=01 input_mode=08 input_command=04

C2=RealTime inputs_=0d input_flux=01 input_mode=08 input_command=04
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff d5 aa 96 ff fe ae ab aa
ab fb fe de aa eb
C1=JustInTime inputs_=01 input_flux=01 input_mode=08 input_command=04

C3=None inputs_=05 input_flux=01 input_mode=08 input_command=04

C2=RealTime inputs_=0d input_flux=01 input_mode=08 input_command=04

Is that enough to suggest what might be wrong?

ryandesign added a commit to ryandesign/CLK that referenced this issue Dec 4, 2023
Ensure we're actually in the sense-write-protect loop before deciding
that we don't need to process anymore.

Closes TomHarte#1218
@ryandesign
Copy link
Contributor Author

With further study of the code I see that the above log was not helpful. I now understand that input_flux, input_command, and input_mode are constants used to see which bits of input_ are set, so there was no need to log those constants. I had been concerned that there might be something wrong with one of the three conditionals but now that I understand how they work I agree they are correct.

The occasions when the Disk II is run_for more than two cycles are not a problem. The disk software only puts the disk into write mode immediately before writing and puts it back into read mode when it prepares the next thing to write. It's those brief read periods that run for more than two cycles.

The bad a0 c0 a0 ... data does occur due to a timing issue after all. Normally the CPU and state machine run independently which is fine for reading but they must be synchronized for writing. Synchronization is accomplished by the CPU enabling sense-write-protect mode. In that mode, all steps of the state machine are 0a: perform operation a (shift the shift register right, shifting in the write protect status) and go to step 0. That way, when we switch to write mode, we know the state machine is at step 0.

After at most eight state machine cycles of 0a, the shift register will contain either 00 if the disk is not write protected or ff if it is. Clock Signal tries to take a shortcut and stop further processing if we are in sense-write-protect mode and the shift register is either 00 or ff since the value will not change further:

// If in sense-write-protect mode, clocking is just-in-time if the shift register hasn't yet filled with the value that
// corresponds to the current write protect status. Otherwise it is none.
if((inputs_ & ~input_flux) == input_command) {
clocking_preference_ = (shift_register_ == (is_write_protected() ? 0xff : 0x00)) ? ClockingHint::Preference::None : ClockingHint::Preference::JustInTime;
}

The problem is that at the moment that the CPU enables sense-write-protect mode, the state machine could be at any step. Synchronization has not yet occurred. The state machine may yet be busy doing other work before it gets around to starting to shift in the write protect status. When the problem occurs, it is because the command that the state machine executed last as part of that other work was a0: perform operation 0 (clear the shift register) and go to step a. The CPU then requested sense-write-protect mode. Clock Signal then saw that we were in sense-write-protect mode and the shift register already contained 00, matching the disk's write protect status, so the state machine was not run and was not synchronized when we switched to write mode. It was at step a instead of 0, so it was grabbing the byte from the data bus 6 state machine cycles (3 CPU cycles) late, probably getting bytes of code instead of data. I imagine a0 is probably the LDY opcode and c0 is probably the high byte of an address.

The fix is to ensure that the state machine is actually performing operation a (or equivalently e) before deciding that we can take the shortcut.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants