Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transactional-Execution Facility design / implementation #263

Closed
7 of 11 tasks
Fish-Git opened this issue Oct 15, 2019 · 211 comments
Closed
7 of 11 tasks

Transactional-Execution Facility design / implementation #263

Fish-Git opened this issue Oct 15, 2019 · 211 comments
Labels
Discussion Developers are invited to discuss a design change or solution to a coding problem. Enhancement This issue does not describe a problem but rather describes a suggested change or improvement. M Issue contains checklist of items, not all of which have been completed yet. Missing Support for the described architectural feature is currently missing and needs to be added. (*MOVED*) (the original issue was moved into a different issue) Ongoing Issue is long-term. Variant of IN PROGRESS: it's being worked on but maybe not at this exact moment. Related This issue is closely related to another issue. Consider this issue a "sub-issue" of the other. TXF Bug related to, or likely caused by, our current Transaction-Execution Facility implementation

Comments

@Fish-Git
Copy link
Member

Fish-Git commented Oct 15, 2019


NOTE: This issue has been closed and is now being continued in a NEW GitHub issue, #339: "Transactional-Execution Facility... (continued) "


I have been told the recently released z/OS 2.4 requires the availability of both the Transactional-Execution Facility and Constrained-Transactional-Execution Facility in order to successfully IPL:

Transactional-Execution enforcement

z/OS V2.4 provides enforcement to the effect that hardware Transactional-Execution, available on IBM Z servers since zEC12, is always available to programs running on z/OS V2.4 so that programs in such an environment can unconditionally assume and make use of Transactional-Execution.

https://www-01.ibm.com/common/ssi/ShowDoc.wss?docURL=/common/ssi/rep_ca/0/877/ENUSZP19-0410/index.html

This GitHub Issue is being created so that we can, together, discuss how best to implement this facility. Please offer your suggested approach/design as a GitHub comment reply to this issue.

I myself have a vague idea of how maybe it might be implemented, but I don't know if it will even fly nor especially how good it is. It's entirely possible (even likely!) that one of you might be able to come up with a better idea. (Please?)

Besides the description of the facility in the Principles of Operation manual, here are some additional papers I found on the subject to help get your creative juices flowing:

I've also assigned everyone to this issue because I really, really want everyone to contribute with their own thoughts/ideas on how is the best way to implement this facility in Hercules.

Thanks!


NOTE: This issue should be considered a specific sub-issue of issue #77 "MISSING Facilities support"._


EDIT: The following items still remain to be done:

(Feel free to add additional items as needed)

  • FORMAL TESTING! We still need a comprehensive set of tests (preferably standalone runtests) to verify proper functionality of all aspects of TXF!  Right now we're simply using z/OS and z/VM and just presuming it's working correctly as long as both operating systems appear to function "normally", but of course that is not good enough!

  • 'txf' tracing

  • #define DEBUG tracing (TRACE macro)

  • PTT tracing (PTT_TXF ==> PTT( PTT_CL_TXF ...)

  • Constrained transactions constraint: 2. "All instructions in the transaction must be within 256 contiguous bytes of storage, including the TRANSACTION BEGIN (TBEGINC) and any TRANSACTION END instructions." (page 5-107)

  • Constrained transactions constraint: 4. "The transaction’s storage operands access no more than four octowords. Note: LOAD ON CONDITION and STORE ON CONDITION are considered to reference storage regardless of the condition code." (page 5-109)

  • Constrained transactions constraint: 5. "The transaction executing on this CPU, or stores by other CPUs or the channel subsystem, do not access storage operands in any 4 K-byte blocks that contain the 256 bytes of storage beginning with the TRANSACTION BEGIN (TBEGINC) instruction." (page 5-109)

  • Constrained transactions constraint: 7. "Operand references made by each instruction in the transaction must be within a single double-word, except that for LOAD ACCESS MULTIPLE, LOAD MULTIPLE, LOAD MULTIPLE HIGH, STORE ACCESS MULTIPLE, STORE MULTIPLE, and STORE MULTIPLE HIGH, operand references must be within a single octoword." (page 5-109)

  • PER as it relates to TXF. (pages 4-26++, 5-89++)

  • SIE as it relates to TXF.   (Refer to GitHub Issue Comment below for some details)

  • FPCR update on abort: "In addition to the diagnostic information saved in the TDB, when a transaction is aborted due to any data-exception program-exception condition and both the AFP-register control, bit 45 of control register 0, and the effective allow-floating-point-operation control (F) are one, the data-exception code (DXC) is placed into byte 2 of the floating-point control register (FPCR), regardless of whether filtering applies to the program-interruption condition." (page 5-97)

Note that constraints #​4 and #​7 seem to contradict one another. One says four octowords (4x32=128 bytes) whereas the other says a single double-word (8 bytes). Unless #​4 means four octowords in total?? Constraint 5 is going to be next to impossible. Not sure about 7.

@Fish-Git Fish-Git added Enhancement This issue does not describe a problem but rather describes a suggested change or improvement. Discussion Developers are invited to discuss a design change or solution to a coding problem. labels Oct 15, 2019
@Peter-J-Jansen
Copy link
Collaborator

Peter-J-Jansen commented Oct 16, 2019

Well that was bound to happen sooner or later (ok, 7 years later).

Presuming z/OS 2.4 will be able to run under z/VM (7.1?), will SIE have to support these instructions?

Cheers,

Peter

@rwoodpd
Copy link
Contributor

rwoodpd commented Oct 16, 2019

z/VM 6.4 and above support z/OS 2.4, so I am sure that SIE will need to support it.

@ivan-w
Copy link
Member

ivan-w commented Oct 21, 2019

I don't think there is any issue with SIE supporting it.

The issue is supporting TXF as a whole.

Remember you are not only making a transaction on registers but also on any storage change, and those need to be capable of being rolled back or committed, and the entire transaction needs to be viewed as atomic by all the other entities (CPUs and Channel). This requires implementing some form of storage modification journal/log. It's possible a COW mechanism might be sufficient.

And then of course you need to except/intercept all the instructions that can't be used within a Transaction (a lot of them and some depending on how they are specified.. For example a backward branch is a no-no - no loops are allowed in a transaction).

@Fish-Git
Copy link
Member Author

For example a backward branch is a no-no - no loops are allowed in a transaction

That's only for constrained transactions, not unconstrained. They're also only allowed to execute at most a total of 32 instructions too, all of which must be within the same 256-byte contiguous area of storage.

Unconstrained transactions do not have such limitations. They can execute as many contiguous or non-contiguous instructions as they want and loop all they want, branching to anywhere they want (forward or backward). Their only restriction is branch and/or mode tracing isn't enabled.

It's possible a COW mechanism might be sufficient.

What's "COW"?

As for me, I'm still tossing around some type of "shadow" storage key array approach that tracks which pages a transaction stores into (along with a copy of the page that can later be committed or discarded as appropriate), but the devil of course is in the details.

@rwoodpd
Copy link
Contributor

rwoodpd commented Oct 24, 2019

COW = Copy on Write. Not sure if that would work for this. I think we need to queue the storage updates and actually write to storage on commit. We also will need to remember the location of a TBEGIN instruction in case of abort, which will happen if something else updates the storage before the commit (for an unconstrained transaction). I think some type of table pointed to the REGS structure may be in order. There is a CPU based limit on the number of stores that can be tracked. An abort also happens if that limit is exceeded.

@s390guy
Copy link
Contributor

s390guy commented Oct 24, 2019 via email

@Fish-Git
Copy link
Member Author

@s390guy

FYI: #203 (comment)

@Peter-J-Jansen
Copy link
Collaborator

Peter-J-Jansen commented Oct 25, 2019

If I understand things correctly, then the granularity of the transactional store conflicts on the real iron is a 'cache line', which I think today is 256 bytes (contiguous, aligned). Hercules has currently no such concept, but I presume we'd have to implement something similar. The number of such cache lines to provide for transactional execution could be a new Hercules config parameter.

All global storage access will need to check whether the same cache line is already marked as a 'transactional store is in progress'. Seems like an unavoidable overhead. Which makes me believe that this Transactional-Execution feature is only going to slow things down when only a few CPU's are available.

For the non-constraint TBEGIN instruction therefore a simple emulation could be to just always return CC=3, as all such transactions must be able to cope with that. But for the constraint TBEGINC I see no such 'poor-man' emulation possibility.

Well, this is not going to be easy.

Anyone having a z/OS 2.4 to test yet ?

Cheers,

Peter

@rwoodpd
Copy link
Contributor

rwoodpd commented Oct 25, 2019 via email

@Peter-J-Jansen
Copy link
Collaborator

Peter-J-Jansen commented Oct 25, 2019

Interesting Bob!

The mainlock-style (but different) lock thus works, but am I correct to assume that this only works with one CPU configured?

In order to make this 'poor-man TBEGINC' work for multiple CPU's, the lock being held by any CPU would need to imply that all other CPU's would need to stop executing instructions until that lock is released. Thus all other CPU's would need to inspect that TBEGINC lock prior to each instruction, which is why I discarded the idea. But perhaps this could work.

As I said before, a 'poor-man TBEGIN' (i.e. non-constrained) could simply always return a CC=3 condition I think. Could you perhaps try that as well?

Cheers,

Peter

@Peter-J-Jansen
Copy link
Collaborator

Peter-J-Jansen commented Oct 25, 2019

I read Bob's last comment again and saw that this TBEGINC lock also works with 4 CPU's, without the extra checking by CPU's not holding the TBEGINC lock. That makes sense for well behaving TBEGINC software, but in principle (no pun intended) it would be possible for a CPU not in transaction mode to access storage that is used while another CPU is in TBEGINC mode. And in that case the TBEGINC would need to (automatically) backout and retry, right? Such backout and retry can be avoided if the TBEGINC lock could guarantee that all other CPU's wait until the max. 32 TBEGINC instructions are finished and the TBEGINC lock released. Or am I wrong?

Cheers,

Peter

@rwoodpd
Copy link
Contributor

rwoodpd commented Oct 25, 2019 via email

@dasdman
Copy link
Contributor

dasdman commented Oct 25, 2019 via email

@rwoodpd
Copy link
Contributor

rwoodpd commented Oct 25, 2019 via email

@Fish-Git
Copy link
Member Author

I would strongly suggest re-reading the principles regarding the interactions; it does not operate as suggested between CPUs unless the CPUs are referencing the same lock. In addition, the current mainlock is over utilized on systems supporting atomic operations.

Wiser words have never been spoken. Everyone needs to thoroughly read and re-read and re-read again (and again and again) pages 5-89 through 5-109 of the Principles of Operation which describes the Transactional-Execution Facility. Using a lock the way Bob suggested is simply not going to work. It's a no-go from the get-go.

@Fish-Git
Copy link
Member Author

I believe the first step in this endeavor should probably be writing a test program to prove whatever implementation we eventually come up with is actually correct. This test program should test all aspects of the facility and be able to reliably detect incorrect functionality (architectural violations).

Having such a program beforehand is IMHO critical to the overall success of this project. Afterall, it hardly matters whether we have an implementation which we believe is correct if we're unable to actually prove that it is!

Developing such a program beforehand would also help to identify easily overlooked details that a given implementation must reliably account for (be able to properly deal with). Such a program would likely have a strong impact (influence) on our eventual design too, as it would serve to identify its possible weaknesses and problem areas.

The point is, I strongly feel we should be thinking first about how to test the proper functioning of the facility (all architectural aspects of it), which in turn will help us to then determine the best way to go about actually implementing it.

And as always, correctness of functionality comes first and performance/efficiency (speed) comes second. Once we have an implementation that we know works, then we can worry about how to make it better (faster).

@ivan-w
Copy link
Member

ivan-w commented Dec 2, 2019

Step 1 - For implementation, first thing is to set a base so that TEND/TABORT/Filtered Program interrupts/Constraint can resume after TBEGIN/TBEGINC with the proper code (involves some setjmp/longjmp).

Will start working on this now.

@rwoodpd
Copy link
Contributor

rwoodpd commented Dec 2, 2019

TBEGINC restarts at the TBEGINC instruction. Constraint violations cause an actual interrupt if in constrained mode. In non-constrained mode, control is given to the instruction following the TBEGIN with a non-zero condition code. I have verified that on real hardware. I have it working on my machine. I have everything working except for program interrupt filtering. A long jump is indeed needed for the abort condition.

@mcisho
Copy link
Contributor

mcisho commented Nov 23, 2020

My apologies for not making it clear I was taking about real hardware, not the Hercules emulation.

@mcisho
Copy link
Contributor

mcisho commented Nov 23, 2020

And shouldn't the TBEGIN in your first test have been TBEGIN tdb,x'fe04'?

@Fish-Git
Copy link
Member Author

Fish-Git commented Nov 23, 2020

No version of Hercules was involved, the transactions were executed on a real-iron z15.

Ah! Okay. That wasn't clear. Thanks.

But now we have yet another contradiction, but this time between you and Abdo. According to Abdo, his keytool test was run on real iron and ran successfully, but failed (originally) when run on Hercules, with the failure being the STD instruction on an unconstrained transaction that didn't specify the 'F' flag. Weird. I wonder whether he was also using a z15 or not? Dang! This model dependent crap is really starting to piss me off!

So now the question is, who do I believe? You or him? (Or both?!) I don't know! <whimper!>   :(

(sigh!)

  :(

p.s. the Chapter 18 ADR instruction will of course always cause the transaction to fail without the 'F' flag, but I'm sure you knew that.

My concern (what is currently being "debated"), is whether or not the Chapter 9 FP Control instructions should be allowed or not without the 'F' flag. According to your tests, they shouldn't be, whereas according to Abdo's claim, they should be. Who's right and who's wrong? How should we resolve this? Another new "model dependent" facility flag perhaps? (like I recently implemented for the other "we're-not-quite-sure-what-the-proper-handling-should-be-for-these-instructions" situations? I'm thinking yes, but with the default being as according to your z15 tests: Chapter 9 instructions should, by default, cause a TAC 11 (Restricted Instruction) abort without the 'F' flag. If the new (not-coded-yet) HERC_TXF_RESTRICT_4 facility is explicitly disabled however, then they should succeed.

How does that sound?

@Fish-Git
Copy link
Member Author

And shouldn't the TBEGIN in your first test have been TBEGIN tdb,x'fe04'?

Yeah. Simple typo typing in my GitHub comment.

@mcisho
Copy link
Contributor

mcisho commented Nov 23, 2020

In an earlier post I asked "What where the sequence of instructions between the TBEGIN and the TEND?" in the keytool test. I asked because when I was writing the transactions I originally omitted the JNZ following the TBEGIN. As a result the transaction appeared to complete successfully, though it had aborted and the sequence of instructions following the TBEGIN were executed, but not as a transaction. I was wondering if whatever compiled the keytool program made the same silly mistake?

@mcisho
Copy link
Contributor

mcisho commented Nov 23, 2020

And did Abdo specify what his real iron was?

@Fish-Git
Copy link
Member Author

In an earlier post I asked "What where the sequence of instructions between the TBEGIN and the TEND?" in the keytool test.

I have no idea. All I know is the keytool command causes an unconstrained transaction to execute that contains the STD instruction in it. See GitHub comment #263 (comment) where he originally reported the problem, and subsequent GitHub comment #263 (comment) where he says it ran successfully on a "z server".

I was wondering if whatever compiled the keytool program made the same silly mistake?

While I realize IBM themselves are not immune to making mistakes (they are human after all, as much as we sometimes like to believe otherwise!), I'm doubting it since it is, AFAIK, IBM code.

And did Abdo specify what his real iron was?

Nope. Just "z server".

If you really want to know, why don't we just ask him? Abdo? (@azarrafa) What model was the "z server" that your keytool test ran just fine on? Was it a z15? Or some other model? Thanks!

@mcisho
Copy link
Contributor

mcisho commented Nov 23, 2020

I've just looked again at the original message from Abdo. He said he got:

#keytool -printcert -sslserver zsyst.me.com:1443 -J-Dfile.encoding=UTF8 -rfc
CEE3250C The system or user abend S0E0 R=00000018 was issued.
From entry point ZJ9SYM1 at compile unit offset +00000050131F7720 at entry offset
+0000004F993F7418 at address 00000050131F7720.

Please correct me if I'm wrong, but my understanding is that a noncontrained transaction doesn't generally present a program interrupt, the transaction is aborted and the nsi following the TBEGIN is executed, with the condition code informing the application "that didn't work, try something else". That is what appears to happen in my tests, but have I grasped the right or wrong end of the stick?

@Fish-Git
Copy link
Member Author

Ivan wrote:

Aye!

We're getting there, Ivan! Trying to get these last few things taken care of (and I want to try and rework my tracing logic too (*)) before building the release.

(*) It's a bit too klunky for my liking and isn't quite working right at the moment either, so I want to try and fix it. Being able to properly trace TXF transactions is an important feature/ability to have in our release IMO. I mean, if we're going to release this thing and let people bang away on it, reporting problems and what not, it's important IMO that we can provide them a means of gathering additional information for us to try and figure out what the heck is going on, so I want to work on it a bit and try to clean it up a bit and make it more reliable. I'm hoping to be able to release 4.3 sometime before Thanksgiving. (Definitely before xmas though!)

@Fish-Git
Copy link
Member Author

Fish-Git commented Nov 23, 2020

my understanding is that a noncontrained transaction doesn't generally present a program interrupt

Well.......

the transaction is aborted and the nsi following the TBEGIN is executed, with the condition code informing the application "that didn't work, try something else".

Nope. Wrong. If your transaction attempts to execute a restricted instruction, it will Program Check (a 218 Program Interrupt will occur) and the instruction immediately following your TBEGIN instruction will not be executed. Instead, the Program Old PSW is set to that instruction (the one following your TBEGIN) and the Program New PSW is then loaded and branched to. That is to say, your PROGRAM is aborted. It crashes.

If you have a crash handler (i.e. a Program Interrupt interception handler/routine (I forget what they're called on z/OS; I'm not a z/OS person!)), then the system will invoke that routine and you can then look at your TDB to see the cause if you want, and, I'm presuming, decide to "ignore" that interrupt and continue executing your program anyway where it left off at, then yes, your program would then continue at the instruction immediately following your TBEGIN instruction with cc=3 set (which means "Persistent condition; successful retry NOT likely under current conditions").

Now your program interrupt handler can of course decide to construct a different transaction and then try executing it instead, but the point is, a Restricted Instruction Program Interrupt cannot be filtered (from TXF's PIFC point of view) and a 218 Program Interrupt will always occur.

So it's not so much that your transaction was aborted, but rather that your program BLEW UP!

(Unless maybe I'm the one that's grasping the wrong end of the stick??)

@mcisho
Copy link
Contributor

mcisho commented Nov 23, 2020

Hmm... I have to think about that. My application does have a crash handler, but it is not called when the non-constrained transaction tries executing FP instructions, and it is, as expected, called when the constrained transaction tries executing FP instructions (or SS instruction, or anything else restricted).

The only way to get to my applications code that reports the non-constrained transaction was not successful is via the JNZ instruction immediately following the TBEGIN,. My application has no code that examines the TDB, or any code that makes any decision to ignore or not, so is there another stick in town?

@Fish-Git
Copy link
Member Author

so is there another stick in town?

Must be z/OS doing it then. z/OS must be handling the Program Interrupt, detecting that it's a non-constrained transaction, and then automatically continuing your program (which, according to the stored Program Old PSW would be the instruction following your TBEGIN, with cc=3 set). That's the only thing I can think of! Because the POO is quite clear regarding Program Interrupts that occur during a transaction and how they're handled, especially Unfiltered Program Interrupts (which is what trying to execute a restricted instruction would be: Program Interrupt code 218, Transaction Constraint Exception, which is a Transactional-Execution Class 1 interrupt, which cannot be filtered).

I would very much like to know how your unconstrained transaction executes stand-alone on a real z15! Is that possible? Can you do that?

@mcisho
Copy link
Contributor

mcisho commented Nov 23, 2020

The transactions aren't executed stand-alone, they are executed by by a normal application running in an address space. The application has been developed and added to over many, many years and provides a platform for trying new things.

I was wondering if the z/OS program check handler was getting involved. It might explain why the keytool app was getting the program check, z/OS simply passes the program check on to USS to deal with it.

@azarrafa
Copy link

In an earlier post I asked "What where the sequence of instructions between the TBEGIN and the TEND?" in the keytool test.

I have no idea. All I know is the keytool command causes an unconstrained transaction to execute that contains the STD instruction in it. See GitHub comment #263 (comment) where he originally reported the problem, and subsequent GitHub comment #263 (comment) where he says it ran successfully on a "z server".

I was wondering if whatever compiled the keytool program made the same silly mistake?

While I realize IBM themselves are not immune to making mistakes (they are human after all, as much as we sometimes like to believe otherwise!), I'm doubting it since it is, AFAIK, IBM code.

And did Abdo specify what his real iron was?

Nope. Just "z server".

If you really want to know, why don't we just ask him? Abdo? (@azarrafa) What model was the "z server" that your keytool test ran just fine on? Was it a z15? Or some other model? Thanks!

Hello
I made the test on a Z15.

@Fish-Git
Copy link
Member Author

And did Abdo specify what his real iron was?

Nope. Just "z server".

If you really want to know, why don't we just ask him? Abdo? (@azarrafa) What model was the "z server" that your keytool test ran just fine on? Was it a z15? Or some other model? Thanks!

Hello
I made the test on a Z15.

Thank you, Abdo!     Ian?  (@mcisho)  Did you hear that?

@Fish-Git
Copy link
Member Author

While testing, after a while (i.e. after some unknown event), I'm seeing a sudden flood of unconstrained transaction aborts due to TAC 7 (Fetch Overflow):

09:28:52.332 HHC17730I Total UNconstrained Transactions =         653
09:28:52.332 HHC17731I Retries for ANY/ALL reason(s):
09:28:52.332 HHC17732I 0 retries =         646  (98.9%)
09:28:52.332 HHC17732I 1 retries =           9  ( 1.4%)
09:28:52.332 HHC17732I 2 retries =           0  ( 0.0%)
09:28:52.333 HHC17732I 3 retries =           0  ( 0.0%)
09:28:52.333 HHC17732I 4 retries =           0  ( 0.0%)
09:28:52.333 HHC17732I 5 retries =           0  ( 0.0%)
09:28:52.333 HHC17732I 6 retries =           0  ( 0.0%)
09:28:52.333 HHC17732I 7 retries =           0  ( 0.0%)
09:28:52.333 HHC17732I 8+retries =           0  ( 0.0%)
09:28:52.333 HHC17733I MAXIMUM   =           1
09:28:52.333 HHC17734I            3  ( 0.5%)  Retries due to TAC   2 External interruption
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC   4 PGM Interruption (Unfiltered)
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC   5 Machine-check Interruption
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC   6 I/O Interruption
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC   7 Fetch overflow
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC   8 Store overflow
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC   9 Fetch conflict
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC  10 Store conflict
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC  11 Restricted instruction
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC  12 PGM Interruption (Filtered)
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC  13 Nesting Depth exceeded
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC  14 Cache (fetch related)
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC  15 Cache (store related)
09:28:52.333 HHC17734I            0  ( 0.0%)  Retries due to TAC  16 Cache (other)
09:28:52.334 HHC17734I            4  ( 0.6%)  Retries due to TAC 255 Miscellaneous condition
09:28:52.334 HHC17735I            0  ( 0.0%)  Retries due to other TAC

10:30:53.109 HHC17730I Total UNconstrained Transactions =         653
10:30:53.109 HHC17731I Retries for ANY/ALL reason(s):
10:30:53.109 HHC17732I 0 retries =         646  (98.9%)
10:30:53.109 HHC17732I 1 retries =           9  ( 1.4%)
10:30:53.110 HHC17732I 2 retries =           0  ( 0.0%)
10:30:53.110 HHC17732I 3 retries =           0  ( 0.0%)
10:30:53.110 HHC17732I 4 retries =           0  ( 0.0%)
10:30:53.110 HHC17732I 5 retries =           0  ( 0.0%)
10:30:53.110 HHC17732I 6 retries =           0  ( 0.0%)
10:30:53.110 HHC17732I 7 retries =           0  ( 0.0%)
10:30:53.110 HHC17732I 8+retries =           0  ( 0.0%)
10:30:53.110 HHC17733I MAXIMUM   =           1
10:30:53.110 HHC17734I            3  ( 0.5%)  Retries due to TAC   2 External interruption
10:30:53.110 HHC17734I            0  ( 0.0%)  Retries due to TAC   4 PGM Interruption (Unfiltered)
10:30:53.110 HHC17734I            0  ( 0.0%)  Retries due to TAC   5 Machine-check Interruption
10:30:53.110 HHC17734I            0  ( 0.0%)  Retries due to TAC   6 I/O Interruption
10:30:53.110 HHC17734I            0  ( 0.0%)  Retries due to TAC   7 Fetch overflow
10:30:53.110 HHC17734I            0  ( 0.0%)  Retries due to TAC   8 Store overflow
10:30:53.110 HHC17734I            0  ( 0.0%)  Retries due to TAC   9 Fetch conflict
10:30:53.110 HHC17734I            0  ( 0.0%)  Retries due to TAC  10 Store conflict
10:30:53.111 HHC17734I            0  ( 0.0%)  Retries due to TAC  11 Restricted instruction
10:30:53.111 HHC17734I            0  ( 0.0%)  Retries due to TAC  12 PGM Interruption (Filtered)
10:30:53.111 HHC17734I            0  ( 0.0%)  Retries due to TAC  13 Nesting Depth exceeded
10:30:53.111 HHC17734I            0  ( 0.0%)  Retries due to TAC  14 Cache (fetch related)
10:30:53.111 HHC17734I            0  ( 0.0%)  Retries due to TAC  15 Cache (store related)
10:30:53.111 HHC17734I            0  ( 0.0%)  Retries due to TAC  16 Cache (other)
10:30:53.111 HHC17734I            4  ( 0.6%)  Retries due to TAC 255 Miscellaneous condition
10:30:53.111 HHC17735I            0  ( 0.0%)  Retries due to other TAC


10:57:06.859 HHC17703D TXF: CP04: SIE: Failed Outermost UNconstrained Transaction for TND 1: TAC_FETCH_OVF = Fetch overflow, why = TXF_WHY_MAX_PAGES
10:57:06.860 HHC17709D TXF: CP04: SIE: Formatted dump of TDB:
10:57:06.860 HHC17721D TXF: CP04: SIE: Fmt: 1, TND: 1, EAID: 0E, DXC/VXC: 00, PIID: 00000000, Flags: (none)
10:57:06.861 HHC17721D TXF: CP04: SIE: TAC:   0x0000000000000007: TAC_FETCH_OVF = Fetch overflow
10:57:06.861 HHC17721D TXF: CP04: SIE: Token: 0x0000000000000000, ATIA:  0x00000000202FCD9C
10:57:06.861 HHC17721D TXF: CP04: SIE:
10:57:06.862 HHC17721D TXF: CP04: SIE: 00000000202FCD9C INST=E36050680004 LG    6,104(0,5)             load_long
10:57:06.862 HHC17721D TXF: CP04: SIE:
10:57:06.862 HHC17721D TXF: CP04: SIE: TEID:  0x0000000003089000, BEA:   0x00000000202FBB66
10:57:06.862 HHC17721D TXF: CP04: SIE: GR 00: 0x0000000000000000, GR 01: 0x00000050086E4D40
10:57:06.863 HHC17721D TXF: CP04: SIE: GR 02: 0x0000000000000098, GR 03: 0x0000000080000004
10:57:06.863 HHC17721D TXF: CP04: SIE: GR 04: 0x00000051186FDC80, GR 05: 0x00000050086E4090
10:57:06.863 HHC17721D TXF: CP04: SIE: GR 06: 0x00000000202FCD90, GR 07: 0x00000000202FBB68
10:57:06.863 HHC17721D TXF: CP04: SIE: GR 08: 0x00000050086E3EF0, GR 09: 0x00000051186FE748
10:57:06.863 HHC17721D TXF: CP04: SIE: GR 10: 0x0000000000100000, GR 11: 0x0000000000000005
10:57:06.863 HHC17721D TXF: CP04: SIE: GR 12: 0x0000000000100000, GR 13: 0x00000050086E4D40
10:57:06.863 HHC17721D TXF: CP04: SIE: GR 14: 0x0000000000000028, GR 15: 0x0000005000000000
10:57:06.872 HHC17717D TXF: CP04: SIE: UNconstrained transaction retry #1...
10:57:06.872 HHC17719D TXF: CP04: SIE: UNconstrained transaction retry #1 FAILED!
10:57:06.873 HHC17703D TXF: CP04: SIE: Failed Outermost UNconstrained Transaction for TND 1: TAC_FETCH_OVF = Fetch overflow, why = TXF_WHY_MAX_PAGES
10:57:06.873 HHC17709D TXF: CP04: SIE: Formatted dump of TDB:
10:57:06.873 HHC17721D TXF: CP04: SIE: Fmt: 1, TND: 1, EAID: 0E, DXC/VXC: 00, PIID: 00000000, Flags: (none)
10:57:06.873 HHC17721D TXF: CP04: SIE: TAC:   0x0000000000000007: TAC_FETCH_OVF = Fetch overflow
10:57:06.873 HHC17721D TXF: CP04: SIE: Token: 0x0000000000000000, ATIA:  0x00000000202FBB34
10:57:06.873 HHC17721D TXF: CP04: SIE:
10:57:06.873 HHC17721D TXF: CP04: SIE: 00000000202FBB34 INST=E30060000094 LLC   0,0(0,6)               load_logical_character
10:57:06.873 HHC17721D TXF: CP04: SIE:
10:57:06.873 HHC17721D TXF: CP04: SIE: TEID:  0x0000000000000000, BEA:   0x00000000202FBB02
10:57:06.873 HHC17721D TXF: CP04: SIE: GR 00: 0x0000000000000100, GR 01: 0x00000051186FE748
10:57:06.874 HHC17721D TXF: CP04: SIE: GR 02: 0x0000000000100000, GR 03: 0x0000000000000005
10:57:06.874 HHC17721D TXF: CP04: SIE: GR 04: 0x00000051186FDD80, GR 05: 0x00000050086E3EF0
10:57:06.874 HHC17721D TXF: CP04: SIE: GR 06: 0x000000007CD12600, GR 07: 0x000000007CA9DDDE
10:57:06.874 HHC17721D TXF: CP04: SIE: GR 08: 0x00000050086E3EF0, GR 09: 0x00000051186FE748
10:57:06.874 HHC17721D TXF: CP04: SIE: GR 10: 0x0000000000100000, GR 11: 0x0000000000000005
10:57:06.874 HHC17721D TXF: CP04: SIE: GR 12: 0x0000000000100000, GR 13: 0x00000050086E4D40
10:57:06.874 HHC17721D TXF: CP04: SIE: GR 14: 0x0000000000000028, GR 15: 0x0000005010F7D4F0
10:57:06.879 HHC17717D TXF: CP04: SIE: CONSTRAINED transaction retry #2...
10:57:06.888 HHC17703D TXF: CP04: SIE: Failed Outermost UNconstrained Transaction for TND 1: TAC_FETCH_OVF = Fetch overflow, why = TXF_WHY_MAX_PAGES
10:57:06.888 HHC17709D TXF: CP04: SIE: Formatted dump of TDB:
10:57:06.888 HHC17721D TXF: CP04: SIE: Fmt: 1, TND: 1, EAID: 0E, DXC/VXC: 00, PIID: 00000000, Flags: (none)
10:57:06.888 HHC17721D TXF: CP04: SIE: TAC:   0x0000000000000007: TAC_FETCH_OVF = Fetch overflow
10:57:06.888 HHC17721D TXF: CP04: SIE: Token: 0x0000000000000000, ATIA:  0x00000000202FCD9C
10:57:06.889 HHC17721D TXF: CP04: SIE:
10:57:06.889 HHC17721D TXF: CP04: SIE: 00000000202FCD9C INST=E36050680004 LG    6,104(0,5)             load_long
10:57:06.889 HHC17721D TXF: CP04: SIE:
10:57:06.889 HHC17721D TXF: CP04: SIE: TEID:  0x0000000000000000, BEA:   0x00000000202FBB66
10:57:06.889 HHC17721D TXF: CP04: SIE: GR 00: 0x0000000000000000, GR 01: 0x00000050086E4D40
10:57:06.889 HHC17721D TXF: CP04: SIE: GR 02: 0x0000000000000098, GR 03: 0x0000000080000004
10:57:06.889 HHC17721D TXF: CP04: SIE: GR 04: 0x00000051186FDC80, GR 05: 0x00000050086E4090
10:57:06.889 HHC17721D TXF: CP04: SIE: GR 06: 0x00000000202FCD90, GR 07: 0x00000000202FBB68
10:57:06.889 HHC17721D TXF: CP04: SIE: GR 08: 0x00000050086E3EF0, GR 09: 0x00000051186FE748
10:57:06.889 HHC17721D TXF: CP04: SIE: GR 10: 0x0000000000100000, GR 11: 0x0000000000000005
10:57:06.889 HHC17721D TXF: CP04: SIE: GR 12: 0x0000000000100000, GR 13: 0x00000050086E4D40
10:57:06.889 HHC17721D TXF: CP04: SIE: GR 14: 0x0000000000000028, GR 15: 0x0000005000000000
10:57:06.900 HHC17717D TXF: CP04: SIE: UNconstrained transaction retry #1...
10:57:06.901 HHC17719D TXF: CP04: SIE: UNconstrained transaction retry #1 FAILED!
10:57:06.901 HHC17703D TXF: CP04: SIE: Failed Outermost UNconstrained Transaction for TND 1: TAC_FETCH_OVF = Fetch overflow, why = TXF_WHY_MAX_PAGES
10:57:06.901 HHC17709D TXF: CP04: SIE: Formatted dump of TDB:
10:57:06.901 HHC17721D TXF: CP04: SIE: Fmt: 1, TND: 1, EAID: 0E, DXC/VXC: 00, PIID: 00000000, Flags: (none)
10:57:06.901 HHC17721D TXF: CP04: SIE: TAC:   0x0000000000000007: TAC_FETCH_OVF = Fetch overflow
10:57:06.901 HHC17721D TXF: CP04: SIE: Token: 0x0000000000000000, ATIA:  0x00000000202FCD9C
10:57:06.901 HHC17721D TXF: CP04: SIE:
10:57:06.901 HHC17721D TXF: CP04: SIE: 00000000202FCD9C INST=E36050680004 LG    6,104(0,5)             load_long
10:57:06.901 HHC17721D TXF: CP04: SIE:
10:57:06.902 HHC17721D TXF: CP04: SIE: TEID:  0x0000000000000000, BEA:   0x00000000202FBB66
10:57:06.902 HHC17721D TXF: CP04: SIE: GR 00: 0x0000000000000000, GR 01: 0x00000050086E4D40
10:57:06.902 HHC17721D TXF: CP04: SIE: GR 02: 0x0000000000000098, GR 03: 0x0000000080000004
10:57:06.902 HHC17721D TXF: CP04: SIE: GR 04: 0x00000051186FDC80, GR 05: 0x00000050086E4090
10:57:06.902 HHC17721D TXF: CP04: SIE: GR 06: 0x00000000202FCD90, GR 07: 0x00000000202FBB68
10:57:06.902 HHC17721D TXF: CP04: SIE: GR 08: 0x00000050086E3EF0, GR 09: 0x00000051186FE748
10:57:06.902 HHC17721D TXF: CP04: SIE: GR 10: 0x0000000000100000, GR 11: 0x0000000000000005
10:57:06.902 HHC17721D TXF: CP04: SIE: GR 12: 0x0000000000100000, GR 13: 0x00000050086E4D40
10:57:06.902 HHC17721D TXF: CP04: SIE: GR 14: 0x0000000000000028, GR 15: 0x0000005000000000
10:57:06.920 HHC17703D TXF: CP02: SIE: Failed Outermost UNconstrained Transaction for TND 1: TAC_FETCH_OVF = Fetch overflow, why = TXF_WHY_MAX_PAGES
10:57:06.920 HHC17709D TXF: CP02: SIE: Formatted dump of TDB:
10:57:06.921 HHC17721D TXF: CP02: SIE: Fmt: 1, TND: 1, EAID: 00, DXC/VXC: 00, PIID: 00000000, Flags: (none)
10:57:06.921 HHC17721D TXF: CP02: SIE: TAC:   0x0000000000000007: TAC_FETCH_OVF = Fetch overflow
10:57:06.921 HHC17721D TXF: CP02: SIE: Token: 0x0000000000000000, ATIA:  0x00000000202FCD9C
10:57:06.921 HHC17721D TXF: CP02: SIE:
10:57:06.921 HHC17721D TXF: CP02: SIE: 00000000202FCD9C INST=E36050680004 LG    6,104(0,5)             load_long
10:57:06.921 HHC17721D TXF: CP02: SIE:
10:57:06.921 HHC17721D TXF: CP02: SIE: TEID:  0x00000053ABEFD000, BEA:   0x00000000202FBB66
10:57:06.921 HHC17721D TXF: CP02: SIE: GR 00: 0x0000000000000000, GR 01: 0x00000050086E4D40
10:57:06.921 HHC17721D TXF: CP02: SIE: GR 02: 0x0000000000000098, GR 03: 0x0000000080000004
10:57:06.921 HHC17721D TXF: CP02: SIE: GR 04: 0x00000051186FDC80, GR 05: 0x00000050086E4090
10:57:06.921 HHC17721D TXF: CP02: SIE: GR 06: 0x00000000202FCD90, GR 07: 0x00000000202FBB68
10:57:06.921 HHC17721D TXF: CP02: SIE: GR 08: 0x00000050086E3EF0, GR 09: 0x00000051186FE748
10:57:06.921 HHC17721D TXF: CP02: SIE: GR 10: 0x0000000000100000, GR 11: 0x0000000000000005
10:57:06.921 HHC17721D TXF: CP02: SIE: GR 12: 0x0000000000100000, GR 13: 0x00000050086E4D40
10:57:06.921 HHC17721D TXF: CP02: SIE: GR 14: 0x0000000000000028, GR 15: 0x0000005000000000
10:57:06.927 HHC17717D TXF: CP02: SIE: UNconstrained transaction retry #1...
10:57:06.965 HHC17703D TXF: CP00: SIE: Failed Outermost UNconstrained Transaction for TND 1: TAC_FETCH_OVF = Fetch overflow, why = TXF_WHY_MAX_PAGES
10:57:06.965 HHC17709D TXF: CP00: SIE: Formatted dump of TDB:
10:57:06.965 HHC17721D TXF: CP00: SIE: Fmt: 1, TND: 1, EAID: 00, DXC/VXC: 00, PIID: 00000000, Flags: (none)
10:57:06.965 HHC17721D TXF: CP00: SIE: TAC:   0x0000000000000007: TAC_FETCH_OVF = Fetch overflow
10:57:06.965 HHC17721D TXF: CP00: SIE: Token: 0x0000000000000000, ATIA:  0x00000000202FCD9C
10:57:06.965 HHC17721D TXF: CP00: SIE:
10:57:06.965 HHC17721D TXF: CP00: SIE: 00000000202FCD9C INST=E36050680004 LG    6,104(0,5)             load_long
10:57:06.966 HHC17721D TXF: CP00: SIE:
10:57:06.966 HHC17721D TXF: CP00: SIE: TEID:  0x0000000000000000, BEA:   0x00000000202FBB66
10:57:06.966 HHC17721D TXF: CP00: SIE: GR 00: 0x0000000000000000, GR 01: 0x00000050086E4D40
10:57:06.966 HHC17721D TXF: CP00: SIE: GR 02: 0x0000000000000098, GR 03: 0x0000000080000004
10:57:06.966 HHC17721D TXF: CP00: SIE: GR 04: 0x00000051186FDC80, GR 05: 0x00000050086E4090
10:57:06.966 HHC17721D TXF: CP00: SIE: GR 06: 0x00000000202FCD90, GR 07: 0x00000000202FBB68
10:57:06.966 HHC17721D TXF: CP00: SIE: GR 08: 0x00000050086E3EF0, GR 09: 0x00000051186FE748
10:57:06.966 HHC17721D TXF: CP00: SIE: GR 10: 0x0000000000100000, GR 11: 0x0000000000000005
10:57:06.966 HHC17721D TXF: CP00: SIE: GR 12: 0x0000000000100000, GR 13: 0x00000050086E4D40
10:57:06.967 HHC17721D TXF: CP00: SIE: GR 14: 0x0000000000000028, GR 15: 0x0000005000000000
10:57:06.974 HHC17717D TXF: CP00: SIE: UNconstrained transaction retry #1...
10:57:06.974 HHC17719D TXF: CP00: SIE: UNconstrained transaction retry #1 FAILED!
10:57:06.974 HHC17703D TXF: CP00: SIE: Failed Outermost UNconstrained Transaction for TND 1: TAC_FETCH_OVF = Fetch overflow, why = TXF_WHY_MAX_PAGES
10:57:06.974 HHC17709D TXF: CP00: SIE: Formatted dump of TDB:
10:57:06.974 HHC17721D TXF: CP00: SIE: Fmt: 1, TND: 1, EAID: 0E, DXC/VXC: 00, PIID: 00000000, Flags: (none)
10:57:06.975 HHC17721D TXF: CP00: SIE: TAC:   0x0000000000000007: TAC_FETCH_OVF = Fetch overflow
10:57:06.975 HHC17721D TXF: CP00: SIE: Token: 0x0000000000000000, ATIA:  0x00000000202FCD9C
10:57:06.975 HHC17721D TXF: CP00: SIE:
10:57:06.975 HHC17721D TXF: CP00: SIE: 00000000202FCD9C INST=E36050680004 LG    6,104(0,5)             load_long
10:57:06.975 HHC17721D TXF: CP00: SIE:
10:57:06.975 HHC17721D TXF: CP00: SIE: TEID:  0x0000000000000000, BEA:   0x00000000202FBB66
10:57:06.975 HHC17721D TXF: CP00: SIE: GR 00: 0x0000000000000000, GR 01: 0x00000050086E4D40
10:57:06.975 HHC17721D TXF: CP00: SIE: GR 02: 0x0000000000000098, GR 03: 0x0000000080000004
10:57:06.975 HHC17721D TXF: CP00: SIE: GR 04: 0x00000051186FDC80, GR 05: 0x00000050086E4090
10:57:06.975 HHC17721D TXF: CP00: SIE: GR 06: 0x00000000202FCD90, GR 07: 0x00000000202FBB68
10:57:06.975 HHC17721D TXF: CP00: SIE: GR 08: 0x00000050086E3EF0, GR 09: 0x00000051186FE748
10:57:06.975 HHC17721D TXF: CP00: SIE: GR 10: 0x0000000000100000, GR 11: 0x0000000000000005
10:57:06.975 HHC17721D TXF: CP00: SIE: GR 12: 0x0000000000100000, GR 13: 0x00000050086E4D40
10:57:06.975 HHC17721D TXF: CP00: SIE: GR 14: 0x0000000000000028, GR 15: 0x0000005000000000
10:57:06.986 HHC17717D TXF: CP00: SIE: CONSTRAINED transaction retry #2...
10:57:06.997 HHC17717D TXF: CP04: SIE: CONSTRAINED transaction retry #2...

11:07:25.085 HHC17730I Total UNconstrained Transactions =       27874
11:07:25.086 HHC17731I Retries for ANY/ALL reason(s):
11:07:25.086 HHC17732I 0 retries =       27030  (97.0%)
11:07:25.086 HHC17732I 1 retries =         353  ( 1.3%)
11:07:25.086 HHC17732I 2 retries =           6  ( 0.0%)
11:07:25.086 HHC17732I 3 retries =           0  ( 0.0%)
11:07:25.086 HHC17732I 4 retries =           0  ( 0.0%)
11:07:25.086 HHC17732I 5 retries =           0  ( 0.0%)
11:07:25.087 HHC17732I 6 retries =           0  ( 0.0%)
11:07:25.087 HHC17732I 7 retries =           0  ( 0.0%)
11:07:25.087 HHC17732I 8+retries =           0  ( 0.0%)
11:07:25.087 HHC17733I MAXIMUM   =           2
11:07:25.087 HHC17734I          168  ( 0.6%)  Retries due to TAC   2 External interruption
11:07:25.087 HHC17734I            0  ( 0.0%)  Retries due to TAC   4 PGM Interruption (Unfiltered)
11:07:25.088 HHC17734I            0  ( 0.0%)  Retries due to TAC   5 Machine-check Interruption
11:07:25.088 HHC17734I            0  ( 0.0%)  Retries due to TAC   6 I/O Interruption
11:07:25.088 HHC17734I          570  ( 2.0%)  Retries due to TAC   7 Fetch overflow
11:07:25.088 HHC17734I            0  ( 0.0%)  Retries due to TAC   8 Store overflow
11:07:25.088 HHC17734I            2  ( 0.0%)  Retries due to TAC   9 Fetch conflict
11:07:25.088 HHC17734I            0  ( 0.0%)  Retries due to TAC  10 Store conflict
11:07:25.088 HHC17734I            0  ( 0.0%)  Retries due to TAC  11 Restricted instruction
11:07:25.088 HHC17734I            0  ( 0.0%)  Retries due to TAC  12 PGM Interruption (Filtered)
11:07:25.088 HHC17734I            0  ( 0.0%)  Retries due to TAC  13 Nesting Depth exceeded
11:07:25.088 HHC17734I            0  ( 0.0%)  Retries due to TAC  14 Cache (fetch related)
11:07:25.088 HHC17734I            0  ( 0.0%)  Retries due to TAC  15 Cache (store related)
11:07:25.088 HHC17734I            0  ( 0.0%)  Retries due to TAC  16 Cache (other)
11:07:25.088 HHC17734I          432  ( 1.5%)  Retries due to TAC 255 Miscellaneous condition
11:07:25.088 HHC17735I            0  ( 0.0%)  Retries due to other TAC

The Principles of Operation has this to say about fetch/store overflows:

Page 5-100 ("Transaction Abort Conditions"):

Fetch Overflow: A fetch-overflow condition is detected when the transaction attempts
to fetch from more locations than the CPU supports. The transaction-abort code is set
to 7, and the condition code is set to either 2 or 3.

Store Overflow: A store-overflow condition is detected when the transaction attempts
to store to more locations than the CPU supports. The transaction-abort code is set
to 8, and the condition code is set to either 2 or 3.

Currently, Hercules has this value defined as a hard coded #define constant value of 64:

#define MAX_TXF_PAGES 64 /* Max num of modified pages */

I don't know where Bob got the value from (I'm guessing he just picked a number that seemed reasonable at the time), but I think it's now obvious that the value is too low and should be increased.

Does anyone know what the value should be? Is there a way we can somehow discover what the value is on real iron? (e.g. on a z15 for example?) Or should we just increase it to, say, 128 or even 256?

Help!   :(

@s390guy
Copy link
Contributor

s390guy commented Nov 25, 2020

First, I do not know what the correct value should be. However, let's think about this. Real mainframes can have 64 CPUs. Maybe higher. 64 pages would limit each CPU to 1 TXF page. Does that sound reasonable for unconstrained transactions? Probably not. Transactions operate on the basis of cache lines. A cache line appears to be about 128 bytes in size. While a page is 32 cache lines, a transaction might be dispersed between many pages, worse case, 1 cache line from multiple pages.

Some analysis of cache memory sizes in real mainframes might give a clue.

But is this number in Hercules the cause or the symptom?

If this is a new phenomenon, perhaps a look at what has changed recently might give a clue as to why it is happening.
That would suggests a new bug. Of course maybe it is time related and the tests have been running longer than in the past.

I know not much help. I would be reluctant to simply increase the value without understanding why the number is influencing the result, assuming it actually is.

This is an area I really doubt Hercules can replicate what happens on real mainframes, for better or worse for the transaction. I understand the desirability to replicate this phenomenon from a transaction development perspective, but can we?

@Fish-Git
Copy link
Member Author

Fish-Git commented Nov 25, 2020

Real mainframes can have 64 CPUs.

The Windows version of Hercules only supports a maximum MAXCPU value of 64, but the Linux (gcc) version of Hercules supports a maximum MAXCPU value of 128. Real mainframes on the other hand, as far as I know, can support a lot more than that!

64 pages would limit each CPU to 1 TXF page.

Um, no. The MAX_TXF_PAGES value defines the maximum number of pages than can be modified for that CPU. It's not a system-wide value; it's a per CPU value. (i.e. the txf_pagesmap field is in REGS, not SYSBLK):

TPAGEMAP txf_pagesmap[ MAX_TXF_PAGES ]; /* Page addresses */

A cache line appears to be about 128 bytes in size.

The cache line size on z is 256 bytes, not 128. Intel is 128 bytes, but z is 256.

Some analysis of cache memory sizes in real mainframes might give a clue.

     L1 cache (per core):  128 KB I, 128 KB D
     L2 cache (per core):    4 MB I,   4 MB D

If I'm understanding you correctly, what you're proposing is defining a value that matches however much L1 and/or L2 Data cache a typical IBM mainframe has? Yes? I guess that makes sense. <shrug>

In which case it sounds like we should increase our value to 1024.

But is this number in Hercules the cause or the symptom?

Cause. Definitely the cause:

hyperion/transact.c

Lines 2036 to 2065 in eab6bf3

/* If not mapped yet, capture real page and save a copy */
if (!altpage)
{
/* Abort transaction if too many pages were touched */
if (regs->txf_pgcnt >= MAX_TXF_PAGES)
{
int txf_tac = TXF_IS_FETCH_ACCTYPE() ?
TAC_FETCH_OVF : TAC_STORE_OVF;
regs->txf_why |= TXF_WHY_MAX_PAGES;
PTT_TXF( "*TXF mad max", txf_tac, regs->txf_contran, regs->txf_tnd );
regs->txf_why |= TXF_WHY_MAX_PAGES;
ABORT_TRANS( regs, ABORT_RETRY_CC, txf_tac );
UNREACHABLE_CODE( return maddr );
}
pageaddr = (BYTE*) addrpage;
pmap = &regs->txf_pagesmap[ regs->txf_pgcnt ];
altpage = pmap->altpageaddr;
savepage = altpage + ZPAGEFRAME_PAGESIZE;
/* Capture a copy of this page */
memcpy( altpage, pageaddr, ZPAGEFRAME_PAGESIZE );
memcpy( savepage, altpage, ZPAGEFRAME_PAGESIZE );
/* Finish mapping this page */
pmap->mainpageaddr = (BYTE*) addrpage;
pmap->virtpageaddr = vaddr & ZPAGEFRAME_PAGEMASK;
regs->txf_pgcnt++;
}

 

This is an area I really doubt Hercules can replicate what happens on real mainframes, for better or worse for the transaction. I understand the desirability to replicate this phenomenon from a transaction development perspective, but can we?

I suppose instead of hard coding it, we can make it a configuration file parameter instead? (with some greater-than-today's-value reasonable default if not specified) Maybe TXFMAXPAGES?

@s390guy
Copy link
Contributor

s390guy commented Nov 26, 2020

Sorry for the misinformation. It has been a LONG time since I looked at the details. But you did get the gist of where I was headed.

Based upon that stats, the txf_pgcnt is doing exactly what it is supposed to do. The discussion then becomes one of how many of these aborts are acceptable. If every time the system hits this value we have the urge to increase it, does that not defeat the purpose of the value to begin with?

The value I am assuming you find disturbing is the 2% TAC 7 events. At one level I do not find disturbing that these events are occurring. They are programmed to do so. From appearances in the stats, Hercules is doing exactly what it is supposed to do.

What I find disturbing is that all of a sudden they started to occur when they had not and the inclination is to adjust TXF_MAX_PAGES.

Have you looked at the way txf_pgcnt is being managed? Is it being reduced properly such that it is not an ever increasing value and all of a sudden your test incremented it to a point that they started to occur? Hence the question about a "symptom" vs "cause".

It is easy to sit back and quarter back from the sidelines. So, I apologize if it appears as though that is what I am doing. In the past when I have had a problem I am having trouble fixing, I found it useful to get the input from somebody that knows little about the problem in an effort to look at the problem differently. That is all I am doing. Sometimes one can get locked into a way of looking at the situation that misses other approaches that may identify a more foundational situation. My hope is that the discussion is more helpful than aggravating, and I apologize if it's the latter.

@Fish-Git
Copy link
Member Author

Fish-Git commented Nov 26, 2020

The discussion then becomes one of how many of these aborts are acceptable.

I would expect the counts for certain aborts -- TAC 7 & 8 (Fetch and Store Overflow), TAC 11 (Restricted Instruction) and TAC 13 (Nesting Depth Exceeded) -- to always be zero in a normally functioning system.

If every time the system hits this value we have the urge to increase it, does that not defeat the purpose of the value to begin with?

It depends on the reason the value is hit. In this case I believe it's because our value is indeed too low, especially given that the value was, as best as I can tell, just a pure guess to begin with.

If, after increasing it to a much more reasonable value however (i.e. match the number of pages in IBM's largest system's L2 Data cache) we discover it is still being exceeded in believed to be normal situations, then I would like to believe there would be zero urge to increase it again but to rather try and determine a more likely reason. (That is to say, if after increasing it, we find it is still being regularly exceeded, then my urge would not be to increase it again but rather to look for a bug in Hercules code somewhere.)

What I find disturbing is that all of a sudden they started to occur ...

I find that to be more perplexing (puzzling) than disturbing. So far the occurrence is a one-of. I have not yet tried to reproduce it, especially since I don't know what I did to cause it to occur in the first place, if anything at all! I know next to nothing about z/OSMF, just as I know almost nothing about z/OS too!   :(

Have you looked at the way txf_pgcnt is being managed?

Yes. It's solid. It's initialized to zero at transaction begin and only ever incremented at the place where I showed you.

It is easy to sit back and quarter back from the sidelines. So, I apologize if it appears as though that is what I am doing.

Not at all! I appreciate the feedback.

In the past when I have had a problem I am having trouble fixing, I found it useful to get the input from somebody that knows little about the problem in an effort to look at the problem differently. That is all I am doing. Sometimes one can get locked into a way of looking at the situation that misses other approaches that may identify a more foundational situation.

Which is why I appreciate it! Years ago very early in my career we were having a problem with a new version of our software. So much so, they (home office) sent one of the developers (probably the guy that wrote the code I think) to our office to try and figure out why things were going wrong. (At the time I think it was only our office that was having the problem.)

After a long night of debugging I came in to find him sitting there with his program listing, looking rather rather tired and dejected. A look of total defeat. I asked him what was wrong and he told me he just couldn't figure out why xyz was happening and it was driving him crazy. Feeling badly for him, I asked if I could help. He asked how? I said "Walk me through your code." Since he had nothing to lose he agreed. (You can see where this is heading, right?)

He started explaining everything was fine at this point but once it got to this point things were wrong (or something like that; I can't remember the specific details). He explained "First, we call this routine to do such-and-such, and then we check this and then do that..." etc. At one point he said something like "And then we call this routine to do such-and-such and then...", at which point I stopped him (up until that point I had been following him closely, asking simple questions along the way, etc) and asked: "And you're absolutely sure that routine actually does do such-and-such?". He said "Of course! It's just a simple routine that does ......." He stopped talking and a funny look came over his face and he immediately flipped to the page where that function was, looked at it and after a brief second or two, shouted "That's it! That's the bug! I'm not doing such-and-such!" (or maybe it was "I'm presuming such-and-such is always true!" or something similar. You get the idea.)

He started thanking me profusely as he rushed over to the keypunch machine, punched the fix, and then ran into the machine room, assembled and linked the fix, and started another test, and of course things began working perfectly from that point on.

It's easy for a programmer to skim over parts of code and for their mind to presume code is doing certain things that it's in actuality not really doing (or not doing properly!). That's programming, and I don't know a single programmer that hasn't experience what I just described at least once in their career. Having a "second set of eyes", a set of eyes belonging to a person who isn't harboring the same set of presumptions as you helps a LOT.

So yeah, I do appreciate you taking the time to ask probing questions, Harold! I'm not angry or aggravated at all. I very much appreciate such feedback and find it to be quite valuable and helpful!

@s390guy
Copy link
Contributor

s390guy commented Nov 26, 2020

Our discussion just leads me to the same place. We need to release this code to get more experience. We do not really know what is normal and what is not (panel.c issues aside). Making it more available should provide more input (for better or worse).

Sorry to beat that drum, but that is where I am with regards to TXF.

I do want to say that knowing what I do about how TXF works, that we have as good an implementation as we do totally amazes me.

@s390guy
Copy link
Contributor

s390guy commented Nov 26, 2020

On the topic of releasing TXF...

TXF on Hercules, while released, should be considered experimental. TXF on all systems is sensitive to local load situations. Hercules is no different and may diverge positively or negatively from experience with other mainframe systems. Transactions that work on Hercules MAY not work on other systems and transactions that work on other systems MAY fail on Hercules.

I do not know what we should say about these differences in operation with regards to the developers willingness to examine the situation. Initially anyway we should probably be more willing than later. Experience here will help us too. Learning what questions to ask and what data to gather.

@Fish-Git
Copy link
Member Author

Fish-Git commented Dec 3, 2020

NOTE: This issue has been closed and is now being continued in a NEW GitHub issue, #339: "Transactional-Execution Facility... (continued) "

@Fish-Git Fish-Git closed this as completed Dec 3, 2020
@Fish-Git Fish-Git added (*MOVED*) (the original issue was moved into a different issue) and removed HELP! Help is needed from someone more experienced or I'm simply overloaded with too much work right now! IN PROGRESS... I'm working on it! (Or someone else is!) QUESTION... A question was asked but has not been answered yet, -OR- additional feedback is requested. labels Dec 3, 2020
Fish-Git added a commit that referenced this issue Dec 18, 2020
As mentioned in GitHub Issue #263 comment:

#263 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Developers are invited to discuss a design change or solution to a coding problem. Enhancement This issue does not describe a problem but rather describes a suggested change or improvement. M Issue contains checklist of items, not all of which have been completed yet. Missing Support for the described architectural feature is currently missing and needs to be added. (*MOVED*) (the original issue was moved into a different issue) Ongoing Issue is long-term. Variant of IN PROGRESS: it's being worked on but maybe not at this exact moment. Related This issue is closely related to another issue. Consider this issue a "sub-issue" of the other. TXF Bug related to, or likely caused by, our current Transaction-Execution Facility implementation
Projects
None yet
Development

No branches or pull requests