Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Energy error codes for issues in data received by DH from SDH #506

Closed
CDR-API-Stream opened this issue Apr 27, 2022 · 7 comments
Closed
Assignees
Labels
Energy Proposal made The DSB has proposed a specific change to the standards to address the change request

Comments

@CDR-API-Stream
Copy link
Collaborator

CDR-API-Stream commented Apr 27, 2022

Description

This CR is being raised to consult on the need of specific error codes required to deal with issues in data received by an energy data holder from the secondary data holder. The need for this change request was identified during consultation of issue #478.

Area Affected

Energy APIs

Change Proposed

This CR is currently a placeholder. The DSB will publish recommended changes to be consulted on within the Maintenance Iteration it gets prioritised for.

DSB Proposed Solution

The current DSB proposal for this issue is in this comment.

@biza-io
Copy link

biza-io commented Aug 6, 2022

Overview

The most significant difference between the existing CDR landscape and the upcoming energy sector activation is the delivery of data by data holders sourced, essentially transparently, from secondary data holder(s), ostensibly AEMO. Challenges with this method arise, particularly during error conditions because, without suitable description, an ADR is unable to determine if an error received is due to a system error on the Holder side or simply a reflection of an error received via the back channel from AEMO.

This is particularly relevant where error conditions may be transient in nature or network path dependent, for instance a faulty record stored within the AEMO data store, a transient API gateway issue within AEMO, a non-compliance to specification by AEMO, a network connectivity issue with AEMO internet connectivity, with AEMO MarketNet termination (which by default is an active/standby topology and is chosen on a case by case basis by a Retailer) or middleware between the CDR service delivery and AEMO (ie. a transformation proxy managed by a separate infrastructure team). The inverse is true also, whereby a Holder may have a transient error while accessing AEMO APIs (or preparing to access them) and yet the error response for the endpoint would likely be the same.

These combinations represent a situation where all parties (Recipient, Holder, Secondary Holder) are unable to disambiguite the source of encountered problems. In addition, the mechanism for Data Recipients to open incidents is only with Holders which will potentially result in Holders now being responsible for managing an increasing number of support requests on behalf of the Secondary Holder. As a result this mechanism results in essentially "baked in" double handling of incidents by Holders, through no fault of their own, between one regulator (ACCC) and another (AEMO).

Scenarios

The following is a non-exhaustive list of scenarios where the ADR understanding the source of a failure would be of benefit but cannot currently be sufficiently communicated:

  1. SDH returns 400 urn:au-cds:error:cds-all:GeneralError/Expected when the Holder does not expect it and makes a spec compliant call
  2. SDH returns 503 urn:au-cds:error:cds-all:Service/Unavailable when the Holder themselves is functional
  3. SDH returns 5xx urn:au-cds:error:cds-all:GeneralError/Unexpected when the Holder makes a spec compliant call
  4. Holder returns 5xx urn:au-cds:error:cds-all:GeneralError/Unexpected to an API designated as a Shared Responsibility API, for instance, the Holder has an error looking up the authorisation details and/or transforming pairwised Service Point IDs to NMI for execution
  5. Version request mismatch resulting in either party returning 406 urn:au-cds:error:cds-all:Header/UnsupportedVersion
  6. SDH endpoint request times out or takes longer than the prescribed NFR
  7. SDH returns 429 Too Many Requests when an ADR has not reached the Holder request limit. The Retry-After header would also be completely invalid

As a third party providing SaaS CDR solutions the challenges of the Secondary Data Holder model is one which Biza.io is quite familiar with because, in simplistic terms, our customers are accessed using a similar topology. As such we cannot state the above is exhaustive as error conditions and behaviours have been discovered over years of development which is why our preferred solution is one of a global nature.

Potential Solutions

We propose one of the following solutions:

  1. Introduce a new error sub-type of cds-sdh allowing for all error codes to be communicated under this namespace on a 1:1 basis
  2. Specify SDH specific error messages for SDH endpoints which can be passed through by Holders
  3. Leave behaviour unspecified resulting in Holders making the decision themselves. We would currently transform this into error code 500 urn:au-cds:error:cds-all:GeneralError/Unexpected as we do not believe it would be appropriate nor technically accurate for a spec compliant Holder to do otherwise

Our current preference is (1) on the basis that it provides flexibility to communicate these errors in a consistent context with existing error behaviour.

@agl-cdrprogram
Copy link

AGL has reviewed previous comment from biza-io and concur with the proposed Overview, Scenarios and Potential Solutions.

From an AGL standpoint the current standards implementation for SDH do not support a mechanism to identify and respond with errors that are initiated by the SDH (in this case AEMO, however would apply equally to VEC and EME).

With the current implementation, all AEMO errors will be returned as AGL errors. This makes it challenging on several fronts, namely:

  • Responsibility – It isn’t clear from the error as to who is responsible for the error. Responsibility will fall to AGL when in fact it is an AEMO error. This could easily lead to lengthy disputes between AGL and AEMO regarding who is responsible along with additional effort to disclose evidence in order to prove responsibility

  • Troubleshooting – Troubleshooting will always start with AGL and will force AGL to check its own systems as there is no way to initially know that it was an AEMO issue. With the current implementation this is an inefficient use of time/resources when the error could have pointed the troubleshooting to the responsible party

  • Reporting – Missed NFR’s and related errors will be incorrectly reported against AGL. This is misleading from a reporting standpoint. AGL should only be reporting on AGL errors.

All scenarios highlighted by Biza-io are concerning with regards to points made above. In particular, Scenario 6 is quite complicated as AGL needs to put forward an intentional design for this. For example, if the NFR between the ADR and the ADH is 30 seconds, then AGL would ideally respond to the ADR within 30 seconds indicating that AEMO hasn’t responded via an appropriate error code. With the current implementation, there is no way to respond to indicate who caused the NFR to be missed.

Furthermore, AEMO does not have the same level of non-functional obligations as ADH’s. If AGL was to establish a true “shared responsibility” arrangement with a third party (in this case AEMO), then it would always establish a contractual arrangement between AGL and the third party to ensure SLA for non-functional obligations are clear and agreed by both parties. This is not possible with the current CDR model and so a solution is needed.

With regard to Potential Solutions, AGL’s preference is (1) from Biza-io, i.e. “1. Introduce a new error sub-type of cds-sdh allowing for all error codes to be communicated under this namespace on a 1:1 basis”. This is a low impact solution that leverages the existing error codes and extends them to support the SDH concept.

@CDR-API-Stream CDR-API-Stream moved this from Iteration Candidates to In Progress: Design in Data Standards Maintenance Aug 17, 2022
@CDR-API-Stream
Copy link
Collaborator Author

It would appear that there is a clear need to convey to the ADR that an error has been returned from the Secondary Data Holder for a variety of reasons. The initial assumption of the DSB was that errors would simply be propagated but a good case is made here that there would still be value in distinctively identifying propagated errors as being from the Secondary Data Holder specifically.

Based on review the feedback we would propose the addition of new error code to the Primary Data Holder variants of the Shared Responsibility APIs (ie the contracts called by the ADRs). This error code would be as follows:

Name: 500 - Secondary Data Holder Error
URN: urn:au-cds:error:cds-all:Secondary/PropagatedError

The meta tag of the error payload would then contain a field called propagatedErrors which would be an array of JSON objects containing:

  • status - A field containing the http status code of the error
  • detail - A field containing the payload of the error response (if any provided)

This would provide a single identifiable error type to the ADR but would also propagate all of the underlying detail that may be needed to understand any issues that occurred downstream.

It is important to note that there are some scenarios where a Primary Data Holder will receive an error from the Secondary Data Holder which is a valid scenario and should not be propagated. For instance, if the ADR calls the Get Service Points end point and the Primary Data Holder translates that into requests for three specific NMIs, one of which is invalid, they will receive an error that they should process correctly. This scenario should not result in error propagation.

@CDR-API-Stream CDR-API-Stream added the Proposal made The DSB has proposed a specific change to the standards to address the change request label Aug 17, 2022
@PratibhaOrigin
Copy link

Origin energy concurs with the concerns raised by Biza and AGL above regarding the error handling and specific error codes to differentiate the error from primary data holder vs secondary data holder.

This concern has been raised previously by Origin during multiple calls, DPs and consultation like DP 154.
This has always been the concern when retailers are merely working as pass-through service for AEMO held data with customer having no visiblity if this data is coming from retailer or AEMO. In absence , specific SDH error codes, even ADR will assume the errors on primary data holder's. This will impact branding , reporting to start with.

Considering the tight timelines , we support the Option 1 from Biza's suggested solution option. --> “1. Introduce a new error sub-type of cds-sdh allowing for all error codes to be communicated under this namespace on a 1:1 basis”.

@benkolera
Copy link

It's worth noting here that I think that stuffing these errors into a new 500 error actually makes things harder for the ADR, not easier. For errors that are about the ADR input like invalid field, the ADR now has to look in two places in their code to figure out if they messed up an input.

It's much more consistent for a client to be able to have all 400 errors come out to the same http code and in the same format as usual. Their error processing code giving feedback to their user is very unlikely to care about whether the error comes from the secondary or primary data holder, but the information is useful none the less for their logging so that an ADR can talk directly to the secondary data holder if the error clearly came from them.

There are two distinct issues here:

  • parsing the errors and giving the end user a response. This takes into account the http status and error code.
  • logging the source of the error so an ADR can talk to the party responsible for the error if they have an issue/question.

The 500 and new code makes 2 very obvious that the error came from the secondary data holder, but makes point one a lot harder because now the ADR needs two error parsing code paths for the secondary and primary data holders.

Using the same http codes but a separate error code in the json would make things a lot easier. Now it's matching the error code in a cds-sdh or cds-au namespace but being able to treat them as equal things for UI purposes but for different purposes for logging. It's worth noting that this takes the code matching away from a simple string equality, but makes the most sense for maximum gain on the two points above.

@CDR-API-Stream
Copy link
Collaborator Author

Below is the summary of the solution that was proposed, discussed and agreed by the participants during the MI call on 31st August 2022:

Summary

  • Define an optional Boolean field in the error response structure which would indicate if a given error is due to secondary data holder
  • The field would be available to implement optionally for November 15th energy go live
  • An FDO will be set after which it is mandatory for affected DHs to implement
  • The description needs to be very clear to ensure the intent of the flag so it cannot be misinterpreted or used in inapplicable scenarios(for e.g. error resulting from a third party vendor issues)
  • Given this will be an optionally implementable change, this would not need to be flagged as urgent

Key benefits

  • General solution covering all potential scenarios
  • Indicate the source of an error
  • No impact to existing/in-flight implementations

Proposal

Based on the above, the DSB recommends the following addition to the Error response structure

  • isSecondaryDataHolderError field MAY be present: an optional Boolean flag which indicates the error is propagated from a designated secondary data holder

Additional Notes

  • Data Holders MAY implement this field on November 15 2022
  • Affected Data Holders MUST implement this field by April 7th 2023

Feedback on the above is welcome.

@CDR-API-Stream CDR-API-Stream moved this from In Progress: Design to In Progress: Staging in Data Standards Maintenance Oct 4, 2022
@CDR-API-Stream
Copy link
Collaborator Author

This change has been staged for review: ConsumerDataStandardsAustralia/standards-staging@release/1.20.0...maintenance/506

Note: The FDO has been changed to May 15th 2023 in alignment with tranche 2 release of Energy sector as per feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Energy Proposal made The DSB has proposed a specific change to the standards to address the change request
Projects
Archived in project
Development

No branches or pull requests

6 participants