Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SR1 1.14.0 - Some jobs failed with the error "CFI Orbit interpolation failed." #1029

Open
5 of 12 tasks
Woljtek opened this issue Jul 7, 2023 · 10 comments
Open
5 of 12 tasks
Assignees
Labels
bug Something isn't working CCB Issue for CCB ipf Limitation The issue causes limitations ops Ticket from ADS operation team priority:major Set the priority to major because the production is heavily impacted S3 Relative to Sentinel-3 RS Addons

Comments

@Woljtek
Copy link

Woljtek commented Jul 7, 2023

Environment:

  • Delivery tag: 2.0.0-rc2
  • Platform: OPS Orange Cloud
  • Configuration: SR1-NRT 1.14.0 with PREINT mode (see this branch)

Traceability:

Current Behavior:
During the NON-REGRESSION test with PREINT, we observed that several executions ended with the following error:

23-07-07T13:37:43.565844 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [I] NAVATT Reader: The NAVATT cover the processing window.
2023-07-07T13:37:43.570228 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [W] Using alternative orbit file: [1]: /data/localWD/13245/Orbit_Scratch.EEF
2023-07-07T13:37:43.570278 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [I] Get mission name (may be different from S3 for test purposes)
2023-07-07T13:37:43.570295 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [I] Mission name key is S3
2023-07-07T13:37:43.570787 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [I] Satellite ID (3A: 129 - 3B: 130 - 3C: 131 - CRYOSAT: 41 - ENVISAT: 21): 129
2023-07-07T13:37:43.573012 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [E] CFI Orbit interpolation failed.
2023-07-07T13:37:43.573053 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [E] Message error 1: Fatal Error in: OrbitId::init
2023-07-07T13:37:43.573065 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [E] Message error 2: EXPLORER_ORBIT >>> WARNING in xo_orbit_init_file: Warnings while computing ANX data
2023-07-07T13:37:43.573125 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [E] Pre Processor task FAILED
2023-07-07T13:37:43.573147 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [E] Exiting with EXIT CODE 136

Expected Behavior:
The addon shall be able to compute all products (it worked with 1.13.x)

Steps To Reproduce:
Play PREINT procedure with the dataset s3://ops-rs-preint/s3/NRT/S3-SR1/input-data/

Test execution artefacts (i.e. logs, screenshots…)
image.pngTip: You can attach images or log files by dragging & dropping, selecting or pasting them.
Each error is restarted 3 times before being discarded. (8 Jobs in error)

An example NOK Job: (from s3://ops-rs-failed-workdir/s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k_S3B_SR_0_SRA____20230409T214430_20230409T215430_20230410T001919_0599_078_129______LN3_D_NR_002.SEN3_bf0677b7-e91e-46d2-9076-c7dea83f05ee_0/)

Full logs of EW:
https://app.zenhub.com/files/398313496/deccd093-7653-4a8b-96b0-d818d54122ba/download

Bug Generic Definition of Ready (DoR)

  • The affect version in which the bug has been found is mentioned
  • The context and environment of the bug is detailed
  • The description of the bug is clear and unambiguous
  • The procedure (steps) to reproduce the bug is clearly detailed
  • The tested User Story / features is linked to the bug if available
  • Logs are attached if available
  • A data set attached if available

Bug Generic Definition of Done (DoD)

  • the modification implemented (the solution to fix the bug) is described in the bug.
  • Unit tests & Continuous integration performed - Test results available - Structural Test coverage reported by SONAR
  • Code committed in GIT with right tag or Analysis/Trade Off documentation up-to-date in reference-system-documentation repository
  • Code is compliant with coding rules (SONAR Report as evidence)
  • Acceptance criteria of the related User story are checked and Passed
@Woljtek Woljtek added bug Something isn't working CCB Issue for CCB ops Ticket from ADS operation team priority:blocking Set the priority to blocking because the production is blocked S3 Relative to Sentinel-3 RS Addons labels Jul 7, 2023
@suberti-ads
Copy link

suberti-ads commented Jul 7, 2023

Hereafter 3 sample Job for failed processing:
Job 13297
job13297.log

Job created by
S3B_SR_0_SRA____20230409T195322_20230409T200322_20230409T210750_0599_078_128______LN3_D_NR_002.SEN3
Input used:
S3B_SR_0_SRA____20230409T194322_20230409T195322_20230409T205329_0599_078_127______LN3_D_NR_002.SEN3
S3B_SR_0_SRA____20230409T195322_20230409T200322_20230409T210750_0599_078_128______LN3_D_NR_002.SEN3
S3B_SR_0_SRA____20230409T200322_20230409T200343_20230409T224829_0020_078_128______LN3_D_NR_002.SEN3

Job 13341
job13341.log

Job created by
S3B_SR_0_SRA____20230409T200322_20230409T200343_20230409T224829_0020_078_128______LN3_D_NR_002.SEN3
Input used:
S3B_SR_0_SRA____20230409T195322_20230409T200322_20230409T210750_0599_078_128______LN3_D_NR_002.SEN3
S3B_SR_0_SRA____20230409T200322_20230409T200343_20230409T224829_0020_078_128______LN3_D_NR_002.SEN3
S3B_SR_0_SRA____20230409T200343_20230409T201343_20230409T223709_0599_078_128______LN3_D_NR_002.SEN3

Job 13342
job13342.log

Job created by:
S3B_SR_0_SRA____20230409T200343_20230409T201343_20230409T223709_0599_078_128______LN3_D_NR_002.SEN3
Input used:
S3B_SR_0_SRA____20230409T200322_20230409T200343_20230409T224829_0020_078_128______LN3_D_NR_002.SEN3
S3B_SR_0_SRA____20230409T200343_20230409T201343_20230409T223709_0599_078_128______LN3_D_NR_002.SEN3
S3B_SR_0_SRA____20230409T201343_20230409T202343_20230409T224428_0599_078_128______LN3_D_NR_002.SEN3

@Woljtek Woljtek added the WERUM dev Ticket dedicated to WERUM development label Jul 10, 2023
@COPRS COPRS deleted a comment from w-jka Jul 10, 2023
@suberti-ads
Copy link

Republished from previous messagre from @w-jka (Deleted by mistake)

From the provided logs and AppDataJob extracts I could not find any problems on our side. As Florian is on vacation this week, I do not have any access to the documentation of the processors, so I can not check if the ICD of the processor contains any additional information regarding the exit code 136.

From the logs the provided orbit files are fine and, while not first in priority, are listed by the new tasktable. The processor itself states that the files are good to go before running into an error.
Based on this analysis the root cause of this issue seems to be in the IPF itself.

@Woljtek
Copy link
Author

Woljtek commented Jul 11, 2023

A PSC issue is opened => https://esa-csc-gs.atlassian.net/browse/PSC-63
Wait for ESA anwser. I propose to move this issue to 'On Hold'

@Woljtek Woljtek added the ipf label Jul 11, 2023
@LAQU156
Copy link

LAQU156 commented Jul 12, 2023

Werum_CCB_2023_w28 : Moved into "Refused Werum" to place it into "On hold" pipeline in CCB Board, waiting for ESA answer

@w-fsi
Copy link

w-fsi commented Jul 17, 2023

@Woljtek : I agree with this approach. As @w-jka pointed out, it looks unlikely to be an issue within our software as there was no change on our side and the kind of error looks more like an issue within the IPF itself. Exit code 136 is often associated in C/C++ programs as SIGFPE and might be caused by an exception with a floating point or an integer oveflow. This is very likely an issue within the IPF as the code is executed as blackbox on our side.

@pcuq-ads
Copy link

pcuq-ads commented Jul 19, 2023

IVV_CCB_2023_w29 : moved to accepted OPS .
@SYTHIER-ADS Could you have a look to this issue ?

@vgava-ads vgava-ads removed the WERUM dev Ticket dedicated to WERUM development label Jul 19, 2023
@SYTHIER-ADS
Copy link

My understanding of the issue is that in degraded cases (missing ROE_AX and DO_0_NAV) the CFI is using TM_0_NAT and in this case the initialisation of the orbit fails. This point is linked to the change of the version of EO CFI inside the CFI.
I would suggest to decrease to Major this anomaly as it is only impacting a degraded case, noting that the benchmark was performed using this version of the IPF without error (DO_0_NAV are available). In parallel an anomaly is to be created on IPF.

@pcuq-ads
Copy link

pcuq-ads commented Jul 26, 2023

System_CCB_2023-w30 : The issue is on CFI side for a degraded case. Priority reduced to major.

@pcuq-ads pcuq-ads added priority:major Set the priority to major because the production is heavily impacted and removed priority:blocking Set the priority to blocking because the production is blocked labels Jul 26, 2023
@vgava-ads vgava-ads added the Limitation The issue causes limitations label Aug 7, 2023
@suberti-ads
Copy link

4 new occurences on SR1-NRT

2024-04-11T17:16:23+00:00	{"header":{"type":"LOG","timestamp":"2024-04-11T17:16:23.065928Z","level":"INFO","line":129,"file":"TaskCallable.java","thread":"pool-77-thread-1"},"message":{"content":"Ending task /usr/local/components/S3IPF_SR1_07.04/bin/SR_1_PRE.bin with exit code 136"},"custom":{"logger_string":"esa.s1pdgs.cpoc.ipf.execution.worker.job.process.TaskCallable"}}
2024-04-11T17:16:23+00:00	2024-04-11T17:16:23.063472 s3-sr1-nrt-part1-execution-worker-v16-7d4d859bc4-vnnkd SR1 07.04 [0000000334]: [E] Exiting with EXIT CODE 136
2024-04-11T17:16:23+00:00	2024-04-11T17:16:23.063449 s3-sr1-nrt-part1-execution-worker-v16-7d4d859bc4-vnnkd SR1 07.04 [0000000334]: [E] Pre Processor task FAILED
2024-04-11T17:16:23+00:00	2024-04-11T17:16:23.063392 s3-sr1-nrt-part1-execution-worker-v16-7d4d859bc4-vnnkd SR1 07.04 [0000000334]: [E] Message error 2: EXPLORER_ORBIT >>> WARNING in xo_orbit_init_file: Warnings while computing ANX data
2024-04-11T17:16:23+00:00	2024-04-11T17:16:23.063380 s3-sr1-nrt-part1-execution-worker-v16-7d4d859bc4-vnnkd SR1 07.04 [0000000334]: [E] Message error 1: Fatal Error in: OrbitId::init
2024-04-11T17:16:23+00:00	2024-04-11T17:16:23.063335 s3-sr1-nrt-part1-execution-worker-v16-7d4d859bc4-vnnkd SR1 07.04 [0000000334]: [E] CFI Orbit interpolation failed.

Note : CAMS Ticket on this issue : 4118

@suberti-ads
Copy link

4 new occurences on SR1-NRT

[code 290] [exitCode 136] [msg Task /usr/local/components/S3IPF_SR1_07.04/bin/SR_1_PRE.bin failed]

with following logs:


2024-07-03T19:40:31+00:00   {"header":{"type":"LOG","timestamp":"2024-07-03T19:40:31.011928Z","level":"INFO","line":129,"file":"TaskCallable.java","thread":"pool-17-thread-1"},"message":{"content":"Ending task /usr/local/components/S3IPF_SR1_07.04/bin/SR_1_PRE.bin with exit code 136"},"custom":{"logger_string":"esa.s1pdgs.cpoc.ipf.execution.worker.job.process.TaskCallable"}}
2024-07-03T19:40:31+00:00   2024-07-03T19:40:31.009483 s3-sr1-nrt-part1-execution-worker-v17-67c7779d6c-wpc6t SR1 07.04 [0000000122]: [E] Exiting with EXIT CODE 136
2024-07-03T19:40:31+00:00   2024-07-03T19:40:31.009458 s3-sr1-nrt-part1-execution-worker-v17-67c7779d6c-wpc6t SR1 07.04 [0000000122]: [E] Pre Processor task FAILED
2024-07-03T19:40:31+00:00   2024-07-03T19:40:31.009377 s3-sr1-nrt-part1-execution-worker-v17-67c7779d6c-wpc6t SR1 07.04 [0000000122]: [E] Message error 2: EXPLORER_ORBIT >>> WARNING in xo_orbit_init_file: Warnings while computing ANX data
2024-07-03T19:40:31+00:00   2024-07-03T19:40:31.009365 s3-sr1-nrt-part1-execution-worker-v17-67c7779d6c-wpc6t SR1 07.04 [0000000122]: [E] Message error 1: Fatal Error in: OrbitId::init
2024-07-03T19:40:31+00:00   2024-07-03T19:40:31.009317 s3-sr1-nrt-part1-execution-worker-v17-67c7779d6c-wpc6t SR1 07.04 [0000000122]: [E] CFI Orbit interpolation failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CCB Issue for CCB ipf Limitation The issue causes limitations ops Ticket from ADS operation team priority:major Set the priority to major because the production is heavily impacted S3 Relative to Sentinel-3 RS Addons
Projects
None yet
Development

No branches or pull requests

7 participants