Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OPS] [BUG] Preparation job stuck during 15 hours before being generated #1088

Open
12 tasks
suberti-ads opened this issue Apr 19, 2024 · 1 comment
Open
12 tasks
Labels
bug Something isn't working CCB Issue for CCB ops Ticket from ADS operation team

Comments

@suberti-ads
Copy link

suberti-ads commented Apr 19, 2024

Environment:

  • Delivery tag:
  • Platform: OPS Orange Cloud
  • Configuration:
    rs-addon s3-l0p v1.14

Traceability:

Current Behavior:
On sample processing, we forecast to download 2 sessions S3B

  • S3B_20240411160247031055 (ingestion end time 16:28)
  • S3B_20240411142312031054 (ingestion end time 16:02)

All L0 Product have been generated during this period except 4 products generated after more than 15h00:
image

hereafter we saw for HKTM prouct 2 of these with ingestion_time over 15hours (ingestion_time is difference between start_processing_date and t0_pdgs_date)

monitoring=# SELECT
   filename,
   (generation_begin_date - product.t0_pdgs_date) As ingestion_duration,
   (generation_end_date - generation_begin_date) As generation_duration,
   (prip_storage_end_date - generation_end_date) As distribution_duration,
   (prip_storage_end_date - product.t0_pdgs_date) As total_duration
   FROM product
   LEFT JOIN output_list ON  output_list.product_id = product.id
   LEFT JOIN processing ON processing.id = output_list.processing_id
   WHERE
    (processing_date - product.t0_pdgs_date) > INTERVAL '30 MINUTES' AND
    processing.rs_chain_name='S3-L0P' AND
    (processing.processing_date BETWEEN '2024-04-11 14:00:00' AND '2024-04-18 08:00:00') AND
    mission = 'S3'AND
    (product.filename like '%HK%'  OR product.filename like '%NAT%')
    ORDER BY processing_date;
                                              filename                                               | ingestion_duration | generation_duration | distribution_duration |   total_durat
ion    
-----------------------------------------------------------------------------------------------------+--------------------+---------------------+-----------------------+--------------
-------
 S3B_TM_0_NAT____20240411T124346_20240411T142313_20240411T162634_5967_091_366______LN3_O_AL_002.SEN3 | 01:55:25.419       | 00:00:14.565        | 4 days 23:08:28.358   | 5 days 01:04:
08.342
 S3B_TM_0_NAT____20240411T124346_20240411T142324_20240411T162649_5978_091_366______LN3_O_AL_002.SEN3 | 01:55:40.018       | 00:00:14.122        | 4 days 22:55:40.048   | 5 days 00:51:
34.188
 S3B_TM_0_HKM____20240411T124345_20240411T142324_20240411T165101_5979_091_366______LN3_O_NR_002.SEN3 | 02:19:52.622       | 00:00:18.594        | 4 days 22:55:07.814   | 5 days 01:15:
19.03
 S3B_TM_0_NAT____20240411T142325_20240411T160242_20240411T165120_5957_091_367______LN3_O_AL_002.SEN3 | 00:40:34.216       | 00:00:14.154        | 4 days 22:20:05.68    | 4 days 23:00:
54.05
 S3B_TM_0_HKM2___20240411T124346_20240411T142324_20240411T165208_5978_091_366______LN3_O_NR_002.SEN3 | 02:20:59.16        | 00:00:14.084        | 4 days 22:19:18.061   | 5 days 00:40:
31.305
 S3B_TM_0_NAT____20240411T142325_20240411T160259_20240411T165222_5973_091_367______LN3_O_AL_002.SEN3 | 00:41:36.243       | 00:00:13.701        | 4 days 22:42:32.508   | 4 days 23:24:
22.452
 S3B_TM_0_HKM2___20240411T142325_20240411T160259_20240412T080211_5973_091_367______LN3_O_NR_002.SEN3 | 15:51:25.854       | 00:00:14.223        | 4 days 07:22:39.31    | 4 days 23:14:
19.387
 S3B_TM_0_HKM____20240411T142324_20240411T160259_20240412T080240_5974_091_367______LN3_O_NR_002.SEN3 | 15:51:54.262       | 00:00:19.31         | 4 days 07:03:27.589   | 4 days 22:55:
41.161
(8 rows)

We saw these 4 job have been generated at about 08:00:00 by the preparation_worker

Products and jobs impacted:
JobOrder.149473.xml (MW0)
JobOrder.149509.xml (HKTM2)
JobOrder.149472.xml (DOP)
JobOrder.149494.xml (HKTM)

Expected Behavior:
Job should be successfully generated by preparation once all data needed data were available.

Steps To Reproduce:
This issue has been detected on sample processing S3

Test execution artefacts (i.e. logs, screenshots…)
Hereafter preparation log
s3-l0p-part1-preparation-worker-v52-d9dcb799f-9fzpj.log.gz

sample mongo job 149494:
joborder149494.json

sample mongo job 149509:
joborder149509.json

Whenever possible, first analysis of the root cause
On preparation log we saw these 4 jobs seems not detected as completed due to missing TM_0_NAT products.
sample for last job created 149494
Detection search start:

{"header":{"type":"LOG","timestamp":"2024-04-11T16:52:31.189641Z","level":"INFO","line":67,"file":"InputSearchService.java","thread":"KafkaConsumerDestination{consumerDestinationName='s3-l0p-part1.message-filter', partitions=4, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Start main input search for AppDataJob 149494"},"custom":{"logger_string":"esa.s1pdgs.cpoc.preparation.worker.service.InputSearchService"}}

Product detected (we saw NAT products)

{"header":{"type":"LOG","timestamp":"2024-04-11T16:52:31.199163Z","level":"INFO","line":653,"file":"MetadataClient.java","thread":"KafkaConsumerDestination{consumerDestinationName='s3-l0p-part1.message-filter', partitions=4, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Metadata query for family 'S3_L0' and product type 'TM_0_NAT___' returned 2 results"},"custom":{"logger_string":"esa.s1pdgs.cpoc.metadata.client.MetadataClient"}}
{"header":{"type":"LOG","timestamp":"2024-04-11T16:52:31.208419Z","level":"INFO","line":653,"file":"MetadataClient.java","thread":"KafkaConsumerDestination{consumerDestinationName='s3-l0p-part1.message-filter', partitions=4, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Metadata query for family 'S3_L0' and product type 'TM_0_NAT___' returned 2 results"},"custom":{"logger_string":"esa.s1pdgs.cpoc.metadata.client.MetadataClient"}}

But it seems not enought to generate execution:

{"header":{"type":"LOG","timestamp":"2024-04-11T16:52:31.208688Z","level":"INFO","line":70,"file":"InputSearchService.java","thread":"KafkaConsumerDestination{consumerDestinationName='s3-l0p-part1.message-filter', partitions=4, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Main input search did not complete successfully: [input TM_0_NAT___] [reason Incomplete result set for TM_0_NAT___][msg Missing inputs]"},"custom":{"logger_string":"esa.s1pdgs.cpoc.preparation.worker.service.InputSearchService"}}
{"header":{"type":"LOG","timestamp":"2024-04-11T16:52:31.236382Z","level":"INFO","line":119,"file":"PreparationWorkerService.java","thread":"KafkaConsumerDestination{consumerDestinationName='s3-l0p-part1.message-filter', partitions=4, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"End preparation of new execution jobs"},"custom":{"logger_string":"esa.s1pdgs.cpoc.preparation.worker.service.PreparationWorkerService"}}

Once receive new AUX data the following days at 08:00:11, it seems restart job generation processus whhich raised an timeout which force processing

{"header":{"type":"REPORT","timestamp":"2024-04-12T08:00:11.301000Z","level":"INFO","mission":"S3","workflow":"NOMINAL","rs_chain_name":"S3-L0P","rs_chain_version":"1.14.0"},"message":{"content":"Received CatalogEvent for S3A_AX___FPO_AX_20240412T000000_20240419T000000_20240412T064627___________________EUM_O_AL_001.SEN3"},"task":{"uid":"6e330276-84ab-4261-bae0-2e252d9262db","name":"ProductionTrigger","event":"BEGIN","input":{"filename_string":"S3A_AX___FPO_AX_20240412T000000_20240419T000000_20240412T064627___________________EUM_O_AL_001.SEN3"},"follows_from_task":"49b65571-d83a-4b21-990d-847acc6315d9"}}
[...]
"header":{"type":"LOG","timestamp":"2024-04-12T08:00:11.307335Z","level":"INFO","line":67,"file":"InputSearchService.java","thread":"KafkaConsumerDestination{consumerDestinationName='s3-l0p-part1.message-filter', partitions=4, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Start main input search for AppDataJob 149472"},"custom":{"logger_string":"esa.s1pdgs.cpoc.preparation.worker.service.InputSearchService"}}
{"header":{"type":"LOG","timestamp":"2024-04-12T08:00:11.318281Z","level":"INFO","line":653,"file":"MetadataClient.java","thread":"KafkaConsumerDestination{consumerDestinationName='s3-l0p-part1.message-filter', partitions=4, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Metadata query for family 'S3_L0' and product type 'TM_0_NAT___' returned 2 results"},"custom":{"logger_string":"esa.s1pdgs.cpoc.metadata.client.MetadataClient"}}
{"header":{"type":"LOG","timestamp":"2024-04-12T08:00:11.344545Z","level":"INFO","line":653,"file":"MetadataClient.java","thread":"KafkaConsumerDestination{consumerDestinationName='s3-l0p-part1.message-filter', partitions=4, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Metadata query for family 'S3_L0' and product type 'TM_0_NAT___' returned 2 results"},"custom":{"logger_string":"esa.s1pdgs.cpoc.metadata.client.MetadataClient"}}
{"header":{"type":"LOG","timestamp":"2024-04-12T08:00:11.344801Z","level":"INFO","line":315,"file":"S3TypeAdapter.java","thread":"KafkaConsumerDestination{consumerDestinationName='s3-l0p-part1.message-filter', partitions=4, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Timeout reached for job 149472. Continue without missing products..."},"custom":{"logger_string":"esa.s1pdgs.cpoc.preparation.worker.type.s3.S3TypeAdapter"}}

Note 2 NAT product selected ( S3B_TM_0_HKM__G_20240411T142342_20240411T160317_20240411T164017_5975______________SVL_O_NR_OPE.ISIP) for processing

  • S3B_TM_0_NAT____20240411T142325_20240411T160242_20240411T165120_5957_091_367______LN3_O_AL_002.SEN3
  • S3B_TM_0_NAT____20240411T142325_20240411T160259_20240411T165222_5973_091_367______LN3_O_AL_002.SEN3

Following tasktable
Herafter TM_NAT covering needed:

                <Input id="[TM_0_NAT___]">
                  <Mode>ALWAYS</Mode>
                  <Mandatory>No</Mandatory>
                  <List_of_Alternatives count="1">
                    <Alternative>
                      <Order>1</Order>
                      <Origin>DB</Origin>
                      <Retrieval_Mode>LatestValCover</Retrieval_Mode>
                      <T0>0</T0>
                      <T1>0</T1>
                      <File_Type>TM_0_NAT___</File_Type>
                      <File_Name_Type>Physical</File_Name_Type>
                    </Alternative>
                  </List_of_Alternatives>

So product stuck because generated NAT files (142325 -->160259) not cover all product duration (142342 --> 160317)
So issue can talk about generation NAT which are shorter

All NAT job for this processing;

{ "_id" : NumberLong(149376), "pod" : "s3-l0p-part1-preparation-worker-v52-d9dcb799f-9fzpj", "productName" : "S3B_TM_0_NAT__G_20240411T124404_20240411T142342_20240411T161654_5978______________SVL_O_NR_OPE.ISIP", "creationDate" : ISODate("2024-04-11T16:23:45.729Z"), "generation" : { "state" : "SENT" } }
{ "_id" : NumberLong(149392), "pod" : "s3-l0p-part1-preparation-worker-v52-d9dcb799f-9fzpj", "productName" : "S3B_TM_0_NAT__G_20240411T124404_20240411T142331_20240411T161511_5967______________SVL_O_NR_OPE.ISIP", "creationDate" : ISODate("2024-04-11T16:24:01.844Z"), "generation" : { "state" : "SENT" } }
{ "_id" : NumberLong(149479), "pod" : "s3-l0p-part1-preparation-worker-v52-d9dcb799f-9fzpj", "productName" : "S3B_TM_0_NAT__G_20240411T142343_20240411T160317_20240411T164009_5974______________SVL_O_NR_OPE.ISIP", "creationDate" : ISODate("2024-04-11T16:47:21.205Z"), "generation" : { "state" : "SENT" } }
{ "_id" : NumberLong(149510), "pod" : "s3-l0p-part1-preparation-worker-v52-d9dcb799f-9fzpj", "productName" : "S3B_TM_0_NAT__G_20240411T142343_20240411T160300_20240411T163823_5957______________SVL_O_NR_OPE.ISIP", "creationDate" : ISODate("2024-04-11T16:47:46.427Z"), "generation" : { "state" : "SENT" } }

So we should explain why NAT generated by job 149479 was shorter

On generation log for S3B_TM_0_NAT____20240411T142325_20240411T160259_20240411T165222_5973_091_367______LN3_O_AL_002.SEN3 No error seen:
S3_L0P_S3B_TM_0_NAT____20240411T142325_20240411T160259_20240411T165222_5973_091_367______LN3_O_AL_002.log

2024-04-11T16:52:24+00:00	2024-04-11T16:52:24.823724 s3-l0p-part1-execution-worker-v52-7f8b775f94-bkh7m IPF-0 06.15 [0000000726]: [I] NAVATT Reader: The NAVATT cover the processing window.
2024-04-11T16:52:24+00:00	2024-04-11T16:52:24.823705 s3-l0p-part1-execution-worker-v52-7f8b775f94-bkh7m IPF-0 06.15 [0000000726]: [I] NAVATT Reader: Required coverage is (2024-04-11T14:23:25.000000Z to 2024-04-11T16:02:59.000000Z), including 0s of margin.
2024-04-11T16:52:24+00:00	2024-04-11T16:52:24.823692 s3-l0p-part1-execution-worker-v52-7f8b775f94-bkh7m IPF-0 06.15 [0000000726]: [I] NAVATT Reader: NAVATT coverage is (2024-04-11T14:23:25.000000Z to 2024-04-11T16:02:59.000000Z).
2024-04-11T16:52:24+00:00	2024-04-11T16:52:24.823615 s3-l0p-part1-execution-worker-v52-7f8b775f94-bkh7m IPF-0 06.15 [0000000726]: [I] NAVATT Reader: NAVATT are perfect (cover proc window and have no gaps).

Bug Generic Definition of Ready (DoR)

  • The affect version in which the bug has been found is mentioned
  • The context and environment of the bug is detailed
  • The description of the bug is clear and unambiguous
  • The procedure (steps) to reproduce the bug is clearly detailed
  • The tested User Story / features is linked to the bug if available
  • Logs are attached if available
  • A data set attached if available

Bug Generic Definition of Done (DoD)

  • the modification implemented (the solution to fix the bug) is described in the bug.
  • Unit tests & Continuous integration performed - Test results available - Structural Test coverage reported by SONAR
  • Code committed in GIT with right tag or Analysis/Trade Off documentation up-to-date in reference-system-documentation repository
  • Code is compliant with coding rules (SONAR Report as evidence)
  • Acceptance criteria of the related User story are checked and Passed
@suberti-ads suberti-ads added bug Something isn't working CCB Issue for CCB ops Ticket from ADS operation team labels Apr 19, 2024
@suberti-ads
Copy link
Author

New occurrence on sample processing report
8 L0p execution started after 14hours (timeout on preparation-worker)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CCB Issue for CCB ops Ticket from ADS operation team
Projects
None yet
Development

No branches or pull requests

1 participant