Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] No signal to coreMQTT-Agent network manager that OTA job has stopped, once OTA is successful #73

Closed
NightSkySK opened this issue Feb 5, 2024 · 8 comments

Comments

@NightSkySK
Copy link
Contributor

NightSkySK commented Feb 5, 2024

Describe the bug
After OTA job is finished and new firmware is verified successfully, there is no signal to coreMQTT-Agent network manager that an OTA job has stopped.
The signal is released only if OtaJobEventFail

        case OtaJobEventFail:
            ESP_LOGI( TAG, "Received an OtaJobEventFail notification from OTA Agent." );

            /* Signal coreMQTT-Agent network manager that an OTA job has stopped. */
            xCoreMqttAgentManagerPost( CORE_MQTT_AGENT_OTA_STOPPED_EVENT );
            break;

System information

  • Hardware board: ESP32
  • IDE used: IDF 5.1
  • Operating System: Windows
  • Code version: v202212.00-50-ge2d407f
  • Project/Demo: iot-reference-esp32c3
  • If your project is a custom application, please add the relevant code snippet(s) in the section titled "Code to reproduce bug".

Expected behavior
After OtaJobEventStartTest is done and image validation is finished there should be signal xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT)
like this:

  case OtaJobEventStartTest:

    /* This demo just accepts the image since it was a good OTA update and
     * networking and services are all working (or we would not have made it
     * this far). If this were some custom device that wants to test other
     * things before validating new image, this would be the place to kick off
     * those tests before calling OTA_SetImageState() with the final result of
     * either accepted or rejected. */

    ESP_LOGI(TAG, "Received OtaJobEventStartTest callback from OTA Agent.");

    err = OTA_SetImageState(OtaImageStateAccepted);

    if (err == OtaErrNone) {
      ESP_LOGI(TAG, "New image validation succeeded in self test mode.");
    } else {
      ESP_LOGE(TAG, "Failed to set image state as accepted with error %d.",
               err);
    }
    xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT);
    break;

and only after this flag other tasks can start their activity again.

Actually based on FreeRTOS documentation there should be one more case OtaJobEventUpdateComplete = 7, but I can't find this event in : https://github.com/FreeRTOS/iot-reference-esp32c3/blob/main/main/demo_tasks/ota_over_mqtt_demo/ota_over_mqtt_demo.c#L516-L609

typedef enum OtaJobEvent
{
    OtaJobEventActivate = 0,       /*!< @brief OTA receive is authenticated and ready to activate. */
    OtaJobEventFail = 1,           /*!< @brief OTA receive failed. Unable to use this update. */
    OtaJobEventStartTest = 2,      /*!< @brief OTA job is now in self test, perform user tests. */
    OtaJobEventProcessed = 3,      /*!< @brief OTA event queued by OTA_SignalEvent is processed. */
    OtaJobEventSelfTestFailed = 4, /*!< @brief OTA self-test failed for current job. */
    OtaJobEventParseCustomJob = 5, /*!< @brief OTA event for parsing custom job document. */
    OtaJobEventReceivedJob = 6,    /*!< @brief OTA event when a new valid AFT-OTA job is received. */
    OtaJobEventUpdateComplete = 7, /*!< @brief OTA event when the update is completed. */
    OtaLastJobEvent = OtaJobEventStartTest
} OtaJobEvent_t; 

Additional question
In case OtaJobEventSelfTestFailed OTA_Shutdown is executed, should program also trigger xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT); in that case or ot will be covered by
OtaJobEventFail any way?

        case OtaJobEventSelfTestFailed:
            ESP_LOGI( TAG, "Received OtaJobEventSelfTestFailed callback from OTA Agent." );

            /* Requires manual activation of previous image as self-test for
             * new image downloaded failed.*/
            ESP_LOGE( TAG, "OTA Self-test failed for new image. shutting down OTA Agent." );

            /* Shutdown OTA Agent, if it is required that the unsubscribe operations are not
             * performed while shutting down please set the second parameter to 0 instead of 1. */
            OTA_Shutdown( 0, 1 );

            break;

Screenshots or console output
Sorry, I didn't record console output for that case. How ever after sucessful OTA update other task kept beeing inactive. And only way to recover from this stage was to manually reboot the device

@paulbartell
Copy link
Member

@NightSkySK Thanks for the bug report. We will look into it.

@NightSkySK
Copy link
Contributor Author

@paulbartell Did you managed to look or replicate this issue

@monkeytronics
Copy link

monkeytronics commented Feb 19, 2024

Is the same thing I am seeing. I have successfully run the OTA update, but the AWS Jobs console always reports it as a failure, as the expected "SUCCESS" update to job/{job_id}/update never happens. This can be seen here:

I (188594) AWS_OTA: Signature verification succeeded.
I (188594) AWS_OTA: Received entire update and validated the signature.
I (188964) MONKEY_AWS_IOT_JOBS_TASK:  Received: 306   Queued: 306   Processed: 305   Dropped: 0
I (188994) coreMQTT: Publishing message to $aws/things/S900001/jobs/AFR_OTA-vdsgfsdfga/update.

I (189964) MONKEY_AWS_IOT_JOBS_TASK:  Received: 306   Queued: 306   Processed: 305   Dropped: 0
I (190614) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (190614) coreMQTT: State record updated. New state=MQTTPublishDone.
I (190614) MONKEY_AWS_IOT_JOBS_TASK: Sent PUBLISH packet to broker $aws/things/S900001/jobs/AFR_OTA-vdsgfsdfga/update to broker.


I (190614) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
W (190624) MONKEY_AWS_IOT_JOBS_TASK: Payload: {"status":"IN_PROGRESS","statusDetails":{"self_test":"ready","updatedBy":"0x01050009"}}


I (190634) coreMQTT: State record updated. New state=MQTTPublishDone.
I (190654) MONKEY_AWS_IOT_JOBS_TASK: Received OtaJobEventActivate callback from OTA Agent.
I (190654) MONKEY_AWS_IOT_JOBS_TASK: Received update response: $aws/things/S900001/jobs/AFR_OTA-vdsgfsdfga/update/accepted.
I (190664) esp_image: segment 0: paddr=00200020 vaddr=3f400020 size=3b2b0h (242352) map
I (190724) esp_image: segment 1: paddr=0023b2d8 vaddr=3ff80064 size=0002ch (    44)
I (190734) esp_image: segment 2: paddr=0023b30c vaddr=3ffbdb60 size=04d0ch ( 19724)
I (190744) esp_image: segment 3: paddr=00240020 vaddr=400d0020 size=d7ce0h (883936) map
I (190894) esp_image: segment 4: paddr=00317d08 vaddr=3ffc286c size=00c88h (  3208)
I (190894) esp_image: segment 5: paddr=00318998 vaddr=40080000 size=18d64h (101732)
I (190924) esp_image: segment 6: paddr=00331704 vaddr=400c0000 size=00064h (   100)
I (190924) esp_image: segment 7: paddr=00331770 vaddr=50000000 size=00028h (    40)
I (190924) esp_image: segment 0: paddr=00200020 vaddr=3f400020 size=3b2b0h (242352) map
I (190964) MONKEY_AWS_IOT_JOBS_TASK:  Received: 306   Queued: 306   Processed: 306   Dropped: 0
I (190974) esp_image: segment 1: paddr=0023b2d8 vaddr=3ff80064 size=0002ch (    44)
I (190984) esp_image: segment 2: paddr=0023b30c vaddr=3ffbdb60 size=04d0ch ( 19724)
I (190984) esp_image: segment 3: paddr=00240020 vaddr=400d0020 size=d7ce0h (883936) map
I (191144) esp_image: segment 4: paddr=00317d08 vaddr=3ffc286c size=00c88h (  3208)
I (191144) esp_image: segment 5: paddr=00318998 vaddr=40080000 size=18d64h (101732)
I (191164) esp_image: segment 6: paddr=00331704 vaddr=400c0000 size=00064h (   100)
I (191174) esp_image: segment 7: paddr=00331770 vaddr=50000000 size=00028h (    40)
I (191754) wifi:state: run -> init (0)
I (191754) wifi:pm stop, total sleep time: 92504243 us / 177986771 us

I (191754) wifi:<ba-del>idx:0, tid:0
I (191754) wifi:new:<12,0>, old:<12,0>, ap:<255,255>, sta:<12,0>, prof:1
E (191764) wifi:NAN WiFi stop
I (191764) core_mqtt_agent_manager: WiFi disconnected.
I (191764) monkey_wifi.c: * WIFI_EVENT : SYSTEM_EVENT_STA_DISCONNECTED
W (191774) monkey_wifi.c: Disconnect code   : 8
I (191774) monkey_wifi.c: Network Unavailable
E (191784) monkey_wifi.c: Disconnect reason : WIFI_REASON_UNSPECIFIED
E (191784) monkey_wifi.c: monkey task fsm -> MONKEY_WIFI_DISCONNECTING
I (191794) monkey_wifi.c: * WIFI_EVENT : SYSTEM_EVENT_STA_STOP
W (191804) monkey_wifi.c: Uptime = 191 s
W (191814) monkey_wifi.c: Uptime = 0 hrs
E (191814) monkey_wifi.c: Entering monkey_reconnect
I (191864) wifi:flush txq
I (191864) wifi:stop sw txq
I (191864) wifi:lmac stop hw txq
ets Jul 29 2019 12:21:46

The last thing AWS hears (at timestamp 190624) is that the job is still in progress!

@NightSkySK
Copy link
Contributor Author

I've finally found some time to catch the exact logs of the situation which I've observed.

Full log:
log.mSpaceGate.20240302140851.txt

Initial device configuartion:

I (871) App_Version: Version: t0.2.4 
I (871) App_Version: Project Name: mSpaceGate
I (871) App_Version: Compile Time: 13:31:44
I (881) App_Version: Compile Date: Mar  2 2024
I (881) App_Version: SOFTWARE VERSION: 0.2.4
I (911) ota_over_mqtt: OTA over MQTT demo, Application version 0.2.4

Once new job is received the other task are suspended (OTA started. Preventing coreMQTT-Agent commands from being enqueued.)as expected:

I (17291) ota_over_mqtt: Received job message callback, size 492.
I (17301) AWS_OTA: Extracted parameter: [key: value]=[execution.jobId: AFR_OTA-mSpaceGate_t0_2_5]
I (17311) AWS_OTA: Extracted parameter: [key: value]=[execution.jobDocument.afr_ota.streamname: AFR_OTA-a5ccc605-4e35-4db8-a17f-ed918d372f7c]
I (17331) AWS_OTA: Extracted parameter: [key: value]=[execution.jobDocument.afr_ota.protocols: ["MQTT"]]
I (17341) AWS_OTA: Extracted parameter: [key: value]=[filepath: /]
I (17341) AWS_OTA: Extracted parameter: [key: value]=[filesize: 1312496]
I (17351) AWS_OTA: Extracted parameter: [key: value]=[fileid: 0]
I (17361) AWS_OTA: Extracted parameter: [key: value]=[certfile: Code Verify Key]
I (17361) AWS_OTA: Extracted parameter [ sig-sha256-ecdsa: MEUCIQDsSDL3R+0JJpIK0HfC/mrhfjg1... ]
I (17371) AWS_OTA: Job document was accepted. Attempting to begin the update.
I (17381) AWS_OTA: Job parsing success: OtaJobParseErr_t=OtaJobParseErrNone, Job name=AFR_OTA-mSpaceGate_t0_2_5
I (17391) ota_over_mqtt: Received OtaJobEventReceivedJob callback from OTA Agent.
I (17401) AWS_OTA: otaPal_GetPlatformImageState
I (17401) report: OTA started. Preventing coreMQTT-Agent commands from being enqueued.
I (17401) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (17411) sub_pub_unsub: OTA started. Preventing coreMQTT-Agent commands from being enqueued.
I (17421) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (17431) core_mqtt_agent_manager: OTA started.

After new firmware is fully downloaded device reboot already with new SW version (App version: t0.2.5):

I (225191) AWS_OTA: Number of blocks remaining: 1
I (225191) ota_over_mqtt: OTA Event processing completed. Freeing the event buffer to pool.
I (225201) AWS_OTA: Current State=[WaitingForFileBlock], Event=[ReceivedFileBlock], New state=[WaitingForFileBlock]
I (226171) coreMQTT: Publishing message to $aws/things/GA913F0001/streams/AFR_OTA-a5ccc605-4e35-4db8-a17f-ed918d372f7c/get/cbor.

I (226171) ota_over_mqtt: Sent PUBLISH packet to broker $aws/things/GA913F0001/streams/AFR_OTA-a5ccc605-4e35-4db8-a17f-ed918d372f7c/get/cbor to broker.


I (226191) AWS_OTA: Published to MQTT topic to request the next block: topic=$aws/things/GA913F0001/streams/AFR_OTA-a5ccc605-4e35-4db8-a17f-ed918d372f7c/get/cbor
I (226201) AWS_OTA: Current State=[WaitingForFileBlock], Event=[RequestFileBlock], New state=[WaitingForFileBlock]
I (227291) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (227291) coreMQTT: State record updated. New state=MQTTPublishDone.
I (227301) AWS_OTA: Received valid file block: Block index=320, Size=1776
I (227311) AWS_OTA: Received final block of the update.
I (227931) AWS_OTA: Signature verification succeeded.
I (227931) AWS_OTA: Received entire update and validated the signature.
I (228301) coreMQTT: Publishing message to $aws/things/GA913F0001/jobs/AFR_OTA-mSpaceGate_t0_2_5/update.

I (229541) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (229541) coreMQTT: State record updated. New state=MQTTPublishDone.
I (229541) ota_over_mqtt: Sent PUBLISH packet to broker $aws/things/GA913F0001/jobs/AFR_OTA-mSpaceGate_t0_2_5/update to broker.


I (229551) ota_over_mqtt: Received OtaJobEventActivate callback from OTA Agent.
I (229541) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (229561) esp_image: segment 0: paddr=001f0020 vaddr=3f400020 size=3aecch (241356) map
I (229571) coreMQTT: State record updated. New state=MQTTPublishDone.
I (229591) ota_over_mqtt: Received update response: $aws/things/GA913F0001/jobs/AFR_OTA-mSpaceGate_t0_2_5/update/accepted{"timestamp":1709385169}af.
I (229651) esp_image: segment 1: paddr=0022aef4 vaddr=3ffbdb60 size=05124h ( 20772) 
I (229671) esp_image: segment 2: paddr=00230020 vaddr=400d0020 size=e0210h (918032) map
I (229951) esp_image: segment 3: paddr=00310238 vaddr=3ffc2c84 size=00964h (  2404) 
I (229951) esp_image: segment 4: paddr=00310ba4 vaddr=40080000 size=1fb24h (129828) 
I (229991) esp_image: segment 0: paddr=001f0020 vaddr=3f400020 size=3aecch (241356) map
I (230071) esp_image: segment 1: paddr=0022aef4 vaddr=3ffbdb60 size=05124h ( 20772) 
I (230081) esp_image: segment 2: paddr=00230020 vaddr=400d0020 size=e0210h (918032) map
I (230351) esp_image: segment 3: paddr=00310238 vaddr=3ffc2c84 size=00964h (  2404) 
I (230361) esp_image: segment 4: paddr=00310ba4 vaddr=40080000 size=1fb24h (129828) 
I (230961) wifi:state: run -> init (0)
I (230961) wifi:pm stop, total sleep time: 163061635 us / 227233436 us

I (230961) wifi:<ba-del>idx:0, tid:0
I (230961) wifi:new:<11,0>, old:<11,2>, ap:<255,255>, sta:<11,2>, prof:1
E (230971) wifi:NAN WiFi stop
I (230971) core_mqtt_agent_manager: WiFi disconnected.
I (230971) NETWORKING: Disconnected. Connecting to the AP again...
I (230981) NETWORKING: Disconnected. Connecting to the AP again...
I (231041) wifi:flush txq
I (231041) wifi:stop sw txq
I (231041) wifi:lmac stop hw txq
ets Jul 29 2019 12:21:46

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:7128
load:0x40078000,len:15692
ho 0 tail 12 room 4
load:0x40080400,len:4
0x40080400: _init at ??:?
load:0x40080404,len:3876
entry 0x4008064c
I (31) boot: ESP-IDF v5.1.2-684-g7cbee80fb9 2nd stage bootloader
I (31) boot: compile time Mar  2 2024 13:10:08
I (33) boot: Multicore bootloader
I (37) boot: chip revision: v3.1
I (40) boot.esp32: SPI Speed      : 40MHz
I (45) boot.esp32: SPI Mode       : DIO
I (50) boot.esp32: SPI Flash Size : 4MB
I (54) boot: Enabling RNG early entropy source...
I (60) boot: Partition Table:
I (63) boot: ## Label            Usage          Type ST Offset   Length
I (71) boot:  0 nvs              WiFi data        01 02 00010000 00006000
I (78) boot:  1 otadata          OTA data         01 00 00016000 00002000
I (85) boot:  2 phy_init         RF data          01 01 00018000 00001000
I (93) boot:  3 ota_0            OTA app          00 10 00020000 001c2000
I (100) boot:  4 ota_1            OTA app          00 11 001f0000 001c2000
I (108) boot:  5 storage          WiFi data        01 02 003b2000 00009000
I (115) boot:  6 settings         WiFi data        01 02 003bb000 00010000
I (123) boot: End of partition table
I (167) esp_image: segment 0: paddr=001f0020 vaddr=3f400020 size=3aecch (241356) map
I (255) esp_image: segment 1: paddr=0022aef4 vaddr=3ffbdb60 size=05124h ( 20772) load
I (263) esp_image: segment 2: paddr=00230020 vaddr=400d0020 size=e0210h (918032) map
I (595) esp_image: segment 3: paddr=00310238 vaddr=3ffc2c84 size=00964h (  2404) load
I (596) esp_image: segment 4: paddr=00310ba4 vaddr=40080000 size=1fb24h (129828) load
I (670) boot: Loaded app from partition at offset 0x1f0000
I (671) boot: Disabling RNG early entropy source...
I (682) cpu_start: Multicore app
I (683) cpu_start: Pro cpu up.
I (683) cpu_start: Starting app cpu, entry point is 0x400814c0
0x400814c0: call_start_cpu1 at C:/Espressif/v5.1/esp-idf/components/esp_system/port/cpu_start.c:159
I (672) cpu_start: App cpu up.
I (703) cpu_start: Pro cpu start user code
I (703) cpu_start: cpu freq: 160000000 Hz
I (703) cpu_start: Application information:
I (708) cpu_start: Project name:     mSpaceGate
I (713) cpu_start: App version:      t0.2.5
I (718) cpu_start: Compile time:     Mar  2 2024 14:04:43
I (724) cpu_start: ELF file SHA256:  3a93f59e049c6ac9...
I (730) cpu_start: ESP-IDF:          v5.1.2-684-g7cbee80fb9
I (736) cpu_start: Min chip rev:     v0.0
I (741) cpu_start: Max chip rev:     v3.99 
I (745) cpu_start: Chip rev:         v3.1

Verification of new SW version no is ok and new SW is set for testing:

I (12473) AWS_OTA: In self test mode.
I (12483) AWS_OTA: New image has a higher version number than the current image: New image version=0.2.5, Previous image version=0.2.4
I (12493) AWS_OTA: Image version is valid: Begin testing file: File ID=0
I (12503) AWS_OTA: otaPal_SetPlatformImageState, 1
W (12503) AWS_OTA: Set image as testing!

During validation of new image other tasks are put on hold again:

I (16653) AWS_OTA: Job parsing success: OtaJobParseErr_t=OtaJobParseErrNone, Job name=AFR_OTA-mSpaceGate_t0_2_5
I (16663) coreMQTT: State record updated. New state=MQTTPublishDone.
I (16673) ota_over_mqtt: Received OtaJobEventReceivedJob callback from OTA Agent.
I (16683) ota_over_mqtt: Received update response: $aws/things/GA913F0001/jobs/AFR_OTA-mSpaceGate_t0_2_5/update/accepted{"timestamp":1709385188}0:c5:a8",
	"firmware":	"t0.2.5",
	"rssi":	-52
}t0_2_5","status":"IN_PROGRESS","statusDetails":{"self_test":"ready","updatedBy":"0x00020004"},"queuedAt":1709384957,"startedAt":1709384966,"lastUpdatedAt":1709385169,"versionNumber":15,"executionNumber":1,"jobDocument":{"afr_ota":{"protocols":["MQTT"],"streamname":"AFR_OTA-a5ccc605-4e35-4db8-a17f-ed918d372f7c","files":[{"filepath":"/","filesize":1312496,"fileid":0,"certfile":"Code Verify Key","sig-sha256-ecdsa":"MEUCIQDsSDL3R+0JJpIK0HfC/mrhfjg1NbLWwWpPmPKyCDlCKQIge2wNUqam3yhEO5pmO5C5IW7zX0AzhJkUxRQZ6D/qkkA="}]}}}}.
I (16693) report: OTA started. Preventing coreMQTT-Agent commands from being enqueued.
I (16693) AWS_OTA: otaPal_GetPlatformImageState
I (16763) sub_pub_unsub: OTA started. Preventing coreMQTT-Agent commands from being enqueued.
I (16773) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (16773) core_mqtt_agent_manager: OTA started.
I (16783) esp_ota_ops: [1] aflags/seq:0x1/0x2, pflags/seq:0x2/0x1
I (16793) ota_over_mqtt: OTA Event processing completed. Freeing the event buffer to pool.
I (16803) AWS_OTA: Current State=[CreatingFile], Event=[ReceivedJobDocument], New state=[CreatingFile]
I (16813) AWS_OTA: Beginning self-test.
I (16823) AWS_OTA: otaPal_GetPlatformImageState
I (16823) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (16833) esp_ota_ops: [1] aflags/seq:0x1/0x2, pflags/seq:0x2/0x1
I (16833) ota_over_mqtt: Received OtaJobEventStartTest callback from OTA Agent.
I (16843) AWS_OTA: otaPal_SetPlatformImageState, 2
I (16853) AWS_OTA: Set image as valid one!

OTA Job is completed, but other tasks never receive flag CORE_MQTT_AGENT_OTA_STOPPED_EVENT and remain on hold. They only can be recovered by reboot / restart the device

I (21063) ota_over_mqtt: New image validation succeeded in self test mode.
I (21073) coreMQTT: State record updated. New state=MQTTPublishDone.
I (21073) AWS_OTA: Current State=[WaitingForJob], Event=[StartSelfTest], New state=[WaitingForJob]
I (21083) ota_over_mqtt: Received update response: $aws/things/GA913F0001/jobs/AFR_OTA-mSpaceGate_t0_2_5/update/accepted{"timestamp":1709385192}0:c5:a8",
	"firmware":	"t0.2.5",
	"rssi":	-52
}t0_2_5","status":"IN_PROGRESS","statusDetails":{"self_test":"ready","updatedBy":"0x00020004"},"queuedAt":1709384957,"startedAt":1709384966,"lastUpdatedAt":1709385169,"versionNumber":15,"executionNumber":1,"jobDocument":{"afr_ota":{"protocols":["MQTT"],"streamname":"AFR_OTA-a5ccc605-4e35-4db8-a17f-ed918d372f7c","files":[{"filepath":"/","filesize":1312496,"fileid":0,"certfile":"Code Verify Key","sig-sha256-ecdsa":"MEUCIQDsSDL3R+0JJpIK0HfC/mrhfjg1NbLWwWpPmPKyCDlCKQIge2wNUqam3yhEO5pmO5C5IW7zX0AzhJkUxRQZ6D/qkkA="}]}}}}.
I (23153) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (23153) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (23153) ota_over_mqtt: Received job message callback, size 24.


I (23163) AWS_OTA: No active job available in received job document: OtaJobParseErr_t=OtaJobParseErrNoActiveJobs
I (23173) ota_over_mqtt: Received OtaJobEventNoActiveJob callback from OTA Agent.
I (23183) AWS_OTA: otaPal_GetPlatformImageState
I (23183) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (23193) esp_ota_ops: [1] aflags/seq:0x2/0x2, pflags/seq:0x2/0x1
I (23203) ota_over_mqtt: OTA Event processing completed. Freeing the event buffer to pool.
I (23213) AWS_OTA: Current State=[WaitingForJob], Event=[ReceivedJobDocument], New state=[CreatingFile]
I (31943) ota_over_mqtt: OTA agent suspend by timer.
W (32943) AWS_OTA: OTA Timer handle NULL for Timerid=0, can't stop.
I (32943) AWS_OTA: OTA Agent is suspended.
I (32943) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
I (152943) ota_over_mqtt: OTA agent resumed by timer.
I (153943) AWS_OTA: otaPal_GetPlatformImageState
I (153943) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (153943) esp_ota_ops: [1] aflags/seq:0x2/0x2, pflags/seq:0x2/0x1
I (153943) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]
I (154393) ota_over_mqtt: Subscribed to topic $aws/things/GA913F0001/jobs/notify-next.


I (154393) AWS_OTA: Subscribed to MQTT topic: $aws/things/GA913F0001/jobs/notify-next
I (155393) coreMQTT: Publishing message to $aws/things/GA913F0001/jobs/$next/get.

I (156603) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (156603) coreMQTT: State record updated. New state=MQTTPublishDone.
I (156603) ota_over_mqtt: Sent PUBLISH packet to broker $aws/things/GA913F0001/jobs/$next/get to broker.


I (156603) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (156613) AWS_OTA: Current State=[WaitingForJob], Event=[RequestJobDocument], New state=[WaitingForJob]
I (156623) coreMQTT: State record updated. New state=MQTTPublishDone.
I (156643) ota_over_mqtt: Received job message callback, size 53.


I (156653) AWS_OTA: No active job available in received job document: OtaJobParseErr_t=OtaJobParseErrNoActiveJobs
I (156663) ota_over_mqtt: Received OtaJobEventNoActiveJob callback from OTA Agent.
I (156663) AWS_OTA: otaPal_GetPlatformImageState
I (156673) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (156683) esp_ota_ops: [1] aflags/seq:0x2/0x2, pflags/seq:0x2/0x1
I (156683) ota_over_mqtt: OTA Event processing completed. Freeing the event buffer to pool.
I (156693) AWS_OTA: Current State=[WaitingForJob], Event=[ReceivedJobDocument], New state=[CreatingFile]

Sorry for big number of logs, but I tried to explain as detailed as I can.

@NightSkySK
Copy link
Contributor Author

I've tried to update the prvOtaAppCallback function to the following state, by adding case OtaJobEventUpdateComplete:

static void prvOtaAppCallback(OtaJobEvent_t event, void *pData) {
  OtaErr_t err = OtaErrUninitialized;

  switch (event) {
  case OtaJobEventActivate:
    ESP_LOGI(TAG, "Received OtaJobEventActivate callback from OTA Agent.");

    /**
     * Activate the new firmware image immediately. Applications can choose to
     * postpone the activation to a later stage if needed.
     */
    err = OTA_ActivateNewImage();

    /**
     * Activation of the new image failed. This indicates an error that requires
     * a follow up through manual activation by resetting the device. The demo
     * reports the error and shuts down the OTA agent.
     */
    ESP_LOGE(TAG, "New image activation failed.");

    /* Shutdown OTA Agent, if it is required that the unsubscribe operations are
     * not performed while shutting down please set the second parameter to 0
     * instead of 1. */
    OTA_Shutdown(0, 1);

    break;

  case OtaJobEventFail:
    ESP_LOGI(TAG, "Received an OtaJobEventFail notification from OTA Agent.");

    /* Signal coreMQTT-Agent network manager that an OTA job has stopped. */
    xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT);
    break;

  case OtaJobEventStartTest:

    /* This demo just accepts the image since it was a good OTA update and
     * networking and services are all working (or we would not have made it
     * this far). If this were some custom device that wants to test other
     * things before validating new image, this would be the place to kick off
     * those tests before calling OTA_SetImageState() with the final result of
     * either accepted or rejected. */

    ESP_LOGI(TAG, "Received OtaJobEventStartTest callback from OTA Agent.");

    err = OTA_SetImageState(OtaImageStateAccepted);

    if (err == OtaErrNone) {
      ESP_LOGI(TAG, "New image validation succeeded in self test mode.");
    } else {
      ESP_LOGE(TAG, "Failed to set image state as accepted with error %d.",
               err);
    }

    break;

  case OtaJobEventProcessed:

    ESP_LOGI(
        TAG,
        "OTA Event processing completed. Freeing the event buffer to pool.");
    configASSERT(pData != NULL);
    prvOTAEventBufferFree((OtaEventData_t *)pData);

    break;

  case OtaJobEventSelfTestFailed:
    ESP_LOGI(TAG,
             "Received OtaJobEventSelfTestFailed callback from OTA Agent.");

    /* Requires manual activation of previous image as self-test for
     * new image downloaded failed.*/
    ESP_LOGE(TAG,
             "OTA Self-test failed for new image. shutting down OTA Agent.");

    /* Shutdown OTA Agent, if it is required that the unsubscribe operations are
     * not performed while shutting down please set the second parameter to 0
     * instead of 1. */
    OTA_Shutdown(0, 1);

    /* Signal coreMQTT-Agent network manager that an OTA job has stopped. */
    xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT);
    break;

  case OtaJobEventReceivedJob:
    ESP_LOGI(TAG, "Received OtaJobEventReceivedJob callback from OTA Agent.");
    /* Signal coreMQTT-Agent network manager that an OTA job has started. */
    xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STARTED_EVENT);
    break;

  case OtaJobEventUpdateComplete:
    ESP_LOGI(TAG,
             "Received OtaJobEventUpdateComplete callback from OTA Agent.");
    /* Signal coreMQTT-Agent network manager that an OTA job has stopped. */
    xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT);
    break;

  case OtaJobEventNoActiveJob:
    ESP_LOGI(TAG, "Received OtaJobEventNoActiveJob callback from OTA Agent.");
    /* No more jobs available in IoTCore, no further actions on this event. */
    break;

  default:
    ESP_LOGW(TAG,
             "Received an unhandled callback event from OTA Agent, "
             "event = %d",
             event);

    break;
  }
}

but there is the same behaviour
log.mSpaceGate.20240302153506.txt

The solution which works for me is to add the CORE_MQTT_AGENT_OTA_STOPPED_EVENT to OtaJobEventStartTest after check if OTA_SetImageState

static void prvOtaAppCallback(OtaJobEvent_t event, void *pData) {
  OtaErr_t err = OtaErrUninitialized;

  switch (event) {
  case OtaJobEventActivate:
    ESP_LOGI(TAG, "Received OtaJobEventActivate callback from OTA Agent.");

    /**
     * Activate the new firmware image immediately. Applications can choose to
     * postpone the activation to a later stage if needed.
     */
    err = OTA_ActivateNewImage();

    /**
     * Activation of the new image failed. This indicates an error that requires
     * a follow up through manual activation by resetting the device. The demo
     * reports the error and shuts down the OTA agent.
     */
    ESP_LOGE(TAG, "New image activation failed.");

    /* Shutdown OTA Agent, if it is required that the unsubscribe operations are
     * not performed while shutting down please set the second parameter to 0
     * instead of 1. */
    OTA_Shutdown(0, 1);

    break;

  case OtaJobEventFail:
    ESP_LOGI(TAG, "Received an OtaJobEventFail notification from OTA Agent.");

    /* Signal coreMQTT-Agent network manager that an OTA job has stopped. */
    xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT);
    break;

  case OtaJobEventStartTest:

    /* This demo just accepts the image since it was a good OTA update and
     * networking and services are all working (or we would not have made it
     * this far). If this were some custom device that wants to test other
     * things before validating new image, this would be the place to kick off
     * those tests before calling OTA_SetImageState() with the final result of
     * either accepted or rejected. */

    ESP_LOGI(TAG, "Received OtaJobEventStartTest callback from OTA Agent.");

    err = OTA_SetImageState(OtaImageStateAccepted);

    if (err == OtaErrNone) {
      ESP_LOGI(TAG, "New image validation succeeded in self test mode.");
    } else {
      ESP_LOGE(TAG, "Failed to set image state as accepted with error %d.",
               err);
    }
    xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT);
    break;

  case OtaJobEventProcessed:

    ESP_LOGI(
        TAG,
        "OTA Event processing completed. Freeing the event buffer to pool.");
    configASSERT(pData != NULL);
    prvOTAEventBufferFree((OtaEventData_t *)pData);

    break;

  case OtaJobEventSelfTestFailed:
    ESP_LOGI(TAG,
             "Received OtaJobEventSelfTestFailed callback from OTA Agent.");

    /* Requires manual activation of previous image as self-test for
     * new image downloaded failed.*/
    ESP_LOGE(TAG,
             "OTA Self-test failed for new image. shutting down OTA Agent.");

    /* Shutdown OTA Agent, if it is required that the unsubscribe operations are
     * not performed while shutting down please set the second parameter to 0
     * instead of 1. */
    OTA_Shutdown(0, 1);

    /* Signal coreMQTT-Agent network manager that an OTA job has stopped. */
    xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT);
    break;

  case OtaJobEventReceivedJob:
    ESP_LOGI(TAG, "Received OtaJobEventReceivedJob callback from OTA Agent.");
    /* Signal coreMQTT-Agent network manager that an OTA job has started. */
    xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STARTED_EVENT);
    break;

  case OtaJobEventNoActiveJob:
    ESP_LOGI(TAG, "Received OtaJobEventNoActiveJob callback from OTA Agent.");
    /* No more jobs available in IoTCore, no further actions on this event. */
    break;

  default:
    ESP_LOGW(TAG,
             "Received an unhandled callback event from OTA Agent, "
             "event = %d",
             event);

    break;
  }
}

LOG:
log.mSpaceGate.20240302161840.txt

with this function definition both task are start working again after OTA completed:

I (21180) ota_over_mqtt: New image validation succeeded in self test mode.
I (21190) coreMQTT: State record updated. New state=MQTTPublishDone.
I (21190) report: OTA stopped. No longer preventing coreMQTT-Agent commands from being enqueued.
I (21190) AWS_OTA: Current State=[WaitingForJob], Event=[StartSelfTest], New state=[WaitingForJob]
I (21210) sub_pub_unsub: OTA stopped. No longer preventing coreMQTT-Agent commands from being enqueued.
I (21200) ota_over_mqtt: Received update response: $aws/things/GA913F0001/jobs/AFR_OTA-mSpaceLock-t0_2_7/update/accepted{"timestamp":1709393030}0:c5:a8",
	"firmware":	"t0.2.7",
	"rssi":	-56
}t0_2_7","status":"IN_PROGRESS","statusDetails":{"self_test":"ready","updatedBy":"0x00020006"},"queuedAt":1709392696,"startedAt":1709392802,"lastUpdatedAt":1709393006,"versionNumber":15,"executionNumber":1,"jobDocument":{"afr_ota":{"protocols":["MQTT"],"streamname":"AFR_OTA-508f39ed-54b2-44e9-bb2d-37c3a845ab0f","files":[{"filepath":"/","filesize":1312512,"fileid":0,"certfile":"Code Verify Key","sig-sha256-ecdsa":"MEQCIBkZyQTR8ho/2aZAoLz07TGXPEbKzjneoCg+yCeXBjiUAiAsoCRg4Fd6bIk5cC+h9YmLEDylKOo1jHGsaRfu9pKU7g=="}]}}}}.
I (21230) core_mqtt_agent_manager: OTA stopped.
I (21780) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (21780) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (21780) ota_over_mqtt: Received job message callback, size 24.


I (21790) AWS_OTA: No active job available in received job document: OtaJobParseErr_t=OtaJobParseErrNoActiveJobs
I (21800) ota_over_mqtt: Received OtaJobEventNoActiveJob callback from OTA Agent.
I (21810) AWS_OTA: otaPal_GetPlatformImageState
I (21810) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (21820) esp_ota_ops: [1] aflags/seq:0x2/0x4, pflags/seq:0x2/0x3
I (21820) ota_over_mqtt: OTA Event processing completed. Freeing the event buffer to pool.
I (21830) AWS_OTA: Current State=[WaitingForJob], Event=[ReceivedJobDocument], New state=[CreatingFile]
I (24740) sub_pub_unsub: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 4
I (25050) sub_pub_unsub: Subscribe 4 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (25050) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 5.
I (25070) sub_pub_unsub: Task "SubPub0" waiting for publish 5 to complete.
I (26050) coreMQTT: Publishing message to /filter/SubPub0.

I will prepare pull request with that change proposal.

@monkeytronics
Copy link

My fix was this. Not sure if this lines up with the desired flow...

switch( event )
    {
        case OtaJobEventActivate:
            ESP_LOGI( TAG, "Received OtaJobEventActivate callback from OTA Agent." );

            /** 
             * OTA Successful. Report Success to JOBS portal. X Monkey X
            */
            OTA_SetImageState( OtaImageStateAccepted );

            /**
             * Activate the new firmware image immediately. Applications can choose to postpone
             * the activation to a later stage if needed.
             */
            err = OTA_ActivateNewImage();

@ActoryOu
Copy link
Member

ActoryOu commented Mar 22, 2024

Hi,
Glad to see that the PR has been merged, resolving the issue at hand.
If I may, I would like to share my thoughts on the additional question:

Additional question
In case OtaJobEventSelfTestFailed OTA_Shutdown is executed, should program also trigger xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT); in that case or ot will be covered by
OtaJobEventFail any way?

It seems appropriate to trigger xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT) again at this point to resume other tasks. However, if the OTA is shut down here, the entire OTA service would be disabled after this. The reference code provided is intended to demonstrate the ability to run OTA with the simplest error handling approach. I will test it with xCoreMqttAgentManagerPost(CORE_MQTT_AGENT_OTA_STOPPED_EVENT) added and consider raising a PR for the same.

Appreciate the contributions from all of you.

@ActoryOu
Copy link
Member

Closing as PRs are merged. Thanks for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants