Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up and extend RAMTRON driver #13621

Closed
wants to merge 26 commits into from

Conversation

nexton-winjeel
Copy link
Contributor

We experienced an issue where parameters were being unexpectedly erased during development. The sequence of events that caused this was:

  1. The RAMTRON driver fails to initialise;
  2. The Storage driver silently falls back to using the SD card;
  3. The SD card was not fitted, so the Storage drive falls back to using Flash;
  4. A new program image was written to the device over JTAG, erasing the params in Flash;
  5. Developer tests the new image, and finds unexpected params.

This behaviour was observed on both a Pixracer and custom hardware.

The root cause of the the RAMTRON driver failing to initialise (step 1. above) was different data returned by different manufacturers. Fujitsu and Cypress make FRAM chips that are both pin and register compatible, but each manufacturer returns different data when the RDID register is read.

This pull request contains the following key changes to address the issues discovered during this investigation:

  • RAMTRON:
    • Updated the struct that defines the supported FRAM devices to include the RDID format. This was then used to properly parse the read of the RDID register.
    • Updated the init function to ensure the WriteEnable and BlockProtect bits of the status register are properly set.
    • Updated the read and write functions so that the SPI read and write calls were inside the same chip select activation. This prevents spurious data being returned if the FRAM chip resets it's read registers on deselect.
    • Updated the read and write functions to fail if we attempt to access more data than the FRAM chip can hold. Previously these would silently wrap back to address 0x0000, which would cause unexpected behaviour.
  • AP_HAL_ChibiOS/Storage:
    • Added a field and extra logging to capture what physical device is being used by the Storage driver.
    • Added a panic if no suitable backend device can be initialised by the Storage driver.
    • Added an optional panic (intended to be controlled by a define in the hwdef.dat file) that prevents falling back to another device if the FRAM can't be initialised.

Please let me know if you need test evidence and/or test cases for this.

@tridge
Copy link
Contributor

tridge commented Feb 21, 2020

thanks for your work on this!
Note that there is some overlap with #13581, especially in removing the fallback when using FRAM

Copy link
Contributor

@tridge tridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your work on this!
A few changes needed are listed below, plus I'd like to add the retries from #13581. That PR came out of a board where power up speed seems to cause the init of the ramtron to fail sometimes. I don't think the result is explained by your changes.

@@ -517,7 +517,8 @@ class AP_Param
uint8_t magic[2];
uint8_t revision;
uint8_t spare;
};
} PACKED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding PACKED on this structure doesn't change the shape for storage, but does make g++ produce less efficient code (it makes for larger flash size for no advantage)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a habit from working with space constrained devices. Removed the PACKED attribute.

@@ -539,7 +540,8 @@ class AP_Param
uint32_t type : 5;
uint32_t key_high : 1;
uint32_t group_element : 18;
};
} PACKED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, PACKED just makes for less efficient code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the PACKED attribute.

uint8_t manufacturer[6];
uint8_t memory;
uint8_t id1;
uint8_t id2;
} rdid;
if (!dev->read_registers(RAMTRON_RDID, (uint8_t *)&rdid, sizeof(rdid))) {
} PACKED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PACKED doesn't help here and makes for slower and larger code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the PACKED attribute. Replaced with a static_assert to ensure the RDID shape is correct.

uint8_t manufacturer[2];
uint8_t id1;
uint8_t id2;
} PACKED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PACKED doesn't help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, removed PACKED attribute, added static_assert.

@@ -45,7 +45,7 @@ const AP_Param::Info Plane::var_info[] = {
ASCALAR(autotune_level, "AUTOTUNE_LEVEL", 6),

// @Param: TELEM_DELAY
// @DisplayName: Telemetry startup delay
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you see if you can configure your editor not to do whitespace cleanup on lines you don't edit?
You may find using "git add -p" to build a patchset useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Once you're happy with the changes, I'll rebase and remove the whitespace changes.

Are you aware that you can append ?w=1 to the URL to hide whitespace changes in the diff? Maybe you can do this in the interim?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'm aware. It just pollutes commits however, changing the git blame results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased to remove the whitepace changes.


// read from device
bool read(uint32_t offset, uint8_t *buf, uint32_t size);
uint32_t read(uint32_t offset, uint8_t * const buf, uint32_t size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change from bool when the only use case is to check if we get the full amount requested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was useful while I was debugging to see why I wasn't getting back the data I expected. It also allows consumers to do partial retries.

Would you prefer this reverted back to returning a bool?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'd prefer it be bool. If we end up with a consumer that wants to do partial retries in the future we could change this, but I doubt we will

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to bool.


// write to device
bool write(uint32_t offset, const uint8_t *buf, uint32_t size);
uint32_t write(uint32_t offset, uint8_t const * const buf, uint32_t size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change from bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above...

// get data
dev->transfer(nullptr, 0, buf, size);
uint32_t const kReadSz = MIN(size, kMaxReadSz);
bool ok = dev->set_chip_select(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transfer() already handles chip select, so is only needed when doing multiple operations within the one CS op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unnecessary calls to set_chip_select().

if (fram.read(0, _buffer, CH_STORAGE_SIZE) == CH_STORAGE_SIZE) {
_save_backup();
_initialisedType = StorageBackend::FRAM;
hal.console->printf("Initialised Storage type=%d\n", _initialisedType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this happens so early that it isn't visible on the console as console isn't setup yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed back to using ::printf()

if (sdcard_retry()) {
log_fd = AP::FS().open(HAL_STORAGE_FILE, O_RDWR|O_CREAT);
if (log_fd == -1) {
hal.console->printf("open failed of " HAL_STORAGE_FILE "\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

::printf() was used as it works in early startup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. As above, reverted to use ::printf()

@Jaaaky
Copy link
Contributor

Jaaaky commented Feb 23, 2020

I've 3 fmuv3 boards, pixhawk1 like, which do not save any parameters.
The firmware doesn't fallback to sdcard as - I think - it can see an FRAM, so any param change is lost on reboot.
I really think it's a driver issue, but not sure.
I've tested this pull request and @tridge's one, but no success.

On this pull request it gives "Couldn't load params: erasing params..."
I can debug it further, if you tell me how.

@nexton-winjeel
Copy link
Contributor Author

Please note: The latest changes compile, but I won't have a chance to test them on the relevant hardware until tomorrow.

}

#if defined(HAL_RAMTRON_ALLOW_FALLBACK) && !HAL_RAMTRON_ALLOW_FALLBACK
AP_HAL::panic("Unable to init RAMTRON storage");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like your indentation has gone wonky here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -48,7 +48,13 @@ class ChibiOS::Storage : public AP_HAL::Storage {
bool healthy(void) override;

private:
volatile bool _initialised;
enum StorageBackend {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Flash,
SDCard,
};
volatile enum StorageBackend _initialisedType = StorageBackend::None;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think you need the enum here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the enum now that StorageBackend is an enum class. Fixed.

@rmackay9
Copy link
Contributor

rmackay9 commented Mar 1, 2020

We had another report in the forums of this happening. It very often seems to happen as part of a firmware upgrade. https://discuss.ardupilot.org/t/all-parameters-set-to-default-by-themselves/52983

While they share the same register maps, the FRAM devices from Cypress
and Fujitsu return a different data structure when the RDID register is
read.
This driver will panic if HAL_RAMTRON_NO_FALLBACK is defined in the
hwdef.dat file and the RAMTRON initialisation fails. This prevents us
from silently falling back to a potentially different parameter set and
causing unexpected behaviour.
…is used

Note: This usually won't get seen as the USB is not up when this is
printed. However, adding a delay (of ~2s) into the main_loop() function
in the HAL_ChibiOS_Class.cpp, just after the scheduler is initialised,
will allow this message to be seen.
…efined

This #error pragma causes the Bootloader to fail, even though this file
class isn't used by the Bootloader. However, it is included in the base
AP_HAL definition that the Bootloader pulls in.
Inline initialisation of VLAs will cause a compilation error under
Clang, as it is not allowed according to the C specification. However,
GCC will allow it.
@nexton-winjeel
Copy link
Contributor Author

nexton-winjeel commented Mar 6, 2020

thanks for your work on this!
A few changes needed are listed below, plus I'd like to add the retries from #13581. That PR came out of a board where power up speed seems to cause the init of the ramtron to fail sometimes. I don't think the result is explained by your changes.

@tridge: I've incorporated equivalent changes to #13581 in commits 2dc533a and 528fe35.

@nexton-winjeel
Copy link
Contributor Author

Tidied up and rebased onto latest master.

@tridge
Copy link
Contributor

tridge commented Mar 6, 2020

@sypaq-nexton the main holdup on this PR now is the investigation we have going on into the param resets on cube orange. The changes you have done unfortunately don't explain the issue. We're seeing params reset with a firmware that only has RAMTRON storage enabled, and with a firmware that does every read twice to confirm with CRC. Most bizarrely we're seeing resets even when we completely remove the write() code in the ramtron driver, so it should have no ability to write to the ramtron at all.
My current theory is that the reset it actually happening in the bootloader. The CS line is currently set to floating in the bootloader, and the reset seems to happen when powering the cube with particularly type of power system (eg. certain type of ESC setup). So I'm wondering if the floating CS combined with some electrical noise in the early startup of the MCU could make the FRAM think that it is getting a write transaction while still in the bootloader. We're getting the user who can reproduce this to test a new bootloader that forces the CS line high while in the bootloader

@tridge
Copy link
Contributor

tridge commented Mar 10, 2020

@sypaq-nexton we've now proven that CS pin floating in the bootloader is the cause of the resets that have particulatly plagued cube orange. We will still do the improvements to the RAMTRON driver, but the key fix seems to be this one: #13780

@tridge
Copy link
Contributor

tridge commented Mar 11, 2020

@sypaq-nexton I did tests of this branch on a CubeBlack and found that parameters did not persist. The change of approach for the latching of the write enable bit of the FRAM doesn't work on that cypress chip. It is also dangerous as it leaves write enabled as the normal state when flying which means that on reboot the floating CS pin is much more likely to cause corruption.
I have now updated #13581 to include the other key changes from your PR and consolidated ready to merge. I will close this PR and get that one merged.

@nexton-winjeel
Copy link
Contributor Author

Great, thanks @tridge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants