Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow D-Cache to remain on during core power-down #1553

Merged
merged 1 commit into from
Oct 18, 2018

Conversation

glneo
Copy link
Contributor

@glneo glneo commented Aug 30, 2018

Hello all,

The first two patches should be simple enough. The third is a bit more interesting and is kinda an RFC.

For K3, we have hardware assisted coherency but as I've argued before the flag "HW_ASSISTED_COHERENCY" is a misnomer and is targeted specifically for DynamIQ platforms. An example of this is in psci_do_pwrdown_sequence(), where we assume all platforms with the hardware assisted coherency flag will use a core's with a *_core_pwr_dwn function that will not disable the cache. This is not the case for K3 with A53 cores.

We introduce a new flag "DISABLE_DCACHE_LATE" designed as the inverse of "WARMBOOT_ENABLE_DCACHE_EARLY". It is for platforms that do not need (or cannot have due to hardware issues) cache disabled in the power down path.

As you can see we need to make a check for this flag in cortex_a53.S, is this the right thing to do? Should this check be added for all cores for consistency?

Thanks,
Andrew

@ssg-bot
Copy link

ssg-bot commented Aug 30, 2018

Can one of the admins verify this patch?

@soby-mathew
Copy link
Contributor

soby-mathew commented Aug 31, 2018

Hi Andrew,

As you can see we need to make a check for this flag in cortex_a53.S, is this the right thing to do? Should this check be added for all cores for consistency?

The sequences in the CPU ops file comes from the respective TRMs (see the Individual core shutdown mode in Cortex-A53 TRM). The patch is modifying that sequence which may not be right. We would need to understand your platform better and I will trigger an email discussion regarding this.

Copy link
Contributor

@soby-mathew soby-mathew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch is modifying the TRM recommended power down sequence.

@glneo
Copy link
Contributor Author

glneo commented Oct 4, 2018

Pushed a rebased set.

As for modifying the powerdown sequence, yes it is, but only for the K3 platform, which needs it this way.

In general putting the cache flush steps in the CPU specific TRM is not the best place for this information as the exact sequence is dependent on how the cores are integrated into the SoC. For instance A75 cores assume hardware will take care of cache maintenance while powering down [0]. And in our case we have an A53 that requires the HW take care of cache maintenance on powerdown (due to our unique coherency model).

[0] https://github.com/ARM-software/arm-trusted-firmware/blob/master/lib/cpus/aarch64/cortex_a75.S#L83

@soby-mathew
Copy link
Contributor

The patch currently seems to only skip the cache disable but leave the actual cache flush (L1 and L2) untouched. If the TI_K3 hardware can take care of cache maintenance in hardware, then I suspect even the cache flush need not be done.

The cpu_ops file try to capture the sequence as mandated by TRM. Hence we would like to preserve the sequence for Cortex-A53.

What I can suggest is, if you create a copy of the Cortex-A53 cpu file, call in ti_a53.S, and do the required sequence in that file, then the patch can be acceptable. Once that is done, I think HW_ASSISTED_COHERENCY build option should work for you, now that cpu operations you require are in place.

@glneo
Copy link
Contributor Author

glneo commented Oct 9, 2018

I have no problem making ti_a53.S but it would need to be kept in sync with the regular cortex-a53.S for everything except those two lines disabling cache. That is why I went with this approach just making them #ifdef'd out for only K3 platforms.

@glneo glneo closed this Oct 9, 2018
@glneo glneo reopened this Oct 9, 2018
@glneo
Copy link
Contributor Author

glneo commented Oct 9, 2018

Also we cannot skip the cache flushes, the HW does not actually flush the caches for us on shutdown, it is just that we cannot touch any memory with the caches off or we lose coherency for that cache-line across all cores. So we need to have the caches on during shutdown.

We also cannot enable HW_ASSISTED_COHERENCY as it assumes we have CAS based spinlocks for use in PSCI which we do not have.

Now that I've looked at is a bit more, duplicating cortex-a53.S as ti_a53.S just to drop two lines seems really wrong. It will be a maintenance burden to keep these in sync when, for instance, a new errata for A53 comes out it will have to be applied to both files. Our cores are not custom, they are stock A53s, adding a custom CPU definition file is not correct here.

@soby-mathew
Copy link
Contributor

soby-mathew commented Oct 10, 2018

Also we cannot skip the cache flushes, the HW does not actually flush the caches for us on shutdown, it is just that we cannot touch any memory with the caches off or we lose coherency for that cache-line across all cores. So we need to have the caches on during shutdown.

So what would happen to the new dirty lines in L1 cache after the cache is flushed in the power down sequence ? For example, the sequence here for CPU_OFF assumes that the cache is either OFF after CPU power down sequence or it will be flushed later by hardware (HW_ASSISTED_COHERENCY).

The aff_info update would get invalidated for the above sequence in TI_K3. We make some assumptions about coherency and cache maintenance at each point in the power down sequence, and from your description, those assumptions will not be met in TI_K3.

We also cannot enable HW_ASSISTED_COHERENCY as it assumes we have CAS based spinlocks for use in PSCI which we do not have.

Not really, CAS is only applied if the Architecture is set as ARMv8.1 or higher. Otherwise it falls back to regular Load-Store Exclusives which A53 supports (here).

Now that I've looked at is a bit more, duplicating cortex-a53.S as ti_a53.S just to drop two lines seems really wrong. It will be a maintenance burden to keep these in sync when, for instance, a new errata for A53 comes out it will have to be applied to both files. Our cores are not custom, they are stock A53s, adding a custom CPU definition file is not correct here.

The execution of power down sequence in Cortex-A53 with caches enabled is wrong as per the TRM (reasons are mentioned in TRM). We wouldn't want to merge any patch which would indicate/enable the contrary. Perhaps Cortex-A53 and the TI-53 can share the same reset handler (and so the same errata will apply) and only the power down sequence is overridden.

@glneo
Copy link
Contributor Author

glneo commented Oct 10, 2018

So what would happen to the new dirty lines in L1 cache after the cache is flushed in the power down sequence ? For example, the sequence here for CPU_OFF assumes that the cache is either OFF after CPU power down sequence or it will be flushed later by hardware (HW_ASSISTED_COHERENCY).

Does the following invalidate act as a clean+invalidate? We cannot perform that flush with the cache off unless the location is marked as non-cached for all cores. I'm not sure how to work around this other than leave the cache on an only read/write to non-cached location from that point on or turn off the cache.

The aff_info update would get invalidated for the above sequence in TI_K3.

Well it seems to work anyway, not sure how or why, but it does.

Not really, CAS is only applied if the Architecture is set as ARMv8.1 or higher. Otherwise it falls back to regular Load-Store Exclusives which A53 supports (here).

Yes, but the regular Load-Store Exclusives do not work for us due to the lock being written with cache on and then data on the same cache-line as the lock being read with the cache off by the starting core. The ARM ARM does not guarantee this will work, and for us it does not.

The execution of power down sequence in Cortex-A53 with caches enabled is wrong as per the TRM (reasons are mentioned in TRM). We wouldn't want to merge any patch which would indicate/enable the contrary. Perhaps Cortex-A53 and the TI-53 can share the same reset handler (and so the same errata will apply) and only the power down sequence is overridden.

I would be fine with this, but how do I implement it? The core functions are chosen by the cores MIDR, so I would need to have the TI-53 have the same MIDR as the A53, which I can do if I do and that should work if I do not compile in the regular A53 file, but I need that so I can share the reset handler.

If you would like to avoid indicating the contrary with respect to the cache off sequence then we can make the define that avoids it called something very TI specific: TI_AM65X_WORKAROUND, make it clear it is for this one platform only and not the correct method by the TRM.

The only other fix I can think of is to put every single memory location accessed (write and read) after cache has been disabled into a non-cached section for all cores.

@soby-mathew
Copy link
Contributor

Sorry for the delay as I had to spend some time to think about this problem.

Does the following invalidate act as a clean+invalidate?

It is not a clean invalidate which means if you are running with caches ON at that point, the cache line with the update is going to be invalidated.

Yes, but the regular Load-Store Exclusives do not work for us due to the lock being written with cache on and then data on the same cache-line as the lock being read with the cache off by the starting core.

Enabling the WARMBOOT_ENABLE_DCACHE_EARLY flag would avoid the issue you mention. Although we could move the spin_lock to a separate cache line aligned data to solve this irregularity, this shouldn't affect you anymore since you set the option.

The aff_info update would get invalidated for the above sequence in TI_K3.

Well it seems to work anyway, not sure how or why, but it does.

I think the problem may trigger if stressed enough. Although if you enable HW_ASSISTED_COHERENCY flag, then the invalidate operation is not done.

I am also not convinced about adding the DISABLE_DCACHE_LATE build flag to generic code and if HW_ASSISTED_COHERENCY build can do the right things for TI_K3, then that is ideal.

The only other fix I can think of is to put every single memory location accessed (write and read) after cache has been disabled into a non-cached section for all cores.

If you disable PSCI_STAT and RUNTIME_INSTRUMENTATION features, the only data that is accessed from generic code after caches are turned OFF are psci_svc_cpu_data.aff_info_state and the cpu/non-cpu power domain locks.

Assuming you keep the caches ON, with HW_ASSISTED_COHERENCY=1, the power domain locks use spin_locks which are all accessed with caches ON which should be alright on TI_K3.

You could explicitly flush psci_svc_cpu_data.aff_info_state in psci_plat_pm_ops->pwr_domain_pwr_down_wfi(), but it would be redundant for CPU_SUSPEND which is fine.

Any global data accessed within the platform plat_psci_ops_t pwr_domain_suspend and pwr_domain_off hooks need to be taken of.

The idea is to keep the caches ON throughout the power down sequence for TI_K3 and any global data which are updated after the caches flush in cpu_ops are handled explicitly by the platform. Since the caches are ON, all the cache maintenance ops can be NOP which the HW_ASSISTED_COHERENCY=1 does.

I would be fine with this, but how do I implement it? The core functions are chosen by the cores MIDR, so I would need to have the TI-53 have the same MIDR as the A53, which I can do if I do and that should work if I do not compile in the regular A53 file, but I need that so I can share the reset handler.

A TI specific build flag could work as you suggest. But let us know the results after the above changes.

@glneo
Copy link
Contributor Author

glneo commented Oct 12, 2018

Sorry for the delay as I had to spend some time to think about this problem.

No problem, I've spent more time thinking about this problem than I care to admit. :)

Enabling the WARMBOOT_ENABLE_DCACHE_EARLY flag would avoid the issue you mention. Although we could move the spin_lock to a separate cache line aligned data to solve this irregularity, this shouldn't affect you anymore since you set the option.

That might help, but the issue is not the spin_lock (it is always accessed with cache on) it is the adjacent data that is being accessed with cache-off and poisoning the whole cache-line (reads don't poison the cache-line coherency for most platforms, for us they do, hence why we are the first to hit so many of these problems). A better fix would be to move any and all data accessed (read or write) with the cache-off to either:

  1. it's own cache-line and use proper cache maintenance on it everywhere it is accessed
  2. it's own cache-line marked as non-caching for all participants (expand use of USE_COHERENT_MEM)
    Option 2 would be my preference, then we can drop most of the cache maintenance operation from TF-A altogether I would think.

I am also not convinced about adding the DISABLE_DCACHE_LATE build flag to generic code and if HW_ASSISTED_COHERENCY build can do the right things for TI_K3, then that is ideal.

I agree, this is my end goal. I've been adding WARMBOOT_ENABLE_DCACHE_EARLY checks to places where HW_ASSISTED_COHERENCY also implies the same, now with this patch-set I add a similar flag for the shutdown path. When these two flags cover all locations protected by HW_ASSISTED_COHERENCY then they will be redundant and TI_K3 can simply use HW_ASSISTED_COHERENCY.

I only have one spot left: https://github.com/ARM-software/arm-trusted-firmware/blob/master/lib/psci/psci_private.h#L159
This switches the lock type and removes some explicit flushing operations. Changing either of these breaks TI_K3 still. I can work around this by explicitly flushing the locks and psci_svc_cpu_data from inside pwr_domain_pwr_down_wfi() like you said.

So if that is acceptable then let me update the patches in this PR and make it a RFC. If you agree with the method I can clean up the patch a bit more and use some TI specific build flag instead of HW_ASSISTED_COHERENCY.

@soby-mathew
Copy link
Contributor

A better fix would be to move any and all data accessed (read or write) with the cache-off to either:

There is no access of uncached data within PSCI when HW_ASSISTED_COHERENCY=1 and ENABLE_RUNTIME_INSTRUMENTATION=0 as far as I am aware. I get the issue with spin_locks, but ARM ARM is quite clear when accessing data with differing memory attributes which we comply with. Moving spin_locks to its own cache line is OK but don't see any need for changes to regular PSCI data.

I can work around this by explicitly flushing the locks and psci_svc_cpu_data from inside pwr_domain_pwr_down_wfi() like you said. So if that is acceptable then let me update the patches in this PR and make it a RFC. If you agree with the method I can clean up the patch a bit more and use some TI specific build flag instead of HW_ASSISTED_COHERENCY.

Yes, please work on that. We need to see the changes and if you can restrict the changes to cpu_ops and platform code, that would be ideal.

@glneo
Copy link
Contributor Author

glneo commented Oct 16, 2018

Yes, please work on that. We need to see the changes and if you can restrict the changes to cpu_ops and platform code, that would be ideal.

I've done that and updated this pull request. It is a bit hacky still but it works and all the changes (except leaving cache on in a53 power down) are in platform code.

Copy link
Contributor

@soby-mathew soby-mathew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment. Otherwise looks fine.

@@ -228,11 +228,13 @@ endfunc cortex_a53_reset_func
func cortex_a53_core_pwr_dwn
mov x18, x30

#if !HW_ASSISTED_COHERENCY
Copy link
Contributor

@soby-mathew soby-mathew Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment: Use a TI specific build flag in this file say "#ifndef TI_AM65X_WORKAROUND".

@@ -252,11 +254,13 @@ endfunc cortex_a53_core_pwr_dwn
func cortex_a53_cluster_pwr_dwn
mov x18, x30

#if !HW_ASSISTED_COHERENCY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Leave the caches on and explicitly flush any data that
may be stale when the core is powered down. This prevents
non-coherent interconnect access which has negative side-
effects on AM65x.

Signed-off-by: Andrew F. Davis <afd@ti.com>
@glneo
Copy link
Contributor Author

glneo commented Oct 16, 2018

Updated.

@soby-mathew
Copy link
Contributor

jenkins: test this please

@soby-mathew
Copy link
Contributor

We are having some CI issues at the moment and hence not able to merge any PRs.

@soby-mathew
Copy link
Contributor

jenkins: test this please

@soby-mathew soby-mathew merged commit 0059be2 into ARM-software:integration Oct 18, 2018
@glneo glneo deleted the dcache-late-disable branch January 21, 2019 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants