Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About L3 evict #213

Closed
gengtsh opened this issue Mar 28, 2019 · 6 comments
Closed

About L3 evict #213

gengtsh opened this issue Mar 28, 2019 · 6 comments

Comments

@gengtsh
Copy link

gengtsh commented Mar 28, 2019

hi,
I noticed that many metrics about L3 evict used the event "L2_TRANS_L2_WB". According to Chapter 19.6 in "Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3B", this event is used to "Counts L2 dirty (modified) cache lines evicted by a
demand request." Obviously it does not count the lines that are evicted by l3 cache. Is something wrong with this or is there any special reason for you to use it in this way?

In addition, in the file "groups/sandybrige-ep/L3.txt, "L3 data volume" is calculated with "L2_LINES_IN_ALL" and "L2_TRANS_L2_WB". What does "L3 data volume" mean here? Does it mean the data flowing through L3? If so, there is a problem using "L2_LINES_IN_ALL". If a load request misses L3, the fetched new line will be brought into both L3 and L2 from memory; but if the load misses L2 but hits L3, the missed line will be fetched into L2 from L3 and there will be no new line filling L3. I'm really confused with this formula in this .txt file.

Any help is appreciated.

@TomTheBear
Copy link
Member

The L3 groups measures traffic between L2 and L3. Same for the L2 groups, which measures between L1 and L2.

The cache lines evicted by L3 cache is closely related to the write bandwidth to the memory controllers. Most people want core-local data and there is no "evicts from L3" event for cores. You can use the LLC segments but they are not per core but per socket.

L3 data volume is the amount of data loaded and evicted from/to L3 from the perspective of CPU cores, the processing units, thus between L2 and L3. In most Intel systems it is the data flowing through L3 because the L3 is inclusive. For Skylake and later, the L3 is a victim cache why the L3 group is somewhat misleading on these platforms because not all data is flowing through L3, it can be loaded into L2 from memory directly.

@gengtsh gengtsh closed this as completed Mar 28, 2019
@gengtsh gengtsh reopened this Mar 28, 2019
@gengtsh
Copy link
Author

gengtsh commented Mar 28, 2019

Thanks, Thomas. By saying "You can use the LLC segments", do you mean the per socket uncore event, right?

Still, I think using "L2_LINES_IN_ALL" may not be very appropriate and meaningful from the perspective of LLC performance analysis. Let's just consider the inclusive cache architecture. As I said, if most demand requests from some core-x that miss L2 hit L3, they actually won't cause inter-core interference to others cores because they won't evict any cache lines from other cores. These requests do increase the traffic of this core between its private L2 and shared L3 and become part of the "L3 data volume" defined in likwid.

Even for Skylake's non-inclusive LLC architecture, the requests that miss L2 still need to access L3 (If I understand this correctly). So my above claim also holds for Skylake and its later architecture.

@TomTheBear
Copy link
Member

Yes, the CBo (Intel name) or CBOX (LIKWID name) counters.

The L3 data volume should contain all line fillings and evicts of L3 from/to the backend (memory, interconnect, network,...) as well? Or do you mean traffic caused by remote L2 HITM accesses? Or is it just the name that bothers you and if it would be 'L2 <-> L3 data volume', we wouldn't have this conversation?

You are free to propose changes through pull requests. If you just need another group, create it yourself https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr#defining-custom-performance-groups
If it is a helpful group and you can somehow prove that the counts are correct, I can also integrate it in the repo.

@gengtsh
Copy link
Author

gengtsh commented Mar 29, 2019

in your first reply, you mentioned "L3 data volume is the amount of data loaded and evicted from/to L3 from the perspective of CPU cores, the processing units, thus between L2 and L3".
is it true that in likwid, L3 data volume is L2<->L3 data volume, so, it consists of two parts: one is L2 miss but L3 hit, where the data transfer from L3 to L2; the other is L2 evict to L3 which data transfer from L2 to L3.
based on the likwid/goup/sandybridgeEP/L3.txt file, the L3 data volume equals to 1.0E-09*(L2_LINES_IN_ALL+L2_TRANS_L2_WB)*64.0. i think this explanation make more sense.
here, L2_TRANS_L2_WB is L2 writebacks that access L2 cache, which stands for data evicted from L2 to L3. L2_LINES_IN_ALL is L2 cache lines filling L2 which stands for the data (L2 miss, L3 hit) transfered from L3 to L2.

It is not related to remote L2 HITM accesses. I just want to say the event "L2_LINES_IN_ALL" (L2 cache lines filling L2) contains some lines that are filled in L3 (For example, some core's requests may miss L2 many times but always hit L3).

So, yes, "L2<->L3 data volume" is definitely a clearer name. In other words, if you use "L2<->L3 data volume" or "L2<->L3 traffic", it makes sense to use "L2_LINES_IN_ALL"; but if you want to use some metric that should contain the lines filling of L3 itself, "L2_LINES_IN_ALL" may not be accurate (Instead, "MEM_LOAD_UOPS_RETIRED.LLC_MISS" or "MEM_LOAD_UOPS_RETIRED.L3_MISS" could be more accurate. In addition, "LLC Misses"(num: 2E, umask: 41H) seems a more general event that collects the LLC misses.) Please correct me if I'm wrong.

Thank you again!

@TomTheBear
Copy link
Member

It's nice that you think about the event selection that mindful.

The names could be more specific but until now it was clear to anyone what is measured with the L3 group. The most important part in my description you cited is "from the perspective of CPU cores". The CPU cores don't care what the L3 has to do to provide the cache lines (hit in some L3 segment or miss and load from memory).

The MEM_LOAD_UOPS* events are known to have problems:
SandyBridge: BT106, BT241 and BT243 in Spec Update
IvyBridge: CA93 in Spec Update
Haswell: HSE114 in Spec Update
Broadwell: BDF87 in Spec Update

I tested the LLC Misses event (originally called LONGEST_LAT_CACHE.MISS) quite often and it undercounts dramatically.

If you need line fills and evicts to/from L3, use the MEM group and measure directly at the memory controllers. The counts are highly accurate.

@gengtsh
Copy link
Author

gengtsh commented Apr 2, 2019

thanks TomTheBear, it's very clear for me now, i will close this issue.

@gengtsh gengtsh closed this as completed Apr 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants