Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[master] Arista 7260 -- modified ID of ff:0b.3 in YAML file #15899

Closed

Conversation

assrinivasan
Copy link
Contributor

@assrinivasan assrinivasan commented Jul 18, 2023

Why I did it

The Arista 7260 has a PCI peripheral at address ff:0b.3 whose ID is set as '0001' in the pcie.yaml file but presents as '6f76' on the actual device. This causes a device mismatch error on the pcie daemon and subsequent syslog flooding with the same error.

root@str2-7260cx3-acs-9:~# setpci -s ff:0b.3 2.w
6f76
Work item tracking
  • Microsoft ADO: 24567508

How I did it

Modified ID field of ff:0b.3 in the corresponding pcie YAML file

How to verify it

Install latest image after the merge and verify that syslogs are not flooded with PCI device mismatch errors

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@prgeor
Copy link
Contributor

prgeor commented Jul 19, 2023

@assrinivasan can you paste the output of setpci PCI vendor ID for the PCI device in question(in PR description)?

@prgeor
Copy link
Contributor

prgeor commented Jul 19, 2023

@Staphylo @andywongarista could you please review this change?

@assrinivasan
Copy link
Contributor Author

@assrinivasan can you paste the output of setpci PCI vendor ID for the PCI device in question(in PR description)?

Done.

@Blueve
Copy link
Contributor

Blueve commented Jul 19, 2023

I got similar error with newest image on Arista 720DT:
Is there a regression that widely impacting Arista platforms? @prgeor @Staphylo @andywongarista

2023-07-16T15:19:43.1551025Z E               Match Messages:
2023-07-16T15:19:43.1552491Z E               Jul 16 14:35:00.130709 bjw-can-720dt-4 ERR pmon#CCmisApi: PCIe device 00:18.7 ID mismatch. Expected 0001, received 15ef
2023-07-16T15:19:43.1553671Z E               
2023-07-16T15:19:43.1555107Z E               Jul 16 14:35:00.159597 bjw-can-720dt-4 ERR pmon#CCmisApi: PCIe device 02:00.7 ID mismatch. Expected 15e6, received 15e4

@Staphylo
Copy link
Collaborator

Hi, this is a regression introduced by a change in sonic-platform-daemons.
The expected device ID by the kernel is actually 0001 and not 6f76.
However if you query the pci config space directly, it is indeed 6f76.

Looking at the commit history on sonic-buildimage I see the following submodule update of sonic-platform-daemons
b3e5910
It contains changes to pcied here
sonic-net/sonic-platform-daemons@d73808c

On top of some obvious issues in the code (leftover pass, no context manager with open, ...) it mainly violates the abstraction provided by the Pcie and PcieUtil classes.
The load_platform_pcieutil() offers the possibility for vendor to implement their own pcie check logic.
This new code proceeds with parsing the yaml file directly, thus bypassing the Pcie class.
Also this parsing happens at every loop iteration of the daemon which is wasteful.

The real issue here is the complete bypass of the PcieUtil introduced by the PR above.

For our platforms we implement our own PcieUtil class and ignore the yaml file entirely.
It still exists on our products to avoid pcie-check.sh from asserting since it expects this file, even though pcieutil check ends up using the API. So for newer products our plan is to create an empty pcie.yaml file.

The reason we implement PcieUtil and not a static yaml file is to gracefully handle hotswap hardware.
In the case of chassis, the number of linecards/fabrics and models will affect what the pcie topology looks like.
There are also other more technical reasons why it's better to not hardcode the pcie device tree.

@Blueve why is pmon#CCmisApi reporting errors instead of pmon#pcied ?

@jarias-lfx
Copy link

/easycla

@prgeor
Copy link
Contributor

prgeor commented Jul 19, 2023

@Staphylo I did not understand this. Can you elaborate? Why would kernel see a different device ID ?

The expected device ID by the kernel is actually 0001 and not 6f76.
However if you query the pci config space directly, it is indeed 6f76.

@StormLiangMS
Copy link
Contributor

StormLiangMS commented Aug 1, 2023

@assrinivasan this is still causing many failures in nightly, could you work with reviewer to resolve the PR?

@assrinivasan
Copy link
Contributor Author

@prgeor @Staphylo could I modify this PR to remove ff:0b.03 from pcie.yaml file, per our conversation?

@Staphylo
Copy link
Collaborator

Staphylo commented Aug 9, 2023

@assrinivasan yes you can remove this device from the pcie.yaml file.
I'm working on a PR to remove all of these from our pcie.yaml files but it might be ready in a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants