Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AST1300 PCIe device produces freeze/fence #257

Open
JeremyRand opened this issue Aug 23, 2020 · 5 comments
Open

AST1300 PCIe device produces freeze/fence #257

JeremyRand opened this issue Aug 23, 2020 · 5 comments

Comments

@JeremyRand
Copy link

I have a Talos II workstation (latest 2.00 firmware from Raptor), and am trying to use an IGCME-1300-R10 GPU (chipset is AST1300) with it (in conjunction with a StarTech PEX2MPEX to attach the MiniPCIe GPU to a standard PCIe slot on the Talos). Unfortunately, while the bridge device component of the IGCME-1300-R10 is detected successfully, the VGA controller component produces a freeze/fence in Skiboot logs, and does not subsequently show up in lspci output.

I've tried connecting the PEX2MPEX with IGCME-1300-R10 to an x86 machine (running Windows) and the VGA device does show up as a PCIe device in Windows, which indicates that there is something POWER-specific about this problem.

Curiously, on the older firmware that the Talos II shipped with (not sure of the firmware version, but it's whatever the first batch of Talos II machines shipped with, as it was a pre-order), the bridge device component of the IGCME-1300-R10 isn't present in lspci output either, which suggests that the situation has at least improved between those firmware versions.

I'm attaching Skiboot and lspci output from both the latest 2.00 firmware and the firmware that the Talos II shipped with. (Lest anyone get confused, please note that these logs refer to 2 different AST GPU devices: the AST2500 that's part of the built-in BMC (which works fine), and the AST1300 that this issue is about.) Let me know if there's anything I can do to help debug it. (Or, if you think I should be reporting this to Raptor rather than to you, let me know and I'll do so.)

lspci-with-unrecognized-ast1300-firmware-2.00.txt
lspci-with-unrecognized-ast1300-stock-firmware.txt
skiboot-with-unrecognized-ast1300-firmware-2.00.txt
skiboot-with-unrecognized-ast1300-stock-firmware.txt

@oohal
Copy link
Contributor

oohal commented Aug 24, 2020

Parsing the EEH register dump:

==== PHB Register dump found ====

[   73.061847781,3] PHB#0033[8:3]:             PCI FIR=2000000000000000

NEST FIR = 0000800000000000:
	16 - PFIR_freeze

PCI FIR = 2000000000000000:
	 2 - AIB_intf_error

phbErrorStatus = 00000d0000000000:
	20 - RXE_ARB OR Error Status
	21 - RXE_MRG OR Error Status
!	23 - TXE OR Error Status

phbTxeErrorStatus = 0000000400000000:
!	29 - CFG Write Request Timeout

phbRxeArbErrorStatus = 0000000020000000:
!	34 - PCT Timeout

phbRxeMrgErrorStatus = 0000000000000001:
!	63 - pb_etu_ai_rx_raise_fence

phbRegbErrorStatus = 0040000000000000:
	 9 - PCIE Link Up

So the freeze is due to a config write timeout, which is... unusual. My best guess is that the AST firmware is slow to enable the VGA component so it's not ready to handle the CFG write when we start scanning the bus. If you're ok with building your own firmware can you try this patch?

diff --git a/core/pci.c b/core/pci.c
index e195ecbf4255..a90d4d3c9793 100644
--- a/core/pci.c
+++ b/core/pci.c
@@ -632,6 +632,8 @@ static bool pci_enable_bridge(struct phb *phb, struct pci_device *pd)
 	    pd->dev_type == PCIE_TYPE_SWITCH_DNPORT) {
 		if (!pci_bridge_wait_link(phb, pd, was_reset))
 			return false;
+	} else if (pd->dev_type == PCIE_TYPE_PCIE_TO_PCIX) {
+		time_wait_ms(1000);
 	}
 
 	/* Clear error status */
diff --git a/hw/phb4.c b/hw/phb4.c
index 3f22a2c4d98f..7b463420ff20 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -84,7 +84,7 @@
 
 
 #undef NO_ASB
-#undef LOG_CFG
+#define LOG_CFG
 
 #include <skiboot.h>
 #include <io.h>

Reporting this to Raptor is probably a good idea, but I think this going to be unique to that specific adapters so unless Raptor have one I'm not sure they can help much. If you want more detailed instructions about how to patch the Talos' firmware let me know and I'll do a a writeup.

@JeremyRand
Copy link
Author

@oohal Thanks for the patch. I was able to get it to build, though I increased the wait from 1 second to 10 seconds on the advice of someone in #talos-workstation IRC (basically to increase the safety margin). Unfortunately, no change in behavior: the VGA device still doesn't show up in lspci, and similar freeze/fence errors are still logged. (The added 10-second wait is pretty clearly visible in the log.) There is a lot more logging this time (I guess that's a result of the LOG_CFG line you changed), so maybe the extra logs will help determine the cause. Attaching the Skiboot log.

skiboot-with-unrecognized-ast1300-with-10s-wait.txt

@oohal
Copy link
Contributor

oohal commented Aug 27, 2020

Thanks for the logs. I noticed there's some odd stuff in there:

[   73.390875079,7] PHB#0033[8:3]: 100 CFG8 Wr 18=00000001 # write the default bus numbers
[   73.390876229,7] PHB#0033[8:3]: 100 CFG8 Wr 19=00000000
[   73.390877346,7] PHB#0033[8:3]: 100 CFG8 Wr 1a=00000000

[   73.390878278,7] PHB#0033:01:00.0 Found VID:1a03 DEV:1150 TYP:7 MF- BR+ EX+

[   73.390880618,7] PHB#0033[8:3]: 100 CFG32 Rd a4=0000000d
[   73.390882363,7] PHB#0033[8:3]: 100 CFG16 Rd 04=0000
[   73.390883445,7] PHB#0033[8:3]: 100 CFG16 Wr 04=00000140 # enable the error reporting bits in the command register

[   73.390885288,7] PHB#0033[8:3]: 100 CFG16 Rd 88=2810
[   73.390886859,7] PHB#0033[8:3]: 100 CFG32 Rd 18=00000001
[   73.390887888,7] PHB#0033[8:3]: 100 CFG32 Wr 18=00000141 # <--- ????

That last write definitely isn't supposed to be happening. I think those writes are coming from phb4_endpoint_init() which enables error reporting for the device. The AST1300 doesn't appear to implement the Advance Error Reporting capability so when we initialise that we end up trashing config offsets 0x18..0x1b since the saved aercap offset is zero. For normal devices that's doesn't matter since 0x18 is a BAR register which will be overwritten later on by linux. For bridges however, those are the primary/secondary/subordinate bus number registers which are used to route config space accesses. Broken config space routing would explain the timeouts.

Can you try this patch (keep the LOG_CFG change too):

diff --git a/hw/phb4.c b/hw/phb4.c
index 3f22a2c4d98f..34e9dd58b745 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -787,6 +787,7 @@ static void phb4_endpoint_init(struct phb *phb,
 		  PCICAP_EXP_DEVCTL_UR_REPORT);
 
 	/* Enable ECRC generation and check */
+	if (!aercap) return;
 	pci_cfg_read32(phb, bdfn, aercap + PCIECAP_AER_CAPCTL, &val32);
 	val32 |= (PCIECAP_AER_CAPCTL_ECRCG_EN |
 		  PCIECAP_AER_CAPCTL_ECRCC_EN);

@JeremyRand
Copy link
Author

@oohal Thanks for the new patch. Applying it on top of the previous patch I applied yields the following log (unfortunately no change in visible behavior; the VGA device still doesn't show up in lspci):

skiboot-with-unrecognized-ast1300-with-10s-wait-and-AERcap.txt

@JeremyRand
Copy link
Author

Adding a data point: I tried using a different MiniPCIe device (a WiFi card) with the StarTech PEX2MPEX in my Talos II, and it worked fine. So that at least confirms that the PEX2MPEX isn't responsible for the issue (although we already guessed that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants