ML605 and VC707 PCIe Troubleshooting

Kermin Elliott Fleming edited this page Jun 9, 2015 · 6 revisions

ML605 and VC707 PCIe Troubleshooting

Programming Issues

The most common issues at programming time fall into two categories: how to program and what to program

To test programming the FPGA we provide golden bitfiles. These bitfiles are known to work on our testing configuration. You can use them to test both FPGA programming and the integrity of the PCIe link.

How To Program

Xilinx FPGA programming occurs through the Xilinx supplied tool, Impact. Most errors involving how to program result from improper permissions for the vendor tools. The first step in debugging a new setup is programming the FPGA by hand with Impact, using the golden bitfile.

Once the FPGA is programmed with the golden bitfile, you may need to reboot your system. Reboot is usually necessary for desktop class machines, since these do not support hot-plug capabilities in the PCI-E controller hardware. Some servers do support hot-plug and do not require a reboot, although reboot will not impact them.

Note that during the reboot the FPGA must retain power, so as not to lose the program. Once this reboot is completed, use lspci to verify that the FPGA is visible. You should see something like:

% sudo lspci -vv

01:00.0 RAM memory: Device 1be7:b100
        Subsystem: Device 1be7:b100
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Region 0: Memory at f7e00000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: bluenoc

Here, we see the magic number of the FPGA PCI-E device, 1be7:b100. Also, note that the OS has loaded the bluenoc device. This will happen if you have installed the bluenoc driver.

What to Program

At programming time, leap-fpga-ctrl attempts to discover and program system devices. Since bringing up a device typically involves interacting system-specific features, e.g. hardware, loading drivers, setting permissions, etc. leap-fpga-ctrl requires several scripts to inform it of how to take certain actions. If leap-fpga-ctrl’s configuration files are not correctly parameterized, programming can fail.

An example of such a failure is shown below:

fw1811@fw1811:~/Downloads/hello_hybrid_vc707_vivado/bm/null$ ./run 
Reserving VC707
program 1 /dev/bluenoc_0000:04:00.0
Programming device: /dev/bluenoc_0000:04:00.0
Failed to program FPGA

To understand the cause of the error, examine FPGA_programming.log:

INFO:iMPACT - Digilent Plugin: Plugin Version: 2.4.4
INFO:iMPACT - Digilent Plugin: Opening device : "SN:210201234286".
ERROR:iMPACT - Digilent Plugin: failed to open device (DmgrOpenEx, erc = 3072).
Signature        False               
KeepSVF              False               
ConcurrentMode       False               
UseHighz             False               
ConfigOnFailure      Stop                
UserLevel            Novice              
MessageLevel         Detailed            
svfUseTime           false               
SpiByteSwap          Auto_Correction     
AutoInfer            false               
SvfPlayDisplayComments false               

Here the error suggests that the device serial number for the FPGA is incorrect. Programmer should examine LEAP’s config to set the correct serial number.

PCIe

leap-fpga-ctrl may terminate with the error message:


Reserving VC707
FPGA is already programmed (signature match)…
deactivate 1 /dev/bluenoc_0000:01:00.0
Disabling PCIe device…
activate 1 /dev/bluenoc_0000:01:00.0
Enabling PCIe device…
Activating device: VC707 1 /dev/bluenoc_0000:01:00.0
Enabling FPGA device vc707.0 access…
hello_hybrid_vc707_vivado_sw.exe: sw/model/pcie-bluenoc.cpp:131: bool PCIE_DEVICE_CLASS::Init(): Assertion `board_info.is_active’ failed.
sh: line 1: 3363 Aborted (core dumped) ../../pm/.//sw/obj/hello_hybrid_vc707_vivado_sw.exe —modeldir=../../pm/./ —workload=null —param FPGA_DEV_PATH=“/dev/bluenoc_0000:01:00.0” —global-strings=‘../../pm/.//hello_hybrid_vc707_vivado.str’ —global-strings=‘../../pm/.//hello_hybrid_vc707_vivado.str’ 2>&1
Model exited with status 134
Disabling FPGA device vc707.0 access…

The problem is caused by an incorrect setting of the PCIe sockets in the configuration and control scripts. LEAP’s PCIe device management requires knowledge of specific system addressing. For example, lspci shows:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04)
00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a4)
00:1f.0 ISA bridge: Intel Corporation Q77 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)
01:00.0 RAM memory: Device 1be7:b100

The last row ‘01:00.0 RAM memory: Device 1be7:b100’ is the FPGA card at socket 01:00.0. This socket address (bus_id) must be used in both the configuration script /usr/etc/leap/config and board control script, e.g. /usr/share/leap/scripts/VC707. The latter script also needs to know the socket address of the PCIe hot-plug controller:


#!/bin/sh -p

##

  1. Configuration script for Xilinx VC707 FPGA boards.
    ##
  1. Keep user from messing with PATH when running as root
    PATH=/sbin:/bin:/usr/sbin:/usr/bin
    export PATH
    unset LD_LIBRARY_PATH

echo $*
function usage()
{
echo “Usage: $0 <program|activate> ”
exit 1
}

if [ $# -ne 3 ]; then
usage
fi

arg_command=“$1”
arg_devid=“$2”
arg_driver=“$3”

  1. Hot-plug controller for enabling/disabling while programming
    bus_id=`leap-fpga-ctrl —dev=${arg_devid} —getconfig=bus_id`
    pci_devname=“^${bus_id}”
    case “$bus_id” in
    "0000:01:00.0 “)
    pci_enable=”/sys/devices/pci0000:00/0000:00:01.0/rescan"
    pci_disable=“/sys/devices/pci0000:00/0000:00:01.0/${bus_id}/remove”

The socket address 0000:00:01.0 is the PCIe hot-plug controller. Discovering the PCIe address of the hot-plug controller for a specific card requires examination of individual system hardware. Tools like lspci and dmesg can be helpful in finding these values. Searching for the Bluenoc driver in the PCI device tree is a good way to start:
Run


find /sys/devices/ -type l -ilname “bluenoc

In this example, it returns:

/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/driver

Examining the file representation of the PCIe drivers can also help:
Examine the output of “dmesg | grep ‘pci’” or “dmesg | grep ‘bluenoc’”, or search manually for the link to Bluenoc, for example:


ls -l /sys/devices/pci0000:00
ls -l /sys/devices/pci0000:00/0000:00:01.0
ls -l /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0
ls -l /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/driver

returns

/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/driver → ../../../../bus/pci/drivers/bluenoc

Reprogramming

LEAP provides mechanisms for programming the FPGA without rebooting. This involves caching and reloading the PCIE device configuration dynamically, as documented here.

If reprogramming is not configured correctly, errors can result:

Reserving VC707
 Disabling FPGA device vc707.0 access...
 Enabling FPGA device vc707.0 access...
PCIe write error:  only wrote 32 of 288 bytes
Model exited with status 1
 Disabling FPGA device vc707.0 access...

The following is a bit more serious, and may require a reboot.

Reserving VC707
bluenoc-ioctl: bluenoc-ioctl.c:110: main: Assertion `board_info.is_active' failed.
/usr/share/leap/scripts/VC707: line 56: 81638 Aborted                 /usr/local/bin/bluenoc-ioctl ${arg_driver} "activate"
Activating device: VC707 0 /dev/bluenoc_0000:04:00.0
 Enabling FPGA device vc707.0 access...
Model exited with status 134
hello_hybrid_vc707_vivado_sw.exe: sw/model/pcie-bluenoc.cpp:131: bool PCIE_DEVICE_CLASS::Init(): Assertion `board_info.is_active' failed.
Disabling FPGA device vc707.0 access...
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.