Skip to content

NIC Driver Technical Documentation

ellitron edited this page Jun 2, 2012 · 114 revisions

NIC Driver Technical Documentation

For questions, comments, reporting errors / documentation bugs, etc., please send e-mail to:

Jonathan Ellithorpe


This documentation on the NetFPGA-10G Reference NIC Linux Driver covers the following:

  • Foreword: Documentation Scope and Audience
  • Driver and Device Initialization
  • DMA Engine Architecture, Starting the Engine
  • Packet Buffers, Structure and Management
  • Life of a Packet, Receiving and Transmitting
  • Reading and Writing Registers

Foreword: Documentation Scope and Audience

The purpose of this documentation is to provide a technical overview of the most important components of the driver, primarily focusing on aspects particular to its implementation for NetFPGA-10G. While general kernel interfaces and common data structures have not been of primary concern in the scope of this document, references have been made where possible for the unfamiliar reader. The ideal audience for this documentation, therefore, would have some degree of experience with writing Linux network drivers, although such experience is not required to understand the majority of the contents.

Driver and Device Initialization

This section discusses initialization and is divided into two parts: driver initialization and device initialization. Driver initialization focuses on initialization of driver structures and callbacks, while device initialization focuses on setting up the NetFPGA-10G device.

Driver Initialization

Upon driver loading, the kernel will call the following function:

/* Initialization. */
static int __init nf10_eth_driver_init(void)

It's primarily in this function that all of the initialization is done for the driver, and here we'll walk through the basic steps.

The first thing we do is allocate NUM_NETDEVS network device structures (struct net_device) and set their MAC addresses. These are called nf10_netdevs[0 - NUM_NETDEVS-1]. For a single NetFPGA-10G device NUM_NETDEVS is 4, although for a box with many NetFPGA-10G cards this number can changed to suite the total number of interfaces present.

    /* Allocate the network interfaces. */
    for(i = 0; i < NUM_NETDEVS; i++) {
        nf10_netdevs[i] = alloc_netdev(0, "nf%d", nf10_netdev_init);
    }

    mac_addr_len = nf10_netdevs[0]->addr_len;
    char mac_addr[mac_addr_len+1];

    /* Set network interface MAC addresses. */
    for(i = 0; i < NUM_NETDEVS; i++) {
        memset(mac_addr, 0, mac_addr_len+1);
        snprintf(&mac_addr[1], mac_addr_len, "NF%d", i);
        memcpy(nf10_netdevs[i]->dev_addr, mac_addr, mac_addr_len);
    }

The result is that after initialization and bringing up the interfaces, the MAC addresses will look something like:

nf0  Link encap:Ethernet  HWaddr 00:4E:46:30:00:00 
nf1  Link encap:Ethernet  HWaddr 00:4E:46:31:00:00 
nf2  Link encap:Ethernet  HWaddr 00:4E:46:32:00:00 
nf3  Link encap:Ethernet  HWaddr 00:4E:46:33:00:00 

Note that we allocate the net_devices with nf10_netdev_init as the initialization function. The kernel will call this function to setup the net_devices method pointers and various properties. See below:

void nf10_netdev_init(struct net_device *netdev)
{
    ether_setup(netdev);
    netdev->netdev_ops  = &nf10_netdev_ops;
    netdev->watchdog_timeo = 5 * HZ;
}

Where nf10_netdev_ops contains the following functions at the time of writing:

static const struct net_device_ops nf10_netdev_ops = {
    .ndo_open               = nf10_ndo_open,
    .ndo_stop               = nf10_ndo_stop,
    .ndo_start_xmit         = nf10_ndo_start_xmit,
    .ndo_tx_timeout         = nf10_ndo_tx_timeout,
    .ndo_get_stats          = nf10_ndo_get_stats,
    .ndo_set_mac_address    = nf10_ndo_set_mac_address,
};

Following on, the driver will add a NAPI polling mechanism to the 0th network device:

    /* Add NAPI structure to the device. */
    /* Since we have NUM_NETDEVS net_devices, we just use the 1st one for implementing polling. */
    netif_napi_add(nf10_netdevs[0], &nf10_napi_struct, nf10_napi_struct_poll, RX_POLL_WEIGHT);

NAPI stands for "New API" and is a mechanism used for interrupt mitigation and efficient packet throttling when receiving packets on high speed networking devices (see here: NAPI). In summary NAPI helps to suppress floods of interrupts on 10G devices (a single 10G interface could potentially receive ~20Mpps of traffic) as well as provide a mechanism for the kernel's networking stack to apply back pressure at the driver level when it gets overloaded (it's more efficient for the system to drop packets earlier in the stages of processing if it already knows it can't handle them, otherwise it's wasting effort). Given the NetFPGA-10G's potential for receiving 40G of traffic at once, NAPI was therefore chosen as the high performance packet receiving mechanism in the driver. Please note, however, that since interrupts at the time of writing have yet to be implemented in the hardware, a polling mechanism is used in its place until interrupts are available. How this all works is beyond the discussion of initialization so a detailed explanation is not provided here. To learn about how NAPI has been implemented and used in the driver, however, please see the section on Polling for Packets.

After calling the above function, the callback nf10_napi_struct_poll is registered to the nf10_napi_struct structure (struct napi_struct), and the nf10_napi_struct structure is registered to the 0th net_device structure in the driver. The last argument is the poll weight which is a measure of how many packets to try to receive on each call to the poll function. At the time of writing RX_POLL_WEIGHT is 64.

The relationship between the nf10_napi_struct and the net_device to which it's registered is actually arbitrary. The important thing to know is that when the driver's polling mechanism discovers packets ready to be received, the driver will schedule a NAPI polling event using the nf10_napi_struct structure like this:

napi_schedule(&nf10_napi_struct);

Which will result in the kernel scheduling a call to:

/* Slurp up packets. */
static int nf10_napi_struct_poll(struct napi_struct *napi, int budget)

Which will then receive at most budget number of packets (it is through budget and scheduling calls to nf10_napi_struct_poll() that the kernel controls back pressure). Again, please see the section on receiving packets for more details.

After having allocated the net_devices, set their MAC addresses, and setup NAPI on nf10_netdevs[0], the driver registers the devices with the kernel like so:

    /* Register the network interfaces. */
    for(i = 0; i < NUM_NETDEVS; i++) {
        register_netdev(nf10_netdevs[i]);
    }

The next step in initialization is registering a Generic Netlink Family and associated operations:

    /* Register our Generic Netlink family. */
    genl_register_family(&nf10_genl_family);

    /* Register operations with our Generic Netlink family. */
    for(i = 0; i < ARRAY_SIZE(genl_all_ops); i++) {
        genl_register_ops(&nf10_genl_family, genl_all_ops[i]);
    }

Put simply, Generic Netlink is an interface that allows userspace applications to communicate with kernel modules including drivers using messages (for detailed documentation please see here). In the NetFPGA-10G driver Generic Netlink was originally used for developing an easily extendable userspace debugging tool allowing a developer to test and probe the driver via a command line application called driver_ctrl. Later on register reading and writing functionality was added across Generic Netlink, and a dynamic library was developed (called nf10_reg_lib) providing access to these functions for applications. In the driver the GENL register reading and writing functions are called genl_ops_reg_rd and genl_ops_reg_wr, and are included in the genl_all_ops[] array:

static struct genl_ops *genl_all_ops[] = {
    .
    .
    &genl_ops_reg_rd,
    &genl_ops_reg_wr,
    .
    .
};

For a detailed discussion on how this works, please see the section on Reading and Writing Registers.

Continuing on, the driver creates a file in /proc/driver called "nf10_eth_driver" for the use of debugging:

    create_proc_read_entry("driver/nf10_eth_driver", 0, NULL, read_proc, NULL);

When this file is read, read_proc() is called and fills a buffer with information to output to the user, handy for debugging. The read_proc() function looks like this:

/* Called to fill @buf when user reads our file in /proc. */
int read_proc(char *buf, char **start, off_t offset, int count, int *eof, void *data)

Next we enable NAPI (required by NAPI):

    /* Enable NAPI. */
    napi_enable(&nf10_napi_struct);

And finally register the driver:

    /* Register the pci_driver.
     * Note: This will succeed even without a card installed in the system. */
    pci_register_driver(&nf10_pci_driver);

The nf10_pci_driver structure actually represents the driver itself, and in the driver is setup like this:

static struct pci_driver nf10_pci_driver = {
    .name       = "nf10_eth_driver: pci_driver",
    .id_table   = id_table,
    .probe      = probe,
    .remove     = remove,
};

When this structure is registered via pci_register_driver(), the kernel will search for the NetFPGA-10G card in the system. It searches based on what's inside the id_table field of the nf10_pci_driver. In the NetFPGA-10G driver, id_table looks like this:

/* These are the IDs of the PCI devices that this Ethernet driver supports. */
static struct pci_device_id id_table[] = {
    { PCI_DEVICE(PCI_VENDOR_ID_NF10, PCI_DEVICE_ID_NF10_REF_NIC), }, /* NetFPGA-10G Reference NIC. */
    { 0, }
};

It is a composition of the NetFPGA-10G card's PCI Vendor ID and PCI Device ID (which are set in the PCIe core in the FPGA). During boot the computer scans the PCIe bus for cards and registers their information with the system. Later on when this device driver is loaded and pci_register_driver() is called, the computer checks nf10_pci_driver.id_table against the list of cards and if a match is found, will call nf10_pci_driver.probe(). It is in the probe() function that device initialization begins, and is documented in the following section.

Device Initialization

Device initialization occurs in the function probe():

static int probe(struct pci_dev *pdev, const struct pci_device_id *id)

probe() is called if when pci_register_driver() is called, the device with which our driver is associated is found in the system, and is responsible for initializing that device.

The very first thing that is done is to turn on the HW_FOUND bit in the hw_state variable:

   /* The hardware has been found. */
    hw_state |= HW_FOUND;

The hw_state variable is used in the driver to keep track of the state of the hardware. At the time of writing hw_state has two state bits:

/* Hardware state flags. */
#define HW_FOUND        0x00000001
#define HW_INIT         0x00000002

Where HW_FOUND indicates presence of hardware, and HW_INIT indicates successful initialization of hardware.

The next step is to enable the PCI device and enable DMA functionality:

    /* Enable the device. pci_enable_device() will do the following (ref. PCI/pci.txt kernel doc):
     *     - wake up the device if it was in suspended state
     *     - allocate I/O and memory regions of the device (if BIOS did not)
     *     - allocate an IRQ (if BIOS did not) */
    pci_enable_device(pdev);

    /* Enable DMA functionality for the device.
     * pci_set_master() does this by (ref. PCI/pci.txt kernel doc) setting the bus master bit
     * in the PCI_COMMAND register. pci_clear_master() will disable DMA by clearing the bit.
     * This function also sets the latency timer value if necessary. */
    pci_set_master(pdev);

These functions are non-NetFPGA specific, but rather apply to all PCI devices. For further documentation on their use and function please see PCI/pci.txt in the kernel documentation (here's a link. See section 3.1).

Continuing on,

    /* Mark BAR0 MMIO region as reserved by this driver. */
    pci_request_region(pdev, BAR_0, driver_name);

    /* Remap BAR0 MMIO region into our address space. */
    bar0_base_va = pci_ioremap_bar(pdev, BAR_0);

The first line pci_request_region() simply stakes a claim on BAR_0 for our driver so that only it can access that region (see section 3.2). If anyone else is accessing the region it will throw an error. The second line pci_ioremap_bar() maps the BAR_0 memory region into a space that the driver can use. We will get to what this space looks like later on.

To prepare for allocating RX and TX DMA regions, the driver first sets the DMA mask (see Documentation/DMA-API.txt Part Ic).

    err = dma_set_mask(&pdev->dev, DMA_BIT_MASK(32));
    err = dma_set_coherent_mask(&pdev->dev, DMA_BIT_MASK(32));

In brief, the first function dma_set_mask() just tells the kernel our device can only access the lower 32-bit addressable region of memory, so when we get around to allocating DMA regions, they need to be taken from that space. The second function dma_set_coherent_mask() tells the kernel that we want that future DMA memory to be coherent with the processor cache.

Now the driver can allocate DMA accessible regions, and it does so using the function dma_alloc_coherent(). It allocates one DMA region for storing RX packets from the device, and one DMA region for storing TX packets from the kernel.

    for(dma_cpu_bufs = DMA_CPU_BUFS; dma_cpu_bufs >= MIN_DMA_CPU_BUFS; dma_cpu_bufs /= 2) {
        dma_region_size = ((DMA_BUF_SIZE + OCDP_METADATA_SIZE + sizeof(uint32_t)) * dma_cpu_bufs);

        /* Allocate TX DMA region. */
        tx_dma_reg_va = dma_alloc_coherent(&pdev->dev, dma_region_size, &tx_dma_reg_pa, GFP_KERNEL | __GFP_NOWARN);
        if(tx_dma_reg_va == NULL)
            /* Try smaller allocation. */
            continue;

        /* Allocate RX DMA region. */
        rx_dma_reg_va = dma_alloc_coherent(&pdev->dev, dma_region_size, &rx_dma_reg_pa, GFP_KERNEL | __GFP_NOWARN);
        if(rx_dma_reg_va == NULL) {
            dma_free_coherent(&pdev->dev, dma_region_size, tx_dma_reg_va, tx_dma_reg_pa);
            /* Try smaller allocation. */
            continue;
        }

        /* Both memory regions have been allocated successfully. */
        break;
    }

DMA_CPU_BUFS is the ideal number of buffers to allocate, and MIN_DMA_CPU_BUFS is the minimum number that the driver can reasonable work with. At the time of writing DMA_CPU_BUFS is equal to 32,768 and MIN_DMA_CPU_BUFS is equal to 1. DMA_BUF_SIZE is determined by the DMA engine in the FPGA, and at the time of writing is 2048 (Bytes). Since DMA_CPU_BUFS*DMA_BUF_SIZE is roughly equal to 64MB of space and since this is quite a large contiguous space to ask for, the call to dma_alloc_coherent() is not guaranteed to succeed. Therefore the code runs in a loop trying smaller and smaller sizes, reducing by a half the requested size in dma_region_size each time allocation fails. Upon success the variables tx_dma_reg_va, tx_dma_reg_pa, rx_dma_reg_va, and rx_dma_reg_pa contain the regions' virtual and physical addresses, respectively.

Now that the probe() function has the DMA regions allocated in host memory, and also has access to the card's BAR0 MMIO region, it is ready to begin initializing and configuring the DMA engine in the FPGA hardware.

NOTE: The reader is advised at this point to read the section entitled "DMA Engine Architecture, Starting the Engine" before going further. This section provides the necessary background to understand the code that follows for initialization of the DMA Engine.

First, the driver sets up some useful data structures for accessing various parts of the BAR0 MMIO region of the device. First is the variable occp.

    /* Now we begin to structure the BAR0 MMIO region as the set of control and status
     * registers that it is. Once we setup this structure, then we proceed to reset,
     * initialize, and then start the hardware components. */

    occp        = (OccpSpace *)bar0_base_va;

Here the variable occp is of type struct OccpSpace, and is the top-level structure for the BAR0 memory region, whose structure is documented in the section for DMA Architecture. For convenience the struct OccpSpace definition is copied here from occp.h:

typedef struct {
    OccpAdminRegisters admin;
    uint8_t pad[OCCP_ADMIN_SIZE - sizeof(OccpAdminRegisters)];
    OccpWorker worker[OCCP_MAX_WORKERS];
    uint8_t config[OCCP_MAX_WORKERS][OCCP_WORKER_CONFIG_SIZE];
} OccpSpace;

typedef struct {
    OccpWorkerRegisters control;
    uint8_t pad[OCCP_WORKER_CONTROL_SIZE - sizeof(OccpWorkerRegisters)];
} OccpWorker;

typedef struct {
    const uint32_t
        initialize,
        start,
        stop,
        release,
        test,
        beforeQuery,
        afterConfigure,
        reserved7,
        status;
    uint32_t
        control;
    const uint32_t
        lastConfig;
    uint32_t
        clearError,
        pageWindow,
        reserved[3];
} OccpWorkerRegisters;

Continuing on, occp is used to setup the following:

    dp0_props   = (OcdpProperties *)occp->config[WORKER_DP0];
    dp1_props   = (OcdpProperties *)occp->config[WORKER_DP1];
    sma0_props  = (uint32_t *)occp->config[WORKER_SMA0];
    sma1_props  = (uint32_t *)occp->config[WORKER_SMA1];
    bias_props  = (uint32_t *)occp->config[WORKER_BIAS];

    dp0_regs    = &occp->worker[WORKER_DP0].control,
    dp1_regs    = &occp->worker[WORKER_DP1].control,
    sma0_regs   = &occp->worker[WORKER_SMA0].control,
    sma1_regs   = &occp->worker[WORKER_SMA1].control,
    bias_regs   = &occp->worker[WORKER_BIAS].control;

The *_props variables all point to their respective worker configuration spaces. *_regs point to the workers corresponding control region of BAR0 (not to be confused with the 'control' register within that region).

The NetFPGA worker's pointers are also setup here:

    nf10_regs   = (uint32_t *)occp->config[WORKER_NF10];
    nf10_ctrl   = &occp->worker[WORKER_NF10].control;

Here, however, since the NetFPGA-10G device subscribes to a different model than the rest of the OpenCPI workers, nf10_regs points to it's configuration space (since this is its space for register access), and nf10_ctrl points to the configure space as defined in the section on DMA Architecture.

After setting up the variables, the workers are put into reset, and at the same time their timeout values are set:

    /* Assert reset. */
    dp0_regs->control   = OCCP_LOG_TIMEOUT;
    dp1_regs->control   = OCCP_LOG_TIMEOUT;
    sma0_regs->control  = OCCP_LOG_TIMEOUT;
    sma1_regs->control  = OCCP_LOG_TIMEOUT;
    bias_regs->control  = OCCP_LOG_TIMEOUT;
    nf10_ctrl->control  = OCCP_LOG_TIMEOUT;

At the time of writing OCCP_LOG_TIMEOUT is set to 30, which is equal to roughly 8.5 seconds.

Following on the workers are then taken out of reset:

    /* Take out of reset. */
    dp0_regs->control   = OCCP_CONTROL_ENABLE | OCCP_LOG_TIMEOUT;
    dp1_regs->control   = OCCP_CONTROL_ENABLE | OCCP_LOG_TIMEOUT;
    sma0_regs->control  = OCCP_CONTROL_ENABLE | OCCP_LOG_TIMEOUT;
    sma1_regs->control  = OCCP_CONTROL_ENABLE | OCCP_LOG_TIMEOUT;
    bias_regs->control  = OCCP_CONTROL_ENABLE | OCCP_LOG_TIMEOUT;
    nf10_ctrl->control  = OCCP_CONTROL_ENABLE | OCCP_LOG_TIMEOUT;

And then the workers are initialized by reading their Control-Op: Initialize registers as follows:

    dp0_regs->initialize 
    dp1_regs->initialize 
    sma0_regs->initialize
    sma1_regs->initialize
    bias_regs->initialize
    nf10_ctrl->initialize

In actual code the return values of these calls is made against OCCP_SUCCESS_RESULT to check for errors.

If no errors result, then configuration of the workers can proceed. The driver does that as follows, as explained in the section DMA Engine Architecture, Starting the Engine (the reader is asked to reference that section of documentation in understanding this section of code).

    /* Configure workers. */

    *sma0_props = 1;
    *bias_props = 0;
    *sma1_props = 2;

    dp0_props->nLocalBuffers     = DMA_FPGA_BUFS;
    dp0_props->nRemoteBuffers     = dma_cpu_bufs;
    dp0_props->localBufferBase     = 0;
    dp0_props->localMetadataBase     = DMA_FPGA_BUFS * DMA_BUF_SIZE;
    dp0_props->localBufferSize     = DMA_BUF_SIZE;
    dp0_props->localMetadataSize     = sizeof(OcdpMetadata);
    dp0_props->memoryBytes        = 32*1024; /* FIXME: What is this?? */
    dp0_props->remoteBufferBase    = (uint32_t)tx_dma_reg_pa;
    dp0_props->remoteMetadataBase    = (uint32_t)tx_dma_reg_pa + dma_cpu_bufs * DMA_BUF_SIZE;
    dp0_props->remoteBufferSize    = DMA_BUF_SIZE;
    dp0_props->remoteMetadataSize    = sizeof(OcdpMetadata);
    dp0_props->remoteFlagBase    = (uint32_t)tx_dma_reg_pa + (DMA_BUF_SIZE + sizeof(OcdpMetadata)) * dma_cpu_bufs;
    dp0_props->remoteFlagPitch    = sizeof(uint32_t);
    dp0_props->control        = OCDP_CONTROL(OCDP_CONTROL_CONSUMER, OCDP_ACTIVE_MESSAGE);

    dp1_props->nLocalBuffers     = DMA_FPGA_BUFS;
    dp1_props->nRemoteBuffers     = dma_cpu_bufs;
    dp1_props->localBufferBase     = 0;
    dp1_props->localMetadataBase     = DMA_FPGA_BUFS * DMA_BUF_SIZE;
    dp1_props->localBufferSize     = DMA_BUF_SIZE;
    dp1_props->localMetadataSize     = sizeof(OcdpMetadata);
    dp1_props->memoryBytes        = 32*1024; /* FIXME: What is this?? */
    dp1_props->remoteBufferBase    = (uint32_t)rx_dma_reg_pa;
    dp1_props->remoteMetadataBase    = (uint32_t)rx_dma_reg_pa + dma_cpu_bufs * DMA_BUF_SIZE;
    dp1_props->remoteBufferSize    = DMA_BUF_SIZE;
    dp1_props->remoteMetadataSize    = sizeof(OcdpMetadata);
    dp1_props->remoteFlagBase    = (uint32_t)rx_dma_reg_pa + (DMA_BUF_SIZE + sizeof(OcdpMetadata)) * dma_cpu_bufs;
    dp1_props->remoteFlagPitch    = sizeof(uint32_t);
    dp1_props->control        = OCDP_CONTROL(OCDP_CONTROL_PRODUCER, OCDP_ACTIVE_MESSAGE);

As far as setting the addresses for the DMA Buffers, DMA Metadata, and DMA flags, please see the section on Packet Buffers, Structure and Management for a further description of these structures.

Other than this, the code here does one other thing, which is to setup two convenience structures called tx_dma_stream and rx_dma_stream, which contain pointers to the DMA buffers, DMA Metadata, DMA flags, and also an index to the next free buffer to fill / next full buffer to empty. The structure is as follows:

/* Bundle of variables to keep track of a unidirectional DMA stream. */
struct dma_stream {
    uint8_t             *buffers;
    OcdpMetadata        *metadata;
    volatile uint32_t   *flags;
    volatile uint32_t   *doorbell;
    uint32_t            buf_index;
};

One other thing that the struct dma_stream structure contains is a field called "doorbell" which points to the "fabDoneAvail" register in the corresponding Data Plane's configuration space for that particular DMA Stream. While the section on Packet Buffers, Structure and Management explains this more in detail, in short this doorbell is used to signal to the hardware the a packet has been placed in the buffer (for transmitting packets to the hardware) or to signal to the hardware that a packet has been emptied from the buffer (for receiving packets in the host). This mechanism keeps the hardware aware of when packets are ready to be received / transmitted.

Setting up this structure involves the following:

    tx_dma_stream.buffers    = (uint8_t *)tx_dma_reg_va;
    tx_dma_stream.metadata    = (OcdpMetadata *)(tx_dma_stream.buffers + dma_cpu_bufs * DMA_BUF_SIZE);
    tx_dma_stream.flags    = (volatile uint32_t *)(tx_dma_stream.metadata + dma_cpu_bufs);
    tx_dma_stream.doorbell    = (volatile uint32_t *)&dp0_props->nRemoteDone;
    tx_dma_stream.buf_index    = 0;
    memset((void*)tx_dma_stream.flags, 1, dma_cpu_bufs * sizeof(uint32_t));

    rx_dma_stream.buffers    = (uint8_t *)rx_dma_reg_va;
    rx_dma_stream.metadata    = (OcdpMetadata *)(rx_dma_stream.buffers + dma_cpu_bufs * DMA_BUF_SIZE);
    rx_dma_stream.flags    = (volatile uint32_t *)(rx_dma_stream.metadata + dma_cpu_bufs);
    rx_dma_stream.doorbell    = (volatile uint32_t *)&dp1_props->nRemoteDone;
    rx_dma_stream.buf_index    = 0;
    memset((void*)rx_dma_stream.flags, 0, dma_cpu_bufs * sizeof(uint32_t));

Note that for the tx_dma_stream (packets heading to the FPGA) the flags are initialized to 1, indicating all buffers are empty. rx_dma_stream instead zeros its flags to indicate all buffers are empty. Therefore a flags of 1 indicates readiness in either case, either to store a packet for transmitting or receive a packet for receiving.

After all this configuration is finished, the final step is to start all of the workers by reading their Control-Op: Start register:

    /* Start workers. */
    dp0_regs->start
    dp1_regs->start
    sma0_regs->start
    sma1_regs->start
    bias_regs->start
    nf10_ctrl->start

In actual code error checking is done against the result OCCP_SUCCESS_RESULT, or 0xC0DE_4201.

After that, both the driver and device are fully initialized.

There are a few last things that are done at the end of probe() here to complete initialization. First is:

    /* Hardware has been successfully initialized. */
    hw_state |= HW_INIT;

Which sets the bit in hw_state indicating successful initialization of hardware.

The second is:

    /* Start the polling timer for receiving packets. */
    rx_poll_timer.expires = jiffies + RX_POLL_INTERVAL;
    add_timer(&rx_poll_timer);

Which starts an RX packet polling timer. Each time the timer fires it will check for packets in the RX buffer, and if packets are found, will schedule a NAPI polling event with there kernel. See the section Polling for Packets for details on how this works in the driver.

DMA Engine Architecture, Starting the Engine

The NetFPGA-10G reference NIC's DMA engine is the product of a very specific application of a technology called OpenCPI (see OpenCPI website for full details, particularly the Documentation section for technical details outside of what's mentioned here). In short, OpenCPI, or Open Component Portability Infrastructure, is a general framework for "glueing" together IP of both hardware and software variety in a seamless way by leveraging a consistent set of interface standards. Each piece of IP is called a "worker" in OpenCPI terminology, and a network of workers stitch together to form a design. It is entirely open-source and its IP library of workers is actively used in both commercial and governmental applications. For the NetFPGA-10G reference NIC application, the OpenCPI framework was used to compose a PCIe DMA communication path between the driver and the NetFPGA-10G application residing on the FPGA. In the FPGA the design looks like the following network of workers:

The SMA0 and SMA1 workers (Stream Message Adapters, see OpenCPI_HDL_App_Workers.pdf section 4) are responsible for converting between NetFPGA-10G and OpenCPI internal streaming interfaces. The DP0 and DP1 workers (Data Planes, see OpenCPI_HDL_Infrastructure.pdf section 6) are responsible for streaming in DMA messages and streaming out DMA messages. uNOC stands for micro Network on Chip (see OpenCPI_HDL_Infrastructure.pdf section 4). It is not counted as a "worker" in OpenCPI (and is therefore invisible to software). The uNOC is responsible for acting as an intermediary between workers in the design and PCIe messages, mainly directing register reads and writes to the correct worker depending on the target address. Finally, it should be noted from the diagram that the NetFPGA-10G design (in this case the reference NIC) itself is counted as a "worker" in the network.

In the OpenCPI paradigm each "worker" is treated as an independently configurable and controllable piece of IP in the network. OpenCPI provides software with control and configuration access to these workers through the PCIe core's 16MB BAR0 region of memory mapped IO (see OpenCPI_HDL_Infrastructure.pdf section 5.1). It is also through BAR0 that OpenCPI provides software with a set of "administration" registers which apply to an entire design.

The organization of BAR0 is fully documented in OpenCPI_HDL_Infrastructure.pdf section 5.1, section 5.2, section 5.3, and section 5.4. Below are some diagrams from this documentation copied here for convenience.

First, the overall organization of BAR0 looks like the following:

It is divided up into 16x 1MB regions. The lower 15MBs are the 15x 1MB configuration spaces for each worker 0-14. Configuration space is simply a register space for that worker (and in cases like the SMA workers acts as a way for software to configure the operation of the worker, as is seen later). Reads and writes to those registers are directed to the corresponding worker in the design (functionality provided by the OpenCPI uNOC). See the section on "Reading and Writing Registers" to see how this works for accessing registers in the NetFPGA design (in brief, the NetFPGA worker is worker number WORKER_NF10 as seen in the driver code, and at the time of writing this value is 0, and all reads/writes into WORKER_NF10's configuration space are directed to the NetFPGA design. Note that since the registers for NetFPGA-10G exceed 1MB, a key-holing mechanism is used to access all of them).

The upper 1MB of BAR0, however, is special. It, rather, is divided up into 16x 64KB regions, where the lower 15 slots are the 15x 64KB control spaces for each worker 0-14. Control spaces, unlike configuration spaces, are the same for every worker and provides a standard control interface for the workers. For instance, every worker has an initialize, start, and stop register, allowing workers to be controlled independently in an OpenCPI design. The register layout of this control space is seen below:

Of special note in the control space for workers is the "Worker Control Register". The bitwise use of this register is seen below (most importantly, bit 31 is used to reset the worker):

The highest order bit is a low reset, meaning that to reset the worker, a 0 is written to this bit. The operation is "sticky", meaning that after a 0 is written, worker will be "in reset" until a 1 is written to this bit, taking the device "out of reset". The lower 5 bits of this register is for defining how long each worker has to respond to various commands from OpenCPI (including register reads and writes). OpenCPI_HDL_Infrastructure.pdf section 5.3.2 contains a very good explanation of what this field means and is reproduced here:

The wrkTimeout bit field [4:0] sets the number of 8 ns cycles to wait after a command before a timeout 
condition is declared. This field is the log2 of that threshold. A setting of 4 (the power on default) 
results in 24 = 16 cycle timeout (128 ns). The maximum setting of 31 corresponds to about 17 seconds. (231 x 
8 ns/cy). The default timeout setting of 16 cycles (128 ns) may be inadequate for workers requiring more 
time than that to service their requests. Workers will return the “0xC0DE_4203” TIMEOUT indication if the 
access timer expires before the worker completes the control operation, or acknowledges the configuration 
property access. In such cases, it is recommended that the default timeout for that worker be increased, 
only as much as needed. Higher timeout values add to the upper bound on the maximum latency that can be 
expected for a response from the system.

Here the term "command" simply means the commands sent to workers resulting from reads to the worker's control space registers. If a command times out, 0xCODE_4203 will be returned as the result of the read.

For the workers in NetFPGA, therefore, there is a very simple initialization process:

  1. In the worker's control register, put each worker into reset

  2. In the worker's control register, set the timeout value

  3. Send workers the "initialization" command by reading the "Control-Op: Initialize" register in the worker's control space. If the return value is 0xCODE_4201 (in the driver, OCCP_SUCCESS_RESULT) then the initialization has succeeded. Otherwise it has failed.

After successfully initializing the workers, the second step to getting OpenCPI operational is to configure the workers using their configuration space. The third and final step will be to start the workers by reading their Control-Op: Start register.

For OpenCPI DMA engine used in the reference NIC, the workers that must be configured are (as shown in the OpenCPI DMA engine architectural diagram): SMA0, SMA1, DP0, and DP1.

The configuration space for a Stream Message Adapter looks like the following:

The only part of this configuration space that the driver needs to be concerned about is the "mode" field of the control register "smaCtrl", which must be set to 1 for SMA0 and to 2 for SMA1. As the lower part of the diagram documents, the meaning of this mode field pertains to the type of interface conversion (WMI and WSI are types of interfaces, -S means slave, -M means master) being requested. Because these SMAs have polar opposite data flow directions with respect to the NetFPGA worker, their conversions are set oppositely. Other than this configuration bit, no other configuration is necessary for the SMA workers.

The configuration space for a Data Plane is much more complicated, and looks like this:

The Data Plane workers, being responsible for moving DMA messages between FPGA and host memory, must be configured to know:

  1. What direction they are operating in (grabbing data from host memory or sending data to host memory)

  2. Where the buffers are they need to work with

  3. How big those buffers are.

In the register list above, the prefix "lcl" refers to local (i.e. FPGA) side buffers, while "fab" refers to fabric (i.e. host) side buffers.

Therefore, while not going into all the details (for that the curious reader is directed to OpenCPI_HDL_Infrastructure.pdf section 6.2), the basic configuration of this space is as follows:

...setup directionality settings...

  1. "dpControl" (in driver: OcdpProperties.control) (must be set to 0x9 for DP0 and 0x5 for DP1 (this sets the directionality of the Data Plane worker. See OpenCPI_HDL_Infrastructure.pdf section 6.2.3)

...setup local buffer settings...

  1. "lclNumBufs" (in driver: OcdpProperties.nLocalBuffers) must be set to the number of local buffers given to the worker on the FPGA (this is set at hardware configuration time and for the reference NIC is equal to 4. The driver names this value DMA_FPGA_BUFS). It is the same for both DP0 and DP1.

  2. "lclMesgBase" (in driver: OcdpProperties.localBufferBase) must be set to 0. It is the same for both DP0 and DP1.

  3. "lclMetaBase" (in driver: OcdpProperties.localMetadataBase) is set to point just past the local buffers. In driver terms this is set to point to DMA_FPGA_BUFS*DMA_BUF_SIZE. It is the same for both DP0 and DP1.

  4. "lclMesgBufSize" (in driver: OcdpProperties.localBufferSize) is set to the size of the message buffers on the FPGA. This is a hardware configuration and for the reference NIC is 2048 Bytes (in the driver DMA_BUF_SIZE). It is the same for both DP0 and DP1.

  5. "lclMetaBufSize" (in driver: OcdpProperties.localMetadataSize) is set to the size of the metadata required for each buffer. This is a hardware configuration and for the reference NIC is 16 Bytes (in the driver sizeOf(OcdpMetadata)). It is the same for both DP0 and DP1.

...setup remote buffer settings...

  1. "fabNumBufs" (in driver: OcdpProperties.nRemoteBuffers) is set to the number of buffers in the host memory for this data plane. In the driver, what this is set to is dependent on how much space the driver was able to allocate for the RX and TX packet buffers. For DP0 this is the number of buffers in the TX DMA region, and for DP1 this is the number of buffers in the RX DMA Region.

  2. "fabMesgBase" (in driver: OcdpProperties.remoteBufferBase) is set to the physical address of a packet buffer in host memory. For DP0 this is the packet buffer containing packets destined for the FPGA (TX packet buffer in driver terms). For DP1 this is the packet buffer containing packets destined for the host (RX packet buffer in driver terms). When operating DP0 will DMA messages from its configured fabMesgBase address to its local buffers. DP1 however will DMA message from its local buffers to its configured fabMesgBase address. Please see the section on Packet Buffers, Structure and Management for more details.

  3. "fabMetaBase" (in driver: OcdpProperties.remoteMetadataBase) is set to the physical address in host memory of the metadata for the packets in fabMesgBase. For DP0 this is the metadata is incoming packets to the FPGA, and for DP1 this is the metadata for outgoing packets to the host. Please see the section on Packet Buffers, Structure and Management for more details.

  4. "fabMesgSize" (in driver: OcdpProperties.remoteBufferSize) is the size of the buffers in fabMesgBase. This is hardware configured, and for the reference NIC is 2048 Bytes (as DMA_BUF_SIZE in the driver). It is the same for DP0 and DP1.

  5. "fabMetaSize" (in driver: OcdpProperties.remoteMetadataSize) is the size of metadata. Hardware configured to 16B (sizeOf(OccpMetadata)) in the driver. It is the same for DP0 and DP1.

  6. "fabFlowBase" (in driver: OcdpProperties.remoteFlagBase) is set to the physical address in host memory of the set of flags indicating the fullness/emptiness of each buffer at fabMesgBase. For DP0 this is for incoming packets to the FPGA, and for DP1 this is for outgoing packets to the host. Please refer to the section on Packet Buffers, Structure and Management for more details.

  7. "fabFlowSize" (in driver: OcdpProperties.remoteFlagPitch) is set to the amount of Bytes used per flag in the section for flags as pointed to by fabFlowBase. Please refer to the section on Packet Buffers, Structure and Management for more details.

...setting misc...

  1. "bufferExtent" (in driver: OcdpProperties.memoryBytes) is set to the total amount of memory in the local buffers. It is hardware configured and for the reference NIC is 32KB.

After this is done, all the workers are configured, and the last step as required to the OpenCPI standard is to start them with a call to the worker's control space's "Control-Op: Start" register.

As a final note, the as-yet-mentioned uppermost 64KB region of BAR0 is the "administration" region, whose usage is defined by OpenCPI and applies to the design as a whole. Its usage is irrelevant to the implementation of NetFPGA-10G and the driver, so an explanation of this is spared. The curious reader may refer to OpenCPI_HDL_Infrastructure.pdf section 5.2.

This completes the discussion on DMA architecture as it pertains to the implementation of the driver.

Packet Buffers, Structure and Management

There are two packet buffers in the driver. One is the RX buffer for receiving packets from the FPGA, and the other is the TX buffer for sending packets to the FPGA. While their use is slightly different between TX and RX, their structure is exactly the same:

Here every packet in the buffer is accompanied by metadata and a flag. In the RX buffer for receiving packets, a flag of "1" indicates that the buffer is full with a packet ready to be sent up the networking stack. In the TX buffer for sending packets to the network, a flag of "1" indicates that the corresponding buffer is empty, ready to have a packet written to it for transmit.

Metadata for each packet consists of only two things, one is the length of the packet, in Bytes, and the other is something called "opCode". The name "opCode" comes from OpenCPI, however NetFPGA-10G's use of this field is to indicate to the reference NIC the desired destination interface of the packet. This is done differently for RX and for TX. For TX of packets to the network, the opCode field is set by the software to be the originating host interface (the reference NIC will then interpret this to mean "send the packet out of the corresponding network port". That is, host interface i corresponds to mac interface i). For RX of packets to the host, the opCode field is set by the hardware to be the destination host interface (the reference NIC already did the work of mapping from mac interface to host interface).

The opCode field is encoded as follows:

bit 0 - mac interface 0
bit 1 - host interface 0
bit 2 - mac interface 1
bit 3 - host interface 1
bit 4 - mac interface 2
bit 5 - host interface 2
bit 6 - mac interface 3
bit 7 - host interface 3

And uses the following macros in the software:

/* Interfaces Bitmasks. */
#define OPCODE_CPU0     0x00000002
#define OPCODE_CPU1     0x00000008
#define OPCODE_CPU2     0x00000020
#define OPCODE_CPU3     0x00000080
#define OPCODE_CPU_ALL  0x000000AA

#define OPCODE_MAC0     0x00000001
#define OPCODE_MAC1     0x00000004
#define OPCODE_MAC2     0x00000010
#define OPCODE_MAC3     0x00000040
#define OPCODE_MAC_ALL  0x00000055

The driver is in fact generally unaware of bits 0, 2, 4, and 6. These bits are used by the hardware. Software simply sets bits 1, 3, 5, and 7 on TX, and reads them from received packets from the hardware. As can be seen from the arrangement of bits, the reference NIC need merely perform a bit shift in one direction to perform the i-to-i mapping of ports.

To keep track of the various pieces of packet buffers (packets, metadata, flags), the driver keeps pointers to these various things in a structure called a dma_stream, as described in the section on Driver and Device Initialization. The structure is reproduced here for convenience:

/* Bundle of variables to keep track of a unidirectional DMA stream. */
struct dma_stream {
    uint8_t             *buffers;
    OcdpMetadata        *metadata;
    volatile uint32_t   *flags;
    volatile uint32_t   *doorbell;
    uint32_t            buf_index;
};

In addition to the buffers, metadata, and flags field, this structure also includes a doorbell field. This field is set during initialization to point to the fabDoneAvail register in the corresponding Data Plane's configuration space (see Driver and Device Initialization). For TX, every time a new packet is written into the TX buffer by the driver (metadata, flags, and everything), a "1" is written to tx_dma_stream.doorbell once and only once to tell the hardware a packet has been written to the "head" of the packet buffer (we say "head" because packets are written to the packet buffer sequentially, wrapping around at the ends). For RX, every time a packet is read completely out of the RX buffer (including clearing the flags field), a "1" is written to rx_dma_stream.doorbell once and only once to tell the hardware that a packet has been read, and that the "tail" of the buffer is now free.

Lastly, the struct dma_stream's but_index field indicates the index of the next available buffer to fill for TX, or indicates the index of the next available packet to receive for RX. To see how this all works in the driver, see the section on Life of a Packet, Receiving and Transmitting.

Life of a Packet, Receiving and Transmitting

This section documents the process of receiving and transmitting packets in the driver. Note: before reading this section, it would be helpful to first read the section on Packet Buffers, Structure and Management.

Transmitting

Transmitting a packet occurs in the function nf10_ndo_start_xmit:

static netdev_tx_t nf10_ndo_start_xmit(struct sk_buff *skb, struct net_device *netdev)
{

This function is called by the kernel when a packet needs to be sent out and is a registered callback function for each of the net_devices in the driver (see section Driver and Device Initialization).

First, packet data and length are extracted from the arguments:

   /* Get data and length. */
    data = (void*)skb->data;
    len = skb->len;

Since this function is registered with each of the net_devices in the driver, it needs to figure out on which interface this function was called. Once done the driver can then setup the opcode field for later inclusion in the metadata for the packet (see section on Packet Buffers, Structure and Management). tx_set_src_iface() is the function which performs the encoding of the opcode field for transmit.

    /* Opcode for setting source and destination ports. */
    opcode = 0;
    iface = get_iface_from_netdev(netdev);
    tx_set_src_iface(&opcode, iface);

The driver is now ready to begin writing to the TX DMA buffer. Since this region is protected by the tx_dma_region_spinlock, the driver must first acquire the lock:

    /* First need to acquire lock to access the TX DMA region. */
    spin_lock_irqsave(&tx_dma_region_spinlock, tx_dma_region_spinlock_flags);

Once the lock is acquired to access the TX DMA buffer, the driver checks if there's a free buffer available:

    if(tx_dma_stream.flags[tx_dma_stream.buf_index] == 0) {
        netdev->stats.tx_dropped++;
        dev_kfree_skb(skb);
        spin_unlock_irqrestore(&tx_dma_region_spinlock, tx_dma_region_spinlock_flags);
        return NETDEV_TX_OK;
    }

If no free buffer is available, then the packet is immediately dropped. Otherwise the driver can proceed to write the packet into the next free space in the TX DMA buffer. This is done in the following steps:

  1. Use "buf_index" of tx_dma_stream to index the packet array pointed to by "buffers" to write the packet into the TX DMA buffer:
    /* Copy message into buffer. */
    memcpy((void*)&tx_dma_stream.buffers[tx_dma_stream.buf_index * DMA_BUF_SIZE], data, len);
  1. Write metadata:
    /* Fill out metadata. */
    /* Length. */
    tx_dma_stream.metadata[tx_dma_stream.buf_index].length = len;
    /* OpCode. */
    tx_dma_stream.metadata[tx_dma_stream.buf_index].opCode = opcode;
  1. Set the flag to 0 to indicate the buffer is full:
    /* Set the buffer flag to full. */
    tx_dma_stream.flags[tx_dma_stream.buf_index] = 0;
  1. Ring the doorbell on DP0 to tell Data Plane 0 a packet has been written to the buffer (this is done once and only once, no less, no more, for each packet written to the TX packet buffer):
    /* Tell hardware we filled a buffer. */
    *tx_dma_stream.doorbell = 1;
  1. Update the buf_index, rolling over if necessary (here, dma_cpu_bufs is the total number of packet buffers):
    /* Update the buffer index. */
    if(++tx_dma_stream.buf_index == dma_cpu_bufs)
        tx_dma_stream.buf_index = 0;
  1. Release the lock:
    /* Release the lock, finished with TX DMA region. */
    spin_unlock_irqrestore(&tx_dma_region_spinlock, tx_dma_region_spinlock_flags);
  1. Update statistics and free the socket buffer:
    /* Update the statistics. */
    netdev->stats.tx_packets++;
    netdev->stats.tx_bytes += len;

    dev_kfree_skb(skb);

That's it! The packet is queued in the TX DMA buffer and will be sent out when the hardware gets around to it. After the hardware transfers the packet from the host memory to FPGA memory, it will set that buffer's flag to 1 again to indicate that it is empty.

Receiving

Receiving is a bit more complicated, and works through a polling mechanism using a timer called rx_poll_timer. This timer is setup at the very end of the probe() function:

    /* Start the polling timer for receiving packets. */
    rx_poll_timer.expires = jiffies + RX_POLL_INTERVAL;
    add_timer(&rx_poll_timer);

(Here jiffies is a measure of the current time in the kernel, and RX_POLL_INTERVAL is the number of jiffies in the future the timer should fire. At the time of writing RX_POLL_INTERVAL is set to 1 to poll at the fastest possible rate. The length of a jiffy varies by machine (see http://en.wikipedia.org/wiki/Jiffy_(time)) but is likely on the order of 1's of milliseconds. )

The timer is initialized as global at the top of nf10_eth_driver_main.c:

/* Polling timer for received packets. */
struct timer_list rx_poll_timer = TIMER_INITIALIZER(rx_poll_timer_cb, 0, 0);

And is registered with the callback rx_poll_timer_cb, which is called each time the timer fires:

/* Callback function for the rx_poll_timer. */
static void rx_poll_timer_cb(unsigned long arg)
{
    /* Check for received packets. */
    if(rx_dma_stream.flags[rx_dma_stream.buf_index] == 1) {
        /* Schedule a poll. */
        napi_schedule(&nf10_napi_struct);
    } else {
        rx_poll_timer.expires += RX_POLL_INTERVAL;
        add_timer(&rx_poll_timer);
    }
}

When packets are found, a NAPI polling event is scheduled using a call to the kernel's napi_schedule() function, handing it the nf10_napi_struct, with which the callback function nf10_napi_struct_poll() is registered:

/* Slurp up packets. */
static int nf10_napi_struct_poll(struct napi_struct *napi, int budget)
{

The nf10_napi_struct_poll() takes a budget argument, which is how many packets the poll function is allowed to hand up to kernel in one call to this function (see here: NAPI, particularly the section on NAPI Driver Design). It's the driver's way of efficiently sending up lots of packets in one go, and it's the kernel's way of applying back-pressure at the level of the driver when the system gets overloaded.

Therefore the code runs in a loop receiving packets, each time checking against the budget:

    while(n_rx < budget && rx_dma_stream.flags[buf_index] == 1) {

Packet receiving works in the following steps:

  1. Figure out which net_device to deliver the packet to. Here rx_get_dst_iface() is the function for translating the opcode field of the metadata to the target net_device (see section on Packet Buffering, Structure and Management):
        dst_iface = rx_get_dst_iface(rx_dma_stream.metadata[buf_index].opCode);
  1. Allocate a new socket buffer, put the packet in the buffer, set the socket buffer's net_device interface, and send the packet to the network stack:
        skb = dev_alloc_skb(rx_dma_stream.metadata[buf_index].length);

        memcpy( skb_put(skb, rx_dma_stream.metadata[buf_index].length),
                (void*)&rx_dma_stream.buffers[buf_index * DMA_BUF_SIZE],
                rx_dma_stream.metadata[buf_index].length);

        skb->dev = nf10_netdevs[dst_iface];
        skb->protocol = eth_type_trans(skb, nf10_netdevs[dst_iface]);

        netif_receive_skb(skb);
  1. Update the statistics on the interface:
        /* Update statistics. */
        nf10_netdevs[dst_iface]->stats.rx_packets++;
        nf10_netdevs[dst_iface]->stats.rx_bytes += rx_dma_stream.metadata[buf_index].length;
  1. Write a 0 to the flags field to indicate that the buffer has been emptied.
        /* Mark the buffer as empty. */
        rx_dma_stream.flags[buf_index] = 0;
  1. Ring the doorbell on Data Plane 1 to tell the hardware that the packet has been received. This is done once and only once per packet received from the RX buffer. This lets DP1 know that the buffer is available for writing a new packet if needed.
        /* Tell the hardware we emptied the buffer. */
        *rx_dma_stream.doorbell = 1;
  1. The last step in the loop is to update the buf_index, wrapping around if necessary (dma_cpu_bufs is the total number of buffers):
        /* Update the buffer index. */
        if(++rx_dma_stream.buf_index == dma_cpu_bufs)
            rx_dma_stream.buf_index = 0;

After the loops exits, there are two possible states the TX buffer could be in. One is all the packets have been received (packets in RX buffers were less than or equal to budget), and the other is that there are still more packets left to receive (packets in the RX buffer were greater than budget). What this nf10_napi_struct_poll() function does next is dependent on which state it's in:

    /* Check if we processed everything. */
    if(rx_dma_stream.flags[buf_index] == 0) {
        PDEBUG("nf10_napi_struct_poll(): Slurped up all the packets there were to slurp!\n");
        napi_complete(napi);
        rx_poll_timer.expires = jiffies + RX_POLL_INTERVAL;
        add_timer(&rx_poll_timer);
        return 0;
    } else {
        PDEBUG("nf10_napi_struct_poll(): Slurped %d packets but still more left...\n", n_rx);
        return n_rx;
    }

If all packets have been received in this round, NAPI requires us to make a call to napi_complete() and return 0. In this case the driver will additionally restart the polling timer.

If, on the other hand, there are still more packets left, NAPI requires us to tell it the number of packets that we sent up to the kernel (this is kept track of by n_rx in the nf10_napi_struct_poll() function). In the driver, this is always equal to budget in this case.

This completes the discussion on receiving packets in the driver. As a side note, if interrupts become available from the hardware in the future, the timer based polling mechanism above can be replaced by registering an interrupt handler, which before scheduling NAPI should first disable interrupts from the card. When after so many rounds of receiving packets the RX buffer is finally drained, interrupts should then be re-enabled, thereby utilizing the interrupt mitigation feature of NAPI, preventing the driver from being inundated with interrupts.

Reading and Writing Registers

This section documents how registers in the NetFPGA-10G design are read and written.

In the driver, userspace applications are given access to the NetFPGA-10G design's registers through the Generic Netlink commands genl_cmd_reg_rd and genl_cmd_reg_wr. Most of what these functions do pertains to GENL message handling and is unrelated to actually reading and writing registers. The curious reader is referred here for more details on how that works. This section of documentation, rather, is primarily focused on the aspect of reading and writing NetFPGA-10G registers in the driver. Snippets of code from these functions is therefore presented here to help in describing this.

Firstly, as described in the section on DMA Engine Architecture, the NetFPGA design on the chip technically only has 1MB of configuration space in BAR0 in which software can read and write registers. Since this 1MB was too limiting for many NetFPGA applications, a key-holing mechanism was implemented to expand this 1MB (20 bits of address space) to 4GB (32 bits of address space) by adding a 12 bit page register to the NetFPGA-10G worker's 64KB of control space (In looking at this structure please reference the diagram on control space layout for workers from the section DMA Engine Architecture. In that diagram no pageWindow register exists, as this feature was added later on.):

typedef struct {
    const uint32_t
        initialize,
        start,
        stop,
        release,
        test,
        beforeQuery,
        afterConfigure,
        reserved7,
        status;
    uint32_t
        control;
    const uint32_t
        lastConfig;
    uint32_t
        clearError,
        pageWindow,   /* PAGE REGISTER */
        reserved[3];
} OccpWorkerRegisters;

This register is "sticky" in the sense that after having written to it the designed 12-bit page address, all writes into the worker's configuration space have this page address as the upper 12 bits of the 32 bit register address.

Therefore in accessing registers in the functions genl_cmd_reg_rd and genl_cmd_reg_wr, the first step is to calculate the page address and offset of the register:

    /* Calculate page and offset. */
    reg_addr        = *(uint32_t*)nla_data(na);
    reg_addr_page   = reg_addr / OCCP_WORKER_CONFIG_SIZE;
    reg_addr_offset = reg_addr % OCCP_WORKER_CONFIG_SIZE;

The first line *(uint32_t*)nla_data(na); simply extracts the register address argument out of the argument array to the Generic Netlink function. The second two lines use OCCP_WORKER_CONFIG_SIZE (that is, 1MB) in calculating the page address and offset.

The next step is to set the page register, which is done as follows:

    /* Set page register. */
    nf10_ctrl->pageWindow = reg_addr_page;

Finally, to read from a register:

    /* Go get the register value! */
    reg_val = nf10_regs[(reg_addr_offset >> 2)];

And to write to a register:

    /* Go write the register value! */
    nf10_regs[(reg_addr_offset >> 2)] = reg_val;

The important thing to note here is that registers access must be byte aligned, and so when indexing the nf10_regs array the least significant two bits are chopped off.

This concludes the section on register reading and writing.

Clone this wiki locally