Improve mrpc efficiency by leveraging write combining#28
Conversation
| SWITCHTEC_GAS_FLASH_INFO_OFFSET = 0x1200, | ||
| SWITCHTEC_GAS_PART_CFG_OFFSET = 0x3000, | ||
| SWITCHTEC_GAS_NTB_OFFSET = 0xf000, | ||
| SWITCHTEC_GAS_PFF_CSR_OFFSET = 0x133000, |
There was a problem hiding this comment.
I'd prefer it if you did this change without modifying these defines. They describe the hardware not how the software uses the hardware. And it would just be confusing if they are wrong.
So just subtract SWITCHTEC_GAS_TOP_CFG_OFFSET from any mapping that's done with the offset.
Therefore you only need to change anything that uses 'stdev->mmio' which isn't very much and is easy to grep for.
| if (db_addr) | ||
| *db_addr = pci_resource_start(ntb->pdev, 0) + offset; | ||
| *db_addr = pci_resource_start(ntb->pdev, 0) + offset + | ||
| SWITCHTEC_GAS_MRPC_SIZE; |
There was a problem hiding this comment.
Better to just add
offset += SWITCHTEC_GAS_TOP_CFG_OFFSET;
above instead of adding to this line.
| if (spad_addr) | ||
| *spad_addr = pci_resource_start(ntb->pdev, 0) + offset; | ||
| *spad_addr = pci_resource_start(ntb->pdev, 0) + offset + | ||
| SWITCHTEC_GAS_MRPC_SIZE; |
|
|
||
| addr = (bar_addrs[0] + SWITCHTEC_GAS_NTB_OFFSET + | ||
| addr = (bar_addrs[0] + SWITCHTEC_GAS_MRPC_SIZE + | ||
| SWITCHTEC_GAS_NTB_OFFSET + |
There was a problem hiding this comment.
This goes away if you leave the defines alone.
| stdev->mrpc_busy = 1; | ||
| memcpy_toio(&stdev->mmio_mrpc->input_data, | ||
| stuser->data, stuser->data_len); | ||
| wmb(); |
There was a problem hiding this comment.
This is wrong. Linux guarantees the ordering of IOs with respect to other IOs on the PCI bus and the CPU flushes the WC buffer on any load/store to a UC region (which is exactly the next command). Even if this wasn't the case, a wmb() is a heavy tool to employ here.
There was a problem hiding this comment.
Oh wait scratch that... the next line is in the WC region. The correct way to handle this is a read from the device instead of a hard barrier. So issue an ioread32(&stdev->mmio_mrpc->cmd) and put a comment noting that it is there to flush the WC buffer.
There was a problem hiding this comment.
thx.
replace wmb() by ioread of reserved register in NTB db/msg register range
a. memrd tlp to this range of registers is processed by HW, with qucik response: ns level
b. out of this range, memrd tlp to the registers is processed by FW, with us or even ms level delay
c. dma mrpc feature is try to remove as many as possible of the ioread
There was a problem hiding this comment.
Reference from "DMA-API-HOWTO.txt"
.. important::
Consistent DMA memory does not preclude the usage of
proper memory barriers. The CPU may reorder stores to
consistent memory just as it may normal memory. Example:
if it is important for the device to see the first word
of a descriptor updated before the second, you must do
something like::
desc->word0 = address;
wmb();
desc->word1 = DESC_VALID;
in order to get correct behavior on all platforms.
Also, on some platforms your driver may need to flush CPU write
buffers in much the same way as it needs to flush write buffers
found in PCI bridges (such as by reading a register's value
after writing it).
There was a problem hiding this comment.
That quote is completely unrelated. This is talking about memory accesses that are also accessed by a DMA engine.
Linux and PCI ensure that IO writes are done in order. The only issue is that WC might delay the write until the next access. The correct way to flush the WC buffer is to issue a read from the device. That's what we did.
Logan
There was a problem hiding this comment.
Reference: Documentation/io_ordering.txt
==============================================
Ordering I/O writes to memory-mapped addresses
On some platforms, so-called memory-mapped I/O is weakly ordered. On such
platforms, driver writers are responsible for ensuring that I/O writes to
memory-mapped addresses on their device arrive in the order intended. This is
typically done by reading a 'safe' device or bridge register, causing the I/O
chipset to flush pending writes to the device before any reads are posted.
There was a problem hiding this comment.
Yes, you are correct. However, we did the right thing in patch #36, so I'm not sure what the issue is... We certainly shouldn't be adding wmb() calls anywhere. If there's another place that matters to add flush_wc_buf() then do so...
There was a problem hiding this comment.
Thx for comments.
There is no issue.
I just want to get a clear picture of why wmb() is replaced by PCI read operation.
Regard,
Wesley
There was a problem hiding this comment.
wmb() is a hard memory barrier affecting all memory operations. It's very expensive.
Invoking a PCI read flushes the WC buffer so it accomplishes the same thing but a lot cheaper.
There was a problem hiding this comment.
@Isgunth
Thx for comments.
Regard,
Wesley
|
|
||
| stdev->mmio = pcim_iomap_table(pdev)[0]; | ||
| stdev->mmio_mrpc = stdev->mmio + SWITCHTEC_GAS_MRPC_OFFSET; | ||
| res = devm_request_mem_region(&pdev->dev, |
There was a problem hiding this comment.
Probably easier if you just enclose this in if (!devm_request_mem_region(...)) seeing you don't need the resource.
Also, it will probably help the line length issues and improve readability if you store start and len in variables seeing you use the functions a lot.
2e02c04 to
5c78945
Compare
|
@lsgunth updated. Pls help review. |
| return -EBUSY; | ||
|
|
||
| stdev->mmio_mrpc = devm_ioremap_wc(&pdev->dev, | ||
| res_start, |
There was a problem hiding this comment.
Nit: this line fits on the previous line, no need to have another line.
| return -ENOMEM; | ||
|
|
||
| map = devm_ioremap(&pdev->dev, | ||
| res_start + |
There was a problem hiding this comment.
NIT: Similarly, too many lines here. It should fit on 3 lines instead of 5.
Previously, fill 1k mrpc input buffer took 1024 memwr tlps, with each payload 1 dwords, while only 1 byte is valid(enabled). In this case, too many of tlps within a timer windows introduce tlp throttling. By use of the write combining buffer, 1k data fillingtake 16 memwr tlps with each payload 16 dwords.
change mrpc region to write combining attribute, while keep the other
region stay unchange.
previously, 1k mrpc input buffer filling took 1024 memwr tlps, with
each payload 1dwords, but only 1 byte is valid(enabled).
now, only take 16 memwr tlps with each payload 16 dwords.