Improve mrpc efficiency by leveraging write combining by wesleywesley · Pull Request #28 · Microsemi/switchtec-kernel

wesleywesley · 2018-08-16T07:23:53Z

change mrpc region to write combining attribute, while keep the other
region stay unchange.
previously, 1k mrpc input buffer filling took 1024 memwr tlps, with
each payload 1dwords, but only 1 byte is valid(enabled).
now, only take 16 memwr tlps with each payload 16 dwords.

lsgunth · 2018-08-17T16:21:47Z

+	SWITCHTEC_GAS_FLASH_INFO_OFFSET = 0x1200,
+	SWITCHTEC_GAS_PART_CFG_OFFSET   = 0x3000,
+	SWITCHTEC_GAS_NTB_OFFSET        = 0xf000,
+	SWITCHTEC_GAS_PFF_CSR_OFFSET    = 0x133000,


I'd prefer it if you did this change without modifying these defines. They describe the hardware not how the software uses the hardware. And it would just be confusing if they are wrong.

So just subtract SWITCHTEC_GAS_TOP_CFG_OFFSET from any mapping that's done with the offset.
Therefore you only need to change anything that uses 'stdev->mmio' which isn't very much and is easy to grep for.

lsgunth · 2018-08-17T16:22:30Z

 	if (db_addr)
-		*db_addr = pci_resource_start(ntb->pdev, 0) + offset;
+		*db_addr = pci_resource_start(ntb->pdev, 0) + offset +
+			  SWITCHTEC_GAS_MRPC_SIZE;


Better to just add

offset += SWITCHTEC_GAS_TOP_CFG_OFFSET;

above instead of adding to this line.

lsgunth · 2018-08-17T16:22:53Z

 	if (spad_addr)
-		*spad_addr = pci_resource_start(ntb->pdev, 0) + offset;
+		*spad_addr = pci_resource_start(ntb->pdev, 0) + offset +
+			    SWITCHTEC_GAS_MRPC_SIZE;


lsgunth · 2018-08-17T16:24:43Z


-	addr = (bar_addrs[0] + SWITCHTEC_GAS_NTB_OFFSET +
+	addr = (bar_addrs[0] + SWITCHTEC_GAS_MRPC_SIZE +
+		SWITCHTEC_GAS_NTB_OFFSET +


This goes away if you leave the defines alone.

lsgunth · 2018-08-17T16:26:24Z

 	stdev->mrpc_busy = 1;
 	memcpy_toio(&stdev->mmio_mrpc->input_data,
 		    stuser->data, stuser->data_len);
+	wmb();


This is wrong. Linux guarantees the ordering of IOs with respect to other IOs on the PCI bus and the CPU flushes the WC buffer on any load/store to a UC region (which is exactly the next command). Even if this wasn't the case, a wmb() is a heavy tool to employ here.

Oh wait scratch that... the next line is in the WC region. The correct way to handle this is a read from the device instead of a hard barrier. So issue an ioread32(&stdev->mmio_mrpc->cmd) and put a comment noting that it is there to flush the WC buffer.

thx.
replace wmb() by ioread of reserved register in NTB db/msg register range
a. memrd tlp to this range of registers is processed by HW, with qucik response: ns level
b. out of this range, memrd tlp to the registers is processed by FW, with us or even ms level delay
c. dma mrpc feature is try to remove as many as possible of the ioread

Reference from "DMA-API-HOWTO.txt"

.. important::

Consistent DMA memory does not preclude the usage of proper memory barriers. The CPU may reorder stores to consistent memory just as it may normal memory. Example: if it is important for the device to see the first word of a descriptor updated before the second, you must do something like:: desc->word0 = address; wmb(); desc->word1 = DESC_VALID; in order to get correct behavior on all platforms. Also, on some platforms your driver may need to flush CPU write buffers in much the same way as it needs to flush write buffers found in PCI bridges (such as by reading a register's value after writing it).

That quote is completely unrelated. This is talking about memory accesses that are also accessed by a DMA engine.

Linux and PCI ensure that IO writes are done in order. The only issue is that WC might delay the write until the next access. The correct way to flush the WC buffer is to issue a read from the device. That's what we did.

Logan

Reference: Documentation/io_ordering.txt

==============================================
Ordering I/O writes to memory-mapped addresses

On some platforms, so-called memory-mapped I/O is weakly ordered. On such
platforms, driver writers are responsible for ensuring that I/O writes to
memory-mapped addresses on their device arrive in the order intended. This is
typically done by reading a 'safe' device or bridge register, causing the I/O
chipset to flush pending writes to the device before any reads are posted.

Yes, you are correct. However, we did the right thing in patch #36, so I'm not sure what the issue is... We certainly shouldn't be adding wmb() calls anywhere. If there's another place that matters to add flush_wc_buf() then do so...

@lsgunth

Thx for comments.

There is no issue.
I just want to get a clear picture of why wmb() is replaced by PCI read operation.

Regard,
Wesley

wmb() is a hard memory barrier affecting all memory operations. It's very expensive.

Invoking a PCI read flushes the WC buffer so it accomplishes the same thing but a lot cheaper.

@Isgunth

Thx for comments.

Regard,
Wesley

lsgunth · 2018-08-17T16:33:11Z


-	stdev->mmio = pcim_iomap_table(pdev)[0];
-	stdev->mmio_mrpc = stdev->mmio + SWITCHTEC_GAS_MRPC_OFFSET;
+	res = devm_request_mem_region(&pdev->dev,


Probably easier if you just enclose this in if (!devm_request_mem_region(...)) seeing you don't need the resource.

Also, it will probably help the line length issues and improve readability if you store start and len in variables seeing you use the functions a lot.

wesleywesley · 2018-08-20T09:18:16Z

@lsgunth updated. Pls help review.

lsgunth

Big improvement from the last version. Thanks.

Just a couple of style nits left.

lsgunth · 2018-08-20T18:01:23Z

+		return -EBUSY;
+
+	stdev->mmio_mrpc = devm_ioremap_wc(&pdev->dev,
+					   res_start,


Nit: this line fits on the previous line, no need to have another line.

lsgunth · 2018-08-20T18:02:06Z

+		return -ENOMEM;
+
+	map = devm_ioremap(&pdev->dev,
+			   res_start +


NIT: Similarly, too many lines here. It should fit on 3 lines instead of 5.

Previously, fill 1k mrpc input buffer took 1024 memwr tlps, with each payload 1 dwords, while only 1 byte is valid(enabled). In this case, too many of tlps within a timer windows introduce tlp throttling. By use of the write combining buffer, 1k data fillingtake 16 memwr tlps with each payload 16 dwords.

wesleywesley requested review from a user, kelvin-cao and lsgunth August 16, 2018 07:24

lsgunth requested changes Aug 17, 2018

View reviewed changes

wesleywesley force-pushed the wc_mrpc branch 2 times, most recently from 2e02c04 to 5c78945 Compare August 20, 2018 09:01

lsgunth reviewed Aug 20, 2018

View reviewed changes

wesleywesley force-pushed the wc_mrpc branch from 5c78945 to edf348b Compare August 21, 2018 01:59

lsgunth approved these changes Aug 22, 2018

View reviewed changes

wesleywesley merged commit e1904b0 into devel Aug 23, 2018

wesleywesley deleted the wc_mrpc branch October 10, 2018 05:03

Conversation

wesleywesley commented Aug 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

============================================== Ordering I/O writes to memory-mapped addresses

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesleywesley commented Aug 20, 2018

Uh oh!

lsgunth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

==============================================
Ordering I/O writes to memory-mapped addresses

lsgunth left a comment •

edited

Loading