Skip to content

Improve mrpc efficiency by leveraging write combining#28

Merged
wesleywesley merged 1 commit intodevelfrom
wc_mrpc
Aug 23, 2018
Merged

Improve mrpc efficiency by leveraging write combining#28
wesleywesley merged 1 commit intodevelfrom
wc_mrpc

Conversation

@wesleywesley
Copy link
Copy Markdown
Contributor

change mrpc region to write combining attribute, while keep the other
region stay unchange.
previously, 1k mrpc input buffer filling took 1024 memwr tlps, with
each payload 1dwords, but only 1 byte is valid(enabled).
now, only take 16 memwr tlps with each payload 16 dwords.

@wesleywesley wesleywesley requested review from a user, kelvin-cao and lsgunth August 16, 2018 07:24
Comment thread linux/switchtec.h Outdated
SWITCHTEC_GAS_FLASH_INFO_OFFSET = 0x1200,
SWITCHTEC_GAS_PART_CFG_OFFSET = 0x3000,
SWITCHTEC_GAS_NTB_OFFSET = 0xf000,
SWITCHTEC_GAS_PFF_CSR_OFFSET = 0x133000,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer it if you did this change without modifying these defines. They describe the hardware not how the software uses the hardware. And it would just be confusing if they are wrong.

So just subtract SWITCHTEC_GAS_TOP_CFG_OFFSET from any mapping that's done with the offset.
Therefore you only need to change anything that uses 'stdev->mmio' which isn't very much and is easy to grep for.

Comment thread ntb_hw_switchtec.c Outdated
if (db_addr)
*db_addr = pci_resource_start(ntb->pdev, 0) + offset;
*db_addr = pci_resource_start(ntb->pdev, 0) + offset +
SWITCHTEC_GAS_MRPC_SIZE;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to just add

offset += SWITCHTEC_GAS_TOP_CFG_OFFSET;

above instead of adding to this line.

Comment thread ntb_hw_switchtec.c Outdated
if (spad_addr)
*spad_addr = pci_resource_start(ntb->pdev, 0) + offset;
*spad_addr = pci_resource_start(ntb->pdev, 0) + offset +
SWITCHTEC_GAS_MRPC_SIZE;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Comment thread ntb_hw_switchtec.c Outdated

addr = (bar_addrs[0] + SWITCHTEC_GAS_NTB_OFFSET +
addr = (bar_addrs[0] + SWITCHTEC_GAS_MRPC_SIZE +
SWITCHTEC_GAS_NTB_OFFSET +
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes away if you leave the defines alone.

Comment thread switchtec.c Outdated
stdev->mrpc_busy = 1;
memcpy_toio(&stdev->mmio_mrpc->input_data,
stuser->data, stuser->data_len);
wmb();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. Linux guarantees the ordering of IOs with respect to other IOs on the PCI bus and the CPU flushes the WC buffer on any load/store to a UC region (which is exactly the next command). Even if this wasn't the case, a wmb() is a heavy tool to employ here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait scratch that... the next line is in the WC region. The correct way to handle this is a read from the device instead of a hard barrier. So issue an ioread32(&stdev->mmio_mrpc->cmd) and put a comment noting that it is there to flush the WC buffer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx.
replace wmb() by ioread of reserved register in NTB db/msg register range
a. memrd tlp to this range of registers is processed by HW, with qucik response: ns level
b. out of this range, memrd tlp to the registers is processed by FW, with us or even ms level delay
c. dma mrpc feature is try to remove as many as possible of the ioread

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference from "DMA-API-HOWTO.txt"

.. important::

     Consistent DMA memory does not preclude the usage of
     proper memory barriers.  The CPU may reorder stores to
     consistent memory just as it may normal memory.  Example:
     if it is important for the device to see the first word
     of a descriptor updated before the second, you must do
     something like::

	desc->word0 = address;
	wmb();
	desc->word1 = DESC_VALID;

         in order to get correct behavior on all platforms.

     Also, on some platforms your driver may need to flush CPU write
     buffers in much the same way as it needs to flush write buffers
     found in PCI bridges (such as by reading a register's value
     after writing it).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That quote is completely unrelated. This is talking about memory accesses that are also accessed by a DMA engine.

Linux and PCI ensure that IO writes are done in order. The only issue is that WC might delay the write until the next access. The correct way to flush the WC buffer is to issue a read from the device. That's what we did.

Logan

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference: Documentation/io_ordering.txt

==============================================
Ordering I/O writes to memory-mapped addresses

On some platforms, so-called memory-mapped I/O is weakly ordered. On such
platforms, driver writers are responsible for ensuring that I/O writes to
memory-mapped addresses on their device arrive in the order intended. This is
typically done by reading a 'safe' device or bridge register, causing the I/O
chipset to flush pending writes to the device before any reads are posted.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. However, we did the right thing in patch #36, so I'm not sure what the issue is... We certainly shouldn't be adding wmb() calls anywhere. If there's another place that matters to add flush_wc_buf() then do so...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lsgunth

Thx for comments.

There is no issue.
I just want to get a clear picture of why wmb() is replaced by PCI read operation.

Regard,
Wesley

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wmb() is a hard memory barrier affecting all memory operations. It's very expensive.

Invoking a PCI read flushes the WC buffer so it accomplishes the same thing but a lot cheaper.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Isgunth

Thx for comments.

Regard,
Wesley

Comment thread switchtec.c Outdated

stdev->mmio = pcim_iomap_table(pdev)[0];
stdev->mmio_mrpc = stdev->mmio + SWITCHTEC_GAS_MRPC_OFFSET;
res = devm_request_mem_region(&pdev->dev,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably easier if you just enclose this in if (!devm_request_mem_region(...)) seeing you don't need the resource.

Also, it will probably help the line length issues and improve readability if you store start and len in variables seeing you use the functions a lot.

@wesleywesley wesleywesley force-pushed the wc_mrpc branch 2 times, most recently from 2e02c04 to 5c78945 Compare August 20, 2018 09:01
@wesleywesley
Copy link
Copy Markdown
Contributor Author

@lsgunth updated. Pls help review.

Copy link
Copy Markdown
Collaborator

@lsgunth lsgunth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big improvement from the last version. Thanks.

Just a couple of style nits left.

Comment thread switchtec.c Outdated
return -EBUSY;

stdev->mmio_mrpc = devm_ioremap_wc(&pdev->dev,
res_start,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this line fits on the previous line, no need to have another line.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

Comment thread switchtec.c Outdated
return -ENOMEM;

map = devm_ioremap(&pdev->dev,
res_start +
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Similarly, too many lines here. It should fit on 3 lines instead of 5.

Previously, fill 1k mrpc input buffer took 1024 memwr tlps,
with each payload 1 dwords, while only 1 byte is valid(enabled).
In this case, too many of tlps within a timer windows introduce
tlp throttling.
By use of the write combining buffer,  1k data fillingtake 16
memwr tlps with each payload 16 dwords.
@wesleywesley wesleywesley merged commit e1904b0 into devel Aug 23, 2018
@wesleywesley wesleywesley deleted the wc_mrpc branch October 10, 2018 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants