Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

virtio: Refactoring and cleanup #98

Merged
merged 3 commits into from Oct 3, 2016

Conversation

ricarkol
Copy link
Collaborator

@ricarkol ricarkol commented Sep 21, 2016

This PR is for two changes:

  1. The network header in bhyve's virtio-net implementation is hard coded to use the extra num_buffers field (or vrh_bufs in freebsd code) even when the VIRTIO_NET_F_MRG_RXBUF feature is not negotiated. This proposed change uses the extended network header, asserts the host feature VIRTIO_NET_F_MRG_RXBUF, and negotiates it with the host (always). Here is the freebsd code with the hardcoded struct for reference: https://github.com/freebsd/freebsd/blob/master/usr.sbin/bhyve/pci_virtio_net.c#L120
  2. The Tx path in bhyve's virtio-net assumes that all Tx'es will have a first descriptor with the header alone. Here is the code. https://github.com/freebsd/freebsd/blob/master/usr.sbin/bhyve/pci_virtio_net.c#L617 has an interesting comment, and https://github.com/freebsd/freebsd/blob/master/usr.sbin/bhyve/pci_virtio_net.c#L631 only copies data for tx after the first descriptor. The proposed change uses two descriptors per Tx operation. All I could find in the virtio 1.0 spec is this:

"Most common is to begin the data with a header (containing little-endian fields) for the device to read, and postfix it with a status tailer for the device to write.".

Tested this changes with the static_website unikernel locally and in GCE. Also tested the ping_test_serve unikernel in freebsd (after applying @hannesm change to Makefile.common and following the instructions).

@ricarkol ricarkol changed the title Virtio net changes for freebsd virtio-net changes for freebsd Sep 21, 2016
@hannesm
Copy link
Contributor

hannesm commented Sep 21, 2016

will hopefully soon have time to test! thank you so much for digging into this :D

Copy link
Member

@mato mato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not required. See https://github.com/freebsd/freebsd/blob/master/usr.sbin/bhyve/pci_virtio_net.c#L957 which does the right thing when VIRTIO_NET_F_MRG_RXBUF is not negotiated.

Orthogonal to this: If we do negotiate the feature we must actually support it rather than asserting num_buffers == 1 (L358).

Copy link
Member

@mato mato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. The bhyve behaviour of requiring two descriptors on the TX path is a bug. See section 5.1.6.2 of the virtio spec (http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1.0-cs04.html#x1-1680006). I'd rather submit a PR for that which will help other (future) bhyve users than work around it. @hannesm What do you think?
  2. A general problem with our virtio code is that we're not using the various indexes ..._last_used, ..._next_avail in a consistent fashion. It's not clear if these are intended to be free-running and bounded only when used as an array/table index or always bounded by the queue size.

desc->len = sizeof(virtio_net_hdr) + len;
desc->addr = (uint64_t)&virtio_net_hdr;
desc->len = sizeof(virtio_net_hdr);
desc->next = xmit_next_avail + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to handle wraparound here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all this descriptor handling code to virtio_ring.c, handling the wraparound there more carefully.

desc->flags = VIRTQ_DESC_F_NEXT;

assert(len <= PKT_BUFFER_LEN);
memcpy(xmit_bufs[xmit_next_avail + 1].data, data, len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all this descriptor handling code to virtio_ring.c, handling the wraparound there more carefully.


assert(len <= PKT_BUFFER_LEN);
memcpy(xmit_bufs[xmit_next_avail + 1].data, data, len);
desc = &(xmitq.desc[xmit_next_avail + 1]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all this descriptor handling code to virtio_ring.c, handling the wraparound there more carefully.

assert(len <= PKT_BUFFER_LEN);
memcpy(xmit_bufs[xmit_next_avail + 1].data, data, len);
desc = &(xmitq.desc[xmit_next_avail + 1]);
desc->addr = (uint64_t) xmit_bufs[xmit_next_avail + 1].data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all this descriptor handling code to virtio_ring.c, handling the wraparound there more carefully.

desc = &(xmitq.desc[xmit_next_avail]);
desc->addr = (uint64_t) xmit_bufs[xmit_next_avail].data;
desc->len = sizeof(virtio_net_hdr) + len;
desc->addr = (uint64_t)&virtio_net_hdr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that we get rid of the static struct virtio_net_hdr entirely and just grab an extra buffer out of xmit_bufs[], zeroing the first sizeof (struct virtio_net_hdr) bytes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (((xmit_next_avail + 1) % xmitq.num) ==
(xmit_last_used % xmitq.num)) {
if (((xmit_next_avail + 2) % xmitq.num) ==
((xmit_last_used * 2) % xmitq.num)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be

if (((xmit_next_avail + 2) % xmitq.num) >=
        ((xmit_last_used % xmitq.num)) {

?
See check_xmit().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using this check anymore. The new version keeps track of free descriptors explicitly (in virtq.num_avail).

@hannesm
Copy link
Contributor

hannesm commented Sep 23, 2016

My take is: implement workarounds in solo5 to get it working right now. But please report the issues upstream (via https://bugs.freebsd.org/bugzilla/ or https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization); and once they're fixed upsteam (and released), we can remove the workarounds in solo5.

Sorry, still at ICFP in Japan, thus little time and bad internet connectivity

@ricarkol ricarkol force-pushed the virtio-net-changes-for-freebsd branch from 4e0c590 to 0aff8c9 Compare September 26, 2016 20:37
@ricarkol
Copy link
Collaborator Author

ricarkol commented Sep 26, 2016

Changes since last version:

  1. Moved all the common code dealing with building descriptor chains (and keeping track of free descriptors), initializing virtq rings, and interrupt handling to virtio_ring.c.
    "A general problem with our virtio code is that we're not using the various indexes ..._last_used, ..._next_avail in a consistent fashion. " --> cleaned all this. Those comparisons between last_used and next_avail were done to infer if there are free descriptors; it's much easier to keep track of free descriptors by hand (with a var), so added that. Also, as you said, the usage of our indexes was not consistent.last_used counts chains and next_avail counts descriptors (2x for tx). I don't think they were meant to be compared against each other.
  2. Not negotiating VIRTIO_NET_F_MRG_RXBUF anymore.

Tests:

  • locally tested test_ping_serve with ping -f, test_blk, and mirage static_website
  • mirage static_website on GCE
  • test_ping_serve on freebsd. @hannesm thanks, will report the bhyve issue then.

@mato ^

@ricarkol ricarkol force-pushed the virtio-net-changes-for-freebsd branch 2 times, most recently from fddc0da to 43ad6c7 Compare September 26, 2016 21:31
Copy link
Member

@mato mato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ricarkol. In general this version looks good and shares more common code; I like it! The only downside is a slight waste of memory for blkq buffers but I think it's worth the increase in shared code.

I've left inline comments in various places on things which aren't 100% clear or need fixing. Aside from those, I'm still confused on one important point:

Are vq->next_avail and vq->last_used intended to be free-running and thus always used in conjunction with % vq->num when indexing descriptors or are they intended to be bounded and thus always set in conjunction with % vq->num?

The current code seems to be inconsistent in this, as I've mentioned in some of the inline comments.

if (dbg)
printf("INTR: BLK: desc=0x%p status=%d\n",
desc->addr, *(uint8_t *)desc->addr);
virtq_add_descriptor_chain(&blkq, head, 3);
Copy link
Member

@mato mato Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either assert(... != -1) or pass the error return to the caller (but then the function signature needs to change).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an assert


return req;
}
ret = (*(uint8_t *) status_buf) != VIRTIO_BLK_S_OK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the following code) would be better rewritten in a linear style with the common case preferred, something like:

status = (*(uint8_t *)status_buf);
if (status != VIRTIO_BLK_S_OK)
    return -1;

if (type == ...) /* see next comment */
    memcpy(...);
return 0;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

return req;
}
ret = (*(uint8_t *) status_buf) != VIRTIO_BLK_S_OK;
if (type == VIRTIO_BLK_T_OUT && ret == 0)
Copy link
Member

@mato mato Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be VIRTIO_BLK_T_IN (read)? If so, why didn't your testing with test_blk catch it?

Copy link
Collaborator Author

@ricarkol ricarkol Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't fail because there is a bug in the test (https://github.com/Solo5/solo5/blob/master/tests/test_blk/test_blk.c#L22):

        if (sector_write[i] != '0' + i % 10) <== should be sector_read
            /* Check failed */
            return 1;

Copy link
Collaborator Author

@ricarkol ricarkol Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (test and virtio_blk code)

desc->len = sizeof(virtio_net_hdr);
desc->next = xmit_next_avail + 1;
desc->flags = VIRTQ_DESC_F_NEXT;
head = xmitq.next_avail;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be clearer if this code (manipulating xmitq.bufs[]) were written in a similar style to that in virtio_blk_op(), i.e. access via *head_buf and *data_buf.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

volatile uint8_t status;
volatile uint8_t hw_used;
};

#define VIRTQ_BLK_MAX_QUEUE_SIZE 8192
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

uint16_t desc_idx;
struct io_buffer *head_buf;

if ((vq->used->idx % vq->num) == vq->last_used)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is last_used free-running? Code in drivers uses (% num) when touching it, doesn't it need the same here? (Also see general comment coming up)

*/
int virtq_add_descriptor_chain(struct virtq *vq,
uint16_t head,
uint16_t num)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Inconsistent types. You're using uint32_t for the descriptor indexes in vq.

uint16_t num)
{
struct virtq_desc *desc;
uint16_t i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Inconsistent types. As above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed


vq->num_avail++;
while (vq->desc[desc_idx].flags & VIRTQ_DESC_F_NEXT) {
head_buf->completed = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's the intent here. Do we want to set completed for non-head bufs? If not, then this should go away. If yes, then it's wrong.

Copy link
Collaborator Author

@ricarkol ricarkol Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.The idea was that it looked more consistent to mark all buffers as completed. Doesn't make much different though as the non-heads are not read.

* the interrupt handler we can just cast this pointer back into a
* 'struct io_buffer'.
*/
assert((uint64_t) vq->bufs[i].data == (uint64_t) &vq->bufs[i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These casts should be unnecessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting the right side to a (uint8_t *) now

@mato
Copy link
Member

mato commented Sep 27, 2016

Following up on the "monster" review I just submitted, if I understand the code correctly then I'd prefer that vq->last_used and vq->next_avail are defined as free-running as it simplifies the code in virtio_ring.c:

  1. You should be able to do just vq->next_avail += num in virtq_add_descriptor_chain().
  2. Compare directly against vq->used->idx in the interrupt handler followed by vq->last_used++ at the end of the loop.

What do you think?

@mato
Copy link
Member

mato commented Sep 27, 2016

One more thing, any particular reason why we're now using a chain of 3 buffers per block write request instead of just a single buffer? The spec seems to suggest the latter, I don't particularly care either way since our writes are synchronous, just curious.

@ricarkol
Copy link
Collaborator Author

ricarkol commented Sep 27, 2016

Thanks for the review @mato

The usage of last_used and next_avail is now consistent and decided to make them wrap at vq->num. The only places where %num is used is when they are being read. The reason is that if they are left as free-running, then every read would have to %num and there are too many of those reads in virtio_net, virtio_blk, and virtio_ring (I'm scared of forgetting a %num).

Tested locally using test_ping_serve and test_blk.

@mato mato changed the title virtio-net changes for freebsd virtio: Refactoring and cleanup Sep 28, 2016
@mato
Copy link
Member

mato commented Sep 28, 2016

@ricarkol This LGTM. If you don't have anything else you'd like to do as part of this PR, could you please rebase your branch against master and squash the intermediate commits (as I don't think these are worth keeping). I'll double-check once that's done and merge if everything checks out.

- Moved all the common vring and descriptor handling code from virtio_net and
  virtio_blk into virtio_ring.c.
- Use two descriptors per virtio-net tx
- test_blk: now correctly checking the read buffer instead of the write buffer
@ricarkol ricarkol force-pushed the virtio-net-changes-for-freebsd branch from 0cf5634 to 6d11a94 Compare September 28, 2016 17:29
@ricarkol
Copy link
Collaborator Author

@mato: rebased

@ricarkol
Copy link
Collaborator Author

ricarkol commented Sep 29, 2016

@mato Made next_avail and last_used free-running (on an unit16_t). Does that help with your freebsd experiments?

@mato
Copy link
Member

mato commented Sep 29, 2016

@ricarkol No difference in behaviour with your last change. I've done some printf() debugging:

  1. Removed the "sending reply" message in test_ping_serve (since command line passing seems broken in grub-bhyve).
  2. Print # when entering handle_virtio_net_interrupt().
  3. Print $ when isr_status & VIRTIO_PCI_ISR_HAS_INTR is true.
  4. Print @ when solo5_poll() in test_ping_serve times out.

The result with ping -f 10.0.0.255 on my FreeBSD box:

Serving ping on 10.0.0.2 / 10.0.0.255 / 255.255.255.255
With MAC: 00:a0:98:f5:bc:45 (no ARP!)
@@#$@#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$#$@@@@@@@@@@@@@@(...)

So, what seems to be happening is that we stop receiving virtio_net interrupts altogether. However, interrupts as such are still working fine (since we see the @ being printed once a second).

I've also tried adding an out to VIRTIO_PCI_QUEUE_NOTIFY at the end of virtio_net_recv_pkt_put(), no luck there either.

@mato
Copy link
Member

mato commented Sep 29, 2016

I think I've found the cause of the problem. Take a look at this log: http://pastebin.com/Ly4aBUDK. The next_avail and last_used values come from virtio_net_pkt_get().

What I think is happening:

  1. bhyve is placing packets onto the recvq as fast as possible, in batches of more than one packet.
  2. Our interrupt handler is updating last_used as fast as possible.
  3. However, test_ping_serve is processing the buffers at a slower rate than that at which they're being added to the virtq.
  4. Therefore: last_used eventually catches up with next_avail at which point the code becomes confused and thinks there are no more packets.

Not sure how to fix this, needs some thought.

@ricarkol
Copy link
Collaborator Author

ricarkol commented Sep 30, 2016

Hi @mato. You are right, that's the issue. I can reproduce it by adding 100ms sleeps at every iteration in test_ping_serve.

There seems to be two problems on the rx path:

  1. When (next_avail % num) == (last_used % num), used to decide whether to block, we can't distinguish between a ring with all descriptors being empty (i.e. just after initialization) from a ring with all descriptors being used (i.e. what happens if our test_ping_serve is too slow). The read function should block on the first case and definitely not block on the second, as we need to make space fast. The fix for this is to not compare these two indexes with the % num. One possible "fix" is to make next_avail start at 0 and compare the indexes without the %'s (diff is here http://pastebin.com/BqPhe37g). The consequence of this problem is that the reader never makes space for new packets to arrive, so test_ping_serve hangs.
  2. The second problem is that we can't recover after reaching a situation where there are no empty descriptors for the device to put receive buffers. After fixing 1 and letting test_ping_serve consume all used descriptors, the device should find all the descriptors being empty but it just does not use them; It almost seems as if the device gives up and never tries again. This receive ring is a queue and if this queue gets filled up, is it the virtio behavior to buffer things elsewhere and retry eventually? I guess the answer is in the qemu virtio devices code.

@mato
Copy link
Member

mato commented Oct 3, 2016

@ricarkol With your fix for 1) above I get some of the ping traffic coming through and none of the device-side hangs that you describe. For example:

--- 10.0.0.255 ping statistics ---
58130 packets transmitted, 5056 packets received, 91.3% packet loss
round-trip min/avg/max/stddev = 0.290/63.275/70.207/10.644 ms

Regarding your question in 2): Bhyve will drop packets received from the tap interface if there is no space in the guest receive queue (https://github.com/freebsd/freebsd/blob/master/usr.sbin/bhyve/pci_virtio_net.c#L325). This is normal behaviour for network interfaces when queues fill up.

I've been through both our code and the bhyve-side code yet again and unfortunately can't see anything obviously wrong, so not sure what to try next.

@ricarkol
Copy link
Collaborator Author

ricarkol commented Oct 3, 2016

Tested locally with test_ping_serve, test_blk, and mirage static_website. Tested on GCE with static_website.

@ricarkol ricarkol merged commit cd8649f into Solo5:master Oct 3, 2016
@hannesm
Copy link
Contributor

hannesm commented Oct 9, 2016

Tested (with @mato next to me) this virtio-net (using test_ping_serve) on a naive FreeBSD-12 and bhyve:

  • ping -f leads to packet loss (>96%)
  • ping -f to a FreeBSD-11 guest does not lead to packet loss (after setting sysctl net.inet.icmp.icmplim=0)
  • ping -i 0.001 no packet loss
  • ping -i 0.0009 packet loss (and delayed replies: 300ms instead of 0.x)

as long as we stay above 10 ms between packets, all is good... as soon as we go below, packets start dropping and getting huge delays

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants