New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
virtio: Refactoring and cleanup #98
virtio: Refactoring and cleanup #98
Conversation
will hopefully soon have time to test! thank you so much for digging into this :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not required. See https://github.com/freebsd/freebsd/blob/master/usr.sbin/bhyve/pci_virtio_net.c#L957 which does the right thing when VIRTIO_NET_F_MRG_RXBUF
is not negotiated.
Orthogonal to this: If we do negotiate the feature we must actually support it rather than asserting num_buffers == 1
(L358).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- The bhyve behaviour of requiring two descriptors on the TX path is a bug. See section 5.1.6.2 of the virtio spec (http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1.0-cs04.html#x1-1680006). I'd rather submit a PR for that which will help other (future) bhyve users than work around it. @hannesm What do you think?
- A general problem with our virtio code is that we're not using the various indexes
..._last_used
,..._next_avail
in a consistent fashion. It's not clear if these are intended to be free-running and bounded only when used as an array/table index or always bounded by the queue size.
desc->len = sizeof(virtio_net_hdr) + len; | ||
desc->addr = (uint64_t)&virtio_net_hdr; | ||
desc->len = sizeof(virtio_net_hdr); | ||
desc->next = xmit_next_avail + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to handle wraparound here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved all this descriptor handling code to virtio_ring.c, handling the wraparound there more carefully.
desc->flags = VIRTQ_DESC_F_NEXT; | ||
|
||
assert(len <= PKT_BUFFER_LEN); | ||
memcpy(xmit_bufs[xmit_next_avail + 1].data, data, len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved all this descriptor handling code to virtio_ring.c, handling the wraparound there more carefully.
|
||
assert(len <= PKT_BUFFER_LEN); | ||
memcpy(xmit_bufs[xmit_next_avail + 1].data, data, len); | ||
desc = &(xmitq.desc[xmit_next_avail + 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved all this descriptor handling code to virtio_ring.c, handling the wraparound there more carefully.
assert(len <= PKT_BUFFER_LEN); | ||
memcpy(xmit_bufs[xmit_next_avail + 1].data, data, len); | ||
desc = &(xmitq.desc[xmit_next_avail + 1]); | ||
desc->addr = (uint64_t) xmit_bufs[xmit_next_avail + 1].data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved all this descriptor handling code to virtio_ring.c, handling the wraparound there more carefully.
desc = &(xmitq.desc[xmit_next_avail]); | ||
desc->addr = (uint64_t) xmit_bufs[xmit_next_avail].data; | ||
desc->len = sizeof(virtio_net_hdr) + len; | ||
desc->addr = (uint64_t)&virtio_net_hdr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that we get rid of the static struct virtio_net_hdr
entirely and just grab an extra buffer out of xmit_bufs[]
, zeroing the first sizeof (struct virtio_net_hdr)
bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (((xmit_next_avail + 1) % xmitq.num) == | ||
(xmit_last_used % xmitq.num)) { | ||
if (((xmit_next_avail + 2) % xmitq.num) == | ||
((xmit_last_used * 2) % xmitq.num)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be
if (((xmit_next_avail + 2) % xmitq.num) >=
((xmit_last_used % xmitq.num)) {
?
See check_xmit()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not using this check anymore. The new version keeps track of free descriptors explicitly (in virtq.num_avail).
My take is: implement workarounds in solo5 to get it working right now. But please report the issues upstream (via https://bugs.freebsd.org/bugzilla/ or https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization); and once they're fixed upsteam (and released), we can remove the workarounds in solo5. Sorry, still at ICFP in Japan, thus little time and bad internet connectivity |
4e0c590
to
0aff8c9
Compare
Changes since last version:
Tests:
@mato ^ |
fddc0da
to
43ad6c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ricarkol. In general this version looks good and shares more common code; I like it! The only downside is a slight waste of memory for blkq buffers but I think it's worth the increase in shared code.
I've left inline comments in various places on things which aren't 100% clear or need fixing. Aside from those, I'm still confused on one important point:
Are vq->next_avail
and vq->last_used
intended to be free-running and thus always used in conjunction with % vq->num
when indexing descriptors or are they intended to be bounded and thus always set in conjunction with % vq->num
?
The current code seems to be inconsistent in this, as I've mentioned in some of the inline comments.
if (dbg) | ||
printf("INTR: BLK: desc=0x%p status=%d\n", | ||
desc->addr, *(uint8_t *)desc->addr); | ||
virtq_add_descriptor_chain(&blkq, head, 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either assert(... != -1) or pass the error return to the caller (but then the function signature needs to change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added an assert
|
||
return req; | ||
} | ||
ret = (*(uint8_t *) status_buf) != VIRTIO_BLK_S_OK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and the following code) would be better rewritten in a linear style with the common case preferred, something like:
status = (*(uint8_t *)status_buf);
if (status != VIRTIO_BLK_S_OK)
return -1;
if (type == ...) /* see next comment */
memcpy(...);
return 0;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
return req; | ||
} | ||
ret = (*(uint8_t *) status_buf) != VIRTIO_BLK_S_OK; | ||
if (type == VIRTIO_BLK_T_OUT && ret == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be VIRTIO_BLK_T_IN (read)? If so, why didn't your testing with test_blk catch it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't fail because there is a bug in the test (https://github.com/Solo5/solo5/blob/master/tests/test_blk/test_blk.c#L22):
if (sector_write[i] != '0' + i % 10) <== should be sector_read
/* Check failed */
return 1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (test and virtio_blk code)
desc->len = sizeof(virtio_net_hdr); | ||
desc->next = xmit_next_avail + 1; | ||
desc->flags = VIRTQ_DESC_F_NEXT; | ||
head = xmitq.next_avail; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be clearer if this code (manipulating xmitq.bufs[]) were written in a similar style to that in virtio_blk_op(), i.e. access via *head_buf and *data_buf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
volatile uint8_t status; | ||
volatile uint8_t hw_used; | ||
}; | ||
|
||
#define VIRTQ_BLK_MAX_QUEUE_SIZE 8192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
uint16_t desc_idx; | ||
struct io_buffer *head_buf; | ||
|
||
if ((vq->used->idx % vq->num) == vq->last_used) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is last_used free-running? Code in drivers uses (% num) when touching it, doesn't it need the same here? (Also see general comment coming up)
*/ | ||
int virtq_add_descriptor_chain(struct virtq *vq, | ||
uint16_t head, | ||
uint16_t num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Inconsistent types. You're using uint32_t for the descriptor indexes in vq.
uint16_t num) | ||
{ | ||
struct virtq_desc *desc; | ||
uint16_t i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Inconsistent types. As above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
|
||
vq->num_avail++; | ||
while (vq->desc[desc_idx].flags & VIRTQ_DESC_F_NEXT) { | ||
head_buf->completed = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what's the intent here. Do we want to set completed for non-head bufs? If not, then this should go away. If yes, then it's wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.The idea was that it looked more consistent to mark all buffers as completed. Doesn't make much different though as the non-heads are not read.
* the interrupt handler we can just cast this pointer back into a | ||
* 'struct io_buffer'. | ||
*/ | ||
assert((uint64_t) vq->bufs[i].data == (uint64_t) &vq->bufs[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These casts should be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting the right side to a (uint8_t *) now
Following up on the "monster" review I just submitted, if I understand the code correctly then I'd prefer that
What do you think? |
One more thing, any particular reason why we're now using a chain of 3 buffers per block write request instead of just a single buffer? The spec seems to suggest the latter, I don't particularly care either way since our writes are synchronous, just curious. |
Thanks for the review @mato The usage of last_used and next_avail is now consistent and decided to make them wrap at vq->num. The only places where %num is used is when they are being read. The reason is that if they are left as free-running, then every read would have to %num and there are too many of those reads in virtio_net, virtio_blk, and virtio_ring (I'm scared of forgetting a %num). Tested locally using test_ping_serve and test_blk. |
@ricarkol This LGTM. If you don't have anything else you'd like to do as part of this PR, could you please rebase your branch against master and squash the intermediate commits (as I don't think these are worth keeping). I'll double-check once that's done and merge if everything checks out. |
- Moved all the common vring and descriptor handling code from virtio_net and virtio_blk into virtio_ring.c. - Use two descriptors per virtio-net tx - test_blk: now correctly checking the read buffer instead of the write buffer
0cf5634
to
6d11a94
Compare
@mato: rebased |
@mato Made |
@ricarkol No difference in behaviour with your last change. I've done some printf() debugging:
The result with
So, what seems to be happening is that we stop receiving I've also tried adding an out to |
I think I've found the cause of the problem. Take a look at this log: http://pastebin.com/Ly4aBUDK. The What I think is happening:
Not sure how to fix this, needs some thought. |
Hi @mato. You are right, that's the issue. I can reproduce it by adding 100ms sleeps at every iteration in test_ping_serve. There seems to be two problems on the rx path:
|
@ricarkol With your fix for 1) above I get some of the ping traffic coming through and none of the device-side hangs that you describe. For example:
Regarding your question in 2): Bhyve will drop packets received from the tap interface if there is no space in the guest receive queue (https://github.com/freebsd/freebsd/blob/master/usr.sbin/bhyve/pci_virtio_net.c#L325). This is normal behaviour for network interfaces when queues fill up. I've been through both our code and the bhyve-side code yet again and unfortunately can't see anything obviously wrong, so not sure what to try next. |
…nd NOTIFY after put
Tested locally with test_ping_serve, test_blk, and mirage static_website. Tested on GCE with static_website. |
Tested (with @mato next to me) this virtio-net (using test_ping_serve) on a naive FreeBSD-12 and bhyve:
as long as we stay above 10 ms between packets, all is good... as soon as we go below, packets start dropping and getting huge delays |
This PR is for two changes:
num_buffers
field (orvrh_bufs
in freebsd code) even when theVIRTIO_NET_F_MRG_RXBUF
feature is not negotiated. This proposed change uses the extended network header, asserts the host featureVIRTIO_NET_F_MRG_RXBUF
, and negotiates it with the host (always). Here is the freebsd code with the hardcoded struct for reference: https://github.com/freebsd/freebsd/blob/master/usr.sbin/bhyve/pci_virtio_net.c#L120Tested this changes with the static_website unikernel locally and in GCE. Also tested the ping_test_serve unikernel in freebsd (after applying @hannesm change to Makefile.common and following the instructions).