Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform CR4 and CR0 pinning from hypervisors #19

Open
kees opened this issue Nov 15, 2019 · 27 comments
Open

Perform CR4 and CR0 pinning from hypervisors #19

kees opened this issue Nov 15, 2019 · 27 comments
Labels
[ARCH] x86_64 Needed on the 64-bit x86 architecture (ARCH=x86)

Comments

@kees
Copy link

kees commented Nov 15, 2019

Similar to the recent CR4 and CR0 pinning done in the native_write_cr*() functions, it would be even better to have KVM (and other hypervisors) pin those bits for their guest kernels too.

@kees kees added good first issue Good for newcomers [ARCH] x86_64 Needed on the 64-bit x86 architecture (ARCH=x86) labels Nov 15, 2019
@pdxjohnny
Copy link

pdxjohnny commented Nov 19, 2019

Hi. I've been working on this here: torvalds/linux@master...pdxjohnny:hc_harden almost ready to send out for review.

@kees
Copy link
Author

kees commented Nov 19, 2019

Ah! Very cool. I wonder, though, if it would be better to make this entire a host based defense? i.e. do not involve the guest at all and just detect when special bits are set and keep them set? Or maybe, have a default set of bits and let the guest add more? (I'd like this defense to work "out of the box" for all guests without having the guest need to know to make a hypercall to gain the protection.)

@pdxjohnny
Copy link

Moving discussion here (CC: @redgecombe @kaccardi)

With regards to the Container as VM usecase.
I noticed that ChromeOS does Linux support in a VM with a hardened
config. Is this something that would be enabled there if implemented?
We're taking this through review internally and was told an "ack from
those [chromeos] folks would go a long way"

Ah! Very cool. I wonder, though, if it would be better to make this entire a host based defense? i.e. do not involve the guest at all and just detect when special bits are set and keep them set? Or maybe, have a default set of bits and let the guest add more? (I'd like this defence to work "out of the box" for all guests without having the guest need to know to make a hypercall to gain the protection.)

I like this idea. I'm not sure how we could convince upstream to take that though, so far when we'd thought about that we'd come to the conclusion that we don't know what other kernels people might be running under KVM that would do whacky things like flip those bits intentionally. In which case the KVM people might not be so keen on us breaking functionality for those.

But minimally I'd guess that patch would pretty much be:

int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
     unsigned long old_cr4 = kvm_read_cr4(vcpu);
     unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
                                X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;

     if (kvm_valid_cr4(vcpu, cr4))
             return 1;

     if (~cr4 & old_cr4 & (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP))
             return 1;

    /* ... */

We could make some sort of IOCTL to have QEMU, etc. request that we run in a once-turned-on-never-turn-off mode. The downside to this is that now we have to go touch QMEU, firecracker, etc. Whereas the hypercall approach is pretty self contained within the kernel (aside from the one line QEMU cpuid bit patch). I'm not sure what the most upstreamably friendly approach his here though.

As it stands right now the patch is a hypercall that allows for pinning SMEP, SMAP, UMIP, and WP. The same bits you had already pinned. The initial approach had been to allow pinning of arbitrary bits. That was met with skepticism over the TS and MP bits in CR0. So I think it might be easiest to upstream if we first allow for only the pinning of the bits we already pin in software and then we could add the ability to pin others if we care to do that later in a different patch.

I'm also meeting with skepticism around KEXEC. I think I'll try to debug what's up with the early boot code clearing those bits to see if we can make this an always on option. If it works with KEXEC, then we probably don't need a CONFIG_ for it, since we can argue that it's functionally equivalent to existing pinning and therefore should be on by default.

@pdxjohnny
Copy link

image

Kexec is working!

@pdxjohnny
Copy link

pdxjohnny commented Nov 27, 2019

Well shit, I spoke too soon. The patch I have to the early boot code works in KVM, but not on my physical host... If I can't get it working by mid next week I'll just keep the Kconfig option and make it !KEXEC for the public RFC.

CR0.WP and CR4.SMEP work, still trying to narrow down what's causing it not to boot when on the physical host (never makes it to first console prints to screen, haven't hooked up serial).

Setting CR4.UMIP or CR4.SMAP causes it not to boot. I think this is because of EFLAGS.AC. Either way it's wednesday and so I'm giving up on Kexec for now :/

Also, I finally ran across the code that's really supposed to be enabling SMAP / UMIP and I see that surprise surprise there is a well thought out existing process for enabling them. It was also pointed out that if we kexec to a new kernel the new kernel might be an older kernel that doesn't know it shouldn't turn off those bits. In which case it would blow up. So the boot time disablable kexec is probably the safest and most usable option long term.

@pdxjohnny
Copy link

Just an update. Almost done addressing all RFCv1 comments to send out RFCv2. One of the comments was that we need to ensure SMIs can't modify pinned bits. Which lead to the discovery that AMD and Intel SDMs specify that the CR* fields within SMRAM are readonly. So we shouldn't be able to modify them in the first place. Should be done with the patch to make sure they are readonly soon. It'll be the new first patch in the pv cr pinning patchset.

@kees kees removed the good first issue Good for newcomers label Feb 25, 2020
@pdxjohnny
Copy link

Update, PATCH sent: https://lkml.org/lkml/2020/6/17/921

@pdxjohnny
Copy link

pdxjohnny commented Jan 13, 2021

Just an update on this. Intel has a review process that I'm gated by right now. I have the next version ready but haven't been able to get a review. https://github.com/pdxjohnny/linux/tree/patch_v2_internal_v6

@pdxjohnny
Copy link

pdxjohnny commented Dec 10, 2021

If someone wants to take this and put a co-authored by on it that works for me. Otherwise I probably need another probably 6 months to get this rebased and out the door, provided there are no major issues with the rebase. I've been working on getting Intel proper CI/CD so that the review process can scale.

@bwendling
Copy link
Collaborator

I can give a shot at this.

@pdxjohnny
Copy link

@gwelymernans Want to have a meeting sometime I can go over the patchset as it was while ago and the test cases?

@bwendling
Copy link
Collaborator

@pdxjohnny That's a good idea. :-) Are you available Wednesday? My schedule's fairly open then or later in the week.

@l0kod
Copy link

l0kod commented Mar 14, 2022

I want to help too!

@pdxjohnny
Copy link

pdxjohnny commented Mar 14, 2022

Meeting invite sent for 10-11 AM PDT. Please propose a new time if that doesn't work for either of you. I am free most afternoons. I'll try to have instructions on how to run the tests. If either of you want a tar of the filesystem on the SUT I had I can tar it up, but I'd need you to give me a network location to dump it to. It's an at this point old fedora image with the git repos and userspace dependencies installed.

Meeting Link: https://meet.google.com/gys-epwh-avc

If you both could please read over coverletter and the patchset beforehand that would be great too, as it's dense at some points and we'll get the most out of our meeting it you've at least skimmed it.

@l0kod
Copy link

l0kod commented Mar 14, 2022

Great! I would prefer to start 2 or 1 hour earlier if it works for you (I'm UTC+1).

@pdxjohnny
Copy link

pdxjohnny commented Mar 14, 2022

I can start as early as 8 AM PDT

@pdxjohnny
Copy link

Or we can do before. 6-7 AM PDT, but I have a hard conflict 7-8 AM PDT every day. I originally made it two hours but if you both read that stuff beforehand we should be able to do it in an hour.

@pdxjohnny
Copy link

pdxjohnny commented Mar 14, 2022

Also, please make sure you have a machine ready for this. I'm pretty sure anything within the past 10 years will work so long as it has virtualization. Ideally one of you would have an Intel and one of you would have an AMD then we could validate the AMD support and get you both spun up all at once. I think if you have a machine remotely accessible via AMT or BMC you could do it too, you just need enough to help you re-image the OS and interact with the BIOS menu on the host.

Proposed agenda:

  • Introductions
  • Q&A
  • Get existing code at old point in git history running

@l0kod
Copy link

l0kod commented Mar 14, 2022

Correction: 10am PDT will be perfect for me Wednesday.

@rpedgeco
Copy link

Something related came up recently with the kernel IBT (Indirect Branch Tracking) patchset, which is not upstream but looks to be getting close. Kernel IBT enforcement wants to wants to have its CR4 bit pinned, but it caused problems with kexec. See this thread for some more details:
https://lore.kernel.org/lkml/eed8902f21ba9e5f93562432f6b5920137860a98.camel@intel.com/

The working solution is to unset CR0.WP before CR4.CET in the old kexec-ing kernel, inlined with no pinning check. I was thinking it could cause further problems for KVM pinning, so just wanted to point it out here. Please feel free to ping me if you have questions.

@pdxjohnny
Copy link

pdxjohnny commented Mar 16, 2022

@pdxjohnny
Copy link

pdxjohnny commented Mar 22, 2022

Sorry guys I got dropped from the call and have to join another call

graph TD
  run_tests[Run all the tests]

  subgraph L0[L0 - host]
     kvm_unit_tests
     qemu[QEMU]
     kexec_tools[kexec-tools]
  end

  subgraph L1[L2 - guest]
    subgraph L1_fs[L1 - guest - filesystem]
      guest_qemu[QEMU]
      qemu_kexec_tools[kexec-tools]
    end
  end

  subgraph L2[L2 - guest of the guest]
    L2_fs
  end

  run_tests -->|QEMU = L0_qemu| kvm_unit_tests

  qemu -->|copy| guest_qemu
  kexec_tools-->|copy| qemu_kexec_tools

  guest_qemu -->|virtio9p| L2_fs
  qemu_kexec_tools-->|virtio9p| L2_fs

@l0kod
Copy link

l0kod commented Mar 22, 2022

The conflict that may be the cause of the issue in my initial rebase is in arch/x86/kvm/emulate.c where post_leave_smm() was patched: l0kod@6b49d7b#diff-50a0b0edc05bf5218443999c4f0c82b96115995a68dca707080331ff3190b937

@isanbard
Copy link

FYI, @l0kod, I had to apply this patch to your tree to get it to compile without this feature.

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 86f66d8a7817..44c183bad720 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -168,7 +168,9 @@ static inline void __init kvm_paravirt_cr_pinning_init(void)
 }

 static inline void kvm_setup_paravirt_cr_pinning(unsigned long cr0_pinned_bits,
-                                                unsigned long cr4_pinned_bits)
+                                                unsigned long cr0_pinned_mask,
+                                                unsigned long cr4_pinned_bits,
+                                                unsigned long cr4_pinned_mask)
 {
 }
 #endif

@l0kod
Copy link

l0kod commented Mar 23, 2022

That is indeed an issue but it doesn't come from me. :)
Actually we didn't notice this because we compile with CONFIG_KVM_GUEST, and you should too. ;)

@l0kod
Copy link

l0kod commented May 5, 2023

We are working on a new set of KVM features to harden the kernel, and we took some of the CR pinning ideas but followed a much simpler path (i.e. without VMM/IOCTL support): https://lore.kernel.org/all/20230505152046.6575-6-mic@digikod.net/
See the cover letter for a detailed explanation of Heki: https://lore.kernel.org/all/20230505152046.6575-1-mic@digikod.net/

@pdxjohnny, your patch series could still be useful in the future, let's see how it goes. 😉

Feel free to join the discussion on the mailing list, there is a lot to do and we're looking for contributors!

@pdxjohnny
Copy link

Woohoo!

We had avoided the hypercall route in order to support live migration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[ARCH] x86_64 Needed on the 64-bit x86 architecture (ARCH=x86)
Projects
None yet
Development

No branches or pull requests

6 participants