Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blacklist kvm and iTCO modules #132

Closed
jeffaco opened this issue Mar 25, 2019 · 41 comments
Closed

Blacklist kvm and iTCO modules #132

jeffaco opened this issue Mar 25, 2019 · 41 comments
Assignees

Comments

@jeffaco
Copy link
Collaborator

jeffaco commented Mar 25, 2019

SAP HANA doesn't need either KVM or iTCO modules.

Can these be blacklisted so they don't load at boot time?

Sollabdsm31:~ # lsmod | grep -i kvm 
kvm                   704512  0
irqbypass              16384  1 kvm
Sollabdsm31:~ # lsmod | grep -i itco
iTCO_wdt               16384  0
iTCO_vendor_support    16384  1 iTCO_wdt
Sollabdsm31:~ # 

Thanks.

@schaefi
Copy link
Collaborator

schaefi commented Mar 26, 2019

Yes this can be done. So I can change the image and add:

  • /etc/modprobe.d/50-azure-li-blacklist.conf

With the content

blacklist kvm
blacklist iTCO_wdt
blacklist iTCO_vendor_support
  • Is this what you want ?
  • for which instance type is this needed LI or VLI or both ?

@schaefi schaefi changed the title Please blacklist KVM and iTCO modules Blacklist kvm and iTCO modules Mar 26, 2019
@schaefi schaefi added the question Further information is requested label Mar 26, 2019
@jeffaco
Copy link
Collaborator Author

jeffaco commented Mar 26, 2019

We would need this for both LI and VLI. And yeah, blacklisting those modules is what we want.

Thanks!

@RalfKlahr
Copy link

Hi Jeff, yes this is valide also for LI and VLI

@schaefi
Copy link
Collaborator

schaefi commented Mar 26, 2019

Thanks

@schaefi schaefi removed the question Further information is requested label Mar 26, 2019
@schaefi
Copy link
Collaborator

schaefi commented Mar 27, 2019

Image builds to change:

  • SLES12-SP3-SAP-Azure-LI-BYOS (was released needs bugzilla)
  • SLES12-SP3-SAP-Azure-VLI-BYOS (was released needs bugzilla)
  • SLES12-SP4-SAP-Azure-LI-BYOS
  • SLES12-SP4-SAP-Azure-VLI-BYOS
  • SLES15-SAP-Azure-LI-BYOS
  • SLES15-SP1-SAP-Azure-LI-BYOS
  • SLES15-SP1-SAP-Azure-VLI-BYOS

@RalfKlahr
Copy link

this is a general setting - image independent!

@schaefi
Copy link
Collaborator

schaefi commented Mar 27, 2019

Same here, that means you need this change for all of them. So we need bugzilla reports for the released ones to handle the request properly

@jeffaco
Copy link
Collaborator Author

jeffaco commented Mar 27, 2019

We would like the change in all images please.

Entered in Bugzilla, ID 1130713.

@schaefi
Copy link
Collaborator

schaefi commented Apr 1, 2019

Robert added the change in the SLE15 SP1 Devel images already

@schaefi schaefi added the WIP label Apr 1, 2019
@schaefi schaefi added this to the Azure Testing Next milestone Apr 1, 2019
@schaefi
Copy link
Collaborator

schaefi commented Apr 1, 2019

Let me welcome @jesusbv to the Azure LI/VLI team. Jesus will take a look at this one to begin with and to become familiar with the project :) Jesus I'm happy to have you on board. As discussed I'll assign this one to you.

@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 1, 2019

Welcome aboard, @jesusbv, nice to meet you virtually! I look forward to working with you more closely down the road.

@jesusbv
Copy link
Collaborator

jesusbv commented Apr 2, 2019

Thank you, @jeffaco, nice to meet you too !

@jesusbv
Copy link
Collaborator

jesusbv commented Apr 5, 2019

I have updated the rest of the images

@schaefi
Copy link
Collaborator

schaefi commented Apr 5, 2019

I have updated the rest of the images

That's a little bit too vague. Please update the checkbox list from #132 (comment). Also if you have done the images marked as "released" please update the bugzilla entry too: https://bugzilla.suse.com/show_bug.cgi?id=1130713

Thanks

@schaefi
Copy link
Collaborator

schaefi commented Apr 8, 2019

@jesusbv Thanks, looks all great now. I have taken the commits in the Devel projects and applied them to the Stable builds from where we now can provide new testing images. This one completes the milestone. Thanks

@schaefi schaefi closed this as completed Apr 8, 2019
@schaefi schaefi removed the WIP label Apr 8, 2019
@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 10, 2019

I'm using the latest SLES 12 SP3 test image, SLES12-SP3-SAP-Azure-LI-BYOS.x86_64-1.0.53-Build1.2-Rel2.1.raw.

That's odd. This change only made it half way (well, 50.1% of the way). 😄

First, by looking at the loaded modules:

azurehost1:~ # lsmod | grep -i kvm                           
kvm                   606208  0 
irqbypass              16384  1 kvm
azurehost1:~ # lsmod | grep -i itco
azurehost1:~ # 

So the itco modules are gone, but kvm is still loaded.

Looking at the contents of file /etc/modprobe.d/50-azure-li-blacklist.conf:

blacklist edac_core
blacklist kvm
blacklist iTCO_wdt
blacklist iTCO_vendor_support
blacklist sb_edac

So kvm is mentioned there, but it's still loaded. How come?

@jeffaco jeffaco reopened this Apr 10, 2019
@schaefi
Copy link
Collaborator

schaefi commented Apr 10, 2019

Hmm, even if a module is blacklisted it can be pro-actively loaded. Something manually loads the module via modprobe. It's gonna be hard to find out where this happens. Can you check in dmesg and in /var/log/messages if you find any pointer on the loading of kvm ?

@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 10, 2019

Sigh - I don't like the way this is shaping up so far! 😦

azurehost1:~ # dmesg | grep kvm
[  115.201221] kvm: disabled by bios
azurehost1:~ # grep kvm /var/log/messages
Apr  9 23:39:21 linux kernel: [  115.201221] kvm: disabled by bios
Apr  9 23:39:11 linux kernel: kvm: disabled by bios
azurehost1:~ # 

Any other ideas of where to look?

@schaefi
Copy link
Collaborator

schaefi commented Apr 10, 2019

Hmm, can you safely unload the module

rmmod kvm

I'm asking to make sure it's not in use

Next can you check if it's loaded by the initrd. You need to reboot the system and add the following at the kernel cmdline

rd.driver.blacklist=kvm

Does that make any difference ?

@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 10, 2019

Oh, yuck. I hate rebooting, it's a pain and takes forever! But ask and I'll do it begrudgingly.

First:

azurehost1:~ # rmmod kvm
azurehost1:~ # lsmod | grep -i kvm
azurehost1:~ # 

So it's not in use.

After the reboot:

azurehost1:~ # lsmod | grep -i kvm
kvm                   606208  0 
irqbypass              16384  1 kvm
azurehost1:~ # 

And the boot command line:

azurehost1:~ # cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.4.176-94.88-default root=UUID=610a28ec-2a24-462a-8b4b-35abb9efc8c2 root=/dev/mapper/3600a09803830362f6e2b48516b525757_part1 disk=/dev/mapper/3600a09803830362f6e2b48516b525757 resume=/dev/mapper/3600a09803830362f6e2b48516b525757_part2 splash=verbose net.ifnames=1 mce=ignore_ce nomodeset numa_balancing=disable transparent_hugepage=never intel_idle.max_cstate=1 processor.max_cstate=1 crashkernel=160M,high crashkernel=80M,low quiet rd.driver.blacklist=kvm
azurehost1:~ # 

Let me know the next steps, thanks.

@schaefi
Copy link
Collaborator

schaefi commented Apr 11, 2019

But ask and I'll do it begrudgingly.

Thanks much for taking the burden to reboot :) I hate it too

So there are good and bad news. The good news is; it's not loaded as part of the initrd, because fixing that would be more work and touches sensitive code. The bad news is; I have no clue what loads kvm on your system. It must be a manual loading by some script/program via modprobe kvm. Because it's blacklisted this is the only way to load it, any event to load the module will not be handled because of the blacklisting.

kvm itself is disabled by your BIOS, this means the cpu relevant parts of kvm via e.g kvm_intel do not exists and are not loaded. Thus the remaining interface part should not hurt in any way.

This is a lame excuse I know but I now can only guess what loads kvm on the machine. Interesting enough when I boot the image in my integration system (Virtual Machine) none of the modules are loaded.

What you can try:

rpm -qa | grep qemu

rpm -e <all-qemu-listed-packages-from-the-above-call>

reboot # I'm sorry

@schaefi
Copy link
Collaborator

schaefi commented Apr 11, 2019

By default there shouldn't be any qemu packages installed. At least the image does not provide any but I don't know if you install stuff via the config file. So qemu was known to manually load the kvm module. All this is just a guess. As we tested the plain image through a VM test and no kvm module was loaded we sent it out for testing. It's really weird that you see the module loaded on your machine, especially because kvm is disabled in the BIOS, that makes loading it completely useless.

Can you confirm that none of the components you add trough the yaml config, scripts, packages... do something with module loading ?

@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 11, 2019

Well, interesting:

azurehost1:~ # rpm -qa | grep qemu
azurehost1:~ # 

So that's not it. No need to reboot for that, so there's that.

I don't really install much for testing. I'm not even installing the UCS drivers automatically, although I'll need to do that. There is a test package that I install, just to verify that installation works, and that's still in place. Let me remove that, redeploy, and see where we stand with regards to the kvm module. I highly doubt that does a thing with the kvm module, but you never know.

The kvm module is loaded dynamically on every boot, yes? So if removal of my test package doesn't help, could we just physically remove the kvm module from disk and reboot, just to see what errors occur? I don't expect things to be healthy, I'm just hoping to see something in /var/log/messages or dmesg that might indicate the source of the load.

Anyway, I'll modify my setup to not install squat, and then see where I stand. I'll report back shortly (after a reboot, of course, for the deployment).

@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 11, 2019

Okay, the entire section to install software is commented out in my YAML file. And, unfortunately:

azurehost1:~ # lsmod | grep -i kvm
kvm                   606208  0 
irqbypass              16384  1 kvm
azurehost1:~ # 

What can we do to try and isolate who is loading kvm, and why?

@rjschwei
Copy link
Contributor

Let's dig a little deeper:

cat /proc/modules | grep kvm

Since VTx is disabled in the BIOS you should not see kvm_intel but if that's there then there is probably an issue with the firmware or the kernel-firmware package.

modinfo kvm

And yes you can remove

mv /lib/modules/$KERNEL_VERSION/kernel/arch/x86/kvm/kvm.ko /root

Where $KERNEL_VERSION needs to be replaced with the version of the kernel you are running. After reboot we should see some errors inf the boot log.

@jesusbv
Copy link
Collaborator

jesusbv commented Apr 12, 2019

The reason why kvm is loaded is probably because it is used by irqbypass.

lsmod | grep -i kvm output shows that

The output of modinfo kvm is

$ modinfo kvm
...
depends: irqbypass
...

Thus, blacklisting irqbypass should fix the loading of kvm despite being blacklisted.

@jeffaco, if you could update /etc/modprobe.d/50-azure-li-blacklist.conf to blacklist irqbypass
and check (after reboot, sorry about that)

Hopefully, that would solve it. However, in case kvm is still loaded, as mentioned before, it means something is manually loading it. If that was the case, there is a solution to instruct modprobe to force the module to fail loading. This is adding install kvm /bin/false in /etc/modprobe.d/50-azure-li-blacklist.conf

@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 13, 2019

Okay, I first tried as @jesusbv helpfully suggested, but no go.

I did the following for @jesusbv:

  • Modified file /etc/modprobe.d/50-azure-li-blacklist.conf as follows:
blacklist edac_core
blacklist kvm
blacklist irqbypass
blacklist iTCO_wdt
blacklist iTCO_vendor_support
blacklist sb_edac
  • Rebooted the system
  • After reboot completed: lsmod | grep kvm:
kvm                   606208  0 
irqbypass              16384  1 kvm

So I moved on to what @rjschwei asked for:

azurehost1:~ # ls /lib/modules/
4.4.162-94.72-default  4.4.176-94.88-default  4.4.73-5-default
azurehost1:~ # uname -a        
Linux azurehost1 4.4.176-94.88-default #1 SMP Thu Mar 21 10:52:54 UTC 2019 (dea44ca) x86_64 x86_64 x86_64 GNU/Linux
azurehost1:~ # mv /lib/modules/4.4.176-94.88-default/kernel/arch/x86/kvm/kvm.ko /root
azurehost1:~ # 

After reboot:

azurehost1:~ # grep kvm /var/log/boot.log
azurehost1:~ # 

However, dmesg | grep kvm had a lot to say. Output is here.

I was unsure if you wanted the entire output from dmesg. In case you did, that's here.

It's interesting that, in the full dmesg output, the "kvm" errors come up immediately after one if the NICs was found/renamed. But the NIC driver can't depend on it, or I wouldn't be able to rmmod kvm, right?

@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 13, 2019

Whoops, I'm sorry. I totally missed the last part of what @jesusbv said:

Hopefully, that would solve it. However, in case kvm is still loaded, as mentioned before, it means something is manually loading it. If that was the case, there is a solution to instruct modprobe to force the module to fail loading. This is adding install kvm /bin/false in /etc/modprobe.d/50-azure-li-blacklist.conf

So sorry. So here's the current /etc/modprobe.d/50-azure-li-blacklist.conf:

blacklist edac_core
blacklist kvm
blacklist iTCO_wdt
blacklist iTCO_vendor_support
blacklist sb_edac

install kvm /bin/false

After a reboot:

azurehost1:~ # dmesg | grep kvm
azurehost1:~ # lsmod | grep kvm
azurehost1:~ # lsmod | grep irq      
irqbypass              16384  0 
azurehost1:~ # 

Anything I should check in terms of health/stability of the system, in particular? Or do you want to make that change to the /etc/modprobe.d/50-azure-li-blacklist.conf file for the next test image?

@schaefi
Copy link
Collaborator

schaefi commented Apr 14, 2019

@jeffaco thanks much for testing. The final setup looks good to me. I'm wondering if we still need to blacklist edac_core, sb_edac. Please note the edac project covers modules for "Error Detection and Correction". I don't want to have them blacklisted. For the purpose of not loading kvm this should not be neded. Can you do a test with:

blacklist kvm
blacklist iTCO_wdt
blacklist iTCO_vendor_support

install kvm /bin/false

If that leads to the expected results we should still have a closer look on dmesg.

To be honest I'm not sure if that blacklisting is a good idea. What exactly is your motivation on that ?

As you saw from the modinfo, a module is only loaded if the hardware matches a certain hardware
identification like a PCI id or because the module is required by some other module to perform its job.

blacklisting of modules only makes sense if there are conflicts. For example VirtualBox virtualization had a conflict with kvm in the past and that made VirtualBox to be non functional if kvm was loaded.
This would be an example where blacklisting makes sense.

However in your case the reason for all this blacklisting is unknown to me

SAP HANA doesn't need either KVM or iTCO modules

This, to be honest, doesn't sound like a reason why we should blacklist them. The loading of the module even if not needed takes away a few bytes of your main memory. Compared to the huge amount of main memory you have available I would say it doesn't matter at all :)

  • Does having them loaded cause any harm to the system ? I don't think so
  • Does having them loaded invalidates any SAP certification ? I don't know
  • Does the blacklist cause any trouble or after effects as the system runs ? we don't know

From my understanding we should be more careful with changes and perform them for a good reason

Hope that makes sense to you too ?

Thanks

@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 15, 2019

@schaefi Please see #41 on disabling EDAC. Both our hardware vendors (both LI and VLI) dictate that EDAC must be disabled, they are both quite clear on this. I never got a complete straight answer when I posed that EDAC should be detection only, and shouldn't modify system behavior (other than reporting). When I asked, I was told that use of the EDAC module can cause timing problems.

As for KVM and iTCO modules, I asked @RalfKlahr to comment on this. Clearly, the HANA systems should never use KVM, so disabling it shouldn't be harmful.

@RalfKlahr Was your ask for disabling KVM and iTCO a preference thing? Or does SAP recommend this? Have we experienced problems without those modules being disabled?

@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 15, 2019

Ah, crud. @RalfKlahr is OOF until April 28th. We may not get a very prompt response on this ...

Can we proceed with disabling the modules for now, at least for test purposes? Thanks ...

By the way, can #135 be picked up for milestone Azure Testing Next, by chance, so we can have that in our next production image? I know that that won't be our way for a while, but it would be nice if it included all our requests to date, thanks.

@rjschwei
Copy link
Contributor

@jeffaco thanks for collecting the logs. Based on the "grep log" it is clear that kvm_intel is getting loaded as that module complains about missing symbols that the kvm module provides. I suspect that "kvm_intel" gets loaded because we are on Intel HW. Then it is detected that VTx extensions are disabled and the "kvm_intel" module gets dumped. However the dependent module "kvm" does not get unloaded.

I'll see if I can confirm this theory.

@schaefi
Copy link
Collaborator

schaefi commented Apr 15, 2019

Thanks @jeffaco for the details on edac

@rjschwei I'll see if I can confirm this theory.

your theory is for sure correct. Which also means we have to blacklist the cpu relevant kvm module.
This results in the following blacklist file:

blacklist edac_core
blacklist sb_edac
blacklist kvm
blacklist kvm_intel
blacklist iTCO_wdt
blacklist iTCO_vendor_support

install kvm /bin/false
install kvm_intel /bin/false

@jeffaco Can we proceed with disabling the modules for now, at least for test purposes?

From my perspective yes. Could you confirm the above blacklist setup would work flawlessly for you ?
If so we can provide an updated testing image quickly after your feedback

@schaefi
Copy link
Collaborator

schaefi commented Apr 15, 2019

By the way, can #135 be picked up for milestone Azure Testing Next,

I'd like to get clarity for success on the prioritized topics of the current milestone first. This includes

  • The final setup for this blacklist
  • confirmation that the SBD device setup is working
  • confirmation on the SAP basenet setup to be correct
  • confirmation of the file-count setup to be correct

We will create a new testing image as soon as this blacklisting topic has been resolved. I expect that new image to address the milestone issues and that makes it a production candidate.

Once this is done we jump on the other open issues.

The rushing game didn't work well in the past and we basically received concerns due to stupid mistakes that happened on our side. In the end it took longer than it should be. We will avoid that in the future which however puts a bit more strictness on the process.

Please let us first nail down the issues currently worked on and then jump on the next ones.
The nature of this project requires a good working feedback loop because we can't directly access the target hardware and therefore can only guarantee a good quality if we got your feedback. This process works very well and you have seen that the SBD device support and the module blacklist issue where both re-opened after you had the chance to test it for real. So please don't let us put more issues on the current milestone it will not scale.

Thanks

@rjschwei
Copy link
Contributor

We should be able to get away without setting the modules to point to /bin/false.

@jeffaco could you please also test with:

blacklist edac_core
blacklist sb_edac
blacklist kvm
blacklist kvm_intel
blacklist iTCO_wdt
blacklist iTCO_vendor_support

Sorry for the seemingly endless reboot testing, but the module loading is more art then science.

@schaefi
Copy link
Collaborator

schaefi commented Apr 15, 2019

With explicitly black listing kvm_intel this might work, however could still end with kvm being loaded... let's see

@jesusbv
Copy link
Collaborator

jesusbv commented Apr 15, 2019

Hopefully it is kvm_intel loading kvm, but yes, it would be better if we find the module responsible of that.

@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 15, 2019

@schaefi I'm not proposing going back to "rushing", that was clearly ill-conceived all around. I totally appreciated your efforts to be responsive, but it obviously just wasn't working regardless of best intentions. My proposal here is to pick up #135 into Azure Testing Next, following standard procedures. This could mean that the production image might be delayed, but I do this in the (perhaps false) hope that this will be the last set of changes for a while, and then we can start focusing on VLI.

I concede that this hope may be false, but I think all the eyes that needed to be on the image have been on the image, so unless something significant was missed, we should (hopefully) be "good" for a while at least.

In retrospect, perhaps we should get used to regular updates to the image anyway, in which case it wouldn't matter. I guess I'm saying: It would be easier if #135 could be picked up but if, for whatever reason, that proves difficult, I can live with that.

@rjschwei I've applied the set of changes to the /etc/modprobe.d/50-azure-li-blacklist.conf file and am joyfully waiting for the absurdly long POST test and subsequent reboot to see how it goes. I'll report back when I have more information. I'm OOF today so it'll take a bit for me to get feedback.

@jeffaco
Copy link
Collaborator Author

jeffaco commented Apr 15, 2019

Okay, it came back just in time (before I needed to leave). This appears to be good so far:

azurehost1:~ # cat /etc/modprobe.d/50-azure-li-blacklist.conf
blacklist edac_core
blacklist sb_edac
blacklist kvm
blacklist kvm_intel
blacklist iTCO_wdt
blacklist iTCO_vendor_support
azurehost1:~ # 
azurehost1:~ # lsmod | grep -i kvm
azurehost1:~ # lsmod | grep -i itco
azurehost1:~ # 

Let me know the next steps, thanks. Note that I never got a good test of #119 because of the network dependency problem, so I definitely need a new test image before moving forward. Thanks.

@schaefi
Copy link
Collaborator

schaefi commented Apr 15, 2019

ok, thanks for the feedback. I will adapt the images and upload a new testing image for you. This will be my last action before my vacation starts :)

Stay tuned

@schaefi
Copy link
Collaborator

schaefi commented Apr 15, 2019

All images updated and building. Expect e-mail with sas url in the next two hours.

@schaefi schaefi closed this as completed Apr 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants