-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
necessary to use old rpi kernel ? #64
Comments
Hi @beadon, thanks for your input. Do you know of any open kernel-level security vulnerabilities? The kernel version I've recommended is a rather late release in the 4.19 chain and I can see that it has one minor open vulnerability: Running an old kernel does not mean you have to run old software packages on top of it, but it does make upgrading packages a little more difficult. However, I want to be sure we're on the same page about any potential vulnerabilities, and I take security very seriously, so if you have any more insight on how the old kernel would become a security problem, I would like to dive into it here so we don't scare off other folks. :) The underlying SPI issue has been reported to the Raspberry Pi team and you can see the issue here: raspberrypi/linux#3381 Just over a week ago, I did some in-depth testing with 5.15. The results of that test indicate it's not the kernel at fault, because using the low-level C driver produces a sample rate that meets expectations IFF the Pi clock is forced to high 24/7 with While I've narrowed it down to the I would love to collaborate on fixing the issue for this project, and I see two paths forward:
Rewriting the SPI sampling in C would be a significant change to the project, whereas hypothetically identifying the issue in |
at the moment there are no known security vulnerabilities that I am aware of for the linux kernels in question here. I mention security because it's a broad topic that also covers the concept of "code rot" as various tools, requirements, and code dependencies change. Regarding the investigation into why SPI is not clocking properly, this unfortunately makes a lot of sense. The clock rate of the ARM influences the SPI clock rate. I believe the best, fastest, simplest path forward is to force the ARM core(s) to the correct frequency. Either:
To achieve this, we have to get into C-states. ACPI defines these well. Intel has done a great job of implementing this for well over 15 years with the speedstep technologies and 'pcm' tools that both allow changing the CPU core frequencies, and monitoring the C-states, but it appears ARM is a little looser in implementation - which is where we will need to drive. Here's an antiquated guide for ARM dating from 2013 : https://www.linaro.org/app/resources/Connect%20Events/ACPI_and_Idle_States.pdf In the intel CPUs the Linux kernel reports the CPU frequency right in /proc/cpuinfo , you can 'cat' this to see which frequency it is, know that it's populated by a daemon -- anyway it appears that the latest and greatest tool is cpufreq-info, which appears to gather lots of very useful information from across the system displaying it in one go. install it: run it:
Notably the amount of time each of my cores has spent in each frequency is interesting - to save power the system is correctly choosing the lowest available frequency. So, the next step here is exaclty what this tools hints to us : the cpu governor - this is a set of rules that code / profiles follow to set the frequencies of the CPU cores. The downside of this is that it takes a few clock cycles to get to the right frequency, so for low latency applications, there is a need to disable the governor. So, let's do that. Here's some good reading material on what you're stepping into: https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt To set the frequency so there is no variance we need to set it to "performance", and set the max and min frequencies the same. First, let's check into where the governor stores it's secrets:
So, as root cat any of these "files" , or echo > into them to set the parameters. Here are a few handy aliases for the shell:
In use:
Ok, so this is all well and good, but we DO need to check that max and min frequencies are the same:
And they are set now ! Beautiful. So, let's run out SPI test now ... and I'll drop some results into the next comment - |
Ok, Success ! Old - not setting the core speed to be static, I was seeing poor results, as you did. New - Set the core speeds to be static:
Is this the proper rate now? Since it appears that timing is super critical here to measure things properly, then I would also advise using a deadline scheduler selected by kernel parameters to get very nearly realtime kernel behavior. Documentation : https://docs.kernel.org/scheduler/sched-deadline.html Can you test these steps to see if the performance governor sets the speeds properly for you, and gives proper results? If we're a little bit off, then I think the deadline task scheduling is the next necessary step. I would also pin the process to the correct core so there is no cross-cpu talking to mess with timing, and then into sticking the process into the correct, closest memory addresses too (numa). Tools required:
then use : numactl
I am open to ideas. What's next ? My kernel: |
@beadon I'm on the same page with regards to the security concerns. Thank you for your input. The information you've shared on CPU frequency and scheduling is awesome!! Thank you so much. I believe Raspberry Pi's support of
So my Pi 4's cores are running at 1.8GHz constantly. My tests from
So the kernel is working fine with the C driver. However, my project doesn't use the C driver - it uses the The problem is that v5 kernels get half the sample rate compared to v4 kernels when using the py-spidev module. Here's a quick test using py-spidev's sudo python3
from time import time
import spidev
spi = spidev.SpiDev()
spi.open(0, 0)
spi.max_speed_hz = 1750000
def get_timed_samples():
num_samples = 10000
start = time()
for _ in range(num_samples):
spi.xfer([1, 8 << 4, 0])
end = time()
duration = (end - start)
sample_rate = num_samples / duration
return sample_rate
if __name__ == '__main__':
for _ in range(10):
sample_rate = get_timed_samples()
print(f"Sample Rate: {round(sample_rate, 2)} SPS")
You'll have to
On a v4 kernel, with the same test code, the sample rate is over 30K SPS. So, it seems like a kernel issue, as that's the only variable (besides the major Python version, but I've already tested Python 3.9 on an a v4 kernel and Python 3.7 on a v5 kernel). The sample rate test results of But, as we've both tested above using the C driver, the v5 kernels are capable of reaching the intended bandwidth. So, I am left wondering, why is the performance of
|
Ok, thanks, nice results, good test code. Since we are talking about spidev, then we have to remember that python is ultimately 'single-threaded' - or at least it is using GIL in this example. So that is sitting in the back of my head when looking at the implementation, the blocking, I/O transfers, ring buffers, and the OS too - which is no longer using jiffies (tickless kernels, will re-confirm below) -- these are some of the variables working against us here.
So this is going to start me down a path of "Why is this so incredibly variable ?!" Possible options to chase;
Notes for later: strace on the python run with test code. This is quite verbose, and will take some time to get a clear picture of what's happening - moving closer to a realtime kernel is going to be chasing the ticks. quick notes - /proc/timer_list has lots of detail to chew on. check this out -
There is a lot to absorb here. I believe reading through the strace will show more information. Is it possible that there's a different interface (file handle class / type , etc) that python is using to access SPI, rather than the C module. Since the code for spidev has not changed since 2020, there may be some kind of legacy interface in this kernel jump kept around for spidev, or others to hit... More tools to extract behavior --
I do have a question - why the hard set speed "spi.max_speed_hz = 1750000" ? This raises red flags for me because it looks like a magic number. This is 1.750 MHz , shouldn't we be sampling somewhere in the khz range ? |
On re-reading your comments, I think we are dealing with at least 2 different problems.
|
For later reference - perf for the 5.15 kernel is not available yet in the apt package manager. linux-perf-5.10 is the latest. A reminder we may want to ping the package maintainer about this. |
Wow, this is going to get pretty deep, but I'm all for it. It seems there is lots for me to learn here. I don't think the GIL is the culprit because everything is running in a single thread. If the In the driver's There is a lot going on in the SPI
1.75 MHz is the SPI clock speed only - and to do a single 10 bit transfer, it takes ~ 18 clock cycles (there's also some overhead in pulling the CS pin high to prepare for the transfer). I selected 1.75 MHz by interpolating (and then testing) between the datasheet clock frequency boundaries, i.e. 1.35MHz at 2.7V and 3.6 MHz at 5V. Since I'm running the ADC at 3.3V, the clock upper limit is somewhere on the curve from 1.35MHz to 3.6MHz. When we see 30K SPS in the test code, that's split between 7 channels, which gives an effective per-channel sampling rate of 4.28 kSPS/kHz. I'll see if I can think of any other tests to do with my scope. Also, I'm thinking of altering the Python test code above to track the sample rate over an increasing range of samples to capture in each batch. For instance, find the average sample time when capturing only 1 sample, then 10 samples, 50, 100, 200, etc, and see if the sample rate stays the same, or if there's a linear/logarithmic relationship. This would help us understand if the batch size has much to do with the effective sample rate. |
Ok it looks like the perf tool is not available with the 5.15 kernel yet on raspbian. I've pinged the rasbian channel on liberia (IRC). There is a debian package for it updated as of Oct 2nd (yes, the future! Because the server is in a timezone ahead of EST), but we may need to compile it ourselves. Here's the code - https://github.com/torvalds/linux/tree/master/tools/perf -- snagging the tree -- mostly for kicks :) . Reference : https://linux-packages.com/debian/package/linux-perf-515 |
It appears that the C driver will only use DMA mode to communicate with SPI when a communication is over 96 bytes -- reference: https://github.com/raspberrypi/linux/blob/rpi-4.19.y/drivers/spi/spi-bcm2835.c#L77 Additional reading - https://iosoft.blog/2020/06/11/fast-data-capture-raspberry-pi/ If the first is true, then there might be something that spidev is doing to stay under that limit (assuming it's stacked on top of the driver at a lower level). If you have the scope up, I assume that the behavior looks very different between the C and python versions - does anything stand out ? taking a close look at spidev, bits_per_word . |
Interesting -- delay_usecs is used inside spidev, but it's been removed from the kernel ( https://www.spinics.net/lists/kernel/msg3864354.html ). Maybe relevant. |
Hi @beadon, great find on delay_usecs being removed from the kernel. I think a PR for py-spidev is in order (eventually!). For DMA, that's an interesting note, however it shouldn't apply here just because the size of the transfers is, I believe, a 3 byte write followed by a 10 bit read. (I haven't validated that on the serial lines with my scope though). I modified my test script shared above to pull pin 16 high when a batch of sampling starts, and to set pin 16 low when sampling is done. I've also set num_samples = 2000 to match what I've set in this project for each sample cycle (2000 per iteration). I pushed the modified test script to a new branch so you can see what it's doing. If you want to run it, you'll need the RPi.GPIO library. Updated test script: https://github.com/David00/rpi-power-monitor/blob/kernel-testing/test.py Here is an annotated capture of the scope screen which explains how I've setup this test. Yellow: Channel 1 - attached to header pin 16 (not GPIO 16 - the actual pin number 16 which is GPIO 23). Referred to as SCOPE_PIN in the annotation above. So, channel 1 stays high while 2k samples are captured. My intent with this test is to measure both the variance of, and the duration the time that pin 16 is pulled high, which would give us the jitter between calls to Speaking of the clock, it looks odd in the photo above, so I'm going to look into this more too. The clock should not be broken up into three groups of 8 AFAIK. But then again, I haven't actually dug that deep into the Linux SPI driver to see how the transfer is done with Here is a close up of the clocks, just FYI. The screenshots above are from readings taken on my v5.15 board. I'll do the same test with my v4.19 board. I have exported the raw scope data to CSV (7M lines, 215MB) but I'll hold off the analysis until I repeat this test on the v4.19 kernel tomorrow, just in case there's any obvious sign of the issue after that test. Plus, it's late. |
Ok, the comparison on the scope between the two kernels uncovered where the sample rate is differing. Now we just need to figure out why. In In By time between single collection, I mean the time between each call to xfer here: V5.15:V4.19Note how in the following photo, the distance between each group of clocks seems to vary. The one I selected to measure seems to have a larger gap than the batch of clocks before and after. I think the small variance is not related to the issue and could possibly be improved with the process scheduling that you suggested. Any idea why Python might be taking longer to execute the calls in the v5 kernel? My suspicion leads me to the inner workings of ioctl.c but I'm not too sure how to debug beyond this point. |
Hi @beadon, are you available to continue troubleshooting this? I keep thinking about it and it's one of those problems that's going to continue being a thorn in my side. I've narrowed it down to a difference in execution time in Python between v4 and v5 kernels. I don't know if it'd be valuable, but we could disassemble the Python instructions and compare between v4 and v5 calls. I wouldn't expect to see a change though since the CPU arch isn't changing... |
long delay - apologies- working on some life stuff concurrently. -- We might have to grab kernel sources and debug at that level. I believe that the timing issue you are seeing is related only to the python implementation, correct ? If so, then this means we need to be on either side of python -- in the OS, comparing what happens when the C driver is pushing things around and , in/above python where it is executing. The insight between kernels and the visual capture of the timing and grouping of messages is really interesting! You can clearly see the difference. There might be a shortcut here, a shot in the dark troubleshooting wise -- if there is a difference in the low level driver, this could also explain the behavior difference. At the moment I am unsure what the interface is between the kernel and the driver, but a good starting point are the loaded kernel modules, or the filehandles that each process has open (identify with lsof ). We should see the process(es) grab different C-level handles , associated with different libraries or modules. One more thing we need to check into is ldd, and LD_LIBRARY_PATH, in case any library is stepping in front of another library and getting used when it should not be. A reminder to keep a close eye on versioning. ... have to dig in more ... |
Hi @beadon, no worries at all - same for me. Yes, I've tested the sampling rate in C and it does not vary between kernel versions (at least, not so significantly - it might have a little bit, don't remember). In the case of |
You know what, I think the short break that I took from this issue was beneficial overall. We might be barking up the wrong tree here. I went back through my test script and re-analyzed the data I collected. While my decorator trigger function was beneficial to spot the issue from a birds eye view, I think it needs to go one level deeper and measure the time it takes to make a single call to If the time is the same, which I have a good feeling it will be, the issue is not with SPI, but rather with the Python implementation (and therefore, CPython). It could even be something as simple as the specific Python build (like whether or not it was compiled with optimizations, or something else I'm not currently aware of). I did test the same Python version, but I didn't test the same Python build. I'm going to run the 'scope test case for individual calls to |
Ok, so I adjusted the test script to pull the scope pin high immediately before calling spi.xfer(), and then set it low on the very next call: for _ in range(num_samples):
GPIO.output(SCOPE_PIN, 1) # pin set high
spi.xfer([1, 8 << 4, 0])
GPIO.output(SCOPE_PIN, 0) # pin set low So, the pin is high immediately before, during, and immediately after the call to spi.xfer() until it is set low agin. I have measured the following time deltas on both kernels, and have uncovered a bit more detail about the source of the delays noted in my previous comment.
On both kernels, it takes about 10µs for the SPI clock to begin after the scope pin is pulled high. (13.2µs on v5, and 9.7µs on v4... not much of a difference at all). Example (note the "13.20µs" near the bottom left, on the X2 - X1 box): However...When the SPI clock stops, it takes significantly longer on the v5 kernel for the scope pin to be pulled low again - about 5x longer!! On v5, it takes ~30µs for the scope pin to be pulled low after the SPI clock stops. On v4, it takes ~6µs for the scope pin to be pulled low after the SPI clock stops. (Note that the scale of the following image is twice as large as the scale above, yet the delay between the clock ending and the scope pin going low still appears smaller!) So, the time between So, when I get back to this again, I'll take another look through spi.xfer() and look at what it's doing to clean up prior to returning. If we need to go one step deeper with the scope (controlling the scope pin at various places inside the spi.xfer() function), I can look at that too. |
@beadon, it looks we're good now after @doceme's patch and bump to py-spidev v3.6. I'm working on a new custom OS build based off the latest Raspberry Pi OS Lite image, dated 2022-09-22, which ships with a v5.15.61 kernel. I'm also merging PR #62 into this release which @kizniche was so kind to work on and submit. I'm so glad to have this SPI issue resolved now - thank you so much for lighting the fire that drove us to get to the bottom of it! I should have the release out by this weekend. In addition to the new custom OS release, I will be tagging this update as Future ReadersYou should be safe to pip install -U spidev
sudo su
echo "force_turbo=1" >> /boot/config.txt
reboot 0 |
You rock! I have had a storm of life events. Thank you for bringing this to conclusion!! |
In https://github.com/David00/rpi-power-monitor/wiki/Software-0.-Installation there is reference to use an old kernel for raspi. While this is all well and good to get things working, this soon becomes a security problem since this cannot be upgraded.
Over time, this will likely mean a lot of headache since the setup appears to be quite brittle.
I've given the latest kernel a go now -- v5.10 Have you tested this kernel before to see if it exhibits the SPI problem ? If not - has this SPI problem been reported to the kernel dev team yet ? If not, I can help champion it with you!
The text was updated successfully, but these errors were encountered: