Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault and Fix for UL Sniffing #37

Open
cellular777 opened this issue Aug 30, 2023 · 9 comments
Open

Segmentation Fault and Fix for UL Sniffing #37

cellular777 opened this issue Aug 30, 2023 · 9 comments

Comments

@cellular777
Copy link

Intro

First, thank you for your contributions with this project. I have been following FALCON for quite some time.
I have encountered a few segmentation faults while running UL mode against a local station.
I wanted to contribute my fix. These errors usually appear sometime after 1-5 minutes of the program running.

Command

  • ./LTESniffer -a clock_source=external,time_source=external -A 2 -W 8 -f 1982.5e6 -u 1902.5e6 -C -m 1

Errors and Solutions

#1 Shared Pointer Access Issue

  • Error: this error manifested in different places, but all the GDB outputs lead to the .get() method on the shared pointer
  • GDB Errors: These are two different ones, fixed in the same way.
Thread 3 "LTESniffer" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7452700 (LWP 225)]
PUSCH_Decoder::investigate_valid_ul_grant (this=0x7fffef71b010, decoding_mem=...) at /home/FALCON/LTESniffer/src/src/UL_Sniffer_PUSCH.cc:907
907	    if (decoding_mem.ran_ul_grant->tb.tbs == 0 || decoding_mem.ran_ul_grant_256->tb.tbs == 0)
(gdb) 

Thread 7 "LTESniffer" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd544e700 (LWP 22451)]
0x00005555555e8ece in PUSCH_Decoder::decode (this=0x7fffd978b010) at /home/FALCON/LTESniffer/src/src/UL_Sniffer_PUSCH.cc:425
425	                ul_cfg.pusch.grant = *decoding_mem.ran_ul_grant.get();
(gdb)
  • Fix: change all ran_ul_grant.get() and ran_ul_grant_256.get() to not use the .get() method in [UL_Sniffer_PUSCH.cc]

#2 Target RNTI Zero Error

  • Error: I don't have a GDB output to show this, but at one point got a decoding_mem.rnti value of zero (false positive) after running a long time. A zero RNTI is an invalid value and bypasses the if statement at [UL_Sniffer_PUSCH.cc @line#-419] which lets an invalid UL grant through causing a segmentation fault. This happens because the target_rnti is default set to zero. You could also change this by setting the target_rnti to some value that cannot occur like NaN or -1 so that this doesn't happen.
if ((decoding_mem.rnti == target_rnti) || (valid_ul_grant == SRSRAN_SUCCESS))
  • Fix: changed the if statement to reject all RNTIs with zero value
if (((decoding_mem.rnti == target_rnti) || (valid_ul_grant == SRSRAN_SUCCESS)) && decoding_mem.rnti != 0)

Hopefully, this is helpful. I am now able to run the UL code for 10s of minutes.

Brian Stevens

@hdtuanss
Copy link
Collaborator

hdtuanss commented Sep 4, 2023

Hi Stevens,
Sorry for my late reply. I'm on a business trip so I'm still not able to update your findings right now.
these errors and solutions are really helpful for LTESniffer, I can make the sniffer more stable and fix the bugs that I have made during the developing time.
I will make sure that the errors will be fixed this weekend.
Also, do you mind if I include your name in the Contribution Section of the README file?
I appreciate your findings a lot.
Thank you!!

@cellular777
Copy link
Author

cellular777 commented Sep 7, 2023

@hdtuanss
I am happy to help. The errors above might be tricky to find/reproduce, as these errors could either be circumstantial to my testing network provider or only occur after many minutes of running. Recently, I did tests a few times going back and forth with/without the suggested fixes, and I am confident the suggestions prevent the segmentation faults. I do think either way, they should provide stability for the code.

I would be honored to be listed as a contributor.
I will report back any other contributions that I believe are useful.

Thanks again for all you are doing on this project.

@alphafox02
Copy link

I’ve made the changes above and compiled the latest source. It seems fine running LTESniffer w/ uplink using an X310 for an extended period of time against srsRan. If there’s specific configurations of srsRan I should try and test against, let me know and I’m happy to do so.

@hdtuanss
Copy link
Collaborator

@cellular777 I updated the code as you recommended. Thanks a lot.

@hdtuanss
Copy link
Collaborator

@alphafox02 Thank you for testing and verifying the new code for me.

@cellular777
Copy link
Author

@hdtuanss, update from me. Figured I would just add to this thread as it is similar. I hope that is ok.
Previously, I was using a specific EARFCN DL/UL pair that could run for extended periods.

Currently, I switched to a different EARFCN DL/UL pair and got errors after 3-5 minutes.

Command

./LTESniffe -a clock_source=external,time_source=external -A 2 -W 8 -f 2130e6 -u 1730e6 -C -m 1

Error #1

Thread 8 "LTESniffer" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd4c04700 (LWP 29866)]
0x0000555555614580 in RNTIManager::getActivationReason (this=0x55555d89e500, rnti=28885) at /home/LTESniffer/lib/src/util/RNTIManager.cc:219
219	    if(it->rnti == rnti) return it->reason;
(gdb) 

Error #2

Thread 3 "LTESniffer" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7452700 (LWP 378)]
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count (__r=..., this=0x7fffd7451698) at /usr/include/c++/7/bits/shared_ptr_base.h:691
691		  _M_pi->_M_add_ref_copy();
(gdb) 

Results

The first error doesn't make much sense to me. I checked the entire std:list and the values all seem fine. The second error also doesn't make much sense to me, as it didn't really give a stack trace.

Some Ideas

  1. Figure out a way to pipe IQ files in, I am assuming this isn't already supported? It would let me reproduce the error and also share the error (via an IQ file).
  2. Build with debug optimization level zero, currently with level 2, this doesn't seem to be supported yet, but may give better line numbers?
  3. Find a way to attach a memory profiler (valgrind?) to see if I can get an understanding if there is a memory leak. I tried this... and got a weird unrecognised instruction (below). I have no clue why this is an issue as it is just a #define and some constants, I think that it may be that srsRAN is a library and I need to load that into valgrind. But, I don't actually know if srsRAN is being compiled like that for this project.
==56304== valgrind: Unrecognised instruction at address 0x16823b.
==56304==    at 0x16823B: prach_cexp_init (prach.c:63)
==56304==    by 0x5D770C: __libc_csu_init (in /home/LTESniffer/build/src/LTESniffer)
==56304==    by 0xA11FC17: (below main) (libc-start.c:266)
  • If you have any additional ideas let me know, or if I make any progress on the above I will send an update.

@hdtuanss hdtuanss reopened this Sep 27, 2023
@hdtuanss
Copy link
Collaborator

@cellular777 Hi Brian, thanks for your update!
For me, the errors you found are pretty tricky, and it is hard to analyze the root cause quickly.
About error 1, do you think it is because of the accessing to RNTIManager from multiple threads at the same time? I will try to make a mutex for this RNTIManager and see the result.
About your idea:
1, The input IQ file is supported but it is not complete, I might need to add more code to make it work effectively. Also, you might need to save I/Q samples to file in the granularity of subframes, as the file-reading function reads the samples by subframe level. I have a code to do that, but it is also not complete code, I will fix it and share it with you.
2 + 3: I haven't done this before; so, it will take me sometimes to try these methods.
Please add comments here if you have any new results.
Thank you

@cellular777
Copy link
Author

@hdtuanss

Error 1 may very well be a multi thread access issue. I might be able to run with a single thread and see if the error occurs? I will report back next week. There is also a shared_mutex object that may be best.

1: File Playback: Let me know what you can share, even if it is "how you would do it". This will be something I work on next week. It would be nice to capture errors and replay.

2 + 3: Memory Debugging: I want to work on this. But #1 comes first.

4: Raw Pointers and Threads: Previously, there was a relationship with the .get() method (raw pointer) and shared_ptrs. I wonder if removing the .get() helps with something like the mutex access issue. My previous suggestion really seemed like it "fixed" something. As I ran multiple times and got crashes, then with the change didn't crash at all. However, maybe it was fixing a thread multi-access issue? Technically, in a single threaded application the get() method should really do nothing functionally different, that stackoverflow recommends it as "safer". I was curious if I removed the get() method from all shared_ptrs if it would "patch the issue". It is something quick I might try.

I will be focusing on this more next week. I will let you know if I find anything of interest.

@cellular777
Copy link
Author

cellular777 commented Jan 25, 2024

@hdtuanss I have some fixes for segfaults I found. I did this before having file playback. I would like to run valgrind on the file playback version that was recently posted to make sure memory is sound. I have been just running in gdb and doing tracebacks when it segfaults. I only found these trying to get long run times and I didnt seem to have these much or at all with 1x thread.

Previously, I would crash between 5min to 8min running. These fixes got me to 5+ even 13+ hours. It is hard to tell if UHD errors might be happening over long periods of time at these run times. Again, after this point I would go to valgrind.

  1. the ul_grants in UL mode sometimes have null pointers to the grant substructure or other variables.
    Fix: I am unsure why the get set to null (maybe another memory error elsewhere), but I fixed this by checking for nulls on all paths of the UL processing when accessing those grants. See one of these errors below.
Thread 2 "LTESniffer" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7c53700 (LWP 177)]
PUSCH_Decoder::investigate_valid_ul_grant (this=this@entry=0x7fffe1b61010, decoding_mem=...) at /home/LTESniffer/src/src/UL_Sniffer_PUSCH.cc:989
989         if (decoding_mem.ran_ul_grant->tb.tbs == 0 || decoding_mem.ran_ul_grant_256->tb.tbs == 0)
(gdb) 
  1. there seems to be access issues with the rnti manager like you said above. I was getting segfaults when reading through the list of rntis. (same as above)
    Fix: add mutexs around reads/writes to the list but this created a bottleneck and I started dropping subframes expoentially as the rnti list got larger. Instead, I changed the rnti list to a std::map, this has longer "add/remove" but faster "reads". This seems to fix the thread access issue but not get me into states where threads are locked up. In the future, it may be better to have a RNTI manager per thread and then merge them periodically, but this seems to work fine for now. (running with 16 threads for reference). See the error below:
Thread 16 "LTESniffer" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd0a43700 (LWP 47348)]
0x00005555556236b0 in RNTIManager::getActivationReason (this=0x55555d8ae870, rnti=28315) at /home/LTESniffer_mirror/lib/src/util/RNTIManager.cc:219
219         if(it->rnti == rnti) return it->reason;
(gdb) root@5g-lab-04:/home/LTESniffer_mirror/build# 
  1. I noticed rx sample failures would print and when I got these it could happen constantly. Not quitting the program, but it would keep failing on UHD calls. Just looping and failing. I added a limit where if this happens more than say 5x times in a row in 10s it would exit the program gracefully. I think the limit makes sense, if you run in a container this can kick off a restart for example. Just make the limit higher if you do not have a way to auto restart the executable. See constant errors below:
/home/LTESniffer/build/srsRAN-src/lib/src/phy/ue/ue_sync.c:772: Error receiving samples
/home/LTESniffer/build/srsRAN-src/lib/src/phy/rf/rf_uhd_imp.cc:1340: Error timed out while receiving samples from UHD.
/home/LTESniffer/src/src/LTESniffer_Core.cc:480: Error calling srsran_ue_sync_work()
/home/LTESniffer/build/srsRAN-src/lib/src/phy/ue/ue_sync.c:772: Error receiving samples
/home/LTESniffer/build/srsRAN-src/lib/src/phy/rf/rf_uhd_imp.cc:1340: Error timed out while receiving samples from UHD.
/home/LTESniffer/src/src/LTESniffer_Core.cc:480: Error calling srsran_ue_sync_work()
/home/LTESniffer/build/srsRAN-src/lib/src/phy/ue/ue_sync.c:772: Error receiving samples
/home/LTESniffer/build/srsRAN-src/lib/src/phy/rf/rf_uhd_imp.cc:1340: Error timed out while receiving samples from UHD.
/home/LTESniffer/src/src/LTESniffer_Core.cc:480: Error calling srsran_ue_sync_work()
/home/LTESniffer/build/srsRAN-src/lib/src/phy/ue/ue_sync.c:772: Error receiving samples

I have tested these over days and it seems way more stable. I will add these when I get back from out of the office (see my other comment on the other issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants