Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kjohnson3 issues #387

Closed
jwokaty opened this issue Jan 3, 2024 · 6 comments
Closed

kjohnson3 issues #387

jwokaty opened this issue Jan 3, 2024 · 6 comments

Comments

@jwokaty
Copy link
Collaborator

jwokaty commented Jan 3, 2024

The purpose of this issue is to collect all the information related to kjohnson3 not performing as we expect.

System Information

# excerpt from system_profiler -detailLevel basic

Software:

    System Software Overview:

      System Version: macOS 13.6.1 (22G313)
      Kernel Version: Darwin 22.6.0
      Boot Volume: Macintosh HD
      Boot Mode: Normal
      Computer Name: kjohnson3
      User Name:  (biocbuild)
      Secure Virtual Memory: Enabled
      System Integrity Protection: Enabled
      Time since boot: 36 days, 23 hours, 27 minutes

Storage:

    Data:

      Free: 1.76 TB (1,755,184,394,240 bytes)
      Capacity: 2 TB (1,995,218,165,760 bytes)
      Mount Point: /System/Volumes/Data
      File System: APFS
      Writable: Yes
      Ignore Ownership: No
      BSD Name: disk3s5
      Volume UUID: 84E902B4-385B-497B-8F88-8D55CA4A223E
      Physical Drive:
          Device Name: APPLE SSD AP2048Z
          Media Name: AppleAPFSMedia
          Medium Type: SSD
          Protocol: Apple Fabric
          Internal: Yes
          Partition Map Type: Unknown
          S.M.A.R.T. Status: Verified

    Macintosh HD:

      Free: 1.76 TB (1,755,184,394,240 bytes)
      Capacity: 2 TB (1,995,218,165,760 bytes)
      Mount Point: /
      File System: APFS
      Writable: No
      Ignore Ownership: No
      BSD Name: disk3s1s1
      Volume UUID: BEB26684-70B5-4407-9D99-FC9B003D9768
      Physical Drive:
          Device Name: APPLE SSD AP2048Z
          Media Name: AppleAPFSMedia
          Medium Type: SSD
          Protocol: Apple Fabric
          Internal: Yes
          Partition Map Type: Unknown
          S.M.A.R.T. Status: Verified

kjohnson3:~ biocbuild$ system_profiler SPHardwareDataType
Hardware:

    Hardware Overview:

      Model Name: Mac Pro
      Model Identifier: Mac14,8
      Enclosure: Rack
      Model Number: Z172000TRLL/A
      Chip: Apple M2 Ultra
      Total Number of Cores: 24 (16 performance and 8 efficiency)
      Memory: 192 GB
      System Firmware Version: 10151.41.12
      OS Loader Version: 8422.141.2.700.1
      Serial Number (system): G4NL9W9XNC
      Hardware UUID: 360759C8-A272-58A9-97B9-B9F743AE7DB0
      Provisioning UDID: 00006022-001829260E90A01E
      Activation Lock Status: Disabled
kjohnson3:~ biocbuild$ sw_vers
ProductName:		macOS
ProductVersion:		13.6.1
BuildVersion:		22G313

Tracking load

We're tracking the load at /Users/biocbuild/bbs-3.19-bioc-mac-arm64/log/uptime.log every 15 minutes.

Performance installing csaw

Indicators that it's not performing as expected. @hpages compared the elapsed time of installing csaw. He thought possibly there might be an issue with the compiler, such as a regression. kjohnson1 (release) has clang-1500.0.40.1 and kjohnson3 has clang-1500.1.0.2.4.

Package: csaw
Version: 1.37.1
Command: /Library/Frameworks/R.framework/Resources/bin/R CMD INSTALL csaw
StartedAt: 2023-12-28 16:11:13 -0500 (Thu, 28 Dec 2023)
EndedAt: 2023-12-28 16:46:06 -0500 (Thu, 28 Dec 2023)
EllapsedTime: 2092.9 seconds
RetCode: 0
Status: OK
Package: csaw
Version: 1.36.1
Command: /Library/Frameworks/R.framework/Resources/bin/R CMD INSTALL csaw
StartedAt: 2023-12-30 10:42:36 -0500 (Sat, 30 Dec 2023)
EndedAt: 2023-12-30 10:44:10 -0500 (Sat, 30 Dec 2023)
EllapsedTime: 94.3 seconds
RetCode: 0
Status: OK

Connection issues

Connection Error on 11/13/23, 12/07/23, 12/20/23:

BBS> ==============================================================
BBS>   (Re)make BBS_CENTRAL_BASEURL/products-in/kjohnson3/... OK
BBS> [STAGE2] STARTING STAGE2 at Wed Dec 20 17:24:55 2023
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1346, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1257, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1303, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1252, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1012, in _send_output
    self.send(msg)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 952, in send
    self.connect()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 923, in connect
    self.sock = self._create_connection(
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/socket.py", line 843, in create_connection
    raise err
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/socket.py", line 831, in create_connection
    sock.connect(sa)
OSError: [Errno 51] Network is unreachable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/biocbuild/BBS/BBS-run.py", line 785, in <module>
    STAGE2()
  File "/Users/biocbuild/BBS/BBS-run.py", line 397, in STAGE2
    BBSbase.waitForTargetRepoToBeReady()
  File "/Users/biocbuild/BBS/BBSbase.py", line 75, in waitForTargetRepoToBeReady
    f = urllib.request.urlopen(PACKAGES_url)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 517, in open
    response = self._open(req, data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1375, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1349, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 51] Network is unreachable>

Other areas to explore

  • Any additional software added by IT
@jwokaty
Copy link
Collaborator Author

jwokaty commented Jan 5, 2024

I think the cronjobs use the efficiency cores rather than the performance cores. M2 Ultras have 16 p-cores and 8 e-cores, the latter are used, for example, with background processes.

I ran R CMD check --timings on csaw via a cron job and manually. In the excerpts from both csaw.Rcheck/csaw-Ex.timings, the manual run is faster:

# manual
name    user    system  elapsed
SEmethods       0.090   0.005   0.094
calculateCPM    0.064   0.006   0.070
checkBimodality 0.113   0.009   0.122
clusterFDR      0.217   0.001   0.218
clusterWindows  0.452   0.002   0.455
clusterWindowsList      0.640   0.001   0.641
combineTests    0.025   0.001   0.025
correlateReads  0.057   0.001   0.058
csawUsersGuide  0       0       0
defunct 0.000   0.000   0.001
detailRanges    5.354   0.112   5.465

# cron
name    user    system  elapsed
SEmethods       0.266   0.018   0.284   
calculateCPM    0.202   0.004   0.207   
checkBimodality 0.451   0.020   0.471
clusterFDR      1.514   0.032   1.546
clusterWindows  1.438   0.027   1.465
clusterWindowsList      2.019   0.004   2.023
combineTests    0.077   0.003   0.079
correlateReads  0.221   0.012   0.234
csawUsersGuide  0       0       0
defunct 0.001   0.000   0.000
detailRanges    14.013   0.406  14.425

We can check taskinfo when running R CMD check (note the run times don't reflect the full process run time I think, just when I had executed taskinfo during the process run). I'm not sure how to read all of it, but notice the effective policies differ in that cronjob's effective clamp (or lowest quality of service as opposed to the ceiling, which I am guessing is the best possible qos?) is utility vs the manual run's unspecified, which is closer to the default.

# cron
process: "bash" [63150] [unique ID: 17070513]
architecture: arm64e
coalition (type 0) ID: 383
coalition (type 1) ID: 384
suspend count: 0
virtual bytes: 389.58 GB; phys_footprint bytes: 1.56 MB; phys_footprint lifetime maximum bytes: 1.56 MB
run time: 45 s
user/system time    (current threads): 0.008090 s / 0.023988 s
user/system time (terminated threads): 0.000000 s / 0.000000 s
P-time: 0.000000 s (0%)
P/E switches: 19
energy used (nJ): 4809957
interrupt wakeups: 1 (0 / 0.00% from platform idle)
default sched policy: POLICY_TIMESHARE
CPU usage monitor: none
CPU wakes monitor: 150 wakes per second (over system-default time period)
dirty tracking: untracked  dirty
boosts: 0 (0 externalized)
requested policy
        req apptype: TASK_APPTYPE_DAEMON_STANDARD
        req role: TASK_UNSPECIFIED (PRIO_DARWIN_ROLE_DEFAULT)
        req qos clamp: THREAD_QOS_UTILITY
        req base/override latency qos: LATENCY_QOS_TIER_UNSPECIFIED / LATENCY_QOS_TIER_UNSPECIFIED
        req base/override thruput qos: THROUGHPUT_QOS_TIER_UNSPECIFIED / THROUGHPUT_QOS_TIER_UNSPECIFIED
        req darwin BG: NO
        req internal/external iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) / THROTTLE_LEVEL_TIER0 (IMPORTANT)
        req darwin BG iotier: THROTTLE_LEVEL_TIER2 (UTILITY)
        req managed: NO
        req other:
        req suppression (App Nap) behaviors:
effective policy
        eff role: TASK_UNSPECIFIED (PRIO_DARWIN_ROLE_DEFAULT)
        eff latency qos: LATENCY_QOS_TIER_3
        eff thruput qos: THROUGHPUT_QOS_TIER_2
        eff darwin BG: NO
        eff iotier: THROTTLE_LEVEL_TIER1 (STANDARD)
        eff managed: NO
        eff qos ceiling: THREAD_QOS_USER_INITIATED
        eff qos clamp: THREAD_QOS_UTILITY
        eff other:
imp_donor: YES
imp_receiver: NO
pid suspended: NO
adaptive daemon: NO (boosted: NO)

# manual
process: "bash" [61814] [unique ID: 17069177]
architecture: arm64e
coalition (type 0) ID: 351227
coalition (type 1) ID: 351228
suspend count: 0
virtual bytes: 389.70 GB; phys_footprint bytes: 1.59 MB; phys_footprint lifetime maximum bytes: 1.59 MB
run time: 62 s
user/system time    (current threads): 0.008862 s / 0.019900 s
user/system time (terminated threads): 0.000000 s / 0.000000 s
P-time: 0.000000 s (0%)
P/E switches: 6
energy used (nJ): 1908027
interrupt wakeups: 0 (0 / 0.00% from platform idle)
default sched policy: POLICY_TIMESHARE
CPU usage monitor: none
CPU wakes monitor: 150 wakes per second (over system-default time period)
dirty tracking: untracked  dirty
boosts: 0 (0 externalized)
requested policy
        req apptype: TASK_APPTYPE_DAEMON_INTERACTIVE
        req role: TASK_UNSPECIFIED (PRIO_DARWIN_ROLE_DEFAULT)
        req qos clamp: THREAD_QOS_UNSPECIFIED
        req base/override latency qos: LATENCY_QOS_TIER_UNSPECIFIED / LATENCY_QOS_TIER_UNSPECIFIED
        req base/override thruput qos: THROUGHPUT_QOS_TIER_UNSPECIFIED / THROUGHPUT_QOS_TIER_UNSPECIFIED
        req darwin BG: NO
        req internal/external iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) / THROTTLE_LEVEL_TIER0 (IMPORTANT)
        req darwin BG iotier: THROTTLE_LEVEL_TIER2 (UTILITY)
        req managed: NO
        req other:
        req suppression (App Nap) behaviors:
effective policy
        eff role: TASK_UNSPECIFIED (PRIO_DARWIN_ROLE_DEFAULT)
        eff latency qos: LATENCY_QOS_TIER_UNSPECIFIED
        eff thruput qos: THROUGHPUT_QOS_TIER_UNSPECIFIED
        eff darwin BG: NO
        eff iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT)
        eff managed: NO
        eff qos ceiling: THREAD_QOS_USER_INITIATED
        eff qos clamp: THREAD_QOS_UNSPECIFIED
        eff other:
imp_donor: YES
imp_receiver: NO
pid suspended: NO
adaptive daemon: NO (boosted: NO)

https://apple.stackexchange.com/questions/431120/how-to-force-a-terminal-command-to-run-with-the-performance-cores-on-apple-silic mentions using taskpolicy to change polcies, but the best we can set the clamp to is utility, which is what it seems to be now. This answer says cronjobs will run on efficiency cores because they're background processes, but it also mentions how some of these things are exposed in Python and we might be able to set the priority this way; however, my toy script that used cron still produced the same results so I wonder if it is still inheriting the policies from cron, which is what https://apple.stackexchange.com/questions/454687/apple-m2-to-use-performance-cores-when-plugged-in#454792 suggests: we can only demote (or allow it to use both types of cores) and qos is set by the application code at creation.

Next inquiry: Since application code can set policies is there a way make the bbs code like an application? Is there another way to run the script that has a higher policy than cron?

Resources

@jwokaty
Copy link
Collaborator Author

jwokaty commented Jan 5, 2024

I replicated the same experiment on kjohnson1 and produced similar timings. When I look at taskinfo for the cronjob, the P/E switches is 1 whereas the cronjob on kjohnson3 had 16 (manual had 6). On kjohnson1, the cronjob also had P-time (is P-time performance core time?), whereas both approaches on kjohnson3 had no P-time:

kjohnson1:~ biocbuild$ sudo taskinfo 26901
Password:
process: "bash" [26901] [unique ID: 60648590]
architecture: arm64e
coalition (type 0) ID: 385
coalition (type 1) ID: 386
suspend count: 0
virtual bytes: 389.57 GB; phys_footprint bytes: 1.19 MB; phys_footprint lifetime maximum bytes: 1.19 MB
run time: 40 s
user/system time    (current threads): 0.007539 s / 0.015860 s
user/system time (terminated threads): 0.000000 s / 0.000000 s
P-time: 0.009711 s (41%)
P/E switches: 1
energy used (nJ): 4200829
interrupt wakeups: 0 (0 / 0.00% from platform idle)
default sched policy: POLICY_TIMESHARE
CPU usage monitor: none
CPU wakes monitor: 150 wakes per second (over system-default time period)
dirty tracking: untracked  dirty
boosts: 0 (0 externalized)
requested policy
	req apptype: TASK_APPTYPE_DAEMON_STANDARD
	req role: TASK_UNSPECIFIED (PRIO_DARWIN_ROLE_DEFAULT)
	req qos clamp: THREAD_QOS_UTILITY
	req base/override latency qos: LATENCY_QOS_TIER_UNSPECIFIED / LATENCY_QOS_TIER_UNSPECIFIED
	req base/override thruput qos: THROUGHPUT_QOS_TIER_UNSPECIFIED / THROUGHPUT_QOS_TIER_UNSPECIFIED
	req darwin BG: NO  
	req internal/external iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) / THROTTLE_LEVEL_TIER0 (IMPORTANT)
	req darwin BG iotier: THROTTLE_LEVEL_TIER2 (UTILITY)
	req managed: NO
	req other: 
	req suppression (App Nap) behaviors: 
effective policy
	eff role: TASK_UNSPECIFIED (PRIO_DARWIN_ROLE_DEFAULT)
	eff latency qos: LATENCY_QOS_TIER_3
	eff thruput qos: THROUGHPUT_QOS_TIER_2
	eff darwin BG: NO
	eff iotier: THROTTLE_LEVEL_TIER1 (STANDARD)
	eff managed: NO
	eff qos ceiling: THREAD_QOS_USER_INITIATED
	eff qos clamp: THREAD_QOS_UTILITY
	eff other: 
imp_donor: YES
imp_receiver: NO
pid suspended: NO
adaptive daemon: NO (boosted: NO)

But maybe this is because there are only 2 efficiency cores on kjohnson1:

kjohnson1:~ biocbuild$ system_profiler SPHardwareDataType
Hardware:

    Hardware Overview:

      Model Name: Mac Studio
      Model Identifier: Mac13,1
      Model Number: Z14J000H4LL/A
      Chip: Apple M1 Max
      Total Number of Cores: 10 (8 performance and 2 efficiency)
      Memory: 64 GB
      System Firmware Version: 10151.41.12
      OS Loader Version: 8422.141.2.700.1
      Serial Number (system): T17H4CVN9D
      Hardware UUID: 679E2FD9-24C7-5AC0-AECF-9753125F26EF
      Provisioning UDID: 00006001-001C31521E69401E
      Activation Lock Status: Disabled

I wonder if because there are so few efficiency cores that kjohnson1 has to use performance cores, even though the qos policies are the same. I'd be interested to see results for taskinfo and powermetrics for some of the build.

@jwokaty
Copy link
Collaborator Author

jwokaty commented Jan 5, 2024

Example of powermetrics while running same cronjob as above:

kjohnson3:~ biocbuild$ sudo powermetrics --samplers tasks --show-process-coalition --show-process-qos --show-cpu-qos --show-cpu-scalability --show-process-qos-tier -n 1 -i 5000                                                              
Machine model: Mac14,8                                                                                                                                                                                                                        
OS version: 22G313                                                                                                                                                                                                                            
Boot arguments:                                                                                                                                                                                                                               
Boot time: Mon Nov 27 17:11:01 2023                                                                                                                                                                                                           
                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                              
*** Sampled system activity (Fri Jan  5 07:21:06 2024 -0500) (5028.27ms elapsed) ***                                                                                                                                                          
                                                                                                                                                                                                                                              
*** Running tasks ***                                                                                                                                                                                                                         
                                                                                                                                                                                                                                              
Name                               ID     CPU ms/s  User%  Deadlines (<2 ms, 2-5 ms)  Wakeups (Intr, Pkg idle)  QoS (L/T) QOS (ms/s) Disabled Maint   BG      Util   Default  U-Init  U-Intr                                                  
com.vix.cron                       383    981.38                                      0.00    0.00                                                                                                                                            
  R                                75312  983.49    94.21  0.00    0.00               0.00    0.00               3/ 2                 0.00    0.00    0.00    983.49  0.00    0.00    0.00                                                    
  R                                74809  2.94      59.76  0.00    0.00               0.00    0.00               3/ 2                 0.00    0.00    0.00    2.94    0.00    0.00    0.00                                                    
  bash                             75311  1.85      24.21  0.00    0.00               0.00    0.00               3/ 2                 0.00    0.00    0.00    1.85    0.00    0.00    0.00                                                    
  cron                             74796  0.00      44.66  0.00    0.00               0.00    0.00               3/ 2                 0.00    0.00    0.00    0.00    0.00    0.00    0.00                                                    
DEAD_TASKS_COALITION                      375.42                                      0.00    0.00                                                                                                                                            
  DEAD_TASKS                       -1     375.42    90.96  0.00    0.00               0.00    0.00              -1/-1                 0.00    0.00    0.00    0.00    0.00    0.00    0.00                                                    

Resources

@jwokaty
Copy link
Collaborator Author

jwokaty commented Jan 10, 2024

It's still not clear why kjohnson1 employs performance cores during parts of the cron. There are fewer switches and P-time when checking taskinfo for the cronjob's PID.

Running the command that is in the cronjob for kjohnson3, the processes use the default qos (user-initiated), which allows the build to complete every other day. We could try to run the commands in the crojob with something other than cron to avoid the processes being labeled as utility. I looked at cron alternatives, particularly fcron, but the set up for mac seems a little complicated. A lot of alternatives still use cron under the hood. I will try running the commands in a python script that I will start in a tmux session to see if it keeps the same user-initiated qos. If it does, we should look at how to make this more robust with a cron replacement or a better script.

@jwokaty
Copy link
Collaborator Author

jwokaty commented Feb 29, 2024

Investigating Crashes on Apple Silicon

  • Logs
    • /var/log/syslog
    • /var/log/DiagnosticMessages (stores only last 4 days? in spindump format)
    • /var/log/fsuck_apfs.log - shows result of fsck; can use diskutil to attempt resolution of issues.
    • /var/log/shutdown_monitor.log
    • /Library/Logs - some crash files
    • pmset - power management settings. use pmset -g cap to show capabilities, pmset -g to show settings. Generally, I think we want 1 = autorestart, ttyskeepawake,tcpkeepalive; 0 = standby, powernap, networkoverlseep, disksleep, sleep, displaysleep, lowpowermode. https://eclecticlight.co/ has more info about pmset
kjohnson3:~ biocbuild$ pmset -g
System-wide power settings:
 DestroyFVKeyOnStandby		0
Currently in use:
 standby              0
 Sleep On Power Button 1
 autorestart          1
 SleepServices        0
 powernap             0
 networkoversleep     0
 disksleep            0
 sleep                0 (sleep prevented by powerd, apsd)
 ttyskeepawake        1
 displaysleep         0
 tcpkeepalive         1
 lowpowermode         0
 womp                 1

@jwokaty
Copy link
Collaborator Author

jwokaty commented Apr 8, 2024

Closing as it seems to be resolved after a clean install, a new R, a build script, and different configuration at 4a931eb and 619cc6b.

@jwokaty jwokaty closed this as completed Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant