Linux systems engineer specializing in HPC cluster administration, GPU compute infrastructure, and research platform engineering.
- HPC Cluster Operations — Multi-node GPU cluster running Slurm, Open OnDemand, Rocky Linux 9, Active Directory/SSSD integration
- GPU Infrastructure — Deployment and management of NVIDIA A5000, A6000, RTX 3090, P40 nodes; Apptainer/Singularity container workflows
- Automation & IaC — AWX/Ansible playbooks for node provisioning, driver installation, cluster finalization
- Storage — TrueNAS/ZFS NFS backends, XFS scratch filesystem management across compute nodes
- Monitoring — Prometheus + Grafana + Loki/Promtail stack for cluster observability
- Tooling — Python TUI tools (Textual) for Slurm administration, config sync, and scratch auditing
Linux (Rocky 9 / RHEL) Slurm Open OnDemand Ansible Python Bash
Active Directory / SSSD TrueNAS / ZFS Prometheus Grafana Loki NVIDIA CUDA
HPC Infrastructure & Platform Engineer — Electrical & Computer Engineering Dept., Ben-Gurion University of the Negev