Skip to content
View adipolak's full-sized avatar
Block or Report

Block or report adipolak

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Take the plunge of distributed machine learning training with Spark, Pytorch and TensorFlow

Hi there 👋

A wee bit about me: I am an experienced Software Engineer and people manager with technical expertise in Apache Spark, HDFS, AWS, Azure, machine learning, and distributed large-scale systems.

I'm highly motivated, and always excited about solving problems and learning. I possess a curious, positive, and can-do attitude. Drove success and improvement for both distributed machine systems and people systems, optimizing Spark cluster, driving +350% throughout at Akamai scale [billions of events a day, processing 1.3PT], saving the company money on compute and optimizing complex ml model deployment from months to 2–3 days cycle, by aligning people-based systems, influencing strategic software integrations, and adopting software best practices.

I have been honored with the Beacon award in the Databricks Ambassadors Program, a testament to my commitment to contributing to data and AI technologies and sharing my expertise with others.


🔭 Industry Contributions

virtual-kubelet-kotlin-spring - how to leverage virtual kubelete and manage serverless services from your Kubernetes cluster

build-e2e-ml-bigdata - full end-to-end application on creating machine learning pipelines on top of parquet compressed data leveraging cloud services.

Author of O’Reilly’s book: Scaling Machine Learning with Spark: Distributed ML with MLlib, TensorFlow, and PyTorch, Adi Polak, 2023.

“Small Files in a Big Data World.” Chapter By Adi Polak, In 97 Things Every Data Engineer Should Know. Edited by Tobias Macey. O’Reilly, 2021: 131-133.

“Three Important Distributed Programming Concepts.” Chapter By Adi Polak, In 97 Things Every Data Engineer Should Know. Edited by Tobias Macey. O’Reilly, 2021: 175-176.

“Deploying Kubernetes in an Enterprise Environment.” in Kubernetes in the Enterprise Trends Report. DZone, 2020.

“Big Data Building Blocks: Selecting Architectures and Open-Source Frameworks.” In DZone 2019 Guide to Big Data. DZone, 2019.

Technical reviewer for Delta Lake: The Definitive Guide, O’Reilly Media, and Databricks, upcoming book, 2024.

Technical reviewer for Fundamentals of Data Observability. O’Reilly Media and Andy Patrella, 2023.

Technical reviewer for Introducing MLOps. How to Scale Machine Learning in the Enterprise. O’Reilly Media and Dataiku, 2020.

Committee member at conferences: Scale By the Bay 2021 & 2023, Data & AI/Spark Summit 2021, 2022 & 2023, Voxxeddays Australia 2021.

🌱 Teaching Experience

“Apache Spark ML First Steps. How to Build Your Own Machine Learning Model at Scale.” Presentation for O'Reilly Media, Inc., July 15, 2020.

“Demystifying Scalable Machine Learning with the Spark Ecosystem.” AI Superstream Series: Scaling AI” Course for O'Reilly Media, Inc., September 2021.

“CI/CD for Data Lakes, Managing your data like code.” Presentation for O'Reilly Media, Inc., December. 7, 2022.

“Scaling Machine Learning in 3 weeks.” Three weeks course for O'Reilly Media, Inc. February 10, 17 & 24, 2023.

👯 More Volunteering activities

FlipCon – co-organization of functional programming conference, 2018. KotlinTLV – co-leading the KotlinTLV meetup group, 2019. She Codes – Nationwide Director of Coding Skills, March 2017 to October 2018. BIPA – Team Lead at Germany - Bavaria Israel Partnership Accelerator, driving innovative solutions to traditional markets from 2016 to 2017.

📝 Articles

“Unlock The Full Business Value Of Data With A Better Engineering Process,” in Forbes. May 26, 2022. “COVID-19 and Mining Social Media - Enabling Machine Learning Workloads with Big Data,” InfoQ. October 2, 2022. “What is Serverless SQL? And How to Use it for Data Exploration,” Towards Data Science. December 1, 2020. “What is TensorFrames? TensorFlow + Apache Spark,” Microsoft Azure. March 25. 2019. “Data at Scale: Learn How Predicate Pushdown Will Save You Money.” Microsoft Azure. December 18, 2018. “Apache Spark — Catalyst Deep Dive,” Microsoft Azure. November 13, 2018.


  1. ml-with-apache-spark ml-with-apache-spark Public

    A series of Jupyter notebooks that walk you through Machine Learning with Apache Spark ecosystem using Spark MLlib, PyTorch and TensorFlow.

    Jupyter Notebook 71 19

  2. Data-Engineering Data-Engineering Public

    A comprehensive collection of educational content for aspiring Data Engineers

    5 1

  3. ms-build-e2e-ml-bigdata ms-build-e2e-ml-bigdata Public

    This repository contains tutorials and resources for you to reproduce Microsoft Build 2020 session - Building an End-to-End ML Pipeline for Big Data​

    Python 17 6

  4. virtual-kubelet-kotlin-spring-demo virtual-kubelet-kotlin-spring-demo Public

    This a Kotlin web app built with spring boot and configured to run on Kubernetes cluster with Virtual Kubelet

    HTML 4 1

  5. vr-blog vr-blog Public

    Vr-blog example

    2 1

  6. play-with-ray play-with-ray Public

    Experimenting with the ray project - Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for accelerat…

    Jupyter Notebook 1