# tutorial-spark-on-k8s
> 목표: pyspark 어플리케이션을 GCP에 구성된 k8s cluster에 submit하고 결과를 확인한다

1. 사전 검토
2. 간단한 pyspark 어플리케이션 개발. local 환경에서 실행
3. GCP k8s cluster 구성
4. GCP Artifactory Registry 구성
5. spark driver를 위한 k8s ServiceAccount 생성
6. pyspark 어플리케이션 image build, push
7. spark-submit on k8s

## 1. 사전 검토

* 참고
  * [Kubernetes 에서 Spark 어플리케이션 실행하기 (Kafka helm chart 설치 포함)](https://heartsavior.medium.com/kubernetes-%EC%97%90%EC%84%9C-spark-%EC%96%B4%ED%94%8C%EB%A6%AC%EC%BC%80%EC%9D%B4%EC%85%98-%EC%8B%A4%ED%96%89%ED%95%98%EA%B8%B0-kafka-helm-chart-%EC%84%A4%EC%B9%98-%ED%8F%AC%ED%95%A8-8f47f48419c0)
  * [Spark on Kubernetes(Google Kubernetes Env) : custom Python](https://firstheart.tistory.com/entry/Spark-on-Kubernetes-custom-Python-source-%EC%8B%A4%ED%96%89)
  * [My Journey With Spark On Kubernetes... In Python (1/3)](https://dev.to/stack-labs/my-journey-with-spark-on-kubernetes-in-python-1-3-4nl3)
* spark 실행 방식: Spark-submit vs Spark Operator

## 2. 간단한 pyspark 어플리케이션 개발. local 환경에서 실행

```bash
conda install -c conda-forge conda-pack
conda create -n mlops python=3.11
conda activate mlops

pip install pyspark
```

# main.py
```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import rand

spark = SparkSession.builder \
    .appName("Random Number Generator") \
    .getOrCreate()

df = spark.range(1000)

df = df.withColumn("random_number", rand())

df = df.repartition(5)

df.show()

spark.stop()
```

```bash
conda pack -n mlops -o environment.tar.gz
/Users/user/playground/spark/spark-3.4.0-bin-hadoop3/bin/spark-submit --archives environment.tar.gz#environment main.py
```

## 3. GCP k8s cluster 구성

```bash
export CLUSTER_NAME=image-semantic-search
export REGION=us-central1-a

gcloud components update
gcloud config set compute/zone $REGION

gcloud container clusters create $CLUSTER_NAME \
    --enable-autoscaling \
    --min-nodes=2 \
    --num-nodes=4 \
    --max-nodes=4 \
    --node-locations=$REGION \
    --machine-type=e2-medium

export KUBECONFIG=~/.kube/config
kubectl config get-contexts
kubectl config set-context gke_white-outlook-389109_us-central1-a_image-semantic
kubectl cluster-info | grep 'Kubernetes control plane' |awk '{print $7}'
```

## 4. GCP Artifactory Registry 구성

```bash
gcloud artifacts repositories create image-semantic-search \
		--location=us-central1 \
		--repository-format=docker
```

## 5. spark driver를 위한 k8s ServiceAccount 생성

k8s/spark-sa-rbac.yaml
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark-sa
  namespace: spark
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: spark
  name: spark-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["*"]
- apiGroups: [""]
  resources: ["services"]
  verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-role-binding
  namespace: spark
subjects:
- kind: ServiceAccount
  name: spark-sa
  namespace: spark
roleRef:
  kind: Role
  name: spark-role
  apiGroup: rbac.authorization.k8s.io
```

```bash
kubectl create namespace spark
kubectl create -f k8s/spark-sa-rbac.yaml
```

## 6. pyspark 어플리케이션 image build, push

```bash
export IMAGE_REGISTRY=us-central1-docker.pkg.dev/white-outlook-389109/image-semantic-search
export SPARK_HOME=/Users/user/playground/spark/spark-3.4.0-bin-hadoop3

# start docker daemon

conda pack -n mlops -o environment.tar.gz
DOCKER_BUILDKIT=1 docker build -t $IMAGE_REGISTRY:20230615 -f Dockerfile .
```

## 7. spark-submit on k8s