Skip to content

Commit

Permalink
Add Spark Job (kubeflow#1467)
Browse files Browse the repository at this point in the history
* Add a Spark operator to Kubeflow along with integration tests.

Start adding converted spark operator elements

Can generate empty service account for Spark

Create the service account for the spark-operator.

Add clusterRole for Spark

Add cluster role bindings for Spark

Add deployment (todo cleanup name/image)

Can now launch spark operator

Put in a reasonable default for namespace (e.g default not null) and make the image used for spark-operator configurable

Start working to add job type

We can now launch and operator and launch a job, but the service accounts don't quite line up. TODO(holden) refactor the service accounts for the job to only be created in the job and move sparkJob inside of all.json as well then have an all / operator / job entry point in all.json maybe?

Add two hacked up temporary test scripts for use during dev (TODO refactor later into proper workflow)

Able to launch a job fixed coreLimit and added job arguments. Remaining TODOs are handling of nulls & svc account hack + test cleanup.

Start trying to re-organize the operator/job

Fix handling of optional jobArguments and mainClass and now it _works_ :)

Auto format the ksonnet.

Reviewer feedback: switch description of Spark operator to something meaningful, use sparkVersion param instead of hard coded v2.3.1-v1alpha1, and fix hardcoded namespace.

Clarify jobName param, remove Fix this since it has been integrated into all.libsonnet as intended.

CR feedback: change description typo and add opitonal param to spark operator for sparkVersion

Start trying to add spark tests to test_deploy.py

At @kunmingg suggestion Revert "Start trying to add spark tests to test_deploy.py" to focus on prow tests.

This reverts commit 912a763.

Start trying to add Spark to the e2e workflow for testing

Looks like the prow tests call into the python tests normally so Revert "At @kunmingg suggestion Revert "Start trying to add spark tests to test_deploy.py" to focus on prow tests."

This reverts commit 6c4c81f.

autoformat jsonnet

s/core/common/ and /var/log/syslog to README

Race condition on first deployment

Start adding SparkPI job to the workflow test.

Generate spark operator during CI as well.

Fix deploy kf indent

Already covered by deploy.

Install spark operator

Revert "Install spark operator"

This reverts commit cc559dd.

Test against the PR not master.

Fix string concat

Take spark-deploy out of workflows since cover in kf presub anyways.

Debug commit revert later.

idk whats going on for real.

hax

Ok lets use where the sym link was coming from idk.

Debug deploy kubeflow call...

Pritn in maint oo.

Specify a name.

name

Get all.

More debugging also why do we eddit app.yaml; directly.

don't gen common

import for debug

Just do spark-operator as verbose.

spelling

hmm namespace looked weird, lets run pytorch in verbose too so I can compare

put verbose at the end

Autoformat the json

Add a deployment scope and give more things a namespace

Format.

Gen pytorch and spark ops as verbose

idk wtf this is.

Don't deploy the spark job in the releaser test

no kfctl test either.

Just use name

We don't append any junk anymore

format json

Don't do spark in deploy_kubeflow anymore

Spark job deployment with workflows

Apply spark operator.

Add a sleep hack

Fix multi-line

add a working dir for the ks app

temp debug

garbage

specify working dir

Working dir was not happy, just cd cause why not

testdir not appDir

change to tests.testDir

Move operator deployment

Make sure we are in the ks_app?

Remove debugging and YOLO

90% less YOLO

Add that comma

Change deps

well CD seems to work in the other command so uhhh who knows?

Use runpath + pushd instead of kfctl generate

Just generate for now

Do both

Generate k8s

Install operator

Break down setting up the spark operator into different steps

We are in default rather than ghke

Use the run script to do the dpeloy

Change the namespace to stepsNamespace and add debug step cauise idk

Append the params to generate cmd

Remove params_str since we're doing list and a param of namespace

s/extends/extend/

Move params to the right place

Remove debug cluster step

Remove local test since we now use the regular e2e argo triggered tests.

Respond to the CR feedback

Fix paramterization of spark executor config.

Plumb through spark version to executor version label

Remove unecessary whitespace change in otherwise unmodified file.

* re-run autoformat

* default doesn't seem to exists anymore

* Debug the env list cause it changed

* re-run autoformat again

* Specify the env since env list shows default env is the only env present.

* Remove debug env list since the operator now works

* autofrmat and indent default

* Address CR feedback: remove deploymentscope and just use clusterole, link to upstream base operator in doc, remove downstream job test since it's triggered in both minikube and kfctl tests and we don't want to test it in minikube right now

* Take out the spark job from ther workflows in components test we just test the operator applies for now.

* Remove namespace as a param and just use the env.

* Fix end of line on namespace from ; to ,
  • Loading branch information
holdenk authored and k8s-ci-robot committed Feb 14, 2019
1 parent a3930cb commit 7f9ae3f
Show file tree
Hide file tree
Showing 6 changed files with 466 additions and 9 deletions.
3 changes: 3 additions & 0 deletions kubeflow/spark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
A very early attempt at allowing Apache Spark to be used with Kubeflow.
Starts a container to run the driver program in, and the rest is up to the Spark on K8s integration.
Based on https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
346 changes: 346 additions & 0 deletions kubeflow/spark/all.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,346 @@
{
// Define the various prototypes you want to support.
// Each prototype should be a list of different parts that together
// provide a userful function such as serving a TensorFlow or PyTorch model.
all(params, name, env):: [
$.parts(params, name, env).operatorServiceAccount,
$.parts(params, name, env).operatorClusterRole(),
$.parts(params, name, env).operatorClusterRoleBinding(),
$.parts(params, name, env).deployment,
],

sparkJob(params, name, env):: [
$.parts(params, name, env).jobServiceAccount,
$.parts(params, name, env).jobClusterRole,
$.parts(params, name, env).jobClusterRoleBinding,
$.parts(params, name, env).sparkJob,
],

// Parts should be a dictionary containing jsonnet representations of the various
// K8s resources used to construct the prototypes listed above.
parts(params, name, env):: {
// All ksonnet environments are associated with a namespace and we
// generally want to use that namespace for a component.
// However, in some cases an application may use multiple namespaces in which
// case the namespace for a particular component will be a parameter.
local namespace = env.namespace,
local mainClass = if params.mainClass == "null" then "" else params.mainClass,
local jobArguments = if params.jobArguments == "null" then [] else std.split(params.jobArguments, ","),
local sparkVersion = params.sparkVersion,

jobServiceAccount:: {
apiVersion: "v1",
kind: "ServiceAccount",
metadata: {
name: name,
namespace: namespace,
},
},

jobClusterRole:: {
apiVersion: "rbac.authorization.k8s.io/v1beta1",
kind: "Role",
metadata: {
namespace: namespace,
name: name,
},
rules: [
{
apiGroups: [
"",
],
resources: [
"pods",
],
verbs: [
"*",
],
},
{
apiGroups: [
"",
],
resources: [
"services",
],
verbs: [
"*",
],
},
],
},
jobClusterRoleBinding:: {
apiVersion: "rbac.authorization.k8s.io/v1beta1",
kind: "RoleBinding",
metadata: {
name: name,
namespace: namespace,
},
subjects: [
{
kind: "ServiceAccount",
name: name,
namespace: namespace,
},
],
roleRef: {
kind: "Role",
name: name,
apiGroup: "rbac.authorization.k8s.io",
},
},
operatorServiceAccount:: {
apiVersion: "v1",
kind: "ServiceAccount",
metadata: {
name: name,
namespace: namespace,
},
},
operatorClusterRole():: {
local roleType = "ClusterRole",
kind: roleType,
apiVersion: "rbac.authorization.k8s.io/v1beta1",
metadata: {
labels: {
app: "spark-operator",
},
name: name,
},
rules: [
{
apiGroups: [
"",
],
resources: [
"pods",
],
verbs: [
"*",
],
},
{
apiGroups: [
"",
],
resources: [
"services",
"configmaps",
],
verbs: [
"create",
"get",
"delete",
],
},
{
apiGroups: [
"",
],
resources: [
"nodes",
],
verbs: [
"get",
],
},
{
apiGroups: [
"",
],
resources: [
"events",
],
verbs: [
"create",
"update",
"patch",
],
},
{
apiGroups: [
"apiextensions.k8s.io",
],
resources: [
"customresourcedefinitions",
],
verbs: [
"create",
"get",
"update",
"delete",
],
},
{
apiGroups: [
"admissionregistration.k8s.io",
],
resources: [
"mutatingwebhookconfigurations",
],
verbs: [
"create",
"get",
"update",
"delete",
],
},
{
apiGroups: [
"sparkoperator.k8s.io",
],
resources: [
"sparkapplications",
"scheduledsparkapplications",
],
verbs: [
"*",
],
},
],
},
operatorClusterRoleBinding():: {
apiVersion: "rbac.authorization.k8s.io/v1beta1",
local bindingType = "ClusterRoleBinding",
local roleType = "ClusterRole",
kind: bindingType,
metadata: {
name: name,
},
subjects: [
{
kind: "ServiceAccount",
name: name,
namespace: namespace,
},
],
roleRef: {
kind: roleType,
name: name,
apiGroup: "rbac.authorization.k8s.io",
},
},
deployment:: {
apiVersion: "apps/v1beta1",
kind: "Deployment",
metadata: {
name: name,
namespace: namespace,
labels: {
"app.kubernetes.io/name": name,
"app.kubernetes.io/version": sparkVersion,
},
},
spec: {
replicas: 1,
selector: {
matchLabels: {
"app.kubernetes.io/name": name,
"app.kubernetes.io/version": sparkVersion,
},
},
strategy: {
type: "Recreate",
},
template: {
metadata: {
annotations: {
"prometheus.io/scrape": "true",
"prometheus.io/port": "10254",
"prometheus.io/path": "/metrics",
},
labels: {
"app.kubernetes.io/name": name,
"app.kubernetes.io/version": sparkVersion,
name: name,
},
initializers: {
pending: [

],
},
},
spec: {
serviceAccountName: name,
containers: [
{
name: name,
image: params.image,
imagePullPolicy: "Always",
command: [
"/usr/bin/spark-operator",
],
ports: [
{
containerPort: 10254,
},
],
args: [
"-logtostderr",
"-enable-metrics=true",
"-metrics-labels=app_type",
],
},
],
},
},
},
},
// Job specific configuration
sparkJob:: {
apiVersion: "sparkoperator.k8s.io/v1alpha1",
kind: "SparkApplication",
metadata: {
name: params.jobName,
namespace: namespace,
},
spec: {
type: params.type,
mode: "cluster",
image: params.image,
imagePullPolicy: "Always",
mainClass: mainClass,
mainApplicationFile: params.applicationResource,
arguments: jobArguments,
volumes: [
{
name: "test-volume",
hostPath: {
path: "/tmp",
type: "Directory",
},
},
],
driver: {
cores: params.driverCores,
memory: params.driverMemory,
labels: {
version: sparkVersion,
},
serviceAccount: params.name,
volumeMounts: [
{
name: "test-volume",
mountPath: "/tmp",
},
],
},
executor: {
cores: params.executorCores,
instances: params.numExecutors,
memory: params.executorMemory,
labels: {
version: params.sparkVersion,
},
volumeMounts: [
{
name: "test-volume",
mountPath: "/tmp",
},
],
},
restartPolicy: "Never",
},
},
},
}
27 changes: 27 additions & 0 deletions kubeflow/spark/parts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"name": "spark",
"apiVersion": "0.0.1",
"kind": "ksonnet.io/parts",
"description": "Holden's awesome Spark Job prototype based on https://github.com/GoogleCloudPlatform/spark-on-k8s-operator\n",
"author": "kubeflow-team <kubeflow-discuss@googlegroups.com>",
"contributors": [
{
"name": "Holden Karau",
"email": "holden@pigscanfly.ca"
}
],
"repository": {
"type": "git",
"url": "https://github.com/kubeflow/kubeflow"
},
"bugs": {
"url": "https://github.com/kubeflow/kubeflow/issues"
},
"keywords": [
"kubernetes",
"kubeflow",
"machine learning",
"apache spark"
],
"license": "Apache 2.0",
}

0 comments on commit 7f9ae3f

Please sign in to comment.