Cluster mode means spark-submit --master k8s://${MASTER_ADDR} ...
.
-
Since we are using k8s spark cluster (see detail), we need bcpkix-jdk15on & bcprov-jdk15on for
spark-submit
. In other words, these two dependencies must be included in$SPARK_HOME/jars
(Note: runecho 'sc.getConf.get("spark.home")' | spark-shell
to find out$SPARK_HOME
if needed). -
In addition, we need hadoop-aws as an extra package while executing
spark-submit
. -
Check username/password of a deployed standalone MinIO.
persistent depends on your Volume, login to your node IP then:
cat <minio path>/.root_user cat <minio path>/.root_password
Client mode means using bitnami/charts in k8s.
-
NFS share volume (Only required in Spark Client Mode, which used for uploading local JARs).
On the client server:
sudo apt update sudo apt install nfs-common
Check available mounting directories:
showmount -e <HOST_IP>
Make the share directory and grant permission:
sudo mkdir <YOUR_MOUNT_DIRECTORY> -p sudo chown nobody:nogroup <YOUR_MOUNT_DIRECTORY>
Mount host directory:
sudo mount <HOST_IP>:<HOST_SHARE_ADDRESS> <YOUR_MOUNT_DIRECTORY>
-
accessing logs:
kubectl logs -f -n dev <DRIVER_POD_NAME>
-
accessing UI
kubectl port-forward -n dev <DRIVER_POD_NAME> 4040:4040
-
debugging
kubectl describe pod -n dev <SPARK_DRIVER_POD>
-
killing driver
kubectl describe pod -n dev <SPARK_DRIVER_POD>
- RDDRelation: RDD
- DataSources: data sources, such as CSV, Parquet, Jdbc, and etc.
- TypedAggregator: simple aggregator
- UserDefinedTypedAggregator: user defined aggregator
- UserDefinedUntypedAggregator.: user defined aggregator and with
spark.sql
- DFWithColumn:
withColumn
function - DFWhereFilter:
filter
&where
clauses - DFWhen: "case when" and "when otherwise"
- DFPivotAndUnpivot: pivot and unpivot
- DFGroupBy:
groupBy
and its methods - DFSort:
sort
andorderBy
- DFJoin:
join
- DFUnion:
union
andunionAll
- DFMap:
map
andmapPartitions
- DFCacheAndPersist:
cache
andpersist
- SqlUDF: UDF
- DFArrayAndMap:
ArrayType
andMapType
--jars
are used for local or remote jar files specified with URL and don't resolve dependencies,--packages
are used for Maven coordinates, and do resolve dependencies. Source