# Some General Spark Optimization Comments



### Various caching strategies

Persist: marks a reqeust to cache the query on the Dataset the next time an action is performed (default storage level: `MEMORY_AND_DISK`)

Caching: same as `.persist()`, but the _only_ storage type is `MEMORY_ONLY`.

**What's the difference?**

Both deserialize the RDD in the JVM in memory. For `MEMORY_ONLY`, if it doesn't fit, partitions have to be re-constituted as-needed. `MEMORY_AND_DISK`, the rest is spilled to disk and read from there. They both use up a lot of space; the former has lower latency, but may require more memory management (garbage collector). Optimization depends on the available RAM on executors.

Checkpointing: Make a request to produce an internal representation (`internalRDD`) of the data lineage and make a reliable copy.


### Catalyst

Catalyst is a framework to generate a dataflow graph of expressions and relational operators. The graph is a tree structure (`TreeNode`) that contains `Expressions` and `QueryPlans`.

Catalyst Optimizer, `Optimizer`, defines the rules to take a structured query and generate an optimized logical plan. You can access the optimized logical plan with `Dataset.explain(extended=True)`  (SQL: `EPLAIN EXTENDED`)


### Type Safety and DataFrames, SQL and Datasets


This book (nice Spark SQL reference): <a href="https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Dataset.html">"The Internals of Spark SQL," by Jacek Laskowski</a>


> The Dataset API offers declarative and type-safe operators that makes for an improved experience for data processing (comparing to DataFrames that were a set of index- or column name-based Rows).

Unlike DataFrames and SQL queries, Datasets are strongly typed and consequently have checks at compile time.

### Tungsten

Tungsten transforms Spark operations into (native) runtime virtual functions, focusing on computational efficiency and fine-grained memory management.



### Optimized Joins


Joins will be automatically broadcast if one table/DF is small enough to fit into memory. The default is 10 MB

Maximum size (in bytes) for a table that will be broadcast to all worker nodes when performing a join. 

If the size of the statistics of the logical plan of a table is at most the setting, the DataFrame is broadcast for join.

Negative values or 0 disable broadcasting.

Use `SQLConf.autoBroadcastJoinThreshold` method to access the current value.

In [None]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)