<img align="right" width="200" height="200" src="https://static.tildacdn.com/tild6236-6337-4339-b337-313363643735/new_logo.png">

# Spark Dataframes II
**Сергей Гришаев**  
serg.grishaev@gmail.com  

## На этом занятии
+ Планы выполнения задач
+ Оптимизация соединений и группировок
+ Управление схемой данных
+ Оптимизатор запросов Catalyst

## Планы выполнения задач

Любой `job` в Spark SQL имеет под собой план выполнения, кототорый генерируется на основе написанно запроса. План запроса содержит операторы, которые затем превращаются в Java код. Поскольку одну и ту же задачу в Spark SQL можно выполнить по-разному, полезно смотреть в планы выполнения, чтобы, например:
+ убрать лишние shuffle
+ убедиться, чтот тот или иной оператор будет выполнен на уровне источника, а не внутри Spark
+ понять, как будет выполнен `join`

Планы выполнения доступны в двух видах:
+ метод `explain()` у DF
+ на вкладке SQL в Spark UI

Прочитаем датасет [Airport Codes](https://datahub.io/core/airport-codes):

In [1]:
val csvOptions = Map("header" -> "true", "inferSchema" -> "true")
val airports = spark.read.options(csvOptions).csv("/tmp/airport-codes.csv")
airports.printSchema
airports.show(numRows = 1, truncate = 100, vertical = true)

Waiting for a Spark session to start...

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: integer (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)

-RECORD 0------------------------------------------
 ident        | 00A                                
 type         | heliport                           
 name         | Total Rf Heliport                  
 elevation_ft | 11                                 
 continent    | NA                                 
 iso_country  | US                                 
 iso_region   | US-PA                              
 municipality | Bensalem                           
 gps_code     | 00A                 

csvOptions = Map(header -> true, inferSchema -> true)
airports = [ident: string, type: string ... 10 more fields]


[ident: string, type: string ... 10 more fields]

Используем метод `explain`, чтобы посмотреть план запроса. Наиболее интересным является физический план, т.к. он отражает фактически алгоритм обработки данных. В данном случае в плане присутствует единственный оператор `FileScan csv`:

In [4]:
airports.explain(extended = false)

== Physical Plan ==
*(1) FileScan csv [ident#10,type#11,name#12,elevation_ft#13,continent#14,iso_country#15,iso_region#16,municipality#17,gps_code#18,iata_code#19,local_code#20,coordinates#21] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/airport-codes.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ident:string,type:string,name:string,elevation_ft:int,continent:string,iso_country:string,...


In [3]:
airports.count

55113

Если остальные планы не нужны, можно показать только физический:

In [6]:
import org.apache.spark.sql.Dataset


airports.queryExecution.executedPlan.treeString

def printPhysicalPlan[_](ds: Dataset[_]): Unit = {
    println(ds.queryExecution.executedPlan.treeString)
}

printPhysicalPlan: [_](ds: org.apache.spark.sql.Dataset[_])Unit


Также есть возмжность получить эту информацию в виде JSON:

In [7]:
println(airports.queryExecution.executedPlan.toJSON)

[{"class":"org.apache.spark.sql.execution.WholeStageCodegenExec","num-children":1,"child":0,"codegenStageId":1},{"class":"org.apache.spark.sql.execution.FileSourceScanExec","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"ident","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":10,"jvmId":"389ceb62-1324-4f94-bb7f-28ab9bd35cb4"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"type","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":11,"jvmId":"389ceb62-1324-4f94-bb7f-28ab9bd35cb4"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"name","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-cl

Выполним `filter` и проверим план выполнения. Читать план нужно снизу вверх. В плане появился новый оператор `filter`

In [8]:
printPhysicalPlan(airports.filter('type === "small_airport"))

*(1) Project [ident#10, type#11, name#12, elevation_ft#13, continent#14, iso_country#15, iso_region#16, municipality#17, gps_code#18, iata_code#19, local_code#20, coordinates#21]
+- *(1) Filter (isnotnull(type#11) && (type#11 = small_airport))
   +- *(1) FileScan csv [ident#10,type#11,name#12,elevation_ft#13,continent#14,iso_country#15,iso_region#16,municipality#17,gps_code#18,iata_code#19,local_code#20,coordinates#21] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/airport-codes.csv], PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,small_airport)], ReadSchema: struct<ident:string,type:string,name:string,elevation_ft:int,continent:string,iso_country:string,...



In [9]:
airports.filter('type === "small_airport").explain(true)

== Parsed Logical Plan ==
'Filter ('type = small_airport)
+- Relation[ident#10,type#11,name#12,elevation_ft#13,continent#14,iso_country#15,iso_region#16,municipality#17,gps_code#18,iata_code#19,local_code#20,coordinates#21] csv

== Analyzed Logical Plan ==
ident: string, type: string, name: string, elevation_ft: int, continent: string, iso_country: string, iso_region: string, municipality: string, gps_code: string, iata_code: string, local_code: string, coordinates: string
Filter (type#11 = small_airport)
+- Relation[ident#10,type#11,name#12,elevation_ft#13,continent#14,iso_country#15,iso_region#16,municipality#17,gps_code#18,iata_code#19,local_code#20,coordinates#21] csv

== Optimized Logical Plan ==
Filter (isnotnull(type#11) && (type#11 = small_airport))
+- Relation[ident#10,type#11,name#12,elevation_ft#13,continent#14,iso_country#15,iso_region#16,municipality#17,gps_code#18,iata_code#19,local_code#20,coordinates#21] csv

== Physical Plan ==
*(1) Project [ident#10, type#11, name#12,

Выполним агрегацию и проверим план выполнения. В нем появляется три оператора: 2 `HashAggregate` и `Exchange hashpartitioning`.

Первый `HashAggregate` содержит функцию `partial_count(1)`. Это означает, что внутри каждого воркера произойдет подсчет строк по каждому ключу. Затем происходит `shuffle` по ключу агрегата, после которого выполняется еще один `HashAggregate` с функцией `count(1)`. Использование двух `HashAggregate` позволяет сократить количество передаваемых данных по сети.

In [10]:
printPhysicalPlan(airports.filter('type === "small_airport").groupBy('iso_country).count)

*(2) HashAggregate(keys=[iso_country#15], functions=[count(1)], output=[iso_country#15, count#116L])
+- Exchange hashpartitioning(iso_country#15, 200)
   +- *(1) HashAggregate(keys=[iso_country#15], functions=[partial_count(1)], output=[iso_country#15, count#120L])
      +- *(1) Project [iso_country#15]
         +- *(1) Filter (isnotnull(type#11) && (type#11 = small_airport))
            +- *(1) FileScan csv [type#11,iso_country#15] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/airport-codes.csv], PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,small_airport)], ReadSchema: struct<type:string,iso_country:string>



In [11]:
airports.filter('type === "small_airport").groupBy('iso_country).count.collect

Array([DZ,22], [LT,45], [MM,52], [CI,19], [TC,4], [AZ,21], [FI,61], [SC,13], [UA,102], [ZM,94], [RO,32], [KI,17], [SL,4], [SB,32], [NL,26], [LA,11], [BS,30], [BW,113], [MN,19], [AM,5], [PL,205], [PS,1], [MK,11], [MX,845], [PF,27], [GL,18], [EE,18], [VG,2], [SM,1], [CN,93], [UM,1], [AT,47], [RU,501], [NA,231], [IQ,42], [CG,47], [HR,28], [SV,25], [CZ,197], [NP,41], [SO,24], [PT,106], [PG,506], [GH,5], [CV,4], [BN,1], [LR,14], [TW,10], [BD,6], [PY,64], [CL,338], [TO,3], [ID,395], [FK,32], [LY,46], [SA,40], [AU,1538], [PK,80], [CA,997], [MW,21], [NE,18], [UZ,167], [GB,572], [YE,14], [BR,3102], [KZ,86], [BY,27], [HN,140], [NC,11], [GT,44], [MD,2], [DE,746], [GN,15], [EC,99], [ES,296], [IR,71], [BH,1], [IL,19], [MR,18], [TR,56], [ME,3], [VE,456], [ZA,4...

При необходимости мы можем почитать ~~перед сном~~ сгенерированный ~~теплый ламповый~~ java код:

In [12]:
import org.apache.spark.sql.execution.command.ExplainCommand

val grouped = airports.filter('type === "small_airport").groupBy('iso_country).count


def printCodeGen[_](ds: Dataset[_]): Unit = {
    val logicalPlan = ds.queryExecution.logical
    val codeGen = ExplainCommand(logicalPlan, extended = true, codegen = true)
    spark.sessionState.executePlan(codeGen).executedPlan.executeCollect().foreach {
      r => println(r.getString(0))
    }
}

printCodeGen(grouped)

Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 ==
*(1) HashAggregate(keys=[iso_country#15], functions=[partial_count(1)], output=[iso_country#15, count#160L])
+- *(1) Project [iso_country#15]
   +- *(1) Filter (isnotnull(type#11) && (type#11 = small_airport))
      +- *(1) FileScan csv [type#11,iso_country#15] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/airport-codes.csv], PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,small_airport)], ReadSchema: struct<type:string,iso_country:string>

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=1
/* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   pri

grouped = [iso_country: string, count: bigint]


printCodeGen: [_](ds: org.apache.spark.sql.Dataset[_])Unit


[iso_country: string, count: bigint]

<img align="right" width="200" height="200" src="https://cs5.pikabu.ru/post_img/big/2015/12/11/7/1449830295198229367.jpg">

### Выводы:
+ Spark составляет физический план выполнения запроса на основании написанного вами кода
+ Изучив план запроса, можно понять, какие операторы будут применены в ходе обработки ваших данных
+ План выполнения запроса - один из основных инструментов оптимизации запроса

## Оптимизация соединений и группировок
При выполнении `join` двух DF важно следовать рекомендациям:
+ фильтровать данные до join'а
+ использовать equ join 
+ если можно путем увеличения количества данных применить equ join вместо non-equ join'а, то делать именно так
+ всеми силами избегать cross-join'ов
+ если правый DF помещается в памяти worker'а, использовать broadcast()

### Виды соединений
+ **BroadcastHashJoin**
  - equ join
  - broadcast
+ **SortMergeJoin**
  - equ join
  - sortable keys
+ **BroadcastNestedLoopJoin**
  - non-equ join
  - using broadcast
+ **CartesianProduct**
  - non-equ join
  
[Optimizing Apache Spark SQL Joins: Spark Summit East talk by Vida Ha](https://youtu.be/fp53QhSfQcI)

Подготовим два датасета:

In [13]:
val left = airports.select('type, 'ident, 'iso_country).localCheckpoint
val right = airports.groupBy('type).count.localCheckpoint

left = [type: string, ident: string ... 1 more field]
right = [type: string, count: bigint]


[type: string, count: bigint]

In [18]:
airports.select('type, 'ident, 'iso_country).write.mode("overwrite").parquet("/tmp/1.parquet")
airports.groupBy('type).count.write.mode("overwrite").parquet("/tmp/2.parquet")

lastException: Throwable = null


In [19]:
val left = spark.read.parquet("/tmp/1.parquet")
val right = spark.read.parquet("/tmp/2.parquet")

left = [type: string, ident: string ... 1 more field]
right = [type: string, count: bigint]


[type: string, count: bigint]

### BroadcastHashJoin
+ работает, когда условие - равенство одного или нескольких ключей
+ работает, когда один из датасетов небольшой и полностью вмещается в память воркера
+ оставляет левый датасет как есть
+ копирует правый датасет на каждый воркер
+ составляет hash map из правого датасета, где ключ - кортеж из колонок в условии соединения
+ итерируется по левому датасета внутри каждой партиции и проверяет наличие ключей в HashMap
+ может быть автоматически использован, либо явно через `broadcast(df)`

In [23]:
import org.apache.spark.sql.functions.broadcast // pyspark.sql.functions.broadcast

// spark.sql.autobroadcastJoinThreshold

val result = left.join(broadcast(right), Seq("type"), "inner")

printPhysicalPlan(result)

*(2) Project [type#227, ident#228, iso_country#229, count#234L]
+- *(2) BroadcastHashJoin [type#227], [type#233], Inner, BuildRight
   :- *(2) Project [type#227, ident#228, iso_country#229]
   :  +- *(2) Filter isnotnull(type#227)
   :     +- *(2) FileScan parquet [type#227,ident#228,iso_country#229] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/1.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(type)], ReadSchema: struct<type:string,ident:string,iso_country:string>
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
      +- *(1) Project [type#233, count#234L]
         +- *(1) Filter isnotnull(type#233)
            +- *(1) FileScan parquet [type#233,count#234L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/2.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(type)], ReadSchema: struct<type:string,count:bigint>



result = [type: string, ident: string ... 2 more fields]


[type: string, ident: string ... 2 more fields]

In [None]:
broadcast_data = {}
for i in right_rows:
  broadcast_data[i[key]] = i

for i in l_partition:
  maybe_join = broadcast_data.get(i[key])
  if maybe_join:
    yield (i | maybe_join)

### SortMergeJoin
+ работает, когда ключи соединения в обоих датасета являются сортируемыми
+ репартиционирует оба датасета в 200 партиций по ключу (ключам) соединения
+ сортирует партиции каждого из датасетов по ключу (ключам) соединения
+ Используя сравнение левого и правого ключей, обходит каждую пару партиций и соединяет строки с одинаковыми ключами

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "200")

In [24]:
spark.conf.get("spark.sql.autoBroadcastJoinThreshold")

26214400

In [25]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

val result = left.join(right, Seq("type"), "inner")

printPhysicalPlan(result)

*(5) Project [type#227, ident#228, iso_country#229, count#234L]
+- *(5) SortMergeJoin [type#227], [type#233], Inner
   :- *(2) Sort [type#227 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(type#227, 200)
   :     +- *(1) Project [type#227, ident#228, iso_country#229]
   :        +- *(1) Filter isnotnull(type#227)
   :           +- *(1) FileScan parquet [type#227,ident#228,iso_country#229] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/1.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(type)], ReadSchema: struct<type:string,ident:string,iso_country:string>
   +- *(4) Sort [type#233 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(type#233, 200)
         +- *(3) Project [type#233, count#234L]
            +- *(3) Filter isnotnull(type#233)
               +- *(3) FileScan parquet [type#233,count#234L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:

result = [type: string, ident: string ... 2 more fields]


[type: string, ident: string ... 2 more fields]

### BroadcastNestedLoopJoin
+ работает, когда один из датасетов небольшой и полностью вмещается в память воркера
+ оставляет левый датасет как есть
+ копирует правый датасет на каждый воркер
+ проходится вложенным циклом по каждой партиции левого датасета и копией правого датасета и проверяет условие
+ может быть автоматически использован, либо явно через `broadcast(df)`

In [26]:
import org.apache.spark.sql.functions.{ expr, udf, col }

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

// Не смотря на то, что UDF сравнивает два ключа, Spark ничего про нее не знает
// и не может применить BroadcastHashJoin или SortMergeJoin
val compare_udf = udf { (leftVal: String, rightVal: String) => leftVal == rightVal }

val joinExpr = compare_udf(col("left.type"), col("right.type"))

val result = left.as("left").join(broadcast(right).as("right"), joinExpr, "inner")

printPhysicalPlan(result)

BroadcastNestedLoopJoin BuildRight, Inner, UDF(type#227, type#233)
:- *(1) FileScan parquet [type#227,ident#228,iso_country#229] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/1.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<type:string,ident:string,iso_country:string>
+- BroadcastExchange IdentityBroadcastMode
   +- *(2) FileScan parquet [type#233,count#234L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/2.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<type:string,count:bigint>



compare_udf = UserDefinedFunction(<function2>,BooleanType,Some(List(StringType, StringType)))
joinExpr = UDF(left.type, right.type)
result = [type: string, ident: string ... 3 more fields]


[type: string, ident: string ... 3 more fields]

In [None]:
broadcast_list = []
for i in broadcasted_rows:
    broadcast_list.append(i)

for i in left_partition:
  for j in broadcast_list:
    if join_function(i, j):
       yield (i | j)

### CartesianProduct
+ Создает пары из каждой партиции левого датасета с каждой партицией правого датасета, релоцирует каждую пару на один воркер и проверяет условие соединения
+ на выходе создает N*M партиций
+ работает медленнее остальных и часто приводит к ООМ воркеров

In [27]:
import org.apache.spark.sql.functions.{ expr, udf, col }

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

// Не смотря на то, что UDF сравнивает два ключа, Spark ничего про нее не знает
// и не может применить BroadcastHashJoin или SortMergeJoin
val compare_udf = udf { (leftVal: String, rightVal: String) => leftVal == rightVal }

val joinExpr = compare_udf(col("left.type"), col("right.type"))

val result = left.as("left").join(right.as("right"), joinExpr, "inner")

printPhysicalPlan(result)
println(
    s"""Partition summary: 
    left=${left.rdd.getNumPartitions}, 
    right=${right.rdd.getNumPartitions}, 
    result=${result.rdd.getNumPartitions}""")

CartesianProduct UDF(type#227, type#233)
:- *(1) FileScan parquet [type#227,ident#228,iso_country#229] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/1.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<type:string,ident:string,iso_country:string>
+- *(2) FileScan parquet [type#233,count#234L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/2.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<type:string,count:bigint>

Partition summary: 
    left=2, 
    right=2, 
    result=4


compare_udf = UserDefinedFunction(<function2>,BooleanType,Some(List(StringType, StringType)))
joinExpr = UDF(left.type, right.type)
result = [type: string, ident: string ... 3 more fields]


[type: string, ident: string ... 3 more fields]

In [28]:
left.crossJoin(right).explain

== Physical Plan ==
CartesianProduct
:- *(1) FileScan parquet [type#227,ident#228,iso_country#229] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/1.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<type:string,ident:string,iso_country:string>
+- *(2) FileScan parquet [type#233,count#234L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/2.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<type:string,count:bigint>


### Снижение количества shuffle
В ряде случаев можно уйти от лишних `shuffle` операций при выполнении соединения. Для этого оба DF должны иметь одинаковое партиционирование - одинаковое количество партиций и ключ партиционирования, совпадающий с ключом соединения.

Разница между планами выполнения будет хорошо видна в Spark UI на графе выполнения в Jobs и плане выполнения в SQL

In [39]:
spark.time { 
    spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
    val left = airports.localCheckpoint
    val right = airports.groupBy('type).count.localCheckpoint

    val joined = left.join(right, Seq("type"))

    println(left.rdd.getNumPartitions)
    println(right.rdd.getNumPartitions)
    printPhysicalPlan(joined)
    
    joined.count
}

2
200
*(4) Project [type#11, ident#10, name#12, elevation_ft#13, continent#14, iso_country#15, iso_region#16, municipality#17, gps_code#18, iata_code#19, local_code#20, coordinates#21, count#947L]
+- *(4) SortMergeJoin [type#11], [type#946], Inner
   :- *(2) Sort [type#11 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(type#11, 200)
   :     +- *(1) Filter isnotnull(type#11)
   :        +- Scan ExistingRDD[ident#10,type#11,name#12,elevation_ft#13,continent#14,iso_country#15,iso_region#16,municipality#17,gps_code#18,iata_code#19,local_code#20,coordinates#21]
   +- *(3) Sort [type#946 ASC NULLS FIRST], false, 0
      +- *(3) Filter isnotnull(type#946)
         +- Scan ExistingRDD[type#946,count#947L]

Time taken: 3647 ms


55113

In [42]:
spark.time { 
    spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
    val airportsRep = airports.repartition(200, col("type"))
    val left = airportsRep.localCheckpoint
    val right = airportsRep.groupBy('type).count.localCheckpoint

    val joined = left.join(right, Seq("type"))

        println(left.rdd.getNumPartitions)
    println(right.rdd.getNumPartitions)
    printPhysicalPlan(joined)
    
    joined.count
}

50
50
*(5) Project [type#11, ident#10, name#12, elevation_ft#13, continent#14, iso_country#15, iso_region#16, municipality#17, gps_code#18, iata_code#19, local_code#20, coordinates#21, count#1151L]
+- *(5) SortMergeJoin [type#11], [type#1155], Inner
   :- *(2) Sort [type#11 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(type#11, 50)
   :     +- *(1) Project [ident#10, type#11, name#12, elevation_ft#13, continent#14, iso_country#15, iso_region#16, municipality#17, gps_code#18, iata_code#19, local_code#20, coordinates#21]
   :        +- *(1) Filter isnotnull(type#11)
   :           +- *(1) FileScan csv [ident#10,type#11,name#12,elevation_ft#13,continent#14,iso_country#15,iso_region#16,municipality#17,gps_code#18,iata_code#19,local_code#20,coordinates#21] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/airport-codes.csv], PartitionFilters: [], PushedFilters: [IsNotNull(type)], ReadSchema: struct<ident:string,type:string,

55113

### Выводы:
+ В Spark используются 4 вида соединений: `BroadcastHashJoin`, `SortMergeJoin`, `BroadcastNestedLoopJoin`, `CartesianProduct`
+ Выбор алгоритма основывается на условии соединения и размере датасетов
+ `CartesianProduct` обладает самой низкой вычислительной эффективностью и его по возможности стоит избегать

## Управление схемой данных
В DF API каждая колонка имеет свой тип. Он может быть:
+ скаляром - `StringType`, `IntegerType` и т. д.
+ массивом - `ArrayType(T)`
+ словарем `MapType(K, V)`
+ структурой - `StructType()`

DF целиком также имеет схему, описанную с помощью класса `StructType`

Посмотреть список колонок можно с помощью атрибута `columns`:

In [44]:
airports.columns.mkString("\n")

ident
type
name
elevation_ft
continent
iso_country
iso_region
municipality
gps_code
iata_code
local_code
coordinates


Схема DF доступна через атрибут `schema`

In [45]:
import org.apache.spark.sql.types._
val schema: StructType = airports.schema

schema = StructType(StructField(ident,StringType,true), StructField(type,StringType,true), StructField(name,StringType,true), StructField(elevation_ft,IntegerType,true), StructField(continent,StringType,true), StructField(iso_country,StringType,true), StructField(iso_region,StringType,true), StructField(municipality,StringType,true), StructField(gps_code,StringType,true), StructField(iata_code,StringType,true), StructField(local_code,StringType,true), StructField(coordinates,StringType,true))


StructType(StructField(ident,StringType,true), StructField(type,StringType,true), StructField(name,StringType,true), StructField(elevation_ft,IntegerType,true), StructField(continent,StringType,true), StructField(iso_country,StringType,true), StructField(iso_region,StringType,true), StructField(municipality,StringType,true), StructField(gps_code,StringType,true), StructField(iata_code,StringType,true), StructField(local_code,StringType,true), StructField(coordinates,StringType,true))

`apply()` метод возвращает поле структуры по имени, как в словаре

In [46]:
val field: StructField = schema("ident")

field = StructField(ident,StringType,true)


StructField(ident,StringType,true)

In [47]:
val field: StructField = schema.apply("ident")

field = StructField(ident,StringType,true)


StructField(ident,StringType,true)

`StructField` обладает атрибутами `name` и `dataType`:

In [48]:
val name: String = field.name

val fieldType: DataType = field.dataType

fieldType match {
    case f: StringType => println("This is string")
    case _ => println("This is not string!")
}

This is string


name = ident
fieldType = StringType


StringType

In [62]:
airports.schema.map {fieldType =>
    fieldType match {
        case StructField(n, s: StringType, _, _) => println(s"Column ${n} is string - ${s}")
        case StructField(_, st, _, _) => println(s"Column is not string! It's ${st}")
    }
}

Column ident is string - StringType
Column type is string - StringType
Column name is string - StringType
Column is not string! It's IntegerType
Column continent is string - StringType
Column iso_country is string - StringType
Column iso_region is string - StringType
Column municipality is string - StringType
Column gps_code is string - StringType
Column iata_code is string - StringType
Column local_code is string - StringType
Column coordinates is string - StringType


List((), (), (), (), (), (), (), (), (), (), (), ())

Метод `simpleString` можно использовать, чтобы получить DDL схемы в виде строки:

In [63]:
fieldType.simpleString

string

In [64]:
val airportSchema = schema.simpleString

airportSchema = struct<ident:string,type:string,name:string,elevation_ft:int,continent:string,iso_country:string,iso_region:string,municipality:string,gps_code:string,iata_code:string,local_code:string,coordinates:string>


struct<ident:string,type:string,name:string,elevation_ft:int,continent:string,iso_country:string,iso_region:string,municipality:string,gps_code:string,iata_code:string,local_code:string,coordinates:string>

In [65]:
val ddlString = schema.toDDL

import org.apache.spark.sql.types.DataType

DataType.fromDDL(ddlString)

ddlString = `ident` STRING,`type` STRING,`name` STRING,`elevation_ft` INT,`continent` STRING,`iso_country` STRING,`iso_region` STRING,`municipality` STRING,`gps_code` STRING,`iata_code` STRING,`local_code` STRING,`coordinates` STRING


StructType(StructField(ident,StringType,true), StructField(type,StringType,true), StructField(name,StringType,true), StructField(elevation_ft,IntegerType,true), StructField(continent,StringType,true), StructField(iso_country,StringType,true), StructField(iso_region,StringType,true), StructField(municipality,StringType,true), StructField(gps_code,StringType,true), StructField(iata_code,StringType,true), StructField(local_code,StringType,true), StructField(coordin...

In [66]:
val jsonString = schema.json

import org.apache.spark.sql.types.DataType

println(DataType.fromJson(jsonString))

StructType(StructField(ident,StringType,true), StructField(type,StringType,true), StructField(name,StringType,true), StructField(elevation_ft,IntegerType,true), StructField(continent,StringType,true), StructField(iso_country,StringType,true), StructField(iso_region,StringType,true), StructField(municipality,StringType,true), StructField(gps_code,StringType,true), StructField(iata_code,StringType,true), StructField(local_code,StringType,true), StructField(coordinates,StringType,true))


jsonString = {"type":"struct","fields":[{"name":"ident","type":"string","nullable":true,"metadata":{}},{"name":"type","type":"string","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"elevation_ft","type":"integer","nullable":true,"metadata":{}},{"name":"continent","type":"string","nullable":true,"metadata":{}},{"name":"iso_country","type":"string","nullable":true,"metadata":{}},{"name":"iso_region","type":"string","nullable":true,"metadata":{}},{"name":"municipality","type":"string","nullable":true,"metadata":{}},{"name":"gps_code","type":"string","nullable":true,"metadata":{}},{"name":"iata_code","type":"string","nullable":true,"metadata":{}},{"name":"local_code","type":"string","nullable":true,"metadata":{}},{"name":"coordin...


{"type":"struct","fields":[{"name":"ident","type":"string","nullable":true,"metadata":{}},{"name":"type","type":"string","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"elevation_ft","type":"integer","nullable":true,"metadata":{}},{"name":"continent","type":"string","nullable":true,"metadata":{}},{"name":"iso_country","type":"string","nullable":true,"metadata":{}},{"name":"iso_region","type":"string","nullable":true,"metadata":{}},{"name":"municipality","type":"string","nullable":true,"metadata":{}},{"name":"gps_code","type":"string","nullable":true,"metadata":{}},{"name":"iata_code","type":"string","nullable":true,"metadata":{}},{"name":"local_code","type":"string","nullable":true,"metadata":{}},{"name":"coordin...

Схема может быть создана из `case class`:

In [67]:
case class Airport(
    ident: String,
    `type`: String,
    name: String,
    elevation_ft: Int,
    continent: String,
    iso_country: String,
    iso_region: String,
    municipality: String,
    gps_code: String,
    iata_code: String,
    local_code: String,
    coordinates: String
)

defined class Airport


In [76]:
import org.apache.spark.sql.types._

import org.apache.spark.sql.catalyst.ScalaReflection
val schemaFromClass = ScalaReflection.schemaFor[Airport].dataType.asInstanceOf[StructType]

schemaFromClass = StructType(StructField(ident,StringType,true), StructField(type,StringType,true), StructField(name,StringType,true), StructField(elevation_ft,IntegerType,false), StructField(continent,StringType,true), StructField(iso_country,StringType,true), StructField(iso_region,StringType,true), StructField(municipality,StringType,true), StructField(gps_code,StringType,true), StructField(iata_code,StringType,true), StructField(local_code,StringType,true), StructField(coordinates,StringType,true))


StructType(StructField(ident,StringType,true), StructField(type,StringType,true), StructField(name,StringType,true), StructField(elevation_ft,IntegerType,false), StructField(continent,StringType,true), StructField(iso_country,StringType,true), StructField(iso_region,StringType,true), StructField(municipality,StringType,true), StructField(gps_code,StringType,true), StructField(iata_code,StringType,true), StructField(local_code,StringType,true), StructField(coordinates,StringType,true))

In [71]:
val schemaFromClass = ScalaReflection.schemaFor[Airport].dataType

schemaFromClass = StructType(StructField(ident,StringType,true), StructField(type,StringType,true), StructField(name,StringType,true), StructField(elevation_ft,IntegerType,false), StructField(continent,StringType,true), StructField(iso_country,StringType,true), StructField(iso_region,StringType,true), StructField(municipality,StringType,true), StructField(gps_code,StringType,true), StructField(iata_code,StringType,true), StructField(local_code,StringType,true), StructField(coordinates,StringType,true))


StructType(StructField(ident,StringType,true), StructField(type,StringType,true), StructField(name,StringType,true), StructField(elevation_ft,IntegerType,false), StructField(continent,StringType,true), StructField(iso_country,StringType,true), StructField(iso_region,StringType,true), StructField(municipality,StringType,true), StructField(gps_code,StringType,true), StructField(iata_code,StringType,true), StructField(local_code,StringType,true), StructField(coordinates,StringType,true))

In [74]:
spark.emptyDataset[Airport].schema

StructType(StructField(ident,StringType,true), StructField(type,StringType,true), StructField(name,StringType,true), StructField(elevation_ft,IntegerType,false), StructField(continent,StringType,true), StructField(iso_country,StringType,true), StructField(iso_region,StringType,true), StructField(municipality,StringType,true), StructField(gps_code,StringType,true), StructField(iata_code,StringType,true), StructField(local_code,StringType,true), StructField(coordinates,StringType,true))

Схема может быть использована:
+ при чтении источника
+ при работе с JSON

In [77]:
val csvOptions = Map("header" -> "true", "inferSchema" -> "false")
val airports = spark.read.options(csvOptions).schema(schemaFromClass).csv("/tmp/airport-codes.csv")
airports.printSchema
airports.show(numRows = 1, truncate = 100, vertical = true)

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: integer (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)

-RECORD 0------------------------------------------
 ident        | 00A                                
 type         | heliport                           
 name         | Total Rf Heliport                  
 elevation_ft | 11                                 
 continent    | NA                                 
 iso_country  | US                                 
 iso_region   | US-PA                              
 municipality | Bensalem                           
 gps_code     | 00A                 

csvOptions = Map(header -> true, inferSchema -> false)
airports = [ident: string, type: string ... 10 more fields]


[ident: string, type: string ... 10 more fields]

In [78]:
import org.apache.spark.sql.functions._

val parseJson = from_json(col("value"), schemaFromClass).alias("s")

val jsoned = airports.toJSON
jsoned.show(1, false)



val withColumns = jsoned.select(parseJson).select(col("s.*"))

withColumns.show(1, 200, true)
withColumns.printSchema

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"ident":"00A","type":"heliport","name":"Total Rf Heliport","elevation_ft":11,"continent":"NA","iso_country":"US","iso_region":"US-PA","municipality":"Bensalem","gps_code":"00A","local_code":"00A","coordinates":"40.07080078125, -74.93360137939453

parseJson = jsontostructs(value) AS `s`
jsoned = [value: string]
withColumns = [ident: string, type: string ... 10 more fields]


[ident: string, type: string ... 10 more fields]

Схема может быть создана вручную:

In [87]:
val withColumns = jsoned.select(parseJson).select("s.*").printSchema

//.show(1, truncate=500, vertical=true)

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: integer (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)



withColumns: Unit = ()


In [88]:
val someSchema = 
    StructType(
        List(
            StructField("foo", StringType),
            StructField("bar", StringType),
            StructField(
                        "boo", 
                        StructType(
                            List(
                                StructField("x", IntegerType),
                                StructField("y", BooleanType)
                                )
                            )
                       )
        
        )
    )

someSchema.printTreeString()

root
 |-- foo: string (nullable = true)
 |-- bar: string (nullable = true)
 |-- boo: struct (nullable = true)
 |    |-- x: integer (nullable = true)
 |    |-- y: boolean (nullable = true)



someSchema = StructType(StructField(foo,StringType,true), StructField(bar,StringType,true), StructField(boo,StructType(StructField(x,IntegerType,true), StructField(y,BooleanType,true)),true))


StructType(StructField(foo,StringType,true), StructField(bar,StringType,true), StructField(boo,StructType(StructField(x,IntegerType,true), StructField(y,BooleanType,true)),true))

Схема также может быть получена из JSON строки:

In [89]:
val jsoned = airports.toJSON

val firstLine = jsoned.head

spark.range(1).select(schema_of_json(lit(firstLine))).head

jsoned = [value: string]
firstLine = {"ident":"00A","type":"heliport","name":"Total Rf Heliport","elevation_ft":11,"continent":"NA","iso_country":"US","iso_region":"US-PA","municipality":"Bensalem","gps_code":"00A","local_code":"00A","coordinates":"40.07080078125, -74.93360137939453"}


[struct<continent:string,coordinates:string,elevation_ft:bigint,gps_code:string,ident:string,iso_country:string,iso_region:string,local_code:string,municipality:string,name:string,type:string>]

Чтобы изменить тип колонки, следует использовать метод `cast`. Данная операция может как возвращать `null`, так и бросать исключение

In [90]:
airports.select('elevation_ft.cast("string")).printSchema
airports.select('elevation_ft.cast("string")).show(1, false)

root
 |-- elevation_ft: string (nullable = true)

+------------+
|elevation_ft|
+------------+
|11          |
+------------+
only showing top 1 row



In [91]:
airports.select('type.cast("float")).printSchema
airports.select('type.cast("float")).show(1, false)

root
 |-- type: float (nullable = true)

+----+
|type|
+----+
|null|
+----+
only showing top 1 row



### Выводы:
+ Spark использует схемы для описания типов колонок, схемы всего DF, чтения источников и для работы с JSON
+ Схема представляет собой инстанс класса `StructType`
+ Колонки в Spark могут иметь любой тип. При этом вложенность словарей, массивов и структур не ограничена

## Оптимизатор запросов Catalyst
Catalyst выполняет оптимизацию запросов с целью ускорения их выполнения и применяет следующие методы:
 + Column projection
 + Partition pruning
 + Predicate pushdown
 + Constant folding
 
 Подготовим датасет для демонстрации работы Catalyst:

In [93]:
airports
    .repartition(2)
    .write
    .format("parquet")
    .partitionBy("iso_country")
    .mode("overwrite")
    .save("/tmp/airports_2.parquet")

val airportPq = spark.read.parquet("/tmp/airports_2.parquet")

airportPq = [ident: string, type: string ... 10 more fields]


lastException: Throwable = null


[ident: string, type: string ... 10 more fields]

In [None]:
/tmp/datasets/airports_2.parquet
├── _SUCCESS
├── iso_country=AD
│   ├── part-00000-656e0232-1f3a-4077-a4dc-438d816b6e4d.c000.snappy.parquet
│   └── part-00001-656e0232-1f3a-4077-a4dc-438d816b6e4d.c000.snappy.parquet
├── iso_country=AE
│   ├── part-00000-656e0232-1f3a-4077-a4dc-438d816b6e4d.c000.snappy.parquet
│   └── part-00001-656e0232-1f3a-4077-a4dc-438d816b6e4d.c000.snappy.parquet
├── iso_country=AF
│   ├── part-00000-656e0232-1f3a-4077-a4dc-438d816b6e4d.c000.snappy.parquet
│   └── part-00001-656e0232-1f3a-4077-a4dc-438d816b6e4d.c000.snappy.parquet

### Column projection
Данный механизм позволяет избегать вычитывания ненужных колонок при работе с источниками

In [95]:
spark.time { 
    val selected = airportPq.groupBy('ident).count
    selected.cache
    selected.count
    selected.unpersist
    printPhysicalPlan(selected)
}

*(2) HashAggregate(keys=[ident#1673], functions=[count(1)], output=[ident#1673, count#1710L])
+- Exchange hashpartitioning(ident#1673, 200)
   +- *(1) HashAggregate(keys=[ident#1673], functions=[partial_count(1)], output=[ident#1673, count#1714L])
      +- *(1) Project [ident#1673]
         +- *(1) FileScan parquet [ident#1673,iso_country#1684] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/airports_2.parquet], PartitionCount: 243, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ident:string>

Time taken: 10646 ms


In [96]:
spark.time { 
    val selected = airportPq
    selected.cache
    selected.count
    selected.unpersist
    printPhysicalPlan(selected)
}

*(1) FileScan parquet [ident#1673,type#1674,name#1675,elevation_ft#1676,continent#1677,iso_region#1678,municipality#1679,gps_code#1680,iata_code#1681,local_code#1682,coordinates#1683,iso_country#1684] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/airports_2.parquet], PartitionCount: 243, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ident:string,type:string,name:string,elevation_ft:int,continent:string,iso_region:string,m...

Time taken: 5732 ms


### Partition pruning
Данный механизм позволяет избежать чтения ненужных партиций

In [97]:
spark.time { 
    val filtered = airportPq.filter('iso_country === "RU")
    filtered.count
    printPhysicalPlan(filtered)
}

*(1) FileScan parquet [ident#1673,type#1674,name#1675,elevation_ft#1676,continent#1677,iso_region#1678,municipality#1679,gps_code#1680,iata_code#1681,local_code#1682,coordinates#1683,iso_country#1684] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/airports_2.parquet], PartitionCount: 1, PartitionFilters: [isnotnull(iso_country#1684), (iso_country#1684 = RU)], PushedFilters: [], ReadSchema: struct<ident:string,type:string,name:string,elevation_ft:int,continent:string,iso_region:string,m...

Time taken: 3338 ms


### Predicate pushdown
Данный механизм позволяет "протолкнуть" условия фильтрации данных на уровень datasource

In [98]:
spark.time { 
    val filtered = airportPq.filter('iso_region === "RU")
    filtered.count
    printPhysicalPlan(filtered)
}

*(1) Project [ident#1673, type#1674, name#1675, elevation_ft#1676, continent#1677, iso_region#1678, municipality#1679, gps_code#1680, iata_code#1681, local_code#1682, coordinates#1683, iso_country#1684]
+- *(1) Filter (isnotnull(iso_region#1678) && (iso_region#1678 = RU))
   +- *(1) FileScan parquet [ident#1673,type#1674,name#1675,elevation_ft#1676,continent#1677,iso_region#1678,municipality#1679,gps_code#1680,iata_code#1681,local_code#1682,coordinates#1683,iso_country#1684] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/airports_2.parquet], PartitionCount: 243, PartitionFilters: [], PushedFilters: [IsNotNull(iso_region), EqualTo(iso_region,RU)], ReadSchema: struct<ident:string,type:string,name:string,elevation_ft:int,continent:string,iso_region:string,m...

Time taken: 5877 ms


### Simplify casts
Данный механизм убирает ненужные `cast`

In [99]:
val result = spark.range(0,10)
result.show
result.printSchema

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+

root
 |-- id: long (nullable = false)



result = [id: bigint]


[id: bigint]

In [103]:
spark.range(0,10).select('id.cast("long")).explain(true)

== Parsed Logical Plan ==
'Project [unresolvedalias(cast('id as bigint), None)]
+- Range (0, 10, step=1, splits=Some(2))

== Analyzed Logical Plan ==
id: bigint
Project [cast(id#1935L as bigint) AS id#1940L]
+- Range (0, 10, step=1, splits=Some(2))

== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(2))

== Physical Plan ==
*(1) Range (0, 10, step=1, splits=2)


In [100]:
val result = spark.range(0,10).select('id.cast("long"))
printPhysicalPlan(result)

*(1) Range (0, 10, step=1, splits=2)



result = [id: bigint]


[id: bigint]

In [104]:
val result = spark.range(0,10).select('id.cast("int"))
printPhysicalPlan(result)

*(1) Project [cast(id#1941L as int) AS id#1943]
+- *(1) Range (0, 10, step=1, splits=2)



result = [id: int]


[id: int]

### Constant folding
Данный механизм сокращает количество констант, используемых в физическом плане

In [105]:
val result = spark.range(0,10).select((lit(3) >  lit(0)).alias("foo"))
printPhysicalPlan(result)

*(1) Project [true AS foo#1947]
+- *(1) Range (0, 10, step=1, splits=2)



result = [foo: boolean]


[foo: boolean]

In [106]:
val result = spark.range(0,10).select(('id >  0).alias("foo"))
printPhysicalPlan(result)

*(1) Project [(id#1949L > 0) AS foo#1951]
+- *(1) Range (0, 10, step=1, splits=2)



result = [foo: boolean]


[foo: boolean]

### Combine filters
Данный механизм объединяет фильтры

In [107]:
val result = spark.range(0,10).filter('id > 0).filter('id !== 5).filter('id < 10)
printPhysicalPlan(result)

*(1) Filter (((id#1953L > 0) && NOT (id#1953L = 5)) && (id#1953L < 10))
+- *(1) Range (0, 10, step=1, splits=2)



result = [id: bigint]




[id: bigint]

In [108]:
spark.range(0,10).filter('id > 0).filter('id !== 5).filter('id < 10).explain(true)



== Parsed Logical Plan ==
'Filter ('id < 10)
+- Filter NOT (id#1958L = cast(5 as bigint))
   +- Filter (id#1958L > cast(0 as bigint))
      +- Range (0, 10, step=1, splits=Some(2))

== Analyzed Logical Plan ==
id: bigint
Filter (id#1958L < cast(10 as bigint))
+- Filter NOT (id#1958L = cast(5 as bigint))
   +- Filter (id#1958L > cast(0 as bigint))
      +- Range (0, 10, step=1, splits=Some(2))

== Optimized Logical Plan ==
Filter (((id#1958L > 0) && NOT (id#1958L = 5)) && (id#1958L < 10))
+- Range (0, 10, step=1, splits=Some(2))

== Physical Plan ==
*(1) Filter (((id#1958L > 0) && NOT (id#1958L = 5)) && (id#1958L < 10))
+- *(1) Range (0, 10, step=1, splits=2)


In [None]:
Collapse Project

In [115]:
spark.range(10)
.select(lit(1) as "c1").select(col("c1"), lit(2) as "c2").select(col("c1"), col("c2"), lit(3) as "c3").explain(true)

== Parsed Logical Plan ==
'Project [unresolvedalias('c1, None), unresolvedalias('c2, None), 3 AS c3#2010]
+- Project [c1#2005, 2 AS c2#2007]
   +- Project [1 AS c1#2005]
      +- Range (0, 10, step=1, splits=Some(2))

== Analyzed Logical Plan ==
c1: int, c2: int, c3: int
Project [c1#2005, c2#2007, 3 AS c3#2010]
+- Project [c1#2005, 2 AS c2#2007]
   +- Project [1 AS c1#2005]
      +- Range (0, 10, step=1, splits=Some(2))

== Optimized Logical Plan ==
Project [1 AS c1#2005, 2 AS c2#2007, 3 AS c3#2010]
+- Project
   +- Range (0, 10, step=1, splits=Some(2))

== Physical Plan ==
*(1) Project [1 AS c1#2005, 2 AS c2#2007, 3 AS c3#2010]
+- *(1) Project
   +- *(1) Range (0, 10, step=1, splits=2)


In [114]:
spark.conf.set("spark.sql.optimizer.excludedRules", "org.apache.spark.sql.catalyst.optimizer.CollapseProject")

In [116]:
spark.range(10)
.select(lit(1) as "c1").select(col("c1"), lit(2) as "c2").select(col("c1"), col("c2"), lit(3) as "c3").explain(true)

== Parsed Logical Plan ==
'Project [unresolvedalias('c1, None), unresolvedalias('c2, None), 3 AS c3#2022]
+- Project [c1#2017, 2 AS c2#2019]
   +- Project [1 AS c1#2017]
      +- Range (0, 10, step=1, splits=Some(2))

== Analyzed Logical Plan ==
c1: int, c2: int, c3: int
Project [c1#2017, c2#2019, 3 AS c3#2022]
+- Project [c1#2017, 2 AS c2#2019]
   +- Project [1 AS c1#2017]
      +- Range (0, 10, step=1, splits=Some(2))

== Optimized Logical Plan ==
Project [1 AS c1#2017, 2 AS c2#2019, 3 AS c3#2022]
+- Project
   +- Range (0, 10, step=1, splits=Some(2))

== Physical Plan ==
*(1) Project [1 AS c1#2017, 2 AS c2#2019, 3 AS c3#2022]
+- *(1) Project
   +- *(1) Range (0, 10, step=1, splits=2)
