# DataFrame I
**Sergey Grishaev**  
serg.grishaev@gmail.com

## На этом занятии
+ Устройство Spark Dataframe API
+ Очистка, проекции и срезы данных
+ Чтение и запись данных
+ Работа с данными
  - Группировки
  - Запись данных
  - Соединения
  - Оконные функции
  - Встроенные функции
+ Кеширование

## Dataframe API

**Dataframe:**
+ структурированная колоночная структура данных
+ может быть создана на основе:
  - локальной коллекции
  - файла (файлов)
  - базы данных
+ в python работает значительно быстрее, чем RDD, тк использует кодогенерацию (см. Tungsten, Janino)
+ под капотом использует RDD
+ позволяет выполнять произвольные SQL операции с данными
+ аналогично RDD являются ленивыми и неизменяеыми

## Из чего состоит Dataframe
+ схема [pyspsark.sql.StructType](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.types.StructType)
+ колонки [pyspark.sql.Column](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column)
+ данные [pyspark.sql.Row](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Row)

In [2]:
import os
import sys
os.environ["PYSPARK_PYTHON"]='/opt/anaconda/envs/bd9/bin/python'
os.environ["SPARK_HOME"]='/usr/hdp/current/spark2-client'
os.environ["PYSPARK_SUBMIT_ARGS"]='--num-executors 3 --executor-memory 3g pyspark-shell'

spark_home = os.environ.get('SPARK_HOME', None)

sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.7-src.zip'))

In [3]:
from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf()
# conf.set("spark.app.name", "Sergey Grishaev Spark Dataframe app") 

spark = SparkSession.builder.config(conf=conf).appName("Sergey Grishaev Spark Dataframe app").getOrCreate()
sc = spark.sparkContext
spark
sc

Подготовим тестовый набор данных

In [4]:
from pyspark.sql.functions import *

test_data = [
{"name":"Moscow", "country":"Rossiya", "continent": "Europe", "population": 12380664},
{ "name":"Madrid", "country":"Spain" },
{ "name":"Paris", "country":"France", "continent": "Europe", "population" : 2196936},
{ "name":"Berlin", "country":"Germany", "continent": "Europe", "population": 3490105},
{ "name":"Barselona", "country":"Spain", "continent": "Europe" },
{ "name":"Cairo", "country":"Egypt", "continent": "Africa", "population": 11922948 },
{ "name":"Cairo", "country":"Egypt", "continent": "Africa", "population": 11922948 },
{ }
]

rdd = sc.parallelize(test_data)
df = spark.read.json(rdd).localCheckpoint()
df

DataFrame[continent: string, country: string, name: string, population: bigint]

Метод `show` выводит часть датафрейма в консоль

In [10]:
df.show(10, truncate=50, vertical=True)

-RECORD 0---------------
 continent  | Europe    
 country    | Rossiya   
 name       | Moscow    
 population | 12380664  
-RECORD 1---------------
 continent  | null      
 country    | Spain     
 name       | Madrid    
 population | null      
-RECORD 2---------------
 continent  | Europe    
 country    | France    
 name       | Paris     
 population | 2196936   
-RECORD 3---------------
 continent  | Europe    
 country    | Germany   
 name       | Berlin    
 population | 3490105   
-RECORD 4---------------
 continent  | Europe    
 country    | Spain     
 name       | Barselona 
 population | null      
-RECORD 5---------------
 continent  | Africa    
 country    | Egypt     
 name       | Cairo     
 population | 11922948  
-RECORD 6---------------
 continent  | Africa    
 country    | Egypt     
 name       | Cairo     
 population | 11922948  
-RECORD 7---------------
 continent  | null      
 country    | null      
 name       | null      
 population | null      


Метод `printSchema` выводит схему датафрейма в консоль

In [11]:
df.printSchema()

root
 |-- continent: string (nullable = true)
 |-- country: string (nullable = true)
 |-- name: string (nullable = true)
 |-- population: long (nullable = true)



Метод `select` позволяет выбрать существующие (а также создать новые) колонки из датафрейма

In [12]:
df.show()

+---------+-------+---------+----------+
|continent|country|     name|population|
+---------+-------+---------+----------+
|   Europe|Rossiya|   Moscow|  12380664|
|     null|  Spain|   Madrid|      null|
|   Europe| France|    Paris|   2196936|
|   Europe|Germany|   Berlin|   3490105|
|   Europe|  Spain|Barselona|      null|
|   Africa|  Egypt|    Cairo|  11922948|
|   Africa|  Egypt|    Cairo|  11922948|
|     null|   null|     null|      null|
+---------+-------+---------+----------+



In [13]:
df.registerTempTable("table0") #.createOrReplaceTempView("table0")
spark.sql("""SELECT continent, country FROM table0""").show(10, False)

+---------+-------+
|continent|country|
+---------+-------+
|Europe   |Rossiya|
|null     |Spain  |
|Europe   |France |
|Europe   |Germany|
|Europe   |Spain  |
|Africa   |Egypt  |
|Africa   |Egypt  |
|null     |null   |
+---------+-------+



In [14]:
df.createOrReplaceTempView("table1") 
spark.sql("""SELECT continent, country FROM table1""").show(10, False)

+---------+-------+
|continent|country|
+---------+-------+
|Europe   |Rossiya|
|null     |Spain  |
|Europe   |France |
|Europe   |Germany|
|Europe   |Spain  |
|Africa   |Egypt  |
|Africa   |Egypt  |
|null     |null   |
+---------+-------+



In [15]:
sql_df = spark.sql("""SELECT continent, country FROM table0""")

In [16]:
sql_df

DataFrame[continent: string, country: string]

In [17]:
from pyspark.sql.functions import col

df.select(col("continent"), col("country")).show(10, False)
df.select(col("continent"), col("country")).explain(True)

+---------+-------+
|continent|country|
+---------+-------+
|Europe   |Rossiya|
|null     |Spain  |
|Europe   |France |
|Europe   |Germany|
|Europe   |Spain  |
|Africa   |Egypt  |
|Africa   |Egypt  |
|null     |null   |
+---------+-------+

== Parsed Logical Plan ==
'Project [unresolvedalias('continent, None), unresolvedalias('country, None)]
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Analyzed Logical Plan ==
continent: string, country: string
Project [continent#5, country#6]
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Optimized Logical Plan ==
Project [continent#5, country#6]
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Physical Plan ==
*(1) Project [continent#5, country#6]
+- Scan ExistingRDD[continent#5,country#6,name#7,population#8L]


In [22]:
c = col("continent").alias("column") # pyspark.sql.Column

In [23]:
df.select(c).show()

+------+
|column|
+------+
|Europe|
|  null|
|Europe|
|Europe|
|Europe|
|Africa|
|Africa|
|  null|
+------+



In [25]:
df.select("continent", col("continent")).show()

+---------+---------+
|continent|continent|
+---------+---------+
|   Europe|   Europe|
|     null|     null|
|   Europe|   Europe|
|   Europe|   Europe|
|   Europe|   Europe|
|   Africa|   Africa|
|   Africa|   Africa|
|     null|     null|
+---------+---------+



In [27]:
df.select("continent", col("continent"), c).explain()

== Physical Plan ==
*(1) Project [continent#5, continent#5, continent#5 AS column#131]
+- Scan ExistingRDD[continent#5,country#6,name#7,population#8L]


In [28]:
from pyspark.sql.functions import lit
lit("foo")

Column<b'foo'>

In [30]:
df.select(lit("foo").alias("FOO")).show()

+---+
|FOO|
+---+
|foo|
|foo|
|foo|
|foo|
|foo|
|foo|
|foo|
|foo|
+---+



In [37]:
lit(2).alias("foo")

Column<b'2 AS `foo`'>

In [33]:
df.select(lit(2).alias("foo"), col("population")) \
    .select((col("foo") + col("population")).alias("foo")).explain(extended=True)

== Parsed Logical Plan ==
'Project [('foo + 'population) AS foo#175]
+- Project [2 AS foo#172, population#8L]
   +- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Analyzed Logical Plan ==
foo: bigint
Project [(cast(foo#172 as bigint) + population#8L) AS foo#175L]
+- Project [2 AS foo#172, population#8L]
   +- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Optimized Logical Plan ==
Project [(2 + population#8L) AS foo#175L]
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Physical Plan ==
*(1) Project [(2 + population#8L) AS foo#175L]
+- Scan ExistingRDD[continent#5,country#6,name#7,population#8L]


In [45]:
op_sum = (col("foo") / col("population")).alias("sum") # pyspark.sql.Column

In [39]:
op_sum

Column<b'(foo + population)'>

In [46]:
df.select(lit(2).alias("foo"), col("population")).select(op_sum, col("population"), col("foo")).show()

+--------------------+----------+---+
|                 sum|population|foo|
+--------------------+----------+---+
|1.615422242296536E-7|  12380664|  2|
|                null|      null|  2|
|9.103587906065538E-7|   2196936|  2|
|5.730486618597435E-7|   3490105|  2|
|                null|      null|  2|
|1.677437492807987E-7|  11922948|  2|
|1.677437492807987E-7|  11922948|  2|
|                null|      null|  2|
+--------------------+----------+---+



In [35]:
df.filter(col("population") > 10000).show(10, False)

+---------+-------+------+----------+
|continent|country|name  |population|
+---------+-------+------+----------+
|Europe   |Rossiya|Moscow|12380664  |
|Europe   |France |Paris |2196936   |
|Europe   |Germany|Berlin|3490105   |
|Africa   |Egypt  |Cairo |11922948  |
|Africa   |Egypt  |Cairo |11922948  |
+---------+-------+------+----------+



In [47]:
df.where(col("population") > 10000).show(10, False)

+---------+-------+------+----------+
|continent|country|name  |population|
+---------+-------+------+----------+
|Europe   |Rossiya|Moscow|12380664  |
|Europe   |France |Paris |2196936   |
|Europe   |Germany|Berlin|3490105   |
|Africa   |Egypt  |Cairo |11922948  |
|Africa   |Egypt  |Cairo |11922948  |
+---------+-------+------+----------+



In [36]:
col("population") > 10000

Column<b'(population > 10000)'>

In [48]:
spark.sql("""SELECT * FROM table0 WHERE population > 10000""").show(10, False)

+---------+-------+------+----------+
|continent|country|name  |population|
+---------+-------+------+----------+
|Europe   |Rossiya|Moscow|12380664  |
|Europe   |France |Paris |2196936   |
|Europe   |Germany|Berlin|3490105   |
|Africa   |Egypt  |Cairo |11922948  |
|Africa   |Egypt  |Cairo |11922948  |
+---------+-------+------+----------+



In [49]:
spark.sql("""SELECT * FROM table0 WHERE population > 10000""").explain(True)

== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('population > 10000)
   +- 'UnresolvedRelation `table0`

== Analyzed Logical Plan ==
continent: string, country: string, name: string, population: bigint
Project [continent#5, country#6, name#7, population#8L]
+- Filter (population#8L > cast(10000 as bigint))
   +- SubqueryAlias `table0`
      +- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Optimized Logical Plan ==
Filter (isnotnull(population#8L) && (population#8L > 10000))
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Physical Plan ==
*(1) Filter (isnotnull(population#8L) && (population#8L > 10000))
+- Scan ExistingRDD[continent#5,country#6,name#7,population#8L]


In [50]:
df.where(col("population") > 10000).explain(True)

== Parsed Logical Plan ==
'Filter ('population > 10000)
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Analyzed Logical Plan ==
continent: string, country: string, name: string, population: bigint
Filter (population#8L > cast(10000 as bigint))
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Optimized Logical Plan ==
Filter (isnotnull(population#8L) && (population#8L > 10000))
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Physical Plan ==
*(1) Filter (isnotnull(population#8L) && (population#8L > 10000))
+- Scan ExistingRDD[continent#5,country#6,name#7,population#8L]


In [51]:
df.filter((col("population") > 10000) & (col("continent") == "Europe")).show(10, False)

+---------+-------+------+----------+
|continent|country|name  |population|
+---------+-------+------+----------+
|Europe   |Rossiya|Moscow|12380664  |
|Europe   |France |Paris |2196936   |
|Europe   |Germany|Berlin|3490105   |
+---------+-------+------+----------+



In [52]:
df.filter((col("population") > 10000) & (col("continent") == "Europe")).explain(True)

== Parsed Logical Plan ==
'Filter (('population > 10000) && ('continent = Europe))
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Analyzed Logical Plan ==
continent: string, country: string, name: string, population: bigint
Filter ((population#8L > cast(10000 as bigint)) && (continent#5 = Europe))
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Optimized Logical Plan ==
Filter (((isnotnull(population#8L) && isnotnull(continent#5)) && (population#8L > 10000)) && (continent#5 = Europe))
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Physical Plan ==
*(1) Filter (((isnotnull(population#8L) && isnotnull(continent#5)) && (population#8L > 10000)) && (continent#5 = Europe))
+- Scan ExistingRDD[continent#5,country#6,name#7,population#8L]


In [53]:
df.filter(col("population") > 10000).filter(col("continent") == "Europe").explain(True)

== Parsed Logical Plan ==
'Filter ('continent = Europe)
+- Filter (population#8L > cast(10000 as bigint))
   +- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Analyzed Logical Plan ==
continent: string, country: string, name: string, population: bigint
Filter (continent#5 = Europe)
+- Filter (population#8L > cast(10000 as bigint))
   +- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Optimized Logical Plan ==
Filter (((isnotnull(population#8L) && isnotnull(continent#5)) && (population#8L > 10000)) && (continent#5 = Europe))
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Physical Plan ==
*(1) Filter (((isnotnull(population#8L) && isnotnull(continent#5)) && (population#8L > 10000)) && (continent#5 = Europe))
+- Scan ExistingRDD[continent#5,country#6,name#7,population#8L]


In [54]:
df.filter(
        col("population") > 10000
        ) \
        .select(
        col("name"), 
        lit(True).alias("t"),
        col("continent")
        ) \
        .filter(col("continent") == "Europe").show(20, False)

+------+----+---------+
|name  |t   |continent|
+------+----+---------+
|Moscow|true|Europe   |
|Paris |true|Europe   |
|Berlin|true|Europe   |
+------+----+---------+



In [55]:
df.filter(
        col("population") > 10000
        ) \
        .select(
        col("name"), 
        lit(True).alias("t"),
        col("continent")
        ) \
        .filter(col("continent") == "Europe").explain(True)

== Parsed Logical Plan ==
'Filter ('continent = Europe)
+- Project [name#7, true AS t#313, continent#5]
   +- Filter (population#8L > cast(10000 as bigint))
      +- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Analyzed Logical Plan ==
name: string, t: boolean, continent: string
Filter (continent#5 = Europe)
+- Project [name#7, true AS t#313, continent#5]
   +- Filter (population#8L > cast(10000 as bigint))
      +- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Optimized Logical Plan ==
Project [name#7, true AS t#313, continent#5]
+- Filter (((isnotnull(population#8L) && isnotnull(continent#5)) && (population#8L > 10000)) && (continent#5 = Europe))
   +- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Physical Plan ==
*(1) Project [name#7, true AS t#313, continent#5]
+- *(1) Filter (((isnotnull(population#8L) && isnotnull(continent#5)) && (population#8L > 10000)) && (continent#5 = Europe))
   +- Scan ExistingRDD[con

In [63]:
rows = df.filter(
        col("population") > 10000
        ) \
        .select(
        col("name"), 
        lit(True).alias("t"),
        col("continent")
        ) \
        .filter(col("continent") == "Europe").limit(2).collect()

type(rows[0])

pyspark.sql.types.Row

In [65]:
type(rows)

list

In [64]:
rows

[Row(name='Moscow', t=True, continent='Europe'),
 Row(name='Paris', t=True, continent='Europe')]

## Очистка данных
Удалим дубликаты. По умолчанию метод `dropDuplicates` удаляет дубликаты строк, у которых ВСЕ колонки совпадают

In [66]:
df.dropDuplicates().show(10, False)
df.dropDuplicates().explain(True)

+---------+-------+---------+----------+
|continent|country|name     |population|
+---------+-------+---------+----------+
|Europe   |Rossiya|Moscow   |12380664  |
|null     |null   |null     |null      |
|null     |Spain  |Madrid   |null      |
|Europe   |Spain  |Barselona|null      |
|Europe   |France |Paris    |2196936   |
|Europe   |Germany|Berlin   |3490105   |
|Africa   |Egypt  |Cairo    |11922948  |
+---------+-------+---------+----------+

== Parsed Logical Plan ==
Deduplicate [continent#5, country#6, name#7, population#8L]
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Analyzed Logical Plan ==
continent: string, country: string, name: string, population: bigint
Deduplicate [continent#5, country#6, name#7, population#8L]
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Optimized Logical Plan ==
Aggregate [continent#5, country#6, name#7, population#8L], [continent#5, country#6, name#7, population#8L]
+- LogicalRDD [continent#5, c

In [67]:
df.dropDuplicates(subset=["continent", "name"]).show(10, False)
df.dropDuplicates(subset=["continent", "name"]).explain(True)

+---------+-------+---------+----------+
|continent|country|name     |population|
+---------+-------+---------+----------+
|null     |Spain  |Madrid   |null      |
|Europe   |France |Paris    |2196936   |
|null     |null   |null     |null      |
|Africa   |Egypt  |Cairo    |11922948  |
|Europe   |Germany|Berlin   |3490105   |
|Europe   |Rossiya|Moscow   |12380664  |
|Europe   |Spain  |Barselona|null      |
+---------+-------+---------+----------+

== Parsed Logical Plan ==
Deduplicate [continent#5, name#7]
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Analyzed Logical Plan ==
continent: string, country: string, name: string, population: bigint
Deduplicate [continent#5, name#7]
+- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Optimized Logical Plan ==
Aggregate [continent#5, name#7], [continent#5, first(country#6, false) AS country#6, name#7, first(population#8L, false) AS population#8L]
+- LogicalRDD [continent#5, country#6, name#7, po

Метод `.na.drop` удаляет СТРОКИ, в которых отсутствует часть данных. Параметр `how="all"` означает, что будут удалены строки, у которых ВСЕ колонки `null`

In [68]:
df.show()

+---------+-------+---------+----------+
|continent|country|     name|population|
+---------+-------+---------+----------+
|   Europe|Rossiya|   Moscow|  12380664|
|     null|  Spain|   Madrid|      null|
|   Europe| France|    Paris|   2196936|
|   Europe|Germany|   Berlin|   3490105|
|   Europe|  Spain|Barselona|      null|
|   Africa|  Egypt|    Cairo|  11922948|
|   Africa|  Egypt|    Cairo|  11922948|
|     null|   null|     null|      null|
+---------+-------+---------+----------+



In [69]:
df.dropDuplicates().na.drop(how="all").show(10, False)
df.dropDuplicates().na.drop(how="all").explain()

+---------+-------+---------+----------+
|continent|country|name     |population|
+---------+-------+---------+----------+
|Europe   |Rossiya|Moscow   |12380664  |
|null     |Spain  |Madrid   |null      |
|Europe   |Spain  |Barselona|null      |
|Europe   |France |Paris    |2196936   |
|Europe   |Germany|Berlin   |3490105   |
|Africa   |Egypt  |Cairo    |11922948  |
+---------+-------+---------+----------+

== Physical Plan ==
*(2) HashAggregate(keys=[continent#5, country#6, name#7, population#8L], functions=[])
+- Exchange hashpartitioning(continent#5, country#6, name#7, population#8L, 200)
   +- *(1) HashAggregate(keys=[continent#5, country#6, name#7, population#8L], functions=[])
      +- *(1) Filter AtLeastNNulls(n, continent#5,country#6,name#7,population#8L)
         +- Scan ExistingRDD[continent#5,country#6,name#7,population#8L]


In [70]:
df.dropDuplicates().na.drop(how="any").show(10, False)

+---------+-------+------+----------+
|continent|country|name  |population|
+---------+-------+------+----------+
|Europe   |Rossiya|Moscow|12380664  |
|Europe   |France |Paris |2196936   |
|Europe   |Germany|Berlin|3490105   |
|Africa   |Egypt  |Cairo |11922948  |
+---------+-------+------+----------+



In [71]:
df.dropDuplicates().na.drop(how="any", subset=["continent", "name"]).show(10, False)

+---------+-------+---------+----------+
|continent|country|name     |population|
+---------+-------+---------+----------+
|Europe   |Rossiya|Moscow   |12380664  |
|Europe   |Spain  |Barselona|null      |
|Europe   |France |Paris    |2196936   |
|Europe   |Germany|Berlin   |3490105   |
|Africa   |Egypt  |Cairo    |11922948  |
+---------+-------+---------+----------+



Метод `.na.fill` заполняет `null`. Для работы этого метода требуется словарь с изменениями

In [72]:
fill_dict = {'continent': 'n/a', 'population': 0 }

df.dropDuplicates().na.drop(how="all").na.fill(fill_dict).show(10, False)
df.dropDuplicates().na.drop(how="all").na.fill(fill_dict).explain(True)

+---------+-------+---------+----------+
|continent|country|name     |population|
+---------+-------+---------+----------+
|Europe   |Rossiya|Moscow   |12380664  |
|n/a      |Spain  |Madrid   |0         |
|Europe   |Spain  |Barselona|0         |
|Europe   |France |Paris    |2196936   |
|Europe   |Germany|Berlin   |3490105   |
|Africa   |Egypt  |Cairo    |11922948  |
+---------+-------+---------+----------+

== Parsed Logical Plan ==
Project [coalesce(continent#5, cast(n/a as string)) AS continent#478, country#6, name#7, coalesce(population#8L, cast(0 as bigint)) AS population#479L]
+- Filter AtLeastNNulls(n, continent#5,country#6,name#7,population#8L)
   +- Deduplicate [continent#5, country#6, name#7, population#8L]
      +- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Analyzed Logical Plan ==
continent: string, country: string, name: string, population: bigint
Project [coalesce(continent#5, cast(n/a as string)) AS continent#478, country#6, name#7, coalesce(pop

Метод `.na.replace` заменяет данные в колонках. Для его работы требуется словарь с заменами

In [73]:
replace_dict = {"Rossiya": "Russia"}

df.dropDuplicates().na.drop("all").na.fill(fill_dict).na.replace(replace_dict, subset=["country"]).show(10, False)
df.dropDuplicates().na.drop("all").na.fill(fill_dict).na.replace(replace_dict, subset=["country"]).explain(True)

+---------+-------+---------+----------+
|continent|country|name     |population|
+---------+-------+---------+----------+
|Europe   |Russia |Moscow   |12380664  |
|n/a      |Spain  |Madrid   |0         |
|Europe   |Spain  |Barselona|0         |
|Europe   |France |Paris    |2196936   |
|Europe   |Germany|Berlin   |3490105   |
|Africa   |Egypt  |Cairo    |11922948  |
+---------+-------+---------+----------+

== Parsed Logical Plan ==
Project [continent#528, CASE WHEN (country#6 = Rossiya) THEN cast(Russia as string) ELSE country#6 END AS country#538, name#7, population#529L]
+- Project [coalesce(continent#5, cast(n/a as string)) AS continent#528, country#6, name#7, coalesce(population#8L, cast(0 as bigint)) AS population#529L]
   +- Filter AtLeastNNulls(n, continent#5,country#6,name#7,population#8L)
      +- Deduplicate [continent#5, country#6, name#7, population#8L]
         +- LogicalRDD [continent#5, country#6, name#7, population#8L], false

== Analyzed Logical Plan ==
continent: str

In [76]:
df \
    .dropDuplicates() \
    .na.drop("all") \
    .na.fill(fill_dict) \
    .na.replace(replace_dict, subset=["country"]) \
    ._jdf \
    .queryExecution() \
    .executedPlan() \
    .toJSON()

'[{"class":"org.apache.spark.sql.execution.WholeStageCodegenExec","num-children":1,"child":0,"codegenStageId":2},{"class":"org.apache.spark.sql.execution.aggregate.HashAggregateExec","num-children":1,"requiredChildDistributionExpressions":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"continent","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":5,"jvmId":"c6b9b368-4ef7-4732-8e07-da4fa9b238b1"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"country","dataType":"string","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":6,"jvmId":"c6b9b368-4ef7-4732-8e07-da4fa9b238b1"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"name","dataType":"string","nullable":true,"metad

In [78]:
type(df.na)

pyspark.sql.dataframe.DataFrameNaFunctions

In [77]:
sc._jvm.System.currentTimeMillis()

1687282146874

Подготровим датафрейм с очищенными данными

In [79]:
from pyspark.sql.functions import col

clean_data = df \
                .dropDuplicates() \
                .na.drop("all") \
                .na.fill(fill_dict) \
                .na.replace(replace_dict) \
                .filter(col("population") >= 0) \
                .select(col("continent"), col("country"), col("name"), col("population")) \
                .localCheckpoint()

clean_data.explain()

== Physical Plan ==
Scan ExistingRDD[continent#630,country#631,name#632,population#621L]


In [80]:
clean_data.show(10, False)

+---------+-------+---------+----------+
|continent|country|name     |population|
+---------+-------+---------+----------+
|Europe   |Russia |Moscow   |12380664  |
|n/a      |Spain  |Madrid   |0         |
|Europe   |Spain  |Barselona|0         |
|Europe   |France |Paris    |2196936   |
|Europe   |Germany|Berlin   |3490105   |
|Africa   |Egypt  |Cairo    |11922948  |
+---------+-------+---------+----------+



In [81]:
clean_data.printSchema()

root
 |-- continent: string (nullable = false)
 |-- country: string (nullable = true)
 |-- name: string (nullable = true)
 |-- population: long (nullable = false)



Подготовим базовый агрегат. По умолчанию имена колонок принимают неудобные названия

In [82]:
# from pyspark.sql.functions import count, sum
import pyspark.sql.functions as F

agg = clean_data.groupBy(col("continent")).agg(F.count("*"), F.sum(col("population")))

agg.show(10, False)
agg.explain()

agg.select(col("`sum(population)`")).show()

+---------+--------+---------------+
|continent|count(1)|sum(population)|
+---------+--------+---------------+
|Europe   |4       |18067705       |
|Africa   |1       |11922948       |
|n/a      |1       |0              |
+---------+--------+---------------+

== Physical Plan ==
*(2) HashAggregate(keys=[continent#630], functions=[count(1), sum(population#621L)])
+- Exchange hashpartitioning(continent#630, 200)
   +- *(1) HashAggregate(keys=[continent#630], functions=[partial_count(1), partial_sum(population#621L)])
      +- *(1) Project [continent#630, population#621L]
         +- Scan ExistingRDD[continent#630,country#631,name#632,population#621L]
+---------------+
|sum(population)|
+---------------+
|       18067705|
|       11922948|
|              0|
+---------------+



Метод `alias` позволяет переименовывать колонки

In [83]:
from pyspark.sql.functions import count, sum, lower

pop_count = count("*").alias("city_count")
pop_sum = F.sum(col("population")).alias("population_sum")

agg = clean_data \
            .groupBy("continent") \
            .agg(pop_count, pop_sum) \
            .withColumn("continent", lower(col("continent")))

agg.show(10, False)
agg.explain()

+---------+----------+--------------+
|continent|city_count|population_sum|
+---------+----------+--------------+
|europe   |4         |18067705      |
|africa   |1         |11922948      |
|n/a      |1         |0             |
+---------+----------+--------------+

== Physical Plan ==
*(2) HashAggregate(keys=[continent#630], functions=[count(1), sum(population#621L)])
+- Exchange hashpartitioning(continent#630, 200)
   +- *(1) HashAggregate(keys=[continent#630], functions=[partial_count(1), partial_sum(population#621L)])
      +- *(1) Project [continent#630, population#621L]
         +- Scan ExistingRDD[continent#630,country#631,name#632,population#621L]


Метод `orderBy` позволяет сортировать Dataframe. Это удобно, если мы хотим вывести его содержимое на экран с помощью `show`, однако не подходит при записи данных

In [84]:
col("population_sum").desc()
col("population_sum").asc()

Column<b'population_sum ASC NULLS FIRST'>

In [90]:
agg.repartition(1).orderBy(col("population_sum").desc()).show()
agg.repartition(1).orderBy(col("population_sum").desc()).explain()

+---------+----------+--------------+
|continent|city_count|population_sum|
+---------+----------+--------------+
|   europe|         4|      18067705|
|   africa|         1|      11922948|
|      n/a|         1|             0|
+---------+----------+--------------+

== Physical Plan ==
*(3) Sort [population_sum#693L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(population_sum#693L DESC NULLS LAST, 1)
   +- Exchange RoundRobinPartitioning(1)
      +- *(2) HashAggregate(keys=[continent#630], functions=[count(1), sum(population#621L)])
         +- Exchange hashpartitioning(continent#630, 1)
            +- *(1) HashAggregate(keys=[continent#630], functions=[partial_count(1), partial_sum(population#621L)])
               +- *(1) Project [continent#630, population#621L]
                  +- Scan ExistingRDD[continent#630,country#631,name#632,population#621L]


In [88]:
agg.repartition(1).sortWithinPartitions(col("population_sum").desc()).show()
agg.repartition(1).sortWithinPartitions(col("population_sum").desc()).explain()

+---------+----------+--------------+
|continent|city_count|population_sum|
+---------+----------+--------------+
|   europe|         4|      18067705|
|   africa|         1|      11922948|
|      n/a|         1|             0|
+---------+----------+--------------+

== Physical Plan ==
*(3) Sort [population_sum#693L DESC NULLS LAST], false, 0
+- Exchange RoundRobinPartitioning(1)
   +- *(2) HashAggregate(keys=[continent#630], functions=[count(1), sum(population#621L)])
      +- Exchange hashpartitioning(continent#630, 200)
         +- *(1) HashAggregate(keys=[continent#630], functions=[partial_count(1), partial_sum(population#621L)])
            +- *(1) Project [continent#630, population#621L]
               +- Scan ExistingRDD[continent#630,country#631,name#632,population#621L]


In [86]:
agg.rdd.getNumPartitions()

200

In [91]:
spark.conf.set("spark.sql.shuffle.partitions", "200")

## Чтение данных из источника
Основной метод чтения любых источников

```df = spark.read.format(datasource_type).option(datasource_options).load(object_name)```

+ ```datasource_type``` - тип источника ("parquet", "json", "cassandra") и т. д.
+ ```datasource_options``` - опции для работы с источником (логины, пароли, адреса для подключения и т. д.)
+ ```object_name``` - имя таблицы/файла/топика/индекса

[DataframeReader](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader):
+ по умолчанию выводит схему данных
+ является трансформацией (ленивый)
+ возвращает [Dataframe](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame)

### Список (неполный) поддерживаемых источников данных
+ Файлы:
  - json
  - text
  - csv
  - orc
  - parquet
  - delta
+ Базы данных
  - elasticsearch
  - cassandra
  - jdbc
  - hive
  - redis
  - mongo
+ Брокеры сообщений
  - kafka
  

**Библиотеки для работы с источниками должны быть доступны в JAVA CLASSPATH на драйвере и воркерах!**

In [92]:
df = spark.read.format("csv").options(header=True, inferSchema=True).load("/tmp/airport-codes.csv")

In [93]:
df.printSchema()

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: integer (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)



In [94]:
df.show(n=1, truncate=False, vertical=True)

-RECORD 0------------------------------------------
 ident        | 00A                                
 type         | heliport                           
 name         | Total Rf Heliport                  
 elevation_ft | 11                                 
 continent    | NA                                 
 iso_country  | US                                 
 iso_region   | US-PA                              
 municipality | Bensalem                           
 gps_code     | 00A                                
 iata_code    | null                               
 local_code   | 00A                                
 coordinates  | 40.07080078125, -74.93360137939453 
only showing top 1 row



In [95]:
df.rdd.getNumPartitions()

2

## Запись данных
Основной метод записи в любые системы

```df.write.format(datasource_type).options(datasource_options).mode(savemode).save(object_name)```

+ ```datasource_type``` - тип источника ("parquet", "json", "cassandra") и т. д.
+ ```datasource_options``` - опции для работы с источником (логины, пароли, адреса для подключения и т. д.)
+ ```savemode``` - режим записи данных (добавление, перезапись и т. д.)
+ ```object_name``` - имя таблицы/файла/топика/индекса

[DataFrameWriter](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter):
+ метод ```save``` является действием
+ позволяет работать с партиционированными данными (parquet, orc)
+ не всегда валидирует схему и формат данных


### Список (неполный) поддерживаемых источников данных
+ Файлы:
  - json
  - text
  - csv
  - orc
  - parquet
  - delta
+ Базы данных
  - elasticsearch
  - cassandra
  - jdbc
  - hive
  - redis
  - mongo
+ Брокеры сообщений
  - kafka
  

**Библиотеки для работы с источниками должны быть доступны в JAVA CLASSPATH на драйвере и воркерах!**



In [106]:
condition = col("continent") != "n/a"

agg \
    .filter(condition) \
    .repartition(2) \
    .write \
    .format("parquet") \
    .mode("overwrite") \
    .save("/tmp/{u}/agg0.parquet".format(u=sc.sparkUser()))

print("Ok! Data is written to /tmp/{u}/agg0.parquet".format(u=sc.sparkUser()))

Ok! Data is written to /tmp/teacher/agg0.parquet


In [97]:
condition = col("continent") != "n/a"

agg \
    .filter(condition) \
    .repartition(2) \
    .write \
    .format("csv") \
    .mode("overwrite") \
    .option("codec", "gzip") \
    .save("/tmp/{u}/agg0.csv".format(u=sc.sparkUser()))

print("Ok! Data is written to /tmp/{u}/agg0.csv".format(u=sc.sparkUser()))

Ok! Data is written to /tmp/teacher/agg0.parquet


In [103]:
!hadoop fs -ls /tmp/teacher/agg0.parquet

Found 2 items
-rw-r--r--   3 teacher hdfs          0 2023-06-20 21:08 /tmp/teacher/agg0.parquet/_SUCCESS
-rw-r--r--   3 teacher hdfs        954 2023-06-20 21:08 /tmp/teacher/agg0.parquet/part-00000-5e90b924-aca5-41ed-8092-9eb7ae8313df-c000.snappy.parquet


In [99]:
!hadoop fs -ls /tmp/teacher/agg0.csv

Found 2 items
-rw-r--r--   3 teacher hdfs          0 2023-06-20 21:07 /tmp/teacher/agg0.csv/_SUCCESS
-rw-r--r--   3 teacher hdfs         56 2023-06-20 21:07 /tmp/teacher/agg0.csv/part-00000-56029a5f-1de2-4a9e-bd3e-5029ba01c8b4-c000.csv.gz


In [104]:
df = spark.read.parquet("/tmp/teacher/agg0.parquet")
df

DataFrame[continent: string, city_count: bigint, population_sum: bigint]

In [105]:
df.count()

1

## Соединения

Join'ы позволяют соединять два DF в один по заданным условиям.

По типу условия join'ы делятся на:
+ equ-join - соединение по равенству одного или более ключей
+ non-equ join - соединение по условию, отличному от равенства одного или более ключей

При выполнении Join используется один из алгоритмов:
- BroadcastHashJoin
- SortMergeJoin
- BroadcastNestedLoopJoin
- CartesianProduct

In [107]:
left = clean_data.withColumn("continent", lower(col("continent")))
left.printSchema()

right = spark.read.parquet("/tmp/{u}/agg0.parquet".format(u=sc.sparkUser()))
right.printSchema()

left.show()
right.show()

root
 |-- continent: string (nullable = false)
 |-- country: string (nullable = true)
 |-- name: string (nullable = true)
 |-- population: long (nullable = false)

root
 |-- continent: string (nullable = true)
 |-- city_count: long (nullable = true)
 |-- population_sum: long (nullable = true)

+---------+-------+---------+----------+
|continent|country|     name|population|
+---------+-------+---------+----------+
|   europe| Russia|   Moscow|  12380664|
|      n/a|  Spain|   Madrid|         0|
|   europe|  Spain|Barselona|         0|
|   europe| France|    Paris|   2196936|
|   europe|Germany|   Berlin|   3490105|
|   africa|  Egypt|    Cairo|  11922948|
+---------+-------+---------+----------+

+---------+----------+--------------+
|continent|city_count|population_sum|
+---------+----------+--------------+
|   europe|         4|      18067705|
|   africa|         1|      11922948|
+---------+----------+--------------+



В качестве условия соединения можно использовать:
- имя колонки, по которой делается соединение
- массив имен колонок, по которым делается соединение
- выражение `pyspark.sql.Column`

In [108]:
joined = left.join(right, 'continent', 'inner')

joined.printSchema()

joined.show(10, False)
joined.explain()

root
 |-- continent: string (nullable = false)
 |-- country: string (nullable = true)
 |-- name: string (nullable = true)
 |-- population: long (nullable = false)
 |-- city_count: long (nullable = true)
 |-- population_sum: long (nullable = true)

+---------+-------+---------+----------+----------+--------------+
|continent|country|name     |population|city_count|population_sum|
+---------+-------+---------+----------+----------+--------------+
|europe   |Russia |Moscow   |12380664  |4         |18067705      |
|europe   |Spain  |Barselona|0         |4         |18067705      |
|europe   |France |Paris    |2196936   |4         |18067705      |
|europe   |Germany|Berlin   |3490105   |4         |18067705      |
|africa   |Egypt  |Cairo    |11922948  |1         |11922948      |
+---------+-------+---------+----------+----------+--------------+

== Physical Plan ==
*(2) Project [continent#898, country#631, name#632, population#621L, city_count#904L, population_sum#905L]
+- *(2) BroadcastHash

In [109]:
joined = left.join(right, ['continent'], 'inner')

joined.printSchema()

joined.show(10, False)
joined.explain()

root
 |-- continent: string (nullable = false)
 |-- country: string (nullable = true)
 |-- name: string (nullable = true)
 |-- population: long (nullable = false)
 |-- city_count: long (nullable = true)
 |-- population_sum: long (nullable = true)

+---------+-------+---------+----------+----------+--------------+
|continent|country|name     |population|city_count|population_sum|
+---------+-------+---------+----------+----------+--------------+
|europe   |Russia |Moscow   |12380664  |4         |18067705      |
|europe   |Spain  |Barselona|0         |4         |18067705      |
|europe   |France |Paris    |2196936   |4         |18067705      |
|europe   |Germany|Berlin   |3490105   |4         |18067705      |
|africa   |Egypt  |Cairo    |11922948  |1         |11922948      |
+---------+-------+---------+----------+----------+--------------+

== Physical Plan ==
*(2) Project [continent#898, country#631, name#632, population#621L, city_count#904L, population_sum#905L]
+- *(2) BroadcastHash

In [110]:
from pyspark.sql.functions import col

joined = left.alias("first") \
                .join(right.alias("second"), 
                      col("first.continent") == col("second.continent"), 'inner')

joined.printSchema()

joined.show(10, False)
joined.explain()

root
 |-- continent: string (nullable = false)
 |-- country: string (nullable = true)
 |-- name: string (nullable = true)
 |-- population: long (nullable = false)
 |-- continent: string (nullable = true)
 |-- city_count: long (nullable = true)
 |-- population_sum: long (nullable = true)

+---------+-------+---------+----------+---------+----------+--------------+
|continent|country|name     |population|continent|city_count|population_sum|
+---------+-------+---------+----------+---------+----------+--------------+
|europe   |Russia |Moscow   |12380664  |europe   |4         |18067705      |
|europe   |Spain  |Barselona|0         |europe   |4         |18067705      |
|europe   |France |Paris    |2196936   |europe   |4         |18067705      |
|europe   |Germany|Berlin   |3490105   |europe   |4         |18067705      |
|africa   |Egypt  |Cairo    |11922948  |africa   |1         |11922948      |
+---------+-------+---------+----------+---------+----------+--------------+

== Physical Plan 

In [113]:
joined.select(col("continent").alias("l_continent")).show()

AnalysisException: "Reference 'continent' is ambiguous, could be: first.continent, second.continent.;"

In [136]:
from pyspark.sql.functions import expr

joined = left.alias("left") \
                .join(broadcast(right).alias("right"), 
                      expr("""left.continent = right.continent"""), 'inner')

joined.printSchema()

joined.show(10, False)
joined.explain()

root
 |-- continent: string (nullable = false)
 |-- country: string (nullable = true)
 |-- name: string (nullable = true)
 |-- population: long (nullable = false)
 |-- continent: string (nullable = true)
 |-- city_count: long (nullable = true)
 |-- population_sum: long (nullable = true)

+---------+-------+---------+----------+---------+----------+--------------+
|continent|country|name     |population|continent|city_count|population_sum|
+---------+-------+---------+----------+---------+----------+--------------+
|europe   |Russia |Moscow   |12380664  |europe   |4         |18067705      |
|europe   |Spain  |Barselona|0         |europe   |4         |18067705      |
|europe   |France |Paris    |2196936   |europe   |4         |18067705      |
|europe   |Germany|Berlin   |3490105   |europe   |4         |18067705      |
|africa   |Egypt  |Cairo    |11922948  |africa   |1         |11922948      |
+---------+-------+---------+----------+---------+----------+--------------+

== Physical Plan 

In [114]:
expr("""left.continent = right.continent""")

Column<b'(left.continent = right.continent)'>

## Оконные функции

Оконные функции позволяют делать функции над "окнами" (кто бы мог подумать) данных

Окно создается из класса [pyspark.sql.Window](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window) с указанием полей, определяющих границы окон и полей, определяющих порядок сортировки внутри окна:

```window = Window.partitionBy("a", "b").orderBy("a")```

Применяя окна, можно использовать такие полезные функции из [pyspark.sql.functions](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions), как ```lag()``` и ```lead()```, а также эффективно работать с данными time-series, вычисляя такие параметры, как, например, среднее значение заданного поля за 3-х часовой интервал

In [120]:
# В нашем случае, используя оконные функции, мы можем построить DF из предыдущих примеров c join, 
# но без использования соединения

from pyspark.sql import Window
import pyspark.sql.functions as F

window = Window.partitionBy("continent")

agg = clean_data \
    .withColumn("city_count", F.count("*").over(window)) \
    .withColumn("population_sum", F.sum("population").over(window)) \

agg.show()
agg.explain()

+---------+-------+---------+----------+----------+--------------+
|continent|country|     name|population|city_count|population_sum|
+---------+-------+---------+----------+----------+--------------+
|   Europe| France|    Paris|   2196936|         4|      18067705|
|   Europe|Germany|   Berlin|   3490105|         4|      18067705|
|   Europe| Russia|   Moscow|  12380664|         4|      18067705|
|   Europe|  Spain|Barselona|         0|         4|      18067705|
|   Africa|  Egypt|    Cairo|  11922948|         1|      11922948|
|      n/a|  Spain|   Madrid|         0|         1|             0|
+---------+-------+---------+----------+----------+--------------+

== Physical Plan ==
Window [count(1) windowspecdefinition(continent#630, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS city_count#1138L, sum(population#621L) windowspecdefinition(continent#630, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS population_sum#1145

In [117]:
# В нашем случае, используя оконные функции, мы можем построить DF из предыдущих примеров c join, 
# но без использования соединения

from pyspark.sql import Window
import pyspark.sql.functions as F

window = Window.partitionBy("continent").orderBy(col("population").desc())

agg = clean_data.withColumn("rn", row_number().over(window))
agg.show()
agg.explain()

+---------+-------+---------+----------+---+
|continent|country|     name|population| rn|
+---------+-------+---------+----------+---+
|   Europe| Russia|   Moscow|  12380664|  1|
|   Europe|Germany|   Berlin|   3490105|  2|
|   Europe| France|    Paris|   2196936|  3|
|   Europe|  Spain|Barselona|         0|  4|
|   Africa|  Egypt|    Cairo|  11922948|  1|
|      n/a|  Spain|   Madrid|         0|  1|
+---------+-------+---------+----------+---+

== Physical Plan ==
Window [row_number() windowspecdefinition(continent#630, population#621L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#1112], [continent#630], [population#621L DESC NULLS LAST]
+- *(1) Sort [continent#630 ASC NULLS FIRST, population#621L DESC NULLS LAST], false, 0
   +- Exchange hashpartitioning(continent#630, 200)
      +- Scan ExistingRDD[continent#630,country#631,name#632,population#621L]


In [118]:
# В нашем случае, используя оконные функции, мы можем построить DF из предыдущих примеров c join, 
# но без использования соединения

from pyspark.sql import Window
import pyspark.sql.functions as F

window = \
    Window.partitionBy("continent") \
            .orderBy(col("population").desc()) \
            .rowsBetween(-1, 1)

window = \
    Window.partitionBy("continent") \
            .orderBy(col("population").desc()) \
            .rangeBetween(-100, 100)

# Window.unboundedPreceding
# Window.unboundedFollowing
# Window.currentRow

## Функции pyspark.sql.functions

Spark обладает достаточно большим набором встроенных функций, доступных в [pyspark.sql.functions](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions), поэтому перед тем, как писать свою UDF, стоит обязательно поискать нужную функцию в данном пакете.

К тому же, все функции Spark принимают на вход и возвращают [pyspark.sql.Column](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column), а это значит, что вы можете совмещать функции вместе

**Также важно помнить, что функции и колонки в Spark могут быть созданы без привязки к конкретным данным и DF**

In [121]:
from pyspark.sql.functions import to_json, col, struct

avg_pop = \
    to_json(
        struct(
            (col("population_sum") / col("city_count")).alias("value")
        )
    ).alias("avg_pop")

agg.select(col("*"), avg_pop).show(truncate=False)
agg.select(col("*"), avg_pop).explain()

+---------+-------+---------+----------+----------+--------------+---------------------+
|continent|country|name     |population|city_count|population_sum|avg_pop              |
+---------+-------+---------+----------+----------+--------------+---------------------+
|Europe   |France |Paris    |2196936   |4         |18067705      |{"value":4516926.25} |
|Europe   |Germany|Berlin   |3490105   |4         |18067705      |{"value":4516926.25} |
|Europe   |Russia |Moscow   |12380664  |4         |18067705      |{"value":4516926.25} |
|Europe   |Spain  |Barselona|0         |4         |18067705      |{"value":4516926.25} |
|Africa   |Egypt  |Cairo    |11922948  |1         |11922948      |{"value":1.1922948E7}|
|n/a      |Spain  |Madrid   |0         |1         |0             |{"value":0.0}        |
+---------+-------+---------+----------+----------+--------------+---------------------+

== Physical Plan ==
Project [continent#630, country#631, name#632, population#621L, city_count#1138L, populat

Большим преимуществом Spark по сравнению с большинством SQL ориентированных БД является наличие встроенных функций работы со списками, словарями и структурами данных

In [122]:
from pyspark.sql.functions import *

all_in_one = agg.select(struct(*agg.columns).alias("allinone"))

all_in_one.printSchema()
all_in_one.show(20, False)
all_in_one.explain()

root
 |-- allinone: struct (nullable = false)
 |    |-- continent: string (nullable = false)
 |    |-- country: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- population: long (nullable = false)
 |    |-- city_count: long (nullable = false)
 |    |-- population_sum: long (nullable = true)

+-----------------------------------------------+
|allinone                                       |
+-----------------------------------------------+
|[Europe, Spain, Barselona, 0, 4, 18067705]     |
|[Europe, Russia, Moscow, 12380664, 4, 18067705]|
|[Europe, France, Paris, 2196936, 4, 18067705]  |
|[Europe, Germany, Berlin, 3490105, 4, 18067705]|
|[Africa, Egypt, Cairo, 11922948, 1, 11922948]  |
|[n/a, Spain, Madrid, 0, 1, 0]                  |
+-----------------------------------------------+

== Physical Plan ==
*(2) Project [named_struct(continent, continent#630, country, country#631, name, name#632, population, population#621L, city_count, city_count#1138L, populatio

Например, можно создавать массивы и объединять их

In [123]:
from pyspark.sql.functions import *

arrays = \
    spark.range(0,1) \
    .withColumn("a", array(lit(1), lit(2), lit(3))) \
    .withColumn("b", array(lit(4),lit(5),lit(6))) \
    .localCheckpoint() \
    .select(array_union(col("a"), col("b")).alias("c"))


arrays.show(1, False)
arrays.explain()

+------------------+
|c                 |
+------------------+
|[1, 2, 3, 4, 5, 6]|
+------------------+

== Physical Plan ==
*(1) Project [array_union(a#1217, b#1220) AS c#1227]
+- Scan ExistingRDD[id#1215L,a#1217,b#1220]


Списки не очень удобны для анализа, поэтому обычно их "взрывают" с помощью `explode`

In [124]:
from pyspark.sql.functions import explode

arrays.select(explode(col("c")).alias("c")).show()
arrays.select(explode(col("c")).alias("c")).explain()

+---+
|  c|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
+---+

== Physical Plan ==
Generate explode(c#1227), false, [c#1241]
+- *(1) Project [array_union(a#1217, b#1220) AS c#1227]
   +- Scan ExistingRDD[id#1215L,a#1217,b#1220]


In [126]:
spark.range(10).select(F.expr("""pmod(id, 2)""")).show()

+---------------------------+
|pmod(id, CAST(2 AS BIGINT))|
+---------------------------+
|                          0|
|                          1|
|                          0|
|                          1|
|                          0|
|                          1|
|                          0|
|                          1|
|                          0|
|                          1|
+---------------------------+



In [125]:
spark.range(10).select(F.pmod(col("id"), 2)).show()

AttributeError: module 'pyspark.sql.functions' has no attribute 'pmod'

## Кеширование
По умолчанию при применении каждого действия Spark пересчитывает весь граф, что может негативно сказать на производительности приложения. Для демонстрации возьмем датасет [Airport Codes](https://datahub.io/core/airport-codes)  

In [127]:
df = spark.read.format("csv").options(header=True, inferSchema=True).load("/tmp/airport-codes.csv")
df.printSchema()
df.show(1, 200, True)

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: integer (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)

-RECORD 0------------------------------------------
 ident        | 00A                                
 type         | heliport                           
 name         | Total Rf Heliport                  
 elevation_ft | 11                                 
 continent    | NA                                 
 iso_country  | US                                 
 iso_region   | US-PA                              
 municipality | Bensalem                           
 gps_code     | 00A                 

Посчитаем несколько агрегатов. Несмотря на то, что `only_ru` является общим для всех действий, он пересчитывается при вызове каждого действия:

In [128]:
only_ru = df.filter((col("iso_country") == "RU") & (col("elevation_ft") > 1000))
only_ru.printSchema()

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: integer (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)



In [129]:
only_ru.groupBy(col("iso_region")).count().orderBy(col("count").desc()).show()
only_ru.groupBy(col("iso_region")).count().orderBy(col("count").desc()).explain()

+----------+-----+
|iso_region|count|
+----------+-----+
|    RU-CHI|   18|
|    RU-IRK|   14|
|     RU-SA|   13|
|     RU-BU|    9|
|    RU-KYA|    7|
|    RU-AMU|    7|
|     RU-BA|    4|
|    RU-STA|    3|
|    RU-MAG|    2|
|     RU-AL|    1|
|     RU-TY|    1|
|     RU-DA|    1|
|    RU-KEM|    1|
|    RU-PRI|    1|
|     RU-KB|    1|
|     RU-KC|    1|
|     RU-IN|    1|
|     RU-SE|    1|
+----------+-----+

== Physical Plan ==
*(3) Sort [count#1377L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#1377L DESC NULLS LAST, 200)
   +- *(2) HashAggregate(keys=[iso_region#1269], functions=[count(1)])
      +- Exchange hashpartitioning(iso_region#1269, 200)
         +- *(1) HashAggregate(keys=[iso_region#1269], functions=[partial_count(1)])
            +- *(1) Project [iso_region#1269]
               +- *(1) Filter (((isnotnull(iso_country#1268) && isnotnull(elevation_ft#1266)) && (iso_country#1268 = RU)) && (elevation_ft#1266 > 1000))
                  +- *(1) FileScan c

In [130]:
only_ru.groupBy(col("type")).count().show()
only_ru.groupBy(col("type")).count().explain()

+--------------+-----+
|          type|count|
+--------------+-----+
|      heliport|    6|
|        closed|    9|
|medium_airport|   34|
| small_airport|   37|
+--------------+-----+

== Physical Plan ==
*(2) HashAggregate(keys=[type#1264], functions=[count(1)])
+- Exchange hashpartitioning(type#1264, 200)
   +- *(1) HashAggregate(keys=[type#1264], functions=[partial_count(1)])
      +- *(1) Project [type#1264]
         +- *(1) Filter (((isnotnull(iso_country#1268) && isnotnull(elevation_ft#1266)) && (iso_country#1268 = RU)) && (elevation_ft#1266 > 1000))
            +- *(1) FileScan csv [type#1264,elevation_ft#1266,iso_country#1268] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://spark-master-1.newprolab.com:8020/tmp/airport-codes.csv], PartitionFilters: [], PushedFilters: [IsNotNull(iso_country), IsNotNull(elevation_ft), EqualTo(iso_country,RU), GreaterThan(elevation_..., ReadSchema: struct<type:string,elevation_ft:int,iso_country:string>


Для решения этой проблемы следует использовать методы `cache`, либо `persist`. Данные методы сохраняют состояние графа после первого действия, и следующие обращаются к нему. Разница между методами заключается в том, что `persist` позволяет выбрать, куда сохранить данные, а `cache` использует значение по умолчанию. В текущей версии Spark это [StorageLevel.MEMORY_AND_DISK](https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). Важно помнить, что данный кеш не предназначен для обмена данными между разными Spark приложения - он является внутренним для приложения. После того, как работа с данными окончена, необходимо выполнить `unpersist` для очистки памяти

In [131]:
only_ru.cache()
only_ru.count()

86

In [134]:
only_ru.groupBy(col("iso_region")).count().orderBy(col("count").desc()).show()
only_ru.groupBy(col("iso_region")).count().orderBy(col("count").desc()).explain()
only_ru.groupBy(col("type")).count().show()
only_ru.groupBy(col("type")).count().explain()

+----------+-----+
|iso_region|count|
+----------+-----+
|    RU-CHI|   18|
|    RU-IRK|   14|
|     RU-SA|   13|
|     RU-BU|    9|
|    RU-KYA|    7|
|    RU-AMU|    7|
|     RU-BA|    4|
|    RU-STA|    3|
|    RU-MAG|    2|
|     RU-AL|    1|
|     RU-TY|    1|
|     RU-DA|    1|
|    RU-KEM|    1|
|    RU-PRI|    1|
|     RU-KB|    1|
|     RU-KC|    1|
|     RU-IN|    1|
|     RU-SE|    1|
+----------+-----+

== Physical Plan ==
*(3) Sort [count#2004L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#2004L DESC NULLS LAST, 200)
   +- *(2) HashAggregate(keys=[iso_region#1269], functions=[count(1)])
      +- Exchange hashpartitioning(iso_region#1269, 200)
         +- *(1) HashAggregate(keys=[iso_region#1269], functions=[partial_count(1)])
            +- *(1) Project [iso_region#1269]
               +- *(1) Filter (((isnotnull(iso_country#1268) && isnotnull(elevation_ft#1266)) && (iso_country#1268 = RU)) && (elevation_ft#1266 > 1000))
                  +- *(1) FileScan c

In [133]:
only_ru.unpersist()

DataFrame[ident: string, type: string, name: string, elevation_ft: int, continent: string, iso_country: string, iso_region: string, municipality: string, gps_code: string, iata_code: string, local_code: string, coordinates: string]

## Выводы
**Dataframe API**:
+ мощный инструмент для работы с данными
+ в отличие от RDD, Dataframe API устроен так, что все вычисления происходят в JVM
+ обладает единым API для работы с различными источниками данных
+ имеет большой набор встроенных функций работы с данными

# Спасибо!

In [None]:
spark.stop()