<a href="https://colab.research.google.com/github/DenysNunes/data-examples/blob/main/spark/advanced/DynamicPartitionInserts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dynamic Partition Inserts

Example using Dynamic Partition Inserts. <br>
This technique is used to overwrite a single partition instead of all data.
Read more about [here](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-dynamic-partition-inserts.html).

# Init spark

In [33]:
!pip install -q pyspark==3.1.1
!sudo apt install tree
!rm -rf /tmp/dynpartition/df1/

from pyspark.sql import SparkSession


spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName("New Session Example") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .enableHiveSupport() \
    .getOrCreate()

spark.sql("drop table if exists tb_parquet_persons")

Reading package lists... Done
Building dependency tree       
Reading state information... Done
tree is already the newest version (1.7.0-5).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


DataFrame[]

# Saving a first dataFrame table

In [34]:
from pyspark.sql.types import Row

raw_rows = [
        Row(id=1, name='Jonh', partition_id=1),
        Row(id=2, name='Maria', partition_id=1),
        Row(id=3, name='Ben', partition_id=2)
]

df = spark.createDataFrame(raw_rows)

df.show()

+---+-----+------------+
| id| name|partition_id|
+---+-----+------------+
|  1| Jonh|           1|
|  2|Maria|           1|
|  3|  Ben|           2|
+---+-----+------------+



In [35]:
df.write.saveAsTable(path='/tmp/dynpartition/df1/', name='tb_parquet_persons', partitionBy='partition_id')

In [36]:
!tree /tmp/dynpartition/df1/

/tmp/dynpartition/df1/
├── partition_id=1
│   ├── part-00000-7b84afce-73f5-4a2f-abd4-915b2a733dcf.c000.snappy.parquet
│   └── part-00001-7b84afce-73f5-4a2f-abd4-915b2a733dcf.c000.snappy.parquet
├── partition_id=2
│   └── part-00001-7b84afce-73f5-4a2f-abd4-915b2a733dcf.c000.snappy.parquet
└── _SUCCESS

2 directories, 4 files


## Saving a new df over partition 2

In [37]:
raw_rows_2 = [
        Row(id=4, name='Oliver', partition_id=2),
        Row(id=5, name='Agata', partition_id=2)
]

df_2 = spark.createDataFrame(raw_rows_2)
df_2.registerTempTable("tb_parquet_new_persons")

In [38]:
spark.sql("""

INSERT OVERWRITE TABLE tb_parquet_persons
PARTITION(partition_id = 2)
SELECT id, name FROM tb_parquet_new_persons

""")

DataFrame[]

In [39]:
!tree /tmp/dynpartition/df1/

/tmp/dynpartition/df1/
├── partition_id=1
│   ├── part-00000-7b84afce-73f5-4a2f-abd4-915b2a733dcf.c000.snappy.parquet
│   └── part-00001-7b84afce-73f5-4a2f-abd4-915b2a733dcf.c000.snappy.parquet
├── partition_id=2
│   ├── part-00000-d61131d8-1eea-4355-8a69-833134058261.c000.snappy.parquet
│   └── part-00001-d61131d8-1eea-4355-8a69-833134058261.c000.snappy.parquet
└── _SUCCESS

2 directories, 5 files


## Verifying a new data source

Notice that partition 2 was overwritten

In [41]:
spark.sql("""

SELECT * FROM tb_parquet_persons
order by id, partition_id

""").show()

+---+------+------------+
| id|  name|partition_id|
+---+------+------------+
|  1|  Jonh|           1|
|  2| Maria|           1|
|  4|Oliver|           2|
|  5| Agata|           2|
+---+------+------------+

