# Generated columns


Delta Lake supports generated columns which are a special type of columns whose values are automatically generated based on a user-specified function over other columns in the Delta table. When you write to a table with generated columns and you do not explicitly provide values for them, Delta Lake automatically computes the values. For example, you can automatically generate a date column (for partitioning the table by date) from the timestamp column; any writes into the table need only specify the data for the timestamp column. However, if you explicitly provide values for them, the values must satisfy the constraint (<value> <=> <generation expression>) IS TRUE or the write will fail with an error.

In [None]:
import delta
from pyspark.sql.types import *
from pyspark.sql.functions import *
import datetime

<mark>**_Important_**</mark> 

**SQL** support **is not available** yet and tracked as [#1100](https://github.com/delta-io/delta/issues/1100). 

In [None]:
delta_table_name = 'demo.generated_columns_demo'
spark.sql(f"DROP TABLE IF EXISTS {delta_table_name}")

delta.DeltaTable.create(spark) \
  .tableName(delta_table_name) \
  .addColumn("id", "LONG") \
  .addColumn("value", BooleanType(), generatedAlwaysAs="true") \
  .addColumn("id_power2", IntegerType(), generatedAlwaysAs="cast(id as int) * cast(id as int)") \
  .execute()

In [None]:
%%sql
DESCRIBE EXTENDED demo.generated_columns_demo

In [None]:
spark.range(5).write.mode("append").saveAsTable(delta_table_name)

In [None]:
%%sql

SELECT * FROM demo.generated_columns_demo

In [None]:
spark.sql(f"DROP TABLE IF EXISTS {delta_table_name}")

## Another example

In [None]:
schema =  StructType([ 
    StructField("id",IntegerType(),True), 
    StructField("shippingdate",DateType(),True),   
    StructField("deliverydate",DateType(),True)])  


data = [(1, datetime.date(2023,6,5), datetime.date(2023,6,8)),
(2, datetime.date(2023,6,5), datetime.date(2023,6,10)),
(3, datetime.date(2023,6,5), datetime.date(2023,6,9)),
(4, datetime.date(2023,6,5), datetime.date(2023,6,7))
]

shipping_df = spark.createDataFrame(data=data,schema=schema)
shipping_df.printSchema()

In [None]:
display(shipping_df.withColumn("day_between", datediff(col("deliverydate"), col("shippingdate"))))
  

> Instead of calculation it on dataframe, let's create a new table with a generated column

In [None]:
from delta import DeltaTable

(
    DeltaTable.create(spark)
    .tableName(delta_table_name)
    .addColumn("id", "INT")
    .addColumn("shippingdate", "DATE")
    .addColumn("deliverydate", "DATE")
    .addColumn(
        "days_between", "INT", generatedAlwaysAs="datediff(deliverydate, shippingdate)"
    )
    .execute()
)


In [None]:
%%sql
DESCRIBE EXTENDED demo.generated_columns_demo

> Using the same dataframe, we'll insert data into the table

In [None]:
shipping_df.printSchema()

In [None]:
shipping_df.write.mode("append").saveAsTable(delta_table_name)

In [None]:
%%sql
select * from demo.generated_columns_demo

## Schema evolution

In [None]:
schema =  StructType([ 
    StructField("id",IntegerType(),True), 
    StructField("shippingdate",DateType(),True)   
   ])  


data = [(1, datetime.date(2023,6,5)),
(2, datetime.date(2023,6,5)),
(3, datetime.date(2023,6,5)),
(4, datetime.date(2023,6,5))
]

# deliverydate columns is missing!

shipping_df = spark.createDataFrame(data=data,schema=schema)
shipping_df.printSchema()

> **This will raise an error**

In [None]:
shipping_df.write.option("mergeSchema", "true").mode("append").format("delta").saveAsTable(delta_table_name)

## When you use values for the column that should be generated

In [None]:
schema =  StructType([ 
    StructField("id",IntegerType(),True), 
    StructField("shippingdate",DateType(),True),   
    StructField("deliverydate",DateType(),True),
    StructField("days_between",IntegerType(),True)])  


data = [(1, datetime.date(2023,6,5), datetime.date(2023,6,8), 10),
(2, datetime.date(2023,6,5), datetime.date(2023,6,10), 10),
(3, datetime.date(2023,6,5), datetime.date(2023,6,9), 10),
(4, datetime.date(2023,6,5), datetime.date(2023,6,7), 10)
]

shipping_df = spark.createDataFrame(data=data,schema=schema)
shipping_df.printSchema()

**This will raise an error**

> **DeltaInvariantViolationException: CHECK constraint Generated Column**

In [None]:
shipping_df.write.mode("append").saveAsTable(delta_table_name)

## Null Values

In [None]:
schema =  StructType([ 
    StructField("id",IntegerType(),True), 
    StructField("shippingdate",DateType(),True),   
    StructField("deliverydate",DateType(),True),
    StructField("days_between",IntegerType(),True)])  


data = [(1, None, datetime.date(2023,6,8), 10),
(2, datetime.date(2023,6,5), datetime.date(2023,6,10), 10),
(3, datetime.date(2023,6,5), datetime.date(2023,6,9), 10),
(4, datetime.date(2023,6,5), datetime.date(2023,6,7), 10)
]

shipping_df = spark.createDataFrame(data=data,schema=schema)
shipping_df.printSchema()

In [None]:
shipping_df.write.mode("append").saveAsTable(delta_table_name)

# Clean up

In [None]:
spark.sql(f"DROP TABLE IF EXISTS {delta_table_name}")