# Shallow Clone

A shallow clone is a clone that does not copy the data files to the clone target. The table metadata is equivalent to the source. These clones are cheaper to create.

The shallow clone concept is designed for short-lived use cases where we can delete the clones after our use case is complete. In such cases, if we alter data files from the source table, our goal of deleting the files after use won't be achieved. Hence, the altered files are written to the cloned table location. When you don't need these files/data, delete them, and you will be good.

Shallow clones reference data files in the source directory. If you run vacuum on the source table clients will no longer be able to read the referenced data files and a FileNotFoundException will be thrown. In this case, running clone with replace over the shallow clone will repair the clone. If this occurs often, consider using a deep clone instead which does not depend on the source table.

Cloning a table is not the same as Create Table As Select or CTAS. A shallow clone takes the metadata of the source table. Cloning also has simpler syntax: you don’t need to specify partitioning, format, invariants, nullability and so on as they are taken from the source table.

A cloned table has an independent history from its source table. Time travel queries on a cloned table will not work with the same inputs as they work on its source table. For example, if the source table was at version 100 and we are creating a new table by cloning it, the new table will have version 0, and therefore we could not run time travel queries on the new table such as SELECT * FROM tbl AS OF VERSION 99.

In [None]:
from pyspark.sql.functions import expr, lit, col
from pyspark.sql.types import *
from datetime import date


df = spark.range(100) \
  .selectExpr("if(id % 2 = 0, 'Open', 'Close') as action") \
  .withColumn("date", expr("cast(concat('2023-06-', cast(rand(5) * 30 as int) + 1) as date)")) \
  .withColumn("device_id", expr("cast(rand(5) * 100 as int)"))


delta_table_name = 'device'
spark.sql(f"DROP TABLE IF EXISTS {delta_table_name}")

df.write.format("delta").mode("overwrite").saveAsTable(delta_table_name)

In [None]:
%%sql
CREATE OR REPLACE TABLE demo.device_shallow_clone SHALLOW CLONE demo.device

In [None]:
spark.sql("CREATE TABLE demo.device_shallow_clone SHALLOW CLONE demo.device")

## Metadata only

In [None]:
mssparkutils.fs.ls("Tables/device_shallow_clone")

    Let's look at lakehouse explorer

In [None]:
deltalog = spark.read.json("Tables/device_shallow_clone/_delta_log/00000000000000000000.json")
display(deltalog.select(
                        "commitInfo.operation", 
                        "add.path", 
                        "add.size",
                        "commitInfo.operationMetrics.numCopiedFiles", 
                        "commitInfo.operationMetrics.sourceNumOfFiles")
                .where((col("commitInfo").isNotNull()) | (col("add").isNotNull()) ))


In [None]:
%%sql

SELECT *, input_file_name() 
FROM demo.device_shallow_clone 
LIMIT 5

## Updating shallow clone

In [None]:
from delta.tables import * 
delta_info = DeltaTable.forName(spark, 'demo.device_shallow_clone')  

delta_info.update(
  condition = expr("device_id % 2 == 0"),
  set = { "device_id": expr("device_id + 100") }
)

    Deleting some rows

In [None]:
delta_info.delete(col('device_id') > 10)

In [None]:
%%sql

SELECT *, input_file_name() 
FROM demo.device_shallow_clone 
LIMIT 5

## Table clone using version

In [None]:
from delta.tables import * 
delta_info = DeltaTable.forName(spark, 'demo.device_shallow_clone')  
display(delta_info.history())

In [None]:
%%sql
CREATE OR REPLACE TABLE demo.new_clone_version SHALLOW CLONE demo.device_shallow_clone VERSION AS OF 1

In [None]:
%%sql
SELECT * FROM demo.new_clone_version

# Clean up

In [None]:
%%sql
DROP TABLE IF EXISTS demo.new_clone_version;
DROP TABLE IF EXISTS demo.device_shallow_clone;
DROP TABLE IF EXISTS demo.device;