## SOlving Schema-Evaluation Problem Using Delta Lake in spark (available since v3.0)

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("Schema Evaluation").getOrCreate()
sc = spark.sparkContext

### Suppose we were getting a dataset with two col, i.e [name,age]

In [3]:
data_lst = [["arun","33"],["Atif","24"],["Mike","33"],["peter","23"]]

In [4]:
df = sc.parallelize(data_lst).toDF(["name","age"])

In [5]:
df.show()

+-----+---+
| name|age|
+-----+---+
| arun| 33|
| Atif| 24|
| Mike| 33|
|peter| 23|
+-----+---+



### Let save it as a table with parquet format

In [6]:
df.write.format("parquet").mode("append").saveAsTable("test_table")

In [8]:
spark.sql("Show tables").show()

+--------+----------+-----------+
|database| tableName|isTemporary|
+--------+----------+-----------+
| default|test_table|      false|
+--------+----------+-----------+



## Next day we got a file with extra column say `Education`

In [9]:
data_lst_1 =[["arun","33","Bacherlor Degree"],["Atif","24","Bacherlor Degree"],["Mike","33","Bacherlor Degree"],["peter","23","Bacherlor Degree"]]

In [11]:
df_1 = sc.parallelize(data_lst_1).toDF(["name","age","education"])

In [12]:
df_1.show()

+-----+---+----------------+
| name|age|       education|
+-----+---+----------------+
| arun| 33|Bacherlor Degree|
| Atif| 24|Bacherlor Degree|
| Mike| 33|Bacherlor Degree|
|peter| 23|Bacherlor Degree|
+-----+---+----------------+



## Now lets append this df to our `test_table`, remember our table has `2` col but the `df_1` has 3 columns

In [13]:
df_1.write.format("parquet").mode("append").saveAsTable("test_table")

AnalysisException: The column number of the existing table default.test_table(struct<name:string,age:string>) doesn't match the data schema(struct<name:string,age:string,education:string>);

### It thorws error saying column no doesn't match

In [14]:
df_select = spark.sql("Select * from test_table")

In [15]:
df_select.show()

+-----+---+
| name|age|
+-----+---+
|peter| 23|
| Atif| 24|
| Mike| 33|
| arun| 33|
+-----+---+



## Now the solution using delta_lake

### We need to chose format as `delta` and pass `mergeSchema` as TRUE in options

In [1]:
from pyspark.sql import SparkSession


In [1]:
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

In [2]:
sc = spark.sparkContext

In [3]:
data_lst = [["arun","33"],["Atif","24"],["Mike","33"],["peter","23"]]
df = sc.parallelize(data_lst).toDF(["name","age"])

In [4]:
df.write.format("delta").mode("append").option("mergeSchema",True).saveAsTable("test_table_delta")

In [5]:
spark.sql("select * from  test_table_delta").show()

+-----+---+
| name|age|
+-----+---+
|peter| 23|
| Mike| 33|
| arun| 33|
| Atif| 24|
+-----+---+



### Now we get file with 3 columns

In [6]:
data_lst_1 =[["arun","33","Bacherlor Degree"],["Atif","24","Bacherlor Degree"],["Mike","33","Bacherlor Degree"],["peter","23","Bacherlor Degree"]]

In [7]:
df_1 = sc.parallelize(data_lst_1).toDF(["name","age","education"])

In [8]:
df_1.write.format("delta").mode("append").option("mergeSchema",True).saveAsTable("test_table_delta")

In [9]:
spark.sql("select * from  test_table_delta").show()

+-----+---+----------------+
| name|age|       education|
+-----+---+----------------+
|peter| 23|Bacherlor Degree|
| arun| 33|Bacherlor Degree|
| Atif| 24|Bacherlor Degree|
| Mike| 33|Bacherlor Degree|
|peter| 23|            null|
| Mike| 33|            null|
| arun| 33|            null|
| Atif| 24|            null|
+-----+---+----------------+



## Hece we can see , new data is loaded with additional column , without an issue