
## Data Writing

Once all our transformations are completed, we need to save this data or **Write** this data so that it can be used by stakeholders. For now, we will store in Databricks default storage location but essetially we need to do this on cloud platforms.

Before writing let's perform a simple transformation.

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

In [0]:
df = spark.read.csv('/Volumes/workspace/default/tutorial_files/BigMartSales.csv', header=True, inferSchema=True)
new_df = df.withColumn('Veg_Expensive', when((col('Item_Type') != 'Meat') & (col("Item_MRP") > 100), 'Veg-Expensive')\
    .when((col('Item_Type') != 'Meat') & (col('Item_MRP') <= 100), 'Veg-Inexpensive')\
        .otherwise('Non-Veg'))
new_df.limit(5).display()

Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Veg_Expensive
FDA15,9.3,Low Fat,0.016047301,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,Veg-Expensive
DRC01,5.92,Regular,0.019278216,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,Veg-Inexpensive
FDN15,17.5,Low Fat,0.016760075,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,Non-Veg
FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38,Veg-Expensive
NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,Veg-Inexpensive



#### CSV

In [0]:
new_df.coalesce(1).write.option("header", True) \
  .csv('/Volumes/workspace/default/tutorial_files/BigMartSales_mod')


The above code will create a folder `BigMartSaes_mod` which contains a file starting with `part-00000...` which is the actual csv file.

We cannot create a file directly here


#### Data Writing Modes in PySpark

##### Append Mode

This mode is used to append the data/dataframe into existing folder, irrespective of its content or of the target folders' content. It will not throw errors and just create a copy there

In [0]:
new_df.write.mode('append').option("header", True) \
  .csv('/Volumes/workspace/default/tutorial_files/BigMartSales_mod')


##### Overwrite Mode

This mode is used to overwrite the existing folder and replace it with the new contents as per what is passed in the overwrite statement. It will delete the previous contents and must be used carefully.

In [0]:
new_df.write.mode('overwrite').option("header", True) \
  .csv('/Volumes/workspace/default/tutorial_files/BigMartSales_mod')


##### Error/Errorifexists Mode

This mode will throw an error if the folder already exists and will not overwrite or append meaninglessly. This is helpful for debugging and understanding target before attempting to manipulate it.

In [0]:
new_df.write.mode('error').option("header", True) \
  .csv('/Volumes/workspace/default/tutorial_files/BigMartSales_mod')

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-7592409404100089>, line 2[0m
[1;32m      1[0m new_df[38;5;241m.[39mwrite[38;5;241m.[39mmode([38;5;124m'[39m[38;5;124merror[39m[38;5;124m'[39m)[38;5;241m.[39moption([38;5;124m"[39m[38;5;124mheader[39m[38;5;124m"[39m, [38;5;28;01mTrue[39;00m) \
[0;32m----> 2[0m   [38;5;241m.[39mcsv([38;5;124m'[39m[38;5;124m/Volumes/workspace/default/tutorial_files/BigMartSales_mod[39m[38;5;124m'[39m)

File [0;32m/databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/readwriter.py:831[0m, in [0;36mDataFrameWriter.csv[0;34m(self, path, mode, compression, sep, quote, escape, header, nullValue, escapeQuotes, quoteAll, dateFormat, timestampFormat, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, charToEscapeQuoteEscaping, encoding, emptyValue, lineSep)[0m
[1;32m   


##### Ignore Mode

This mode is used to ignore current task and not make any changes if the folder already exits. This helps in prevention of unwanted modifications, loss of data, etc.

In [0]:
new_df.write.mode('ignore').option("header", True) \
  .csv('/Volumes/workspace/default/tutorial_files/BigMartSales_mod')

The above code will not make any changes or write the data anywhere. However the below code would since the target destination is new and does not exsts

In [0]:
new_df.write.mode('ignore').option("header", True) \
  .csv('/Volumes/workspace/default/tutorial_files/BigMartSales_mod2')


#### PARQUET Format

This is very important when dealing with big data sinceit is a columnar data format, it reduces huge compute time and improves performance by efficiently picking columns instead of rows.

In [0]:
new_df.write.mode('overwrite').option("header", True) \
  .parquet('/Volumes/workspace/default/tutorial_files/BigMartSales_mod')

#### DELTA Format

This is another crucially important data format which is used by default in Dataricks and is based on Delta Lake.

It is built on top of Parquet Format but the metadata (headers) in Parquet are stored at the bottom of the file, at the footer. In Delta, however, the medata is stored in another file which is a transaction log or delta log. The log would containt all the information regarding updates, creations, deletions, versions, etc. The main data file would essentially be Parquet but this way of removing metadata from data file and string logs is what makes it Delta and it is highly prompoted.


#### TABLE

In [0]:
new_df.write \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("BigMartSales")

### SPARK SQL

**NOTE:** This is not related to just Data Writing.

This is a way in which we can implement SQL queries in PySpark. It can be used for window functions specially and every other stuff but it is not preferred for other stuff since pyspark is great.

Also running Spark SQL does not deplete performance, it is same as PySpark

In [0]:
new_df.createTempView("MyView")

In [0]:
%sql

select * from MyView where Item_Fat_Content = 'Low Fat' limit 5

Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Veg_Expensive
FDA15,9.3,Low Fat,0.016047301,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,Veg-Expensive
FDN15,17.5,Low Fat,0.016760075,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,Non-Veg
NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,Veg-Inexpensive
FDP10,,Low Fat,0.127469857,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636,Veg-Expensive
FDY07,11.8,Low Fat,0.0,Fruits and Vegetables,45.5402,OUT049,1999,Medium,Tier 1,Supermarket Type1,1516.0266,Veg-Inexpensive


In [0]:
df_sql = spark.sql("select * from MyView where Item_Fat_Content = 'Low Fat' limit 5")

In [0]:
df_sql.display()

Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Veg_Expensive
FDA15,9.3,Low Fat,0.016047301,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,Veg-Expensive
FDN15,17.5,Low Fat,0.016760075,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,Non-Veg
NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,Veg-Inexpensive
FDP10,,Low Fat,0.127469857,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636,Veg-Expensive
FDY07,11.8,Low Fat,0.0,Fruits and Vegetables,45.5402,OUT049,1999,Medium,Tier 1,Supermarket Type1,1516.0266,Veg-Inexpensive
