#READ AND WRITE OPERATIONS USECASE

#Task-1:Write SQL statements to create:

In [0]:
%sql
CREATE CATALOG IF NOT EXISTS telecom_catalog_assign;
CREATE SCHEMA IF NOT EXISTS telecom_catalog_assign.landing_zone;
CREATE VOLUME IF NOT EXISTS telecom_catalog_assign.landing_zone.landing_vol;

In [0]:
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/")

Explain the difference between<br>
a. Volume vs DBFS/FileStore<br>
b. Why production teams prefer Volumes for regulated data


**a. Volume vs DBFS/FileStore**

- A Volume is a governed storage object in Unity Catalog used to store non-table files (CSV, JSON, images, ML artifacts, PDFs).<br>
- DBFS (Databricks File System) is a legacy virtual filesystem.<br>

✅ Use Volumes when:

- Production pipelines
- Regulated / secure data
- Multi-team access
- Unity Catalog enabled

⚠️ Use DBFS / FileStore when:

- Temporary uploads
- Demo notebooks
- One-time testing
- Downloading files via UI

**IMPORTANT**<br>
“Volumes are Unity Catalog–governed storage for non-table data, while DBFS/FileStore is a legacy workspace filesystem mainly used for temporary or demo files.”

**b. Why production teams prefer Volumes for regulated data**

Production teams prefer Volumes for regulated data because Volumes provide enterprise-grade security, governance, auditability, and compliance, which are mandatory in production environments (banking, healthcare, fintech, insurance, etc.).

##Data files to use in this usecase:

In [0]:
customer_csv = ''' 101,Arun,31,Chennai,PREPAID 
102,Meera,45,Bangalore,POSTPAID 
103,Irfan,29,Hyderabad,PREPAID 
104,Raj,52,Mumbai,POSTPAID 
105,,27,Delhi,PREPAID 
106,Sneha,abc,Pune,PREPAID '''

In [0]:
usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count 
101\t320\t1500\t20 
102\t120\t4000\t5 
103\t540\t600\t52 
104\t45\t200\t2 
105\t0\t0\t0 '''

In [0]:
tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp 
5001|101|TWR01|-80|2025-01-10 10:21:54 
5004|104|TWR05|-75|2025-01-10 11:01:12 '''

tower_logs_region2 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5002|101|TWR01|-80|2025-01-10 10:21:54
5003|104|TWR05|-75|2025-01-10 11:01:12
'''

#Task-2:Filesystem operations

##sub-Task-1:Write dbutils.fs code to copy the above datasets into your created Volume folders: 

In [0]:
df1=dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv",customer_csv, overwrite=True)
df2=dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.tsv",usage_tsv,overwrite=True)
df3=dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower_logs_region1.csv",tower_logs_region1,overwrite=True)
df4=dbutils.fs.put("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower_logs_region2.csv",tower_logs_region2,overwrite=True)

#Task-3:Spark Directory Read Use Cases


###sub-Task-1:Read all tower logs using: Path glob filter (example: *.csv) Multiple paths input Recursive lookup

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/",recursiveFileLookup=True,pathGlobFilter="tower*",header=True,sep="|",inferSchema=True)
display(df1)
df1.printSchema()

###sub-Task-2:Demonstrate these 3 reads separately: 
- Using pathGlobFilter 
- Using list of paths in spark.read.csv([path1, path2]) 
- Using .option("recursiveFileLookup","true")



In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/",pathGlobFilter="tower*",header=True,sep="|",inferSchema=True)
display(df1)

In [0]:
df1 = spark.read.csv(
    [
        "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower_logs_region1.csv",
        "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower_logs_region2.csv"
    ],
    header=True,
    sep="|",
    inferSchema=True
)
display(df1)

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/",pathGlobFilter="tower*",header=True,sep="|",inferSchema=True,recursiveFileLookup=True)
display(df1)

##sub-Task-3:Compare the outputs and understand when each should be used.

- List of paths is safest and fastest for production<br>
- pathGlobFilter is good for filtering files inside a folder<br>
- recursiveFileLookup is flexible but expensive and risky

#Task-4:Schema Inference, Header, and Separator

##sub-Task-1:Try the Customer, Usage files with the option and options using read.csv and format function:
header=false, inferSchema=false
or
header=true, inferSchema=true

###A.using Customer data file

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/")
display(df1)
df1.printSchema()

the Customer data does not have header and its inferring datatype of all the colum as 'string'. hence, explicitly giving column name to this data using .toDF method

In [0]:
df1=spark.read.option("inferSchema","True").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/").toDF("customer_id","name","age","city","plan")
display(df1)
df1.printSchema()

while having
inferSchema=false-> the datatype of all data was string which practically is wrong<br>
or<br>
inferSchema=true-> used to infere the correct datatype of all columns

in the above code, header=True or false is not used as this data does not have its own header 

###B.using usage data file

In [0]:
df1=spark.read.options(sep="\t",header="True",inferSchema="True").csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/")
display(df1)
df1.printSchema()

##sub-Task-2:Write a note on What changed when we use header or inferSchema with true/false?

- while using header=false-> it takes c0,c1,c2,...,cn as default header<br>
- and while using header=true-> it takes the 1st row of the data as its header (if the data has its header in its 1st row-> then it will take that as its header, otherwise it will the 1st row data as its header)<br>
- while using inferSchema=false-> by default, datatype of all the column is 'string'<br>
- and while using inferSchema=true-> it infers the incoming data and predicts the corresponding datatype of all the columns by its nature

##sub-Task-3:How schema inference handled “abc” in age?


Even one non-numeric value forces the entire column to string.<br>
“During schema inference, a single non-numeric value like ‘abc’ forces Spark to infer the entire column as StringType.”

#Task-5:Column Renaming Usecases

##sub-Task-1:Apply column names using string using toDF function for customer data

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/",inferSchema="True").toDF("customer_id","name","age","city","plan")
display(df1)
df1.printSchema()

##sub-Task-2:Apply column names and datatype using the schema function for usage data

In [0]:
datastruct1="customer_id int,voice_mins int,data_mb int,sms_count double"
df1=spark.read.schema(datastruct1).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/",sep="\t")
display(df1)
df1.printSchema()

##sub-Task-3:Apply column names and datatype using the StructType with IntegerType, StringType, TimestampType and other classes for towers data

In [0]:
from pyspark.sql.types import StructType,StructField,IntegerType,StringType,TimestampType
schema2=StructType(
    [StructField("event_id",IntegerType(),True),
     StructField("customer_id",IntegerType(),True),
     StructField("tower_id",StringType(),True),
     StructField("signal_strength",IntegerType(),True),
     StructField("timestamp",TimestampType(),True)])
df3=spark.read.schema(schema2).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/",sep='|',header=True,pathGlobFilter="tower*",recursiveFileLookup=True)

df3.printSchema()
display(df3)

#Task-6:Write Operations (Data Conversion/Schema migration) – CSV Format Usecases

##sub-Task-1:Write customer data into CSV format using overwrite mode

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/").toDF("customer_id","name","age","city","plan")
display(df1)

In [0]:
df1.write.format("csv").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust.csv",mode="overwrite")

##sub-Task-2:Write usage data into CSV format using append mode

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/",sep='\t',inferSchema="True",header="True")
display(df1)
df1.printSchema()

In [0]:
df1.write.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv",mode="append")

##sub-Task-3:Write tower data into CSV format with header enabled and custom separator (|)

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/",header=True,sep="|",pathGlobFilter="tower*",recursiveFileLookup=True,inferSchema=True)
display(df1)
df1.printSchema()

In [0]:
df1.write.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower.csv",header=True,sep="|",mode="overwrite")

##sub-Task-4:Read the tower data in a dataframe and show only 5 rows.

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower.csv/",recursiveFileLookup=True,sep="|",header=True)
display(df1)
df1.show(5)

#Task-7:Write Operations (Data Conversion/Schema migration)– JSON Format Usecases

##sub-Task-1:Write customer data into JSON format using overwrite mode

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/").toDF("customer_id","name","age","city","plan")
display(df1)

In [0]:
df1.write.format("json").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.json",mode="overwrite")

##sub-Task-2:Write usage data into JSON format using append mode and snappy compression format

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/",sep='\t',inferSchema="True",header="True")
display(df1)
df1.printSchema()

In [0]:
df1.write.format("json").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.json",mode="append",compress="snappy")
df1=spark.read.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.json/")
display(df1)
df1.printSchema()

##sub-Task-3:Write tower data into JSON format using ignore mode and observe the behavior of this mode

In [0]:
%python
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType
struct1=StructType([StructField("event_id",IntegerType(),True),
                    StructField("customer_id",IntegerType(),True),
                    StructField("tower_id",StringType(),True),
                    StructField("signal_strength",StringType(),True),
                    StructField("timestamp",TimestampType(),True)])
df1=spark.read.schema(struct1).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/",pathGlobFilter="tower*",recursiveFileLookup=True,sep="|",header=True)
display(df1)
df1.printSchema()

In [0]:
df1.write.format("json").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower.json",mode="ignore")

In [0]:
df1=spark.read.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower.json/")
display(df1)
df1.printSchema()

##sub-Task-4:Read the tower data in a dataframe and show only 5 rows.


In [0]:
df1=spark.read.json("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.json/")
display(df1.limit(5))
df1.printSchema()

#Task-8:Write Operations (Data Conversion/Schema migration) – Parquet Format Usecases

##sub-Task-1:Write customer data into Parquet format using overwrite mode and in a gzip format

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/").toDF("customer_id","name","age","city","plan")
display(df1)

In [0]:
df1.write.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.parquet",mode="overwrite",compression="gzip")

In [0]:
df1=spark.read.format("parquet").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.parquet/")
display(df1)

##sub-Task-2:Write usage data into Parquet format using error mode

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/",sep='\t',inferSchema="True",header="True")
display(df1)
df1.printSchema()

In [0]:
df1.write.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.parquet",mode="overwrite")


In [0]:
df1=spark.read.format("parquet").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.parquet/")
display(df1)
df1.printSchema()

##sub-Task-3:Write tower data into Parquet format with gzip compression option

In [0]:
%python
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType
struct1=StructType([StructField("event_id",IntegerType(),True),
                    StructField("customer_id",IntegerType(),True),
                    StructField("tower_id",StringType(),True),
                    StructField("signal_strength",StringType(),True),
                    StructField("timestamp",TimestampType(),True)])
df1=spark.read.schema(struct1).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/",pathGlobFilter="tower*",recursiveFileLookup=True,sep="|",header=True)
display(df1)
df1.printSchema()

In [0]:
df1.write.format("parquet").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower.parquet",compression="gzip",mode="overwrite")

In [0]:
df1=spark.read.parquet("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower.parquet/")
display(df1)
df1.printSchema()

##sub-Task-4:Read the usage data in a dataframe and show only 5 rows.

In [0]:
df1=spark.read.format("parquet").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.parquet/")
display(df1.limit(5))
df1.printSchema()

#Task-9:Write Operations (Data Conversion/Schema migration) – Orc Format Usecases

##sub-Task-1:Write customer data into ORC format using overwrite mode

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/").toDF("customer_id","name","age","city","plan")
display(df1)

In [0]:
df1.write.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.orc",mode="overwrite")

In [0]:
df1=spark.read.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.orc/")
display(df1)
df1.printSchema()

##sub-Task-2:Write usage data into ORC format using append mode

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/",sep='\t',inferSchema="True",header="True")
display(df1)
df1.printSchema()

In [0]:
df1.write.format("orc").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.orc",mode="overwrite")

In [0]:
df1=spark.read.format("orc").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.orc/")
df1.show()


##sub-Task-3:Write tower data into ORC format and see the output file structure

In [0]:
%python
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType
struct1=StructType([StructField("event_id",IntegerType(),True),
                    StructField("customer_id",IntegerType(),True),
                    StructField("tower_id",StringType(),True),
                    StructField("signal_strength",StringType(),True),
                    StructField("timestamp",TimestampType(),True)])
df1=spark.read.schema(struct1).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/",pathGlobFilter="tower*",recursiveFileLookup=True,sep="|",header=True)
display(df1)
df1.printSchema()

In [0]:
df1.write.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower.orc")

In [0]:
df1=spark.read.orc("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower.orc/")
display(df1)
df1.printSchema()

##sub-Task-4:Read the usage data in a dataframe and show only 5 rows.

In [0]:
df1=spark.read.format("orc").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.orc/")
display(df1.limit(5))
df1.printSchema()

#Task-10:Write Operations (Data Conversion/Schema migration) – Delta Format Usecases

##sub-Task-1:Write customer data into Delta format using overwrite mode

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/").toDF("customer_id","name","age","city","plan")
display(df1)

In [0]:
df1.write.format("delta").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust.delta", mode="overwrite")

In [0]:
df1=spark.read.format("delta").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust.delta/",inferSchema=True)
display(df1)
df1.printSchema()

##sub-Task-2:Write usage data into Delta format using append mode

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/",sep='\t',inferSchema="True",header="True")
display(df1)
df1.printSchema()

In [0]:
df1 = df1.withColumnRenamed(
    "sms_count ",
    "sms_count"
)

In [0]:
df1.write.format("delta").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.delta", mode="append")

In [0]:
df1=spark.read.format("delta").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.delta/")
display(df1)
df1.printSchema()

##sub-Task-3:Write tower data into Delta format and see the output file structure

In [0]:
%python
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType
struct1=StructType([StructField("event_id",IntegerType(),True),
                    StructField("customer_id",IntegerType(),True),
                    StructField("tower_id",StringType(),True),
                    StructField("signal_strength",StringType(),True),
                    StructField("timestamp",TimestampType(),True)])
df1=spark.read.schema(struct1).csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/",pathGlobFilter="tower*",recursiveFileLookup=True,sep="|",header=True)
display(df1)
df1.printSchema()

In [0]:
df1.write.format("delta").save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower.delta")

In [0]:
df1=spark.read.format("delta").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/tower.delta/")
display(df1)
df1.printSchema()

##sub-Task-4:Read the usage data in a dataframe and show only 5 rows.

In [0]:
df1=spark.read.format("delta").load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.delta/")
display(df1)
df1.printSchema()

#Task-11:Write Operations (Lakehouse Usecases) – Delta table Usecases

##sub-Task-1:Write customer data using saveAsTable() as a managed table

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/").toDF("customer_id","name","age","city","plan")
display(df1)

In [0]:
df1.write.saveAsTable("telecom_catalog_assign.default.cutomer_managed_table")

In [0]:
df1=spark.read.table("telecom_catalog_assign.default.cutomer_managed_table")
display(df1)

##sub-Task-2:Write usage data using saveAsTable() with overwrite mode

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/",sep='\t',inferSchema="True",header="True")
display(df1)
df1.printSchema()

In [0]:
df1 = df1.withColumnRenamed(
    "sms_count ",
    "sms_count"
)

In [0]:
df1.write.saveAsTable("telecom_catalog_assign.default.usage_managed_table",mode="overwrite")


In [0]:
df1=spark.read.table("telecom_catalog_assign.default.usage_managed_table")
display(df1)

##sub-Task-3:Drop the managed table and verify data removal

##sub-Task-4:Go and check the table overview and realize it is in delta format in the Catalog.

yes, it is in the delta format in the catalog

##sub-Task-5:Use spark.read.sql to write some simple queries on the above tables created.

In [0]:
df_sql1=spark.sql("select * from telecom_catalog_assign.default.usage_managed_table")
display(df_sql1.limit(2))

In [0]:
df_sql2=spark.sql("select * from telecom_catalog_assign.default.cutomer_managed_table where plan='POSTPAID '")
display(df_sql2)

#Task-12:Write Operations (Lakehouse Usecases) – Delta table Usecases

##sub-Task-1:Write customer data using insertInto() in a new table and find the behavior

##sub-Task-2:Write usage data using insertTable() with overwrite mode

#Task-13:Write Operations (Lakehouse Usecases) – Delta table Usecases

##sub-Task-1:Write customer data into XML format using rowTag as cust

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/").toDF("customer_id","name","age","city","plan")
display(df1)

In [0]:
df1.write.xml("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust.xml",rowTag="cust")

In [0]:
df1=spark.read.xml("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/cust.xml/",rowTag="cust")
display(df1)

##sub-Task-2:Write usage data into XML format using overwrite mode with the rowTag as usage

In [0]:
df1=spark.read.csv("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/",sep='\t',inferSchema="True",header="True")
display(df1)
df1.printSchema()

In [0]:
dbutils.fs.put(
  "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/xml_usage/_test.txt",
  "test",
  overwrite=True
)

In [0]:
dbutils.fs.rm(
  "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/xml_usage",
  recurse=True
)


In [0]:
# write safely
df1.coalesce(1).write \
  .format("xml") \
  .option("rowTag", "usage") \
  .mode("overwrite") \
  .save("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/xml_usage")

##sub-Task-3:Download the xml data and open the file in notepad++ and see how the xml file looks like.