#Telecom Domain Read & Write Ops Assignment - Building Datalake & Lakehouse
This notebook contains assignments to practice Spark read options and Databricks volumes. <br>
Sections: Sample data creation, Catalog & Volume creation, Copying data into Volumes, Path glob/recursive reads, toDF() column renaming variants, inferSchema/header/separator experiments, and exercises.<br>

![](https://fplogoimages.withfloats.com/actual/68009c3a43430aff8a30419d.png)
![](https://theciotimes.com/wp-content/uploads/2021/03/TELECOM1.jpg)

##First Import all required libraries & Create spark session object

##1. Write SQL statements to create:
1. A catalog named telecom_catalog_assign
2. A schema landing_zone
3. A volume landing_vol
4. Using dbutils.fs.mkdirs, create folders:<br>
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/
/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/
5. Explain the difference between (Just google and understand why we are going for volume concept for prod ready systems):<br>
a. Volume vs DBFS/FileStore<br>
b. Why production teams prefer Volumes for regulated data<br>

In [0]:
%sql
--1. Write SQL statements to create:
--1.1. A catalog named telecom_catalog_assign
CREATE CATALOG IF NOT EXISTS telecom_catalog_assign;

--1.2. A schema landing_zone
CREATE SCHEMA IF NOT EXISTS telecom_catalog_assign.landing_zone;

--1.3. A volume landing_vol
CREATE VOLUME IF NOT EXISTS telecom_catalog_assign.landing_zone.landing_vol;


In [0]:
%sql
--drop catalag even data is present inside - for reference
--DROP CATALOG IF EXISTS telecom_catalog_assign CASCADE;

In [0]:
# 1.4. Using dbutils.fs.mkdirs, create folders:
# /Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/ 
# /Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/ 
# /Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/

dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/region2")

In [0]:
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/ericsson")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/nokia")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/huawei")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/ericsson")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/nokia")
dbutils.fs.mkdirs("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/huawei")

In [0]:
cust_path = "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/"
usage_path = "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/"
tower_path = "/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/"

In [0]:
# # We can create multiple folders in one go by using a loop, instead of writing dbutils.fs.mkdirs() one by one.
# folders = ['/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer','/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage','/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1 ','/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2']

# for folder in folders:
#   dbutils.fs.mkdirs(folder)

# Explain the difference between dbfs and volume in databricks:
##### 1. DBFS (Databricks File System)<br>
- A legacy abstraction layer over cloud object storage (ADLS, S3, GCS).
- Makes cloud storage look like a local file system.
- Tied to the workspace, not Unity Catalog
- Limited governance and security
- Often accessed using dbutils.fs
- Common in older Databricks projects

##### 2. Volumes (Unity Catalog Volumes)
- A modern, governed storage abstraction introduced with Unity Catalog. 
- Provides secure, managed access to cloud storage.
- Governed by Unity Catalog. 
- Fine-grained permissions (READ, WRITE). 
- Works seamlessly with SQL, Python, Spark

##Data files to use in this usecase:
customer_csv = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''

usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''

tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
5001|101|TWR01|-80|2025-01-10 10:21:54
5004|104|TWR05|-75|2025-01-10 11:01:12
'''

In [0]:
customer_csv = '''
101,Arun,31,Chennai,PREPAID
102,Meera,45,Bangalore,POSTPAID
103,Irfan,29,Hyderabad,PREPAID
104,Raj,52,Mumbai,POSTPAID
105,,27,Delhi,PREPAID
106,Sneha,abc,Pune,PREPAID
'''
usage_tsv = '''customer_id\tvoice_mins\tdata_mb\tsms_count
101\t320\t1500\t20
102\t120\t4000\t5
103\t540\t600\t52
104\t45\t200\t2
105\t0\t0\t0
'''
# tower_logs_region1 = '''event_id|customer_id|tower_id|signal_strength|timestamp
# 5001|101|TWR01|-80|2025-01-10 10:21:54
# 5004|104|TWR05|-75|2025-01-10 11:01:12
# 5002|104|TWR06|-75|2025-01-10 11:01:12
# 5003|104|TWR02|-80|2025-01-10 11:01:12
# 5005|104|TWR03|-75|2025-01-10 11:01:12
# 5006|104|TWR04|-75|2025-01-10 11:01:12
# '''
tower_region1_ericsson_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5001|101|TWR01|-80|region1|ericsson|2025-01-10 10:21:54
5002|104|TWR05|-75|region1|ericsson|2025-01-10 11:01:12
'''
tower_region1_nokia_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5003|106|TWR06|-45|region1|nokia|2025-01-10 10:21:54
5004|107|TWR07|-55|region1|nokia|2025-01-10 11:01:12
'''
tower_region1_huawei_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5005|108|TWR08|-66|region1|huawei|2025-01-13 10:21:54
5006|109|TWR09|-76|region1|huawei|2025-01-10 11:01:12
'''
tower_region2_ericsson_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5007|111|TWR10|-10|region2|ericsson|2025-01-19 10:21:54
5008|112|TWR11|-73|region2|ericsson|2025-01-18 11:01:12
'''
tower_region2_nokia_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5009|113|TWR16|-80|region2|nokia|2025-01-20 10:21:54
5010|117|TWR15|-75|region2|nokia|2025-01-28 11:01:12
'''
tower_region2_huawei_data ='''event_id|customer_id|tower_id|signal_strength|region|vendor|timestamp
5011|118|TWR06|-10|region2|huawei|2025-01-20 10:21:54
5012|119|TWR05|-15|region2|huawei|2025-01-10 11:01:12
'''


##2. Filesystem operations
1. Write code to copy the above datasets into your created Volume folders:
Customer → /Volumes/.../customer/
Usage → /Volumes/.../usage/
Tower (region-based) → /Volumes/.../tower/region1/ and /Volumes/.../tower/region2/

2. Write a command to validate whether files were successfully copied

In [0]:
# 2.1. Write code to copy the above datasets into your created Volume folders:
# dbutils.fs.put(path: str, contents: str[file content or variable], overwrite: bool)
dbutils.fs.put('/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/customer.csv', customer_csv, True)

dbutils.fs.put('/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/usage.csv', usage_tsv, True)

dbutils.fs.put(f"/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/ericsson/tower_region1_ericsson.csv", tower_region1_ericsson_data,overwrite = True)

dbutils.fs.put(f"/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/nokia/tower_region1_nokia.csv", tower_region1_nokia_data,overwrite = True)

dbutils.fs.put(f"/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region1/huawei/tower_region1_huawei.csv", tower_region1_huawei_data,overwrite = True)

dbutils.fs.put(f"/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/ericsson/tower_region2_ericsson.csv", tower_region2_ericsson_data,overwrite = True)

dbutils.fs.put(f"/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/nokia/tower_region2_nokia.csv", tower_region2_nokia_data,overwrite = True)

dbutils.fs.put(f"/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/region2/huawei/tower_region2_huawei.csv", tower_region2_huawei_data,overwrite = True)

In [0]:
# 2.2. Write a command to validate whether files were successfully copied

print(dbutils.fs.ls('/Volumes/telecom_catalog_assign/landing_zone/landing_vol/customer/'))
print(dbutils.fs.ls('/Volumes/telecom_catalog_assign/landing_zone/landing_vol/usage/'))

##3. Directory Read Use Cases
1. Read all tower logs using:
Path glob filter (example: *.csv)
Multiple paths input
Recursive lookup

2. Demonstrate these 3 reads separately:
Using pathGlobFilter
Using list of paths in spark.read.csv([path1, path2])
Using .option("recursiveFileLookup","true")

3. Compare the outputs and understand when each should be used.

In [0]:
# 3.1. Read all tower logs using Path glob filter (example: *.csv),Multiple paths input,Recursive lookup:
#print(spark)
from pyspark.sql.session import SparkSession
spark1 = SparkSession.builder.getOrCreate()
#print(spark1)
df_tower_recursive = (spark.read.format("csv").option("recursiveFileLookup", "true").option("pathGlobFilter", "*.csv")
                      .option("header", True).option("sep" , '|')
                      .load("/Volumes/telecom_catalog_assign/landing_zone/landing_vol/tower/")
)
display(df_tower_recursive)


# 3.2. Demonstrate these 3 reads separately:
# 3.2.1. Using pathGlobFilter only
df_tower_multi_path = (
    spark.read
         .csv(path = [f"{tower_path}/region2/huawei/tower_region2_huawei.csv",f"{tower_path}/region2/nokia/tower_region2_nokia.csv"], header = True, inferSchema = True, sep = "|")
)
display(df_tower_multi_path)

# 3.2.3. .option("recursiveFileLookup", "true") with multiple paths
#Reading using recursive option
df_tower_recursive_alone = (
    spark.read
         .format("csv")
         .option("recursiveFileLookup", "true")
         .option("header", True)
         .option("sep" , '|')
         .load(f"{tower_path}/region1")
)
display(df_tower_recursive_alone)  

##4. Schema Inference, Header, and Separator
1. Try the Customer, Usage files with the option and options using read.csv and format function:<br>
header=false, inferSchema=false<br>
or<br>
header=true, inferSchema=true<br>
2. Write a note on What changed when we use header or inferSchema  with true/false?<br>
3. How schema inference handled “abc” in age?<br>

##5. Column Renaming Usecases
1. Apply column names using string using toDF function for customer data
2. Apply column names and datatype using the schema function for usage data
3. Apply column names and datatype using the StructType with IntegerType, StringType, TimestampType and other classes for towers data 