# Data Sources and File Formats - Practice Notebook

This notebook covers reading and writing data in various formats, exploring Spark's built-in data sources.

## Learning Objectives

- Read and write CSV, JSON, Parquet files
- Understand file format options and configurations
- Work with different data sources
- Handle schema evolution and data quality issues
- Optimize file formats for performance

## Sections

1. **CSV Files - Reading and Writing**
2. **JSON Files - Handling Semi-structured Data**
3. **Parquet Files - Columnar Storage**
4. **File Format Comparison**
5. **Advanced Data Source Options**
6. **Practice Exercises**

---


In [2]:
# Setup
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import functions as F
import os

# Create SparkSession
spark = SparkSession.builder.appName("Data Sources and Formats").getOrCreate()

# Create sample data
sample_data = [
    (1, "Alice", "Engineering", 75000, "2020-01-15", ["Python", "SQL"]),
    (2, "Bob", "Sales", 65000, "2019-03-20", ["CRM", "Excel"]),
    (3, "Charlie", "Engineering", 80000, "2018-06-10", ["Java", "Scala"]),
    (4, "Diana", "Marketing", 70000, "2021-02-28", ["Analytics", "Design"]),
    (5, "Eve", "Sales", 68000, "2017-11-05", ["Negotiation", "Presentation"]),
]

df = spark.createDataFrame(
    sample_data, ["id", "name", "department", "salary", "hire_date", "skills"]
)

# Create data directory
os.makedirs("../data/formats", exist_ok=True)

print("Sample DataFrame:")
df.show()
df.printSchema()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/14 09:00:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/07/14 09:00:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/07/14 09:00:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/07/14 09:00:05 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/07/14 09:00:05 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
25/07/14 09:00:05 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.


Sample DataFrame:


                                                                                

+---+-------+-----------+------+----------+--------------------+
| id|   name| department|salary| hire_date|              skills|
+---+-------+-----------+------+----------+--------------------+
|  1|  Alice|Engineering| 75000|2020-01-15|       [Python, SQL]|
|  2|    Bob|      Sales| 65000|2019-03-20|        [CRM, Excel]|
|  3|Charlie|Engineering| 80000|2018-06-10|       [Java, Scala]|
|  4|  Diana|  Marketing| 70000|2021-02-28| [Analytics, Design]|
|  5|    Eve|      Sales| 68000|2017-11-05|[Negotiation, Pre...|
+---+-------+-----------+------+----------+--------------------+

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- hire_date: string (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: string (containsNull = true)



## 1. CSV Files - Reading and Writing

CSV is one of the most common data formats. Let's explore various CSV options.


In [3]:
# Writing CSV files
print("=== WRITING CSV FILES ===")

=== WRITING CSV FILES ===


In [4]:
df.show()

+---+-------+-----------+------+----------+--------------------+
| id|   name| department|salary| hire_date|              skills|
+---+-------+-----------+------+----------+--------------------+
|  1|  Alice|Engineering| 75000|2020-01-15|       [Python, SQL]|
|  2|    Bob|      Sales| 65000|2019-03-20|        [CRM, Excel]|
|  3|Charlie|Engineering| 80000|2018-06-10|       [Java, Scala]|
|  4|  Diana|  Marketing| 70000|2021-02-28| [Analytics, Design]|
|  5|    Eve|      Sales| 68000|2017-11-05|[Negotiation, Pre...|
+---+-------+-----------+------+----------+--------------------+



In [5]:
# IMPORTANT: Convert complex types (arrays) to strings before writing to CSV
# The 'skills' column is an array, so we need to serialize it as a comma-separated string

df_csv_ready = df.withColumn("skills", F.concat_ws(",", "skills"))
df_csv_ready.show()

+---+-------+-----------+------+----------+--------------------+
| id|   name| department|salary| hire_date|              skills|
+---+-------+-----------+------+----------+--------------------+
|  1|  Alice|Engineering| 75000|2020-01-15|          Python,SQL|
|  2|    Bob|      Sales| 65000|2019-03-20|           CRM,Excel|
|  3|Charlie|Engineering| 80000|2018-06-10|          Java,Scala|
|  4|  Diana|  Marketing| 70000|2021-02-28|    Analytics,Design|
|  5|    Eve|      Sales| 68000|2017-11-05|Negotiation,Prese...|
+---+-------+-----------+------+----------+--------------------+



In [6]:
print("Original DataFrame with array column:")
df.select("name", "skills").show(truncate=False)
print("DataFrame prepared for CSV (skills as string):")
df_csv_ready.select("name", "skills").show(truncate=False)

Original DataFrame with array column:
+-------+---------------------------+
|name   |skills                     |
+-------+---------------------------+
|Alice  |[Python, SQL]              |
|Bob    |[CRM, Excel]               |
|Charlie|[Java, Scala]              |
|Diana  |[Analytics, Design]        |
|Eve    |[Negotiation, Presentation]|
+-------+---------------------------+

DataFrame prepared for CSV (skills as string):
+-------+------------------------+
|name   |skills                  |
+-------+------------------------+
|Alice  |Python,SQL              |
|Bob    |CRM,Excel               |
|Charlie|Java,Scala              |
|Diana  |Analytics,Design        |
|Eve    |Negotiation,Presentation|
+-------+------------------------+



In [7]:
# Basic CSV write (with skills serialized as string)
df_csv_ready.write.mode("overwrite").option("header", "true").csv(
    "../data/formats/employees.csv"
)
print("Basic CSV file written (skills column serialized as comma-separated string)")

Basic CSV file written (skills column serialized as comma-separated string)


In [8]:
# CSV with custom options (with skills serialized as string)
df_csv_ready.write.mode("overwrite").option("header", "true").option(
    "delimiter", "|"
).option("quote", '"').option("escape", "\\").csv(
    "../data/formats/employees_custom.csv"
)
print("Custom CSV file written (skills column serialized as comma-separated string)")

Custom CSV file written (skills column serialized as comma-separated string)


In [9]:
# Reading CSV files
print("\n=== READING CSV FILES ===")


=== READING CSV FILES ===


In [10]:
# Basic CSV read
df_csv_basic = spark.read.option("header", "true").csv("../data/formats/employees.csv")
print("Basic CSV read (skills as string):")
df_csv_basic.show()
df_csv_basic.printSchema()

Basic CSV read (skills as string):
+---+-------+-----------+------+----------+--------------------+
| id|   name| department|salary| hire_date|              skills|
+---+-------+-----------+------+----------+--------------------+
|  5|    Eve|      Sales| 68000|2017-11-05|Negotiation,Prese...|
|  4|  Diana|  Marketing| 70000|2021-02-28|    Analytics,Design|
|  3|Charlie|Engineering| 80000|2018-06-10|          Java,Scala|
|  1|  Alice|Engineering| 75000|2020-01-15|          Python,SQL|
|  2|    Bob|      Sales| 65000|2019-03-20|           CRM,Excel|
+---+-------+-----------+------+----------+--------------------+

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: string (nullable = true)
 |-- hire_date: string (nullable = true)
 |-- skills: string (nullable = true)



In [27]:
# Convert the 'skills' string column back to an array if needed
df_csv_with_array = df_csv_basic.withColumn("skills", F.split("skills", ","))
print("\nSkills column converted back to array:")
df_csv_with_array.show()
df_csv_with_array.printSchema()


Skills column converted back to array:
+---+-------+-----------+------+----------+--------------------+
| id|   name| department|salary| hire_date|              skills|
+---+-------+-----------+------+----------+--------------------+
|  5|    Eve|      Sales| 68000|2017-11-05|[Negotiation, Pre...|
|  4|  Diana|  Marketing| 70000|2021-02-28| [Analytics, Design]|
|  3|Charlie|Engineering| 80000|2018-06-10|       [Java, Scala]|
|  1|  Alice|Engineering| 75000|2020-01-15|       [Python, SQL]|
|  2|    Bob|      Sales| 65000|2019-03-20|        [CRM, Excel]|
+---+-------+-----------+------+----------+--------------------+

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: string (nullable = true)
 |-- hire_date: string (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: string (containsNull = false)



In [None]:
# CSV read with schema inference
df_csv_infer = (
    spark.read.option("header", "true")
    .option("inferSchema", "true")
    .csv("../data/formats/employees.csv")
)
print("\nCSV read with schema inference:")
df_csv_infer.show()
df_csv_infer.printSchema()


CSV read with schema inference:
+---+-------+-----------+------+----------+--------------------+
| id|   name| department|salary| hire_date|              skills|
+---+-------+-----------+------+----------+--------------------+
|  5|    Eve|      Sales| 68000|2017-11-05|Negotiation,Prese...|
|  4|  Diana|  Marketing| 70000|2021-02-28|    Analytics,Design|
|  3|Charlie|Engineering| 80000|2018-06-10|          Java,Scala|
|  1|  Alice|Engineering| 75000|2020-01-15|          Python,SQL|
|  2|    Bob|      Sales| 65000|2019-03-20|           CRM,Excel|
+---+-------+-----------+------+----------+--------------------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- hire_date: date (nullable = true)
 |-- skills: string (nullable = true)



In [None]:
# CSV read with custom delimiter
df_csv_custom = (
    spark.read.option("header", "true")
    .option("delimiter", "|")
    .csv("../data/formats/employees_custom.csv")
)
print("\nCSV read with custom delimiter:")
df_csv_custom.show()


CSV read with custom delimiter:
+---+-------+-----------+------+----------+--------------------+
| id|   name| department|salary| hire_date|              skills|
+---+-------+-----------+------+----------+--------------------+
|  5|    Eve|      Sales| 68000|2017-11-05|Negotiation,Prese...|
|  4|  Diana|  Marketing| 70000|2021-02-28|    Analytics,Design|
|  3|Charlie|Engineering| 80000|2018-06-10|          Java,Scala|
|  1|  Alice|Engineering| 75000|2020-01-15|          Python,SQL|
|  2|    Bob|      Sales| 65000|2019-03-20|           CRM,Excel|
+---+-------+-----------+------+----------+--------------------+



## 2. JSON Files - Handling Semi-structured Data

JSON is perfect for semi-structured data with nested fields.


In [None]:
# Writing JSON files
print("=== WRITING JSON FILES ===")

# Basic JSON write
df.write.mode("overwrite").json("../data/formats/employees.json")
print("JSON file written")

# Reading JSON files
print("\n=== READING JSON FILES ===")

df_json = spark.read.json("../data/formats/employees.json")
print("JSON read:")
df_json.show()
df_json.printSchema()

=== WRITING JSON FILES ===
JSON file written

=== READING JSON FILES ===
JSON read:
+-----------+----------+---+-------+------+--------------------+
| department| hire_date| id|   name|salary|              skills|
+-----------+----------+---+-------+------+--------------------+
|      Sales|2017-11-05|  5|    Eve| 68000|[Negotiation, Pre...|
|  Marketing|2021-02-28|  4|  Diana| 70000| [Analytics, Design]|
|Engineering|2018-06-10|  3|Charlie| 80000|       [Java, Scala]|
|Engineering|2020-01-15|  1|  Alice| 75000|       [Python, SQL]|
|      Sales|2019-03-20|  2|    Bob| 65000|        [CRM, Excel]|
+-----------+----------+---+-------+------+--------------------+

root
 |-- department: string (nullable = true)
 |-- hire_date: string (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [None]:
# Create more complex JSON data
complex_data = [
    {
        "id": 1,
        "personal": {"name": "Alice", "age": 30, "email": "alice@company.com"},
        "job": {"title": "Engineer", "salary": 75000, "department": "Engineering"},
        "skills": ["Python", "SQL", "Spark"],
        "projects": [
            {"name": "Project A", "status": "completed"},
            {"name": "Project B", "status": "in_progress"},
        ],
    },
    {
        "id": 2,
        "personal": {"name": "Bob", "age": 25, "email": "bob@company.com"},
        "job": {"title": "Analyst", "salary": 65000, "department": "Sales"},
        "skills": ["Excel", "PowerBI"],
        "projects": [{"name": "Project C", "status": "completed"}],
    },
]

In [33]:
# Convert to DataFrame and write
df_complex = spark.createDataFrame(complex_data)
df_complex.write.mode("overwrite").json("../data/formats/employees_complex.json")

In [34]:
# Read complex JSON
df_complex_read = spark.read.json("../data/formats/employees_complex.json")
print("\nComplex JSON structure:")
df_complex_read.show(truncate=False)
df_complex_read.printSchema()


Complex JSON structure:
+---+------------------------------+------------------------------+--------------------------------------------------+--------------------+
|id |job                           |personal                      |projects                                          |skills              |
+---+------------------------------+------------------------------+--------------------------------------------------+--------------------+
|1  |{Engineering, 75000, Engineer}|{30, alice@company.com, Alice}|[{Project A, completed}, {Project B, in_progress}]|[Python, SQL, Spark]|
|2  |{Sales, 65000, Analyst}       |{25, bob@company.com, Bob}    |[{Project C, completed}]                          |[Excel, PowerBI]    |
+---+------------------------------+------------------------------+--------------------------------------------------+--------------------+

root
 |-- id: long (nullable = true)
 |-- job: struct (nullable = true)
 |    |-- department: string (nullable = true)
 |    |-- salar

In [None]:
# Access nested fields
print("\nAccessing nested fields:")
df_complex_read.select("id", "personal.name", "job.title", "job.salary").show()


Accessing nested fields:
+---+-----+--------+------+
| id| name|   title|salary|
+---+-----+--------+------+
|  1|Alice|Engineer| 75000|
|  2|  Bob| Analyst| 65000|
+---+-----+--------+------+



## 3. Parquet Files - Columnar Storage

Parquet is the preferred format for big data analytics due to its efficiency.


In [None]:
# Writing Parquet files
print("=== WRITING PARQUET FILES ===")

# Basic Parquet write
df.write.mode("overwrite").parquet("../data/formats/employees.parquet")
print("Parquet file written")

# Parquet with partitioning
df.write.mode("overwrite").partitionBy("department").parquet(
    "../data/formats/employees_partitioned.parquet"
)
print("Partitioned Parquet file written")

# Reading Parquet files
print("\n=== READING PARQUET FILES ===")

df_parquet = spark.read.parquet("../data/formats/employees.parquet")
print("Parquet read:")
df_parquet.show()
df_parquet.printSchema()

=== WRITING PARQUET FILES ===
Parquet file written
Partitioned Parquet file written

=== READING PARQUET FILES ===
Parquet read:
+---+-------+-----------+------+----------+--------------------+
| id|   name| department|salary| hire_date|              skills|
+---+-------+-----------+------+----------+--------------------+
|  3|Charlie|Engineering| 80000|2018-06-10|       [Java, Scala]|
|  4|  Diana|  Marketing| 70000|2021-02-28| [Analytics, Design]|
|  1|  Alice|Engineering| 75000|2020-01-15|       [Python, SQL]|
|  5|    Eve|      Sales| 68000|2017-11-05|[Negotiation, Pre...|
|  2|    Bob|      Sales| 65000|2019-03-20|        [CRM, Excel]|
+---+-------+-----------+------+----------+--------------------+

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- hire_date: string (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [42]:
# Read partitioned Parquet
df_partitioned = spark.read.parquet("../data/formats/employees_partitioned.parquet")
print("\nPartitioned Parquet read:")
df_partitioned.show()

# Parquet preserves schema perfectly
print("\nParquet schema preservation:")
print("Original schema:", df.schema.simpleString())
print("Parquet schema:", df_parquet.schema.simpleString())
print("Schemas match:", df.schema == df_parquet.schema)


Partitioned Parquet read:
+---+-------+------+----------+--------------------+-----------+
| id|   name|salary| hire_date|              skills| department|
+---+-------+------+----------+--------------------+-----------+
|  5|    Eve| 68000|2017-11-05|[Negotiation, Pre...|      Sales|
|  4|  Diana| 70000|2021-02-28| [Analytics, Design]|  Marketing|
|  3|Charlie| 80000|2018-06-10|       [Java, Scala]|Engineering|
|  1|  Alice| 75000|2020-01-15|       [Python, SQL]|Engineering|
|  2|    Bob| 65000|2019-03-20|        [CRM, Excel]|      Sales|
+---+-------+------+----------+--------------------+-----------+


Parquet schema preservation:
Original schema: struct<id:bigint,name:string,department:string,salary:bigint,hire_date:string,skills:array<string>>
Parquet schema: struct<id:bigint,name:string,department:string,salary:bigint,hire_date:string,skills:array<string>>
Schemas match: True


In [None]:
# Parquet with compression
df.write.mode("overwrite").option("compression", "snappy").parquet(
    "../data/formats/employees_snappy.parquet"
)
df.write.mode("overwrite").option("compression", "gzip").parquet(
    "../data/formats/employees_gzip.parquet"
)
print("\nParquet files with different compression written")


Parquet files with different compression written


## 4. Practice Exercises

Complete these exercises to practice working with different data formats.
