# Module 02 - Reading & Writing Data - Exercises## InstructionsThis notebook contains exercises based on the concepts learned in Module 02.- Complete each exercise in the provided code cells- Run the data setup cells first to generate/create necessary data- Test your solutions by running the verification cells (if provided)- Refer back to the main module notebook if you need help

## Data Setup

Run the cells below to set up the data needed for the exercises.


In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
    StructType, StructField, StringType,
    IntegerType, DoubleType, DateType
)
from pyspark.sql.functions import col, when, lit
import os
import pandas as pd
import numpy as np
import json

# Create SparkSession
spark = SparkSession.builder \
    .appName(f"Module 1 Exercises") \
    .master("local[*]") \
    .getOrCreate()

# Set data directory
data_dir = "../data"
os.makedirs(data_dir, exist_ok=True)

print("SparkSession created successfully!")
print(f"Data directory: {os.path.abspath(data_dir)}")

# -----------------------------
# Generate sample CSV data
# -----------------------------
csv_data = {
    "id": range(1, 1001),
    "name": [f"Person_{i}" for i in range(1, 1001)],
    "age": np.random.randint(20, 60, 1000),
    "city": np.random.choice(
        ["NYC", "LA", "Chicago", "Houston", "Phoenix"], 1000
    ),
    "salary": np.random.randint(40000, 120000, 1000)
}

df_csv = pd.DataFrame(csv_data)
df_csv.to_csv(f"{data_dir}/exercise_data.csv", index=False)

print(f"Created CSV file: {data_dir}/exercise_data.csv")

# -----------------------------
# Generate sample JSON data
# -----------------------------
json_data = [
    {
        "id": i,
        "product": f"Product_{i}",
        "price": round(np.random.uniform(10, 100), 2),
        "category": np.random.choice(
            ["Electronics", "Clothing", "Food", "Books"], 1
        )[0]
    }
    for i in range(1, 501)
]

with open(f"{data_dir}/exercise_data.json", "w") as f:
    json.dump(json_data, f, indent=2)

print(f"Created JSON file: {data_dir}/exercise_data.json")


SparkSession created successfully!
Data directory: /data
Created CSV file: ../data/exercise_data.csv
Created JSON file: ../data/exercise_data.json


## ExercisesComplete the following exercises based on the concepts from Module 02.

### Exercise 1: Read CSV FileRead the 'exercise_data.csv' file from the data directory and display the first 5 rows.

In [8]:
# Your code here
df_1 = spark.read.csv(f"{data_dir}/exercise_data.csv",header=True)
df_1.show(5)

+---+--------+---+-------+------+
| id|    name|age|   city|salary|
+---+--------+---+-------+------+
|  1|Person_1| 20|Chicago| 85163|
|  2|Person_2| 31|Phoenix|118464|
|  3|Person_3| 50|Chicago| 94327|
|  4|Person_4| 42|     LA|107071|
|  5|Person_5| 47|Phoenix|118236|
+---+--------+---+-------+------+
only showing top 5 rows


### Exercise 2: Read JSON FileRead the 'exercise_data.json' file from the data directory and display the schema.

In [15]:
# Your code here
df_2 = spark.read \
        .format('json') \
        .option('multiline',True) \
        .load(f"{data_dir}/exercise_data.json")

df_2.show()
df_2.printSchema()

+-----------+---+-----+----------+
|   category| id|price|   product|
+-----------+---+-----+----------+
|       Food|  1|74.23| Product_1|
|   Clothing|  2|42.17| Product_2|
|      Books|  3|25.31| Product_3|
|Electronics|  4|94.12| Product_4|
|       Food|  5|49.59| Product_5|
|       Food|  6|40.76| Product_6|
|   Clothing|  7|17.78| Product_7|
|   Clothing|  8|35.37| Product_8|
|      Books|  9|66.37| Product_9|
|       Food| 10|88.39|Product_10|
|       Food| 11|80.02|Product_11|
|      Books| 12| 74.9|Product_12|
|       Food| 13|25.59|Product_13|
|      Books| 14|20.63|Product_14|
|      Books| 15|41.12|Product_15|
|Electronics| 16|67.53|Product_16|
|   Clothing| 17| 73.8|Product_17|
|Electronics| 18|30.49|Product_18|
|      Books| 19|34.83|Product_19|
|      Books| 20|78.96|Product_20|
+-----------+---+-----+----------+
only showing top 20 rows
root
 |-- category: string (nullable = true)
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- product: string 



```
# This is formatted as code
```

### Exercise 3: Write to ParquetWrite the DataFrame from Exercise 1 to a Parquet file named 'output_exercise.parquet' in the data directory.

In [18]:
# Your code here
df_1.printSchema()
df_1.write.parquet(f"{data_dir}/output_exercise.parquet",mode="overwrite")
df_1.show()
df_1.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- city: string (nullable = true)
 |-- salary: string (nullable = true)

+---+---------+---+-------+------+
| id|     name|age|   city|salary|
+---+---------+---+-------+------+
|  1| Person_1| 20|Chicago| 85163|
|  2| Person_2| 31|Phoenix|118464|
|  3| Person_3| 50|Chicago| 94327|
|  4| Person_4| 42|     LA|107071|
|  5| Person_5| 47|Phoenix|118236|
|  6| Person_6| 52|Houston| 60905|
|  7| Person_7| 52|    NYC| 51205|
|  8| Person_8| 50|Phoenix|106542|
|  9| Person_9| 42|     LA| 80638|
| 10|Person_10| 31|     LA| 40354|
| 11|Person_11| 23|     LA|119110|
| 12|Person_12| 44|Chicago| 98678|
| 13|Person_13| 37|    NYC| 73709|
| 14|Person_14| 43|Houston| 43142|
| 15|Person_15| 25|     LA|119299|
| 16|Person_16| 54|    NYC| 82376|
| 17|Person_17| 32|    NYC|112167|
| 18|Person_18| 41|Phoenix| 60096|
| 19|Person_19| 50|     LA| 64937|
| 20|Person_20| 37|Phoenix| 94970|
+---+-----

## Summary

Great job completing the exercises! Review your solutions and compare them with the solutions notebook if needed.
