# Module 02 - Reading & Writing Data - Exercises## InstructionsThis notebook contains exercises based on the concepts learned in Module 02.- Complete each exercise in the provided code cells- Run the data setup cells first to generate/create necessary data- Test your solutions by running the verification cells (if provided)- Refer back to the main module notebook if you need help

## Data Setup

Run the cells below to set up the data needed for the exercises.


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
    StructType, StructField, StringType,
    IntegerType, DoubleType, DateType
)
from pyspark.sql.functions import col, when, lit
import os
import pandas as pd
import numpy as np
import json

# Create SparkSession
spark = SparkSession.builder \
    .appName(f"Module 1 Exercises") \
    .master("local[*]") \
    .getOrCreate()

# Set data directory
data_dir = "../data"
os.makedirs(data_dir, exist_ok=True)

print("SparkSession created successfully!")
print(f"Data directory: {os.path.abspath(data_dir)}")

# -----------------------------
# Generate sample CSV data
# -----------------------------
csv_data = {
    "id": range(1, 1001),
    "name": [f"Person_{i}" for i in range(1, 1001)],
    "age": np.random.randint(20, 60, 1000),
    "city": np.random.choice(
        ["NYC", "LA", "Chicago", "Houston", "Phoenix"], 1000
    ),
    "salary": np.random.randint(40000, 120000, 1000)
}

df_csv = pd.DataFrame(csv_data)
df_csv.to_csv(f"{data_dir}/exercise_data.csv", index=False)

print(f"Created CSV file: {data_dir}/exercise_data.csv")

# -----------------------------
# Generate sample JSON data
# -----------------------------
json_data = [
    {
        "id": i,
        "product": f"Product_{i}",
        "price": round(np.random.uniform(10, 100), 2),
        "category": np.random.choice(
            ["Electronics", "Clothing", "Food", "Books"], 1
        )[0]
    }
    for i in range(1, 501)
]

with open(f"{data_dir}/exercise_data.json", "w") as f:
    json.dump(json_data, f, indent=2)

print(f"Created JSON file: {data_dir}/exercise_data.json")


## ExercisesComplete the following exercises based on the concepts from Module 02.

### Exercise 1: Read CSV FileRead the 'exercise_data.csv' file from the data directory and display the first 5 rows.

In [0]:
# Your code here
df_1 = spark.read.csv(f"{data_dir}/exercise_data.csv",header=True)
df_1.show(5)

### Exercise 2: Read JSON FileRead the 'exercise_data.json' file from the data directory and display the schema.

In [0]:
# Your code here
df_2 = spark.read \
        .format('json') \
        .option('multiline',True) \
        .load(f"{data_dir}/exercise_data.json")

df_2.show()
df_2.printSchema()



```
# This is formatted as code
```

### Exercise 3: Write to ParquetWrite the DataFrame from Exercise 1 to a Parquet file named 'output_exercise.parquet' in the data directory.

In [0]:
# Your code here
df_1.printSchema()
df_1.write.parquet(f"{data_dir}/output_exercise.parquet",mode="overwrite")
df_1.show()
df_1.printSchema()

## Summary

Great job completing the exercises! Review your solutions and compare them with the solutions notebook if needed.
