# Module 01 - Introduction & SparkSession - Exercises

## Instructions

This notebook contains exercises based on the concepts learned in Module 01.

- Complete each exercise in the provided code cells
- Run the data setup cells first to generate/create necessary data
- Test your solutions by running the verification cells (if provided)
- Refer back to the main module notebook if you need help


## Data Setup

Run the cells below to set up the data needed for the exercises.


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
from pyspark.sql.functions import col, when, lit
import os

# Create SparkSession
spark = SparkSession.builder \
    .appName("Module 01 Exercises") \
    .master("local[*]") \
    .getOrCreate()

# Set data directory
data_dir = "../data"
os.makedirs(data_dir, exist_ok=True)
print("SparkSession created successfully!")
print(f"Data directory: {os.path.abspath(data_dir)}")

# Create sample CSV file for exercises
sample_csv = """Name,Age,Department,Salary
Alice,25,Sales,50000
Bob,30,IT,60000
Charlie,35,Sales,70000
Diana,28,IT,55000
Eve,32,HR,65000"""

with open(f"{data_dir}/employees.csv", "w") as f:
    f.write(sample_csv)
print("Sample CSV file created: employees.csv")

# Create sample JSON file for exercises
sample_json = """{"name": "John", "age": 28, "city": "NYC"}
{"name": "Jane", "age": 32, "city": "LA"}
{"name": "Mike", "age": 25, "city": "Chicago"}
{"name": "Sarah", "age": 30, "city": "Houston"}"""

with open(f"{data_dir}/people.json", "w") as f:
    f.write(sample_json)
print("Sample JSON file created: people.json")

# Create sample Parquet file for exercises
product_data = [
    (1, "Product A", 100, "Electronics"),
    (2, "Product B", 200, "Clothing"),
    (3, "Product C", 150, "Electronics"),
    (4, "Product D", 300, "Home"),
    (5, "Product E", 250, "Clothing")
]

product_schema = StructType([
    StructField("product_id", IntegerType(), True),
    StructField("product_name", StringType(), True),
    StructField("price", IntegerType(), True),
    StructField("category", StringType(), True)
])

df_products_temp = spark.createDataFrame(product_data, product_schema)
df_products_temp.write.mode("overwrite").parquet(f"{data_dir}/products.parquet")
print("Sample Parquet file created: products.parquet")

SparkSession created successfully!
Data directory: /Users/rohityadav/ry_workspace/dev_de_tr/12 Pyspark Structured/data
Sample CSV file created: employees.csv
Sample JSON file created: people.json




Sample Parquet file created: products.parquet


                                                                                

## Exercises

Complete the following exercises based on the concepts from Module 01.


### Exercise 1: Create a SparkSession

Create a SparkSession with app name 'MyFirstSparkApp' and master set to 'local[*]'.

In [4]:
# Your code here
spark = 

SyntaxError: invalid syntax (717479698.py, line 2)

### Exercise 2: Create a DataFrame

Create a DataFrame with the following data:
- Names: ['John', 'Jane', 'Mike', 'Sarah']
- Ages: [28, 32, 25, 30]
- Cities: ['NYC', 'LA', 'Chicago', 'Houston']

In [None]:
# Your code here

### Exercise 3: Display DataFrame

Display the first 3 rows of the DataFrame you created in Exercise 2.

In [None]:
# Your code here
# Hint: Use the show() method with the numRows parameter

### Exercise 4: Create DataFrame with Explicit Schema

Create a DataFrame with the following data and explicit schema:
- Data: [("Product A", 100.50, 10), ("Product B", 200.75, 5), ("Product C", 150.25, 8)]
- Schema: 
  - Name: StringType
  - Price: DoubleType
  - Quantity: IntegerType


In [None]:
# Your code here


### Exercise 5: Create DataFrame using spark.range

Create a DataFrame using `spark.range()` that contains numbers from 1 to 20 (inclusive of 1, exclusive of 20). Then display the DataFrame.


In [None]:
# Your code here


### Exercise 6: Create DataFrame from CSV File

Read the CSV file `employees.csv` from the data directory using `spark.read`. Use schema inference (header=True, inferSchema=True). Display the DataFrame and print its schema.


In [None]:
# Your code here
# Hint: Use spark.read.format("csv") with appropriate options


### Exercise 7: Create DataFrame from JSON File

Read the JSON file `people.json` from the data directory using `spark.read`. Display the DataFrame and print its schema.


In [None]:
# Your code here
# Hint: Use spark.read.format("json")


### Exercise 8: Create DataFrame using spark.sql

First, create a temporary view from the DataFrame you created in Exercise 2 (the one with John, Jane, Mike, Sarah). Then use `spark.sql()` to create a new DataFrame that selects only the Name and Age columns where Age is greater than 26. Display the result.


In [None]:
# Your code here
# Hint: Use createOrReplaceTempView() to register the DataFrame, then use spark.sql()


### Exercise 9: Create DataFrame from CSV with Explicit Schema

Read the `employees.csv` file again, but this time use an explicit schema instead of schema inference. Define a schema with:
- Name: StringType
- Age: IntegerType
- Department: StringType
- Salary: IntegerType

Display the DataFrame and print its schema.


In [None]:
# Your code here
# Hint: Define a StructType schema first, then use .schema() in the read operation


### Exercise 10: Access DataFrame Schema Information

Using the DataFrame from Exercise 6 (CSV file), print the schema and then access the data type of the "Age" column from the schema. Display both the full schema and the specific column's data type.


In [None]:
# Your code here
# Hint: Use printSchema() and then access schema['ColumnName'].dataType


### Exercise 11: Get DataFrame Basic Information

Using the DataFrame from Exercise 2, get and print:
1. The number of rows
2. The list of column names
3. The number of columns


In [None]:
# Your code here
# Hint: Use count(), columns, and len()


### Exercise 12: Create DataFrame using spark.table

First, create a temporary view called "employees_view" from the DataFrame you created in Exercise 6 (CSV DataFrame). Then use `spark.table()` to read from this view and create a new DataFrame. Display the result.


In [None]:
# Your code here
# Hint: Use createOrReplaceTempView() first, then spark.table()


### Exercise 13: Create DataFrame using spark.sql with VALUES

Use `spark.sql()` with the VALUES clause to create a DataFrame with the following data:
- ('Apple', 1.50, 10)
- ('Banana', 0.75, 20)
- ('Orange', 2.00, 15)

Name the columns: Product, Price, Stock. Display the DataFrame.


In [None]:
# Your code here
# Hint: Use spark.sql("SELECT * FROM VALUES (...) AS t(columns)")


### Exercise 14: Create DataFrame from Parquet File

Read the Parquet file `products.parquet` from the data directory using `spark.read`. Display the DataFrame, print its schema, and show the number of rows.


In [None]:
# Your code here
# Hint: Use spark.read.format("parquet")


### Exercise 15: Create DataFrame using .toDF() Method

Create a DataFrame from a list of tuples without specifying column names in `createDataFrame()`. Then use the `.toDF()` method to assign column names: "Student", "Score", "Grade". Display the DataFrame.

Data: [("Alice", 95, "A"), ("Bob", 87, "B"), ("Charlie", 92, "A")]


In [None]:
# Your code here
# Hint: spark.createDataFrame(data).toDF("col1", "col2", "col3")


### Exercise 16: Get DataFrame Column Data Types

Using the DataFrame from Exercise 4 (products with explicit schema), get and print the data types of all columns. Use the `dtypes` property of the DataFrame.


In [None]:
# Your code here
# Hint: Use the .dtypes property


### Exercise 17: Create DataFrame with spark.range and Custom Step

Create a DataFrame using `spark.range()` that contains even numbers from 0 to 20 (exclusive of 20). Use the step parameter. Display the DataFrame and verify it contains only even numbers.


In [None]:
# Your code here
# Hint: spark.range(start, end, step)


### Exercise 18: Check if DataFrame is Empty

Create an empty DataFrame using `spark.range(0, 0)` and check if it's empty using the `isEmpty()` method. Then create a non-empty DataFrame and check again. Print both results.


In [None]:
# Your code here
# Hint: Use isEmpty() method on DataFrame


### Exercise 19: Access Schema Fields

Using the DataFrame from Exercise 4, access the schema object and print:
1. The number of fields in the schema
2. The name and data type of the first field
3. The name and data type of the "Price" field


In [None]:
# Your code here
# Hint: Use df.schema.fields and access field properties


### Exercise 20: Create DataFrame - Multiple Methods Comparison

Create the same DataFrame using three different methods:
1. Using `spark.createDataFrame()` with column names
2. Using `spark.createDataFrame()` with `.toDF()`
3. Using `spark.sql()` with VALUES

Data: [("X", 10), ("Y", 20), ("Z", 30)]
Columns: "Letter", "Number"

Display all three DataFrames to verify they're the same.


In [None]:
# Your code here
# Create three DataFrames using different methods and display them


## Summary

Review your solutions and compare them with the solutions notebook if needed.


<< end of notebook >>