# PRACTISE

# PySpark Practice Notebook for Data Engineering & ETL on Databricks

This notebook is designed for hands-on practice with PySpark, focusing on ETL pipelines and data engineering tasks commonly performed in Databricks environments.

## Sections:

1. Environment Setup & Installation
2. SparkSession Initialization
3. Basic DataFrame Operations
4. Data Ingestion (CSV, Parquet, JSON)
5. Transformations & Actions
6. ETL Pipeline Example
7. DataFrame Joins & Aggregations
8. Writing Data (Parquet, Delta, etc.)
9. Useful Tips & Resources


## 1. Environment Setup & Installation

Ensure you have PySpark installed. If not, run the following cell to install it.


In [None]:
# Install PySpark if not already installed
!pip install pyspark

## 2. SparkSession Initialization

Create a SparkSession, which is the entry point to using PySpark.


In [None]:
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName('PySparkPractice').getOrCreate()

# Check Spark version
print('Spark Version:', spark.version)

## 3. Basic DataFrame Operations

Create a DataFrame, view schema, and perform simple operations.


In [None]:
# Create a sample DataFrame
data = [(1, 'Alice', 29), (2, 'Bob', 31), (3, 'Cathy', 25)]
columns = ['id', 'name', 'age']
df = spark.createDataFrame(data, columns)

# Show DataFrame
df.show()

# Print schema
df.printSchema()

# Select columns
df.select('name', 'age').show()

## 4. Data Ingestion (CSV, Parquet, JSON)

Read data from different file formats into DataFrames.


In [None]:
# Example: Reading CSV, Parquet, and JSON files
csv_path = 'data/sample.csv'  # Update with your file path
parquet_path = 'data/sample.parquet'
json_path = 'data/sample.json'

# Read CSV
df_csv = spark.read.option('header', True).csv(csv_path)
df_csv.show()

# Read Parquet
df_parquet = spark.read.parquet(parquet_path)
df_parquet.show()

# Read JSON
df_json = spark.read.json(json_path)
df_json.show()

## 5. Transformations & Actions

Learn the difference between transformations (lazy) and actions (trigger execution) in Spark.


In [None]:
# Transformations: filter, select, withColumn
df_filtered = df.filter(df.age > 25)
df_selected = df.select('name', 'age')
df_newcol = df.withColumn('age_plus_10', df.age + 10)

# Actions: show, count, collect
df_filtered.show()
print('Count:', df.count())
print('Names:', df_selected.collect())

## 6. ETL Pipeline Example

A simple ETL pipeline: Extract data, Transform it, and Load it to storage.


In [None]:
# Simple ETL Pipeline Example
## Extract
input_path = 'data/input.csv'  # Update with your file path
df_etl = spark.read.option('header', True).csv(input_path)

## Transform
df_etl_clean = df_etl.dropna().withColumnRenamed('old_column', 'new_column')

## Load
output_path = 'data/output.parquet'
df_etl_clean.write.mode('overwrite').parquet(output_path)
print('ETL pipeline completed!')

## 7. DataFrame Joins & Aggregations

Practice joining DataFrames and performing aggregations.


In [None]:
# DataFrame Joins Example
df1 = spark.createDataFrame([(1, 'A'), (2, 'B')], ['id', 'val1'])
df2 = spark.createDataFrame([(1, 'X'), (2, 'Y')], ['id', 'val2'])

df_joined = df1.join(df2, on='id', how='inner')
df_joined.show()

# Aggregation Example
from pyspark.sql import functions as F
df_agg = df.groupBy('age').agg(F.count('id').alias('count'))
df_agg.show()

## 8. Writing Data (Parquet, Delta, etc.)

Save DataFrames to different formats.


In [None]:
# Write DataFrame to Parquet
df.write.mode('overwrite').parquet('data/output_df.parquet')

# If using Delta Lake (Databricks), you can write as Delta format:
# df.write.format('delta').mode('overwrite').save('data/output_df_delta')

## 9. Useful Tips & Resources

- [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/)
- [Databricks Documentation](https://docs.databricks.com/)
- [Spark SQL, DataFrames, and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- Use `.explain()` to understand query plans.
- Use `.cache()` for performance when reusing DataFrames.

Happy Practising! 🚀


## 9. Useful Tips & Resources

- [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/)
- [Databricks Documentation](https://docs.databricks.com/)
- [Spark SQL, DataFrames, and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- Use `.explain()` to understand query plans.
- Use `.cache()` for performance when reusing DataFrames.

Happy Practising! 🚀


In [None]:
# Write DataFrame to Parquet
df.write.mode('overwrite').parquet('data/output_df.parquet')

# If using Delta Lake (Databricks), you can write as Delta format:
# df.write.format('delta').mode('overwrite').save('data/output_df_delta')

## 8. Writing Data (Parquet, Delta, etc.)

Save DataFrames to different formats.


In [None]:
# DataFrame Joins Example
df1 = spark.createDataFrame([(1, 'A'), (2, 'B')], ['id', 'val1'])
df2 = spark.createDataFrame([(1, 'X'), (2, 'Y')], ['id', 'val2'])

df_joined = df1.join(df2, on='id', how='inner')
df_joined.show()

# Aggregation Example
from pyspark.sql import functions as F
df_agg = df.groupBy('age').agg(F.count('id').alias('count'))
df_agg.show()

## 7. DataFrame Joins & Aggregations

Practice joining DataFrames and performing aggregations.


In [None]:
# Simple ETL Pipeline Example
## Extract
input_path = 'data/input.csv'  # Update with your file path
df_etl = spark.read.option('header', True).csv(input_path)

## Transform
df_etl_clean = df_etl.dropna().withColumnRenamed('old_column', 'new_column')

## Load
output_path = 'data/output.parquet'
df_etl_clean.write.mode('overwrite').parquet(output_path)
print('ETL pipeline completed!')

## 6. ETL Pipeline Example

A simple ETL pipeline: Extract data, Transform it, and Load it to storage.


In [None]:
# Transformations: filter, select, withColumn
df_filtered = df.filter(df.age > 25)
df_selected = df.select('name', 'age')
df_newcol = df.withColumn('age_plus_10', df.age + 10)

# Actions: show, count, collect
df_filtered.show()
print('Count:', df.count())
print('Names:', df_selected.collect())

## 5. Transformations & Actions

Learn the difference between transformations (lazy) and actions (trigger execution) in Spark.


In [None]:
# Example: Reading CSV, Parquet, and JSON files
csv_path = 'data/sample.csv'  # Update with your file path
parquet_path = 'data/sample.parquet'
json_path = 'data/sample.json'

# Read CSV
df_csv = spark.read.option('header', True).csv(csv_path)
df_csv.show()

# Read Parquet
df_parquet = spark.read.parquet(parquet_path)
df_parquet.show()

# Read JSON
df_json = spark.read.json(json_path)
df_json.show()

## 4. Data Ingestion (CSV, Parquet, JSON)

Read data from different file formats into DataFrames.


In [None]:
# Create a sample DataFrame
data = [(1, 'Alice', 29), (2, 'Bob', 31), (3, 'Cathy', 25)]
columns = ['id', 'name', 'age']
df = spark.createDataFrame(data, columns)

# Show DataFrame
df.show()

# Print schema
df.printSchema()

# Select columns
df.select('name', 'age').show()

## 3. Basic DataFrame Operations

Create a DataFrame, view schema, and perform simple operations.


In [None]:
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName('PySparkPractice').getOrCreate()

# Check Spark version
print('Spark Version:', spark.version)

## 2. SparkSession Initialization

Create a SparkSession, which is the entry point to using PySpark.


In [None]:
# Install PySpark if not already installed
!pip install pyspark

## 1. Environment Setup & Installation

Ensure you have PySpark installed. If not, run the following cell to install it.


In [1]:
import os
print(os.environ.get("JAVA_HOME"))
!java -version

None
openjdk version "17.0.15" 2025-04-15
OpenJDK Runtime Environment Homebrew (build 17.0.15+0)
OpenJDK 64-Bit Server VM Homebrew (build 17.0.15+0, mixed mode, sharing)
