# Apache Spark Introduction

* Create a Spark Session
* Write a dataframe by Delta Lake format
* Write a dataframe by Parquet format
* Manipulate data with Spark SQL (HiveQL)
* Kill a Spark Session

## 1. Create a Spark Session

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder \
    .appName('Ingest checkin table into bronze') \
    .master('spark://spark-master:7077') \
    .config("hive.metastore.uris", "thrift://hive-metastore:9083")\
    .config("spark.hadoop.fs.s3a.path.style.access", "true")\
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")\
    .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')\
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
    .config('spark.sql.warehouse.dir', f's3a://lakehouse/')\
    .enableHiveSupport()\
    .getOrCreate()

In [15]:
import random

# Generate 100 sample data
names = ["Alice", "Bob", "Charlie", "David", "Emma", "Frank", "Grace", "Hannah", "Isaac", "Julia"]
data = [{"Name": random.choice(names), "Age": random.randint(20, 40)} for _ in range(100)]

df = spark.createDataFrame(data)

In [18]:
df.show(5)

+---+-----+
|Age| Name|
+---+-----+
| 32| Emma|
| 28|Isaac|
| 40|Julia|
| 24|David|
| 26|Alice|
+---+-----+
only showing top 5 rows



In [19]:
df.printSchema()

root
 |-- Age: long (nullable = true)
 |-- Name: string (nullable = true)



In [20]:
print("The number of rows", df.count())
print("The number of columns", len(df.columns))

The number of rows 100
The number of columns 2


## 2. Write Dataframe by Delta Lake

In [22]:
spark.sql("CREATE DATABASE IF NOT EXISTS test_db").show()

++
||
++
++



In [25]:
# spark.sql("DROP TABLE employee").show()
df.write.format("delta").saveAsTable("test_db.employee")

## 3. Write Dataframe by Parquet

In [None]:
df.write.format("parquet").save("s3a://lakehouse/test_write/")

In [35]:
parquet_df = spark.read.parquet("s3a://lakehouse/test_write/")
parquet_df.show(5)

+---+-------+
|Age|   Name|
+---+-------+
| 30|    Bob|
| 22|Charlie|
| 25|  Alice|
+---+-------+



## Manipulate data with Spark SQL throughout HiveQL

In [26]:
spark.sql("SHOW DATABASES").show()

+---------+
|namespace|
+---------+
|  default|
|  test_db|
+---------+



In [27]:
spark.sql("USE test_db").show()

++
||
++
++



In [28]:
spark.sql("SHOW TABLES").show()

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
|  test_db| employee|      false|
+---------+---------+-----------+



In [29]:
spark.sql("DESCRIBE employee").show()

+---------------+---------+-------+
|       col_name|data_type|comment|
+---------------+---------+-------+
|            Age|   bigint|       |
|           Name|   string|       |
|               |         |       |
| # Partitioning|         |       |
|Not partitioned|         |       |
+---------------+---------+-------+



In [30]:
spark.sql("SELECT Age, COUNT(Age) FROM employee GROUP BY Age ORDER BY Age DESC").show()

+---+----------+
|Age|count(Age)|
+---+----------+
| 40|         9|
| 39|         5|
| 38|         3|
| 37|         4|
| 35|         4|
| 34|         4|
| 33|         5|
| 32|         6|
| 31|         4|
| 30|         8|
| 29|         5|
| 28|         4|
| 27|         3|
| 26|         6|
| 25|         3|
| 24|         3|
| 23|         8|
| 22|         2|
| 21|         6|
| 20|         8|
+---+----------+



## 4. Kill Spark Session

In [15]:
spark.stop()