<h1><b>Important Spark Concepts</b></h1>
In Apache Spark, both <b>SparkSession</b> and <b>SparkContext</b> are important components, but they serve different purposes and are used in different contexts.

<h2>SparkContext</h2>
<ul>
<li>SparkContext is the entry point to Spark and represents the connection to a Spark cluster. It is the core component of Spark and is responsible for coordinating the execution of tasks in a cluster.</li>
<li>It is primarily used for low-level operations such as creating Resilient Distributed Datasets (RDDs), which are the fundamental data structure in Spark.</li>
<li>SparkContext is not aware of structured data like DataFrames, and it does not provide a high-level API for structured data processing.</li>
</ul>
<h2>SparkSession</h2>
<ul>
<li>SparkSession was introduced in Spark 2.0 and is an abstraction built on top of the SparkContext. It's designed for higher-level, more user-friendly structured data processing, including working with DataFrames and Datasets.</li>
<li>SparkSession is a unified entry point for working with structured data in Spark. It provides a convenient API for creating, manipulating, and querying structured data.</li>
<li>It includes functionalities for working with structured data sources like Parquet, Avro, ORC, JSON, and more, and it also allows you to interact with structured data using SQL queries via the Spark SQL module.</li>
</ul>
In summary, the key difference between SparkContext and SparkSession is their purpose and level of abstraction:

<ul>
<li><b>SparkContext</b> is the fundamental entry point for Spark and is primarily used for low-level operations and RDD-based data processing.</li>
<li><b>SparkSession</b> is a higher-level, more user-friendly entry point for structured data processing, including DataFrames, Datasets, and Spark SQL.</li>
</ul>
    In practice, if you are working with structured data, it is recommended to use SparkSession for its ease of use and more powerful capabilities. If you are working with older Spark code or low-level operations, you may still encounter SparkContext, but modern Spark applications often use SparkSession for structured data processing.

In [None]:
# SparkSession und SparkContext

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Introduction to Spark") \
    .getOrCreate()

sc = spark.sparkContext

print(spark)
print(sc)

In [None]:
# Playing around with JSON file and DataFrames

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Read a JSON File") \
    .getOrCreate()

df = spark.read.option("multiline","true").json("/home/peter/Projects/spark/data/customers.json")
df.printSchema()

# Display some columns of the firts 10 DataFrame records
df.select('last_name', 'first_name').show(10)

# Display females only
df.select('last_name', 'first_name', 'gender_code').filter(df['gender_code'] == 'F').show(10)

# Group by gender
df.groupBy("gender_code").count().show()

# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("customers")

sqlDF = spark.sql("SELECT * FROM customers where first_name =='Peter'")
sqlDF.select('first_name', 'last_name', 'birth_date').show()

In [None]:
import random
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("pi") \
    .getOrCreate()

sc = spark.sparkContext

num_samples = 100000000

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

In [None]:
import random
print(random.random())

In [None]:
# Our first contact with Spark
# Aggregations - grouping. average, min, max

# import non-standard functions
from pyspark.sql.functions import avg

spark = SparkSession \
    .builder \
    .appName("Dataframes and aggretations") \
    .getOrCreate()

# create a dataframe with named columns from an array of tuples (corresponds to a table with 2 columns and 6 rows) 
person_df = spark.createDataFrame([("Peter", 20), ("Thomas", 31), ("Michael", 30), ("Anabel", 35), ("Elke", 25), ("Peter", 58)], ["name", "age"])

# group person with the same name, aggregate over ages and calculate avg age
avg_age_df = person_df.groupBy("name").agg(avg("age"))

# show results
person_df.show()
avg_age_df.show()

In [None]:
# Processing data in CSV files 

spark = SparkSession \
    .builder \
    .appName("CSV") \
    .getOrCreate()

df = spark.read.format("csv") \
    .option("header", "true") \
    .option("sep", "|") \
    .load("/home/peter/Projects/spark/data/customers.csv")

# number of records in a dataframe
# print(df.count())

# show first 10 records of a dataframe
# df.limit(10).show()

# show selected columns only
# df.select('last_name', 'first_name', 'gender_code').limit(10).show()

# filter by gender code
df.select('last_name', 'first_name', 'gender_code').filter("gender_code = 'F'").limit(10).show()
# df.filter("gender_code = 'F'").show()

# write file with female entries only
# df_female_customers = df.filter("gender_code = 'F'")
# df_female_customers.limit(10).show()
# df_female_customers.write.format("csv").mode("overwrite").option("sep", "|").save("dbfs:/user/n68563/bdfb/female_customers")
# df_female_customers.coalesce(1).write.format("csv").mode("overwrite").option("sep", "|").save("dbfs:/user/n68563/bdfb/test.csv")
# df2 = spark.read.format("csv").option("inferSchema", "false").option("header", "false").option("sep", "|").load("/user/n68563/bdfb/female_customers")
# df2.limit(10).show()


In [12]:
# Apache Iceberg

import pyspark

spark = pyspark.sql.SparkSession.builder.appName("Iceberg") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog") \
    .config("spark.sql.catalog.spark_catalog.type", "hive") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "/home/peter/Projects/spark/warehouse") \
    .getOrCreate()

df_customers = spark.read.format("com.databricks.spark.csv") \
    .option("delimiter", "|") \
    .option("header", "true") \
    .load("/home/peter/Projects/spark/data/customers.csv")

df_customers.select('last_name', 'first_name', 'gender_code', 'data_date_part').filter("gender_code = 'F'").createOrReplaceTempView("temp_view_customers")
spark.sql("CREATE or REPLACE TABLE local.db.customers USING iceberg AS SELECT * FROM temp_view_customers")
df = spark.sql("SELECT last_name, first_name, data_date_part FROM local.db.customers")
df.show()
df = spark.sql("UPDATE local.db.customers SET data_date_part = '2023-10-20' WHERE data_date_part = '2021-11-08'")
df = spark.sql("SELECT last_name, first_name, data_date_part FROM local.db.customers")
df.show()


23/10/13 11:40:41 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


+-----------+----------+--------------+
|  last_name|first_name|data_date_part|
+-----------+----------+--------------+
|       Rose|    Rosita|    2021-11-08|
|       Junk|    Hatice|    2021-11-08|
|     Biggen|  Virginia|    2021-11-08|
|      Gnatz|      Ella|    2021-11-08|
|Wagenknecht|  Reinhild|    2021-11-08|
|    Hauffer|  Birgitta|    2021-11-08|
|    Gerlach|    Sabina|    2021-11-08|
|  Striebitz|   Hiltrud|    2021-11-08|
|   Fröhlich|   Evelyne|    2021-11-08|
|     Hiller|   Damaris|    2021-11-08|
|     Johann|  Nadeshda|    2021-11-08|
|     Ladeck|  Ljiljana|    2021-11-08|
|      Klapp|  Fabienne|    2021-11-08|
|   Schuster|      Änne|    2021-11-08|
|      Kambs|     Henni|    2021-11-08|
|       Hahn|  Nadeshda|    2021-11-08|
| Vollbrecht|    Zdenka|    2021-11-08|
|    Ziegert|   Martina|    2021-11-08|
|    Radisch|      Leni|    2021-11-08|
|     Weimer|   Henrike|    2021-11-08|
+-----------+----------+--------------+
only showing top 20 rows

+-----------+-

In [None]:
# Use a function to read from a MariaDB and from PostgreSQL database table

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Databases") \
    .config("spark.jars", "/opt/spark-3.5.0/jars/postgresql-42.6.0.jar") \
    .config("spark.jars", "/opt/spark-3.5.0/jars/mariadb-java-client-3.2.0.jar") \
    .getOrCreate()

def show_customers(spark: SparkSession, database) -> None:
    if (database.lower() == "mariadb"):
        df_customers = spark.read \
            .format("jdbc") \
            .option("url", "jdbc:mysql://localhost:3306/shop?permitMysqlScheme") \
            .option("driver", "org.mariadb.jdbc.Driver") \
            .option("dbtable", "customers") \
            .option("user", "spark") \
            .option("password", "spark") \
            .load()
        df_customers.select('last_name', 'first_name', 'birth_date').show(10)
    elif (database.lower() == "postgresql"):
        df_customers = spark.read \
            .format("jdbc") \
            .option("url", "jdbc:postgresql://localhost:5432/galeria_anatomica") \
            .option("driver", "org.postgresql.Driver") \
            .option("dbtable", "shop.customers") \
            .option("user", "spark") \
            .option("password", "spark") \
            .load()
        df_customers.select('last_name', 'first_name', 'birth_date').show(10)
    else:
        print("Unbekannter Datenbanktyp")

show_customers(spark, "postgresql")

In [None]:
# Write and read Parquet files

spark = SparkSession \
    .builder \
    .appName("Write and read a Parquet file") \
    .getOrCreate()

customers_df = spark.read.format("csv").option("header", "true").option("sep", "|").load("/home/peter/Projects/spark/data/customers.csv")
customers_df.write.mode("overwrite").parquet("/home/peter/Projects/spark/data/customers.parquet")
parquet_df = spark.read.parquet("/home/peter/Projects/spark/data/customers.parquet")
parquet_df.createOrReplaceTempView("parquet_file")
males_df = spark.sql("SELECT last_name, first_name, gender_code FROM parquet_file WHERE gender_code = 'M'")
males_df.show(10)