## Introduction to Apache Spark

#### Spark Session

- Spark Session is Created by Databricks in Notebooks, but we can create one with importing class SparkSession
- Spark Session Created by Databricks is sotred into variable spark through which we can access objects
- spark.sparkContext.applicationId (Return ID of session)


In [0]:
from pyspark.sql import SparkSession

In [0]:
spark = SparkSession.builder.getOrCreate()
application_id = spark.conf.get('spark.app.id')
print(f"Application ID: {application_id}")

#### Spark Reader

In [0]:
df = (spark.read.format("csv")
                .options(header = True)
                .load('/Volumes/db_academy_retail/v01/source_files/customers.csv'))
df.limit(10).display()

In [0]:
df.printSchema()

##### Reading with Explicit Schema
- with using StructType() and StructField() we can construct our own explicit schema which can be used in reading

In [0]:
from pyspark.sql.types import *
customer_schema = StructType([
    StructField("customer_ID", IntegerType(), False),
    StructField("tax_id", StringType(), True),
    StructField("tax_code", StringType(), True),
    StructField("customer_name", StringType(), True),
    StructField("state", StringType(), True),
    StructField("city", StringType(), True),
    StructField("postcode", StringType(), True),
    StructField("street", StringType(), True),
    StructField("number", StringType(), True),
    StructField("unit", StringType(), True),
    StructField("region", StringType(), True),
    StructField("district", StringType(), True),
    StructField("lon", DoubleType(), True),
    StructField("lat", DoubleType(), True),
    StructField("ship_to_address", StringType(), True),
    StructField("valid_from", IntegerType(), True),
    StructField("valid_to", IntegerType(), True),
    StructField("units_purchased", IntegerType(), True),
    StructField("loyalty_segment", IntegerType(), True)])

In [0]:
df_explicit_schema = (spark.read.format("csv")
                                .options(header = True,
                                         schema = customer_schema)
                                .load('/Volumes/db_academy_retail/v01/source_files/customers.csv'))
df_explicit_schema.limit(10).display()

#### Spark Writer

- Creating Schema and Volume (folder) for files

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS db_academy.Spark_Developer;
CREATE VOLUME IF NOT EXISTS db_academy.Spark_Developer.files;

- Modyfing Dataframe before write

In [0]:
df = df.select(
    "customer_id",
    "customer_name",
    "state",
    "city", 
    "region"
)

df.limit(3).display()

##### Writing to a file

We have 4 types of writing mode
- append (append new records)
- overwrite (overwrite File)
- error (rise error If Exists) - default
- ignore (do nothing if exist / silent skip) 

In [0]:
df.write.format("parquet") \
        .mode("overwrite") \
        .save('/Volumes/db_academy/spark_developer/files/parquet_1')

show files inside directory

In [0]:
display(dbutils.fs.ls('/Volumes/db_academy/spark_developer/files'))

##### Writing to a Table

clasic / legacy writer

In [0]:
df.write.format("delta") \
        .mode('overwrite') \
        .saveAsTable("db_academy.spark_developer.customer_table")

modern version 2 writer with options like: **overwrite, append, CreateOrReplace, partitionedBy**

In [0]:
df.writeTo("db_academy.spark_developer.customer_table_v2").partitionedBy("region").create()

In [0]:
spark.read.table("db_academy.spark_developer.customer_table").limit(3).display()
spark.read.table("db_academy.spark_developer.customer_table_v2").limit(3).display()

#### Spark Dataframe Transformations

- I will be working wit Flight Datasets (databricks_airline_performance_data.v01.flights_small)

In [0]:
import pyspark.sql.functions as sf

In [0]:
%sql
USE CATALOG db_academy;
USE SCHEMA spark_developer;

SELECT 
  current_catalog(),
  current_schema()

In [0]:
df = spark.read.table('databricks_airline_performance_data.v01.flights_small')
df.limit(5).display()

In [0]:
df.printSchema()

In [0]:
flights_df = df.select(
    "Year",
    "Month",
    "DayofMonth",
    "DepTime",
    "FlightNum",
    "ActualElapsedTime",
    "CRSElapsedTime",
    "ArrDelay")

In [0]:
display(flights_df.count())

- TRY_CAST() is only available in spark SQL, (not available in pyspark), if casting is not possible will place NULL 
- selectExpr() allows us to use SQL like SELECT expressions
- sf.expr() allow us to use single SQL expression

In [0]:
df_to_view = flights_df.selectExpr(
    "Year",
    "Month",
    "DayofMonth",
    "TRY_CAST(DepTime AS INT) AS DepTime",
    "FlightNum",
    "TRY_CAST(ActualElapsedTime AS INT) AS ActualElapsedTime",
    "CRSElapsedTime",
    "TRY_CAST(ArrDelay AS INT) AS ArrDelay"
).createOrReplaceTempView("flights_temporary")

In [0]:
%sql
SELECT
  -- *
  count_if(YEAR is NULL) as Year_nulls,
  count_if(MONTH is NULL) as Month_nulls,
  count_if(DayofMonth is NULL) as DayofMonth_nulls,
  count_if(DepTime is NULL) as DepTime_nulls,
  count_if(FlightNum is NULL) as FlightNum_nulls,
  count_if(CRSElapsedTime is NULL) as CRSElapsedTime_nulls,
  count_if(ArrDelay is NULL) as ArrDelay_nulls,
  sum(CASE when DepTime is NULL or CRSElapsedTime is NULL or ArrDelay is NULL then 0 else 1 end) as merged
from flights_temporary

In [0]:
%load_ext autoreload
%autoreload 2
# Enables autoreload; learn more at https://docs.databricks.com/en/files/workspace-modules.html#autoreload-for-python-modules
# To disable autoreload; run %autoreload 0

In [0]:
import sys
sys.path.append('/Workspace/Repos/tomydata@tsarmirgmail.onmicrosoft.com/Databricks/DB_academy/')
import my_functions as funkcia

In [0]:
df = funkcia.read_csv_by_me('/Volumes/db_academy/data_landing/data/sales/000.csv')

In [0]:
df.display()