## Demo 1: QuickStart with PySpark in Jupyter Notebooks
This exercise will assist students with configuring their environment to run PySpark. Configurations will differ slightly between computers running Microsoft Windows and those running the MacOS.  Because Spark and Hadoop were both originally developed for UNIX operating systems (e.g., Ubuntu, Linux), configuration is slightly simpler for the, UNIX-based, MacIntosh OS. There are a few extra steps required to run Spark on the Windows platform so that it can interact with the underlying Hadoop file system (HDFS) in order to save data and schema definitions.

#### Overall, the steps are as follows:
- Confirm that <a href="https://www.anaconda.com/download/success"><b>Anaconda Python with Jupyter Notebooks</b></a> is already installed on your computer.
    - Create a new Conda Environment that uses Python version 3.12.7 and the Anaconda libraries. `conda create -n pysparkenv python==3.12.7 anaconda`
    - Activate the new Conda Environment. `conda activate pysparkenv`
    - Install the Jupyter library in the new environment. `python -m pip install ipykernel`
    - Make the Jupyter kernel available. `python -m ipykernel install --user --name pysparkenv --display-name "Python 3 (pysparkenv)"`
- Download and install the <a href="https://www.oracle.com/java/technologies/downloads/"><b>Java 21 runtime</b></a>.
    - (Windows ONLY) Ensure you change the default installation path from `"C:\Program Files\Java\jdk-21"` to `"C:\Java\jdk-21"`.
- **Windows ONLY:**
    - Download <a href="https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz"><b>Apache Spark release 3.4.5 (Dec 20 2024)</b></a>, package type <b>Pre-built for Apache Hadoop 3.3 and later</b>. Copy it to `C:\spark-4.5.4-bin-hadoop3`.
    - Download <a href="https://github.com/cdarlint/winutils"><b>Winutils Hadoop-3.3.6</b></a> and copy it to `C:\hadoop-3.3.6`. 
    - Configure Local User Environmental Variables:
      - <b>JAVA_HOME</b> that points to the `C:\Java\jdk-21` directory.
      - <b>SPARK_HOME</b> that points to the `C:\spark-3.5.4-bin-hadoop3` directory.
      - <b>HADOOP_HOME</b> that points to the `C:\hadoop-3.3.6` directory.
  - Append the `%JAVA_HOME%\bin`, `%SPARK_HOME%\bin` and `%HADOOP_HOME%\bin` paths to the <b>Path</b> variable without overwriting any of the existing entries.
- **Mac ONLY** Use the Python Installer Program (PIP) to install <b>Spark</b> and <b>PySpark</b> in your Anaconda Python Environment.
  - `python -m pip install spark`
  - `python -m pip install pyspark`
- Use the Python Installer Program (PIP) to install <b>findpark</b> in your Anaconda Python environment; `python -m pip install findspark`.
- Use the Python Installer Program (PIP) to install <b>Delta Lake</b> support in your Anaconda Python environment; `python -m pip install delta-spark==3.3.0`

#### Import Required Libraries

In [1]:
import findspark
findspark.init()
findspark.find()

'C:\\spark-3.5.4-bin-hadoop3'

In [2]:
import os
import sys
import shutil

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

#### Instantiate Global Variables

In [3]:
# --------------------------------------------------------------------------------
# Specify Directory Structure for Source Data
# --------------------------------------------------------------------------------
base_dir = os.path.join(os.getcwd(), 'lab_data')
data_dir = os.path.join(base_dir, 'retail-org')
customers_dir = os.path.join(data_dir, "customers")

# --------------------------------------------------------------------------------
# Create Directory Structure for Data Lakehouse Files
# --------------------------------------------------------------------------------
dest_database = "quickstart"
sql_warehouse_dir = os.path.abspath('spark-warehouse')
dest_database_dir = f"{dest_database}.db"
database_dir = os.path.join(sql_warehouse_dir, dest_database_dir)

#### Define Utilities

In [4]:
def remove_directory_tree(path: str):
    '''If it exists, remove the entire contents of a directory structure at a given 'path' parameter's location.'''
    try:
        if os.path.exists(path):
            shutil.rmtree(path)
            return f"Directory '{path}' has been removed successfully."
        else:
            return f"Directory '{path}' does not exist."
            
    except Exception as e:
        return f"An error occurred: {e}"

#### Create a New Spark Session

In [5]:
worker_threads = f"local[{int(os.cpu_count()/2)}]"
shuffle_partitions = int(os.cpu_count())

sparkConf = SparkConf().setAppName('PySpark Quickstart in Juptyer')\
    .setMaster(worker_threads)\
    .set('spark.executor.memory', '2g')\
    .set('spark.driver.memory', '4g') \
    .set('spark.sql.shuffle.partitions', shuffle_partitions) \
    .set('spark.sql.warehouse.dir', sql_warehouse_dir)

spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
spark

#### Prepare Filesystem

In [6]:
remove_directory_tree(database_dir)

"Directory 'C:\\Users\\jtupi\\Documents\\UVA\\DS-2002-Teacher\\04-PySpark\\spark-warehouse\\quickstart.db' has been removed successfully."

#### Read Customer data from a CSV File

In [7]:
customers_csv = os.path.join(customers_dir, "customers.csv")
print(customers_csv)

C:\Users\jtupi\Documents\UVA\DS-2002-Teacher\04-PySpark\lab_data\retail-org\customers\customers.csv


In [8]:
df_customers = spark.read.format('csv').options(header='true', inferSchema=True).load(customers_csv)

# Unit Test -------------
print(f"The 'df_customers' object is of type: {type(df_customers)}.")
df_customers.printSchema()

print(f"The 'df_customers' DataFrame contains {df_customers.count()} rows.")
df_customers.toPandas().head(5)

The 'df_customers' object is of type: <class 'pyspark.sql.dataframe.DataFrame'>.
root
 |-- customer_id: integer (nullable = true)
 |-- tax_id: double (nullable = true)
 |-- tax_code: string (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- state: string (nullable = true)
 |-- city: string (nullable = true)
 |-- postcode: string (nullable = true)
 |-- street: string (nullable = true)
 |-- number: string (nullable = true)
 |-- unit: string (nullable = true)
 |-- region: string (nullable = true)
 |-- district: string (nullable = true)
 |-- lon: double (nullable = true)
 |-- lat: double (nullable = true)
 |-- ship_to_address: string (nullable = true)
 |-- valid_from: integer (nullable = true)
 |-- valid_to: double (nullable = true)
 |-- units_purchased: double (nullable = true)
 |-- loyalty_segment: integer (nullable = true)

The 'df_customers' DataFrame contains 28813 rows.


Unnamed: 0,customer_id,tax_id,tax_code,customer_name,state,city,postcode,street,number,unit,region,district,lon,lat,ship_to_address,valid_from,valid_to,units_purchased,loyalty_segment
0,11123757,,,"SMITH, SHIRLEY",IN,BREMEN,46506.0,N CENTER ST,521.0,,Indiana,50.0,-86.146582,41.450763,"IN, 46506.0, N CENTER ST, 521.0",1532824233,1548137000.0,34.0,3
1,30585978,,,"STEPHENS, GERALDINE M",OR,ADDRESS,0.0,NO SITUS,,,,,-122.105516,45.374317,"OR, 0, NO SITUS, nan",1523100473,,18.0,3
2,349822,,,"GUZMAN, CARMEN",VA,VIENNA,22181.0,HILL RD,2860,,VA,,-77.294126,38.883033,"VA, 22181, HILL RD, 2860",1522922493,,5.0,0
3,27652636,,,"HASSETT, PATRICK J",WI,VILLAGE OF NASHOTAH,53058.0,IVY LANE,W333N 5591,,,,-88.409517,43.121379,"WI, 53058.0, IVY LANE, W333N 5591",1531834357,1558052000.0,7.0,1
4,14437343,,,"HENTZ, DIANA L",OH,COLUMBUS,43228.0,ALLIANCE WAY,5706,,OH,FRA,-83.158438,39.978218,"OH, 43228.0, ALLIANCE WAY, 5706",1517227530,,0.0,0


#### Persist the 'df_customers' DataFrame as a New Table in the Data Lakehouse
##### Create a New Data Lakehouse Database

In [9]:
spark.sql(f"DROP DATABASE IF EXISTS {dest_database};")
spark.sql(f"CREATE DATABASE {dest_database};")

DataFrame[]

##### Create the 'customers' table.

In [10]:
df_customers.write.saveAsTable(f"{dest_database}.customers", mode="overwrite")

In [11]:
# Unit Test ------------------------------------
sql_customers = f"""
    SELECT customer_id
        , customer_name
        , CONCAT(number, " ", street) AS address
        , city
        , state
        , postcode
        , FLOOR(units_purchased) AS units_purchased
    FROM {dest_database}.customers
    ORDER BY state ASC;
"""

spark.sql(sql_customers).toPandas().head(5)

Unnamed: 0,customer_id,customer_name,address,city,state,postcode,units_purchased
0,3010389,"CLAYTON, LATOYA",7840 CREEKSIDE CENTER DR,,AK,99504.0,16
1,2995184,"PROSPER, DEVONNE E",12431 ALPINE DR,,AK,99516.0,1
2,2968489,"TAMPIER, CHRISTOPHER M",3916 STARBURST CIR,,AK,99517.0,0
3,2974347,"GARNER, MARLO B",8100 PETERSBURG ST,,AK,99507.0,3
4,2845292,"NITSCHKE, ALEXANDRA D",68815 JUDY CT,HAPPY VALLEY,AK,99639.0,9


#### Who are my best customers? (i.e., Which customers purchased the most product?)

In [12]:
sql_best_customers = f"""
    SELECT customer_name AS Customer
        , FLOOR(SUM(units_purchased)) AS Total_Units_Purchased
    FROM {dest_database}.customers
    GROUP BY customer_name
    HAVING total_units_purchased >= 750
    ORDER BY total_units_purchased DESC;
"""

spark.sql(sql_best_customers).toPandas()

Unnamed: 0,Customer,Total_Units_Purchased
0,statusdigital,876
1,genesis electronics recycling,866
2,digital attic,842
3,digital lifestyle solutions,828
4,helios electronics limited,814
5,popster digital,812
6,mct digital,784
7,"bradsworth digital solutions, inc",782
8,epi-electrochemical products inc,780
9,modern digital imaging,769


In [13]:
spark.stop()