## **Programming Spark using Jupyter Notebook**
---
__Santander Consumer Bank Germany__  
__CTO & IT Architecture__  

__Version:__ 1.0  
__Date:__ 2024-04-03  
__Github:__

Jupyter Notebook and Microsoft Visual Code are used as development environments. Many of the examples are taken from the [official Spark documentation](https://spark.apache.org/docs/latest/index.html) and some have been slightly modified.

### **How to create a development environment using Docker**
---
#### **Install Docker**
Install Docker following the instructions for your operating system.

#### **Download the jupyter/pyspark-notebook image**

Once installed download the jupyter/pyspark-notebook image.
```
docker pull jupyter/pyspark-notebook
```
### **Create a bash file**

Create a bash file (e.g. run.sh) with the following content:

```
#!/bin/bash

CONTAINER=$(docker run -d --rm --name my-pyspark -p 8888:8888 -v /home/peter/projects:/home/jovyan/work jupyter/pyspark-notebook)
docker cp /home/peter/projects/postgres/lib/postgresql-42.7.0.jar $CONTAINER:/usr/local/spark/jars
docker cp /home/peter/projects/iceberg/lib/iceberg-spark-runtime-3.5_2.12-1.4.0.jar $CONTAINER:/usr/local/spark/jars
export CONTAINER
sleep 5
docker exec $CONTAINER jupyter server list
```

For Windows, create a corresponding Powershell file and adapt the syntax join above. 

The second line creates a container (with the name "my-pyspark") from the downloaded image, maps the Juypter port 8888 so it becomes accessible outside the container under the same port number, and additionally maps the pre-configured home directory inside the container (/home/jovyan/work) to a folder in the file system of your operating system (here: /home/peter/projects). Any filed stored there will appear later inside the container as if it were local. The third and fourth line show how to copy libraries like database drivers into Sparks library folder inside the container (e.g. to read from a Postgres database within your Spark program).

### **Open the Jupyter Notebook in your browser**

Open your preferred browser and enter the following as URL:

```
localhost:8888/tree?token=0f9541f307a73fcd220474bfd24d2476ea145d58d165ad1b
```
The token (here: 0f9541f307a73fcd220474bfd24d2476ea145d58d165ad1b) will be different each time you start Jupyter Notebook. The token you need for the current Juypter session is shown on the screen when the run script terminates. Look for "token=".

```
Currently running servers:
http://cc03a1a1513f:8888/?token=2ea951ec0dc87115a4f40a7b21f1a7b823ce3379a14f94fb
```

## **How to create a development environment using Anaconda**
---
### **Install Anaconda or Minoconda**

Install Anaconda or Miniconda following the instructions for your operating system.

### **Create a virtual environment**

Create a file env.yml with the following content in your working directory. Replace the name "Onboarding" with the name of your project.

```
name: onboarding
channels:
  - conda-forge
dependencies:
  - python=3.10
  - pyspark=3.5
  - pypandoc=1.12
  - pytest=7.4.3
  - pylint=3.0.3
  - findspark=2.0.1
  - jupyter=1.0.0
  - pandas=2.2.1
  - numpy=1.26.4
  - openpyxl=3.1.2
  ```
  Then run

  ```
  conda env create -f env.yml
  ```

Conda will download all the required packages and take care of all dependencies. This may take a couple minutes. Once ready activate the new environment with

```
conda activate onboarding
```

To start Jupyter Notebook enter

```
jupyter notebook
```

In [6]:
import sys
print(sys.path)

['/usr/local/spark/python', '/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip', '/home/jovyan', '/opt/conda/lib/python311.zip', '/opt/conda/lib/python3.11', '/opt/conda/lib/python3.11/lib-dynload', '', '/opt/conda/lib/python3.11/site-packages']


In [5]:
import findspark
findspark.init()

In [3]:
# SparkSession und SparkContext

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Introduction to Spark").getOrCreate()

sc = spark.sparkContext

print(spark)
print(sc)

<pyspark.sql.session.SparkSession object at 0x78c0b7ac9480>
<SparkContext master=local[*] appName=Databases>


24/04/04 15:07:52 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [1]:
# Reading from PostgreSQL

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Databases") \
    .getOrCreate()

# spark = SparkSession \
#    .builder \
#    .appName("Databases") \
# location of the drivers in non-Docker envs
# in Docker driver must be copied into container to /usr/local/spark/jars with docker cp
#    .config("spark.jars", "/home/peter/projects/spark/postgresql-42.7.0.jar") \ 
#    .getOrCreate()

def show_customers(spark: SparkSession, database) -> None:
    df_customers = spark.read \
        .format("jdbc") \
        .option("url", "jdbc:postgresql://172.17.0.2:5432/postgres") \
        .option("driver", "org.postgresql.Driver") \
        .option("dbtable", "customers") \
        .option("user", "postgres") \
        .option("password", "guiltyspark") \
        .load()
    df_customers.select('last_name', 'first_name', 'birth_date').show(100)

show_customers(spark, "postgresql")

24/04/04 15:11:12 WARN Utils: Your hostname, lenovo-xubuntu resolves to a loopback address: 127.0.1.1; using 192.168.1.11 instead (on interface ens33)
24/04/04 15:11:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/04 15:11:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

+------------+-------------+----------+
|   last_name|   first_name|birth_date|
+------------+-------------+----------+
|    Schottin|         Reni|2001-01-17|
|        Drub|     Annelene|1962-08-17|
|      Ladeck|  Wolf-Dieter|1946-02-25|
|   Cichorius|       Marlen|1957-05-29|
|       Zobel|        Cemil|1946-01-18|
|  Neureuther|        Oscar|1997-10-08|
|    Stiebitz|      Rebecca|1991-07-04|
|   Eberhardt|      Mariola|1947-02-06|
|    Weinhold|        Thilo|1995-07-13|
|    Hartmann|     Mohammed|1948-02-18|
|        Bähr|     Maurizio|1989-04-23|
|      Thanel|    Katharina|1953-10-22|
|      Heuser|       Detlev|1959-03-05|
|        Graf|        Ester|1976-04-13|
|     Hettner|      Diether|1954-10-27|
|        Gute|     Christel|2004-06-19|
|    Barkholz|      Swantje|2002-06-30|
|        Kade|      Erdmute|1971-05-14|
|      Albers|         Jiri|1948-05-23|
|     Schacht|        Julie|1983-09-21|
|    Eckbauer|         Ines|1958-07-05|
|     Hornich|     Brunhild|1999-05-31|


In [4]:
# Read a CSV file and convert it to Parquet

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CSV-to-Parquet converter").getOrCreate()

# For Docker
# csv_path = "work/data/csv/"
# parquet_path = "work/data/parquet/"

# For Conda
csv_path = "/home/peter/projects/onboarding/data/csv/"
parquet_path = "/home/peter/projects/onboarding/data/parquet/"


states_csv_df = spark.read.format("csv").option("header", "true").option("sep", "|").load(csv_path + "states.csv")
states_csv_df.write.mode("overwrite").parquet(parquet_path + "states.parquet")
states_parquet_df = spark.read.parquet(parquet_path + "states.parquet")
states_parquet_df.createOrReplaceTempView("states")
states_sql_df = spark.sql("SELECT * FROM states")
states_sql_df.show()

AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/home/peter/projects/onboarding/notebooks/home/peter/projects/onboarding/data/csv/states.csv.