## Setup Spark & Package Dependencies

### Summary

PySpark is a Python library used for working with large-scale data processing. In this tutorial, we will guide you through the process of installing PySpark on a Macbook/Linux.

The overall steps to install spark locally consist of:

- Download and install Java Development Kit (JDK) version 8 or higher on your machine.

- Download the latest version of Apache Spark from the official website (https://spark.apache.org/downloads.html).

- Extract the downloaded Spark archive to a directory of your choice.

- Install `PySpark`: You can install PySpark using pip, the Python package manager. Open a command prompt or terminal and type `pip install pyspark` to install the latest version of PySpark.

- Test your installation: Run on python `import pyspark` to test your PySpark installation. If the installation is successful, you should be able to import the PySpark library without any errors.

### MacOS and Linux User

Before we begin, we can simplify the installation process by using `Homebrew`. Homebrew (brew) is a free and `open-source` package manager that allows installing apps and software in macOS, depending on the userâ€™s desire. It has been recommended for its simplicity and effectiveness in saving time and effort. Its famous description is `The missing package manager for macOS` (tested on macOS Monterey 12.6.5). Homebrew can also be used in Linux OS. 

#### Step 1: Install Homebrew

To install Homebrew, you can use the following command in the terminal:
```
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
<img src="../assets/images/00.install_brew.png" width="40%">

verify:

```
brew --version
```
<img src="../assets/images/00.verify_brew.png" width="50%">

#### Step 2: Install Java, Scala, & Apache Spark

To install Java, Scala, and Apache Spark on macOS using Homebrew, you can use the following command in the terminal:

```
brew install java scala apache-spark
```

This will install the latest versions of Java, Scala, and Apache Spark.

After that, you need to create a symbolic link from the openjdk.jdk directory installed by Homebrew to the standard JDK installation directory on macOS using the command in the terminal:

```
sudo ln -sfn $(brew --prefix)/opt/openjdk/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk
```

You can verify spark installation on terminal using the script below:
```
spark-shell --version
```
Once everything is successfully done, the following message is obtained.

<img src="../assets/images/00.verify_spark.png" width="50%">

#### Step 3: Install Package Dependencies

This part we will install the dependency packages that are used for this project.

In [2]:
# install pandas, numpy, and geopandas, etc
%pip install -q pandas numpy scipy tqdm geopandas matplotlib pyarrow folium seaborn

Note: you may need to restart the kernel to use updated packages.


In [3]:
# install pyspark
%pip install -q pyspark

Note: you may need to restart the kernel to use updated packages.


#### Step 4: Test PySpark

You can now test if PySpark is installed correctly in Python by using the following code:

In [3]:
# Import the PySpark module
import pyspark

# Import the SparkSession from PySpark
from pyspark.sql import SparkSession

# Create a SparkSession with the specified configuration
spark = SparkSession.builder\
        .master("local")\
        .appName("01.ITU.PySpark-test")\
        .config('spark.sql.execution.arrow.pyspark.enabled', 'true')\
        .getOrCreate()

# Print the spark object which contains the SparkSession
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/09 21:28:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


23/05/09 21:28:33 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


The above code imports the necessary modules to create a `SparkSession` in PySpark. 

The code creates a `SparkSession` with a specific set of configurations:
- The master node is set to the local node.
- The name of the application is set as "01.ITU.PySpark-test".
- Arrow is enabled for faster serialization and deserialization.

Finally, the code prints the `spark` object which represents the `SparkSession`.