## Setup Spark & Package Dependencies

### Summary

PySpark is a Python library used for working with large-scale data processing. In this tutorial, we will guide you through the process of installing PySpark on a Windows.

The overall steps to install spark locally consist of:

- Download and install Java Development Kit (JDK) version 8 or higher on your machine.

- Download the latest version of Apache Spark from the official website (https://spark.apache.org/downloads.html).

- Extract the downloaded Spark archive to a directory of your choice.

- Install `PySpark`: You can install PySpark using pip, the Python package manager. Open a command prompt or terminal and type `pip install pyspark` to install the latest version of PySpark.

- Test your installation: Run on python `import pyspark` to test your PySpark installation. If the installation is successful, you should be able to import the PySpark library without any errors.

The recommended pre-requisite installation is Python, which is done from [here](https://www.python.org/downloads).

### Windows User

Here's a step-by-step tutorial to install PySpark on Windows:

#### Step 1: Install Java

PySpark requires Java to be installed on your machine, so the first step is to download and install Java Development Kit (JDK).

1. Go to the Java SE Development Kit 8 Downloads page: https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html

2. Accept the license agreement by clicking the checkbox and download the appropriate JDK version for your system (32-bit or 64-bit).

3. Run the downloaded .exe file and follow the instructions to install Java on your machine.

4. Go to `Command Prompt` and type `java -version` to know the version and know whether it is installed or not.

5. Add the Java path. Open the Start menu and search for `Environment Variables.`

6. Click on `Edit the system environment variables` option.

    <img src="../assets/images/00.config_search_bar.png" width="40%">

7. Click on the `Environment Variables` button.

    <img src="../assets/images/00.config_system_variable.png" width="40%">

8. Under `System variables`, click on `New` button to create your new Environment variable.

    <img src="../assets/images/00.config_environment_panel.png" width="40%">

9. Set Variable Name as `JAVA_HOME` and your Variable Value as the Java installed path. For example lets assume the installed path is `C:\Program Files (x86)\Java\jdk1.8.0_251`, then click 'OK' 

    <img src="../assets/images/00.config_java_path.png" width="40%">

10. Locate the `Path` variable under `System variables` and click on `Edit`. Click on `New` and add new value as `C:\Program Files (x86)\Java\jdk1.8.0_251\bin` then click 'OK'

#### Step 2: Download and install Apache Spark

1. Go to the Apache Spark download page: https://spark.apache.org/downloads.html

    <img src="../assets/images/00.install_spark.png" width="60%">

2. Choose a Spark release version and select the package type as `Pre-built for Apache Hadoop 3.3 and later.`

3. Download the .tgz file for the selected release version.

4. Extract the downloaded .tgz file to a desired location on your machine, e.g., `C:\spark`

5. Create new `Environment Variables` for Spark. Enter `SPARK_HOME` as the variable name and set the path as `C:\spark`.

6. Create new `Environment Variables` for Spark. Enter `HADOOP_HOME` as the variable name and set the path as `C:\spark`.

7. Locate the `Path` variable under `System variables` and click on `Edit`. Click on `New` and add the following path: `%SPARK_HOME%\bin`

<br>

You can verify spark installation on `Command Prompt` using the script below:
```
C:\Users\PC>pyspark
```
Once everything is successfully done, the following message is obtained.

<img src="../assets/images/00.verify_spark_windows.png" width="60%">

#### Step 3: Install Package Dependencies

This part we will install the dependency packages that are used for this project.

In [2]:
# install pandas, numpy, and geopandas, etc
%pip install -q pandas numpy scipy tqdm geopandas matplotlib pyarrow folium seaborn

Note: you may need to restart the kernel to use updated packages.


In [1]:
# install pyspark
%pip install -q pyspark

Note: you may need to restart the kernel to use updated packages.


#### Step 4: Test PySpark

You can now test if PySpark is installed correctly in Python by using the following code:

In [3]:
# Import the PySpark module
import pyspark

# Import the SparkSession from PySpark
from pyspark.sql import SparkSession

# Create a SparkSession with the specified configuration
spark = SparkSession.builder\
        .master("local")\
        .appName("01.ITU.PySpark-test")\
        .config('spark.sql.execution.arrow.pyspark.enabled', 'true')\
        .getOrCreate()

# Print the spark object which contains the SparkSession
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/09 21:28:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


23/05/09 21:28:33 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


The above code imports the necessary modules to create a `SparkSession` in PySpark. 

The code creates a `SparkSession` with a specific set of configurations:
- The master node is set to the local node.
- The name of the application is set as "01.ITU.PySpark-test".
- Arrow is enabled for faster serialization and deserialization.

Finally, the code prints the `spark` object which represents the `SparkSession`.