# Spark Connectors

## Apache Maven

> Apache Maven is a build tool designed for managing and uniformly building any Java-based project. By defining your project in a __project object model(POM)__ Maven can build the project into __Java Archive(JAR)__ files which are easily distributed on Mavens central repository. 

### Maven as a build tool

Maven is a __build tool__. So what does this mean and how can this help us? __Build tools__ are programs which help in the automation of __building, testing, compiling, and deployment__ of __source code__ into a useable application.

Maven was defined to meet these objectives:
- Make the build process easy 
- Provide a uniform build process
- Provide quality project information
- Encourage better development practices

Maven as a build tool:

- Normally used for Java based projects, complies code usually into `JAR` files but other types can be specified. 
- Keeps project build consistent through the use of __POM__ file which describes the __project structure, dependencies, configuration__ and __licenses__ when build your project.
- Automates testing of the file during the build process.
- Encourages sharing built projects on the Maven central repository https://mvnrepository.com/repos/central to share with other users.

### The POM file

When building a project it can rely on many different dependencies critical to its function. 

Navigating between many different websites can be tedious to find the correct versions and many different dependencies required for your project. This is what the __POM file__ is designed to solve, it gives you a uniform structure to define your project in code when building it.

POM files are a representation of your Maven project written in `XML` and stored in a `pom.xml` file. 

This is where you will define dependencies, build configuration, licensing, and even the URL of where the project lives. The general layout and configurable tags are shown here:



In [None]:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
 
  <!-- The Basics -->
  <groupId>...</groupId>
  <artifactId>...</artifactId>
  <version>...</version>
  <packaging>...</packaging>
  <dependencies>...</dependencies>
  <parent>...</parent>
  <dependencyManagement>...</dependencyManagement>
  <modules>...</modules>
  <properties>...</properties>
 
  <!-- Build Settings -->
  <build>...</build>
  <reporting>...</reporting>
 
  <!-- More Project Information -->
  <name>...</name>
  <description>...</description>
  <url>...</url>
  <inceptionYear>...</inceptionYear>
  <licenses>...</licenses>
  <organization>...</organization>
  <developers>...</developers>
  <contributors>...</contributors>
 
  <!-- Environment Settings -->
  <issueManagement>...</issueManagement>Setup 
  <ciManagement>...</ciManagement>
  <mailingLists>...</mailingLists>
  <scm>...</scm>
  <prerequisites>...</prerequisites>
  <repositories>...</repositories>
  <pluginRepositories>...</pluginRepositories>
  <distributionManagement>...</distributionManagement>
  <profiles>...</profiles>
</project>

The bare minimum which should be included in your project is outline here, 

In [None]:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
 
  <groupId>org.apache.spark</groupId> 
  <artifactId>spark-core</artifactId> 
  <version>3.2.1</version>

We won't go into detail about building your own projects and connectors because they will normally be available for you on the Maven repository. When using a Maven project with Spark you will just need to specify the Maven coordinates to download the package. 

#### Maven Coordinates

The required fields are `groupId:artifactId:version`. This acts like an address to your project and timestamp in one. It tells you where in the Maven repository to find the project, this is known as it's Maven Coordiantes. 

- __GroupId__ - Generally unique to an orgainisation or project. For instance all official Apache Spark projects will live under `org.apache.spark`.
- __artifactId__ - This is generally the name your project is known by, `spark-sql`, `kafka-clients` etc. By combining this with the groupId should create a unique location in the Maven repository to house your project.
- __version__ - Used to specify the version of your project keeping them separate on the repository. 



### The Maven central repository

Maven solves the issue of searching multiple sites for dependencies through the use of the Maven central repository https://mvnrepository.com/repos/central where projects build in Maven can be hosted for other users use. 

We can use the Maven coordinates to find the specific piece of software required. 

Let's take a look at an example, search for Apache Spark you will be met with the following.

<img src="images/maven_spark_search.png?modified=232132453">

You can see that there are many different artifacts for Spark, `spark-streaming`, `spark-sql` etc. Clicking on `spark-sql` will show you the many different versions of this artifact hosted on the Maven repository which were built using Maven.

<img src="images/spark_sql_maven.png?modified=232132453">

You can see the historical versioning of this connector, which versions of spark it is compatible with and for which versions of Scala. Selecting the version for Spark 3.2.1 will bring you to the following page. 

<img src="images/spark_sql_download_page.png?modified=232132453">

Note the Maven coordinates on the bottom panel this information will be useful when using the package with PySpark. 

### Submitting packages to Spark

You can add these packages to your PySpark environment through the use of `[PYSPARK_SUBMIT_ARGS]` environment variable. This allows you to submit additional arguments to the PySpark shell at runtime.

Should of the available arguments are so follows:

- `--packages` - Comma separated list of maven coordinates of jars to include on the driver and executor classpaths. This command will search the Maven repository and local Maven installation for the JARs.
- `--repositories` - Comma-separated list of additional remote repositories to search Maven coordinates.
- `--jars` - Comma-separated list of local jars to include with the driver or classpath.
- `--py-files` - Comma-separated list of `.egg`, .`zip` or `.py` files to include with the driver or classpath.

You can find all available options by running `pyspark --help` in your terminal.

### S3 to Spark

Let's add the additional Spark streaming package to `PYSPARK_SUBMIT_ARGS` to enable spark streaming with PySpark. Go to the Maven repository and find the package for your version of Spark.

You can check your version of spark by running `spark-shell` in the terminal. 

<img src="images/spark_streaming.png?modified=232132453">

In this case you can see that Spark `3.2.1` is being used and Scala `2.12.15`. 

When submitting the Package to the `PYSPARK_SUBMIT_ARGS` environment variable it will be in the form of Maven coordinates, i.e. `groupId:artifactId:version`.

Let's get the required packages to read some data from an S3 bucket. To read from the S3 bucket we will need the additional packages `aws-java-sdk` and `hadoop-aws` find them on Maven central and add them as packages to your `PYSPARK_SUBMIT_ARGS`.

Replace the bucket name, AWS access keys and AWS secret keys with your own to allow S3 to connect to Spark.


In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
import os

# Adding the packages required to get data from S3  
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages com.amazonaws:aws-java-sdk-s3:1.12.196,org.apache.hadoop:hadoop-aws:3.3.1 pyspark-shell"

# Creating our Spark configuration
conf = SparkConf() \
    .setAppName('S3toSpark') \
    .setMaster('local[*]')

sc=SparkContext(conf=conf)

# Configure the setting to read from the S3 bucket
accessKeyId="ENTER YOUR AWS ACCESS KEY HERE"
secretAccessKey="ENTER YOUR SECRET AWS ACCESS KEY"
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', accessKeyId)
hadoopConf.set('fs.s3a.secret.key', secretAccessKey)
hadoopConf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') # Allows the package to authenticate with AWS

# Create our Spark session
spark=SparkSession(sc)

# Read from the S3 bucket
df = spark.read.json("PATH TO YOUR AWS FILE IN THE BUCKET") # You may want to change this to read csv depending on the files your reading from the bucket
df.show()

Depending on what data source you want to read/write to in Spark you will need to find the required package from Maven central before submitting it to Spark. 

You may want to connect many different data sources to Spark to leverage it's in-memory processing. Spark has many prebuilt connectors to enable you to do so, often the process will be finding the correct connector for the task and implementing it using the process defined above. 

The documentation for your chosen data source often will indicate which package to use when connecting with Spark so always check the documentation for Maven coordinates. 