AbsaOSS · wajda · Aug 4, 2021 · Jun 17, 2021 · Jun 18, 2021 · Aug 4, 2021
diff --git a/docker/.env b/docker/.env
@@ -1,7 +1,7 @@
 # component versions
 
 ARANGO_DB_VERSION=3.7.10
-SPLINE_CORE_VERSION=0.6.0
+SPLINE_CORE_VERSION=0.6.1
 SPLINE_AGENT_VERSION=0.6.1
 SPLINE_UI_VERSION=0.6.0
 

diff --git a/spline-on-AWS-demo-setup/README.md b/spline-on-AWS-demo-setup/README.md
@@ -0,0 +1,150 @@
+Setting up Spline Server on AWS EC2
+===
+
+Spline is an open-source data lineage tracking tool that can help you to capture data lineage for your various data pipelines.
+See [Spline GitHub pages](https://absaoss.github.io/spline/) for details.
+
+The purpose of this article is to demonstrate the basic steps that need to be done to install Spline Server on AWS EC2.
+
+## Disclaimer
+
+This is **NOT** a production setup guide!
+
+The approach described below is just enough for demo and trial purposes, but doesn't cover the majority of aspects that need to be considered when
+setting up a real production environment.
+
+## Prerequisites
+
+You need to have an AWS account.
+
+## Create and launch EC2 instance
+
+Open your [AWS Console](https://console.aws.amazon.com/) and select **Launch instance**
+
+![img.png](img.png)
+
+![img_1.png](img_1.png)
+
+Select an Amazon Machine Image of your choice (we'll use a default Amazon Linux 2 AMI)
+
+![img_2.png](img_2.png)
+
+When choosing an instance type consider the amount of RAM and disk space. You need to run three Docker containers in total:
+
+- [ArangoDB](https://hub.docker.com/_/arangodb) - this is where the lineage data will be stored.
+- [Spline REST Gateway](https://hub.docker.com/r/absaoss/spline-rest-server) - a Java application that exposes an API for Spline agents and the Spline UI.
+  It runs on a Tomcat server and can be memory intensive. (Alternatively you can use
+  [Spline Kafka Gateway](https://hub.docker.com/r/absaoss/spline-kafka-server) instead of the REST one, but this is beyond the scope of this article)
+- [Spline UI](https://github.com/AbsaOSS/spline-ui) - a lightweight HTTP server that is only used for serving static resources required by the Spline
+  UI. Spline UI is implemented as a Single Page Application (SAP) that runs entirely within the browser and communicates directly with the Spline Gateway via
+  the REST API. It does not route any additional HTTP traffic through its own server.
+
+For demonstration purposes we'll run all three containers on the same EC2 instance, so we'll pick `t2.medium` instance with 4Gb RAM and 2 CPUs.
+
+![img_3.png](img_3.png)
+
+On the **Review Instance** page, check all necessary details. Pay special attention to the security group - the instance needs to be open for
+public access. You also need to open two custom TCP ports - one for the REST API and another for the Spline UI.
+
+![img_4.png](img_4.png)
+
+We'll use ports `8080` and `9090` one for the Spline REST API and another the Spline UI.
+
+Then, we can review and launch our instance.
+
+![img_5.png](img_5.png)
+
+As a final step you'll be asked to create or select a key pair to access the instance via SSH. Follow the AWS instructions.
+
+![img_6.png](img_6.png)
+
+Take a note of the launched instance IP and store it for the rest of the article.
+
+![img_7.png](img_7.png)
+
+## Setup Spline
+
+Open the SSH client and log into the instance.
+
+```shell
+ssh -i ~/.pem/spline-aws.pem ec2-user@18.116.202.35
+```
+
+Then install and start the Docker service.
+
+```shell
+sudo yum install docker -y
+sudo systemctl enable docker.service
+sudo systemctl start docker.service
+sudo usermod -a -G docker ec2-user
+```
+
+Re-login to apply the newly added docker group.
+
+Now we can pull and run Spline containers. You can do it one by one, or use `docker-compose` to run
+a [preconfigured demo setup](https://github.com/AbsaOSS/spline-getting-started/tree/main/docker).
+
+If you want to run individual containers, see the [Step by step instruction](https://absaoss.github.io/spline/#step-by-step).
+
+For the purpose of this article, we will use Docker Compose.
+
+Install Docker Compose:
+
+```shell
+sudo curl -L https://github.com/docker/compose/releases/download/1.21.0/docker-compose-`uname -s`-`uname -m` | sudo tee /usr/local/bin/docker-compose > /dev/null
+sudo chmod +x /usr/local/bin/docker-compose
+```
+
+Download Spline demo Docker-compose config files:
+
+```shell
+mkdir spline
+cd spline
+
+wget https://raw.githubusercontent.com/AbsaOSS/spline-getting-started/main/docker/docker-compose.yml
+wget https://raw.githubusercontent.com/AbsaOSS/spline-getting-started/main/docker/.env
+```
+
+Run `docker-compose` like below. `DOCKER_HOST_EXTERNAL` is the external IP of this EC2 instance. This IP will be passed to the Spline UI and used by
+the client browser to connect to the Spline REST API.
+
+```shell
+DOCKER_HOST_EXTERNAL=18.116.202.35 docker-compose up
+```
+
+The given Docker Compose config also runs a set of Spark examples to pre-populate the database. You can either ignore them, or disable them by
+commenting out the `agent` service block in the `docker-compose.yml` file:
+
+```yaml
+
+#  agent:
+#    image: absaoss/spline-spark-agent:${SPLINE_AGENT_VERSION}
+#    network_mode: "bridge"
+#    environment:
+#      SPLINE_PRODUCER_URL: 'http://172.17.0.1:${SPLINE_REST_PORT}/producer'
+#    links:
+#      - spline
+
+```
+
+When the containers are up we can verify that the Spline Gateway and Spline UI servers are running by visiting the following URLs:
+
+- http://18.116.202.35:8080/
+- http://18.116.202.35:9090/
+
+(Use the correct EC2 instance IP).
+
+---
+
+    Copyright 2019 ABSA Group Limited
+
+    you may not use this file except in compliance with the License.
+    You may obtain a copy of the License at
+
+        http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
diff --git a/spline-on-AWS-demo-setup/img.png b/spline-on-AWS-demo-setup/img.png
diff --git a/spline-on-AWS-demo-setup/img_1.png b/spline-on-AWS-demo-setup/img_1.png
diff --git a/spline-on-AWS-demo-setup/img_2.png b/spline-on-AWS-demo-setup/img_2.png
diff --git a/spline-on-AWS-demo-setup/img_3.png b/spline-on-AWS-demo-setup/img_3.png
diff --git a/spline-on-AWS-demo-setup/img_4.png b/spline-on-AWS-demo-setup/img_4.png
diff --git a/spline-on-AWS-demo-setup/img_5.png b/spline-on-AWS-demo-setup/img_5.png
diff --git a/spline-on-AWS-demo-setup/img_6.png b/spline-on-AWS-demo-setup/img_6.png
diff --git a/spline-on-AWS-demo-setup/img_7.png b/spline-on-AWS-demo-setup/img_7.png
diff --git a/spline-on-databricks/README.md b/spline-on-databricks/README.md
@@ -0,0 +1,169 @@
+Running Spline on Databricks
+===
+
+Spline is an open-source data lineage tracking tool that can help you to capture data lineage of your various data pipelines.
+See [Spline GitHub pages](https://absaoss.github.io/spline/) for details.
+
+In this article, I will demonstrate how to create a minimal Spline set up, and capture data lineage of Spark jobs running on a Databricks Notebook.
+
+## Preparation
+
+### Install and launch Spline server components
+
+First, we need to decide where we will run a Spline Gateway. The Spline Gateway is a server part which is responsible for storing and aggregating
+lineage metadata captured by Spline agents. It is not strictly required though, as the Spline agent can capture and send the lineage data in Spline
+format to any destination, including storing it on S3, HDFS or senfing it to your custom REST API for further processing. However, to fully benefit
+from all Spline features (like Spline UI and other features that will come with future Spline versions), the Spline server needs to be installed.
+
+The simplest way of doing it is using Docker. Since all Spline components are available as Docker, you can run it on any environment that supports
+Docker. The only requirement is that Spline REST API should be accessible from the node where Spark driver is executing. For the purpose of this
+article we will create a public AWS EC2 instance and run Spline docker containers there.
+
+See [Spline on AWS - demo setup](../spline-on-AWS-demo-setup/README.md)
+Make sure the Spline REST Gateway and the Spline UI servers are running and take note of the _Spline Producer API URL_. It can be found on the Spline
+Gateway index page.
+
+![img.png](img.png)
+
+### Prepare a Databricks account
+
+You need to have a Databricks account. In this article, we will use a free account
+on [Databricks Community Edition](https://community.cloud.databricks.com/login.html)
+
+![img_1.png](img_1.png)
+
+## Enable Spline on a Databricks cluster
+
+Create a new cluster on the **Compute** page
+
+![img_2.png](img_2.png)
+
+Pick a desired Databricks runtime and take a note of the selected Spark and Scala versions. Then, go to the **Spark** tab and add required Spline
+configuration parameters.
+
+![img_4.png](img_4.png)
+
+Here we instruct Spline agent to use the embedded `http` lineage dispatcher and send the lineage data to out Spline Gateway. Use the _Producer API
+URL_ copied in the previous step.
+
+```yaml
+spark.spline.lineageDispatcher http
+spark.spline.lineageDispatcher.http.producer.url http://18.116.202.35:8080/producer
+```
+
+You can optionally set the Spline mode to `REQUIRED` if you want Spline pre-flight check errors to be propagated the Spark jobs. It is useful to
+minimize the chance for the Spark jobs to complete without capturing lineage, for example due to Spline misconfiguration.
+
+```yaml
+spark.spline.mode REQUIRED
+```
+
+Refer the [Spline agent configuration](https://github.com/AbsaOSS/spline-spark-agent#configuration) section for details about other config parameters
+available.
+
+Now click **Create Cluster** and go to the **Libraries** tab, where we'll proceed with installing Spline agent.
+
+![img_5.png](img_5.png)
+
+If you have a Spline agent JAR file you can upload it, otherwise you can simply use Maven coordinates, so the agent will be downloaded automatically
+from the Maven Central repository.
+
+![img_6.png](img_6.png)
+
+Click **Search Packages**, select **Maven Central** and type "_spline agent bundle_" into the query text field. You'll get a list of available Spline
+agent bundles compiled for different Spark and Scala version.
+
+**Important**: Use a Spline agent bundle that matches the Spark and Scala version of the selected Databricks runtime.
+
+![img_7.png](img_7.png)
+
+Then click **Install** button.
+
+![img_8.png](img_8.png)
+
+The cluster is ready to use, so we can create a new Notebook and start writing our test Spark job:
+
+![img_9.png](img_9.png)
+
+We're almost ready to run some Spark jobs. The last step we need to do is to enable linage tracking on the Spark session.
+
+```scala
+import za.co.absa.spline.harvester.SparkLineageInitializer._
+
+spark.enableLineageTracking()
+```
+
+This step has to be done once per Spark session. It could also be done via setting the `spark.sql.queryExecutionListeners` Spark property in the Spark
+cluster configuration (see https://github.com/AbsaOSS/spline-spark-agent#initialization), but unfortunately it doesn't work on Databricks. When the
+Databricks cluster is booting, the Spark session initializes _before_ the necessary agent library is actually installed on the cluster, resulting in
+a `ClassNotFoundError` error, and the cluster fails to start. The workaround is to call `enableLineageTracking()` method explicitly. At the time of
+calling that method the Spark session is already initialized and all necessary classes are loaded.
+
+Now, just run some Spark code as usual.
+
+We'll use the following example that consists of two jobs. First let's create and save two sample files.
+
+```scala
+case class Student(id: Int, name: String, addrId: Int)
+
+case class Address(id: Int, address: String)
+
+Seq(
+  Student(111, "Amy Smith", 1),
+  Student(222, "Bob Brown", 2)
+).toDS.write.mode("overwrite").parquet("/students")
+
+Seq(
+  Address(1, "123 Park Ave, San Jose"),
+  Address(2, "456 Taylor St, Cupertino")
+).toDS.write.mode("overwrite").parquet("/addresses")
+```
+
+In the next job, let's read those files, join them and write the result into another file using `append` mode:
+
+```scala
+val students = spark.read.parquet("/students")
+val addresses = spark.read.parquet("/addresses")
+
+students
+  .join(addresses)
+  .where(addresses("id") === students("addrId"))
+  .select("name", "address")
+  .write.mode("append").parquet("/student_names_with_addresses")
+```
+
+**Note**: Spline agent only tracks persistent actions that result in data written to a file, a table or another persistent location. For example, you
+will not see lineage of memory-only actions like `.show()` or `.collect()`
+
+To see the captured metadata, go to the Spline UI page.
+
+![img_11.png](img_10.png)
+
+If everything is done correctly, you should see three execution events that correspond to three writes in our example. To see the lineage overview of
+the data produced by a particular execution event, click on the event name.
+
+![img_12.png](img_11.png)
+
+The graph above represents the high-level lineage of the data produced by the current execution event. It basically shows how the data actually flew
+between the data sources and what jobs were involved in the process.
+
+To see the details of a particular job (execution plan) click on the button on the corresponding node.
+
+![img_13.png](img_12.png)
+
+Here you can see what transformations has been applied on the data, the operation details, input/output data types etc.
+
+---
+
+    Copyright 2019 ABSA Group Limited
+
+    you may not use this file except in compliance with the License.
+    You may obtain a copy of the License at
+
+        http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
diff --git a/spline-on-databricks/img.png b/spline-on-databricks/img.png
diff --git a/spline-on-databricks/img_1.png b/spline-on-databricks/img_1.png
diff --git a/spline-on-databricks/img_10.png b/spline-on-databricks/img_10.png
diff --git a/spline-on-databricks/img_11.png b/spline-on-databricks/img_11.png
diff --git a/spline-on-databricks/img_12.png b/spline-on-databricks/img_12.png
diff --git a/spline-on-databricks/img_2.png b/spline-on-databricks/img_2.png
diff --git a/spline-on-databricks/img_3.png b/spline-on-databricks/img_3.png
diff --git a/spline-on-databricks/img_4.png b/spline-on-databricks/img_4.png
diff --git a/spline-on-databricks/img_5.png b/spline-on-databricks/img_5.png
diff --git a/spline-on-databricks/img_6.png b/spline-on-databricks/img_6.png
diff --git a/spline-on-databricks/img_7.png b/spline-on-databricks/img_7.png
diff --git a/spline-on-databricks/img_8.png b/spline-on-databricks/img_8.png
diff --git a/spline-on-databricks/img_9.png b/spline-on-databricks/img_9.png