Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docker/.env
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# component versions

ARANGO_DB_VERSION=3.7.10
SPLINE_CORE_VERSION=0.6.0
SPLINE_CORE_VERSION=0.6.1
SPLINE_AGENT_VERSION=0.6.1
SPLINE_UI_VERSION=0.6.0

Expand Down
150 changes: 150 additions & 0 deletions spline-on-AWS-demo-setup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
Setting up Spline Server on AWS EC2
===

Spline is an open-source data lineage tracking tool that can help you to capture data lineage for your various data pipelines.
See [Spline GitHub pages](https://absaoss.github.io/spline/) for details.

The purpose of this article is to demonstrate the basic steps that need to be done to install Spline Server on AWS EC2.

## Disclaimer

This is **NOT** a production setup guide!

The approach described below is just enough for demo and trial purposes, but doesn't cover the majority of aspects that need to be considered when
setting up a real production environment.

## Prerequisites

You need to have an AWS account.

## Create and launch EC2 instance

Open your [AWS Console](https://console.aws.amazon.com/) and select **Launch instance**

![img.png](img.png)

![img_1.png](img_1.png)

Select an Amazon Machine Image of your choice (we'll use a default Amazon Linux 2 AMI)

![img_2.png](img_2.png)

When choosing an instance type consider the amount of RAM and disk space. You need to run three Docker containers in total:

- [ArangoDB](https://hub.docker.com/_/arangodb) - this is where the lineage data will be stored.
- [Spline REST Gateway](https://hub.docker.com/r/absaoss/spline-rest-server) - a Java application that exposes an API for Spline agents and the Spline UI.
It runs on a Tomcat server and can be memory intensive. (Alternatively you can use
[Spline Kafka Gateway](https://hub.docker.com/r/absaoss/spline-kafka-server) instead of the REST one, but this is beyond the scope of this article)
- [Spline UI](https://github.com/AbsaOSS/spline-ui) - a lightweight HTTP server that is only used for serving static resources required by the Spline
UI. Spline UI is implemented as a Single Page Application (SAP) that runs entirely within the browser and communicates directly with the Spline Gateway via
the REST API. It does not route any additional HTTP traffic through its own server.

For demonstration purposes we'll run all three containers on the same EC2 instance, so we'll pick `t2.medium` instance with 4Gb RAM and 2 CPUs.

![img_3.png](img_3.png)

On the **Review Instance** page, check all necessary details. Pay special attention to the security group - the instance needs to be open for
public access. You also need to open two custom TCP ports - one for the REST API and another for the Spline UI.

![img_4.png](img_4.png)

We'll use ports `8080` and `9090` one for the Spline REST API and another the Spline UI.

Then, we can review and launch our instance.

![img_5.png](img_5.png)

As a final step you'll be asked to create or select a key pair to access the instance via SSH. Follow the AWS instructions.

![img_6.png](img_6.png)

Take a note of the launched instance IP and store it for the rest of the article.

![img_7.png](img_7.png)

## Setup Spline

Open the SSH client and log into the instance.

```shell
ssh -i ~/.pem/spline-aws.pem ec2-user@18.116.202.35
```

Then install and start the Docker service.

```shell
sudo yum install docker -y
sudo systemctl enable docker.service
sudo systemctl start docker.service
sudo usermod -a -G docker ec2-user
```

Re-login to apply the newly added docker group.

Now we can pull and run Spline containers. You can do it one by one, or use `docker-compose` to run
a [preconfigured demo setup](https://github.com/AbsaOSS/spline-getting-started/tree/main/docker).

If you want to run individual containers, see the [Step by step instruction](https://absaoss.github.io/spline/#step-by-step).

For the purpose of this article, we will use Docker Compose.

Install Docker Compose:

```shell
sudo curl -L https://github.com/docker/compose/releases/download/1.21.0/docker-compose-`uname -s`-`uname -m` | sudo tee /usr/local/bin/docker-compose > /dev/null
sudo chmod +x /usr/local/bin/docker-compose
```

Download Spline demo Docker-compose config files:

```shell
mkdir spline
cd spline

wget https://raw.githubusercontent.com/AbsaOSS/spline-getting-started/main/docker/docker-compose.yml
wget https://raw.githubusercontent.com/AbsaOSS/spline-getting-started/main/docker/.env
```

Run `docker-compose` like below. `DOCKER_HOST_EXTERNAL` is the external IP of this EC2 instance. This IP will be passed to the Spline UI and used by
the client browser to connect to the Spline REST API.

```shell
DOCKER_HOST_EXTERNAL=18.116.202.35 docker-compose up
```

The given Docker Compose config also runs a set of Spark examples to pre-populate the database. You can either ignore them, or disable them by
commenting out the `agent` service block in the `docker-compose.yml` file:

```yaml

# agent:
# image: absaoss/spline-spark-agent:${SPLINE_AGENT_VERSION}
# network_mode: "bridge"
# environment:
# SPLINE_PRODUCER_URL: 'http://172.17.0.1:${SPLINE_REST_PORT}/producer'
# links:
# - spline

```

When the containers are up we can verify that the Spline Gateway and Spline UI servers are running by visiting the following URLs:

- http://18.116.202.35:8080/
- http://18.116.202.35:9090/

(Use the correct EC2 instance IP).

---

Copyright 2019 ABSA Group Limited

you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Binary file added spline-on-AWS-demo-setup/img.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-AWS-demo-setup/img_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-AWS-demo-setup/img_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-AWS-demo-setup/img_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-AWS-demo-setup/img_4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-AWS-demo-setup/img_5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-AWS-demo-setup/img_6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-AWS-demo-setup/img_7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
169 changes: 169 additions & 0 deletions spline-on-databricks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
Running Spline on Databricks
===

Spline is an open-source data lineage tracking tool that can help you to capture data lineage of your various data pipelines.
See [Spline GitHub pages](https://absaoss.github.io/spline/) for details.

In this article, I will demonstrate how to create a minimal Spline set up, and capture data lineage of Spark jobs running on a Databricks Notebook.

## Preparation

### Install and launch Spline server components

First, we need to decide where we will run a Spline Gateway. The Spline Gateway is a server part which is responsible for storing and aggregating
lineage metadata captured by Spline agents. It is not strictly required though, as the Spline agent can capture and send the lineage data in Spline
format to any destination, including storing it on S3, HDFS or senfing it to your custom REST API for further processing. However, to fully benefit
from all Spline features (like Spline UI and other features that will come with future Spline versions), the Spline server needs to be installed.

The simplest way of doing it is using Docker. Since all Spline components are available as Docker, you can run it on any environment that supports
Docker. The only requirement is that Spline REST API should be accessible from the node where Spark driver is executing. For the purpose of this
article we will create a public AWS EC2 instance and run Spline docker containers there.

See [Spline on AWS - demo setup](../spline-on-AWS-demo-setup/README.md)
Make sure the Spline REST Gateway and the Spline UI servers are running and take note of the _Spline Producer API URL_. It can be found on the Spline
Gateway index page.

![img.png](img.png)

### Prepare a Databricks account

You need to have a Databricks account. In this article, we will use a free account
on [Databricks Community Edition](https://community.cloud.databricks.com/login.html)

![img_1.png](img_1.png)

## Enable Spline on a Databricks cluster

Create a new cluster on the **Compute** page

![img_2.png](img_2.png)

Pick a desired Databricks runtime and take a note of the selected Spark and Scala versions. Then, go to the **Spark** tab and add required Spline
configuration parameters.

![img_4.png](img_4.png)

Here we instruct Spline agent to use the embedded `http` lineage dispatcher and send the lineage data to out Spline Gateway. Use the _Producer API
URL_ copied in the previous step.

```yaml
spark.spline.lineageDispatcher http
spark.spline.lineageDispatcher.http.producer.url http://18.116.202.35:8080/producer
```

You can optionally set the Spline mode to `REQUIRED` if you want Spline pre-flight check errors to be propagated the Spark jobs. It is useful to
minimize the chance for the Spark jobs to complete without capturing lineage, for example due to Spline misconfiguration.

```yaml
spark.spline.mode REQUIRED
```

Refer the [Spline agent configuration](https://github.com/AbsaOSS/spline-spark-agent#configuration) section for details about other config parameters
available.

Now click **Create Cluster** and go to the **Libraries** tab, where we'll proceed with installing Spline agent.

![img_5.png](img_5.png)

If you have a Spline agent JAR file you can upload it, otherwise you can simply use Maven coordinates, so the agent will be downloaded automatically
from the Maven Central repository.

![img_6.png](img_6.png)

Click **Search Packages**, select **Maven Central** and type "_spline agent bundle_" into the query text field. You'll get a list of available Spline
agent bundles compiled for different Spark and Scala version.

**Important**: Use a Spline agent bundle that matches the Spark and Scala version of the selected Databricks runtime.

![img_7.png](img_7.png)

Then click **Install** button.

![img_8.png](img_8.png)

The cluster is ready to use, so we can create a new Notebook and start writing our test Spark job:

![img_9.png](img_9.png)

We're almost ready to run some Spark jobs. The last step we need to do is to enable linage tracking on the Spark session.

```scala
import za.co.absa.spline.harvester.SparkLineageInitializer._

spark.enableLineageTracking()
```

This step has to be done once per Spark session. It could also be done via setting the `spark.sql.queryExecutionListeners` Spark property in the Spark
cluster configuration (see https://github.com/AbsaOSS/spline-spark-agent#initialization), but unfortunately it doesn't work on Databricks. When the
Databricks cluster is booting, the Spark session initializes _before_ the necessary agent library is actually installed on the cluster, resulting in
a `ClassNotFoundError` error, and the cluster fails to start. The workaround is to call `enableLineageTracking()` method explicitly. At the time of
calling that method the Spark session is already initialized and all necessary classes are loaded.

Now, just run some Spark code as usual.

We'll use the following example that consists of two jobs. First let's create and save two sample files.

```scala
case class Student(id: Int, name: String, addrId: Int)

case class Address(id: Int, address: String)

Seq(
Student(111, "Amy Smith", 1),
Student(222, "Bob Brown", 2)
).toDS.write.mode("overwrite").parquet("/students")

Seq(
Address(1, "123 Park Ave, San Jose"),
Address(2, "456 Taylor St, Cupertino")
).toDS.write.mode("overwrite").parquet("/addresses")
```

In the next job, let's read those files, join them and write the result into another file using `append` mode:

```scala
val students = spark.read.parquet("/students")
val addresses = spark.read.parquet("/addresses")

students
.join(addresses)
.where(addresses("id") === students("addrId"))
.select("name", "address")
.write.mode("append").parquet("/student_names_with_addresses")
```

**Note**: Spline agent only tracks persistent actions that result in data written to a file, a table or another persistent location. For example, you
will not see lineage of memory-only actions like `.show()` or `.collect()`

To see the captured metadata, go to the Spline UI page.

![img_11.png](img_10.png)

If everything is done correctly, you should see three execution events that correspond to three writes in our example. To see the lineage overview of
the data produced by a particular execution event, click on the event name.

![img_12.png](img_11.png)

The graph above represents the high-level lineage of the data produced by the current execution event. It basically shows how the data actually flew
between the data sources and what jobs were involved in the process.

To see the details of a particular job (execution plan) click on the button on the corresponding node.

![img_13.png](img_12.png)

Here you can see what transformations has been applied on the data, the operation details, input/output data types etc.

---

Copyright 2019 ABSA Group Limited

you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Binary file added spline-on-databricks/img.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_10.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_11.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_12.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_8.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added spline-on-databricks/img_9.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.