diff --git a/docker/.env b/docker/.env index c0050e9..4da7ee6 100644 --- a/docker/.env +++ b/docker/.env @@ -1,7 +1,7 @@ # component versions ARANGO_DB_VERSION=3.7.10 -SPLINE_CORE_VERSION=0.6.0 +SPLINE_CORE_VERSION=0.6.1 SPLINE_AGENT_VERSION=0.6.1 SPLINE_UI_VERSION=0.6.0 diff --git a/spline-on-AWS-demo-setup/README.md b/spline-on-AWS-demo-setup/README.md new file mode 100644 index 0000000..fe78bba --- /dev/null +++ b/spline-on-AWS-demo-setup/README.md @@ -0,0 +1,150 @@ +Setting up Spline Server on AWS EC2 +=== + +Spline is an open-source data lineage tracking tool that can help you to capture data lineage for your various data pipelines. +See [Spline GitHub pages](https://absaoss.github.io/spline/) for details. + +The purpose of this article is to demonstrate the basic steps that need to be done to install Spline Server on AWS EC2. + +## Disclaimer + +This is **NOT** a production setup guide! + +The approach described below is just enough for demo and trial purposes, but doesn't cover the majority of aspects that need to be considered when +setting up a real production environment. + +## Prerequisites + +You need to have an AWS account. + +## Create and launch EC2 instance + +Open your [AWS Console](https://console.aws.amazon.com/) and select **Launch instance** + +![img.png](img.png) + +![img_1.png](img_1.png) + +Select an Amazon Machine Image of your choice (we'll use a default Amazon Linux 2 AMI) + +![img_2.png](img_2.png) + +When choosing an instance type consider the amount of RAM and disk space. You need to run three Docker containers in total: + +- [ArangoDB](https://hub.docker.com/_/arangodb) - this is where the lineage data will be stored. +- [Spline REST Gateway](https://hub.docker.com/r/absaoss/spline-rest-server) - a Java application that exposes an API for Spline agents and the Spline UI. + It runs on a Tomcat server and can be memory intensive. (Alternatively you can use + [Spline Kafka Gateway](https://hub.docker.com/r/absaoss/spline-kafka-server) instead of the REST one, but this is beyond the scope of this article) +- [Spline UI](https://github.com/AbsaOSS/spline-ui) - a lightweight HTTP server that is only used for serving static resources required by the Spline + UI. Spline UI is implemented as a Single Page Application (SAP) that runs entirely within the browser and communicates directly with the Spline Gateway via + the REST API. It does not route any additional HTTP traffic through its own server. + +For demonstration purposes we'll run all three containers on the same EC2 instance, so we'll pick `t2.medium` instance with 4Gb RAM and 2 CPUs. + +![img_3.png](img_3.png) + +On the **Review Instance** page, check all necessary details. Pay special attention to the security group - the instance needs to be open for +public access. You also need to open two custom TCP ports - one for the REST API and another for the Spline UI. + +![img_4.png](img_4.png) + +We'll use ports `8080` and `9090` one for the Spline REST API and another the Spline UI. + +Then, we can review and launch our instance. + +![img_5.png](img_5.png) + +As a final step you'll be asked to create or select a key pair to access the instance via SSH. Follow the AWS instructions. + +![img_6.png](img_6.png) + +Take a note of the launched instance IP and store it for the rest of the article. + +![img_7.png](img_7.png) + +## Setup Spline + +Open the SSH client and log into the instance. + +```shell +ssh -i ~/.pem/spline-aws.pem ec2-user@18.116.202.35 +``` + +Then install and start the Docker service. + +```shell +sudo yum install docker -y +sudo systemctl enable docker.service +sudo systemctl start docker.service +sudo usermod -a -G docker ec2-user +``` + +Re-login to apply the newly added docker group. + +Now we can pull and run Spline containers. You can do it one by one, or use `docker-compose` to run +a [preconfigured demo setup](https://github.com/AbsaOSS/spline-getting-started/tree/main/docker). + +If you want to run individual containers, see the [Step by step instruction](https://absaoss.github.io/spline/#step-by-step). + +For the purpose of this article, we will use Docker Compose. + +Install Docker Compose: + +```shell +sudo curl -L https://github.com/docker/compose/releases/download/1.21.0/docker-compose-`uname -s`-`uname -m` | sudo tee /usr/local/bin/docker-compose > /dev/null +sudo chmod +x /usr/local/bin/docker-compose +``` + +Download Spline demo Docker-compose config files: + +```shell +mkdir spline +cd spline + +wget https://raw.githubusercontent.com/AbsaOSS/spline-getting-started/main/docker/docker-compose.yml +wget https://raw.githubusercontent.com/AbsaOSS/spline-getting-started/main/docker/.env +``` + +Run `docker-compose` like below. `DOCKER_HOST_EXTERNAL` is the external IP of this EC2 instance. This IP will be passed to the Spline UI and used by +the client browser to connect to the Spline REST API. + +```shell +DOCKER_HOST_EXTERNAL=18.116.202.35 docker-compose up +``` + +The given Docker Compose config also runs a set of Spark examples to pre-populate the database. You can either ignore them, or disable them by +commenting out the `agent` service block in the `docker-compose.yml` file: + +```yaml + +# agent: +# image: absaoss/spline-spark-agent:${SPLINE_AGENT_VERSION} +# network_mode: "bridge" +# environment: +# SPLINE_PRODUCER_URL: 'http://172.17.0.1:${SPLINE_REST_PORT}/producer' +# links: +# - spline + +``` + +When the containers are up we can verify that the Spline Gateway and Spline UI servers are running by visiting the following URLs: + +- http://18.116.202.35:8080/ +- http://18.116.202.35:9090/ + +(Use the correct EC2 instance IP). + +--- + + Copyright 2019 ABSA Group Limited + + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/spline-on-AWS-demo-setup/img.png b/spline-on-AWS-demo-setup/img.png new file mode 100644 index 0000000..b879b20 Binary files /dev/null and b/spline-on-AWS-demo-setup/img.png differ diff --git a/spline-on-AWS-demo-setup/img_1.png b/spline-on-AWS-demo-setup/img_1.png new file mode 100644 index 0000000..e9feb11 Binary files /dev/null and b/spline-on-AWS-demo-setup/img_1.png differ diff --git a/spline-on-AWS-demo-setup/img_2.png b/spline-on-AWS-demo-setup/img_2.png new file mode 100644 index 0000000..a7c30e1 Binary files /dev/null and b/spline-on-AWS-demo-setup/img_2.png differ diff --git a/spline-on-AWS-demo-setup/img_3.png b/spline-on-AWS-demo-setup/img_3.png new file mode 100644 index 0000000..4688e4b Binary files /dev/null and b/spline-on-AWS-demo-setup/img_3.png differ diff --git a/spline-on-AWS-demo-setup/img_4.png b/spline-on-AWS-demo-setup/img_4.png new file mode 100644 index 0000000..9210505 Binary files /dev/null and b/spline-on-AWS-demo-setup/img_4.png differ diff --git a/spline-on-AWS-demo-setup/img_5.png b/spline-on-AWS-demo-setup/img_5.png new file mode 100644 index 0000000..2aa606a Binary files /dev/null and b/spline-on-AWS-demo-setup/img_5.png differ diff --git a/spline-on-AWS-demo-setup/img_6.png b/spline-on-AWS-demo-setup/img_6.png new file mode 100644 index 0000000..3b8fb40 Binary files /dev/null and b/spline-on-AWS-demo-setup/img_6.png differ diff --git a/spline-on-AWS-demo-setup/img_7.png b/spline-on-AWS-demo-setup/img_7.png new file mode 100644 index 0000000..997b7cf Binary files /dev/null and b/spline-on-AWS-demo-setup/img_7.png differ diff --git a/spline-on-databricks/README.md b/spline-on-databricks/README.md new file mode 100644 index 0000000..83ba8f0 --- /dev/null +++ b/spline-on-databricks/README.md @@ -0,0 +1,169 @@ +Running Spline on Databricks +=== + +Spline is an open-source data lineage tracking tool that can help you to capture data lineage of your various data pipelines. +See [Spline GitHub pages](https://absaoss.github.io/spline/) for details. + +In this article, I will demonstrate how to create a minimal Spline set up, and capture data lineage of Spark jobs running on a Databricks Notebook. + +## Preparation + +### Install and launch Spline server components + +First, we need to decide where we will run a Spline Gateway. The Spline Gateway is a server part which is responsible for storing and aggregating +lineage metadata captured by Spline agents. It is not strictly required though, as the Spline agent can capture and send the lineage data in Spline +format to any destination, including storing it on S3, HDFS or senfing it to your custom REST API for further processing. However, to fully benefit +from all Spline features (like Spline UI and other features that will come with future Spline versions), the Spline server needs to be installed. + +The simplest way of doing it is using Docker. Since all Spline components are available as Docker, you can run it on any environment that supports +Docker. The only requirement is that Spline REST API should be accessible from the node where Spark driver is executing. For the purpose of this +article we will create a public AWS EC2 instance and run Spline docker containers there. + +See [Spline on AWS - demo setup](../spline-on-AWS-demo-setup/README.md) +Make sure the Spline REST Gateway and the Spline UI servers are running and take note of the _Spline Producer API URL_. It can be found on the Spline +Gateway index page. + +![img.png](img.png) + +### Prepare a Databricks account + +You need to have a Databricks account. In this article, we will use a free account +on [Databricks Community Edition](https://community.cloud.databricks.com/login.html) + +![img_1.png](img_1.png) + +## Enable Spline on a Databricks cluster + +Create a new cluster on the **Compute** page + +![img_2.png](img_2.png) + +Pick a desired Databricks runtime and take a note of the selected Spark and Scala versions. Then, go to the **Spark** tab and add required Spline +configuration parameters. + +![img_4.png](img_4.png) + +Here we instruct Spline agent to use the embedded `http` lineage dispatcher and send the lineage data to out Spline Gateway. Use the _Producer API +URL_ copied in the previous step. + +```yaml +spark.spline.lineageDispatcher http +spark.spline.lineageDispatcher.http.producer.url http://18.116.202.35:8080/producer +``` + +You can optionally set the Spline mode to `REQUIRED` if you want Spline pre-flight check errors to be propagated the Spark jobs. It is useful to +minimize the chance for the Spark jobs to complete without capturing lineage, for example due to Spline misconfiguration. + +```yaml +spark.spline.mode REQUIRED +``` + +Refer the [Spline agent configuration](https://github.com/AbsaOSS/spline-spark-agent#configuration) section for details about other config parameters +available. + +Now click **Create Cluster** and go to the **Libraries** tab, where we'll proceed with installing Spline agent. + +![img_5.png](img_5.png) + +If you have a Spline agent JAR file you can upload it, otherwise you can simply use Maven coordinates, so the agent will be downloaded automatically +from the Maven Central repository. + +![img_6.png](img_6.png) + +Click **Search Packages**, select **Maven Central** and type "_spline agent bundle_" into the query text field. You'll get a list of available Spline +agent bundles compiled for different Spark and Scala version. + +**Important**: Use a Spline agent bundle that matches the Spark and Scala version of the selected Databricks runtime. + +![img_7.png](img_7.png) + +Then click **Install** button. + +![img_8.png](img_8.png) + +The cluster is ready to use, so we can create a new Notebook and start writing our test Spark job: + +![img_9.png](img_9.png) + +We're almost ready to run some Spark jobs. The last step we need to do is to enable linage tracking on the Spark session. + +```scala +import za.co.absa.spline.harvester.SparkLineageInitializer._ + +spark.enableLineageTracking() +``` + +This step has to be done once per Spark session. It could also be done via setting the `spark.sql.queryExecutionListeners` Spark property in the Spark +cluster configuration (see https://github.com/AbsaOSS/spline-spark-agent#initialization), but unfortunately it doesn't work on Databricks. When the +Databricks cluster is booting, the Spark session initializes _before_ the necessary agent library is actually installed on the cluster, resulting in +a `ClassNotFoundError` error, and the cluster fails to start. The workaround is to call `enableLineageTracking()` method explicitly. At the time of +calling that method the Spark session is already initialized and all necessary classes are loaded. + +Now, just run some Spark code as usual. + +We'll use the following example that consists of two jobs. First let's create and save two sample files. + +```scala +case class Student(id: Int, name: String, addrId: Int) + +case class Address(id: Int, address: String) + +Seq( + Student(111, "Amy Smith", 1), + Student(222, "Bob Brown", 2) +).toDS.write.mode("overwrite").parquet("/students") + +Seq( + Address(1, "123 Park Ave, San Jose"), + Address(2, "456 Taylor St, Cupertino") +).toDS.write.mode("overwrite").parquet("/addresses") +``` + +In the next job, let's read those files, join them and write the result into another file using `append` mode: + +```scala +val students = spark.read.parquet("/students") +val addresses = spark.read.parquet("/addresses") + +students + .join(addresses) + .where(addresses("id") === students("addrId")) + .select("name", "address") + .write.mode("append").parquet("/student_names_with_addresses") +``` + +**Note**: Spline agent only tracks persistent actions that result in data written to a file, a table or another persistent location. For example, you +will not see lineage of memory-only actions like `.show()` or `.collect()` + +To see the captured metadata, go to the Spline UI page. + +![img_11.png](img_10.png) + +If everything is done correctly, you should see three execution events that correspond to three writes in our example. To see the lineage overview of +the data produced by a particular execution event, click on the event name. + +![img_12.png](img_11.png) + +The graph above represents the high-level lineage of the data produced by the current execution event. It basically shows how the data actually flew +between the data sources and what jobs were involved in the process. + +To see the details of a particular job (execution plan) click on the button on the corresponding node. + +![img_13.png](img_12.png) + +Here you can see what transformations has been applied on the data, the operation details, input/output data types etc. + +--- + + Copyright 2019 ABSA Group Limited + + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/spline-on-databricks/img.png b/spline-on-databricks/img.png new file mode 100644 index 0000000..aaaab14 Binary files /dev/null and b/spline-on-databricks/img.png differ diff --git a/spline-on-databricks/img_1.png b/spline-on-databricks/img_1.png new file mode 100644 index 0000000..5890f07 Binary files /dev/null and b/spline-on-databricks/img_1.png differ diff --git a/spline-on-databricks/img_10.png b/spline-on-databricks/img_10.png new file mode 100644 index 0000000..8e820cc Binary files /dev/null and b/spline-on-databricks/img_10.png differ diff --git a/spline-on-databricks/img_11.png b/spline-on-databricks/img_11.png new file mode 100644 index 0000000..a61cbf6 Binary files /dev/null and b/spline-on-databricks/img_11.png differ diff --git a/spline-on-databricks/img_12.png b/spline-on-databricks/img_12.png new file mode 100644 index 0000000..9c1214e Binary files /dev/null and b/spline-on-databricks/img_12.png differ diff --git a/spline-on-databricks/img_2.png b/spline-on-databricks/img_2.png new file mode 100644 index 0000000..e9a5e6c Binary files /dev/null and b/spline-on-databricks/img_2.png differ diff --git a/spline-on-databricks/img_3.png b/spline-on-databricks/img_3.png new file mode 100644 index 0000000..bf5146e Binary files /dev/null and b/spline-on-databricks/img_3.png differ diff --git a/spline-on-databricks/img_4.png b/spline-on-databricks/img_4.png new file mode 100644 index 0000000..70aeba9 Binary files /dev/null and b/spline-on-databricks/img_4.png differ diff --git a/spline-on-databricks/img_5.png b/spline-on-databricks/img_5.png new file mode 100644 index 0000000..655f8ea Binary files /dev/null and b/spline-on-databricks/img_5.png differ diff --git a/spline-on-databricks/img_6.png b/spline-on-databricks/img_6.png new file mode 100644 index 0000000..e65d6ee Binary files /dev/null and b/spline-on-databricks/img_6.png differ diff --git a/spline-on-databricks/img_7.png b/spline-on-databricks/img_7.png new file mode 100644 index 0000000..b6ace98 Binary files /dev/null and b/spline-on-databricks/img_7.png differ diff --git a/spline-on-databricks/img_8.png b/spline-on-databricks/img_8.png new file mode 100644 index 0000000..cef02a2 Binary files /dev/null and b/spline-on-databricks/img_8.png differ diff --git a/spline-on-databricks/img_9.png b/spline-on-databricks/img_9.png new file mode 100644 index 0000000..a3ba903 Binary files /dev/null and b/spline-on-databricks/img_9.png differ