# Data engineering with Databricks - Building our C360 database

Building a C360 database requires to ingest multiple datasources.  

It's a complex process requiring batch loads and streaming ingestion to support real-time insights, used for personalization and marketing targeting among other.

Ingesting, transforming and cleaning data to create clean SQL tables for our downstream user (Data Analysts and Data Scientists) is complex.

<link href="https://fonts.googleapis.com/css?family=DM Sans" rel="stylesheet"/>
<div style="width:300px; text-align: center; float: right; margin: 30px 60px 10px 10px;  font-family: 'DM Sans'">
  <div style="height: 250px; width: 300px;  display: table-cell; vertical-align: middle; border-radius: 50%; border: 25px solid #fcba33ff;">
    <div style="font-size: 70px;  color: #70c4ab; font-weight: bold">
      73%
    </div>
    <div style="color: #1b5162;padding: 0px 30px 0px 30px;">of enterprise data goes unused for analytics and decision making</div>
  </div>
  <div style="color: #bfbfbf; padding-top: 5px">Source: Forrester</div>
</div>

<br>

## <img src="https://github.com/databricks-demos/dbdemos-resources/raw/main/images/de.png" style="float:left; margin: -35px 0px 0px 0px" width="80px"> John, as Data engineer, spends immense time….


* Hand-coding data ingestion & transformations and dealing with technical challenges:<br>
  *Supporting streaming and batch, handling concurrent operations, small files issues, GDPR requirements, complex DAG dependencies...*<br><br>
* Building custom frameworks to enforce quality and tests<br><br>
* Building and maintaining scalable infrastructure, with observability and monitoring<br><br>
* Managing incompatible governance models from different systems
<br style="clear: both">

This results in **operational complexity** and overhead, requiring expert profile and ultimatly **putting data projects at risk**.

# Simplify Ingestion and Transformation with Delta Live Tables

<img style="float: right" width="500px" src="https://github.com/databricks-demos/dbdemos-resources/raw/main/images/retail/lakehouse-churn/lakehouse-retail-c360-churn-1.png" />

In this notebook, we'll work as a Data Engineer to build our c360 database. <br>
We'll consume and clean our raw data sources to prepare the tables required for our BI & ML workload.

We have 3 data sources sending new files in our blob storage (`/demos/retail/churn/`) and we want to incrementally load this data into our Datawarehousing tables:

- Customer profile data *(name, age, adress etc)*
- Orders history *(what our customer bough over time)*
- Streaming Events from our application *(when was the last time customers used the application, typically a stream from a Kafka queue)*


Databricks simplify this task with Delta Live Table (DLT) by making Data Engineering accessible to all.

DLT allows Data Analysts to create advanced pipeline with plain SQL.

## Delta Live Table: A simple way to build and manage data pipelines for fresh, high quality data!

<div>
  <div style="width: 45%; float: left; margin-bottom: 10px; padding-right: 45px">
    <p>
      <img style="width: 50px; float: left; margin: 0px 5px 30px 0px;" src="https://raw.githubusercontent.com/QuentinAmbard/databricks-demo/main/retail/resources/images/lakehouse-retail/logo-accelerate.png"/> 
      <strong>Accelerate ETL development</strong> <br/>
      Enable analysts and data engineers to innovate rapidly with simple pipeline development and maintenance 
    </p>
    <p>
      <img style="width: 50px; float: left; margin: 0px 5px 30px 0px;" src="https://raw.githubusercontent.com/QuentinAmbard/databricks-demo/main/retail/resources/images/lakehouse-retail/logo-complexity.png"/> 
      <strong>Remove operational complexity</strong> <br/>
      By automating complex administrative tasks and gaining broader visibility into pipeline operations
    </p>
  </div>
  <div style="width: 48%; float: left">
    <p>
      <img style="width: 50px; float: left; margin: 0px 5px 30px 0px;" src="https://raw.githubusercontent.com/QuentinAmbard/databricks-demo/main/retail/resources/images/lakehouse-retail/logo-trust.png"/> 
      <strong>Trust your data</strong> <br/>
      With built-in quality controls and quality monitoring to ensure accurate and useful BI, Data Science, and ML 
    </p>
    <p>
      <img style="width: 50px; float: left; margin: 0px 5px 30px 0px;" src="https://raw.githubusercontent.com/QuentinAmbard/databricks-demo/main/retail/resources/images/lakehouse-retail/logo-stream.png"/> 
      <strong>Simplify batch and streaming</strong> <br/>
      With self-optimization and auto-scaling data pipelines for batch or streaming processing 
    </p>
</div>
</div>

<br style="clear:both">

<img src="https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-logo.png" style="float: right;" width="200px">

## Delta Lake

All the tables we'll create in the Lakehouse will be stored as Delta Lake table. Delta Lake is an open storage framework for reliability and performance.<br>
It provides many functionalities (ACID Transaction, DELETE/UPDATE/MERGE, Clone zero copy, Change data Capture...)<br>
For more details on Delta Lake, run dbdemos.install('delta-lake')

<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://www.google-analytics.com/collect?v=1&gtm=GTM-NKQ8TT7&tid=UA-163989034-1&cid=555&aip=1&t=event&ec=field_demos&ea=display&dp=%2F42_field_demos%2Fretail%2Flakehouse_churn%2Fdlt_sql&dt=LAKEHOUSE_RETAIL_CHURN">

## Re-building the Data Engineering pipeline with Delta Live Tables
In this example we will re-implement the pipeline we just created using DLT.

### Examine the source.
A DLT pipeline can be implemented either in SQL or in Python.
* [DLT pipeline definition in SQL]($./01.2 - Delta Live Tables - SQL)
* [DLT pipeline definition in Python]($./01.2 - Delta Live Tables - Python)

### Define the pipeline
Use the UI to achieve that:
* Go to **Workflows / Delta Live Tables / Create Pipeline**
* Specify **churn_data_pipeline** as the name of the pipeline
* As a source specify **one** of the above notebooks. **Either** the SQL or the Python one would work.
* Specify the parameters for the DLT job with the values below:

In [0]:
%run ./includes/SetupLab

In [0]:
# Create the tables in a database in the hive metastore with data on dbfs
print("Specify the following database/schema when defining the DLT pipeline:\n" + databaseForDLT + "\n")
print("Specify the following storage location for the DLT pipeline tables:\n" + dltPipelinesOutputDataDirectory + "\n")

Specify the following database/schema when defining the DLT pipeline:
odl_user_1237583_databrickslabs_com_retail_dlt

Specify the following storage location for the DLT pipeline tables:
/Users/odl_user_1237583_databrickslabs_com/retail/dlt_pipelines



## Run the following after having set up and run the DLT job

In [0]:
sqlStatement = "select count(*) from hive_metastore." + databaseForDLT + ".churn_features"
print("Executing:\n" + sqlStatement)
display(spark.sql(sqlStatement))

Executing:
select count(*) from hive_metastore.odl_user_1237583_databrickslabs_com_retail_dlt.churn_features


[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-3235179284848391>:3[0m
[1;32m      1[0m sqlStatement [38;5;241m=[39m [38;5;124m"[39m[38;5;124mselect count(*) from hive_metastore.[39m[38;5;124m"[39m [38;5;241m+[39m databaseForDLT [38;5;241m+[39m [38;5;124m"[39m[38;5;124m.churn_features[39m[38;5;124m"[39m
[1;32m      2[0m [38;5;28mprint[39m([38;5;124m"[39m[38;5;124mExecuting:[39m[38;5;130;01m\n[39;00m[38;5;124m"[39m [38;5;241m+[39m sqlStatement)
[0;32m----> 3[0m display(spark[38;5;241m.[39msql(sqlStatement))

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m start [38;5;241m=[39m time[38;5;241m.[39mperf_counter()
[1;32m     47[0m [38;5;28;01mtry[39;00m:
[0;32m---> 48[0m     res [

In [0]:
# Scroll the output to verify the storage location of the table
sqlStatement = "DESCRIBE EXTENDED hive_metastore." + databaseForDLT + ".churn_features"
print("Executing:\n" + sqlStatement)
display(spark.sql(sqlStatement))



In [0]:
sqlStatement = "DESCRIBE HISTORY hive_metastore." + databaseForDLT + ".churn_features"
print("Executing:\n" + sqlStatement)
display(spark.sql(sqlStatement))



## Rerun the DLT pipeline
As not new data are uploaded on the blob storage, there will be only a recalculation of the last table

In [0]:
sqlStatement = "select count(*) from hive_metastore." + databaseForDLT + ".churn_features"
print("Executing:\n" + sqlStatement)
display(spark.sql(sqlStatement))



In [0]:
sqlStatement = "DESCRIBE HISTORY hive_metastore." + databaseForDLT + ".churn_features"
print("Executing:\n" + sqlStatement)
display(spark.sql(sqlStatement))



### Next up
[Build and train a Machine Learning model]($./02 - Machine Learning with MLflow)