
### 📁 Step 1: Creating Schemas to Organize Your Data

In any professional data architecture — especially when using **Delta Live Tables (DLT)** — it’s critical to separate raw ingested data from modeled data for **clarity, governance, and scalability**.  
This is where **schemas** come in. Think of them as folders that help you structure your data warehouse layers.

---

#### 🗂️ Why We Create Separate Schemas:

| 🔸 Schema        | 📌 Purpose |
|------------------|------------|
| `raw_schema`    | Stores **raw data** (aka. *staging/potato layer*) |
| `lake_schema`    | Stores **cleaned data** (aka. *cleaned/bronze layer with metadata columns*) |
| `hub_schema`     | Stores **modeled and enriched data** (aka. *silverstar schema*, dimensions, facts) |

✅ This separation makes it easier to:
- Apply different data quality expectations 🧪  
- Track lineage 🔄  
- Manage permissions 🔐  
- Keep your modeling layer clean and focused 🎯  

---

### 🛠️ SQL Commands to Create the Schemas:

after running following script check you unity catalog and you should see 3 new empty schemas


In [0]:
create SCHEMA if not exists hub_schema;
create schema if not exists lake_schema;
create schema if not exists raw_schema;

###  🛠️ Step 2: Ingest data from Github
This exercise works with ingested data from Github ( fixtures )
Following code will ingest data in its raw form to raw layer. 

In [0]:
%python
import pandas as pd

table_list = ["customer", "product", "sales"]

for table in table_list:
    url = f"https://raw.githubusercontent.com/VladisKliman/Databricks_Developer_Training/main/fixtures/{table}.csv"
    pdf = pd.read_csv(url)
    df = spark.createDataFrame(pdf)
    df.write.format("delta").mode("overwrite").saveAsTable(f"raw_schema.raw_{table}")

    spark.sql(f"ALTER TABLE raw_schema.raw_{table} SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")


### ▶️ Step 3: Building the Star Schema

Now that we have defined our data model using Delta Live Tables (DLT), let’s walk through what we will build in the next steps and clarify the design:

---

### 🧩 DLT Table Structure

We defined the following DLT tables:

#### 🔹 Lake Layer (Staging):
- `lake_schema.customers`
- `lake_schema.products`
- `lake_schema.sales`

#### 🌟 Hub Layer (Modeled):
- `hub_schema.dim_customers` — Customer dimension with a **Primary Key**.
- `hub_schema.dim_products` — Product dimension with a **Primary Key**.
- `hub_schema.fact_sales_star` — Fact table with denormalized dimension data and a **row_key as Primary Key**.

> 💡 In Unity Catalog, we defined **Primary Keys** directly in the table declarations using `PRIMARY KEY (...)`. After running pipeline go and see unity catalog for newly created tables.

---

### ⚠️ Foreign Key Limitation

Due to current [Unity Catalog](https://docs.databricks.com/en/data-governance/unity-catalog/index.html) limitations:
- **Foreign keys are not yet fully enforced**, even though we can define them in table metadata.
- That's why we will **manually enforce referential integrity** in the additional step using SQL script adjustement after tables are build.



####  Running the DLT Pipeline

Follow these steps to run the pre-built Delta Live Tables (DLT) pipeline using an existing notebook:

1. Go to **Jobs & Pipelines** → click **Create New Pipeline**.
2. If you don’t see the **LakeFlow Pipeline Editor**, enable it using the toggle at the **top center** of the screen.
3. Name your pipeline and click **Add existing assets**.
4. In the first window (**Pipeline root folder**), select the **entire directory** that contains this training material.
5. In the second window (**Source code paths**), select **only notebook 6.1**.
6. A new UI will be created where you can run the **DLT pipeline directly from the notebook** and observe its execution graph.

> ✅ This allows you to visually inspect how the pipeline runs, including dependencies between tables and processing logic.


### 🔗 Step 4: Adding **Foreign Keys** to Strengthen Our Data Model 

So far, we’ve created the **`hub_schema`** with our **model tables**, and defined **primary keys** to uniquely identify each record.  
However, we haven’t yet added **foreign key relationships** — these are essential to describe how our **fact table** connects to the **dimension tables**.  

> ⚠️ This limitation is due to Delta Live Tables (DLT) table definitions.  
> In our project, we use an API call to automatically update foreign keys *after* a new tables are created.  

---

### Why add **Foreign Keys**? 🤔

- 📝 **Document** relationships clearly between tables  
- 🛡️ **Improve data integrity** by showing how dimensions relate to facts  
- 🔍 Make your data model **easier to understand and maintain**

> 💡 **Note:**  
> Many platforms (including DLT) may *not* fully enforce foreign key constraints, but **adding them is best practice** for clarity and future-proofing your architecture.

---

### 🚀 Next Steps

Change and Run the script (6.1 last script for fact table) to add foreign keys to your fact table

**FROM**
```
customer_fk STRING,
  product_fk STRING,
  --customer_fk STRING FOREIGN KEY REFERENCES hub_schema.dim_customer(row_key) COMMENT 'FK to customer dimension',
  --product_fk STRING FOREIGN KEY REFERENCES hub_schema.dim_product(row_key) COMMENT 'FK to product dimension',
```
**TO**
```
--customer_fk STRING,
--product_fk STRING,
customer_fk STRING FOREIGN KEY REFERENCES hub_schema.dim_customer(row_key) COMMENT 'FK to customer dimension',
product_fk STRING FOREIGN KEY REFERENCES hub_schema.dim_product(row_key) COMMENT 'FK to product dimension',
```
- This way after tables which are referenced by FK constraint are already created in first pipeline run and constraint is enforced in second run. 

then:  

➡️ **Go check the ERD (Entity Relationship Diagram) in Unity Catalog!**  
You’ll now see the **relationships between your fact and dimension tables visualized**.  

---

### 🎉 Congratulations! 🎉

You’ve just built a **simple Star Schema** in DLT — a solid foundation for scalable, well-structured data modeling! 👏👏

---

If you have any questions, don’t hesitate to ask — mastering relationships is key to becoming a great data modeler! 💪😊


### Step 5: 🙏 Thank You for Your Attention! 💫

We know this training covered a lot of ground — and that’s totally normal!  
If you don’t understand everything right away, **don’t be discouraged**. Learning complex data modeling and pipelines takes time and practice.

Your feedback is super valuable to us!  
If you feel something is missing, or if you spot any bugs during the training, please let us know so we can improve it for future versions. 🙌

---

### 🧹 Cleanup Time

Before you go, please run the following script to **delete the schemas** we created during this modeling training.  
This helps keep your environment clean and ready for the next session!

**Dont forget to delete Pipeline also from UI!!** same as you have created pipeline in steps before go to Jobs & Pipelines. You should see just your created pipeline. click on 3 vertical dots on far right and DELETE.   

---

If you have any questions or want to revisit any part of this training, just reach out — we’re here to support you! 🌟


In [0]:
drop schema if exists hub_schema CASCADE;
drop schema if exists lake_schema CASCADE;
drop schema if exists raw_schema CASCADE;