

---

## 🧠 CHAPTER 1: Storytime — Why XML Still Exists

Imagine you’re working at a **pharmaceutical company** (like IQVIA 😎).
They receive data feeds every night from multiple healthcare partners.
Some of those partners use **modern JSON APIs**,
but others (especially older systems like hospital EHRs or lab systems) still send data in **XML** format.

For example:

* **Clinical Trial Results**
* **Lab Reports**
* **Patient Visit Records**

Snowflake can handle that data **without needing external parsing tools**, thanks to the **VARIANT** data type and functions like `XMLGET()`, `FLATTEN()`, and `TO_ARRAY()`.

---

## 🧩 CHAPTER 2: Step-by-Step — Loading XML Data in Snowflake

Let’s start small.
We’ll use a simple XML document that represents **patient information**:

### 🧾 Sample XML file: `patients.xml`

```xml
<patients>
  <patient>
    <id>101</id>
    <name>John Doe</name>
    <visits>
      <visit>
        <date>2024-01-15</date>
        <diagnosis>Fever</diagnosis>
      </visit>
      <visit>
        <date>2024-02-10</date>
        <diagnosis>Cold</diagnosis>
      </visit>
    </visits>
  </patient>
  <patient>
    <id>102</id>
    <name>Jane Smith</name>
    <visits>
      <visit>
        <date>2024-03-05</date>
        <diagnosis>Allergy</diagnosis>
      </visit>
    </visits>
  </patient>
</patients>
```

---

## 🧱 Step 1: Create a Stage

We’ll store our XML file temporarily in a Snowflake stage:

```sql
CREATE OR REPLACE STAGE xml_stage;
```

Then upload your file:

```bash
PUT file://patients.xml @xml_stage;
```

---

## 🧱 Step 2: Create a Table with a VARIANT Column

```sql
CREATE OR REPLACE TABLE patient_xml (
  v VARIANT
);
```

Remember — `VARIANT` is the **universal container** for semi-structured data.

---

## 🧱 Step 3: Load XML Data into the Table

```sql
COPY INTO patient_xml
FROM @xml_stage/patients.xml
FILE_FORMAT = (TYPE = 'XML');
```

That’s it! Your XML data is now stored as VARIANT in Snowflake.

---

## 🔍 Step 4: View the Raw Data

```sql
SELECT * FROM patient_xml;
```

Output (simplified for readability):

| V                                                         |
| --------------------------------------------------------- |
| `<patients><patient><id>101</id><name>John Doe</name>...` |

---

## 🌳 CHAPTER 3: Getting the Root Element of XML Data

When you load XML into a VARIANT, the entire XML document is represented as one hierarchical object.
The **root element** (in our case, `<patients>`) is accessible directly.

### Example:

```sql
SELECT
  XMLGET(v, 'patients') AS root
FROM patient_xml;
```

But — in practice, since the root node **is already** `<patients>`, you can just query the top-level children directly like this:

```sql
SELECT
  v:"$" AS root_value
FROM patient_xml;
```

🧠 **Note:**
Snowflake uses `$` to represent the current element (the entire XML node tree).
The root here is essentially the “patients” node that contains multiple `<patient>` elements.

---

## 🧰 CHAPTER 4: The `XMLGET()` Function — Deep Dive with Scenario

### What `XMLGET()` Does:

`XMLGET()` retrieves a **child element** from an XML document stored in a VARIANT.
It’s like saying: *“Give me this specific tag from inside that XML node.”*

### Syntax:

```sql
XMLGET(xml_expression, 'element_name')
```

---

### 📖 Example Scenario:

Let’s say we want to extract each `<patient>` node.

```sql
SELECT
  XMLGET(v, 'patient') AS patient_node
FROM patient_xml;
```

This gives you:

| PATIENT_NODE                                         |
| ---------------------------------------------------- |
| `<patient><id>101</id><name>John Doe...</patient>`   |
| `<patient><id>102</id><name>Jane Smith...</patient>` |

But notice something important —
since `<patients>` contains *multiple* `<patient>` elements, this doesn’t directly explode them into multiple rows yet.
For that, we’ll use **`FLATTEN()`** next.

---

## 🌀 CHAPTER 5: Using `LATERAL FLATTEN(TO_ARRAY(...))` — Exploding XML Elements

When you call `XMLGET(v, 'patient')`, it still returns a **single variant object** (even if there are multiple `<patient>` tags inside).

To handle multiple child nodes, you need to **convert them into an array**, and then **flatten** that array into rows.

That’s what `TO_ARRAY()` and `FLATTEN()` together do.

---

### Step-by-step example:

```sql
SELECT
  p.value AS patient_data
FROM patient_xml,
LATERAL FLATTEN(TO_ARRAY(XMLGET(v, 'patient'))) p;
```

**Explanation:**

1. `XMLGET(v, 'patient')` extracts all `<patient>` nodes.
2. `TO_ARRAY()` wraps those into an array structure.
3. `FLATTEN()` explodes that array so each `<patient>` becomes one row.

✅ **Result:**

| patient_data                                         |
| ---------------------------------------------------- |
| `<patient><id>101</id><name>John Doe...</patient>`   |
| `<patient><id>102</id><name>Jane Smith...</patient>` |

Now you can query each patient separately!

---

## 🧬 CHAPTER 6: Extracting Child Elements from Each Patient

Let’s dig deeper — extract patient details like ID, Name, etc.

```sql
SELECT
  XMLGET(p.value, 'id'):"$"::STRING AS patient_id,
  XMLGET(p.value, 'name'):"$"::STRING AS patient_name
FROM patient_xml,
LATERAL FLATTEN(TO_ARRAY(XMLGET(v, 'patient'))) p;
```

**Output:**

| patient_id | patient_name |
| ---------- | ------------ |
| 101        | John Doe     |
| 102        | Jane Smith   |

**Explanation:**

* `XMLGET(p.value, 'id')` → gets the `<id>` element of each `<patient>`.
* The `:"$"` part extracts the **text value** inside the tag.
* Casting with `::STRING` gives a normal text column.

---

## 🔁 CHAPTER 7: Nested XML Elements — Chaining `FLATTEN(TO_ARRAY(...))`

Now each `<patient>` has multiple `<visit>` elements inside `<visits>`.
Let’s flatten those too!

```sql
SELECT
  XMLGET(p.value, 'id'):"$"::STRING AS patient_id,
  XMLGET(p.value, 'name'):"$"::STRING AS patient_name,
  XMLGET(vs.value, 'date'):"$"::STRING AS visit_date,
  XMLGET(vs.value, 'diagnosis'):"$"::STRING AS diagnosis
FROM patient_xml,
LATERAL FLATTEN(TO_ARRAY(XMLGET(v, 'patient'))) p,
LATERAL FLATTEN(TO_ARRAY(XMLGET(p.value, 'visits'))) v1,
LATERAL FLATTEN(TO_ARRAY(XMLGET(v1.value, 'visit'))) vs;
```

🧠 **Explanation of the chaining:**

| Step                             | What it does                                |
| -------------------------------- | ------------------------------------------- |
| `XMLGET(v, 'patient')`           | Extracts each `<patient>` node              |
| `FLATTEN(TO_ARRAY(...))`         | Turns those into multiple rows              |
| `XMLGET(p.value, 'visits')`      | Accesses the `<visits>` section per patient |
| Another `FLATTEN(TO_ARRAY(...))` | Expands each `<visit>` inside `<visits>`    |
| `XMLGET(vs.value, 'date')`       | Gets date for each visit                    |
| `XMLGET(vs.value, 'diagnosis')`  | Gets diagnosis                              |

✅ **Final Output:**

| patient_id | patient_name | visit_date | diagnosis |
| ---------- | ------------ | ---------- | --------- |
| 101        | John Doe     | 2024-01-15 | Fever     |
| 101        | John Doe     | 2024-02-10 | Cold      |
| 102        | Jane Smith   | 2024-03-05 | Allergy   |

---

## 💡 CHAPTER 8: Key Takeaways

| Concept    | Description                             | Example                                                  |
| ---------- | --------------------------------------- | -------------------------------------------------------- |
| VARIANT    | Stores XML data in flexible structure   | `v VARIANT`                                              |
| XMLGET()   | Extracts specific XML tags              | `XMLGET(v, 'patient')`                                   |
| TO_ARRAY() | Converts XML elements into arrays       | `TO_ARRAY(XMLGET(v, 'patient'))`                         |
| FLATTEN()  | Turns arrays into multiple rows         | `LATERAL FLATTEN(TO_ARRAY(...))`                         |
| LATERAL    | Allows chaining flatten operations      | `FROM table, LATERAL FLATTEN(...), LATERAL FLATTEN(...)` |
| "$"        | Extracts text content inside an XML tag | `XMLGET(v, 'id'):"$"`                                    |

---

## 🎯 Bonus Practice Questions

1. What does the `XMLGET()` function return when the tag doesn’t exist?
2. Why do we need `TO_ARRAY()` before `FLATTEN()` when working with XML?
3. How would you extract both `<id>` and `<name>` from each `<patient>` tag?
4. What happens if you `FLATTEN()` twice on the same XML structure?
5. Can you use `XMLGET()` and `ARRAY_SIZE()` together to count the number of `<visit>` tags per patient?

---



---

## 🧩 Bonus Practice Questions — Deeply Explained

We’ll take each question, turn it into a *real-world Snowflake scenario*, and then solve it step by step.

---

### **1️⃣ How can you extract multiple attributes from XML using one query?**

Let’s start with a story:

> **Scenario:**
> You work at an airline company. Each flight record in your XML column stores `<flight>` elements with attributes like `id`, `source`, `destination`, and `duration`.

For example:

```xml
<flights>
  <flight id="F101" source="NYC" destination="LON" duration="7h" />
  <flight id="F102" source="LON" destination="PAR" duration="1h" />
</flights>
```

You have this XML stored in a Snowflake `VARIANT` column `flight_data`.

Let’s say your table looks like:

```sql
CREATE OR REPLACE TABLE flight_xml (
  id INT AUTOINCREMENT,
  flight_data VARIANT
);
```

Insert the sample XML:

```sql
INSERT INTO flight_xml (flight_data)
VALUES (PARSE_XML('<flights>
  <flight id="F101" source="NYC" destination="LON" duration="7h"/>
  <flight id="F102" source="LON" destination="PAR" duration="1h"/>
</flights>'));
```

Now, to extract **all flights and their attributes**, we flatten and extract:

```sql
SELECT
  f.value:"@id"::STRING AS flight_id,
  f.value:"@source"::STRING AS source,
  f.value:"@destination"::STRING AS destination,
  f.value:"@duration"::STRING AS duration
FROM flight_xml,
LATERAL FLATTEN(TO_ARRAY(XMLGET(flight_data, 'flight'))) f;
```

✅ **Explanation:**

* `XMLGET(flight_data, 'flight')` → gets all `<flight>` elements.
* `TO_ARRAY()` → ensures even one element becomes iterable.
* `LATERAL FLATTEN()` → expands each `<flight>` node into rows.
* Access attributes using `@` prefix — e.g. `"@source"`.

So one query gives you **multiple attributes** per node.
Beautiful, right?

---

### **2️⃣ How to find count of repeated XML tags using Snowflake functions?**

> **Scenario:**
> Suppose your XML logs look like this:

```xml
<logs>
  <entry>Login</entry>
  <entry>Logout</entry>
  <entry>Login</entry>
</logs>
```

You want to know: *How many `<entry>` tags are there?*

---

```sql
SELECT
  ARRAY_SIZE(TO_ARRAY(XMLGET(log_data, 'entry'))) AS entry_count
FROM log_xml;
```

✅ **Explanation:**

* `XMLGET(log_data, 'entry')` → grabs all `<entry>` tags.
* `TO_ARRAY()` → makes sure even a single `<entry>` becomes iterable.
* `ARRAY_SIZE()` → counts how many entries exist.

So, this gives you a count of all repeated tags — simple and powerful.

---

### **3️⃣ How can you handle nested XML with repeating tags inside multiple nodes?**

> **Scenario:**
> You have product catalog XML like this:

```xml
<catalog>
  <category name="Electronics">
    <product><name>Phone</name><price>600</price></product>
    <product><name>TV</name><price>1200</price></product>
  </category>
  <category name="Books">
    <product><name>Novel</name><price>20</price></product>
  </category>
</catalog>
```

And you want all **product names with category names**.

---

```sql
SELECT
  c.value:"@name"::STRING AS category_name,
  p.value:"name"::STRING AS product_name,
  p.value:"price"::STRING AS price
FROM catalog_xml,
LATERAL FLATTEN(TO_ARRAY(XMLGET(xml_data, 'category'))) c,
LATERAL FLATTEN(TO_ARRAY(XMLGET(c.value, 'product'))) p;
```

✅ **Explanation:**

* First `LATERAL FLATTEN` iterates over each `<category>`.
* Second `LATERAL FLATTEN` iterates over `<product>` inside that category.
* It’s a **nested flattening** technique.
* Access attributes and child elements using path syntax.

🎯 **Key Learning:** You can chain multiple `LATERAL FLATTEN` calls to go deeper into nested XML.

---

### **4️⃣ How to extract both elements and attributes together?**

> **Scenario:**
> You receive this XML for employee data:

```xml
<employee id="E101">
  <name>John</name>
  <department>IT</department>
</employee>
```

You want both attribute (`id`) and elements (`name`, `department`).

---

```sql
SELECT
  emp.value:"@id"::STRING AS employee_id,
  emp.value:"name"::STRING AS employee_name,
  emp.value:"department"::STRING AS department
FROM employee_xml,
LATERAL FLATTEN(TO_ARRAY(XMLGET(xml_data, 'employee'))) emp;
```

✅ **Explanation:**

* Attributes always start with `@` (e.g. `"@id"`).
* Child nodes are accessed by their names (e.g. `"name"`).

This is the typical pattern you’ll use in **real data warehousing pipelines**.

---

### **5️⃣ How can you join XML data with relational tables?**

> **Scenario:**
> You have:
>
> * A `department_table` in relational format.
> * An XML column storing employee details (name, dept_id, salary).

---

```sql
SELECT
  d.department_name,
  e.value:"name"::STRING AS employee_name,
  e.value:"salary"::STRING AS salary
FROM employee_xml,
LATERAL FLATTEN(TO_ARRAY(XMLGET(xml_data, 'employee'))) e
JOIN department_table d
  ON e.value:"dept_id"::STRING = d.department_id;
```

✅ **Explanation:**

* Flatten XML into rows.
* Extract relational values (`dept_id`).
* Join like a normal relational query.

That’s how **Snowflake bridges structured and semi-structured data seamlessly**.

---

## 🧠 Key Takeaways

| Concept          | Function / Method                | Purpose                      |
| ---------------- | -------------------------------- | ---------------------------- |
| Parse XML        | `PARSE_XML()`                    | Convert string → VARIANT XML |
| Access element   | `XMLGET(column, 'tag')`          | Get child node               |
| Access attribute | `"@attr_name"`                   | Get attribute value          |
| Flatten XML      | `LATERAL FLATTEN(TO_ARRAY(...))` | Expand repeating nodes       |
| Count elements   | `ARRAY_SIZE(TO_ARRAY(...))`      | Count repeated tags          |
| Nested iteration | Multiple `LATERAL FLATTEN()`     | Traverse multiple XML layers |

---

## 🎯 Final Practice Challenge

> Create a `customer_orders` table that stores XML of customers and their multiple orders.
> Then write a query that extracts:
>
> * Customer name
> * Each product name and price
> * Total number of orders per customer

Use everything you’ve learned — `XMLGET`, `LATERAL FLATTEN`, `ARRAY_SIZE`, and nested flattening.
