
# **AirQ_Part2_xxx** — Assignment 2 (ETL + OLAP with Atoti)

- This notebook orchestrates Assignment 2.
- All SQL must live in external `.sql` files under `ddl/`, `etl/`, and `sql/`. 
- All MDX must live in external `.mdx` files under `mdx/`.

**Final folder layout (per‑group, self‑contained)**

```
BI_Projects/
  DWH2_xxx/
    csv/       # 15 OLTP CSV files
    ddl/       # DDL only (staging, warehouse)
    etl/       # ETL steps: a2_etl*.sql files
    mdx/       # MDX-queries in .mdx files: a2_q{NN}_{A|B}.mdx
    mdx_out/   # CSV files with the results of MDX-queries
    pdf/       # PDF files with dashboard exports: a2_q{NN}.pdf
    sql/       # SQL-queries in .sql files: a2_q{NN}_{A|B}.sql
    sqldump/   # Export produced by pg_dump
    AirQ_Part2_xxx.ipynb
    group_xxx.txt
    Report_Part2_Group_xxx.pdf
```
> Replace `xxx` in your file names with your **three‑digit** group number.


## Contents
1. Configuration & preflight (group, paths)  
2. Database connection
3. Reset and create staging schema (`stg2_xxx`) from DDL file 
4. Load CSVs into stg2_xxx (order-sensitive)  
5. Reset and create warehouse (`dwh2_xxx`) from DDL file 
6. ETL runner (executes `etl/a2_etl*.sql`)  
7. SQL queries
8. Atoti setup and build the OLAP cube (scaffold)
9. Define hierarchies and measures
10. MDX queries
11. Batch executor: run all .mdx → CSV (+ an index)
12. Create database dump
13. Submission checklist


## 1) Configuration & preflight

In [13]:
# === Parameters ===
# XXX = "001"               # # three digits, e.g. "007"
# ...
# XXX = "031"               # # three digits, e.g. "007"
# ...
# XXX = "071"               # # three digits, e.g. "007"
# ...
# XXX = "199"               # # three digits, e.g. "007"
XXX = "020"               # # three digits, e.g. "007"

VERBOSE_SQL = False             # print progress when running .sql files

In [14]:
import re, time
import shutil, subprocess, os
import json, hashlib

from pathlib import Path
from getpass import getpass
from urllib.parse import quote_plus
from datetime import datetime, timezone

import pandas as pd
import sqlalchemy as sa
import sqlparse
from sqlalchemy import create_engine, text, engine

import atoti as tt

In [15]:
!pip show atoti

Name: atoti
Version: 0.9.9
Summary: Explore metrics across hundreds of dimensions, analyze live data at its most granular level and perform what-if simulations at unparalleled speed
Home-page: https://www.atoti.io
Author: 
Author-email: ActiveViam <dev@atoti.io>
License: 
Location: C:\ProgramData\anaconda3\envs\dwh\Lib\site-packages
Requires: atoti-client, atoti-server, jdk4py
Required-by: 


In [16]:
# === Toggles & paths ===
root_dir = Path.cwd()
csv_dir = root_dir / "csv"
ddl_dir = root_dir / "ddl"
etl_dir = root_dir / "etl"
mdx_dir = root_dir / "mdx"
mdx_out_dir = root_dir / "mdx_out"
sql_dir = root_dir / "sql"
sqldump_dir = root_dir / "sqldump"

SCHEMA_STG = f"stg2_{XXX}"
SCHEMA_DWH = f"dwh2_{XXX}"

# files we expect in the ddl subfolder
STG2_RESET  = ddl_dir / f"airq_reset_stg2_{XXX}.sql"
STG2_CREATE = ddl_dir / f"airq_create_stg2_{XXX}.sql"
DWH2_RESET  = ddl_dir / f"airq_reset_dwh2_{XXX}.sql"
DWH2_CREATE = ddl_dir / f"airq_create_dwh2_{XXX}.sql"

print("CSV dir:", csv_dir)
print("DDL dir:", ddl_dir)
print("ETL dir:", etl_dir)
print("MDX dir:", mdx_dir)
print("MDX_out dir:", mdx_out_dir)
print("SQL dir:", sql_dir)
print("SQLdump dir:", sqldump_dir)

CSV dir: C:\ProgramData\anaconda3\envs\dwh\etc\jupyter\BI_Projects\DWH2_020\csv
DDL dir: C:\ProgramData\anaconda3\envs\dwh\etc\jupyter\BI_Projects\DWH2_020\ddl
ETL dir: C:\ProgramData\anaconda3\envs\dwh\etc\jupyter\BI_Projects\DWH2_020\etl
MDX dir: C:\ProgramData\anaconda3\envs\dwh\etc\jupyter\BI_Projects\DWH2_020\mdx
MDX_out dir: C:\ProgramData\anaconda3\envs\dwh\etc\jupyter\BI_Projects\DWH2_020\mdx_out
SQL dir: C:\ProgramData\anaconda3\envs\dwh\etc\jupyter\BI_Projects\DWH2_020\sql
SQLdump dir: C:\ProgramData\anaconda3\envs\dwh\etc\jupyter\BI_Projects\DWH2_020\sqldump



## 2) Make database connection


In [17]:
from getpass import getpass

# === Minimal config & connect ===
DB_USER = f"grp_{XXX}"
DB_NAME = "airq"
DB_HOST = "localhost"
DB_PORT = "5432"

# a password is asked once per run; enter empty password if your local pg_hba allows trust/peer
pw = getpass(f"Password for {DB_USER}@{DB_HOST}:{DB_PORT}/{DB_NAME} (leave empty if not needed): ")
DSN = f"postgresql+psycopg2://{DB_USER}:{quote_plus(pw)}@{DB_HOST}:{DB_PORT}/{DB_NAME}" if pw \
      else f"postgresql+psycopg2://{DB_USER}@{DB_HOST}:{DB_PORT}/{DB_NAME}"

def _mask_dsn(dsn: str) -> str:
    try:
        return str(engine.make_url(dsn).set(password="***"))
    except Exception:
        return re.sub(r"://([^:@]+)(?::[^@]*)?@", r"://\\1:***@", dsn)

engine = create_engine(DSN, future=True, pool_pre_ping=True)
print("Connecting via:", _mask_dsn(DSN))

with engine.begin() as conn:
    # best-effort: set the role if it exists; don't crash if not
    try:
        conn.exec_driver_sql(f"SET ROLE grp_{XXX}")
        print(f"SET ROLE grp_{XXX} ✓")
    except Exception as e:
        print(f"(no SET ROLE: {e.__class__.__name__})")
    who = conn.exec_driver_sql("select current_user").scalar_one()
    print("current_user:", who)


Password for grp_020@localhost:5432/airq (leave empty if not needed):  ········


Connecting via: postgresql+psycopg2://\1:***@localhost:5432/airq
SET ROLE grp_020 ✓
current_user: grp_020


In [18]:
def run_sqlscript(
    path: str,
    *,
    engine,
    progress: bool = True,      # progress/verbosity- show progress OR keep output quiet
    add_search_path: bool = False,
    schema_dwh: str | None = None,
    schema_stg: str | None = None,
    title: str | None = None,      # optional title
    strip_psql_meta: bool = True,  # psql meta stripping
):
    """
    Execute all statements in a .sql file.
    - Returns the LAST result set as a pandas.DataFrame if any statement returns rows; else None.
    - Set progress=False to suppress progress/header prints (great for check scripts).
    """

    raw = Path(path).read_text(encoding="utf-8")

    # Strip psql meta-commands (e.g., \i, \set) if requested
    if strip_psql_meta:
        raw = "\n".join(
            line for line in raw.splitlines()
            if not line.lstrip().startswith("\\")
        )

    # Optional search_path prologue
    prologue = ""
    if add_search_path:
        schs = [s for s in (schema_dwh, schema_stg) if s]
        if schs:
            prologue = f"SET search_path TO {', '.join(schs)};\n"

    script = prologue + raw
    stmts = [s.strip() for s in sqlparse.split(script) if s and s.strip(" ;\n\t")]

    if progress:
        hdr = f"▶ {title}" if title else "▶ Running SQL script"
        print(f"{hdr}: {path} ({len(stmts)} statements)")
    t0 = time.time()

    last_df = None
    with engine.begin() as conn:
        for i, stmt in enumerate(stmts, start=1):
            if not stmt:
                continue
            start = time.time()
            try:
                if progress:
                    preview = " ".join(stmt.split())[:120]
                    print(f"  {i:>3}: {preview} ...")

                cursor = conn.exec_driver_sql(stmt)

                if cursor.returns_rows:
                    rows = cursor.fetchall()
                    cols = cursor.keys()
                    last_df = pd.DataFrame(rows, columns=cols)

                if progress:
                    print(f"       OK ({time.time() - start:.3f}s)")

            except Exception as e:
                # Raise with a helpful preview even when progress=False
                preview = " ".join(stmt.split())[:160]
                raise RuntimeError(
                    f"SQL error in statement #{i}: {preview}"
                ) from e

    if progress:
        print(f"✅ Done in {time.time() - t0:.2f}s")

    return last_df

## 3) Reset and create **staging schema** (`stg2_xxx`) from DDL file

In [19]:
print(f"== STAGING-ONLY RESET: stg2_{XXX} ==")
try:
    for p in (STG2_RESET, STG2_CREATE):
        run_sqlscript(p, engine=engine, progress=VERBOSE_SQL)
except Exception as e:
    print(f"!! Reset & create failed: {e}")
    raise

== STAGING-ONLY RESET: stg2_020 ==


## 4) Load CSV → `stg2_xxx` with Pandas `.to_sql()`

In [20]:
def load_folder_to_stg(
    folder_name: str,
    engine,
    SCHEMA_STG: str,
    load_order=None,
    if_exists: str = "append",
    chunksize: int = 20000,
):
    global root_dir  # expected to be defined earlier
    src_dir = Path(root_dir) / folder_name
    if not src_dir.exists():
        raise FileNotFoundError(f"Folder not found: {src_dir}")

    def load_one(name: str):
        path = src_dir / f"{name}.csv"
        if not path.exists():
            print("Missing CSV:", path.name)
            return 0
        df = pd.read_csv(
            path,
            na_values=["\\N"],
            keep_default_na=False,
            low_memory=False,
        )
        # Convert any *...from / ...to / ...at* to DATE
        for col in df.columns:
            col_l = col.lower()
            if col_l.endswith(("from", "to", "at")):
                df[col] = pd.to_datetime(df[col], format="%Y-%m-%d", errors="coerce").dt.date
        # Write
        df.to_sql(
            name,
            con=engine,
            schema=SCHEMA_STG,
            if_exists=if_exists,
            index=False,
            method="multi",
            chunksize=chunksize,
        )
        print(f"Loaded {len(df):,} rows → {SCHEMA_STG}.{name}")
        return len(df)

    if not load_order:
        discovered = sorted([p.stem for p in src_dir.glob("*.csv")])
        print("No order set yet. CSVs found:", discovered)
        return

    t0 = time.time()
    total = 0
    for name in load_order:
        total += load_one(name)
    print(f"⏱️ Total load time: {time.time() - t0:.2f} seconds · {total:,} rows")

In [21]:
# Loading of original 15 CSV files in the correct order
LOAD_ORDER_CSV = [
    "tb_country",
    "tb_city",
    "tb_role",
    "tb_servicetype",
    "tb_employee",
    "tb_param",
    "tb_alert",
    "tb_paramalert",
    "tb_sensortype",
    "tb_paramsensortype",
    "tb_sensordevice",
    # dependent tables come **after** parent tables
    "tb_readingmode",
    "tb_weather",          # depends only on city
    "tb_readingevent",     # depends on sensordevice + param + readingmode
    "tb_serviceevent",     # depends on employee + sensordevice + servicetype
                 ]

load_folder_to_stg("csv", engine, SCHEMA_STG, load_order=LOAD_ORDER_CSV,  if_exists="append")

Loaded 20 rows → stg2_020.tb_country
Loaded 36 rows → stg2_020.tb_city
Loaded 16 rows → stg2_020.tb_role
Loaded 24 rows → stg2_020.tb_servicetype
Loaded 484 rows → stg2_020.tb_employee
Loaded 30 rows → stg2_020.tb_param
Loaded 4 rows → stg2_020.tb_alert
Loaded 120 rows → stg2_020.tb_paramalert
Loaded 12 rows → stg2_020.tb_sensortype
Loaded 115 rows → stg2_020.tb_paramsensortype
Loaded 627 rows → stg2_020.tb_sensordevice
Loaded 8 rows → stg2_020.tb_readingmode
Loaded 26,316 rows → stg2_020.tb_weather
Loaded 985,573 rows → stg2_020.tb_readingevent
Loaded 22,720 rows → stg2_020.tb_serviceevent
⏱️ Total load time: 512.01 seconds · 1,036,105 rows


## 5) Reset and create **warehouse** (`dwh2_xxx`) from DDL file

In [22]:
print(f"== DWH-ONLY RESET: dwh2_{XXX} ==")
try:
    for p in (DWH2_RESET, DWH2_CREATE):
        run_sqlscript(p, engine=engine, progress=VERBOSE_SQL)
except Exception as e:
    print(f"!! Reset & create failed: {e}")
    raise

== DWH-ONLY RESET: dwh2_020 ==



## 6) SQL-first ETL — run all files in etl/

We execute **all** files matching `etl/a2_etl*.sql` in lexicographic order. Every ETL file must begin with `SET search_path TO dwh2_xxx, stg2_xxx;`  



In [23]:
steps = sorted(etl_dir.glob("a2_etl*.sql"))
if not steps:
    print("No ETL step files found in etl/ (expected a2_etl*.sql).")
else:
    for s in steps:
        run_sqlscript(s, engine=engine, progress=VERBOSE_SQL)

## 7) SQL-queries 

In [24]:
# Business question Q31 (example)
df = run_sqlscript("sql/a2_q31.sql", engine=engine, progress=VERBOSE_SQL)
display(df)

Unnamed: 0,city_name,P95 Recorded Value (2023)
0,Prague,152.58
1,Hamburg,150.75
2,Athens,142.47
3,London,132.38
4,Istanbul,131.14
5,Edinburgh,126.17
6,Copenhagen,124.67
7,Stuttgart,119.63
8,Salzburg,119.14
9,Kazan,117.25


In [25]:
# Business question Q32 (example)
df = run_sqlscript("sql/a2_q32.sql", engine=engine, progress=VERBOSE_SQL)
display(df)

Unnamed: 0,city_name,Data Volume (KB) 2024
0,Istanbul,803441
1,London,330327
2,Moscow,257108
3,Berlin,224658
4,St. Petersburg,215185
5,Paris,196835
6,Rome,188638
7,Ufa,174256
8,Copenhagen,148863
9,Vienna,146300


In [26]:
# Business question Q33 (example)
df = run_sqlscript("sql/a2_q33.sql", engine=engine, progress=VERBOSE_SQL)
display(df)

Unnamed: 0,country_name,month_name,Avg Data Quality
0,Austria,Sep,3.58
1,Belarus,Sep,3.24
2,Belgium,Sep,3.18
3,Croatia,Jun,3.6
4,Czech Republic,Apr,3.24
5,Denmark,Jun,3.48
6,Finland,May,3.7
7,France,Apr,3.09
8,Germany,Sep,3.12
9,Greece,Aug,3.12


In [27]:
# **Business Question Q1** — SQL for Student A

# For parameter PM2, show Exceed Days (any) by Country × Month for Q1 of 2024.
# Return Countries on rows and the first three months of 2024 (Jan–Mar) on columns.

# Running SQL query to get the result
df = run_sqlscript("sql/a2_q01_A.sql", engine=engine, progress=VERBOSE_SQL)
display(df)


Unnamed: 0,country_name,Jan_2024,Feb_2024,Mar_2024
0,Austria,14,6,8
1,Belgium,6,2,4
2,Croatia,6,2,5
3,Czech Republic,7,3,5
4,Denmark,1,6,7
5,Finland,2,2,1
6,France,16,14,14
7,Germany,25,18,27
8,Greece,9,7,11
9,Hungary,7,3,7


In [28]:
# **Business Question Q2** — SQL for Student A

# For parameter O3, show Missing Days in Austria by City × Month for Q1 of 2023.
# Return Austrian Cities on rows and the first three months of 2023 (Jan–Mar) on columns.

# Running SQL query to get the result
df = run_sqlscript("sql/a2_q02_A.sql", engine=engine, progress=VERBOSE_SQL)
display(df)


Unnamed: 0,city_name,Jan_2023,Feb_2023,Mar_2023
0,Graz,11,4,2
1,Salzburg,3,5,4
2,Vienna,6,2,2


In [29]:
# **Business Question Q4** — SQL for Student A

# For 2024, show total Data Volume (KB) by Region × Quarter.
# Return Regions on rows and the four quarters of 2024 (Q1–Q4) on columns.

# Running SQL query to get the result
df = run_sqlscript("sql/a2_q04_A.sql", engine=engine, progress=VERBOSE_SQL)
display(df)


Unnamed: 0,region_name,Q1_2024,Q2_2024,Q3_2024,Q4_2024
0,Central Europe,1780051,1779221,1792775,1807947
1,Eastern Europe,2586668,2598339,2629947,2586709
2,Western Europe,2364973,2376173,2383913,2405937


In [30]:
# **Business Question Q5** — SQL for Student A

# For 2023 and 2024, show total Data Volume (KB) by Param Category × Year.
# Return Param Categories on rows and the two years (2023, 2024) on columns.

# Running SQL query to get the result
df = run_sqlscript("sql/a2_q05_A.sql", engine=engine, progress=VERBOSE_SQL)
display(df)


Unnamed: 0,category,2023,2024
0,Biological,6328259,6298621
1,Gas,5599952,5597083
2,Heavy Metal,4961428,4955113
3,Particulate matter,5257962,5267659
4,Volatile Organic Compound,4962229,4974177


In [31]:
# **Business Question Q6** — SQL for Student A

# For 2024, list the Top 10 Cities by total Missing Days (all parameters).
# Return the Top 10 cities on rows (highest → lowest) and one column with the total Missing Days for 2024.

# Running SQL query to get the result
df = run_sqlscript("sql/a2_q06_A.sql", engine=engine, progress=VERBOSE_SQL)
display(df)


Unnamed: 0,city_name,Total_Missing_Days_2024
0,Zagreb,2524
1,Salzburg,2436
2,Kazan,2327
3,Minsk,2274
4,Warsaw,2273
5,Edinburgh,2196
6,Milan,2189
7,Marseille,2134
8,Hamburg,2133
9,Gothenburg,2109


In [32]:
# **Business Question Q7** — SQL for Student A

# For 2023, show Avg Recorded Value and P95 Recorded Value by Country for PM10.
# Return Countries on rows and two columns: Avg Recorded Value and P95 Recorded Value.

# Running SQL query to get the result
df_q7_a = run_sqlscript("sql/a2_q07_A.sql", engine=engine, progress=VERBOSE_SQL)
display(df_q7_a)


Unnamed: 0,country_name,Avg_Recorded_Value_2023,P95_Recorded_Value_2023
0,Greece,95.07001516666665,131.444565
1,Austria,81.9452968611111,107.143666
2,Turkey,80.61919258333333,92.928107
3,Germany,80.09584458333333,101.082223
4,Croatia,79.74585641666665,96.761852
5,Hungary,79.43401425,95.53179
6,United Kingdom,79.11582875,91.239154
7,Czech Republic,79.09515479166666,99.943981
8,Russia,78.03715670833333,91.607558
9,Belgium,77.52159558333332,93.422599


In [33]:
# **Business Question Q8** — SQL for Student A

# For 2024, show Reading Events by Country × Quarter (Top 10 countries).
# Return the four quarters on columns (Q1–Q4) and the Top 10 countries on rows, ranked by total Reading Events in 2024.

# Running SQL query to get the result
df_q8_a = run_sqlscript("sql/a2_q08_A.sql", engine=engine, progress=VERBOSE_SQL)
display(df_q8_a)


Unnamed: 0,country_name,Q1_2024,Q2_2024,Q3_2024,Q4_2024,total_reading_events
0,Turkey,22618,22383,22562,22277,89840
1,Russia,17562,17851,18006,17648,71067
2,Germany,13509,13244,13718,13759,54230
3,United Kingdom,10797,10802,11045,11075,43719
4,France,8845,9043,9050,8942,35880
5,Austria,7287,7168,7302,7284,29041
6,Italy,6593,6617,6695,6648,26553
7,Sweden,5080,5025,5256,5275,20636
8,Czech Republic,3965,4138,4110,4144,16357
9,Greece,3153,3191,3267,3201,12812


In [34]:
# **Business Question Q9** — SQL for Student B

# For 2024, list the Top 10 Countries by Avg Data Quality.
# Return the 10 countries with the highest values on rows (highest → lowest) and one column with Avg Data Quality for 2024.

# Running SQL query to get the result
df_q9_b = run_sqlscript("sql/a2_q09_B.sql", engine=engine, progress=VERBOSE_SQL)
display(df_q9_b)


Unnamed: 0,country_name,Avg_Data_Quality_2024
0,Belarus,3.0413258209876544
1,Croatia,3.0156915344827584
2,Sweden,3.011928864583333
3,Finland,3.0109282857142854
4,Germany,3.010857410344828
5,Italy,3.0071215112994354
6,Poland,3.0067860333333334
7,France,3.006653448643411
8,Serbia,3.005463354938272
9,United Kingdom,3.002515738700565


In [35]:
# **Business Question Q10** — SQL for Student B

# For 2024, show Exceed Days (any) by Region for Param Category = Gas.
# Return Regions on rows and one column with Exceed Days (any) for the year 2024, filtered to Category = Gas.

# Running SQL query to get the result
df_q10_b = run_sqlscript("sql/a2_q10_B.sql", engine=engine, progress=VERBOSE_SQL)
display(df_q10_b)


Unnamed: 0,region_name,Exceed_Days_2024
0,Western Europe,6686
1,Central Europe,6128
2,Eastern Europe,5742


In [36]:
# **Business Question Q11** — SQL for Student B

# For 2024, show Exceed Days (any) by City × Monthly Peak Alert Level for Eastern Europe.
# Return Cities in Eastern Europe on rows and the five Alert Levels on columns.

# Running SQL query to get the result
df_q11_b = run_sqlscript("sql/a2_q11_B.sql", engine=engine, progress=VERBOSE_SQL)
display(df_q11_b)


Unnamed: 0,city_name,alert_level_name,Exceed_Days_2024
0,Kazan,,0
1,Ufa,,0
2,Belgrade,,0
3,Ankara,,0
4,St. Petersburg,,0
5,Athens,,0
6,Minsk,,0
7,Ufa,Yellow,280
8,Ankara,Yellow,563
9,Athens,Yellow,13


In [37]:
# **Business Question Q13** — SQL for Student B

# For 2024, show Exceed Days (any) by City for Param Category = Particulate Matter.
# Return Cities on rows and one column with the total Exceed Days for the year 2024, filtered to Category = Particulate Matter.

# Running SQL query to get the result
df_q13_b = run_sqlscript("sql/a2_q13_B.sql", engine=engine, progress=VERBOSE_SQL)
display(df_q13_b)


Unnamed: 0,city_name,Exceed_Days_2024
0,Istanbul,1867
1,London,1284
2,Paris,853
3,Ufa,841
4,Berlin,818
5,Moscow,769
6,Rome,701
7,Prague,692
8,Ankara,579
9,Budapest,570


In [38]:
# **Business Question Q14** — SQL for Student B

# For 2024, list the Top 10 City × Param pairs by Avg Data Quality.
# Return the 10 City-Param pairs with the highest values on rows (highest → lowest) and one column with Avg Data Quality for 2024.

# Running SQL query to get the result
df_q14_b = run_sqlscript("sql/a2_q14_B.sql", engine=engine, progress=VERBOSE_SQL)
display(df_q14_b)


Unnamed: 0,city_name,param_name,Avg_Data_Quality_2024
0,Leipzig,Cadmium,3.330555583333333
1,Lyon,Actinomycetes,3.325923
2,Stockholm,Toluene,3.323829583333333
3,Hamburg,CH4,3.287972583333333
4,Hamburg,CO2,3.27731475
5,Warsaw,PM2,3.273014083333333
6,Leipzig,Nickel,3.269648
7,Stockholm,Ethylbenzene,3.262208916666667
8,Lyon,CO,3.260038
9,Edinburgh,PM2,3.241639166666667


In [39]:
# **Business Question Q15** — SQL for Student B

# Show Exceed Days (any) by Country in Eastern Europe for 2023 and 2024.
# Return Countries (only those in Eastern Europe) on rows and two columns—2023 and 2024 totals of Exceed Days (any).

# Running SQL query to get the result
df_q15_b = run_sqlscript("sql/a2_q15_B.sql", engine=engine, progress=VERBOSE_SQL)
display(df_q15_b)


Unnamed: 0,country_name,Exceed_Days_2023,Exceed_Days_2024
0,Belarus,1537,1474
1,Greece,3194,3128
2,Russia,12093,12074
3,Serbia,798,804
4,Turkey,12015,11798


In [40]:
# **Business Question Q16** — SQL for Student B

# For 2024, show Data Volume (KB) by Param Category × Quarter.
# Return Param Categories on rows and the four quarters of 2024 (Q1–Q4) on columns.

# Running SQL query to get the result
df_q16_b = run_sqlscript("sql/a2_q16_B.sql", engine=engine, progress=VERBOSE_SQL)
display(df_q16_b)


Unnamed: 0,category,Q1_2024,Q2_2024,Q3_2024,Q4_2024
0,Biological,1569420,1572081,1578413,1578707
1,Gas,1384917,1397440,1408923,1405803
2,Heavy Metal,1229515,1237593,1244435,1243570
3,Particulate matter,1307361,1314618,1327195,1318485
4,Volatile Organic Compound,1240479,1232001,1247669,1254028


In [41]:
# **Business Question Q17** — SQL for Student B

# Show Avg Data Quality by Country for 2023 and 2024.
# Return Countries on rows and two columns—2023 and 2024 values of Avg Data Quality.

# Running SQL query to get the result
df_q17_b = run_sqlscript("sql/a2_q17_B.sql", engine=engine, progress=VERBOSE_SQL)
display(df_q17_b)


Unnamed: 0,country_name,Avg_Data_Quality_2023,Avg_Data_Quality_2024
0,Austria,2.9970856425702808,2.9872878915662646
1,Belarus,3.0486171882716047,3.0413258209876544
2,Belgium,3.015514576923077,2.997155282051282
3,Croatia,3.042829683908046,3.0156915344827584
4,Czech Republic,2.995515993333333,2.9905771183333334
5,Denmark,3.021791224137931,3.0003959137931036
6,Finland,2.9838187797619047,3.0109282857142854
7,France,2.9771062412790696,3.006653448643411
8,Germany,3.000847524137931,3.010857410344828
9,Greece,2.998819810344828,2.9956242528735633


## 8) Atoti setup and build cube (scaffold)

In [42]:
os.environ.pop("JAVA_HOME", None)  # let Atoti use its own JDK via jdk4py

# Start a new Atoti session
session = tt.Session.start()

# URL to the Atoti web app
session.url

'http://localhost:63724'

In [43]:
def upsert_table(session, name, df, *, keys=None, defaults=None, dtypes=None):
    if name in session.tables.keys():
        t = session.tables[name]
        t.drop()  # delete all rows, keep schema
        if defaults:  # non-nullability even for existing tables
            for col, val in defaults.items():
                t[col].default_value = val  # set after creation too
        t.load(df)
    else:
        t = session.read_pandas(
            df,
            table_name=name,
            keys=keys or (),
            default_values=defaults or {},   # set at creation time
            data_types=dtypes or {},
        )
    return t

In [45]:
# Load star-schema tables to DataFrames
df_time   = pd.read_sql(f"SELECT * FROM dwh2_{XXX}.dim_timemonth", engine)
df_city   = pd.read_sql(f"SELECT * FROM dwh2_{XXX}.dim_city", engine)
df_param  = pd.read_sql(f"SELECT * FROM dwh2_{XXX}.dim_param", engine)
df_alert  = pd.read_sql(f"SELECT * FROM dwh2_{XXX}.dim_alertpeak", engine)
df_fact   = pd.read_sql(f"SELECT * FROM dwh2_{XXX}.ft_param_city_month", engine)

time_store  = upsert_table(session, "dim_timemonth", df_time,
                           keys=["month_key"],
                           defaults={"year_num": 0, "quarter_num": 0, "month_name": "Unknown"},
                           dtypes={"year_num": "int", "quarter_num": "int"})

city_store  = upsert_table(session, "dim_city", df_city,
                           keys=["city_key"],
                           defaults={"region_name": "Unknown", "country_name": "Unknown", "city_name": "Unknown"})

param_store = upsert_table(session, "dim_param", df_param,
                           keys=["param_key"],
                           defaults={"purpose": "Unknown", "category": "Unknown", "param_name": "Unknown"})

ap_store    = upsert_table(session, "dim_alertpeak", df_alert,
                           keys=["alertpeak_key"],
                           defaults={"alert_level_name": "None"})

fact_store = upsert_table(session, "ft_param_city_month", df_fact, 
                          keys=["ft_pcm_key"], 
                          defaults={"month_key": 0, "city_key": 0, "param_key": 0, "alertpeak_key": 1000,  # FKs
                                    "reading_events_count": 0, "devices_reporting_count": 0, 
                                    "data_volume_kb_sum": 0, "recordedvalue_avg": 0.0, "recordedvalue_p95": 0.0, 
                                    "exceed_days_any": 0, "data_quality_avg": 0.0, "missing_days": 0, 
                                   }, 
                          dtypes={"month_key": "int", "city_key": "int", 
                                  "param_key": "int", "alertpeak_key": "int", 
                                  "reading_events_count": "int", "devices_reporting_count": "int", 
                                  "data_volume_kb_sum": "int", "recordedvalue_avg": "float", 
                                  "recordedvalue_p95": "float", "exceed_days_any": "int", 
                                  "data_quality_avg": "float", "missing_days": "int", 
                                 },
                         )

# Define joins once per fresh session - can re-run the cell without redefining joins
if not getattr(session, "_airq_joins_done", False):
    fact_store.join(time_store,   fact_store["month_key"]     == time_store["month_key"])
    fact_store.join(city_store,   fact_store["city_key"]      == city_store["city_key"])
    fact_store.join(param_store,  fact_store["param_key"]     == param_store["param_key"])
    fact_store.join(ap_store,     fact_store["alertpeak_key"] == ap_store["alertpeak_key"])
    session._airq_joins_done = True

# Create or reuse the cube
cube_name = "AirQ Cube"
cube = (
    session.cubes[cube_name]
    if cube_name in session.cubes.keys()
    else session.create_cube(fact_store, cube_name, mode="manual")
)

# Access cube components
m, h, l = cube.measures, cube.hierarchies, cube.levels

cube

## 9) Define hierarchies and measures
Define explicit hierarchies in Atoti:

1) Time: Year → Quarter → Month,
2) Geo: Region → Country → City,
3) Param: Purpose → Category → Param,
4) Alert: Level (sorted by rank).

In [46]:
# Define hierarchies
h["Time"] = [
    time_store["year_num"],
    time_store["quarter_num"],
    time_store["month_name"]
]

h["Geo"] = [
    city_store["region_name"],
    city_store["country_name"],
    city_store["city_name"]
]

h["Param"] = [
    param_store["purpose"],
    param_store["category"],
    param_store["param_name"]
]

h["Alert"] = [
    ap_store["alert_level_name"]
]

# Define fully additive measures (SUM aggregation)
m["Reading Events"] = tt.agg.sum(
    fact_store["reading_events_count"]
)

m["Devices Reporting"] = tt.agg.sum(
    fact_store["devices_reporting_count"]
)

m["Data Volume (KB)"] = tt.agg.sum(
    fact_store["data_volume_kb_sum"]
)

m["Missing Days"] = tt.agg.sum(
    fact_store["missing_days"]
)

m["Exceed Days (any)"] = tt.agg.sum(
    fact_store["exceed_days_any"]
)

# Define semi-additive measures (MEAN aggregation for time dimension)
m["Avg Recorded Value"] = tt.agg.mean(
    fact_store["recordedvalue_avg"]
)

m["P95 Recorded Value"] = tt.agg.mean(
    fact_store["recordedvalue_p95"]
)

m["Avg Data Quality"] = tt.agg.mean(
    fact_store["data_quality_avg"]
)


In [47]:
# order months as in calendar, not alphabetically
month_lvl = cube.hierarchies["Time"]["month_name"]
month_lvl.order = tt.CustomOrder(first_elements=["Jan","Feb","Mar","Apr","May","Jun",
                                                 "Jul","Aug","Sep","Oct","Nov","Dec"])

In [48]:
# order alert levels from least to most harmful
alert_lvl = cube.hierarchies["Alert"]["alert_level_name"]
alert_lvl.order = tt.CustomOrder(first_elements=["None", "Yellow", "Orange", "Red", "Crimson"])

In [50]:
cube

In [52]:
print("\nHierarchies and their levels:")
for h_name, hierarchy in cube.hierarchies.items():
    level_names = [getattr(level, "name", str(level)) for level in hierarchy]
    print(f" - {h_name} → levels: {level_names}")

print("\\Measures:")
for m in cube.measures.keys():
    print("  -", m)    


Hierarchies and their levels:
 - ('dim_timemonth', 'Time') → levels: ['year_num', 'quarter_num', 'month_name']
 - ('dim_param', 'Param') → levels: ['purpose', 'category', 'param_name']
 - ('dim_alertpeak', 'Alert') → levels: ['alert_level_name']
 - ('dim_city', 'Geo') → levels: ['region_name', 'country_name', 'city_name']
\Measures:
  - Avg Data Quality
  - P95 Recorded Value
  - Devices Reporting
  - Avg Recorded Value
  - contributors.COUNT
  - Exceed Days (any)
  - Reading Events
  - Data Volume (KB)
  - update.TIMESTAMP
  - Missing Days


## 10) MDX queries

In [55]:
# MDX cell magic: let us write MDX code like this:
#   %%mdx
#   SELECT ... FROM [AirQ Cube]
#
# Requirements: a live `session` from atoti and the cube already created.

from IPython.core.magic import register_cell_magic
from IPython.display import display

@register_cell_magic
def mdx(line, cell):
    """Run MDX in this cell and display a DataFrame.
    Usage:
        %%mdx
        SELECT ...
        FROM [AirQ Cube]
    """
    q = cell.strip()
    df = session.query_mdx(q)   # Atoti returns levels on index, measures as columns
    return df                   # df = _


### 10.1) Business question Q31 (example)

In [56]:
%%mdx

-- 31. For parameter O3, list the Top 10 Cities by P95 Recorded Value for 2023.
-- Return the 10 cities with the highest values on rows (highest → lowest) and one column with P95 Recorded Value for 2023.
SELECT
  { [Measures].[P95 Recorded Value] } ON COLUMNS,
  TOPCOUNT(
    NONEMPTY(
      [dim_city].[Geo].[city_name].Members,
      [Measures].[P95 Recorded Value]
    ),
    10, [Measures].[P95 Recorded Value]
  ) ON ROWS
FROM [AirQ Cube]
WHERE (
  [dim_timemonth].[Time].[year_num].&[2023],
  [dim_param].[Param].[param_name].&[O3]
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,P95 Recorded Value
region_name,country_name,city_name,Unnamed: 3_level_1
Central Europe,Czech Republic,Prague,152.58
Central Europe,Germany,Hamburg,150.75
Eastern Europe,Greece,Athens,142.47
Western Europe,United Kingdom,London,132.38
Eastern Europe,Turkey,Istanbul,131.14
Western Europe,United Kingdom,Edinburgh,126.17
Western Europe,Denmark,Copenhagen,124.67
Central Europe,Germany,Stuttgart,119.63
Central Europe,Austria,Salzburg,119.14
Eastern Europe,Russia,Kazan,117.25


### 10.2) Business question Q32 (example)

In [57]:
%%mdx 

-- 32. For 2024, show Data Volume (KB) by City for category ‘Volatile Organic Compound’, and list the Top 10 cities.
-- Return the Top 10 cities on rows (highest -> lowest) and one column with Data Volume (KB) for 2024, limited to the Volatile Organic Compound category.
SELECT
  { [Measures].[Data Volume (KB)] } ON COLUMNS,
  TOPCOUNT(
    NONEMPTY([dim_city].[Geo].[city_name].Members, [Measures].[Data Volume (KB)]),
    10, [Measures].[Data Volume (KB)]
  ) ON ROWS
FROM (
  SELECT ( [dim_timemonth].[Time].[year_num].&[2024] ) ON 0 FROM (
    SELECT (
      FILTER(
        [dim_param].[Param].[param_name].Members,
        ANCESTOR(
          [dim_param].[Param].CurrentMember,
          [dim_param].[Param].[category]
        ).Name = "Volatile Organic Compound"
      )
    ) ON 0 FROM [AirQ Cube]
  )
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Data Volume (KB)
region_name,country_name,city_name,Unnamed: 3_level_1
Eastern Europe,Turkey,Istanbul,803441
Western Europe,United Kingdom,London,330327
Eastern Europe,Russia,Moscow,257108
Central Europe,Germany,Berlin,224658
Eastern Europe,Russia,St. Petersburg,215185
Western Europe,France,Paris,196835
Western Europe,Italy,Rome,188638
Eastern Europe,Russia,Ufa,174256
Western Europe,Denmark,Copenhagen,148863
Central Europe,Austria,Vienna,146300


### 10.3) Business question Q33 (example)

In [58]:
%%mdx

-- 33. For parameter PM4 in 2024, return for each Country the Month with the highest Avg Data Quality.
-- Return one row per Country × Month (the month with the highest Avg Data Quality in 2024) and one column with Avg Data Quality.
SELECT
  { [Measures].[Avg Data Quality] } ON COLUMNS,
  NON EMPTY
    GENERATE(
      [dim_city].[Geo].[country_name].Members,
      TOPCOUNT(
        CROSSJOIN(
          { [dim_city].[Geo].CurrentMember },
          Descendants(
            [dim_timemonth].[Time].[year_num].&[2024],
            [dim_timemonth].[Time].[month_name]
          )
        ),
        1, [Measures].[Avg Data Quality]
      )
    ) ON ROWS
FROM [AirQ Cube]
WHERE ( [dim_param].[Param].[param_name].&[PM4] )

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Avg Data Quality
region_name,country_name,year_num,quarter_num,month_name,Unnamed: 5_level_1
Central Europe,Austria,2024,3,Sep,3.58
Central Europe,Croatia,2024,2,Jun,3.6
Central Europe,Czech Republic,2024,2,Apr,3.24
Central Europe,Germany,2024,3,Sep,3.12
Central Europe,Hungary,2024,3,Aug,3.43
Central Europe,Poland,2024,1,Jan,3.29
Eastern Europe,Belarus,2024,3,Sep,3.24
Eastern Europe,Greece,2024,3,Aug,3.12
Eastern Europe,Russia,2024,2,Jun,3.29
Eastern Europe,Serbia,2024,2,Apr,3.4


### 10.4) Business Question Q1 - MDX for Student A

In [59]:
# Business Question Q1 - MDX for Student A
# For parameter PM2, show Exceed Days (any) by Country x Month for Q1 of 2024.

mdx_query = Path("mdx/a2_q01_A.mdx").read_text(encoding="utf-8")
df = session.query_mdx(mdx_query)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Exceed Days (any)
year_num,quarter_num,month_name,region_name,country_name,Unnamed: 5_level_1
2024,1,Jan,Central Europe,Austria,14
2024,1,Feb,Central Europe,Austria,6
2024,1,Mar,Central Europe,Austria,8
2024,1,Jan,Central Europe,Croatia,6
2024,1,Feb,Central Europe,Croatia,2
2024,1,Mar,Central Europe,Croatia,5
2024,1,Jan,Central Europe,Czech Republic,7
2024,1,Feb,Central Europe,Czech Republic,3
2024,1,Mar,Central Europe,Czech Republic,5
2024,1,Jan,Central Europe,Germany,25


### 10.5) Business Question Q2 - MDX for Student A

In [60]:
# Business Question Q2 - MDX for Student A
# For parameter O3, show Missing Days in Austria by City x Month for Q1 of 2023.

mdx_query = Path("mdx/a2_q02_A.mdx").read_text(encoding="utf-8")
df = session.query_mdx(mdx_query)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Missing Days
year_num,quarter_num,month_name,region_name,country_name,city_name,Unnamed: 6_level_1
2023,1,Jan,Central Europe,Austria,Graz,11
2023,1,Feb,Central Europe,Austria,Graz,4
2023,1,Mar,Central Europe,Austria,Graz,2
2023,1,Jan,Central Europe,Austria,Salzburg,3
2023,1,Feb,Central Europe,Austria,Salzburg,5
2023,1,Mar,Central Europe,Austria,Salzburg,4
2023,1,Jan,Central Europe,Austria,Vienna,6
2023,1,Feb,Central Europe,Austria,Vienna,2
2023,1,Mar,Central Europe,Austria,Vienna,2


### 10.6) Business Question Q3 - MDX for Student A

In [61]:
# Business Question Q3 - MDX for Student A (NEW)
# For 2024, show Reading Events by Region x Quarter.

mdx_query = Path("mdx/a2_q03_A.mdx").read_text(encoding="utf-8")
df = session.query_mdx(mdx_query)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Reading Events
year_num,quarter_num,region_name,Unnamed: 3_level_1
2024,1,Central Europe,32373
2024,2,Central Europe,32245
2024,3,Central Europe,32659
2024,4,Central Europe,32872
2024,1,Eastern Europe,47064
2024,2,Eastern Europe,47308
2024,3,Eastern Europe,47776
2024,4,Eastern Europe,47084
2024,1,Western Europe,42956
2024,2,Western Europe,43275


### 10.7) Business Question Q12 - MDX for Student A

In [62]:
# Business Question Q12 - MDX for Student A (NEW)
# For 2023, show Data Volume (KB) by Param Purpose x Region.

mdx_query = Path("mdx/a2_q12_A.mdx").read_text(encoding="utf-8")
df = session.query_mdx(mdx_query)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Data Volume (KB)
region_name,purpose,Unnamed: 2_level_1
Central Europe,Comfort,372834
Eastern Europe,Comfort,425427
Western Europe,Comfort,426022
Central Europe,Environmental Monitoring,700437
Eastern Europe,Environmental Monitoring,1113626
Western Europe,Environmental Monitoring,985945
Central Europe,Health Risk,3931810
Eastern Europe,Health Risk,5588149
Western Europe,Health Risk,5117287
Central Europe,Regulatory Compliance,1004339


### 10.8) Business Question Q18 - MDX for Student A

In [63]:
# Business Question Q18 - MDX for Student A (NEW)
# For 2024, list Top 5 Countries by total Devices Reporting for Particulate Matter category.

mdx_query = Path("mdx/a2_q18_A.mdx").read_text(encoding="utf-8")
df = session.query_mdx(mdx_query)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Devices Reporting
region_name,country_name,Unnamed: 2_level_1
Eastern Europe,Turkey,1752
Eastern Europe,Russia,1284
Central Europe,Germany,1056
Western Europe,United Kingdom,804
Western Europe,France,756


### 10.9) Business Question Q9 - MDX for Student B

In [64]:
# Business Question Q9 - MDX for Student B
# For 2024, list the Top 10 Countries by Avg Data Quality.

mdx_query = Path("mdx/a2_q09_B.mdx").read_text(encoding="utf-8")
df = session.query_mdx(mdx_query)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Avg Data Quality
region_name,country_name,Unnamed: 2_level_1
Eastern Europe,Belarus,3.04
Central Europe,Croatia,3.02
Western Europe,Sweden,3.01
Western Europe,Finland,3.01
Central Europe,Germany,3.01
Western Europe,Italy,3.01
Central Europe,Poland,3.01
Western Europe,France,3.01
Eastern Europe,Serbia,3.01
Western Europe,United Kingdom,3.0


### 10.10) Business Question Q10 - MDX for Student B

In [65]:
# Business Question Q10 - MDX for Student B
# For 2024, show Exceed Days (any) by Region for Param Category = Gas.

mdx_query = Path("mdx/a2_q10_B.mdx").read_text(encoding="utf-8")
df = session.query_mdx(mdx_query)
display(df)

Unnamed: 0_level_0,Exceed Days (any)
region_name,Unnamed: 1_level_1
Central Europe,6128
Eastern Europe,5742
Western Europe,6686


### 10.11) Business Question Q19 - MDX for Student B

In [66]:
# Business Question Q19 - MDX for Student B (NEW)
# For 2023 and 2024, compare Missing Days by Region (Year-over-Year).

mdx_query = Path("mdx/a2_q19_B.mdx").read_text(encoding="utf-8")
df = session.query_mdx(mdx_query)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Missing Days
year_num,region_name,Unnamed: 2_level_1
2023,Central Europe,49434
2024,Central Europe,49581
2023,Eastern Europe,23256
2024,Eastern Europe,23439
2023,Western Europe,47544
2024,Western Europe,47448


### 10.12) Business Question Q20 - MDX for Student B

In [67]:
# Business Question Q20 - MDX for Student B (NEW)
# For 2024, show Exceed Days (any) by Alert Level x Param Category.

mdx_query = Path("mdx/a2_q20_B.mdx").read_text(encoding="utf-8")
df = session.query_mdx(mdx_query)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Exceed Days (any)
purpose,category,alert_level_name,Unnamed: 3_level_1
Comfort,Gas,,0
Comfort,Volatile Organic Compound,,0
Environmental Monitoring,Gas,,0
Environmental Monitoring,Particulate matter,,0
Environmental Monitoring,Volatile Organic Compound,,0
...,...,...,...
Regulatory Compliance,Volatile Organic Compound,Crimson,911
Scientific Study,Biological,Crimson,2948
Scientific Study,Heavy Metal,Crimson,1455
Scientific Study,Particulate matter,Crimson,961


### 10.13) Business Question Q21 - MDX for Student B

In [68]:
# Business Question Q21 - MDX for Student B (NEW)
# For Eastern Europe in 2024, show P95 Recorded Value by City x Quarter for PM10 parameter.

mdx_query = Path("mdx/a2_q21_B.mdx").read_text(encoding="utf-8")
df = session.query_mdx(mdx_query)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,P95 Recorded Value
year_num,quarter_num,region_name,country_name,city_name,Unnamed: 5_level_1
2024,1,Eastern Europe,Belarus,Minsk,158.57
2024,2,Eastern Europe,Belarus,Minsk,215.24
2024,3,Eastern Europe,Belarus,Minsk,169.21
2024,4,Eastern Europe,Belarus,Minsk,163.47
2024,1,Eastern Europe,Greece,Athens,171.21
2024,2,Eastern Europe,Greece,Athens,214.54
2024,3,Eastern Europe,Greece,Athens,151.81
2024,4,Eastern Europe,Greece,Athens,198.76
2024,1,Eastern Europe,Russia,Kazan,167.82
2024,2,Eastern Europe,Russia,Kazan,148.33


## 11) Batch executor: run all .mdx → CSV (+ an index)

In [69]:
def run_mdx_folder(
    mdx_folder="mdx",
    out_folder="mdx_out",
    pattern="*.mdx",
    overwrite=True,
    index_csv="mdx_index.csv",
):
    mdx_path = Path(mdx_folder)
    out_path = Path(out_folder)
    mdx_path.mkdir(exist_ok=True)
    out_path.mkdir(exist_ok=True)

    records = []
    files = sorted(mdx_path.glob(pattern))
    if not files:
        print(f"No MDX files found in {mdx_path.resolve()}.")
        return pd.DataFrame()

    for f in files:
        q = f.read_text(encoding="utf-8")
        t0 = time.time()
        error = None
        rows = cols = 0
        dest = out_path / f"{f.stem}.csv"

        try:
            df = session.query_mdx(q).reset_index()
            rows, cols = df.shape
            if overwrite or not dest.exists():
                df.to_csv(dest, index=False)
        except Exception as e:
            error = str(e)

        elapsed = time.time() - t0
        records.append({
            "file": f.name,
            "csv": dest.name,
            "rows": rows,
            "cols": cols,
            "seconds": round(elapsed, 3),
            "error": error,
        })

    index_df = pd.DataFrame(records)
    index_path = out_path / index_csv
    index_df.to_csv(index_path, index=False)
    print(f"Done. Index saved to {index_path}")
    return index_df
    

In [70]:
# Run all MDX files:
index_df = run_mdx_folder()
index_df

Done. Index saved to mdx_out\mdx_index.csv


Unnamed: 0,file,csv,rows,cols,seconds,error
0,a2_q01_A.mdx,a2_q01_A.csv,51,6,0.23,
1,a2_q02_A.mdx,a2_q02_A.csv,9,7,0.221,
2,a2_q03_A.mdx,a2_q03_A.csv,12,4,0.148,
3,a2_q09_B.mdx,a2_q09_B.csv,10,3,0.124,
4,a2_q10_B.mdx,a2_q10_B.csv,3,2,0.108,
5,a2_q12_A.mdx,a2_q12_A.csv,15,3,0.107,
6,a2_q18_A.mdx,a2_q18_A.csv,5,3,0.101,
7,a2_q19_B.mdx,a2_q19_B.csv,6,3,0.096,
8,a2_q20_B.mdx,a2_q20_B.csv,90,4,0.105,
9,a2_q21_B.mdx,a2_q21_B.csv,36,6,0.129,



## 12) Create `sqldump/sqldump_airq_dwh2_xxx.sql`

We run `pg_dump -n dwh2_xxx --no-owner --no-privileges` to keep dumps portable.


In [71]:
# === Create sqldump/sqldump_airq_dwh2_xx.sql (pg_dump) ===
sqldump_dir.mkdir(exist_ok=True)
outfile = sqldump_dir / f"sqldump_airq_dwh2_{XXX}.sql"

def detect_postgres_container():
    """Return the name of a running PostgreSQL container, or None."""
    try:
        result = subprocess.run(
            ["docker", "ps", "--filter", "ancestor=postgres", "--format", "{{.Names}}"],
            capture_output=True, text=True, check=True
        )
        name = result.stdout.strip().split('\n')[0]
        if name:
            return name

        # fallback: match container name
        result = subprocess.run(
            ["docker", "ps", "--filter", "name=postgres", "--format", "{{.Names}}"],
            capture_output=True, text=True, check=True
        )
        name = result.stdout.strip().split('\n')[0]
        return name or None

    except Exception:
        return None


def run_pg_dump_docker(container_name):
    """Run pg_dump inside Docker."""
    print(f"Using Docker container: {container_name}")

    cmd = [
        "docker", "exec", "-i", container_name,
        "pg_dump",
        "-U", DB_USER,
        "-d", DB_NAME,
        "-n", f"dwh2_{XXX}",
        "--no-owner",
        "--no-privileges"
    ]

    env = dict(os.environ)
    if 'pw' in globals() and pw:
        env["PGPASSWORD"] = pw

    print("Running:", " ".join(cmd).replace(DB_USER, "<user>"))
    with open(outfile, "w", encoding="utf-8") as f:
        subprocess.run(cmd, check=True, env=env, stdout=f)

    print("✓ Dump created at", outfile)


def run_pg_dump_local():
    """Run pg_dump using the local binary."""
    pg_dump = shutil.which("pg_dump") or "pg_dump"

    cmd = [
        pg_dump,
        "-h", DB_HOST,
        "-p", str(DB_PORT),
        "-U", DB_USER,
        "-d", DB_NAME,
        "-n", f"dwh2_{XXX}",
        "--no-owner",
        "--no-privileges",
        "-f", str(outfile),
    ]

    env = dict(os.environ)
    if 'pw' in globals() and pw:
        env["PGPASSWORD"] = pw

    print("Running:", " ".join(cmd).replace(DB_USER, "<user>"))
    subprocess.run(cmd, check=True, env=env)
    print("✓ Dump created at", outfile)


# === Logic: try Docker only if it actually exists ===
postgres_container = detect_postgres_container()

if postgres_container:
    try:
        run_pg_dump_docker(postgres_container)
    except Exception as e:
        print(f"Docker pg_dump failed: {e}")
        print("Falling back to local pg_dump...\n")
        run_pg_dump_local()
else:
    # No docker container found → use local pg_dump directly
    run_pg_dump_local()


Using Docker container: bi2025_postgres
Running: docker exec -i bi2025_postgres pg_dump -U <user> -d airq -n dwh2_020 --no-owner --no-privileges
✓ Dump created at C:\ProgramData\anaconda3\envs\dwh\etc\jupyter\BI_Projects\DWH2_020\sqldump\sqldump_airq_dwh2_020.sql


## 13) Submission checklist (put these in your **ZIP**)

- `csv/` — CSV files 
- `ddl/` — DDL scripts 
- `etl/` — Your `a2_etl*.sql` files (ETL scripts)
- `mdx/` — Your `a2_q{NN}_{A|B}.mdx` files (MDX queries for business questions)
- `mdx_out/` — Your `a2_q{NN}_{A|B}.csv` files (results of MDX queries)
- `pdf/` — Your `a2_q{NN}.pdf` files (Dashboard exports as .pdf)
- `sql/` — Your `a2_q{NN}_{A|B}.sql` files (SQL queries for business questions)
- `sqldump/` — `sqldump_airq_dwh2_xxx.sql`  
- `AirQ_Part2_xxx.ipynb`
- `group_xxx.txt`
- `Report_Part2_Group_xxx.pdf`

## Dashboard PDFs (Section 12.5)

All dashboard PDFs have been created and exported to the `pdf/` folder:

- **Student A dashboards:**
  - `pdf/a2_q01.pdf` — Q01: Exceed Days (any) by Country × Month for PM2, 2024 Q1
  - `pdf/a2_q02.pdf` — Q02: Missing Days in Austria by City × Month for O3, 2023 Q1
  - `pdf/a2_q12.pdf` — Q12: Data Volume (KB) by Param Purpose × Region, 2023

- **Student B dashboards:**
  - `pdf/a2_q09.pdf` — Q09: Top 10 Countries by Avg Data Quality, 2024
  - `pdf/a2_q20.pdf` — Q20: Exceed Days (any) by Alert Level × Param Category, 2024
  - `pdf/a2_q21.pdf` — Q21: P95 Recorded Value by City × Quarter for PM10, Eastern Europe, 2024

All dashboards validated against corresponding MDX outputs in `mdx_out/`.