## TASK BREAKDOWN

**1. Create a daily-updated data archive of observed meteorology:**

Stakeholders are Salient's Machine Learning team and our customers
Duration limit to complete the task is a 2 hour timeframe, enforced on the honor system
Deadline to submit an answer is 2 weeks after receipt of this email.

For now, the archive will contain 3 different observed met station WBAN codes:
14739 (Boston),
23169 (Las Vegas), and
94846 (Chicago)

Eventually, this system must scale to handle all >100k GHCNd stations
Get data from NCEI, example for Boston:

https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/USW00014739.csv


**2. Output is a zarr archive:**

Coordinates:   ghcn_id & time, chunked at your discretion
Data variables: precip (mm/day),  tmax (°C), tmin (°C)
The source data calls it "prcp", so you'll have to change it

**3. Write functions:**

build_ghcnd_archive - that establishes a fresh archive from scratch
update_ghcnd_archive - that updates the archive each day

**4. Answer questions with 1-3 sentences:**

- How would you orchestrate this system to run at scale?
- What major risks would this system face?
- What are the next set of enhancements you would add?
- How would you improve the clarity of this assignment?

Send answer as zipped (.py | .ipynb) & .pdf
PDF must contain a print statement that shows the archive contents

In [1]:
import os
import shutil
import xarray as xr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import gcsfs
import zarr
import dask
from datetime import datetime, timedelta
from urllib.request import urlopen
from IPython.display import display

In [2]:
WBAN_CODES = [14739, 23169, 94846]
DATA_DIR = 'ghcnd_archive'
ZARR_STORE = os.path.join(DATA_DIR, 'ghcnd.zarr')

In [7]:
def build_ghcnd_archive():
    """Establish a fresh archive from scratch"""

    if not os.path.exists(DATA_DIR):
        os.makedirs(DATA_DIR)
    else :
        shutil.rmtree(DATA_DIR)
        os.makedirs(DATA_DIR)

    ds = xr.Dataset()

    for wban in WBAN_CODES:
        url = f'https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/USW{str(wban).zfill(8)}.csv'
        try:
            df = pd.read_csv(url, index_col='DATE')
        except Exception as e:
            # Handle error during download 
            print(f"Error downloading data for WBAN {wban}: {e}")
            continue

        display(df)

        # Convert index to datetime
        df.index = pd.to_datetime(df.index)
        df.index.name = 'time'  # Set index name to 'time'

        end_date = datetime.today() - timedelta(days=4)  # Four days ago, # as 2024-08-13

        # Filter DataFrame based on dates
        df = df.loc[df.index.min():end_date]

        # Select relevant columns and rename 'PRCP' to 'precip'
        df = df[['PRCP', 'TMAX', 'TMIN']].rename(columns={'PRCP': 'precip'})

        display(df)
        
        # Convert units
        df['precip'] = df['precip'].astype('float') / 10  # Convert to mm/day
        df['tmax'] = df['TMAX'].astype('float') / 10  # convert from tenths of degrees Celsius to standard degrees Celsius.
        df['tmin'] = df['TMIN'].astype('float') / 10  # convert from tenths of degrees Celsius to standard degrees Celsius.

        # Create xarray DataArray for each variable
        for var in ['precip', 'tmax', 'tmin']:
            ds[var] = xr.DataArray(df[var].values, dims=['time'], coords={'time': df.index})


        # Append station ID to ghcn_id
        if 'ghcn_id' not in ds.coords:
            ds['ghcn_id'] = xr.DataArray(
                [f'USW{str(wban).zfill(8)}'] * len(df), dims=['time'], coords={'time': df.index}
            )
        else:
            ds['ghcn_id'] = xr.concat(
                [ds['ghcn_id'], xr.DataArray([f'USW{str(wban).zfill(8)}'] * len(df), dims=['time'], coords={'time': df.index})], dim='time'
            )

        ds['ghcn_id'] = ds['ghcn_id'].astype(str)

    # Chunk the dataset (adjust chunk sizes as needed)
    ds = ds.chunk({'time': 1})

    # now = datetime.now()
    # datetime_str = now.strftime('%y-%m-%d')
    group = f'meteorology-etl-job'

    # Save to Zarr store
    ds.to_zarr(ZARR_STORE, mode='w', group=group)

In [3]:
def update_ghcnd_archive():
    """Update the archive each day"""

    if not os.path.exists(DATA_DIR):
        os.makedirs(DATA_DIR)
    # else :
    #     shutil.rmtree(DATA_DIR)
    #     os.makedirs(DATA_DIR)

    ds = xr.Dataset()

    # ds = xr.open_zarr(ZARR_STORE, group="meteorology-etl-job", decode_times=False)

    today = datetime.today().date()
    yesterday = today - timedelta(days=3) # as 2024-08-14

    for wban in WBAN_CODES:
        url = f'https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/USW{str(wban).zfill(8)}.csv'
        df = pd.read_csv(url, index_col='DATE')
        
        df.index = pd.to_datetime(df.index)
        df.index.name = 'time' 

        # Extract latest data
        new_data = df.loc[df.index.date >= yesterday]

        # Select relevant columns and rename 'PRCP' to 'precip'
        new_data = new_data[['PRCP', 'TMAX', 'TMIN']].rename(columns={'PRCP': 'precip'})

        display(new_data)

        # Convert units
        new_data['precip'] = new_data['precip'].astype('float') / 10  # Convert to mm/day
        new_data['tmax'] =  new_data['TMAX'].astype('float') / 10  # convert from tenths of degrees Celsius to standard degrees Celsius.
        new_data['tmin'] =  new_data['TMIN'].astype('float') / 10  # convert from tenths of degrees Celsius to standard degrees Celsius.

        # Create xarray DataArray for each variable
        for var in ['precip', 'tmax', 'tmin']:
            ds[var] = xr.DataArray(new_data[var].values, dims=['time'], coords={'time': new_data.index})

        # Append station ID to ghcn_id
        if 'ghcn_id' not in ds.coords:
            ds['ghcn_id'] = xr.DataArray(
                [f'USW{str(wban).zfill(8)}'] * len(new_data), dims=['time'], coords={'time': new_data.index}
            )
        else:
            ds['ghcn_id'] = xr.concat(
                [ds['ghcn_id'], xr.DataArray([f'USW{str(wban).zfill(8)}'] * len(new_data), dims=['time'], coords={'time': new_data.index})], dim='time'
            )

        ds['ghcn_id'] = ds['ghcn_id'].astype(str)
        
    # Chunk and save to Zarr store
    ds = ds.chunk({'time': 1})

    group = f'meteorology-etl-job'

    ds.to_zarr(ZARR_STORE, append_dim='time', group=group)


In [8]:

# Run
build_ghcnd_archive()

  df = pd.read_csv(url, index_col='DATE')


Unnamed: 0_level_0,STATION,LATITUDE,LONGITUDE,ELEVATION,NAME,PRCP,PRCP_ATTRIBUTES,SNOW,SNOW_ATTRIBUTES,SNWD,...,WT17,WT17_ATTRIBUTES,WT18,WT18_ATTRIBUTES,WT19,WT19_ATTRIBUTES,WT21,WT21_ATTRIBUTES,WT22,WT22_ATTRIBUTES
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1936-01-01,USW00014739,42.36057,-71.00975,3.2,"BOSTON LOGAN INTERNATIONAL AIRPORT, MA US",0.0,",,0,",0.0,",,0",0.0,...,,,,,,,,,,
1936-01-02,USW00014739,42.36057,-71.00975,3.2,"BOSTON LOGAN INTERNATIONAL AIRPORT, MA US",53.0,",,0,2400",0.0,"T,,0",0.0,...,,,1.0,",,X",,,,,,
1936-01-03,USW00014739,42.36057,-71.00975,3.2,"BOSTON LOGAN INTERNATIONAL AIRPORT, MA US",353.0,",,0,2400",0.0,",,0",0.0,...,,,,,,,,,,
1936-01-04,USW00014739,42.36057,-71.00975,3.2,"BOSTON LOGAN INTERNATIONAL AIRPORT, MA US",0.0,"T,,0,2400",0.0,",,0",0.0,...,,,,,,,,,,
1936-01-05,USW00014739,42.36057,-71.00975,3.2,"BOSTON LOGAN INTERNATIONAL AIRPORT, MA US",229.0,",,0,2400",0.0,"T,,0",0.0,...,,,1.0,",,X",,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-08-10,USW00014739,42.36057,-71.00975,3.2,"BOSTON LOGAN INTERNATIONAL AIRPORT, MA US",28.0,",,W,2400",0.0,",,W",,...,,,,,,,,,,
2024-08-11,USW00014739,42.36057,-71.00975,3.2,"BOSTON LOGAN INTERNATIONAL AIRPORT, MA US",0.0,",,W,2400",0.0,",,W",,...,,,,,,,,,,
2024-08-12,USW00014739,42.36057,-71.00975,3.2,"BOSTON LOGAN INTERNATIONAL AIRPORT, MA US",0.0,",,W,2400",0.0,",,W",,...,,,,,,,,,,
2024-08-13,USW00014739,42.36057,-71.00975,3.2,"BOSTON LOGAN INTERNATIONAL AIRPORT, MA US",0.0,",,D,2400",0.0,",,D",,...,,,,,,,,,,


Unnamed: 0_level_0,precip,TMAX,TMIN
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1936-01-01,0.0,17.0,-61.0
1936-01-02,53.0,17.0,-61.0
1936-01-03,353.0,122.0,17.0
1936-01-04,0.0,78.0,17.0
1936-01-05,229.0,61.0,6.0
...,...,...,...
2024-08-09,81.0,294.0,178.0
2024-08-10,28.0,317.0,233.0
2024-08-11,0.0,289.0,206.0
2024-08-12,0.0,261.0,183.0


  df = pd.read_csv(url, index_col='DATE')


Unnamed: 0_level_0,STATION,LATITUDE,LONGITUDE,ELEVATION,NAME,PRCP,PRCP_ATTRIBUTES,SNOW,SNOW_ATTRIBUTES,SNWD,...,WT18,WT18_ATTRIBUTES,WT21,WT21_ATTRIBUTES,WV01,WV01_ATTRIBUTES,WV03,WV03_ATTRIBUTES,WV07,WV07_ATTRIBUTES
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1948-09-06,USW00023169,36.0719,-115.16343,662.8,"MCCARRAN INTERNATIONAL AIRPORT, NV US",0.0,",,X,",0.0,",,X",0.0,...,,,,,,,,,,
1948-09-07,USW00023169,36.0719,-115.16343,662.8,"MCCARRAN INTERNATIONAL AIRPORT, NV US",0.0,",,X,",0.0,",,X",0.0,...,,,,,,,,,,
1948-09-08,USW00023169,36.0719,-115.16343,662.8,"MCCARRAN INTERNATIONAL AIRPORT, NV US",0.0,",,X,",0.0,",,X",0.0,...,,,,,,,,,,
1948-09-09,USW00023169,36.0719,-115.16343,662.8,"MCCARRAN INTERNATIONAL AIRPORT, NV US",0.0,",,X,",0.0,",,X",0.0,...,,,,,,,,,,
1948-09-10,USW00023169,36.0719,-115.16343,662.8,"MCCARRAN INTERNATIONAL AIRPORT, NV US",0.0,",,X,",0.0,",,X",0.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-08-10,USW00023169,36.0719,-115.16343,662.8,"MCCARRAN INTERNATIONAL AIRPORT, NV US",0.0,",,W,2400",0.0,",,W",0.0,...,,,,,,,,,,
2024-08-11,USW00023169,36.0719,-115.16343,662.8,"MCCARRAN INTERNATIONAL AIRPORT, NV US",0.0,",,W,2400",0.0,",,W",0.0,...,,,,,,,,,,
2024-08-12,USW00023169,36.0719,-115.16343,662.8,"MCCARRAN INTERNATIONAL AIRPORT, NV US",0.0,",,W,2400",0.0,",,W",0.0,...,,,,,,,,,,
2024-08-13,USW00023169,36.0719,-115.16343,662.8,"MCCARRAN INTERNATIONAL AIRPORT, NV US",0.0,",,D,2400",0.0,",,D",,...,,,,,,,,,,


Unnamed: 0_level_0,precip,TMAX,TMIN
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1948-09-06,0.0,406.0,161.0
1948-09-07,0.0,400.0,156.0
1948-09-08,0.0,394.0,178.0
1948-09-09,0.0,411.0,167.0
1948-09-10,0.0,411.0,161.0
...,...,...,...
2024-08-09,0.0,411.0,300.0
2024-08-10,0.0,433.0,317.0
2024-08-11,0.0,417.0,322.0
2024-08-12,0.0,406.0,311.0


  df = pd.read_csv(url, index_col='DATE')


Unnamed: 0_level_0,STATION,LATITUDE,LONGITUDE,ELEVATION,NAME,PRCP,PRCP_ATTRIBUTES,SNOW,SNOW_ATTRIBUTES,SNWD,...,WT19,WT19_ATTRIBUTES,WT21,WT21_ATTRIBUTES,WT22,WT22_ATTRIBUTES,WV03,WV03_ATTRIBUTES,WV20,WV20_ATTRIBUTES
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1946-10-09,USW00094846,41.96017,-87.93164,204.8,"CHICAGO OHARE INTERNATIONAL AIRPORT, IL US",,,,,,...,,,,,,,,,,
1946-10-10,USW00094846,41.96017,-87.93164,204.8,"CHICAGO OHARE INTERNATIONAL AIRPORT, IL US",,,,,,...,,,,,,,,,,
1946-10-11,USW00094846,41.96017,-87.93164,204.8,"CHICAGO OHARE INTERNATIONAL AIRPORT, IL US",,,,,,...,,,,,,,,,,
1946-10-12,USW00094846,41.96017,-87.93164,204.8,"CHICAGO OHARE INTERNATIONAL AIRPORT, IL US",,,,,,...,,,,,,,,,,
1946-10-13,USW00094846,41.96017,-87.93164,204.8,"CHICAGO OHARE INTERNATIONAL AIRPORT, IL US",,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-08-10,USW00094846,41.96017,-87.93164,204.8,"CHICAGO OHARE INTERNATIONAL AIRPORT, IL US",0.0,",,W,2400",0.0,",,W,2400",0.0,...,,,,,,,,,,
2024-08-11,USW00094846,41.96017,-87.93164,204.8,"CHICAGO OHARE INTERNATIONAL AIRPORT, IL US",0.0,",,W,2400",0.0,",,W,2400",0.0,...,,,,,,,,,,
2024-08-12,USW00094846,41.96017,-87.93164,204.8,"CHICAGO OHARE INTERNATIONAL AIRPORT, IL US",0.0,",,W,2400",0.0,",,W,2400",0.0,...,,,,,,,,,,
2024-08-13,USW00094846,41.96017,-87.93164,204.8,"CHICAGO OHARE INTERNATIONAL AIRPORT, IL US",0.0,",,W,2400",0.0,",,D,2400",0.0,...,,,,,,,,,,


Unnamed: 0_level_0,precip,TMAX,TMIN
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1946-10-09,,,
1946-10-10,,,
1946-10-11,,,
1946-10-12,,,
1946-10-13,,,
...,...,...,...
2024-08-09,0.0,222.0,156.0
2024-08-10,0.0,239.0,133.0
2024-08-11,0.0,267.0,144.0
2024-08-12,0.0,278.0,161.0


In [4]:
update_ghcnd_archive()

  df = pd.read_csv(url, index_col='DATE')


Unnamed: 0_level_0,precip,TMAX,TMIN
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-08-14,,,


  df = pd.read_csv(url, index_col='DATE')


Unnamed: 0_level_0,precip,TMAX,TMIN
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-08-14,,,


  df = pd.read_csv(url, index_col='DATE')


Unnamed: 0_level_0,precip,TMAX,TMIN
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-08-14,,,


In [10]:
# Print the archive contents
ds = xr.open_zarr(ZARR_STORE, group='meteorology-etl-job')

display(ds)

# Create a PDF file
with PdfPages('archive_contents_meteorology-etl-job.pdf') as pdf:

    # Create a figure for the table
    fig, ax = plt.subplots(figsize=(8, 6))

    # Create a table with variable information
    table_data = [["Variable", "Dimensions", "Shape", "Data Type"]]
    for var in ds.variables:
        table_data.append([
            var, 
            str(ds[var].dims), 
            str(ds[var].shape),
            str(ds[var].dtype)
        ])

    # Display the table using matplotlib's table function
    table = ax.table(cellText=table_data, loc='center', cellLoc='center')
    table.set_fontsize(10)
    table.scale(1, 1.5)

    # Remove axes and ticks
    ax.axis('off')
    ax.set_title(f"Archive Contents for: meteorology-etl-job")

    # Save the figure to the PDF
    pdf.savefig(fig, bbox_inches='tight')

    # Close the figure
    plt.close(fig)

Unnamed: 0,Array,Chunk
Bytes,1.36 MiB,44 B
Shape,"(32368,)","(1,)"
Dask graph,32368 chunks in 2 graph layers,32368 chunks in 2 graph layers
Data type,,
"Array Chunk Bytes 1.36 MiB 44 B Shape (32368,) (1,) Dask graph 32368 chunks in 2 graph layers Data type",32368  1,

Unnamed: 0,Array,Chunk
Bytes,1.36 MiB,44 B
Shape,"(32368,)","(1,)"
Dask graph,32368 chunks in 2 graph layers,32368 chunks in 2 graph layers
Data type,,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32368,)","(1,)"
Dask graph,32368 chunks in 2 graph layers,32368 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 252.88 kiB 8 B Shape (32368,) (1,) Dask graph 32368 chunks in 2 graph layers Data type float64 numpy.ndarray",32368  1,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32368,)","(1,)"
Dask graph,32368 chunks in 2 graph layers,32368 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32368,)","(1,)"
Dask graph,32368 chunks in 2 graph layers,32368 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 252.88 kiB 8 B Shape (32368,) (1,) Dask graph 32368 chunks in 2 graph layers Data type float64 numpy.ndarray",32368  1,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32368,)","(1,)"
Dask graph,32368 chunks in 2 graph layers,32368 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32368,)","(1,)"
Dask graph,32368 chunks in 2 graph layers,32368 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 252.88 kiB 8 B Shape (32368,) (1,) Dask graph 32368 chunks in 2 graph layers Data type float64 numpy.ndarray",32368  1,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32368,)","(1,)"
Dask graph,32368 chunks in 2 graph layers,32368 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [5]:
# Print the archive contents
ds = xr.open_zarr(ZARR_STORE, group='meteorology-etl-job')

display(ds)

# Create a PDF file
with PdfPages('archive_contents_meteorology-etl-job-updated.pdf') as pdf:

    # Create a figure for the table
    fig, ax = plt.subplots(figsize=(8, 6))

    # Create a table with variable information
    table_data = [["Variable", "Dimensions", "Shape", "Data Type"]]
    for var in ds.variables:
        table_data.append([
            var, 
            str(ds[var].dims), 
            str(ds[var].shape),
            str(ds[var].dtype)
        ])

    # Display the table using matplotlib's table function
    table = ax.table(cellText=table_data, loc='center', cellLoc='center')
    table.set_fontsize(10)
    table.scale(1, 1.5)

    # Remove axes and ticks
    ax.axis('off')
    ax.set_title(f"Archive Contents for: meteorology-etl-job")

    # Save the figure to the PDF
    pdf.savefig(fig, bbox_inches='tight')

    # Close the figure
    plt.close(fig)

Unnamed: 0,Array,Chunk
Bytes,1.36 MiB,44 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,,
"Array Chunk Bytes 1.36 MiB 44 B Shape (32369,) (1,) Dask graph 32369 chunks in 2 graph layers Data type",32369  1,

Unnamed: 0,Array,Chunk
Bytes,1.36 MiB,44 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 252.88 kiB 8 B Shape (32369,) (1,) Dask graph 32369 chunks in 2 graph layers Data type float64 numpy.ndarray",32369  1,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 252.88 kiB 8 B Shape (32369,) (1,) Dask graph 32369 chunks in 2 graph layers Data type float64 numpy.ndarray",32369  1,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 252.88 kiB 8 B Shape (32369,) (1,) Dask graph 32369 chunks in 2 graph layers Data type float64 numpy.ndarray",32369  1,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [6]:
# Print the archive contents
ds = xr.open_zarr(ZARR_STORE, group='meteorology-etl-job')

display(ds)

# Create a PDF file
with PdfPages('archive_contents_meteorology-etl-job-full.pdf') as pdf:

    # Convert xarray dataset to a Pandas DataFrame
    df = ds.to_dataframe()

    # Determine the number of pages needed
    num_pages = len(df) // 20  # Adjust 20 based on the desired number of rows per page
    if len(df) % 20 != 0:
        num_pages += 1

    # Iterate through chunks of the DataFrame and create a figure for each page
    for page_num in range(num_pages):
        start_row = page_num * 20
        end_row = min((page_num + 1) * 20, len(df))

        # Create a figure for the table
        fig, ax = plt.subplots(figsize=(12, 6))  # Adjust size as needed

        # Create the table using the Pandas DataFrame
        table = ax.table(cellText=df.values, colLabels=df.columns, loc='center', cellLoc='center')
        table.set_fontsize(8)  # Adjust font size as needed
        table.scale(1, 1.5)  # Adjust scaling as needed

        # Remove axes and ticks
        ax.axis('off')
        ax.set_title(f"Archive Contents for: meteorology-etl-job")

        # Save the figure to the PDF
        pdf.savefig(fig, bbox_inches='tight')

        # Close the figure
        plt.close(fig)

Unnamed: 0,Array,Chunk
Bytes,1.36 MiB,44 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,,
"Array Chunk Bytes 1.36 MiB 44 B Shape (32369,) (1,) Dask graph 32369 chunks in 2 graph layers Data type",32369  1,

Unnamed: 0,Array,Chunk
Bytes,1.36 MiB,44 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 252.88 kiB 8 B Shape (32369,) (1,) Dask graph 32369 chunks in 2 graph layers Data type float64 numpy.ndarray",32369  1,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 252.88 kiB 8 B Shape (32369,) (1,) Dask graph 32369 chunks in 2 graph layers Data type float64 numpy.ndarray",32369  1,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 252.88 kiB 8 B Shape (32369,) (1,) Dask graph 32369 chunks in 2 graph layers Data type float64 numpy.ndarray",32369  1,

Unnamed: 0,Array,Chunk
Bytes,252.88 kiB,8 B
Shape,"(32369,)","(1,)"
Dask graph,32369 chunks in 2 graph layers,32369 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
