# Explore NetCDF (*.nc) file

In [1]:
# Import packages
from pathlib import Path
import pandas as pd
import xarray as xr

## The `xarray` [dataset object](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html). 
The dataset we'll use comes from NOAA's [Hindcast and Reanalysis Archives - Phase 2](https://polar.ncep.noaa.gov/waves/hindcasts/nopp-phase2.php).  

Specifically, we are using 10m partition data from January 1979 for the Mediterranean Sea: [link](https://polar.ncep.noaa.gov/waves/hindcasts/nopp-phase2/197901/partitions). This [dataset](https://polar.ncep.noaa.gov/waves/hindcasts/nopp-phase2/197901/partitions/multi_reanal.partition.med_10m.197901.nc) is available on the mapped class drive at `U:\859_data\multi_reanal.partition.med_10m.197901.nc`

In [49]:
# Set path to nc filed and import as an xarray dataset
nc_file = "V:\multi_reanal.partition.med_10m.197901.nc"
ds = xr.open_dataset(nc_file)
ds

### The components of an `xarray` dataset object:
### ➡️Dimensions
Dimensions define the axes or coordinate lengths that structure the dataset. They describe how data is organized — e.g., time, space, or ensemble members. 

The results above indicate our data has 4 dimensions:
* There are **745 time steps** (`date: 745`)
* Each time step includes data for **301** points of latitude and **109** points of longitude (`longitude: 301, latitude: 109`)
* Each location at each time step is associated with **11 partitions** (`partitions: 11`)

> Dimensions are the *axes of a multidimensional spreadsheet*. 

While the first three dimensions are intuitive, the fourth dimension, "partition" is specific to this dataset. Partitions refer to different spectral components of oceanic waves ([source](https://polar.ncep.noaa.gov/waves/workshop/pdfs/WW3-workshop-exercises-day4-wavetracking.pdf?utm_source=chatgpt.com)). For our purposes, we don't need to get into much detail beyond that; instead, perhaps just imagine that this dataset was collected by a set of 11 different sensor types, and the "partition" dimension represents data collected by each of these different sensors. 

In [50]:
#Reveal the dimensions of the dataset
ds.dims



### ➡️Data Variables
Data variables are the core measured or modeled quantities — the actual “data fields” stored along the dataset’s dimensions.

Each data variable can depend on one or more dimensions.
For instance:
```python
depth(latitude, longitude)
significant_wave_height(date, partition, latitude, longitude)
```
* **depth** varies latitude and longitude, but it does not vary over time or by partition.
* **significant_wave_height** varies over date, location, and partition, so it’s a 4D variable: wave height for each partition of the spectrum at each time and location.
> Each variable is like a column in a data table, but in multiple dimensions.

In [55]:
#Reveal the data variables of the dataset
ds.data_vars

Data variables:
    depth                    (latitude, longitude) float32 131kB ...
    wind_speed               (date, latitude, longitude) float32 98MB ...
    wind_direction           (date, latitude, longitude) float32 98MB ...
    current_speed            (date, latitude, longitude) float32 98MB ...
    current_direction        (date, latitude, longitude) float32 98MB ...
    significant_wave_height  (date, partition, latitude, longitude) float32 1GB ...
    peak_period              (date, partition, latitude, longitude) timedelta64[ns] 2GB ...
    wavelength               (date, partition, latitude, longitude) float32 1GB ...
    wave_direction           (date, partition, latitude, longitude) float32 1GB ...
    direction_spreading      (date, partition, latitude, longitude) float32 1GB ...
    wind_sea_fraction        (date, partition, latitude, longitude) float32 1GB ...

### ➡️Indexes
Indexes (or coordinates) define the values associated with dimensions.
They describe what each position along an axis means — like labels for time, space, or category.



In the dataset:
```python
date → datetime64[ns]
latitude → float64 gridpoint
partition → int32 partition
```
* `date` is indexed by actual datetimes (likely the times of each model output).
* `gridpoint` might be an index ID or physical location identifier.
* `partition` labels the wave components (e.g., 1–12).

Sometimes coordinates like latitude and longitude are also stored as data variables that depend on these dimensions:

```python
latitude(date, gridpoint)
longitude(date, gridpoint)
```

These serve as geolocation coordinates, helping you map or spatially analyze the data.

In [62]:
#Reveal the indices of the dataset
ds.indexes

Indexes:
    date       DatetimeIndex(['1979-01-01 00:00:00', '1979-01-01 01:00:00',
               '1979-01-01 02:00:00', '1979-01-01 03:00:00',
               '1979-01-01 04:00:00', '1979-01-01 05:00:00',
               '1979-01-01 06:00:00', '1979-01-01 07:00:00',
               '1979-01-01 08:00:00', '1979-01-01 09:00:00',
               ...
               '1979-01-31 15:00:00', '1979-01-31 16:00:00',
               '1979-01-31 17:00:00', '1979-01-31 18:00:00',
               '1979-01-31 19:00:00', '1979-01-31 20:00:00',
               '1979-01-31 21:00:00', '1979-01-31 22:00:00',
               '1979-01-31 23:00:00', '1979-02-01 00:00:00'],
              dtype='datetime64[ns]', name='date', length=745, freq=None)
    longitude  Index([               -7.0,  -6.833333492279053,  -6.666666507720947,
                      -6.5,  -6.333333492279053,  -6.166666507720947,
                      -6.0, -5.8333330154418945,  -5.666666507720947,
                      -5.5,
       ...
        

#### ➡️Attributes (Metadata)
Attributes are descriptive metadata — they provide context about the dataset or variables but don’t affect the data structure.
```text
title:          WAVEWATCH III version 5.08
institution:    National Centers for Environmental Prediction
source:         WAVEWATCH III partition file
experiment:     CFSRR Phase 2
history:        part2nc
field_type:     instantaneous
forecast_type:  hindcast
```
They tell you:
* What the data represents (a hindcast from WAVEWATCH III).
* Where it came from (NCEP, CFSRR project).
* How it was produced (part2nc = partition-to-NetCDF conversion).
Each variable may also have its own attributes (e.g., units, standard names).
>Attributes are like the notes on the spreadsheet explaining what the numbers mean.

In [None]:
#Reveal the attributes of the dataset
ds.attrs

{'title': 'WAVEWATCH III version 5.08',
 'institution': 'National Centers for Environmental Prediction',
 'source': 'WAVEWATCH III partition file',
 'experiment': 'CFSRR Phase 2',
 'history': 'part2nc',
 'field_type': 'instantaneous',
 'forecast_type': 'hindcast'}

---
## Selecting data from the dataset

### Isolating data for a specific specific *variable*
We can subset our Xarray dataset for just values related to a specific variable by calling that variable as so:

In [30]:
#List the variables in the dataset
ds.data_vars

Data variables:
    longitude                (date, gridpoint) float32 5MB ...
    latitude                 (date, gridpoint) float32 5MB ...
    depth                    (date, gridpoint) float32 5MB ...
    number_of_partitions     (date, gridpoint) float64 10MB ...
    wind_speed               (date, gridpoint) float32 5MB ...
    wind_direction           (date, gridpoint) float32 5MB ...
    current_speed            (date, gridpoint) float32 5MB ...
    current_direction        (date, gridpoint) float32 5MB ...
    significant_wave_height  (date, gridpoint, partition) float32 61MB ...
    peak_period              (date, gridpoint, partition) timedelta64[ns] 121MB ...
    wavelength               (date, gridpoint, partition) float32 61MB ...
    wave_direction           (date, gridpoint, partition) float32 61MB ...
    direction_spreading      (date, gridpoint, partition) float32 61MB ...
    wind_sea_fraction        (date, gridpoint, partition) float32 61MB ...

In [31]:
#Select only depth records into a data array object
arr_depth = ds['depth']
type(arr_depth)

xarray.core.dataarray.DataArray

As you see, this creates an XArray `data array` object. Calling that object reveals information about its structure.

In [32]:
#Display info on the depth data array
arr_depth

The data array object is similar to the dataset, but only retains values related to the variable. **However**, not all variables include all dimensions. For example, the `depth` variable does not have a `partition` dimension.

In [5]:
#Show dimensions for the depth array
arr_depth.dims

('date', 'gridpoint')

The `wave_direction`, however, does include data in all the three dimensions...

In [6]:
#Show dimensions for the wave direction array
ds['wave_direction'].dims

('date', 'gridpoint', 'partition')

## Selecting data from a dataset with `.isel()`
Now, we'll focus on working with data in a data array. 

We'll begin with the `.isel()` function. This functions allows us to select values from our dataset via their *integer position* along each dimension. For, example, the value in the `wave_direction` variable at the 5th positition along the `date` index, the 3nd position along the `gridpoint` index, and in the 1st `partition` is **284.49 degrees**. (Note, the `values` statement returns the value(s) held in the array.)

>How do we know the value is in degrees? Have a look at the data array object: it reports the units...

In [7]:
#Show the value at the 5th date, 3rd gridpoint, 1st partition
ds['wave_direction'].isel(
    date=4,
    gridpoint=2,
    partition=0
).values

array(284.49, dtype=float32)

If we omit one of the dimensions, the `isel()` function will return all values in the ommitted dimension. Here we see that, in the `wave_direction` variable, only two of the 12 partitions have values. 

In [8]:
#Show the value at the 5th date, 3rd gridpoint - all partitions
ds['wave_direction'].isel(
    date = 4,
    gridpoint=2    
).values

array([284.49, 284.49,    nan,    nan,    nan,    nan,    nan,    nan,
          nan,    nan,    nan,    nan], dtype=float32)

We can also request multiple values along a dimension by providing a list of indices:

In [9]:
#Show the value at the 5th date, 3rd gridpoint - partitions 1, 2, and 3
ds['wave_direction'].isel(
    date = 4,
    gridpoint=2,
    partition = [0,1,2]
).values

array([284.49, 284.49,    nan], dtype=float32)

And we can extract a slice of values using the range function:

In [10]:
#Show the value at the 5th date, 3rd gridpoint - partitions 2 thru 4)
ds['wave_direction'].isel(
    date = 4,
    gridpoint=2,
    partition = range(2,5)
).values

array([nan, nan, nan], dtype=float32)

### Selecting values by *index* with `.sel()`
Applying the `.sel()` command to a dataset allows us to select records by their *index values*.  

Of course, to do this, we need an idea of what those index values are. Some values are intuitive, like values in the `date` dimension, but others may require more knowledge of the datasets structure.  

In [39]:
#Show the first 10 values in the date index
ds['date'][:10].values

array(['1979-01-01T00:00:00.000000000', '1979-01-01T01:00:00.000000000',
       '1979-01-01T02:00:00.000000000', '1979-01-01T03:00:00.000000000',
       '1979-01-01T04:00:00.000000000', '1979-01-01T05:00:00.000000000',
       '1979-01-01T06:00:00.000000000', '1979-01-01T07:00:00.000000000',
       '1979-01-01T08:00:00.000000000', '1979-01-01T09:00:00.000000000'],
      dtype='datetime64[ns]')

Looking at the date index values, we see they are datetime objects of hourly increments from Jan 1, 1979 to Jan 30, 1979.


In [45]:
#Show the 
ds['gridpoint'].values

array([1., 2., 3., ..., 0., 0., 0.])

In [None]:
ds['wave_direction']['partition']

In [None]:

lon = ds.longitude.isel(date=0)
lat = ds.latitude.isel(date=0)
depth = ds.depth.isel(date=0)

coord_table = pd.DataFrame({
    'gridpoint': ds.gridpoint.values,
    'longitude': lon.values,
    'latitude': lat.values,
    'depth': depth.values
})

coord_table.head()

In [None]:
ds = xr.open_dataset(Path.cwd().parent /'data'/'raw'/'nwio_10m.nc')
ds