# Explore NetCDF (*.nc) file

In [None]:
# Import packages
from pathlib import Path
import pandas as pd
import xarray as xr

## The `xarray` [dataset object](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html). 


In [None]:
# Set path to nc filed and import as an xarray dataset
nc_file = Path.cwd().parent/'data'/'raw'/'multi_reanal.partition.aoc_15m.197901.nc'
ds = xr.open_dataset(nc_file)
ds

### The components of an `xarray` dataset object:
### ➡️Dimensions
Dimensions define the axes or coordinate lengths that structure the dataset. They describe how data is organized — e.g., time, space, or ensemble members. 

The results above indicate:
* There are **745 time steps** (`date: 745`)
* Each time step includes data for **1694 spatial gridpoints** (`gridpoint: 1694`)
*  Each gridpoint is associated with **12 partitions** or 12 measurements (`partitions: 12`)

> Dimensions are the *axes of a multidimensional spreadsheet*.

### ➡️Data Variables
Data variables are the core measured or modeled quantities — the actual “data fields” stored along the dataset’s dimensions.

Each data variable can depend on one or more dimensions.
For instance:
```python
longitude(date, gridpoint)
significant_wave_height(date, gridpoint, partition)
```
* **longitude** varies over date and gridpoint — meaning it could change slightly over time (e.g., if the grid moves or data assimilations adjust).
* **significant_wave_height** varies over date, gridpoint, and partition, so it’s a 3D variable: wave height for each partition of the spectrum at each time and location.
> Each variable is like a column in a data table, but in multiple dimensions.

### ➡️Indexes
Indexes (or coordinates) define the values associated with dimensions.
They describe what each position along an axis means — like labels for time, space, or category.

In the dataset:
```python
date → datetime64[ns]
gridpoint → float64 gridpoint
partition → int32 partition
```
* `date` is indexed by actual datetimes (likely the times of each model output).
* `gridpoint` might be an index ID or physical location identifier.
* `partition` labels the wave components (e.g., 1–12).

Sometimes coordinates like latitude and longitude are also stored as data variables that depend on these dimensions:

```python
latitude(date, gridpoint)
longitude(date, gridpoint)
```

These serve as geolocation coordinates, helping you map or spatially analyze the data.

#### ➡️Attributes (Metadata)
Attributes are descriptive metadata — they provide context about the dataset or variables but don’t affect the data structure.
```text
title:          WAVEWATCH III version 5.08
institution:    National Centers for Environmental Prediction
source:         WAVEWATCH III partition file
experiment:     CFSRR Phase 2
history:        part2nc
field_type:     instantaneous
forecast_type:  hindcast
```
They tell you:
* What the data represents (a hindcast from WAVEWATCH III).
* Where it came from (NCEP, CFSRR project).
* How it was produced (part2nc = partition-to-NetCDF conversion).
Each variable may also have its own attributes (e.g., units, standard names).
>Attributes are like the notes on the spreadsheet explaining what the numbers mean.

---
## Selecting data from the dataset

### Selecting a specific variable
We can subset our dataset for just values related to a specific variable by calling that variable as so:

In [None]:
#Select only depth records into a data array object
arr_depth = ds['depth']
type(arr_depth)

The data array object retains values in all dimensions, but just values pertaining to the specified variable. **However**, not all variables include all dimensions. For example, the `depth` variable does not have a `partition` dimension, but the `wave_direction` variable does.

In [None]:
#Show dimensions for the depth array
arr_depth.dims

In [None]:
#Show dimensions for the wave direction array
ds['wave_direction'].dims

## Selecting data from a dataset with `.isel()`
The `.isel()` function allows us to select values from our dataset via the integer position. We supply this position for each index. Omiting a value for a given dimension will return all values for that dimension. 

In [None]:
#Select the wave_direction value for the record falling in the first
# position in the date dimension, the 10 position in the gridcell dimension, 
# and in the last position of the partition dimension
wave_dir_value = ds['wave_direction'].isel(date=0,gridpoint=9, partition=-1)
type(wave_dir_value)

The result is again a data array, but with just one potential value. And we can see that the wave_length at this position has no recorded value. 

In [None]:
#Show info for the resulting data array
wave_dir_value

The `values` operator of a data array will return all the values in that array.

In [None]:
#Reveal just the values of the data array
wave_dir_value.values

If, in the `isel()` command, we omit a dimension, the resulting data array will include all records along that axis. 

In [None]:
#View data for all partitions, limited to the 11th date and 11th gridpoint positions)
ds['wave_direction'].isel(date=10,gridpoint=10)

In [None]:
#Show just the values
ds['wave_direction'].isel(date=10,gridpoint=10).values

In [None]:
#Compute the mean of all (non missing) values
ds['wave_direction'].isel(date=10,gridpoint=10).mean().values

❓What dimensions does the `wind_direction` variable have
What is the wind_direction value for the 10 date, 3rd gridpoint?

In [None]:
#Show the dimensions of the wind direction variable
ds['wind_direction'].dims

In [None]:
#Show the wind_direction value for the 10th date, 3rd gridpoint?
ds['wind_direction'].isel(date=9,gridpoint=3).values

### Selecting values by *index position* with `.sel()`
Applying the `.sel()` command to a dataset allows us to select values by a coordinate label. 

In [None]:

lon = ds.longitude.isel(date=0)
lat = ds.latitude.isel(date=0)
depth = ds.depth.isel(date=0)

coord_table = pd.DataFrame({
    'gridpoint': ds.gridpoint.values,
    'longitude': lon.values,
    'latitude': lat.values,
    'depth': depth.values
})

coord_table.head()