## **Exploring netCDFs** 
adapted from [Katy Abbot](https://github.com/amnh/BridgeUP-STEM-Oceans-Six/blob/master/jupyter-notebooks/netCDF_practice.ipynb)

![image](https://camo.githubusercontent.com/77e36a1f8169f7da010f7c1615fe39ab88f190ea/687474703a2f2f6465736b746f702e6172636769732e636f6d2f656e2f6172636d61702f31302e332f6d616e6167652d646174612f6e65746364662f475549442d44383732413443332d373439452d343135392d413643302d4642364433423437433544382d7765622e676966)
What are netCDF files? The acronym stands for Network Common Data Form, and they're a way of formatting data that makes it easy for other scientists to share and read data on different computers, with different operating systems, with different software etc... without running into issues or struggling to understand someone else's work!

netCDF files are in what we call an array-oriented dataset. Data is stored in arrays, which are like grids, and can be accessed by selecting the appropriate row and column. Here's an example of a 2D array:

<img src="https://camo.githubusercontent.com/b525fcfb6792a87d5a15b0b1c52fc39aff739722/68747470733a2f2f7777772e6479636c617373726f6f6d2e636f6d2f696d6167652f746f7069632f632f32642d61727261792f32642d61727261792e6a7067" width="600"/>

With netCDF files, our rows, columns, and other indices are called dimensions, and they can take values such as latitude, longitude and time.


<img src="https://simulatingcomplexity.files.wordpress.com/2014/11/netcdf-file-structure.png" width="400"/>

Let's try to explore this file format with an actual file. Make sure you have the file sea_velocity_19930101.nc in your GitHub repository. This is a dataset of velocity geostrophic velocities. 

First, we are going to explore the file in Terminal.

* In Terminal, type **ncdump -h sea_velocity_19930101.nc** to see all the headers for the file. 

* Type **ncdump -v "header title"** to see the data in the file under that header.
    
* Try exploring the files by searching different headers (time, lattitude, etc.)

Now we are going to explore using python:  Our first step is to import netCDF4.Dataset, one of the main tools we use for viewing netCDF files.

In [2]:
from netCDF4 import Dataset #import Dataset from the netCDF4 package
dataset = Dataset(r'/Users/helenfellow/Documents/InternGit/ocean-ml/sea_velocity_19930101.nc') #replace with pathname for your computer


In [10]:
print(dataset) #What output do you see when you run this command?

<type 'netCDF4._netCDF4.Dataset'>
root group (NETCDF3_CLASSIC data model, file format NETCDF3):
    _NCProperties: version=1|netcdflibversion=4.4.1|hdf5libversion=1.8.18
    Conventions: CF-1.6
    Metadata_Conventions: Unidata Dataset Discovery v1.0
    cdm_data_type: Grid
    comment: Sea Surface Height measured by Altimetry and derived variables
    contact: http://climate.copernicus.eu/c3s-user-service-desk
    creator_email: http://climate.copernicus.eu/c3s-user-service-desk
    creator_name: Copernicus Climate Change Service (C3S)
    creator_url: http://climate.copernicus.eu
    date_created: 2019-06-12T17:08:33Z
    date_issued: 2019-06-12T17:08:33Z
    date_modified: 2019-06-12T17:08:33Z
    geospatial_lat_max: 89.875
    geospatial_lat_min: -89.875
    geospatial_lat_resolution: 0.25
    geospatial_lat_units: degrees_north
    geospatial_lon_max: 359.875
    geospatial_lon_min: 0.125
    geospatial_lon_resolution: 0.25
    geospatial_lon_units: degrees_east
    geospatial_ver

Note that we've now created an object, called dataset, that we can use to access different aspects of the file. We'll use the dot notation (i.e. dataset.blahblahblah) to access different parts of the data.

Let's find out more about this dataset. We'll look at the "metadata," which is basically data about the data. 

Scientists use this to explain how the data was acquired or made, how old it is, who to contact with questions etc. First, we'll look at the dataset's "global attributes," which can be accessed by calling ncattrs (shorthand for netcdf attributes).

In [11]:
dataset.ncattrs()

[u'_NCProperties',
 u'Conventions',
 u'Metadata_Conventions',
 u'cdm_data_type',
 u'comment',
 u'contact',
 u'creator_email',
 u'creator_name',
 u'creator_url',
 u'date_created',
 u'date_issued',
 u'date_modified',
 u'geospatial_lat_max',
 u'geospatial_lat_min',
 u'geospatial_lat_resolution',
 u'geospatial_lat_units',
 u'geospatial_lon_max',
 u'geospatial_lon_min',
 u'geospatial_lon_resolution',
 u'geospatial_lon_units',
 u'geospatial_vertical_max',
 u'geospatial_vertical_min',
 u'geospatial_vertical_positive',
 u'geospatial_vertical_resolution',
 u'geospatial_vertical_units',
 u'history',
 u'institution',
 u'keywords',
 u'keywords_vocabulary',
 u'license',
 u'platform',
 u'processing_level',
 u'product_version',
 u'project',
 u'references',
 u'software_version',
 u'source',
 u'ssalto_duacs_comment',
 u'standard_name_vocabulary',
 u'summary',
 u'time_coverage_duration',
 u'time_coverage_end',
 u'time_coverage_resolution',
 u'time_coverage_start',
 u'title',
 u'History']

To look at one of these, type in the name of the dataset variable, and add a period (.) and the name of the attribute you want to look at.

In [16]:
print(dataset.geospatial_lat_max)
print(dataset.time_coverage_duration)
print(dataset.contact)

89.875
P1D
http://climate.copernicus.eu/c3s-user-service-desk


You can access the dimensions of the dataset by calling **dataset.dimensions.** Notice that the output is a dictionary. We can see the "keys," or dimension names, with **dataset.dimensions.keys()**

In [17]:
print(dataset.dimensions)
print(dataset.dimensions.keys())

OrderedDict([(u'time', <type 'netCDF4._netCDF4.Dimension'>: name = 'time', size = 1
), (u'latitude', <type 'netCDF4._netCDF4.Dimension'>: name = 'latitude', size = 720
), (u'longitude', <type 'netCDF4._netCDF4.Dimension'>: name = 'longitude', size = 1440
)])
[u'time', u'latitude', u'longitude']


If you want to see a specific dimension, you can do so by adding brackets and the dimension name in quotes. i.e. **dataset.dimensions['time']**

In [1]:
print(dataset.dimensions['ugos'])

NameError: name 'dataset' is not defined

Now that you know the dimensions of this file, try to draw a sketch, like the images at the start of this Jupyter notebook, that show the possible dimensions and how they relate to each other. Don't worry about "bnds" for now.
We can also access the variables of our dataset by typing dataset.variables

In [20]:
print(dataset.variables, "\n \n")  #"\n" creates a new empty line so you can separate your output

print(dataset.variables.keys())

(OrderedDict([(u'ugos', <type 'netCDF4._netCDF4.Variable'>
int32 ugos(time, latitude, longitude)
    _FillValue: -2147483647
    coordinates: time latitude longitude 
    grid_mapping: crs
    long_name: Absolute geostrophic velocity: zonal component
    scale_factor: 0.0001
    standard_name: surface_geostrophic_eastward_sea_water_velocity
    units: m/s
    _ChunkSizes: [ 1 50 50]
unlimited dimensions: 
current shape = (1, 720, 1440)
filling on), (u'time', <type 'netCDF4._netCDF4.Variable'>
float32 time(time)
    axis: T
    calendar: gregorian
    long_name: Time
    standard_name: time
    units: days since 1950-01-01 00:00:00
    _ChunkSizes: 1
    _CoordinateAxisType: Time
unlimited dimensions: 
current shape = (1,)
filling on, default _FillValue of 9.96920996839e+36 used
), (u'latitude', <type 'netCDF4._netCDF4.Variable'>
float32 latitude(latitude)
    axis: Y
    bounds: lat_bnds
    long_name: Latitude
    standard_name: latitude
    units: degrees_north
    valid_max: 89.875


These variables have a lot more information, right? Let's look at just one variable: latitude. Inspect it by typing **dataset.variables['latitude']**


In [6]:
dataset.variables['ugos']

<type 'netCDF4._netCDF4.Variable'>
int32 ugos(time, latitude, longitude)
    _FillValue: -2147483647
    coordinates: time latitude longitude 
    grid_mapping: crs
    long_name: Absolute geostrophic velocity: zonal component
    scale_factor: 0.0001
    standard_name: surface_geostrophic_eastward_sea_water_velocity
    units: m/s
    _ChunkSizes: [ 1 50 50]
unlimited dimensions: 
current shape = (1, 720, 1440)
filling on

How many different attributes can you identify? (standard_name, long_name, cell_methods, _FillValue, missing_value, original_name, original_units, history, current shape). Look at the second line. It gives the name of the variable, and it also lists three names in parentheses after it. What do you think those names signify?

We can access any one of these attributes by calling it directly. Just add a period at the end of your call to a variable and add in the attribute name.

In [23]:
print(dataset.variables['longitude'].bounds)
print(dataset.variables['longitude'].long_name)

lon_bnds
Longitude


In [24]:
for attr in dataset.variables['longitude'].ncattrs(): #ncattrs is a shorthand way of saying the attributes of a netCDF file
    print(attr)
    print(getattr(dataset.variables['longitude'], attr))  #getattr is a function that takes a variable and an attribute name and returns its value

axis
X
bounds
lon_bnds
long_name
Longitude
standard_name
longitude
units
degrees_east
valid_max
359.875
valid_min
0.125
_ChunkSizes
50
_CoordinateAxisType
Lon


You may be wondering: Where's the actual data?? So far, we've learning about what variables and dimensions are in this dataset, but we haven't actually seen any numbers or values.

Let's look at the latitude and longitude values. To do so, you'll call on a variable (i.e. dataset.variables['tos'], as above), but you'll add [:] after it to tell the computer that you want to see the numpy array.

In [8]:
print("latitude: ", dataset.variables['latitude'][:]) #print the latitude values, and then add a line break to distinguish from longitude


('latitude: ', masked_array(data=[-89.875, -89.625, -89.375, -89.125, -88.875, -88.625,
                   -88.375, -88.125, -87.875, -87.625, -87.375, -87.125,
                   -86.875, -86.625, -86.375, -86.125, -85.875, -85.625,
                   -85.375, -85.125, -84.875, -84.625, -84.375, -84.125,
                   -83.875, -83.625, -83.375, -83.125, -82.875, -82.625,
                   -82.375, -82.125, -81.875, -81.625, -81.375, -81.125,
                   -80.875, -80.625, -80.375, -80.125, -79.875, -79.625,
                   -79.375, -79.125, -78.875, -78.625, -78.375, -78.125,
                   -77.875, -77.625, -77.375, -77.125, -76.875, -76.625,
                   -76.375, -76.125, -75.875, -75.625, -75.375, -75.125,
                   -74.875, -74.625, -74.375, -74.125, -73.875, -73.625,
                   -73.375, -73.125, -72.875, -72.625, -72.375, -72.125,
                   -71.875, -71.625, -71.375, -71.125, -70.875, -70.625,
                   -70.375, -70.125,

## 👉netCDF file cheat sheet👈
[This tutorial](http://www.ceda.ac.uk/static/media/uploads/ncas-reading-2015/10_read_netcdf_python.pdf) was written in Python 2.7, so the print command is slightly different, but it's a helpful read to understand how these files work.

Addditionally:
1. Import the tools to open a dataset: **from netCDF4 import Dataset**
2. Open a dataset: **dataset = Dataset('filename.nc')**
3. View the dataset's attributes: **dataset.ncattrs()**
4. Access a specific attribute: **dataset.attribute_name**
5. View the dataset's dimensions: **dataset.dimensions**
6. View a specific dimension: **dataset.dimensions[ 'name of dimension' ]**
7. View the dataset's variables: **dataset.variables**
8. View a specific variable: **dataset.variables[ 'name of variable' ]**
9. See a variable's values: **dataset.variables[ 'name of variable' ][ : ]**