## PYDEV - $1^{st}$ Meeting

* Intro
    * Goals:
        1. Pool Python knowledge and skills
        2. Develop community to
            a. write better code
            b. develop and roll out faster
            c. help others
        3. Respond to increased use of Scientific Python 
            * IDL scripts
            * Data access?
        
    
* Tools
    * Github Repo: https://github.com/OCPyG
        * This meeting: **git clone https://github.com/OCPyG/Meeting1.git**
    * Scientific Python: https://www.continuum.io/downloads
    * e-books: http://library.gsfc.nasa.gov/resource-types/books-and-ebooks
        * Safari books online: http://search.safaribooksonline.com/search?q=python
    * Misc.: atom; https://atom.io/
    
    
* Tentative Format
    * quick presentation, e.g. on ipython notebook
        * showcase a useful module, pose a problem, etc.
    * (interactive) discussion
    * can all be online
    * ...


* Tentative list of topics of potential interest to the group
    * netCDF4 -- how to read our files
    * numpy -- manipulate data
    * matplotlib, seaborn
    * scipy, requests, Cython, numba, subprocess, multiprocessing, threading, scipy, scikit-learn, statsmodels, theano, bokeh?

-Numpy
-Scipy
-Plots
-Matplotlib
-Gwyn bootcamp notebooks

### Brief netcdf4 / numpy tutorial 

In [41]:
# loaded relevant modules
import netCDF4 as nc
import numpy as np

In [3]:
# open a data file and create a dataset object
fpath = './S1997302193636_silent.L2'
ds = nc.Dataset(fpath)

In [4]:
# let's look at the ds object we just created
ds

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    title: SeaWiFS Level-2 Data
    product_name: S1997302193636_silent.L2
    processing_version: Unspecified
    equatorCrossingLongitude: -116.421
    orbit_number: 1321
    history: l2gen ifile=/disk02/UNCERTAINTIES/Monte-Carlo/Matchups/L1As/S1997302193636.L1A_MLAC.R0000020885_30N_30N_114W_114W.hdf ofile=/disk02/UNCERTAINTIES/Monte-Carlo/Matchups/L2/S1997302193636/S1997302193636_silent.L2 par=./parfiles/parfile_silent
    instrument: SeaWiFS
    platform: Orbview-2
    Conventions: CF-1.6
    Metadata_Conventions: Unidata Dataset Discovery v1.0
    license: http://science.nasa.gov/earth-science/earth-science-data/data-information-policy/
    naming_authority: gov.nasa.gsfc.sci.oceandata
    id: L2/S1997302193636_silent.L2
    date_created: 2016-06-20T19:03:10.000Z
    keywords_vocabulary: NASA Global Change Master Directory (GCMD) Science Keywords
    keywords: Oceans > Ocean Chemistry > Chlorophy

In [8]:
# most attributes in the ds object are demselves an ordered dictionary (also an object)
ds.dimensions

OrderedDict([('number_of_lines',
              <class 'netCDF4._netCDF4.Dimension'>: name = 'number_of_lines', size = 101),
             ('pixels_per_line',
              <class 'netCDF4._netCDF4.Dimension'>: name = 'pixels_per_line', size = 101),
             ('pixel_control_points',
              <class 'netCDF4._netCDF4.Dimension'>: name = 'pixel_control_points', size = 101),
             ('number_of_bands',
              <class 'netCDF4._netCDF4.Dimension'>: name = 'number_of_bands', size = 8),
             ('number_of_reflective_bands',
              <class 'netCDF4._netCDF4.Dimension'>: name = 'number_of_reflective_bands', size = 8)])

In [15]:
# Groups is the main dictionary for anything to do with data manipulations.
ds.groups.keys()

odict_keys(['sensor_band_parameters', 'scan_line_attributes', 'geophysical_data', 'navigation_data', 'processing_control'])

In [44]:
# Entries in groups netcdf4 group objects which in turn typically contain dictionaries
ds.groups['scan_line_attributes']

<class 'netCDF4._netCDF4.Group'>
group /scan_line_attributes:
    dimensions(sizes): 
    variables(dimensions): int32 [4myear[0m(number_of_lines), int32 [4mday[0m(number_of_lines), int32 [4mmsec[0m(number_of_lines), int8 [4mdetnum[0m(number_of_lines), int8 [4mmside[0m(number_of_lines), float32 [4mslon[0m(number_of_lines), float32 [4mclon[0m(number_of_lines), float32 [4melon[0m(number_of_lines), float32 [4mslat[0m(number_of_lines), float32 [4mclat[0m(number_of_lines), float32 [4melat[0m(number_of_lines), float32 [4mcsol_z[0m(number_of_lines)
    groups: 

In [14]:
# accessing one of these dictionaries shows netcdf4 variable object.
ds.groups['scan_line_attributes'].variables['year']

<class 'netCDF4._netCDF4.Variable'>
int32 year(number_of_lines)
    long_name: Scan year
    units: years
    _FillValue: -32767
    valid_min: 1900
    valid_max: 2100
path = /scan_line_attributes
unlimited dimensions: 
current shape = (101,)
filling on

In [12]:
# a netcdf4 variable object can be accessed as a numpy array like so
ds.groups['scan_line_attributes'].variables['year'][:]

array([1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997,
       1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997,
       1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997,
       1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997,
       1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997,
       1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997,
       1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997,
       1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997,
       1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997,
       1997, 1997], dtype=int32)

In [18]:
# just so we don't have to retype long references here's assigning the variable object in 
# a couple of the groups entries...
gpNavVar = ds.groups['navigation_data'].variables
gpGeoVar = ds.groups['geophysical_data'].variables

In [19]:
# and passing some of their content to numpy arrays
lats = gpNavVar['latitude'][:]
lons = gpNavVar['longitude'][:]

In [23]:
# quick aside on how the print function works, and how string formatting can be done
from math import pi
print("pi is %.3f %.4f" % (pi,pi))

pi is 3.142 3.1416


In [25]:
# [:] passes gets the whole data array
chlora = gpGeoVar['chlor_a'][:]

In [45]:
# slicing the array can be done like so (remember that the end of the range is exclusive in python)
chlorw = gpGeoVar['chlor_a'][10:45,20:]

In [46]:
chlorw.shape

(35, 81)

In [47]:
# usually, if the data has gone through flagging, the resulting variable is a 
#    numpy masked array...
type(chlora)

numpy.ma.core.MaskedArray

In [27]:
# this is a array with a data component and a mask component. Caution! a 'True' entry in the 
#    mask refers to a *bad* entry.
chlora

masked_array(data =
 [[-- -- -- ..., 0.7414780855178833 0.8797937035560608 0.8858433365821838]
 [-- -- -- ..., 0.7445796132087708 0.8131616115570068 0.8088726997375488]
 [-- -- -- ..., 0.9621338248252869 0.7958893179893494 0.8035178184509277]
 ..., 
 [-- -- -- ..., 0.9789645671844482 0.9724005460739136 0.9688998460769653]
 [-- -- -- ..., 1.2452915906906128 1.0555698871612549 1.0561654567718506]
 [-- -- -- ..., 1.2546403408050537 1.0435562133789062 0.8720251321792603]],
             mask =
 [[ True  True  True ..., False False False]
 [ True  True  True ..., False False False]
 [ True  True  True ..., False False False]
 ..., 
 [ True  True  True ..., False False False]
 [ True  True  True ..., False False False]
 [ True  True  True ..., False False False]],
       fill_value = -32767.0)

In [28]:
# getting just the data from the masked array is easy...
chlora.data

array([[ -3.27670000e+04,  -3.27670000e+04,  -3.27670000e+04, ...,
          7.41478086e-01,   8.79793704e-01,   8.85843337e-01],
       [ -3.27670000e+04,  -3.27670000e+04,  -3.27670000e+04, ...,
          7.44579613e-01,   8.13161612e-01,   8.08872700e-01],
       [ -3.27670000e+04,  -3.27670000e+04,  -3.27670000e+04, ...,
          9.62133825e-01,   7.95889318e-01,   8.03517818e-01],
       ..., 
       [ -3.27670000e+04,  -3.27670000e+04,  -3.27670000e+04, ...,
          9.78964567e-01,   9.72400546e-01,   9.68899846e-01],
       [ -3.27670000e+04,  -3.27670000e+04,  -3.27670000e+04, ...,
          1.24529159e+00,   1.05556989e+00,   1.05616546e+00],
       [ -3.27670000e+04,  -3.27670000e+04,  -3.27670000e+04, ...,
          1.25464034e+00,   1.04355621e+00,   8.72025132e-01]], dtype=float32)

In [30]:
# but it will contain the bad values as well. for ease of manipulation the compressed() method 
# available in masked array objects can be used. compressed() returns a 1D array (vector) of all
#  good values. 
chloranonmasked = chlora.compressed()

In [31]:
# this is no longer a masked array and as such methods and attributes available to masked 
# arrays are no longer available
type(chloranonmasked)

numpy.ndarray

In [32]:
# but as mentioned earlier it has now a smaller footprint
print(chlora.shape)
print(chloranonmasked.shape)

(101, 101)
(6803,)


In [33]:
# in order to maintain its structural relationship to other arrays, the mask of the original
#   array can be used to create new masked arrays. As a result compressing both arrays using 
#  the same mask maintains said said relationship between these arrays. Here's an example...
# Get array from chlora array
mymask=chlora.mask

In [34]:
# Make new masked array with (originally non-masked lat array -- note that np.ma is numpy 
# masked array submodule
latmks = np.ma.array(lats,mask=mymask)

In [35]:
# now if we use the compressed method on this new masked array we get the same shape as the 
#   chloranonmasked array.
latmks.compressed().shape

(6803,)