In [1]:
import os
import re
import sys
import numpy
import netCDF4
import rasterio
from glob import glob
from datetime import datetime
from collections import namedtuple
from osgeo import osr

In [2]:
#Four netcdf files: 
#1 Old java optimised for pixel drill. 
#2 same lat,lon,time order as1 but 100x100x100 chucksize for compression
#3 new order ubyte Data(time, lat, lon) with 100x100x100 chuncksize
#4 new order ubyte Data(time, lat, lon) with 500x10x10 chuncksize
ncfile1="/g/data/fk4/wofs/water_f7q/extents/149_-035/LS_WATER_149_-035_1987-07-16T23-15-17.556_2014-03-28T23-47-03.171.nc" #1x256x1200
ncfile2="/g/data/fk4/wofs/water_20160203/extents/149_-035/LS_WATER_149_-035_1987-07-16T23-15-17_2016-01-20T23-57-35.nc" #100x100x100
ncfile3="/g/data/u46/wofs/extents2nc/149_-035/LS_WATER_149_-035_1987-07-16T23-15-17_2016-01-20T23-57-35.nc" #100x100x100
ncfile4="/g/data/u46/wofs/extents/149_-035/LS_WATER_149_-035_1987-07-16T23-15-17_2016-01-20T23-57-35.nc"  #500x10x10 chunk

    

In [16]:
# make up 1000 pixels randomly
npix=1000
pixels1 = numpy.random.randint(0, 4000, (npix))
pixels2 = numpy.random.randint(0, 4000, (npix))

print pixels1[:10], pixels2[:10]


[ 277  100 3169 3279  663 1455  481 2411 2947 3083] [3129 3345 1624 3971 1200  612  718  627 1800 1817]


In [4]:
def read1(data): # time last
    for i in range(1,len(pixels1)):
        data[pixels1[i],pixels2[i],:]

        
def read2(data):  # time first
    for i in range(1,len(pixels1)):
        data[:, pixels1[i],pixels2[i]]

In [17]:
ds1=netCDF4.Dataset(ncfile1)

print ds1.data_model
#print ds1.variables

data1=ds1['Data']
print data1.size, data1.dtype, data1.shape

%timeit read1(data1)

NETCDF4
15456000000 int8 (4000, 4000, 966)
The slowest run took 15.75 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 1.28 s per loop


In [18]:
# 20-30 times slower compare to method1
ds1=netCDF4.Dataset(ncfile2)
data1=ds1['Data']

print ds1.data_model
#print ds1.variables

print data1.size, data1.dtype, data1.shape


%timeit read1(data1)

NETCDF4
18336000000 int8 (4000, 4000, 1146)
1 loops, best of 3: 39.6 s per loop


In [19]:
# 20-30 times slower compare to method1
ds1=netCDF4.Dataset(ncfile3)

print ds1.data_model
# ds1.variables

data1=ds1['Data']
print data1.size, data1.dtype, data1.shape

%timeit read2(data1)

NETCDF4
18336000000 uint8 (1146, 4000, 4000)
1 loops, best of 3: 27.6 s per loop


In [20]:
# method-7 chucksize 500X10X10, time first is fastest for reading. half of method 1. Fastest
# takes more RAM memory and cpu to create. But faster to drill
# pus to limit, use len(times)X10x10 as chuncksize? 
ds1=netCDF4.Dataset(ncfile4)

print ds1.data_model
# ds1.variables

data1=ds1['Data']
print data1.size, data1.dtype, data1.shape

#print data1[:, 1,1,]

%timeit read2(data1)

NETCDF4
18336000000 uint8 (1146, 4000, 4000)
The slowest run took 6.87 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 694 ms per loop


In [36]:
pixels1 = numpy.random.randint(0, 4000, (10))
pixels2 = numpy.random.randint(0, 4000, (10))

print pixels1, pixels2

[1516 2464 1356 2489 1667 2819  656 3650 1250 1052] [ 706 1698 2876 2963  927 3469  355 3695 2815 2513]


In [39]:
def read(data):
    for i in range(1,len(pixels1)):
        data[:,pixels1[i],pixels2[i]]
    

In [40]:
%timeit read()

The slowest run took 14.11 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 3.1 ms per loop


From: Sixsmith Joshua 
Sent: Wednesday, 23 March 2016 4:33 PM
To: Zhang Fei
Cc: Raevski Gregory
Subject: RE: NetCDF file size [SEC=UNCLASSIFIED]

Hi Fei,

Here is the code snippet used to test reading a time series from a file.  Effects become more noticeable the more pixels you wish to read.  If an immediately following pixel is contained within the same chunk, then it should also read quicker due to caching.

In [14]: pixels
Out[14]:
array([1829, 2675, 3626, 3514, 2950,  628, 1023, 1392, 1176,  162,   95,
       3659, 3223,  490, 1161, 1519, 3541,  823, 3363, 1430, 1663, 1575,
       1226,  227, 1448, 3763,   11, 1556, 1935, 1294, 3468, 3706,  821,
       1931,  301,  889, 3912, 2881, 1250, 3231, 3421, 3999,  899, 1043,
       2157, 3461, 2198, 1137, 3178, 2952, 3011, 2678, 2702,  546, 2347,
       2089,  243,  609, 2506,  132,  227, 2692, 2965, 3613, 1725, 2229,
       1576, 1993,  559, 1817, 1679, 3240, 3867, 1300,  863, 1729,  211,
       3222, 1271, 3892, 2402, 3527, 3862, 1050, 2769,  320, 3591, 1303,
       2998, 3176,  539,  710, 3170, 3741, 1408, 3524, 2342, 1579, 2664,
       2358])

In [15]: def read1(ds, pixels):
   ....:     for pix in pixels:
   ....:         val = ds[pix, pix, :]
   ....:

In [16]: def read2(ds, pixels):
   ....:     for pix in pixels:
   ....:         val = ds[:, pix, pix]
   ....:

In [17]: %timeit read1(d1, pixels)
1 loops, best of 3: 129 ms per loop

In [18]: %timeit read2(d2, pixels)
1 loops, best of 3: 1.3 s per loop

In [19]: d1.chunks
Out[19]: (1, 250, 1046)

In [20]: pixels = numpy.random.randint(0, 4000, (1000))

In [21]: %timeit read1(d1, pixels)
1 loops, best of 3: 1.29 s per loop

In [22]: %timeit read2(d2, pixels)
1 loops, best of 3: 14.2 s per loop


The dimensional ordering for “d1” is (y, x, z) and the dimensional ordering for “d2” is (z, y, x).  The chunks for “d1” are (1, 250, 1046), and the chunks for “d2” are (100, 100, 100).

So in order to read a time series for “d1” a single chunk is read i.e. (1, 250, 1046), containing 261,500 elements.  Also, there is less skipping on the memory order to form the time series as the time dimension “z”, is already in contiguous memory.
In order to read a time series for “d2”, 11 x (100, 100, 100) blocks are read, which totals 11,000,000 elements.  At each chunk read, the data array also needs to skip over (100x*100y) elements in order to form a time series which (minutely) adds to the overall read time.

Hope that helps

Cheers
Josh

From: Zhang Fei 
Sent: Wednesday, 23 March 2016 11:05 AM
To: Sixsmith Joshua
Cc: Raevski Gregory; Ayers Damien; Oliver Simon; Wu Wenjun; Hicks Andrew; Hooke Jeremy; Bala Biswajit
Subject: NetCDF file size [SEC=UNCLASSIFIED]
Importance: High


Hi Josh and all,

For pixel drill purpose, we needed to stack the WOFS extents tiff files into netCDF files, one for each cell. 

I have observed the following file size phenomena. 
If you can study further and give some explanation It would be great.

Cheers

Fei

Create NetCDF stacks: comparison of 3 methods

Water extents tiff files    /g/data/u46/wofs/extents 
total size = 514 GB.  

Method	Total netcdf files size	Computing Time for netcdf files creation	program	Interoperability.	Source code and netcdf files	
1.Java-based	2100 GB	Works done 1.5 years ago. Not sure. Heard very slow	Complex java libraries dependency, Not working now.	Gdalinfo does not like the Netcdf file	ls /g/data/fk4/wofs/water_f7q/extents/*/*.nc 

du –ch /g/data/fk4/wofs/water_f7q/extents/*/*.nc	
2.Python
netCDF files are
non-CF compliant	1001 GB	100 hours
(2cpu)	Python-based.
A  200-line script	Same as above Gdalinfo does not like the Netcdf file	ls /g/data/fk4/wofs/water_20160203/extents/*/*.nc 

du –ch /g/data/fk4/wofs/water_20160203/extents/*/*.nc
https://github.com/feizhanga/scicomput/blob/master/GeoDataSoft/netCDF4/stack_tiffs2netcdf.py 

3.New Python CF1.6 compliant netCDF4	370 GB	47 hours
(2cpu)	Python-based.
A  200-line script	Updated netCDF format, interoperable with other tools such as GDAL, ncdump, etc.	ls /g/data/u46/wofs/extents2nc/*/*.nc 

du –ch /g/data/u46/wofs/extents2nc/*/*.nc

https://github.com/feizhanga/scicomput/blob/master/GeoDataSoft/netCDF4/stack_tiffs2netcdf_CF.py



