Author: Maxime Marin  
@: mff.marin@gmail.com

# Accessing IMOS data case studies: Walk-through and interactive session - Data Extraction

*'Ok Max, very cool but I really just want to save the data and do my own stuff'*

In this very short notebook, we show that once you have loaded your data, it is very easy to extract it out of jupyter onto a variety of format. 

## 1) Saving data

Let's bring back our dataset (new notebook), import some libraries and save it:


In [1]:
import sys
import os

sys.path.append('/home/jovyan/intake-aodn')
import intake_aodn
from intake_aodn.utils import save_netcdf

In [2]:
import xarray as xr
data = xr.open_dataset('Example_Data.nc')

In [3]:
%%time
df = data.stack(space=['longitude','latitude']).mean(dim='space').to_dataframe()
df.to_csv('box_averaged.csv',float_format = '%.2f', na_rep = 'NaN')

CPU times: user 537 ms, sys: 208 ms, total: 745 ms
Wall time: 844 ms


We can of course save the data under a csv format. In the cell above, we first get the box-averaged timeseries and save it at a daily frequency.  
If we open the file in jupyter or excel, we will see that the file contains a number of row equals to the number of time entries in our dataset (~10k days). Now, if we wanted to save all pixels from our selected region, the number of rows would be multiplied by the number of pixels, which would quickly make the file very big and not practical to save in csv.

While saving large datasets into csv is possible, it would take a very long time, and excel for example would not display the complete dataset past 2^20 rows. Instead we can extract the data into a netcdf format, which is tailored for large 3D-4D datasets:

In [4]:
%%time
save_netcdf(data,filename = 'mySSTfile.nc')

CPU times: user 290 ms, sys: 70.6 ms, total: 360 ms
Wall time: 2.66 s


It only took a little longer to save the daily data for all pixels into a netcdf format.

***

## 2) All in one

If users simply want to use these tools to access, download and extract data, this all could be done in a few lines of code:

In [5]:
%%time
import sys
import os
sys.path.append('/home/jovyan/intake-aodn')
import intake_aodn
from intake_aodn.utils import get_local_cluster, save_netcdf

client = get_local_cluster()

data=intake_aodn.cat.aodn_s3.SST_L3S_1d_ngt(startdt='2018-01-01',
                                          enddt='2020-12-31',
                                          cropto=dict(latitude=slice(-28,-30),longitude=slice(110,112))).read()

save_netcdf(data,filename = 'mySSTfile.nc')

CPU times: user 1.27 s, sys: 646 ms, total: 1.91 s
Wall time: 52.3 s


distributed.client - ERROR - Failed to reconnect to scheduler after 1.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
