Author: Maxime Marin  
@: mff.marin@gmail.com

# Accessing IMOS data case studies: Walk-through and interactive session - Data Extraction

*'Ok Max, very cool but I really just want to save the data and do my own stuff'*

In this very short notebook, we show that once you have loaded your data, it is very easy to extract it out of jupyter onto a variety of format. 

## 1) Saving data

Let's bring back our dataset (new notebook), import some libraries and save it:


In [1]:
import sys
import os

sys.path.append('/home/jovyan/intake-aodn')
import intake_aodn
from intake_aodn.plot import netcdf_save

%store -r data

In [5]:
%%time

ds = data.stack(space=['longitude','latitude']).mean(dim='space').to_dataframe()
ds.to_csv('box_averaged.csv',float_format = '%.2f', na_rep = 'NaN')

CPU times: user 4.73 s, sys: 1.83 s, total: 6.56 s
Wall time: 9.53 s


We can of course save the data under a csv format. In the cell above, we first get the box-averaged timeseries and save it at a daily frequency.  
If we open the file in jupyter or excel, we will see that the file contains a number of row equals to the number of time entries in our dataset (~10k days). Now, if we wanted to save all pixels from our selected region, the number of rows would be multiplied by the number of pixels, which would quickly make the file very big and not practical to save in csv.

While saving large datasets into csv is possible, it would take a very long time, and excel for example would not display the complete dataset past 2^20 rows. Instead we can extract the data into a netcdf format, which is tailored for large 3D-4D datasets:

In [3]:
%%time
netcdf_save(data,filename = 'mySSTfile')

CPU times: user 4.31 s, sys: 3.36 s, total: 7.67 s
Wall time: 22.5 s


It only took twice as much time to save the daily data for all pixels into a netcdf format.

***

## 2) All in one

If users simply want to use these tools to access, download and extract data, this all could be done in a few lines of code:

In [1]:
%%time
import sys
import os
sys.path.append('/home/jovyan/intake-aodn')
import intake_aodn
from intake_aodn.plot import netcdf_save
from intake_aodn.utils import get_distributed_cluster

client = get_distributed_cluster(worker_cores=2,
                                 worker_memory=4,
                                 min_workers=1,
                                 max_workers=64)

def load_creds():
    with open(os.environ['HOME'] + '/.aws/credentials','rt') as f:
        f.readline()
        key=f.readline().split('=')[1].strip()
        secret=f.readline().split('=')[1].strip()
    return key, secret

key,secret=load_creds()

storage_options=dict(target_protocol='s3',
                     target_options=dict(key=key,secret=secret),
                     remote_protocol='s3',
                     remote_options=dict(anon=True))

data=intake_aodn.cat.aodn_s3.SST_L3S_1d_ngt(startdt='1992-03-21',
                                          enddt='2021-06-30',
                                          cropto=dict(latitude=slice(-28,-30),longitude=slice(110,112))).read()

netcdf_save(data,filename = 'mySSTfile')
client.cluster.shutdown()


An existing cluster was found. Connected to cluster [1measihub.fa5b73d4d06e46c4a33846c2052ec77f[0m


VBox(children=(HTML(value='<h2>GatewayCluster</h2>'), HBox(children=(HTML(value='\n<div>\n<style scoped>\n    …



CPU times: user 12.2 s, sys: 10.1 s, total: 22.3 s
Wall time: 7min 30s


*'Let me get that data real quick....'* 