# Download files from remote data services

This notebook demonstrates a few ways to download files from NCI's THREDDS server.

* download a single file
    - requests.get
    - urllib
    - wget
    
* download multiple files
    - requests.get
    - urllib
---
- Authors: NCI Virtual Research Environment Team
- Keywords: data download, THREDDS, request, wget, urllib
- Create Date: 2020-Jul
---

This notebook is licenced under the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/)

### Single file download

First, let's define a THREDDS endpoint url:

In [1]:
url = 'http://dapds00.nci.org.au/thredds/fileServer/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/national_geophysical_compilations/Gravmap2016/Gravmap2016-grid-grv_ir.nc'

1. `request.get`

In [2]:
import os
import requests
def download_file(in_filename, out_filename):
    if not os.path.exists(out_filename):
        print("Downloading", in_filename)
        response = requests.get(in_filename)
        with open(out_filename, 'wb') as f:
            f.write(response.content)

### create output directory
outdir = './output'
if not os.path.exists(outdir):
    os.mkdir(outdir)
           
            
download_file(url, './output/IR1.nc')

2. `urllib`

In [3]:
from urllib import request
request.urlretrieve(url,'./output/IR2.nc')

('./output/IR2.nc', <http.client.HTTPMessage at 0x7f238e5a4290>)

3. `wget`

In [4]:
!wget $url -O ./output/IR3.nc

--2021-03-01 14:55:48--  http://dapds00.nci.org.au/thredds/fileServer/iv65/Geoscience_Australia_Geophysics_Reference_Data_Collection/national_geophysical_compilations/Gravmap2016/Gravmap2016-grid-grv_ir.nc
Resolving dapds00.nci.org.au (dapds00.nci.org.au)... 130.56.243.202
Connecting to dapds00.nci.org.au (dapds00.nci.org.au)|130.56.243.202|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29556652 (28M) [application/x-netcdf]
Saving to: ‘./output/IR3.nc’


2021-03-01 14:55:48 (73.1 MB/s) - ‘./output/IR3.nc’ saved [29556652/29556652]



### Bulk download

1. `request.get`

First, get all the file names. 

In [5]:
from siphon.catalog import TDSCatalog
url='http://dapds00.nci.org.au/thredds/catalog/yj45/acorn/sat/version_2/site_data/catalog.xml'
cat = TDSCatalog(url)
print("\n".join(cat.datasets.keys()))

acorn_sat_v2_daily_tmax.tar.gz
acorn_sat_v2_daily_tmin.tar.gz
acorn_sat_v2_stations.txt
tmax.001019.daily.csv
tmax.002012.daily.csv
tmax.003003.daily.csv
tmax.004032.daily.csv
tmax.004106.daily.csv
tmax.005007.daily.csv
tmax.005026.daily.csv
tmax.006011.daily.csv
tmax.007045.daily.csv
tmax.008296.daily.csv
tmax.008297.daily.csv
tmax.008315.daily.csv
tmax.009021.daily.csv
tmax.009518.daily.csv
tmax.009617.daily.csv
tmax.009789.daily.csv
tmax.009999.daily.csv
tmax.010092.daily.csv
tmax.010286.daily.csv
tmax.010916.daily.csv
tmax.010917.daily.csv
tmax.011003.daily.csv
tmax.011052.daily.csv
tmax.012038.daily.csv
tmax.013017.daily.csv
tmax.014015.daily.csv
tmax.014825.daily.csv
tmax.015135.daily.csv
tmax.015590.daily.csv
tmax.015666.daily.csv
tmax.016001.daily.csv
tmax.016098.daily.csv
tmax.017043.daily.csv
tmax.017126.daily.csv
tmax.018012.daily.csv
tmax.018044.daily.csv
tmax.018192.daily.csv
tmax.021133.daily.csv
tmax.022823.daily.csv
tmax.023090.daily.csv
tmax.023373.daily.csv
tmax.02602

In [6]:
import requests 
for filename in cat.datasets.keys():
    if filename.endswith('.csv'):
        url = 'http://dapds00.nci.org.au/thredds/catalog/yj45/acorn/sat/version_2/site_data/'+ str(filename)
        r = requests.get(url, allow_redirects = True)
        open('./output/'+str(filename), 'wb').write(r.content)

Alternatively, you can use thredds crawler to get all the end points.

In [7]:
from thredds_crawler.crawl import Crawl
url= 'http://dapds00.nci.org.au/thredds/catalog/yj45/acorn/sat/version_2/site_data/catalog.xml'
c = Crawl(url)
c.datasets

[<LeafDataset id: yj45/acorn_sat_v2_daily_tmax.tar.gz, name: acorn_sat_v2_daily_tmax.tar.gz, services: ['HTTPServer']>,
 <LeafDataset id: yj45/acorn_sat_v2_daily_tmin.tar.gz, name: acorn_sat_v2_daily_tmin.tar.gz, services: ['HTTPServer']>,
 <LeafDataset id: yj45/acorn_sat_v2_stations.txt, name: acorn_sat_v2_stations.txt, services: ['HTTPServer']>,
 <LeafDataset id: yj45/tmax.001019.daily.csv, name: tmax.001019.daily.csv, services: ['HTTPServer']>,
 <LeafDataset id: yj45/tmax.002012.daily.csv, name: tmax.002012.daily.csv, services: ['HTTPServer']>,
 <LeafDataset id: yj45/tmax.003003.daily.csv, name: tmax.003003.daily.csv, services: ['HTTPServer']>,
 <LeafDataset id: yj45/tmax.004032.daily.csv, name: tmax.004032.daily.csv, services: ['HTTPServer']>,
 <LeafDataset id: yj45/tmax.004106.daily.csv, name: tmax.004106.daily.csv, services: ['HTTPServer']>,
 <LeafDataset id: yj45/tmax.005007.daily.csv, name: tmax.005007.daily.csv, services: ['HTTPServer']>,
 <LeafDataset id: yj45/tmax.005026.dai

In [8]:
urls_download = [s.get("url") for d in c.datasets for s in d.services if s.get("service").lower() == "httpserver"]
urls_download

['http://dapds00.nci.org.au/thredds/fileServer/yj45/acorn/sat/version_2/site_data/acorn_sat_v2_daily_tmax.tar.gz',
 'http://dapds00.nci.org.au/thredds/fileServer/yj45/acorn/sat/version_2/site_data/acorn_sat_v2_daily_tmin.tar.gz',
 'http://dapds00.nci.org.au/thredds/fileServer/yj45/acorn/sat/version_2/site_data/acorn_sat_v2_stations.txt',
 'http://dapds00.nci.org.au/thredds/fileServer/yj45/acorn/sat/version_2/site_data/tmax.001019.daily.csv',
 'http://dapds00.nci.org.au/thredds/fileServer/yj45/acorn/sat/version_2/site_data/tmax.002012.daily.csv',
 'http://dapds00.nci.org.au/thredds/fileServer/yj45/acorn/sat/version_2/site_data/tmax.003003.daily.csv',
 'http://dapds00.nci.org.au/thredds/fileServer/yj45/acorn/sat/version_2/site_data/tmax.004032.daily.csv',
 'http://dapds00.nci.org.au/thredds/fileServer/yj45/acorn/sat/version_2/site_data/tmax.004106.daily.csv',
 'http://dapds00.nci.org.au/thredds/fileServer/yj45/acorn/sat/version_2/site_data/tmax.005007.daily.csv',
 'http://dapds00.nci.org

In [9]:
import requests 
for url in urls_download:
    if url.endswith('.csv'):
        r = requests.get(url, allow_redirects = True)
        open('./output/'+url[-21:], 'wb').write(r.content)

2. `urllib`

In [10]:
from urllib import request
for filename in cat.datasets.keys():
    url = 'http://dapds00.nci.org.au/thredds/fileServer/yj45/acorn/sat/version_2/site_data/'+ str(filename)
    request.urlretrieve(url,'./output/'+filename)