"Geo Data Science with Python" 
### Notebook Lesson 06c

# Python Packages: Download data from the Web I

This lesson discusses several smaller Python Modules useful to download and retrieve Geoscience data from the internet. 

### Sources
This notebook is contains information from the following resources:

- Downloading Files from the Internet: https://www.codingem.com/python-download-file-from-url/
- Module urllib: https://docs.python.org/3/library/urllib.html
- Module Requests: https://docs.python-requests.org/en/latest/api/
- Module Pywget: https://bitbucket.org/licface/pywget/src/master/
- Install Pywget: https://anaconda.org/anaconda/pywget
---

---
## CODE EXAMPLE 1: Download files with urllib

In [1]:
%reset -f

In [2]:
# 1. Import the urllib module
import urllib

# 2. Define the URL: VT logo from vt.edu 
URL = "https://www.assets.cms.vt.edu/images/logo-maroon-whiteBG.svg"

# 3. urllib‘s request.urlretrieve() method to download a file from a specific URL 
#    and save it to a new file called "VTlogo.svg"
response = urllib.request.urlretrieve(URL, "VTlogo.svg")
response

('VTlogo.svg', <http.client.HTTPMessage at 0x7fdc7e3bd4c0>)

---
## CODE EXAMPLE 2: Download files with Requests

In [3]:
# 1. Import the requests module
import requests

# 2. Define the URL: VT logo from vt.edu
URL = "https://www.assets.cms.vt.edu/images/logo-maroon-whiteBG.svg"

# 3. Use requests.get() to download the data behind that URL: VT logo image file 
response = requests.get(URL)

# 4. Write the file to a new file called "VTlogo.svg"
f=open("VTlogo.svg", "wb")
f.write(response.content)
f.close()

## CODE EXAMPLE 3: Download files with wget 

#### NOTE: pywget is not a standard library and works only, if installed

On your computer you can install the module **via terminal** with:
```bash
conda install -c anaconda pywget
```
or
```bash
conda install -c conda-forge pywget
```

<div class="alert alert-success">

**Note**: anaconda and conda-forge are different web-channels distributing conda-compatible python packages. Most of you (as well as the ARC webapp) will have Python installed with conda, and should use conda (or anaconda) with any conda-compatible channel to install new packages. Many documentation packages of the Python modules only mention installation through the pip distribution, which you should avoid, although conda installation is available as well. Please avoid using pip for package installation, if you did install Python through conda (or anaconda), this might lead to software incompatibility among different Python packages.
    
</div>


In [4]:
%reset -f

In [5]:
# 1. Import the wget module
import wget

# 2. Define the URL: VT logo from vt.edu
URL = "https://www.assets.cms.vt.edu/images/logo-maroon-whiteBG.svg"

# 3. Use wget.download() to download a file from a specific URL 
#    and save it to a new file called "VTlogo.svg".
response = wget.download(URL, "VTlogo.svg")


---
## CODE EXAMPLE 4: Read a batch of files

In [6]:
# simple solution, which assumes knowledge of non-existance of file 5 and 6:

import requests

for m in list(range(0,5))+list(range(7,12)):
    fname = 'test'+ str(m) + '.dat'
    url = 'http://test.opendap.org/opendap/hyrax/data/ff/'+ fname
    outpath = './out/' + fname
    r = requests.get(url)
    open(outpath, "wb").write(r.content)
    print(fname)

test0.dat
test1.dat
test2.dat
test3.dat
test4.dat
test7.dat
test8.dat
test9.dat
test10.dat
test11.dat


In [7]:
# solution with exception handling

import requests

for m in range(0,12):
    fname = 'test'+ str(m) + '.dat'
    url = 'http://test.opendap.org/opendap/hyrax/data/ff/'+ fname
    outpath = './out/' + fname
    try:
        r = requests.get(url)
        open(outpath, "wb").write(r.content)
        r.raise_for_status()
        print(fname)
    except requests.exceptions.HTTPError as err:
        print(err)


test0.dat
test1.dat
test2.dat
test3.dat
test4.dat
404 Client Error: Not Found for url: http://test.opendap.org/opendap/hyrax/data/ff/test5.dat
404 Client Error: Not Found for url: http://test.opendap.org/opendap/hyrax/data/ff/test6.dat
test7.dat
test8.dat
test9.dat
test10.dat
test11.dat


---
## CODE EXAMPLE 4: Download a NetCDF File

In [63]:
# 1. Import the module requests
import requests

In [9]:
# 2. Define the URL
url = 'https://data.giss.nasa.gov/pub/gistemp/gistemp250_GHCNv4.nc.gz'
filename = 'gistemp250_GHCNv4.nc.gz'

In [10]:
# 3. Use requests.get() to download the data behind that URL
r = requests.get(url, allow_redirects=True, stream=True)  

In [11]:
#optional: print out the content type at the url-address
print(r.headers.get('content-type'))   

application/x-gzip


In [12]:
# 4. Write the file to a new BINARY file on your computer
open(filename, 'wb').write(r.content)  #the previous two in one line

11034247

---
## CODE EXAMPLE 5: GUNZIP

In [13]:
# Unzip the file: bash command
!gunzip -f -k {filename} # unpacks the file, 
                         # -f forces command & overwrites files
                         # -k keeps both files .gz and unzipped one
# Information on gunzip: https://www.tutorialspoint.com/unix_commands/gunzip.htm

In [14]:
# alternative way to unzip the file via Python module gzip
# import gzip
# import shutil
# filename_unzip = filename.replace('.gz', '')
# with gzip.open(filename, 'rb') as f_in:
#     with open(filename_unzip, 'wb') as f_out:
#         shutil.copyfileobj(f_in, f_out)

---
## Final Notes on conda environement and package installation on the ARC webapp

On the arc webapp, you should use the Python Kernel of this class, which is:

**OOD-GEOFALL**.

You can find out if this is the active Kernel of this notebook with the bash command:

In [1]:
! conda env list

# conda environments:
#
base                     /opt/anaconda3
geosf21               *  /opt/anaconda3/envs/geosf21



The environment marked with an asterisk is the one you have active. If this is not the environment for the class (OOD-GEOFALL), you can switch the environment of the notebook: **either** by switching the Kernel of the notebook at the top right drop down menu, **or** by executing the bash command in a magic cell:

In [65]:
%%bash
source activate OOD-GEOFALL
conda env list

# conda environments:
#
base                     /opt/anaconda3
geosf21               *  /opt/anaconda3/envs/geosf21



bash: line 1: activate: No such file or directory


Once you are sure, you are in the right environment, you can check if a needed package is installed in this environment, for example the netCDF4 package:

In [68]:
import netCDF4

If you get an error message, e.g. "ModuleNotFoundError", meaning the package is not installed, you have to switch to the terminal and execute the follwoing command and confirm the prompt:

```bash
conda install -c conda-forge netcdf4
```

or

```bash
conda install -c anaconda netcdf4
```

If you execute the command here as shell command (with %% bash or !), you might not see the execution in real time or won't be able to confirm the prompt, it might also run slower. I advise to run it in the terminal.

This is the same procedure for any new packages that are not installed in your environment on any computer.