# Automatic Data Downloads
Satellite images and outputs from global earth systems models can be very large files. If we're dealing with time series, large spatial areas, or multivariate model outputs, we can quickly be moving into data volumes that exceed the memory and storage capacity of personal computers. To access these types of global data, we are interfacing with online databases. Today's lesson is intended to give you the tools to programmatically access online databases. These tools will enable you to use your personal computer to convert these large datasets into analysis-ready data for your research project. Specifically, today we'll learn to:

1. Interpret directory structure of ftp and http addresses.
2. Create a project directory on your local machine.
3. Configure a .gitignore file to ignore raw data.
4. Use the command line to download files from the internet.

If there's time, we'll break into groups based on research interest and start utilizing APIs to search datasets on public geospatial data repositories that match the location and time period of your study area.

In [1]:
import pandas as pd
from IPython.display import HTML
import os
import urllib.request

In [2]:
conda list

# packages in environment at C:\Users\SarShel\anaconda3\envs\geostats_env:
#
# Name                    Version                   Build  Channel
affine                    2.3.0              pyhd3eb1b0_0  
appdirs                   1.4.4                    pypi_0    pypi
argcomplete               1.12.3             pyhd3eb1b0_0  
argon2-cffi               20.1.0           py37h2bbff1b_1  
async_generator           1.10             py37h28b3542_0  
atomicwrites              1.4.0                    pypi_0    pypi
attrs                     21.2.0             pyhd3eb1b0_0  
aws-requests-auth         0.4.3                    pypi_0    pypi
backcall                  0.2.0              pyhd3eb1b0_0  
beautifulsoup4            4.10.0             pyh06a4308_0  
blas                      1.0                         mkl  
bleach                    4.0.0              pyhd3eb1b0_0  
boto3                     1.15.18                  pypi_0    pypi
botocore                  1.18.18                  p

## G is for *Generalizable* 
When we're making measurements of an earth system process, we often care deeply about how well uur experimental results apply to other times/places. Since it is often too expensive or two difficult to collect in-situ samples of our earth systems process at all the times and locations that matter, environmental data science allows us to use statistical models to leverage globally available observations to improve the generalizability of our system. These models will generalize our inferences about our earth systems process in one of three ways:

1. *Prediction*: can our model allow us to generalize our observations to out-of-sample times and locations? For example: will my model linking air temperature to green-up time from my experimental forest accurately apply to a forest 200 miles away? 
2. *Interpolation*: can our model allow us to "fill in the gaps" in our spatial/temporal sampling schele? For example: do my measurements of precipitation for my two precipitation gage locations accurately represent the total precipitation that fell in my watershed?
3. *Diagnosis*: can our model help us to interpret what processes are either drivers of or covariates with our earth systems process, allowing us to improve our physical understanding of trends and variability in that system: for example: is air temperature or precipitation a more important driver of current cropping system productivity, and how might this impact cropping system function under climate change? 

### These global observations are often publically available to researchers on online geodatabases.
For example:
 - NASA: https://earthdata.nasa.gov/
 - USGS: https://earthexplorer.usgs.gov/ 
 - NOAA: https://psl.noaa.gov/data/gridded/ 
 - Google: https://developers.google.com/earth-engine/datasets 
 - NY State: https://cugir.library.cornell.edu/ 


## R is for *Reproducible*
Since the raw data for our generalizable analysis is globally available, programmatically accessing our data gives us an important added benefit: we can design our version controlled, collaborative project repositories so they directly interface with these public geodatabases. That way, anyone who wants to can access the raw data required to reproduce our analytic workflow.

A reminder on why reproducible science is so important:

In [2]:
HTML('<iframe width="930" height="523" src="https://www.youtube.com/embed/NGFO0kdbZmk", frameborder="0" allowfullscreen></iframe>')



### Project Repository
Your project repository is where you store all of the elements of your data science workflow. At it's core, it should have folders for raw data, processed data, code, outputs, and images. A good project repository is.

1. Human readable: use directory names that are easy to understand, includes a highly detailed README file that explains what's in each folder, how to sequence inputs and outputs to code files, and how to cite the repository.
2. Machine readable - avoid funky characters OR SPACES.
3. Supportive of sorting - If you have a list of input files, it’s nice to be able to sort them to quickly see what’s there and find what you need.

You should also take extra steps to preserve raw data so it’s not modified. More on this later. 

We're going to create a new repository for your class project. The os package (os stands for **O**perating **S**ystem) allows you to manipulate files on your computer. Ask it what it does:

In [3]:
?os

In [4]:
#For example, this command is the equivalent of ls in terminal:
os.getcwd()

'C:\\Users\\ekcarter\\Data'

In [3]:
#this command is the equivalent of:
mkdir H:/EnvDatSci/project
#os.mkdir('H:\\EnvDatSci\\project')

#this command is the equivalent of:
# cd H:/EnvDatSci/project
os.chdir('H:\\EnvDatSci\\project')

### TASK 1: enter a command in the below cell to check and make sure you're in your project directory:

In [4]:
#Task 1:
os.getcwd()

'H:\\EnvDatSci\\project'

### TASK 2: populate your project directory with appropriate files
Read Chapter 4.1 of the textbook: https://www.earthdatascience.org/courses/earth-analytics/document-your-science/file-organization-101/

Using os commands, populate your project directory with subfolders.

Print your directory to the screen (hint: see Task 1)

In [11]:
#Task 2:
os.mkdir("data_raw")
os.mkdir("data_analysisReady")
os.mkdir("code")
os.mkdir("figures")
os.listdir()

['code', 'data_analysisReady', 'data_raw', 'figures']

### TASK 3: change the current working directory to your the folder where you intend to store raw data.

In [5]:
#Task 3:
os.chdir("./data_raw")

## Decoding the file structure of online geodatabases
Just like we can use code to find and access files on our local machine, we can use code to find and access files on public geodatabases. Since these geodatabases are version controlled, providing code that links to the online files helps prevent us from making redundant copies of data on the internet. Programatically accessing public geodatabases requires that we understand how the database itself has been organized. 

 - Click on the following link to the National Oceanic and Atmospheric Association databse website: https://psl.noaa.gov/data/gridded/ 

 - Navigate to the "NCEP/NCAR Reanalysis dataset"
 - Of the seven sections they've divided data into, click on "Surface" 
 - Under "Air Temperature: Daily", click "See list"
 - Under "Surface", click "See list"

### TASK 4: Right click on the first link in the list, and select "copy link". Paste that link address below:
https://downloads.psl.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.1948.nc

##### Task 4: double click on this markdown cell to add text

### Tasking your computer to download files
Our goal is to write a script that can download files, extract a relevant subset of information from the files, and then delete the files. The first part of this task to to learn the filenames that we want to download. 

In the link above, we can break the filepath down into substrings, using basic text commands:

In [6]:
http_dir = "https://downloads.psl.noaa.gov/Datasets/"
dataset = "ncep.reanalysis.dailyavgs"
lev_type = "surface"
variable = "air.sig995."
time = "2010"
file_type = ".nc"
filepaths= http_dir + dataset + "/" + lev_type + "/" + variable + time + file_type
print(filepaths)

https://downloads.psl.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.2010.nc


What happens if you click on that link? You can also have python download the file for you using the <urllib.request.urlretrieve> function:

In [7]:
#what does this function do and how do we use it?
?urllib.request.urlretrieve

In [21]:
url = filepaths
filename = variable + time + file_type
urllib.request.urlretrieve(url, filename)
print(url, filename)

https://downloads.psl.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.2010.nc air.sig995.2010.nc


In [22]:
#what happens?
os.listdir()

['air.sig995.2010.nc']

We can infer patterns from the database itself and generate the names of multiple files. For example, if we need five years of daily air temperature data:

In [23]:
time =pd.Series(list(range(1965,1970)))
time = time.apply(str)
filepaths= http_dir + dataset + "/" + lev_type + "/" + variable + time + file_type
print(filepaths)

0    https://downloads.psl.noaa.gov/Datasets/ncep.r...
1    https://downloads.psl.noaa.gov/Datasets/ncep.r...
2    https://downloads.psl.noaa.gov/Datasets/ncep.r...
3    https://downloads.psl.noaa.gov/Datasets/ncep.r...
4    https://downloads.psl.noaa.gov/Datasets/ncep.r...
dtype: object


### TASK 5: Write a "for" loop that downloads all five years worth of air temperature data into you working directory. Print the contents of your directory to the screen.

In [30]:
#Task 5
for i in range(len(filepaths)):
    filename = variable + time[i] + file_type
    url= filepaths[i]
    urllib.request.urlretrieve(url, filename)
os.listdir()

['air.sig995.1965.nc',
 'air.sig995.1966.nc',
 'air.sig995.1967.nc',
 'air.sig995.1968.nc',
 'air.sig995.1969.nc']