In [1]:
import os
import shutil
import time
import zipfile

## Introduction

In researching price data for this project, it was became clear that the time frame we could analyze was from 2000-2018 due to the availability of continuous data. We are interested in historical weather variables which we are getting from the [WorldClim](https://www.worldclim.org/data/monthlywth.html) website. These come as `.tif` files, and they are *big*: in total, they are around 100 GB after being unpacked. I cannot host this amount of data on GitHub, so in order to follow along please run this Notebook before any of the others to make sure you have all the geospatial data for the project.

## Downloading Data

Below, I wrote a method using Python and `wget` that will download zip files from WorldClim, unzip them, and delete all `.tif` files outside of the 2000-2018 time frame. The zip files come in two increments per variable, e.g. we have to download precipitation data from 2000-2009 and then again from 2010-2018. This is a decent amount of data, and downloading 19 years x 12 months = 228 `.tif` files per weather variable we are interested in. The entire process takes ~ 45 minutes (depending on your internet connection, see the printout at the bottom of the method below). I have tried to write the method such that if you already have downloaded the zip files, you should not have to download them again (although this presupposes that the downloads were not interupted, i.e., the files downloaded completely and successfully). In any case, this method checks each directory at the end for the correct number of files, so you will know if you have the complete zip files.

There is a small amount of data cleanup at the end to make sure the directories have the right name and to delete uncessary files.

In [2]:
#so far we are interested in minimimum and maximum temperature, as well as precipitation
tmin_url1 = 'biogeo.ucdavis.edu/data/worldclim/v2.1/hist/wc2.1_2.5m_tmin_2000-2009.zip'
tmin_url2 = 'biogeo.ucdavis.edu/data/worldclim/v2.1/hist/wc2.1_2.5m_tmin_2010-2018.zip'
tmax_url1 = 'biogeo.ucdavis.edu/data/worldclim/v2.1/hist/wc2.1_2.5m_tmax_2000-2009.zip'
tmax_url2 = 'biogeo.ucdavis.edu/data/worldclim/v2.1/hist/wc2.1_2.5m_tmax_2010-2018.zip'
precip_url1 = 'biogeo.ucdavis.edu/data/worldclim/v2.1/hist/wc2.1_2.5m_prec_2000-2009.zip'
precip_url2 = 'biogeo.ucdavis.edu/data/worldclim/v2.1/hist/wc2.1_2.5m_prec_2010-2018.zip'
urls = [tmin_url1, tmin_url2, tmax_url1, tmax_url2, precip_url1, precip_url2]

#these are the files you should end up with after download
tmin_zip1 = 'wc2.1_2.5m_tmin_2000-2009.zip'
tmin_zip2 = 'wc2.1_2.5m_tmin_2010-2018.zip'
tmax_zip1 = 'wc2.1_2.5m_tmax_2000-2009.zip'
tmax_zip2 = 'wc2.1_2.5m_tmax_2010-2018.zip'
precip_zip1 = 'wc2.1_2.5m_prec_2000-2009.zip'
precip_zip2 = 'wc2.1_2.5m_prec_2010-2018.zip'
zip_files = [tmin_zip1, tmin_zip2, tmax_zip1, tmax_zip2, precip_zip1, precip_zip2]

In [3]:
os.chdir('../data/raw/climate')
os.getcwd()

'/home/jakidxav/ciat_bean_price_feature_importances/data/raw/climate'

Make sure that you are in the right directory! You are running this Notebook in the `notebooks` directory, but should now be in the `raw/climate` directory.

In [4]:
def merge_files_clean_dir(src, dest, new_dir):
    """
    This method copies all files in one directory to another, renames the new directory, 
    and deletes the source directory.
    
    Arguments:
      src: original directory containing the files you want to copy over
      dest: where you want to copy the files
      new_dir: what you want to rename `dest`
    """
    for file in os.listdir(src):
        shutil.copy(src+file, dest)
    os.rename(dest, new_dir)
    shutil.rmtree(src)

In [5]:
def remove_file_if_exists(path_to_file):
    """
    This method 
    
    Args:
        path_to_file: this is the path to the filename you want to delete, including the filename
    """
    if os.path.isfile(path_to_file):
        os.remove(path_to_file)

In [6]:
def download_worldclim():
    start_time = time.time()
    num_files = 228

    #download zip files with progress bar
    for url in urls:
        if os.path.isfile(url[-29:]) == False:
            !wget -q {url} --show-progress
        else:
            print('{} already downloaded.'.format(url[-29:]))

    #make sure all files exist
    for file in zip_files:
        assert(os.path.isfile(file))
    print()
    
    #save the zip file to a directory with the same name (without the .zip extension)
    for file in zip_files:
        #only make directory if it doesn't exist already
        directory = file[:-4]
        os.mkdir(directory)
        print('Unpacking {}'.format(directory))

        #unzip files without producing output
        with zipfile.ZipFile(file, 'r') as zip_ref:
            zip_ref.extractall(directory)
        print('     Done!')
    
    print()
    print('Copying files over, renaming directories, deleting extra files...')
    merge_files_clean_dir('./wc2.1_2.5m_tmin_2000-2009/', './wc2.1_2.5m_tmin_2010-2018/', 'wc2.1_2.5m_tmin_2000-2018')
    print('Directory ./wc2.1_2.5m_tmin_2000-2018/ created...')
    
    merge_files_clean_dir('./wc2.1_2.5m_tmax_2000-2009/', './wc2.1_2.5m_tmax_2010-2018/', 'wc2.1_2.5m_tmax_2000-2018')
    print('Directory ./wc2.1_2.5m_tmax_2000-2018/ created...')
    
    merge_files_clean_dir('./wc2.1_2.5m_prec_2000-2009/', './wc2.1_2.5m_prec_2010-2018/', 'wc2.1_2.5m_prec_2000-2018')
    print('Directory ./wc2.1_2.5m_prec_2000-2018/ created...')
    
    #remove extra file for 2019 data
    remove_file_if_exists('./wc2.1_2.5m_tmin_2000-2018/wc2.1_2.5m_tmin_2019-12.tif')
    remove_file_if_exists('./wc2.1_2.5m_tmax_2000-2018/wc2.1_2.5m_tmax_2019-12.tif')
    remove_file_if_exists('./wc2.1_2.5m_prec_2000-2018/wc2.1_2.5m_prec_2019-12.tif')
    
    #make sure each directory has exactly 228 files
    tmin_filenames = sorted(os.listdir('wc2.1_2.5m_tmin_2000-2018'))
    tmax_filenames = sorted(os.listdir('wc2.1_2.5m_tmax_2000-2018'))
    precip_filenames = sorted(os.listdir('wc2.1_2.5m_prec_2000-2018'))

    assert(len(tmin_filenames) == num_files)
    assert(len(tmax_filenames) == num_files)
    assert(len(precip_filenames) == num_files)
    
    #remove all zip files
    for file in os.listdir():
        if file.endswith('.zip'):
            os.remove(file)
            
    remove_file_if_exists('wget-log')
        
    print('     Done!')

    #report time in minutes
    end_time = time.time()
    total_time = round((end_time - start_time) / 60.0, 2) 
    print()
    print('Total time to download: {} minutes'.format(total_time))

In [7]:
download_worldclim()

wc2.1_2.5m_tmin_2000-2009.zip already downloaded.
wc2.1_2.5m_tmin_2010-2018.zip already downloaded.
wc2.1_2.5m_tmax_2000-2009.zip already downloaded.
wc2.1_2.5m_tmax_2010-2018.zip already downloaded.
wc2.1_2.5m_prec_2000-2009.zip already downloaded.
wc2.1_2.5m_prec_2010-2018.zip already downloaded.

Unpacking wc2.1_2.5m_tmin_2000-2009
     Done!
Unpacking wc2.1_2.5m_tmin_2010-2018
     Done!
Unpacking wc2.1_2.5m_tmax_2000-2009
     Done!
Unpacking wc2.1_2.5m_tmax_2010-2018
     Done!
Unpacking wc2.1_2.5m_prec_2000-2009
     Done!
Unpacking wc2.1_2.5m_prec_2010-2018
     Done!

Copying files over, renaming directories, deleting extra files...
Directory ./wc2.1_2.5m_tmin_2000-2018/ created...
Directory ./wc2.1_2.5m_tmax_2000-2018/ created...
Directory ./wc2.1_2.5m_prec_2000-2018/ created...
     Done!

Total time to download: 12.22 minutes
