Skip to content
Stephen Pascoe edited this page Apr 9, 2014 · 7 revisions
Wiki Reorganisation
This page has been classified for reorganisation. It has been given the category MOVE.
The content of this page will be revised and moved to one or more other pages in the new wiki structure.

Wget Script

How can I download files?

Here to the wget script FAQ

The ESGF P2P Index Node (portal) offers the possibility to download files via a so-called wget script.

The wget script is a bash script encapsulating calls to the wget while providing a rich set of extra features .

Pre-Requisites

  • Use FireFox 7+ or Google Chrome 16+ to access any ESGF web portal. For example, Safari has a known bug that prevents users from logging in to many of the ESG sites. Also, sites will not work properly with Internet Explorer.

  • Trust VeriSign certificates . This is normally the default configuration of most systems, but it might not be in your case. In such a case ask your system administrator to add this certificate. You may test everything is fine by issuing:

*           $ wget -nv -O /dev/null https://rainbow.llnl.gov
      2013-02-13 10:50:30 URL:https://rainbow.llnl.gov/ [31562] -> "/dev/null" [1]

If something goes wrong you would see something like this:

*           $ wget -nv -O /dev/null https://rainbow.llnl.gov
    ERROR: cannot verify rainbow.llnl.gov's certificate, issued by `/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=Terms of use at https://www.verisign.com/rpa (c)10/CN=VeriSign Class 3 Secure Server CA - G3':
      Self-signed certificate encountered.
    To connect to rainbow.llnl.gov insecurely, use `--no-check-certificate'.
    Unable to establish SSL connection.

You may also try this workaround , though be warned you'll be turning off security from your side (i.e. no guarantee the data comes from where it is supposed to come)

  • Register at PCMDI . At this time, we recommend that you run the wget scripts with an OpenID issued by the PCMDI Gateway . If you haven't done so already, access the PCMDI Gateway registration page and create an account. You will be issued an OpenID of the form https://pcmdi3.llnl.gov/esgcet/myopenid/your_username_here which you can use to log in at any site through the ESG federation.
* If you want to experiment with a new p2p openid, you can register at any of the new ESGF p2p sites, for example the [ PCMDI p2p Node ](http://pcmdi9.llnl.gov/) or [ NASA/JPL p2p Node ](http://esg-datanode.jpl.nasa.gov/)
  • Join "CMIP5 Research" . In order to download any dataset, either from the web or by running a wget script, you must first be enrolled in the group of users that are allowed to do so. To request authorization, use any of the web portals (for example, the newest PCMDI Data Node and try to download one single file by placing the dataset in your data cart and following any of the HTTP links. If you are interested in CMIP5 model data (project=CMIP5) or supporting observations (project=obs4MIPs), you need to enroll in the CMIP5 Research group (or CMIP5 Commercial if you intend to use the data for commercial purposes). Enrollment is instantaneous and you can immediately download the requested file as a test.

Features

Feature

Description/Notes

Since

Security Management

CA and own Certificate retrieval is done automatically. You only need to know your OpenID and your password. This won't be triggered if the certificate is valid. (use -f to force retrieval of CA'S etc.)

v1.0

Automated Certificate Renewal

Always active if, and only if, the script had to retrieved the certificate when starting. (So to force activation use -f)

v1.2

File Download

Retrieves files serially until all files have been downloaded or at least tried once if failed.

v1.0

Download Continuation

Aborted or failed downloads will continue from where they were stopped.

v1.0

File Verification

If the data publisher offered the file checksums these will be used to validate the downloads thus assuring the transfer completed successfully.

v1.0

Download caching

Once downloaded and verified files are not rechecked. The status is kept in a file named .<script_name>.status in the same directory as the script.

v1.0

File list Overflow Check

There are limits imposed to the system regarding how many files can be downloaded at once. The script checks and reports if the amount of files retrieved does not match the total number of files reported by the query.

v1.2

Download directory generation

It is possible to define the directory structure where the files will be downloaded to. See Defining a Download Directory Structure

v1.2

Newer version, file modification check

If published files has checksums published. Updating the script (-u) and re running it can inform of this event which might normally mean a new version is available

v1.2

Running the Script

This is pretty simple and you can start it by either setting the executable bit and running directly:

chmod +x my_wget_script
.(my_wget_script

or telling bash to interprete it directly:

bash my_wget_script

If you try to run it with other shells (like sh) the script will try to notify this and restart itself in the proper shell. This might however not work for all shells.

Dsiplaying the script help

To display a short help on the flags available use the -h flag:

$ ./my_wget_script -h
Usage: wget.sh [flags]
Flags is one of:
        c <cert> : use this certificate for authentication.
        p        : preserve data that failed checksum
        f        : force certificate retrieval (defaults to only once per day)
        F <file> : read input from file instead of the embedded one (use - to read from stdin)
        o <openid>: Provide OpenID instead of interactively asking for it.
        w <file> : Write embedded files into a file and exit
        i        : set insecure mode, i.e. don't check server certificate
        u        : Issue the search again and see if something has changed.
        U        : Update files from server overwritting local ones (detect with -u)
        d        : display debug information
        s        : completely skip security. It will only work if the accessd data is not secured at all.
        v        : be more verbose
        q        : be less verbose
        h        : displays this help

This command stores the states of the downloads in .wget.sh.status
For more information check the website: http://esgf.org/wiki/ESGF_wget

I think most flags are self describing here the more _ uncommon _ ones:

-p Even if the checksum fails, don't try to re-download the file. If not set the file will be re-downloaded again until it succeeds. Since v1.2 even if the file returns always the same checksum and it doesn't match the published one (this happens sometimes) the file will be re-downloaded just a couple of times and then abort the procedure for this file, reporting it accordingly. -f Don't reuse the certificate that might be found and force its retrieval. -u Re-trigger the search and see if there where any changes regarding the wget script (either the files contained in it or the script itself). If a modification is detected the wget script will be updated and the previous version will be stored at my_wget_script.old.# where # is just a running index. This is used to check for new versions as running the same scipt on the same files again will verify their checksums against that from the server. -s This doesn't mean you can get to the data without any security, but that no security is triggered so you will trying to get the files anonymously. This only work for datasets that are not protected at all. Getting data anonymously also implies the server has no record of who downloaded what, so you won't be able to be notified if data is changed or recalled.

[API] Defining a Download Directory Structure

Previous version of the script downloaded all files to a current directory directly. Since v1.2 it is also possible to define the creation of a complex directory structure to simplify the data management. These are the new flags which the Wget API understands that direct this behavior:

download_structure=facet1,facet2

E.g. h ttp://esgf-data.dkrz.de/esg-search/wget?distrib=false&limit=3& download_structure=model,variable

 If missing: download to current directory ("./") 

List of facets used for generating the directory structure of downloaded files. files will be downloaded to //file1 where <facet*> is the facet value of that facet for that particular file.

_ NOTE: _ The system doesn't version files but datasets, so there's no facet for this. there is a facet called version, but is not used as of this time.

download_emptypath=<some_string>

E.g h ttp://esg-datanode.jpl.nasa.gov/esg- search/wget?distrib=false&limit=3&download_structure=model,instrument& download_emptypath=unknown

 If missing: set to "", which will cause missing directories to collapse (e.g. dir1/dir3/... if dir2 facet is not available for the file) 
 Value to use if facet is missing. Setting it to something other than "", will use that string and prevent collapse (i.e. if set to "unknown" then the previous example would be: dir1/unknown/dir3 so that dir3 remains always at the 3rd position) 

_ Note: _ Your system imposes some limitation on characters allowed for directories and we do replace many as well as assure the length of the directories is bounded. Still your system might not accept all generated names. In such case, please refer from using that particular facet for your particular system.

Change History

v1.3.1

added -n flag (dry-run)

 The flag inhibit the call to wget and just displays the results of what the script wants to download or has already downloaded and verified. 

v1.3

added -F flag to read from file

 The script can now bypass it'S contents and be used as a wrapper for a list of files passed to it using this parameters (if "-F -" is used, the script reads from STDIN). The data should be in the standard form for the wget script: 
  • '' '<source_url>' '<checksum_type>' '<checksum_value>'

The last two ones are optional (the script can only check for 'md5' anyway) but the quoting is mandatory.

modified -w

 now -w writes in the exact same format as it would be read by the -F parameter. 

v1.2

added wget version check

 We assume everyone uses gnu wget, but there's too many versions out there and mac tend to have very old ones. If the required version is not present (i.e. required flags are missing) the script stops with a proper message. 

added file list overflow check and report message to user (if files in the script are less than the ones returned by the query)

 Now we do provide the means for the user to realize the returned file set is not complete. 

added directory generation for files (download_structure and download_emptypath wget API modifications)

 This allows the user to download multiple files in a more comfortable way by generating a destination directory structure out of some facet value. 

renew certificate automatically if it was retrieved by the script the first time

 If the certificate was gathered by other means, this isn't triggered. We need the password and be sure the the machine can retrieve it. 

caveat: The trigger happens after a failed download, so that the failed file will be skipped. (fast implementation) Is it advisable to rerun the script anyway to check everything was download properly, that's the main use of it.

added wrong published checksum check

 The script always compares the checksums if provided, and if wrong the file is removed and retried. This is normally what you want to do _unless_ the published checksum is wrong. For those cases the script tried a couple of times, and if it always retrieves the same file, with a different checksum, it just stops trying, leave the file as is and report this back. 

caveat: After starting the script again, the file will be verified and after realize its checksum doesn't match that of the server, will be deleted and re- download (a couple of times as before) before aborting.

added download file modification check (report if remote file was modified, i.e. checksum changed)

 Files already downloaded that are different from those on the server are reported and thus new versions could be spotted (if the wget script is regenerated and therefore contains new checksums for the same files) 

added "update" option (-u) to compare the wget script with the latest version of itself and redownload if different (preserving old one, that is)

 related to the previous case, the wget auto-update itself if it detects there's been changes. Restarting it will gather newer files and report files being changed (won't report files deleted but this could be done locally by comparing the resulted list with the directory structure) 
Clone this wiki locally