Skip to content

CarlKCarlK/fetch-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fetch-data

github crates.io docs.rs CI

Fetch data files from a URL, but only if needed. Verify contents via SHA256.

Fetch-Data checks a local data directory and then downloads needed files. It always verifies the local files and downloaded files via a hash.

Fetch-Data makes it easy to download large and small sample files. For example, here we download a genomics file from GitHub (if it has not already been downloaded). We then print the size of the now local file.

use fetch_data::sample_file;

let path = sample_file("small.fam")?;
println!("{}", std::fs::metadata(path)?.len()); // Prints 85

# use fetch_data::FetchDataError; // '#' needed for doctest
# Ok::<(), FetchDataError>(())

Features

  • Thread-safe -- allowing it to be used with Rust's multithreaded testing framework.
  • Inspired by Python's popular Pooch and our PySnpTools filecache module.
  • Avoids run-times such as Tokio (by using ureq to download files via blocking I/O).

Suggested Usage

You can set up FetchData many ways. Here are the steps -- followed by sample code -- for one set up.

  • Create a registry.txt file containing a whitespace-delimited list of files and their hashes. (This is the same format as Pooch. See section Registry Creation for tips on creating this file.)

  • As shown below, create a global static FetchData instance that reads your registry.txt file. Give it:

    • the URL root from which to download the files
    • an environment variable telling the local data directory in which to store the files
    • a qualifier, organization, and application -- Used to create a local data directory when the environment variable is not set. See crate ProjectsDir for details.
  • As shown below, define a public sample_file function that takes a file name and returns a Result containing the path to the downloaded file.

use fetch_data::{ctor, FetchData, FetchDataError};
use std::path::{Path, PathBuf};

#[ctor]
static STATIC_FETCH_DATA: FetchData = FetchData::new(
    include_str!("../registry.txt"),
    "https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
    "BAR_APP_DATA_DIR", // env_key
    "com",              // qualifier
    "Foo Corp",         // organization
    "Bar App",          // application
);

/// Download a data file.
pub fn sample_file<P: AsRef<Path>>(path: P) -> Result<PathBuf, FetchDataError> {
    STATIC_FETCH_DATA.fetch_file(path)
}

You can now use your sample_file function to download your files as needed.

Registry Creation

You can create your registry.txt file many ways. Here are the steps -- followed by sample code -- for one way to create it.

  • Upload your data files to the Internet.
    • For example, Fetch-Data puts its sample data files in tests/data, so they upload to this GitHub folder. In GitHub, by looking at the raw view of a data file, we see the root URL for these files. In cargo.toml, we keep these data files out of our crate via exclude = ["tests/data/*"]
  • As shown below, write code that
    • Creates a FetchData instance without registry contents.
    • Lists the files in your data directory.
    • Calls the gen_registry_contents method on your list of files. This method will download the files, compute their hashes, and create a string of file names and hashes.
  • Print this string, then manually paste it into a file called registry.txt.
use fetch_data::{FetchData, dir_to_file_list};

let fetch_data = FetchData::new(
    "", // registry_contents ignored
    "https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
    "BAR_APP_DATA_DIR", // env_key
    "com",              // qualifier
    "Foo Corp",         // organization
    "Bar App",          // application
);
let file_list = dir_to_file_list("tests/data")?;
let registry_contents = fetch_data.gen_registry_contents(file_list)?;
println!("{registry_contents}");

# use fetch_data::FetchDataError; // '#' needed for doctest
# Ok::<(), FetchDataError>(())

Notes

  • Feature requests and contributions are welcome.

  • Don't use our sample sample_file. Define your own sample_file that knows where to find your data files.

  • The FetchData instance need not be global and static. See FetchData::new for an example of a non-global instance.

  • Additional methods on the FetchData instance can fetch multiples files and can give the path to the local data directory.

  • You need not use a registry.txt file and FetchData instance. You can instead use the stand-alone function fetch to retrieve a single file with known URL, hash, and local path.

  • Additional stand-alone functions can download files and hash files.

  • Fetch-Data always does binary downloads to maintain consistent line endings across OSs.

  • The Bed-Reader genomics crate uses Fetch-Data.

  • To make FetchData work well as a static global, FetchData::new never fails. Instead, FetchData stores any error and returns it when the first call to fetch_file, etc., is made.

  • Debugging this crate under Windows can cause a "Oops! The debug adapter has terminated abnormally" exception. This is some kind of LLVM, Windows, NVIDIA(?) problem via ureq.

  • This crate follows Nine Rules for Elegant Rust Library APIs from Towards Data Science.

Project Links

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages