# Comparison of different data hosting/downloading methods

This Jupyter notebook evaluates various data hosting and downloading methods, covers platforms like Zenodo, Open Science Framework (OSF), Google Drive, GitHub LFS, and plain GitHub, and highlighting their advantages and limitations in handling large datasets with nested structures. 

Demonstrations for data downloading and comparisons of zip file sizes are included to assess the efficiency of each method.


---

#### Unzipped v. Zipped 

In some of these examples, we discuss the case of storing the data directly vs storing the data as a zip. 

To avoid redundancy, we outline pros and cons of zipping the entire data set here:

**Pros:**
- *In Zenodo:* we can preserve the nested data structure only if it's within a zip
- Zenodo and OSF don't handle large amounts of files well

**Cons:**
- Versioning becomes harder; for example, adding a single filter file would require a reupload of the entire zip (currently 382MBs)

**Notes:**
- While not discussed in this notebook, but there is potential merit in *partially* zipping the data; for example, zipping the contents of sed/QSO/. This gives us some of the benefits of consolidated data, while providing us the flexibility to add and modify without needing to reupload everything again.

---

## Zenodo • *(No Demo)*

**Pros:**
- Version control, DOIs
- No upper limit for storage capacity (but no files > 50GB per file)
- Lifetime: same as CERN (at least 20 years)

**Cons:**
- Doesn't support nested directory structures, which is a no-go for our data
- CERN stores data in a way that doesn't really like many small files
- Doesn't seem to support multiple contributors--I can list serveral authors to the dataset, but could not easily find a way to grant multiple accounts edit access.

In [1]:
# No demo

## Zenodo, with the data stored as a zip • [Project on Zenodo](https://zenodo.org/records/10843773)

In [2]:
# Zenodo Zip Demo (~5 min)

ZENODO_ZIP_DATA_URL="https://zenodo.org/api/records/10843773/files-archive"
! curl --output lephare_data_zenodo.zip $ZENODO_ZIP_DATA_URL

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  364M    0  364M    0     0   782k      0 --:--:--  0:07:57 --:--:--  780k:--  0:01:41 --:--:--  777kk  781k      0 --:--:--  0:03:33 --:--:--  811k 0  198M    0     0   782k      0 --:--:--  0:04:19 --:--:--  797k --:--:--  0:04:51 --:--:--  779k --:--:--  0:05:03 --:--:--  826k --:--:--  691k   0 --:--:--  0:05:13 --:--:--  812k    0 --:--:--  0:05:14 --:--:--  802k 0 --:--:--  0:05:18 --:--:--  784k  782k      0 --:--:--  0:05:35 --:--:--  796k-:--  782k 783k      0 --:--:--  0:06:04 --:--:--  816k      0 --:--:--  0:06:18 --:--:--  776k83k      0 --:--:--  0:06:19 --:--:--  804k  0 --:--:--  0:06:20 --:--:--  808k  793k0:06:27 --:--:--  789k    0     0   782k      0 --:--:--  0:06:32 --:--:--  732k-:--:--  767k06:38 --:--:--  783k--:--:--  0:06:39 --:--:--  786k   782k      0 --:--:--  0:06:47 --:--:--  729k  0 --:--:--  0

## Open Science Framework (OSF) • [Project on OSF](https://osf.io/64eg9/)

**Pros:**
- Version control, DOIs
- CLI tool [osfclient](https://pypi.org/project/osfclient/) for easier batch upload, among other functionality
- Co-ownership: supports multiple contributors
- Can edit and delete uploaded data
- 50GB capacity for public projects
- Lifetime: preservation fund for 50+ years after closing at current costs

**Cons:**
- Uploading nested subdirs via browser GUI (rather than `osfclient`) is tedious at best, semi-impossible at worst. This makes our `filt/` and `sed/` directories challenging.
- osfclient is extremely slow when uploading a large set of files ([osfclient#155](https://github.com/osfclient/osfclient/issues/155), [#149](https://github.com/osfclient/osfclient/issues/149), [#146](https://github.com/osfclient/osfclient/issues/146); additionally, [nilearn#1925](https://github.com/nilearn/nilearn/issues/1925))
- OSF is designed to store very few files, it cannot handle a large set of files ([osfclient#155](https://github.com/osfclient/osfclient/issues/155)).
- Regional servers: I defaulted to the US (does it get more specific than this?) and naturally that's suboptimal for collaboration across Europe and US west coast. Options include: United States, Canada (Montréal), Germany (Frankfurt), Australia (Sydney)
- (Potentially, though [this GitHub discussion is from 2018](https://github.com/osfclient/osfclient/issues/155#issuecomment-409196619), osfclient can miss some files in a batch upload)

**Notes:**
- Curling the link to the zip is not explicitly described in the official API docs, but is discussed as an option in GitHub issues ([osfclient#475](https://github.com/CenterForOpenScience/osf.io/issues/475))

In [8]:
# OSF Demo (~40 mins)

OSF_DATA_URL="https://files.osf.io/v1/resources/64eg9/providers/osfstorage/?zip=&_gl=1*1hf86d4*_ga*Nzk4MjQxNzMzLjE3MTAzNDQ3NzQ.*_ga_YE9BMGGWX8*MTcxMDUyOTI0OS41LjEuMTcxMDUyOTI2My40Ni4wLjA."
! curl --output lephare_data_osf.zip "$OSF_DATA_URL"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

100  226M    0  226M    0     0   101k      0 --:--:--  0:38:16 --:--:--  102k00:12 --:--:-- 37377- 14511--:--:--  0:01:45 --:--:-- 38740   0 --:--:--  0:05:18 --:--:--  5929-:--  0:05:22 --:--:-- 14959--     0-  0:05:43 --:--:-- 11195  0 --:--:--  0:05:57 --:--:--     0:06:11 --:--:--     0    0 --:--:--  0:08:08 --:--:-- 12309 0:08:45 --:--:--     092   0 --:--:--  0:09:24 --:--:-- 114659649      0 --:--:--  0:09:49 --:--:--     0  119k-:--:--     0-  0:11:53 --:--:-- 15020      0 --:--:--  0:14:43 --:--:-- 14337.6M    0 57.6M    0     0  68260      0 --:--:--  0:14:45 --:--:-- 143313      0 --:--:--  0:15:31 --:--:--     0    0 --:--:--  0:17:01 --:--:--     0   0 --:--:--  0:17:33 --:--:--     00:18:11 --:--:--     0-:--  0:18:12 --:--:--     0   0 --:--:--  0:19:54 --:--:--     0 0 --:--:--  0:23:24 --:--:-- 1123k19 --:--:-- 24129:--:--  0:24:25 --:--:--     089489      0 --:--:--  0:25:56 --:--:--     0--:--  0:27:45 --:--:--  752k--:--:-- 140543-:-- 35714:-- 55386    0 --:--:-- 

## OSF, with the data stored as a zip • [Project on OSF](https://osf.io/mvpks/)

**Notes:**
- For this project, I did elect the European server (Frankfurt) to see if it would make a large difference for me in Pittsburgh.


In [4]:
# OSF as Zip Demo (~1 min)

OSF_AS_ZIP_DATA_URL="https://files.osf.io/v1/resources/mvpks/providers/osfstorage/?zip=&_gl=1*15ihtvy*_ga*Nzk4MjQxNzMzLjE3MTAzNDQ3NzQ.*_ga_YE9BMGGWX8*MTcxMDk0MjQzMi4xOC4xLjE3MTA5NDI3ODYuNTQuMC4w"
! curl --output lephare_data_osf_as_zip.zip "$OSF_AS_ZIP_DATA_URL"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  364M    0  364M    0     0  4581k      0 --:--:--  0:01:21 --:--:-- 4189k.5M    0     0  3540k      0 --:--:--  0:00:09 --:--:-- 5379k--:--:--  0:00:35 --:--:-- 4386k


## Google Drive • [View File](https://drive.google.com/file/d/1gsni-hMPU5yGkJDyTcv5wV-vl9NCH7g0/view?usp=drive_link)

**Pros:**
- Reliable plan B for just getting data, especially for times of active development

**Cons:**
- I don't really see this as a long term solution

In [5]:
# Google Drive Demo (10 sec with gdown already installed)

GDRIVE_DATA_URL="https://drive.google.com/file/d/1gsni-hMPU5yGkJDyTcv5wV-vl9NCH7g0/view?usp=drive_link"
GDRIVE_FILE_ID="1gsni-hMPU5yGkJDyTcv5wV-vl9NCH7g0"
! pip install gdown
! gdown $GDRIVE_FILE_ID -O lephare_data_gdrive.zip

[0mDownloading...
From (original): https://drive.google.com/uc?id=1gsni-hMPU5yGkJDyTcv5wV-vl9NCH7g0
From (redirected): https://drive.google.com/uc?id=1gsni-hMPU5yGkJDyTcv5wV-vl9NCH7g0&confirm=t&uuid=762247a4-2dda-4e86-8380-6e8e550e222a
To: /Users/orl/code/LEPHARE-demo/lephare_data_gdrive.zip
100%|████████████████████████████████████████| 382M/382M [00:07<00:00, 50.1MB/s]


## GitHub LFS • *(No Demo)*

**Pros:**
- There could be options for academic institutions to get more free storage/bandwith (TODO look into)

**Cons:**
- This is a no-go for us, as the current size of our data is 1.36 GB. Even if we stored the zipped version of our data (382 MB) we would be in trouble very quickly.
> Every account using Git Large File Storage receives 1 GiB of free storage and 1 GiB a month of free bandwidth. If the bandwidth and storage quotas are not enough, you can choose to purchase an additional quota for Git LFS. - [GitHub LFS Docs](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage)


In [None]:
# No demo

## Plain GitHub • [View Repo](https://github.com/OliviaLynn/LEPHARE-data)

**Pros:**
- Versioning, easy to navigate, nested directories, fast download
- To make this even more convenient than manually cloning the repo - we can use the subtree feature to inegrate a data repo into another GitHub repo, and squash the subtree's history to not clutter the main repo
- Some precedence of using a GitHub repo in this way ([PacBio](https://github.com/swcarpentry/DEPRECATED-site/issues/797#issuecomment-73505437)) (probably more out there; just happened to see this)

 **I don't think this is a con, but I'm not a lawyer so I welcome second opinions:**
 - I couldn't find any indication of this breaking the TOS (and as above, I did see some precidence in the scientific community). Still, I'd love a definite yes.

In [6]:
# GitHub Demo (~1 min)

GITHUB_DATA_URL="https://github.com/OliviaLynn/LEPHARE-data.git"
! git clone $GITHUB_DATA_URL

fatal: destination path 'LEPHARE-data' already exists and is not an empty directory.


## Potentially: NERSC storage? 

- Would have to ask Sam what the general size caps we had in mind for this.
- The advantage is this would facilitate sharing of templates between other rail template codes.
- There is already precedent of storing some data at our public NERSC data dir, including BPZ demo files.

---

## Checking the zips

In [9]:
# Compare zip sizes

# OSF size - still due to missing sed/STAR/ directory
# I'm uploading it currently for completeness, but already fairly disillusioned with non-zipped OSF

! ls -l *.zip

-rw-r--r--  1 orl  staff  382442106 Mar 22 10:06 lephare_data_gdrive.zip
-rw-r--r--  1 orl  staff  237935378 Mar 22 10:45 lephare_data_osf.zip
-rw-r--r--@ 1 orl  staff  382442348 Mar 22 09:58 lephare_data_osf_as_zip.zip
-rw-r--r--  1 orl  staff  382442280 Mar 22 09:55 lephare_data_zenodo.zip
