# Comparison of different data hosting/downloading methods

TODO: Write an overview, finish writing pros/cons


---

#### Unzipped v. Zipped 

In some of these examples, we discuss the case of storing the data directly vs storing the data as a zip. 

To avoid redundancy, we outline pros and cons of zipping the entire data set here:

**Pros:**
- In zenodo: we can preserve the nested data structure only if it's within a zip
- Zenodo and OSF don't handle large amounts of files well

**Cons:**
- Versioning becomes harder; for example, adding a single filter file would require a reupload of the entire zip (currently 382MBs)

**Notes:**
- While not discussed in this notebook, but there is potential merit in *partially* zipping the data; for example, zipping the contents of sed/QSO/. This gives us some of the benefits of consolidated data, while providing us the flexibility to add and modify without needing to reupload everything again.

---

## Zenodo • *(No Demo)*

**Pros:**
- Version control, DOIs
- No upper limit for storage capacity (but no files > 50GB per file)
- Lifetime: same as CERN (at least 20 years)

**Cons:**
- CERN stores data in a way that doesn't really like many small files (find link for this)
- Doesn't seem to support multiple contributors (I can list serveral authors to the dataset, but it's unclear if I can give other users permisison to edit the project)
- Doesn't support nested directory structures, which is a no-go for our data

In [8]:
# No demo

## Zenodo, with the data stored as a zip • [Project on Zenodo](https://zenodo.org/records/10843773)

In [7]:
# Zenodo Zip Demo (~5 min)

ZENODO_ZIP_DATA_URL="https://zenodo.org/api/records/10843773/files-archive"
! curl --output lephare_data_zenodo.zip $ZENODO_ZIP_DATA_URL

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

100  364M    0  364M    0     0  1290k      0 --:--:--  0:04:49 --:--:-- 1543k--  0:00:06 --:--:--  703k  0 --:--:--  0:00:35 --:--:--  766k--  777k0:52 --:--:--  765k--:--:--  0:00:56 --:--:--  760k--:--:--  0:01:00 --:--:--  765k--:--  771k2 --:--:--  775k01:08 --:--:--  779k 0   753k      0 --:--:--  0:01:09 --:--:--  788k 784k:--  0:01:11 --:--:--  778k   0 --:--:--  0:01:14 --:--:--  746k --:--:--  0:01:31 --:--:-- 1285k9k      0 --:--:--  0:02:41 --:--:-- 1366k 0 --:--:--  0:02:55 --:--:-- 1421k:03:47 --:--:-- 2159k:17 --:--:-- 2332k-  0:04:21 --:--:-- 1701k--:-- 1403k4:41 --:--:-- 1477k88k      0 --:--:--  0:04:47 --:--:-- 1480k


## Open Science Framework (OSF) • [Project on OSF](https://osf.io/64eg9/)

**Pros:**
- Version control, DOIs
- CLI tool [osfclient](https://pypi.org/project/osfclient/) for easier batch upload, among other functionality
- Co-ownership: supports multiple contributors
- Can edit and delete uploaded data
- 50GB capacity for public projects
- Lifetime: preservation fund for 50+ years after closing at current costs

**Cons:**
- Uploading nested subdirs via browser GUI (rather than `osfclient`) is tedious at best, semi-impossible at worst. This makes our `filt/` and `sed/` directories challenging.
- osfclient is extremely slow when uploading a large set of files ([osfclient#155](https://github.com/osfclient/osfclient/issues/155), [#149](https://github.com/osfclient/osfclient/issues/149), [#146](https://github.com/osfclient/osfclient/issues/146); additionally, [nilearn#1925](https://github.com/nilearn/nilearn/issues/1925))
- OSF is designed to store very few files, it cannot handle a large set of files ([osfclient#155](https://github.com/osfclient/osfclient/issues/155)).
- Regional servers: I defaulted to the US (does it get more specific than this?) and naturally that's suboptimal for collaboration across Europe and US west coast. Options include: United States, Canada (Montréal), Germany (Frankfurt), Australia (Sydney)
- (Potentially, though [this GitHub discussion is from 2018](https://github.com/osfclient/osfclient/issues/155#issuecomment-409196619), osfclient can miss some files in a batch upload)

**Notes:**
- Curling the link to the zip is not explicitly described in the official API docs, but is discussed as an option in GitHub issues ([osfclient#475](https://github.com/CenterForOpenScience/osf.io/issues/475))

In [2]:
# OSF Demo (~43 mins)

OSF_DATA_URL="https://files.osf.io/v1/resources/64eg9/providers/osfstorage/?zip=&_gl=1*1hf86d4*_ga*Nzk4MjQxNzMzLjE3MTAzNDQ3NzQ.*_ga_YE9BMGGWX8*MTcxMDUyOTI0OS41LjEuMTcxMDUyOTI2My40Ni4wLjA."
! curl --output lephare_data_osf.zip "$OSF_DATA_URL"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  291M    0  291M    0     0   113k      0 --:--:--  0:43:50 --:--:-- 857271819      0 --:--:--  0:01:00 --:--:--     0:03:50 --:--:--     0 --:--:--  0:05:08 --:--:-- 12494   097      0 --:--:--  0:06:40 --:--:--     0-:--  0:06:53 --:--:-- 14438--:-- 11772--:--     0--:--     0  0 --:--:--  0:09:30 --:--:-- 1037k-:-- 12170 0:11:30 --:--:--     00  0 --:--:--  0:13:41 --:--:--     06 --:--:--     0 0 --:--:--  0:14:52 --:--:--     0--:--:--     015:12 --:--:--     065522      0 --:--:--  0:15:18 --:--:--     0-:--  0:15:41 --:--:--     0-:--:--     0 0 --:--:--  0:18:04 --:--:-- 14454:18:23 --:--:--     0 --:--:--  0:20:14 --:--:--     0--:--:--  0:20:16 --:--:--     0-:--:-- 13704--:--:-- 1345450 --:--:--  5675-  0:21:52 --:--:--     047398      0 --:--:--  0:22:17 --:--:-- 22606:22:30 --:--:--     01      0 --:--:--  0:22:46

## OSF, with the data stored as a zip • [Project on OSF](https://osf.io/mvpks/)

**Notes:**
- For this example, I did elect the European server (Germany) to see if it would make a large difference for me.


In [24]:
# OSF as Zip Demo (~14 sec)

OSF_AS_ZIP_DATA_URL="https://files.osf.io/v1/resources/mvpks/providers/osfstorage/?zip=&_gl=1*15ihtvy*_ga*Nzk4MjQxNzMzLjE3MTAzNDQ3NzQ.*_ga_YE9BMGGWX8*MTcxMDk0MjQzMi4xOC4xLjE3MTA5NDI3ODYuNTQuMC4w"
! curl --output lephare_data_osf_as_zip.zip "$OSF_AS_ZIP_DATA_URL"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  364M    0  364M    0     0  26.6M      0 --:--:--  0:00:13 --:--:-- 28.6M13 --:--:-- 30.4M


## Google Drive • [View File](https://drive.google.com/file/d/1gsni-hMPU5yGkJDyTcv5wV-vl9NCH7g0/view?usp=drive_link)

**Pros:**
- Reliable plan B for just getting data, especially for times of active development

**Cons:**
- I don't really see this as a long term solution

In [32]:
# Google Drive Demo (10 sec with gdown already installed)
#https://drive.google.com/file/d/1gsni-hMPU5yGkJDyTcv5wV-vl9NCH7g0/view?usp=drive_link
#GDRIVE_DATA_URL="https://drive.google.com/drive/folders/1-vYT_rbC2B747dNEVjaMKJvjPmJglPXy?usp=drive_link"
GDRIVE_DATA_URL="https://drive.google.com/file/d/1gsni-hMPU5yGkJDyTcv5wV-vl9NCH7g0/view?usp=drive_link"
GDRIVE_FILE_ID="1gsni-hMPU5yGkJDyTcv5wV-vl9NCH7g0"
! pip install gdown
! gdown $GDRIVE_FILE_ID -O lephare_data_gdrive.zip

[0mDownloading...
From (original): https://drive.google.com/uc?id=1gsni-hMPU5yGkJDyTcv5wV-vl9NCH7g0
From (redirected): https://drive.google.com/uc?id=1gsni-hMPU5yGkJDyTcv5wV-vl9NCH7g0&confirm=t&uuid=936469b8-819d-4ef7-8f88-5eede1d7fc83
To: /Users/orl/code/LEPHARE-demo/lephare_data_gdrive.zip
100%|████████████████████████████████████████| 382M/382M [00:08<00:00, 46.4MB/s]


## GitHub LFS • *(No Demo)*

**Pros:**
- There could be options for academic institutions to get more free storage/bandwith (TODO look into)

**Cons:**
- This is a no-go for us, as the current size of our data is 1.36 GB. 
> Every account using Git Large File Storage receives 1 GiB of free storage and 1 GiB a month of free bandwidth. If the bandwidth and storage quotas are not enough, you can choose to purchase an additional quota for Git LFS.
\- [GitHub LFS Docs](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage)
 Even if we stored the zipped version of our data (382 MB) we would be in trouble very quickly

In [None]:
# No demo

## Plain GitHub • [View Repo](https://github.com/OliviaLynn/LEPHARE-data)

**Pros:**
- Versioning, easy to navigate, nested directories, fast download
- TODO: Need to read up on details, but this can integrate really nicely with other GitHub repos that read its data (maybe worth making a little demo)

 **Cons:**
 - Does this comply with TOS? (TODO look into)

In [9]:
# GitHub Demo (~1 min)

GITHUB_DATA_URL="https://github.com/OliviaLynn/LEPHARE-data.git"
! git clone $GITHUB_DATA_URL

Cloning into 'LEPHARE-data'...
remote: Enumerating objects: 18763, done.[K
remote: Counting objects: 100% (2376/2376), done.[K
remote: Compressing objects: 100% (1955/1955), done.[K
remote: Total 18763 (delta 403), reused 2375 (delta 403), pack-reused 16387[K
Receiving objects: 100% (18763/18763), 346.51 MiB | 4.94 MiB/s, done.
Resolving deltas: 100% (617/617), done.
Updating files: 100% (19383/19383), done.


## Potentially, maybe: NERSC storage? 

Would have to ask Sam what the general size caps we had in mind for this. The advantage is this would facilitate sharing of templates between other rail template codes.

---

## Checking the zips

In [31]:
# Compare zip sizes
# A disclaimer that lephare_data_osf.zip will be smaller because I never uploaded that last sed/STARS dir after realizing it is probably moot.

! ls -l *.zip

-rw-r--r--  1 orl  staff  382442106 Mar 20 11:01 lephare_data_gdrive.zip
-rw-r--r--  1 orl  staff  305758858 Mar 20 08:38 lephare_data_osf.zip
-rw-r--r--@ 1 orl  staff  382442348 Mar 20 10:50 lephare_data_osf_as_zip.zip
-rw-r--r--  1 orl  staff  382442280 Mar 20 11:02 lephare_data_zenodo.zip
