-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathfledging.qmd
More file actions
100 lines (83 loc) · 5.44 KB
/
fledging.qmd
File metadata and controls
100 lines (83 loc) · 5.44 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
title: "Fledging"
---
Fledging is a goal of these JupyterHubs, so that users who have had success
and want to continue their adventures in cloud computing, have the knowledge
and resources to develop their own solutions. Fledging work to date has been co-developed
with the NASA Openscapes Mentors community, and we continue to work with mentors
in both NASA and NOAA to identify fledging paths and address challenges.
The Openscapes JupyterHubs aim to provide an analysis-ready computing environment,
with all (or most) of the software dependencies required for working with
cloud-hosted spatial data. This is a big advantage for learning how to
access and work with the data, run workshops for many learners, and have early
success in cloud computing. However, it does hide some of the complexities of
setting up your computing environment, and most fledging paths will necessarily
involve learning some of these tools and processes.
### The Cloud: Data vs Compute
When thinking about cloud computing, we consider two components: where is our
data? And where is our compute? In general, if our data is already on our local
computer, we will work with that data on our computer, so the scope we are
considering is how to work with *data that is hosted in the cloud*.
{width="400" fig-align="center"}
The proliferation of data centres to support cloud computing and data storage
has a [significant environmental cost](https://www.goclimate.com/knowledge/servers).
While most scientific workflows are small in comparison to other cloud uses
(e.g., AI and cryptocurrency mining), it is worth taking this into
consideration when thinking about "when to cloud".
### Skills learned for working with **cloud-hosted data**
The JupyterHubs provide a shared computing environment for mentors to learn and teach
workflows for accessing and analyzing cloud-based data. Some of the skills we learn include:
- Conceptual understanding of data in the cloud, and cloud-native data sources
- Finding and accessing data in Earthdata cloud with earthaccess (Python) and
earthdatalogin (R)
- Streaming and subsetting cloud data with xarray (Python) and terra/rstac (R)
- Authenticating with NASA Earthdata Login
- Ability to ask questions, teach, and develop learning resources
These skills enable efficient use of cloud-hosted data using either *local* or
*cloud* compute resources.
### What are fledging paths for **compute**?
Depending on computing needs and data size, there are many options for fledging.
These may include:
- Continue compute in the cloud using a shared environment (example: JupyterHub)
- If your institution offers a JupyterHub instance, this is a great option
to seamlessly transfer your learnings from the Openscapes Hub to a Hub
for "real science".
- Not all JupyterHubs will have the same environment (i.e., software and
packages), so users will need to learn how to customize their new
environment to suit their needs.
- Send large compute jobs to the cloud (example: Coiled)
- This is a good option for large compute jobs using Python that are easily
parallelized (see an example
[below](#an-example-path-for-computing-in-the-cloud)).
- Compute locally with cloud-native workflows (example: streaming cloud data
with earthaccess + xarray)
- For data and analysis jobs that can fit within your laptop's resrouces
(RAM and hard disk), working locally with the same tools and workflows
learned from working in the hub is a great option, and computing in the
cloud is likely unnecessary. You can benefit greatly from the effiency of
cloud data access patterns (such as streaming and range requests) without
worrying about cloud costs and administration.
- Compute locally using a shared environment “container” (example: Docker)
- One of the great benefits that the shared JupyterHub offers is a computing
environment that is already set up, so you don't have to go through the pain
of installing and configuring everything yourself. The images used in the
JupyterHubs can often be used locally using tools like Docker. See some
examples [here](https://pangeo-docker-images.readthedocs.io/en/latest/howto/launch.html#how-to-launch-jupyterlab-locally-with-one-of-these-images).
#### An example path for computing in the cloud
*from Aronne Merrelli, University of Michigan*
- Learned when and how to cloud in 2i2c JupyterHub
(managed by 2i2c & Openscapes, using NASA credits)
- Log into a computing environment in the Cloud. A single environment to teach
and learn without having to know the complex setups
- Experimented with compute jobs in Coiled
(managed by Coiled & Openscapes, NASA credits)
- Log into a computing environment in the Cloud
- Learned how to run code in parallel in the Cloud
- Did real science with University credit card
(managed by Coiled & University credit card)
- UM’s institutional AWS account is a big help here
- Ran parallel compute jobs on big data in the cloud
- See a [presentation on this fledging story](https://docs.google.com/presentation/d/1-TiPN8bfY6iDL5EVcEuCQSjqFMs3jPSX2MlNWA6no2E/edit#slide=id.g2ece79fddf8_3_73),
given at the Earth Science Information Partners (ESIP) conference in July, 2024.
#### Vision for the future
> People have skills and are empowered to do their science with cloud-hosted data efficiently and fairly, based on their computing needs and their values.