# Cloud-Optimized Data Access

Recall from the introduction that cloud object storage is accessed over the network. There are storage limitations to local file storage, but local file storage access will always be faster (because data is located on the same device as the CPU or Central Processing Unit). This is why the design of file formats in the cloud requires more consideration than local file storage.

## 🏋️ Exercise

## Why you should care

ADD ME

## Why you shouldn't care

Hopefully, one day, this will all be obsolete. But we are in an intermediary period where we have access to the cloud but non-optimal data access patterns. We're still thinking about files and not just logical datasets. Hopefully, in a few years time, we will catalog all of these datasets in a way that you don't have to think about files at all.

(But you should still care about the cloud, just not so much about file formats.)

## Pre-amble: Anatomy of a structured data file

[Diagram of a structured data file]

A structured data file is composed of two parts: metadata and data. Metadata is information about the data, such as the data shape, data type, the data variables, the data's coordinate system, and how the data is stored, such as chunk shape and compression. Data is the actual data that you want to analyze.

We can optimize this structure for reading from cloud storage.

## What does cloud-optimized mean?

The "optimize" in cloud-optimized is to **minimize data latency** and **maximize throughput** by:

* Making as few requests as possible
* Making even less for metadata, preferably only one
* Using a file layout that simplifies accessing data for parallel reads.

## How do we accomplish cloud-optimization?

1. Separate metadata from data and store it contiguously data so it can be read with one request.
2. Store data in chunks, so the whole file doesn't have to be read to access a portion of the data.
3. Compress these chunks so there is less data to transfer.
3. Make sure chunks of data are not too small, so more data can be fetched with each request.
4. Make sure the chunks are not too large, which means more data has to be transferred and decompression takes longer.

## How do we read data from cloud-optimized files?

Libraries, such as xarray, first read the metadata. They defer actually reading data until it's needed for analysis. When a computation of the data is called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the chunks required. This is also called "lazy loading" data. See also [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing)

## An analogy

When you lived at home with your parents, everything was right there when you needed it (local file storage). Let's say you're about to move away to college (the cloud), and you are not allowed to bring anything with you. You put everything in your parent's (infinitely large) garage (cloud object storage). Given you would need to have things shipped to you, would it be better to leave everything unpacked? To put everything all in one box? A few different boxes? And what would be the most efficient way for your parents to know where things were when you asked for them?

## Examples of cloud-optimized file formats

You ar probably familiar with the following file formats:
* NetCDF
* HDF5
* GeoTIFF

You can actually make any of these file formats "cloud-optimized" by:

* Seperating metadata from data and making sure metadata is all stored contiguously so it can be read with one request
* Storing data in reasonably sized chunks (not too big, not too small)

## Cloud-native formats

Cloud-native formats like COG and Zarr are designed specifically for use in cloud object storage and are inherently cloud-optimized. However, converting existing archives to a new cloud-native format is often impractical. Fortunately, existing formats can also be structured to be considered cloud-optimized.
