Cache GCP client responses to reduce repeated operations and network calls #19

aaraney · 2021-01-22T17:23:05Z

Each time the gcp client is used, it hits gcp to get the requested data. Given the size of the data and that repeated process, it only makes sense to implement some kind of cache. I propose that we use a file db (i.e. sqlite) to accomplish this for simplicity and broad support in python.

High level logic

check if db cache has desired data (stored as df)
1. yes - return data
2. no - continue
get data from gcp
create a df from the gcp data
cache this df in an sqlite db using the URL path as the key
return the df

Requirements

sqlite lib must support multiprocessing and batch commits

The sqlitedict library is mature, maintained, and seems to fit this bill for this feature. The lib lets you create/connect with a db and use it like you would a python dictionary. Most importantly, it supports multiprocessing.

The text was updated successfully, but these errors were encountered:

jarq6c · 2021-01-22T17:27:07Z

My inclination is to cache processed dataframes. GCP is pretty resilient, but there's a major bottleneck when it comes to processing the NetCDF files. I'd like to see if we can optimize that aspect first, then explore caching options.

…

On Fri, Jan 22, 2021, 11:23 AustinRaney-NOAA ***@***.***> wrote: Each time the gcp client is used, it hits gcp to get the requested data. Given the size of the data and that repeated process, it only makes sense to implement some kind of cache. I propose that we use a file db (i.e. sqlite) to accomplish this for simplicity and broad support in python. High level logic 1. check if db cache has desired data (stored as df) 1. yes - return data 2. no - continue 2. get data from gcp 3. create a df from the gcp data 4. cache this df in an sqlite db using the URL path as the key 5. return the df Requirements - sqlite lib must support multiprocessing and batch commits The sqlitedict <https://github.com/RaRe-Technologies/sqlitedict> library is mature, maintained, and seems to fit this bill for this feature. The lib lets you create/connect with a db and use it like you would a python dictionary. Most importantly, it supports multiprocessing. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#19>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBZNXCOO4SNGR45CVALZKTS3GYATANCNFSM4WOZLTNA> .

aaraney · 2021-01-22T17:34:15Z

My inclination is to cache processed dataframes.

Likewise.

GCP is pretty resilient,
but there's a major bottleneck when it comes to processing the NetCDF
files. I'd like to see if we can optimize that aspect first, then explore
caching options.

Okay, I will look into optimizing that before moving on. Can you speculate on the bottleneck you mentioned?

jarq6c · 2021-01-22T17:39:21Z

I think there may be an issue similar to what I had when pulling data from AWS. That it's not using standard xarray to_dataframe methods.

…

On Fri, Jan 22, 2021, 11:34 AustinRaney-NOAA ***@***.***> wrote: My inclination is to cache processed dataframes. Likewise. GCP is pretty resilient, but there's a major bottleneck when it comes to processing the NetCDF files. I'd like to see if we can optimize that aspect first, then explore caching options. Okay, I will look into optimizing that before moving on. Can you speculate on the bottleneck you mentioned? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBZNXHUCGVYWBTQNHQQELDS3GZKNANCNFSM4WOZLTNA> .

aaraney · 2021-01-22T17:40:56Z

Alright, that's helpful. I will fool around with it and see if I see any improvement in performance.

aaraney · 2021-01-26T14:41:36Z

I took a look at the performance issues on Friday and found two places where we can make substantial improvements. For an initial baseline, I timed retrieving an analysis and assimilation cycle without filtering the results (all 2.7 million reaches included). That took ~48 seconds. Not horrible, but also not great. To get a sense of what the bottle neck might be, I did the same test, but filtered to just USGS gauges and that took substantially less time ~24 seconds.

Before I elaborate on the findings, I was able to get the time down to ~20 seconds for pull and processing all 2.7 million reaches for a single AnA cycle. To me, that is a success especially if we are to cache those results. Enough of the good news and on to the findings...

I found three places where we can substantially improve performance:

Dropping variables the we are not concerned about at xarray dataset creation time.
Not decoding the time fields at xarray dataset creation time.
Post xarray dataset creation and creation of pandas df object, not casting the nwm_feature_id field to a string. This had the largest effect on performance. It cut wall time pretty much in half.

def NWM_bytes_to_DataFrame(
    path,
) -> pd.DataFrame:
    vars_to_drop = [
        "crs",
        "nudge",
        "velocity",
        "qSfcLatRunoff",
        "qBucket",
        "qBtmVertRunoff",
    ]
    # Load data as xarray DataSet
    with xr.load_dataset(
        path,
        engine="h5netcdf",
        mask_and_scale=False,
        drop_variables=vars_to_drop, #1
        decode_cf=False, #2
    ) as ds:

        # Extract streamflow data to pandas DataFrame
        df = pd.DataFrame(
            {
                "nwm_feature_id": ds["streamflow"].feature_id.values,
                "value": ds["streamflow"].values,
            }
        )

        # Scale data
        scale_factor = ds["streamflow"].scale_factor[0]
        value_date = pd.to_datetime(ds.time.values[0], unit="m")
        start_date = pd.to_datetime(ds.reference_time.values[0], unit="m")

    df.loc[:, "value"] = df["value"].mul(scale_factor)

    # 3. This had the biggest impact on wall time. With casting, for a full AnA cycle, goes to ~50 seconds
    # Convert feature IDs to strings
    # df["nwm_feature_id"] = df["nwm_feature_id"].astype(str)

    # Extract valid datetime
    df["value_date"] = value_date

    # Extract reference datetime
    df["start_date"] = start_date

    return df

Now with these findings being presented, my only question is, given the cast from int to str of the nwm_feature_id field having the biggest impact on performance, what are your thoughts on leaving the field as an int? This would break with the canonical dataframe format.

jarq6c · 2021-01-26T15:04:43Z

At the model level, it makes sense to store nwm_feature_id as an int. However, for this use case my preference is that all location identifiers be str then pandas.Category.

Having said that, I believe the transformation from int to str can be more efficiently executed outside the NWM_bytes_to_DataFrame method (after calling pandas.concat elsewhere).

Additionally, I wonder if there is any performance to be gained from using the xarray.Dataset.to_dataframe method. See here docs. Hand-crafting pandas.Dataframe apparently comes with significant overhead.

aaraney · 2021-01-26T15:31:43Z

I agree that for the end user, storing them as strings makes more sense and it is just the standard that we have in place. I think it makes sense to cache results as int's personally. I would imagine that int comparison speeds would be faster than str comparison, then again I guess the question is, if int comparison then casting to str is faster than just a str comparison. I may be changing my mind, but then again, the overall size of the cache would be much higher if we are to store nwm_feature_id as a string.

Having said that, I believe the transformation from int to str can be more efficiently executed outside the NWM_bytes_to_DataFrame method (after calling pandas.concat elsewhere).

Potentially, but I am not so sure of that. Given that a copy of the data will be made in memory when cast from int to str I would think (in most cases) it would actually be slower to cast after concatenating the results. As of now, we have a kind of divide and conquer strategy that should be sightly faster.

Additionally, I wonder if there is any performance to be gained from using the xarray.Dataset.to_dataframe method. See here docs. Hand-crafting pandas.Dataframe apparently comes with significant overhead.

Ive tested that hypothesis , I would not say robustly, and found that is actually not the case. My thinking is that passing the underlying numpy array to the pandas.DataFrame constructor does not conduct a copy of the data in memory until you change it. But I could be wrong about that. Ill see what I can find in the source for my own curiosity sake. But to your point, this is definitely worth more investigation than ive done so far.

aaraney · 2021-01-26T15:48:35Z

Having looked into the source I don't think that for this problem we will see any performance improvements using to_dataframe(). Ive included the method used to create the dataframe by xarray below. Our code should be a minutely faster since we also use the underlying numpy array to construct the df, just in less code.

# from xarray.core.dataset.Dataset 
# https://github.com/pydata/xarray/blob/a0c71c1508f34345ad7eef244cdbbe224e031c1b/xarray/core/dataset.py#L4970
def _to_dataframe(self, ordered_dims: Mapping[Hashable, int]):
        columns = [k for k in self.variables if k not in self.dims]
        data = [
            self._variables[k].set_dims(ordered_dims).values.reshape(-1)
            for k in columns
        ]
        index = self.coords.to_index([*ordered_dims])
        return pd.DataFrame(dict(zip(columns, data)), index=index)

jarq6c · 2021-01-26T16:28:52Z

Potentially, but I am not so sure of that. Given that a copy of the data will be made in memory when cast from int to str I would think (in most cases) it would actually be slower to cast after concatenating the results. As of now, we have a kind of divide and conquer strategy that should be slightly faster.

As an experiment I ran a test on my personal laptop that compared creating and concatenating 10,000 dataframes each with 1000 rows and two columns (nwm_feature_id and data) using multiprocessing (5 processes). I found casting nwm_feature_id to str after concatenation was 50% slower than before concatenation. At least on my laptop, it seems your intuition is correct.

I understand the efficiency arguments for caching nwm_feature_id as int. My reservations are related to usgs_site_code which are also numbers, but should not be stored as int (due to the leading zero problem). I'm concerned about maintaining a special separate standard at different parts of the pipeline for each location identifier (and any potential future identifiers that may also be number-based, but not actual integers).

jarq6c · 2021-01-26T16:34:04Z

I also had notes on memory-usage for the resulting Dataframe.
str: 647.4 MB
int: 152.6 MB
str as Category: 95.5 MB
int as Category: 95.4 MB

aaraney · 2021-01-26T17:09:02Z

I understand the efficiency arguments for caching nwm_feature_id as int. My reservations are related to usgs_site_code which are also numbers, but should not be stored as int (due to the leading zero problem). I'm concerned about maintaining a special separate standard at different parts of the pipeline for each location identifier (and any potential future identifiers that may also be number-based, but not actual integers).

Right, I agree with you. My plan was never to include the usgs_site_code in a cached version since crosswalking and merging is so cheap computationally. I share your concern about splitting from the canonical format, just trying to figure out what makes the most sense moving forward.

aaraney · 2021-01-26T19:00:58Z

I was able to get the time down to ~34 seconds doing the following:

max_feature_id = df["nwm_feature_id"].max()
max_digit_length = int(np.log10(max_feature_id)) + 1
df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")

jarq6c · 2021-01-26T20:10:47Z

I was able to get the time down to ~34 seconds doing the following:

max_feature_id = df["nwm_feature_id"].max()
max_digit_length = int(np.log10(max_feature_id)) + 1
df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")

Yuck! But also, clever! 😆 Nice!

aaraney · 2021-01-26T20:14:47Z

Only cause for concern is now nwm_feature_id is byte type not a decoded string. Going back to the drawing board. I found a really thorough and well explained stackoverflow post comparing astype and apply(<T>) that may be useful in the future or just good to know about.

hellkite500 · 2021-01-26T20:29:18Z

Might try writing this a vectorized copy and see if it helps performance

    # 3. This had the biggest impact on wall time. With casting, for a full AnA cycle, goes to ~50 seconds
    # Convert feature IDs to strings
    # df["nwm_feature_id"] = df["nwm_feature_id"].astype(str)

To something like

df["nwm_feature_id_str"] = str(df["nwm_feature_id"])
df.drop("nwm_feature_id")
df.rename({"nwm_feature_id_str":"nwm_feature_id"})

I'm curious if the type assignment from one column type to another is causing some significant slow down on that large of data. Essentially it has to create tmp buffer, copy the new transformed data to it, and then reassign that index. Maybe something like the above can speed that up a bit, but YMMV.

aaraney · 2021-01-26T20:34:02Z

I tried something similar to that, but instead of using casting with the built-in str, I used pandas astype. Still hovering at 50 seconds.

Edit: Remember English

aaraney · 2021-01-28T19:27:37Z

Ive spent some more time working on this and I think I may have come to a fair compromise solution. Just to recap a few of the things ive tried since I last commented:

Using cython to cast from an int to a string. I found that df.apply(str) was still slightly faster with a small array, but df.apply(str) became the clear winner as the array size increased.
Using numpy.char.mod("%s", ds["streamflow"].feature_id.values) showed comparable speed to

max_feature_id = df["nwm_feature_id"].max()
max_digit_length = int(np.log10(max_feature_id)) + 1
df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")

but also returned encoded bytes rather than unicode strings.

Next, I tried np.char.array(ds["nwm_feature_id"].feature_id.values, copy=False, unicode=True), this was faster than df.apply(str) in small arrays, but as array size grew and the string length increased, it grew slower than the df.apply(str) approach ~39 seconds when retrieving an EAnA cycle.
Lastly, I tried pd.Categorical(ds["streamflow"].feature_id.values). This implementation took ~33 seconds for an entire EAnA cycle. I think this implementation is preferable to 1,2, and 3 because nwm_feature_id values are in a type that is recognizable and does not require further casting by the user. Also its pretty fast. As noted in Document "gotchas" when dealing with Categorical dtype #25, there are known issues with categorical data, however I think these issues are outweighed by the functionality of the categorical datetype. Namely, you can compare integers to categorical types without issue. To that same point, then again, what do we gain from casting these fields to categorical values? It seems like if our only two main advantages are comparison to integer types and that it is not a byte string, I don't see what we benefit is over leaving the values as int's in the first place. I think the best argument is that by using the categorical datatype, we minimize wall time and avoid altering the canonical format.

jarq6c · 2021-01-28T19:36:37Z

I don't see what we benefit is over leaving the values as int's in the first place. I think the best argument is that by using the categorical datatype, we minimize wall time and avoid altering the canonical format.

The categorical dtype flexes its strength more on large dataframes. I found categories beats both str and int in terms of memory efficiency (and storage as HDF) by a considerable margin on large dataframes. I also found that str categories are comparable to int categories. The added benefit of str categories is that we can use a consistent dtype for all location IDs.

From a previous comment:

str: 647.4 MB
int: 152.6 MB
str as Category: 95.5 MB
int as Category: 95.4 MB

aaraney · 2021-01-28T19:45:11Z

Those memory saving are hard to argue against. Do you know if you can specify the category type (str or int) prior to the astype cast? So, can you get an str as Category from an int without having to cast to str first? From my reading of pandas.Categorical I am not seeing a straight path forward.

jarq6c · 2021-01-28T19:58:43Z

It appears that you can be explicit about the CategoricalDtype. However, I don't think there's a way around performing the initial cast to str (via astype, apply, etc). There may be differences between chaining apply or astype or using a lambda expression.

Another alternative might be to use xarray.DataArray.astype or similar function on the original dataset.

aaraney · 2021-01-28T20:05:26Z

It appears that you can be explicit about the CategoricalDtype. However, I don't think there's a way around performing the initial cast to str (via astype, apply, etc). There may be differences between chaining apply or astype or using a lambda expression.

Right, that was my thinking as well .

I am giving your second suggestion a run at the moment.

aaraney · 2021-01-28T20:08:37Z

So I tried the following:

df = pd.DataFrame(
                {
                    # "nwm_feature_id": pd.Categorical(
                    #     ds["streamflow"].feature_id.values
                    # ),
                    "nwm_feature_id": ds["streamflow"].feature_id.astype(str),
                    "value": ds["streamflow"].values,
                }
            )

df["nwm_feature_id"] = df["nwm_feature_id"].astype("category")

I took ~183 seconds. I did not report it earlier, but I found early on that casting the values while they are still in an xarray.DataArray to be slow.

jarq6c · 2021-01-28T20:20:56Z

What about merging your above solution with the underlying numpy method? Something like:

"nwm_feature_id": pd.Categorical(ds["streamflow"].feature_id.values.astype(f"|S{max_digit_length}")

aaraney · 2021-01-28T20:25:22Z

Ill give that a go and see if that improves anything. I just tested this:

df = pd.DataFrame(
                {
                    "nwm_feature_id": ds["streamflow"].feature_id.values,
                    "value": ds["streamflow"].values,
                }
            )

df["nwm_feature_id"] = df["nwm_feature_id"].astype("category")
        df["nwm_feature_id"].cat.rename_categories(
            df["nwm_feature_id"].cat.codes.apply(str), inplace=True
        )

It was the slowest so far of the bunch ~212 seconds lol.

aaraney · 2021-01-28T20:33:16Z

Unfortunately,

"nwm_feature_id": pd.Categorical(ds["streamflow"].feature_id.values.astype(f"|S{max_digit_length}")

took ~148 seconds.

jarq6c · 2021-04-21T16:59:04Z

Basic caching implemented for this subpack with #66

Closing this issue here. We can discuss the future of dataframe caching on another ticket.

aaraney added the enhancement New feature or request label Jan 22, 2021

aaraney self-assigned this Jan 22, 2021

jarq6c closed this as completed Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache GCP client responses to reduce repeated operations and network calls #19

Cache GCP client responses to reduce repeated operations and network calls #19

aaraney commented Jan 22, 2021

jarq6c commented Jan 22, 2021 via email

aaraney commented Jan 22, 2021

jarq6c commented Jan 22, 2021 via email

aaraney commented Jan 22, 2021

aaraney commented Jan 26, 2021

jarq6c commented Jan 26, 2021

aaraney commented Jan 26, 2021

aaraney commented Jan 26, 2021

jarq6c commented Jan 26, 2021

jarq6c commented Jan 26, 2021 •

edited

Loading

aaraney commented Jan 26, 2021

aaraney commented Jan 26, 2021

jarq6c commented Jan 26, 2021

aaraney commented Jan 26, 2021

hellkite500 commented Jan 26, 2021

aaraney commented Jan 26, 2021 •

edited

Loading

aaraney commented Jan 28, 2021

jarq6c commented Jan 28, 2021

aaraney commented Jan 28, 2021

jarq6c commented Jan 28, 2021

aaraney commented Jan 28, 2021

aaraney commented Jan 28, 2021

jarq6c commented Jan 28, 2021

aaraney commented Jan 28, 2021

aaraney commented Jan 28, 2021

jarq6c commented Apr 21, 2021

Cache GCP client responses to reduce repeated operations and network calls #19

Cache GCP client responses to reduce repeated operations and network calls #19

Comments

aaraney commented Jan 22, 2021

High level logic

Requirements

jarq6c commented Jan 22, 2021 via email

aaraney commented Jan 22, 2021

jarq6c commented Jan 22, 2021 via email

aaraney commented Jan 22, 2021

aaraney commented Jan 26, 2021

jarq6c commented Jan 26, 2021

aaraney commented Jan 26, 2021

aaraney commented Jan 26, 2021

jarq6c commented Jan 26, 2021

jarq6c commented Jan 26, 2021 • edited Loading

aaraney commented Jan 26, 2021

aaraney commented Jan 26, 2021

jarq6c commented Jan 26, 2021

aaraney commented Jan 26, 2021

hellkite500 commented Jan 26, 2021

aaraney commented Jan 26, 2021 • edited Loading

aaraney commented Jan 28, 2021

jarq6c commented Jan 28, 2021

aaraney commented Jan 28, 2021

jarq6c commented Jan 28, 2021

aaraney commented Jan 28, 2021

aaraney commented Jan 28, 2021

jarq6c commented Jan 28, 2021

aaraney commented Jan 28, 2021

aaraney commented Jan 28, 2021

jarq6c commented Apr 21, 2021

jarq6c commented Jan 26, 2021 •

edited

Loading

aaraney commented Jan 26, 2021 •

edited

Loading