Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache GCP client responses to reduce repeated operations and network calls #19

Closed
aaraney opened this issue Jan 22, 2021 · 26 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@aaraney
Copy link
Member

aaraney commented Jan 22, 2021

Each time the gcp client is used, it hits gcp to get the requested data. Given the size of the data and that repeated process, it only makes sense to implement some kind of cache. I propose that we use a file db (i.e. sqlite) to accomplish this for simplicity and broad support in python.

High level logic

  1. check if db cache has desired data (stored as df)
    1. yes - return data
    2. no - continue
  2. get data from gcp
  3. create a df from the gcp data
  4. cache this df in an sqlite db using the URL path as the key
  5. return the df

Requirements

  • sqlite lib must support multiprocessing and batch commits

The sqlitedict library is mature, maintained, and seems to fit this bill for this feature. The lib lets you create/connect with a db and use it like you would a python dictionary. Most importantly, it supports multiprocessing.

@aaraney aaraney added the enhancement New feature or request label Jan 22, 2021
@aaraney aaraney self-assigned this Jan 22, 2021
@jarq6c
Copy link
Collaborator

jarq6c commented Jan 22, 2021 via email

@aaraney
Copy link
Member Author

aaraney commented Jan 22, 2021

My inclination is to cache processed dataframes.

Likewise.

GCP is pretty resilient,
but there's a major bottleneck when it comes to processing the NetCDF
files. I'd like to see if we can optimize that aspect first, then explore
caching options.

Okay, I will look into optimizing that before moving on. Can you speculate on the bottleneck you mentioned?

@jarq6c
Copy link
Collaborator

jarq6c commented Jan 22, 2021 via email

@aaraney
Copy link
Member Author

aaraney commented Jan 22, 2021

Alright, that's helpful. I will fool around with it and see if I see any improvement in performance.

@aaraney
Copy link
Member Author

aaraney commented Jan 26, 2021

I took a look at the performance issues on Friday and found two places where we can make substantial improvements. For an initial baseline, I timed retrieving an analysis and assimilation cycle without filtering the results (all 2.7 million reaches included). That took ~48 seconds. Not horrible, but also not great. To get a sense of what the bottle neck might be, I did the same test, but filtered to just USGS gauges and that took substantially less time ~24 seconds.

Before I elaborate on the findings, I was able to get the time down to ~20 seconds for pull and processing all 2.7 million reaches for a single AnA cycle. To me, that is a success especially if we are to cache those results. Enough of the good news and on to the findings...

I found three places where we can substantially improve performance:

  1. Dropping variables the we are not concerned about at xarray dataset creation time.
  2. Not decoding the time fields at xarray dataset creation time.
  3. Post xarray dataset creation and creation of pandas df object, not casting the nwm_feature_id field to a string. This had the largest effect on performance. It cut wall time pretty much in half.
def NWM_bytes_to_DataFrame(
    path,
) -> pd.DataFrame:
    vars_to_drop = [
        "crs",
        "nudge",
        "velocity",
        "qSfcLatRunoff",
        "qBucket",
        "qBtmVertRunoff",
    ]
    # Load data as xarray DataSet
    with xr.load_dataset(
        path,
        engine="h5netcdf",
        mask_and_scale=False,
        drop_variables=vars_to_drop, #1
        decode_cf=False, #2
    ) as ds:

        # Extract streamflow data to pandas DataFrame
        df = pd.DataFrame(
            {
                "nwm_feature_id": ds["streamflow"].feature_id.values,
                "value": ds["streamflow"].values,
            }
        )

        # Scale data
        scale_factor = ds["streamflow"].scale_factor[0]
        value_date = pd.to_datetime(ds.time.values[0], unit="m")
        start_date = pd.to_datetime(ds.reference_time.values[0], unit="m")

    df.loc[:, "value"] = df["value"].mul(scale_factor)

    # 3. This had the biggest impact on wall time. With casting, for a full AnA cycle, goes to ~50 seconds
    # Convert feature IDs to strings
    # df["nwm_feature_id"] = df["nwm_feature_id"].astype(str)

    # Extract valid datetime
    df["value_date"] = value_date

    # Extract reference datetime
    df["start_date"] = start_date

    return df

Now with these findings being presented, my only question is, given the cast from int to str of the nwm_feature_id field having the biggest impact on performance, what are your thoughts on leaving the field as an int? This would break with the canonical dataframe format.

@jarq6c
Copy link
Collaborator

jarq6c commented Jan 26, 2021

At the model level, it makes sense to store nwm_feature_id as an int. However, for this use case my preference is that all location identifiers be str then pandas.Category.

Having said that, I believe the transformation from int to str can be more efficiently executed outside the NWM_bytes_to_DataFrame method (after calling pandas.concat elsewhere).

Additionally, I wonder if there is any performance to be gained from using the xarray.Dataset.to_dataframe method. See here docs. Hand-crafting pandas.Dataframe apparently comes with significant overhead.

@aaraney
Copy link
Member Author

aaraney commented Jan 26, 2021

I agree that for the end user, storing them as strings makes more sense and it is just the standard that we have in place. I think it makes sense to cache results as int's personally. I would imagine that int comparison speeds would be faster than str comparison, then again I guess the question is, if int comparison then casting to str is faster than just a str comparison. I may be changing my mind, but then again, the overall size of the cache would be much higher if we are to store nwm_feature_id as a string.

Having said that, I believe the transformation from int to str can be more efficiently executed outside the NWM_bytes_to_DataFrame method (after calling pandas.concat elsewhere).

Potentially, but I am not so sure of that. Given that a copy of the data will be made in memory when cast from int to str I would think (in most cases) it would actually be slower to cast after concatenating the results. As of now, we have a kind of divide and conquer strategy that should be sightly faster.

Additionally, I wonder if there is any performance to be gained from using the xarray.Dataset.to_dataframe method. See here docs. Hand-crafting pandas.Dataframe apparently comes with significant overhead.

Ive tested that hypothesis , I would not say robustly, and found that is actually not the case. My thinking is that passing the underlying numpy array to the pandas.DataFrame constructor does not conduct a copy of the data in memory until you change it. But I could be wrong about that. Ill see what I can find in the source for my own curiosity sake. But to your point, this is definitely worth more investigation than ive done so far.

@aaraney
Copy link
Member Author

aaraney commented Jan 26, 2021

Having looked into the source I don't think that for this problem we will see any performance improvements using to_dataframe(). Ive included the method used to create the dataframe by xarray below. Our code should be a minutely faster since we also use the underlying numpy array to construct the df, just in less code.

# from xarray.core.dataset.Dataset 
# https://github.com/pydata/xarray/blob/a0c71c1508f34345ad7eef244cdbbe224e031c1b/xarray/core/dataset.py#L4970
def _to_dataframe(self, ordered_dims: Mapping[Hashable, int]):
        columns = [k for k in self.variables if k not in self.dims]
        data = [
            self._variables[k].set_dims(ordered_dims).values.reshape(-1)
            for k in columns
        ]
        index = self.coords.to_index([*ordered_dims])
        return pd.DataFrame(dict(zip(columns, data)), index=index)

@jarq6c
Copy link
Collaborator

jarq6c commented Jan 26, 2021

Potentially, but I am not so sure of that. Given that a copy of the data will be made in memory when cast from int to str I would think (in most cases) it would actually be slower to cast after concatenating the results. As of now, we have a kind of divide and conquer strategy that should be slightly faster.

As an experiment I ran a test on my personal laptop that compared creating and concatenating 10,000 dataframes each with 1000 rows and two columns (nwm_feature_id and data) using multiprocessing (5 processes). I found casting nwm_feature_id to str after concatenation was 50% slower than before concatenation. At least on my laptop, it seems your intuition is correct.

I understand the efficiency arguments for caching nwm_feature_id as int. My reservations are related to usgs_site_code which are also numbers, but should not be stored as int (due to the leading zero problem). I'm concerned about maintaining a special separate standard at different parts of the pipeline for each location identifier (and any potential future identifiers that may also be number-based, but not actual integers).

@jarq6c
Copy link
Collaborator

jarq6c commented Jan 26, 2021

I also had notes on memory-usage for the resulting Dataframe.
str: 647.4 MB
int: 152.6 MB
str as Category: 95.5 MB
int as Category: 95.4 MB

@aaraney
Copy link
Member Author

aaraney commented Jan 26, 2021

I understand the efficiency arguments for caching nwm_feature_id as int. My reservations are related to usgs_site_code which are also numbers, but should not be stored as int (due to the leading zero problem). I'm concerned about maintaining a special separate standard at different parts of the pipeline for each location identifier (and any potential future identifiers that may also be number-based, but not actual integers).

Right, I agree with you. My plan was never to include the usgs_site_code in a cached version since crosswalking and merging is so cheap computationally. I share your concern about splitting from the canonical format, just trying to figure out what makes the most sense moving forward.

@aaraney
Copy link
Member Author

aaraney commented Jan 26, 2021

I was able to get the time down to ~34 seconds doing the following:

max_feature_id = df["nwm_feature_id"].max()
max_digit_length = int(np.log10(max_feature_id)) + 1
df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")

@jarq6c
Copy link
Collaborator

jarq6c commented Jan 26, 2021

I was able to get the time down to ~34 seconds doing the following:

max_feature_id = df["nwm_feature_id"].max()
max_digit_length = int(np.log10(max_feature_id)) + 1
df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")

Yuck! But also, clever! 😆 Nice!

@aaraney
Copy link
Member Author

aaraney commented Jan 26, 2021

Only cause for concern is now nwm_feature_id is byte type not a decoded string. Going back to the drawing board. I found a really thorough and well explained stackoverflow post comparing astype and apply(<T>) that may be useful in the future or just good to know about.

@hellkite500
Copy link
Member

Might try writing this a vectorized copy and see if it helps performance

    # 3. This had the biggest impact on wall time. With casting, for a full AnA cycle, goes to ~50 seconds
    # Convert feature IDs to strings
    # df["nwm_feature_id"] = df["nwm_feature_id"].astype(str)

To something like

df["nwm_feature_id_str"] = str(df["nwm_feature_id"])
df.drop("nwm_feature_id")
df.rename({"nwm_feature_id_str":"nwm_feature_id"})

I'm curious if the type assignment from one column type to another is causing some significant slow down on that large of data. Essentially it has to create tmp buffer, copy the new transformed data to it, and then reassign that index. Maybe something like the above can speed that up a bit, but YMMV.

@aaraney
Copy link
Member Author

aaraney commented Jan 26, 2021

I tried something similar to that, but instead of using casting with the built-in str, I used pandas astype. Still hovering at 50 seconds.

Edit: Remember English

@aaraney
Copy link
Member Author

aaraney commented Jan 28, 2021

Ive spent some more time working on this and I think I may have come to a fair compromise solution. Just to recap a few of the things ive tried since I last commented:

  1. Using cython to cast from an int to a string. I found that df.apply(str) was still slightly faster with a small array, but df.apply(str) became the clear winner as the array size increased.
  2. Using numpy.char.mod("%s", ds["streamflow"].feature_id.values) showed comparable speed to
max_feature_id = df["nwm_feature_id"].max()
max_digit_length = int(np.log10(max_feature_id)) + 1
df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")

but also returned encoded bytes rather than unicode strings.

  1. Next, I tried np.char.array(ds["nwm_feature_id"].feature_id.values, copy=False, unicode=True), this was faster than df.apply(str) in small arrays, but as array size grew and the string length increased, it grew slower than the df.apply(str) approach ~39 seconds when retrieving an EAnA cycle.
  2. Lastly, I tried pd.Categorical(ds["streamflow"].feature_id.values). This implementation took ~33 seconds for an entire EAnA cycle. I think this implementation is preferable to 1,2, and 3 because nwm_feature_id values are in a type that is recognizable and does not require further casting by the user. Also its pretty fast. As noted in Document "gotchas" when dealing with Categorical dtype #25, there are known issues with categorical data, however I think these issues are outweighed by the functionality of the categorical datetype. Namely, you can compare integers to categorical types without issue. To that same point, then again, what do we gain from casting these fields to categorical values? It seems like if our only two main advantages are comparison to integer types and that it is not a byte string, I don't see what we benefit is over leaving the values as int's in the first place. I think the best argument is that by using the categorical datatype, we minimize wall time and avoid altering the canonical format.

@jarq6c
Copy link
Collaborator

jarq6c commented Jan 28, 2021

I don't see what we benefit is over leaving the values as int's in the first place. I think the best argument is that by using the categorical datatype, we minimize wall time and avoid altering the canonical format.

The categorical dtype flexes its strength more on large dataframes. I found categories beats both str and int in terms of memory efficiency (and storage as HDF) by a considerable margin on large dataframes. I also found that str categories are comparable to int categories. The added benefit of str categories is that we can use a consistent dtype for all location IDs.

From a previous comment:

str: 647.4 MB
int: 152.6 MB
str as Category: 95.5 MB
int as Category: 95.4 MB

@aaraney
Copy link
Member Author

aaraney commented Jan 28, 2021

Those memory saving are hard to argue against. Do you know if you can specify the category type (str or int) prior to the astype cast? So, can you get an str as Category from an int without having to cast to str first? From my reading of pandas.Categorical I am not seeing a straight path forward.

@jarq6c
Copy link
Collaborator

jarq6c commented Jan 28, 2021

It appears that you can be explicit about the CategoricalDtype. However, I don't think there's a way around performing the initial cast to str (via astype, apply, etc). There may be differences between chaining apply or astype or using a lambda expression.

Another alternative might be to use xarray.DataArray.astype or similar function on the original dataset.

@aaraney
Copy link
Member Author

aaraney commented Jan 28, 2021

It appears that you can be explicit about the CategoricalDtype. However, I don't think there's a way around performing the initial cast to str (via astype, apply, etc). There may be differences between chaining apply or astype or using a lambda expression.

Right, that was my thinking as well .

I am giving your second suggestion a run at the moment.

@aaraney
Copy link
Member Author

aaraney commented Jan 28, 2021

So I tried the following:

df = pd.DataFrame(
                {
                    # "nwm_feature_id": pd.Categorical(
                    #     ds["streamflow"].feature_id.values
                    # ),
                    "nwm_feature_id": ds["streamflow"].feature_id.astype(str),
                    "value": ds["streamflow"].values,
                }
            )

df["nwm_feature_id"] = df["nwm_feature_id"].astype("category")

I took ~183 seconds. I did not report it earlier, but I found early on that casting the values while they are still in an xarray.DataArray to be slow.

@jarq6c
Copy link
Collaborator

jarq6c commented Jan 28, 2021

What about merging your above solution with the underlying numpy method? Something like:

"nwm_feature_id": pd.Categorical(ds["streamflow"].feature_id.values.astype(f"|S{max_digit_length}")

@aaraney
Copy link
Member Author

aaraney commented Jan 28, 2021

Ill give that a go and see if that improves anything. I just tested this:

df = pd.DataFrame(
                {
                    "nwm_feature_id": ds["streamflow"].feature_id.values,
                    "value": ds["streamflow"].values,
                }
            )

df["nwm_feature_id"] = df["nwm_feature_id"].astype("category")
        df["nwm_feature_id"].cat.rename_categories(
            df["nwm_feature_id"].cat.codes.apply(str), inplace=True
        )

It was the slowest so far of the bunch ~212 seconds lol.

@aaraney
Copy link
Member Author

aaraney commented Jan 28, 2021

Unfortunately,

"nwm_feature_id": pd.Categorical(ds["streamflow"].feature_id.values.astype(f"|S{max_digit_length}")

took ~148 seconds.

@jarq6c
Copy link
Collaborator

jarq6c commented Apr 21, 2021

Basic caching implemented for this subpack with #66

Closing this issue here. We can discuss the future of dataframe caching on another ticket.

@jarq6c jarq6c closed this as completed Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants