Download profiles from blob storage by jenhagg · Pull Request #419 · Breakthrough-Energy/PowerSimData

jenhagg · 2021-03-18T19:52:17Z

Purpose

Use blob storage as the source for profiles, downloading as needed.

What the code is doing

Added a version.json file to support listing available versions. The profiles are in the raw/usa_tamu folder in the blob container to mimic the structure elsewhere. Similar to before, we download if it doesn't exist locally.

Testing

Deleted local copies of the profiles and ran a simulation using plug, which shows the files being downloaded (during the call to prepare_simulation_input).

Usage Example/Visuals

Example query (no change in usage, other than switching to static method)

In [18]: InputData.get_profile_version("usa_tamu", "solar")
Out[18]: ['vJan2021']

Snippet of files being downloaded

In [7]: scenario.state.prepare_simulation_input()
---------------------------
PREPARING SIMULATION INPUTS
---------------------------
--> Creating temporary folder on server for simulation inputs
--> Loading demand
demand_vJan2021.csv not found in /root/ScenarioData/ on local machine
--> Downloading demand_vJan2021.csv from blob storage.
100%|████████████████████████████████████████████████████████████████████████████| 5.15M/5.15M [00:00<00:00, 6.17MB/s]
--> Done!
Multiply demand in Far West (#301) by 1.09
/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py:1843: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value
Multiply demand in North (#302) by 1.09
Multiply demand in West (#303) by 1.09
Multiply demand in South (#304) by 1.09
Multiply demand in North Central (#305) by 1.09
Multiply demand in South Central (#306) by 1.09
Multiply demand in Coast (#307) by 1.09
Multiply demand in East (#308) by 1.09
Writing scaled demand profile in /root/ScenarioData/ on local machine
--> Moving file /root/ScenarioData/1_demand.csv to /mnt/bes/pcm/tmp/scenario_1/demand.csv
--> Deleting original copy
--> Loading hydro
hydro_vJan2021.csv not found in /root/ScenarioData/ on local machine
--> Downloading hydro_vJan2021.csv from blob storage.
100%|██████████████████████████████████████████████████████████████████████████████| 222M/222M [00:40<00:00, 5.74MB/s]
--> Done!

Time estimate

20 mins

rouille · 2021-03-18T21:18:25Z

Two comments about the code snippet:

--> Creating temporary folder on server for simulation inputs. Now that we are not strictly restricted to the client/server installation, I believe we can remove the on server in the printout.
there is a warning:

Multiply demand in Far West (#301) by 1.09
/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py:1843: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

I believed it is raised in the _get_demand_profile method of the TransformProfile class. I remember seeing it before. but I rarely run scenarios. Does it ring any bell @danielolsen ?

These comments are not related to your PR though. Just happen to show up in the code snippet.

danielolsen · 2021-03-19T14:24:32Z

@rouille I'm familiar with that warning, it's something that's not very hard to track down and fix.

danielolsen · 2021-03-19T15:18:08Z

I think the warning can be fixed by changing L.106 of transform_profile.py from:

demand = self._input_data.get_data(self.scenario_info, "demand")[zone_id]

to:

demand = self._input_data.get_data(self.scenario_info, "demand").loc[:, zone_id]

jenhagg · 2021-03-19T18:06:06Z

I think the warning can be fixed by changing L.106 of transform_profile.py

Nice, that works - no warning when I re-ran the same test.

ahurli · 2021-03-19T18:17:37Z

-            filename = [os.path.basename(line.rstrip()) for line in stdout.readlines()]
-            version = [f[f.rfind("_") + 1 : -4] for f in filename]
-        return version
+        resp = requests.get(f"{BASE_URL}/version.json")


Do we want users to be able to use their own profiles that only exist locally? I'm thinking that in the previous Docker implementation, users could just add a new profile (like a new demand profile to test) to the ScenarioDatadirectory and be able to use it, whereas now they'll either need to add it to our blob storage or create their own.

I think we are just downloading a JSON files enclosing the profile version here. Otherwise the logic that we had before is still in place, i.e., we only download if there is not local copy. Also in plug, I would say the user can have its own profile and use them in the various script via the set_base_profile method.

What would get added to the ScenarioList if the user brought their own profile? "untracked"?

Do we want users to be able to use their own profiles that only exist locally?

Good point, there isn't a natural way to use a custom profile if we are strictly checking the version.json in blob storage. I think we'd need to do something like cache that file locally so it can be edited, or check both a local copy and the one in blob storage, etc. The profiles can still be put in the ScenarioData folder locally, but they would have a version that is considered valid to actually use it.

What would get added to the ScenarioList if the user brought their own profile? "untracked"?

Yeah it would get added to the user's ScenarioList. So their environment would be consistent, but couldn't be easily "merged" with another one (e.g. our server environment). I was going to suggest something like adding full references to files in the scenario list, but can't do much if the sources aren't public, e.g. in some other blob storage. Not sure if this is an issue at this point?

Since DataAccess differentiates between configurations, we could add something like data_access.get_local_profiles(kind) that returns nothing for SSHDataAccess but returns any additional profiles for LocalDataAccess. We could call and append the results in InputData.

Depending on how fancy we want to get, we could also pass in the profiles already found in blob storage to both a) mark which profiles have been cached if a user wants to be careful of their internet usage and b) filter out only the extra profiles not in blob storage.

Cool, I ended up thinking the same thing (my other ideas didn't pan out). Not sure I understand the part about passing in the profiles though?

I was first thinking of passing in the profiles like data_access.get_local_profiles(kind, blob_profiles), but looking at it now, there's no reason we can't do that filtering/marking at the InputData level instead of at the DataAccess level.

Actually, sorry, I remembered what I was thinking. You would want to pass in the blob profiles so that in the SSHDataAccess instance, you could return the locally cached profiles but exclude anything not in the blob profiles list. In the LocalDataAccess instance, you would return both cached and local-only profiles.

Ah thanks, that makes sense. Let me push a commit with how I'm thinking of implementing that.

rouille · 2021-03-20T00:44:22Z

I think with 1f9291f, we have the best of both worlds. Did you have a chance to test it with both the client/server installation and the containers?

jenhagg · 2021-03-20T01:13:06Z

I think with 1f9291f, we have the best of both worlds. Did you have a chance to test it with both the client/server installation and the containers?

So far only a quick test in the container, before and after adding a local version.json. That worked as expected, and the client/server behavior should be a subset of that (only checks blob storage). Planning to do a bit more, and look into adding unit tests, which was the original reason for creating the parse_version without the http request.

rouille · 2021-03-22T22:29:11Z

        raise NotImplementedError

+    def get_profile_version(self, grid_model, kind):
+        return ProfileHelper.get_profile_version_cloud(grid_model, kind)


We implicitly return the versions stored on the cloud. I see how it is useful but is that intuitive in the DataAccess class? Should we return instead an instance of ProfileHelper and call the functions in the child classes?

I was thinking since blob storage is the source of truth, it makes sense as the default. Another way would be to raise NotImplemented here and have the child classes contain their own implementations (SSHDataAccess is just inheriting this, while LocalDataAccess appends the local versions if there are any). Would that be more intuitive?

You are right. It makes sense it is the default.

ahurli · 2021-03-22T22:24:03Z

+        version = scenario_info["base_" + field_name]
+        file_name = field_name + "_" + version + ".csv"
+        grid_model = scenario_info["grid_model"]
+        from_dir = f"raw/{grid_model}"


Does this forward slash cause issues for Windows users when saving to a local directory (like in line 25)?

Good call, I should probably do an os.path.join here

I should probably get this on my personal Windows machine to test, but I think we might run into problems calling the blob storage then. We're using the same from_dir for both the url and local directory in the download_file method.

D'oh, will fix. We can look into automating windows tests, maybe in github actions or using windows containers.

ahurli · 2021-03-22T23:37:10Z

+        :return: (*list*) -- available profile version.
+        """
+
+        version_file = os.path.join(server_setup.LOCAL_DIR, "version.json")


Would it be easier just to list the files in the directory and filter out the ones that match the {kind}_{version}.csv format? Then again, making a user add a new profile to version.json is more intentional, so they're less likely to accidentally add something without understanding the risks.

Yeah this was kind of a trade off - it's less code to reuse the json format and provides at least one way for a user to customize. Figured it's ok for now, but definitely open to future improvements.

Makes sense! We could also probably get some feedback from our users/collaborators to see what they think about usability. But agreed, this looks good for now.

ahurli

Looks good!

rouille

Thanks. This is a great feature.

jenhagg added this to the Put Your Records On milestone Mar 18, 2021

jenhagg linked an issue Mar 18, 2021 that may be closed by this pull request

Handle USA profiles #415

Closed

jenhagg self-assigned this Mar 18, 2021

jenhagg requested review from ahurli and rouille March 18, 2021 19:53

rouille requested a review from danielolsen March 18, 2021 21:17

danielolsen reviewed Mar 19, 2021

View reviewed changes

Comment thread powersimdata/input/input_data.py Outdated

Jon Hagg added 5 commits March 19, 2021 10:24

feat: get profile versions from blob storage

42dd4a5

feat: download profiles from blob storage

6e019fe

fix: use consistent paths

aa8c034

feat: progress bar for download

0c3ea4d

refactor: use top level version list to simplify paths

55ad3a3

chore: remove base profile dir and fix pandas warning

32491bd

jenhagg force-pushed the jon/profiles branch from d14e156 to 32491bd Compare March 19, 2021 18:08

ahurli reviewed Mar 19, 2021

View reviewed changes

feat: support custom profiles through local version.json

1f9291f

rouille reviewed Mar 20, 2021

View reviewed changes

Comment thread powersimdata/data_access/profile_helper.py Outdated

chore: remove redundant validation

d739089

test: add unit tests and move some logic around

505f53d

rouille reviewed Mar 22, 2021

View reviewed changes

Comment thread powersimdata/data_access/profile_helper.py Outdated

chore: more specific method name

27123e5

rouille reviewed Mar 22, 2021

View reviewed changes

Comment thread powersimdata/data_access/data_access.py

rouille reviewed Mar 22, 2021

View reviewed changes

docs: add missing docstrings

41cce24

ahurli reviewed Mar 22, 2021

View reviewed changes

fix: create local path correctly

4d5abcc

jenhagg force-pushed the jon/profiles branch from de733df to 4d5abcc Compare March 23, 2021 00:47

ahurli approved these changes Mar 23, 2021

View reviewed changes

rouille approved these changes Mar 23, 2021

View reviewed changes

jenhagg merged commit 0a2443c into develop Mar 23, 2021

jenhagg deleted the jon/profiles branch March 23, 2021 21:10

danielolsen mentioned this pull request Apr 28, 2021

chore: merge develop into master for v0.4.1 release #468

Merged

danielolsen mentioned this pull request Sep 15, 2021

Feature request: Load Custom Demand/Gen Profiles .csv #547

Closed

1 task

Conversation

jenhagg commented Mar 18, 2021

Purpose

What the code is doing

Testing

Usage Example/Visuals

Time estimate

Uh oh!

rouille commented Mar 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielolsen commented Mar 19, 2021

Uh oh!

Uh oh!

danielolsen commented Mar 19, 2021

Uh oh!

jenhagg commented Mar 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jenhagg Mar 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rouille commented Mar 20, 2021

Uh oh!

jenhagg commented Mar 20, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahurli left a comment

Choose a reason for hiding this comment

Uh oh!

rouille left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

rouille commented Mar 18, 2021 •

edited

Loading

jenhagg Mar 19, 2021 •

edited

Loading