Skip to content

Download profiles from blob storage#419

Merged
jenhagg merged 12 commits intodevelopfrom
jon/profiles
Mar 23, 2021
Merged

Download profiles from blob storage#419
jenhagg merged 12 commits intodevelopfrom
jon/profiles

Conversation

@jenhagg
Copy link
Copy Markdown
Collaborator

@jenhagg jenhagg commented Mar 18, 2021

Purpose

Use blob storage as the source for profiles, downloading as needed.

What the code is doing

Added a version.json file to support listing available versions. The profiles are in the raw/usa_tamu folder in the blob container to mimic the structure elsewhere. Similar to before, we download if it doesn't exist locally.

Testing

Deleted local copies of the profiles and ran a simulation using plug, which shows the files being downloaded (during the call to prepare_simulation_input).

Usage Example/Visuals

Example query (no change in usage, other than switching to static method)

In [18]: InputData.get_profile_version("usa_tamu", "solar")
Out[18]: ['vJan2021']

Snippet of files being downloaded

In [7]: scenario.state.prepare_simulation_input()
---------------------------
PREPARING SIMULATION INPUTS
---------------------------
--> Creating temporary folder on server for simulation inputs
--> Loading demand
demand_vJan2021.csv not found in /root/ScenarioData/ on local machine
--> Downloading demand_vJan2021.csv from blob storage.
100%|████████████████████████████████████████████████████████████████████████████| 5.15M/5.15M [00:00<00:00, 6.17MB/s]
--> Done!
Multiply demand in Far West (#301) by 1.09
/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py:1843: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value
Multiply demand in North (#302) by 1.09
Multiply demand in West (#303) by 1.09
Multiply demand in South (#304) by 1.09
Multiply demand in North Central (#305) by 1.09
Multiply demand in South Central (#306) by 1.09
Multiply demand in Coast (#307) by 1.09
Multiply demand in East (#308) by 1.09
Writing scaled demand profile in /root/ScenarioData/ on local machine
--> Moving file /root/ScenarioData/1_demand.csv to /mnt/bes/pcm/tmp/scenario_1/demand.csv
--> Deleting original copy
--> Loading hydro
hydro_vJan2021.csv not found in /root/ScenarioData/ on local machine
--> Downloading hydro_vJan2021.csv from blob storage.
100%|██████████████████████████████████████████████████████████████████████████████| 222M/222M [00:40<00:00, 5.74MB/s]
--> Done!

Time estimate

20 mins

@jenhagg jenhagg added this to the Put Your Records On milestone Mar 18, 2021
@jenhagg jenhagg linked an issue Mar 18, 2021 that may be closed by this pull request
@jenhagg jenhagg self-assigned this Mar 18, 2021
@jenhagg jenhagg requested review from ahurli and rouille March 18, 2021 19:53
@rouille rouille requested a review from danielolsen March 18, 2021 21:17
@rouille
Copy link
Copy Markdown
Collaborator

rouille commented Mar 18, 2021

Two comments about the code snippet:

  • --> Creating temporary folder on server for simulation inputs. Now that we are not strictly restricted to the client/server installation, I believe we can remove the on server in the printout.
  • there is a warning:
Multiply demand in Far West (#301) by 1.09
/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py:1843: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

I believed it is raised in the _get_demand_profile method of the TransformProfile class. I remember seeing it before. but I rarely run scenarios. Does it ring any bell @danielolsen ?

These comments are not related to your PR though. Just happen to show up in the code snippet.

@danielolsen
Copy link
Copy Markdown
Contributor

@rouille I'm familiar with that warning, it's something that's not very hard to track down and fix.

Comment thread powersimdata/input/input_data.py Outdated
@danielolsen
Copy link
Copy Markdown
Contributor

I think the warning can be fixed by changing L.106 of transform_profile.py from:

demand = self._input_data.get_data(self.scenario_info, "demand")[zone_id]

to:

demand = self._input_data.get_data(self.scenario_info, "demand").loc[:, zone_id]

@jenhagg
Copy link
Copy Markdown
Collaborator Author

jenhagg commented Mar 19, 2021

I think the warning can be fixed by changing L.106 of transform_profile.py

Nice, that works - no warning when I re-ran the same test.

Comment thread powersimdata/input/input_data.py Outdated
filename = [os.path.basename(line.rstrip()) for line in stdout.readlines()]
version = [f[f.rfind("_") + 1 : -4] for f in filename]
return version
resp = requests.get(f"{BASE_URL}/version.json")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want users to be able to use their own profiles that only exist locally? I'm thinking that in the previous Docker implementation, users could just add a new profile (like a new demand profile to test) to the ScenarioDatadirectory and be able to use it, whereas now they'll either need to add it to our blob storage or create their own.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are just downloading a JSON files enclosing the profile version here. Otherwise the logic that we had before is still in place, i.e., we only download if there is not local copy. Also in plug, I would say the user can have its own profile and use them in the various script via the set_base_profile method.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would get added to the ScenarioList if the user brought their own profile? "untracked"?

Copy link
Copy Markdown
Collaborator Author

@jenhagg jenhagg Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want users to be able to use their own profiles that only exist locally?

Good point, there isn't a natural way to use a custom profile if we are strictly checking the version.json in blob storage. I think we'd need to do something like cache that file locally so it can be edited, or check both a local copy and the one in blob storage, etc. The profiles can still be put in the ScenarioData folder locally, but they would have a version that is considered valid to actually use it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would get added to the ScenarioList if the user brought their own profile? "untracked"?

Yeah it would get added to the user's ScenarioList. So their environment would be consistent, but couldn't be easily "merged" with another one (e.g. our server environment). I was going to suggest something like adding full references to files in the scenario list, but can't do much if the sources aren't public, e.g. in some other blob storage. Not sure if this is an issue at this point?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since DataAccess differentiates between configurations, we could add something like data_access.get_local_profiles(kind) that returns nothing for SSHDataAccess but returns any additional profiles for LocalDataAccess. We could call and append the results in InputData.

Depending on how fancy we want to get, we could also pass in the profiles already found in blob storage to both a) mark which profiles have been cached if a user wants to be careful of their internet usage and b) filter out only the extra profiles not in blob storage.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I ended up thinking the same thing (my other ideas didn't pan out). Not sure I understand the part about passing in the profiles though?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was first thinking of passing in the profiles like data_access.get_local_profiles(kind, blob_profiles), but looking at it now, there's no reason we can't do that filtering/marking at the InputData level instead of at the DataAccess level.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry, I remembered what I was thinking. You would want to pass in the blob profiles so that in the SSHDataAccess instance, you could return the locally cached profiles but exclude anything not in the blob profiles list. In the LocalDataAccess instance, you would return both cached and local-only profiles.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, that makes sense. Let me push a commit with how I'm thinking of implementing that.

Comment thread powersimdata/data_access/profile_helper.py Outdated
@rouille
Copy link
Copy Markdown
Collaborator

rouille commented Mar 20, 2021

I think with 1f9291f, we have the best of both worlds. Did you have a chance to test it with both the client/server installation and the containers?

@jenhagg
Copy link
Copy Markdown
Collaborator Author

jenhagg commented Mar 20, 2021

I think with 1f9291f, we have the best of both worlds. Did you have a chance to test it with both the client/server installation and the containers?

So far only a quick test in the container, before and after adding a local version.json. That worked as expected, and the client/server behavior should be a subset of that (only checks blob storage). Planning to do a bit more, and look into adding unit tests, which was the original reason for creating the parse_version without the http request.

Comment thread powersimdata/data_access/profile_helper.py Outdated
Comment thread powersimdata/data_access/data_access.py
raise NotImplementedError

def get_profile_version(self, grid_model, kind):
return ProfileHelper.get_profile_version_cloud(grid_model, kind)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We implicitly return the versions stored on the cloud. I see how it is useful but is that intuitive in the DataAccess class? Should we return instead an instance of ProfileHelper and call the functions in the child classes?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking since blob storage is the source of truth, it makes sense as the default. Another way would be to raise NotImplemented here and have the child classes contain their own implementations (SSHDataAccess is just inheriting this, while LocalDataAccess appends the local versions if there are any). Would that be more intuitive?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It makes sense it is the default.

version = scenario_info["base_" + field_name]
file_name = field_name + "_" + version + ".csv"
grid_model = scenario_info["grid_model"]
from_dir = f"raw/{grid_model}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this forward slash cause issues for Windows users when saving to a local directory (like in line 25)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I should probably do an os.path.join here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably get this on my personal Windows machine to test, but I think we might run into problems calling the blob storage then. We're using the same from_dir for both the url and local directory in the download_file method.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh, will fix. We can look into automating windows tests, maybe in github actions or using windows containers.

:return: (*list*) -- available profile version.
"""

version_file = os.path.join(server_setup.LOCAL_DIR, "version.json")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easier just to list the files in the directory and filter out the ones that match the {kind}_{version}.csv format? Then again, making a user add a new profile to version.json is more intentional, so they're less likely to accidentally add something without understanding the risks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was kind of a trade off - it's less code to reuse the json format and provides at least one way for a user to customize. Figured it's ok for now, but definitely open to future improvements.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! We could also probably get some feedback from our users/collaborators to see what they think about usability. But agreed, this looks good for now.

Copy link
Copy Markdown
Contributor

@ahurli ahurli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Copy Markdown
Collaborator

@rouille rouille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This is a great feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle USA profiles

4 participants