Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

29 from csv function #39

Merged
merged 15 commits into from
Oct 24, 2023
Merged

29 from csv function #39

merged 15 commits into from
Oct 24, 2023

Conversation

rogerkuou
Copy link
Member

@rogerkuou rogerkuou commented Oct 17, 2023

Added a from_csv function for lazy loading csv files.

  • Implement from_csv
  • Add example dataset
  • Unit test
  • Documentation
  • Exmaple notebook

@rogerkuou rogerkuou marked this pull request as ready for review October 18, 2023 12:36
@rogerkuou
Copy link
Member Author

Hi @SarahAlidoost and @fnattino, could you please review this PR for me?
The example of it's application can be found at the beginning of the example notebook

@rogerkuou rogerkuou linked an issue Oct 18, 2023 that may be closed by this pull request
if spacetime_pattern is not None:
key = list(spacetime_pattern.keys())[0]
for column in ddf.columns:
if re.match(re.compile(key), column):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful for users to know if the pattern in csv files is valid. So perhaps adding else statement with a clear error message will help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I added on check at the beginning of the function, to check if all specified space/space-time patterns have at least one match. Otherwise an ValueError is raised

# Initiate a template STM
coords = {
"space": range(da_col0.shape[0]),
"time": range(time_shape),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can time values be extracted from column names i.e. amp_20100110 and stored as a time object. If users want to combine datasets, time information is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will tentatively hold this for now. Because currently the time information is all string-based (e.g. from folder names and col headers, etc.) This can be fragile and may be even incorrect in case of futural hourly observation. We are thinking of integrating the time information from metadata, e.g. STAC-Catalog metatdata, so for now I will leave it to users to manually change the time coordinates.

However I think maybe we should do a good documentation on this. Do you have any suggestion?

stmtools/_io.py Outdated
stmat = stmat.assign({column: (("space"), da_pnt)})
else:
for k in spacetime_pattern.keys():
if re.match(re.compile(f"{k}"), column):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same comment as above, let's add else statement with a clear error message to help users if the pattern in csv file is not valid.

Copy link
Contributor

@SarahAlidoost SarahAlidoost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogerkuou nice work 👍 the function works. Just a suggestion about time coordinates, let's keep them if it is possible.

Copy link
Contributor

@fnattino fnattino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @rogerkuou!

I have just raised two points related to:

  • a performance concern I have about reading CSV files by columns;
  • making the output of from_csv more consistent with the STM produced from SLC data.

I leave it to you wether you want to consider these now or for later (potential) improvements.

All the rest is very minor stuff (couple of typos and small things related to type hints).

stmtools/_io.py Outdated
Comment on lines 12 to 15
spacetime_pattern: dict | None = None,
coords_cols: list | dict = None,
output_chunksize: dict | None = None,
blocksize: int | str | None = 200e6,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can skip the | None and just do: spacetime_pattern: dict = None

Also, if dicts and lists might fit some structure, you might be more specific on their contents with something like this (I am not 100% sure about the syntax):

from typing import Dict, List

def func(x: Dict[int, str], y: List[int]) -> None:
    pass

stmtools/_io.py Outdated
coords_cols: list | dict = None,
output_chunksize: dict | None = None,
blocksize: int | str | None = 200e6,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are already using typing, maybe add the returned type?

stmtools/_io.py Outdated Show resolved Hide resolved
stmtools/_io.py Outdated Show resolved Hide resolved
stmtools/_io.py Outdated Show resolved Hide resolved
stmtools/_io.py Outdated Show resolved Hide resolved
stmtools/_io.py Show resolved Hide resolved
stmtools/_io.py Show resolved Hide resolved
@rogerkuou rogerkuou merged commit 795d7ad into main Oct 24, 2023
@rogerkuou rogerkuou deleted the 29_from_csv_function branch November 9, 2023 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A "from_csv" function in STMtools
3 participants