-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
29 from csv function #39
Conversation
Hi @SarahAlidoost and @fnattino, could you please review this PR for me? |
if spacetime_pattern is not None: | ||
key = list(spacetime_pattern.keys())[0] | ||
for column in ddf.columns: | ||
if re.match(re.compile(key), column): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful for users to know if the pattern in csv files is valid. So perhaps adding else
statement with a clear error message will help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I added on check at the beginning of the function, to check if all specified space/space-time patterns have at least one match. Otherwise an ValueError is raised
# Initiate a template STM | ||
coords = { | ||
"space": range(da_col0.shape[0]), | ||
"time": range(time_shape), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can time values be extracted from column names i.e. amp_20100110
and stored as a time object. If users want to combine datasets, time information is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will tentatively hold this for now. Because currently the time information is all string-based (e.g. from folder names and col headers, etc.) This can be fragile and may be even incorrect in case of futural hourly observation. We are thinking of integrating the time information from metadata, e.g. STAC-Catalog metatdata, so for now I will leave it to users to manually change the time coordinates.
However I think maybe we should do a good documentation on this. Do you have any suggestion?
stmtools/_io.py
Outdated
stmat = stmat.assign({column: (("space"), da_pnt)}) | ||
else: | ||
for k in spacetime_pattern.keys(): | ||
if re.match(re.compile(f"{k}"), column): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same comment as above, let's add else
statement with a clear error message to help users if the pattern in csv file is not valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rogerkuou nice work 👍 the function works. Just a suggestion about time coordinates, let's keep them if it is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice @rogerkuou!
I have just raised two points related to:
- a performance concern I have about reading CSV files by columns;
- making the output of
from_csv
more consistent with the STM produced from SLC data.
I leave it to you wether you want to consider these now or for later (potential) improvements.
All the rest is very minor stuff (couple of typos and small things related to type hints).
stmtools/_io.py
Outdated
spacetime_pattern: dict | None = None, | ||
coords_cols: list | dict = None, | ||
output_chunksize: dict | None = None, | ||
blocksize: int | str | None = 200e6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can skip the | None
and just do: spacetime_pattern: dict = None
Also, if dicts and lists might fit some structure, you might be more specific on their contents with something like this (I am not 100% sure about the syntax):
from typing import Dict, List
def func(x: Dict[int, str], y: List[int]) -> None:
pass
stmtools/_io.py
Outdated
coords_cols: list | dict = None, | ||
output_chunksize: dict | None = None, | ||
blocksize: int | str | None = 200e6, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are already using typing, maybe add the returned type?
Co-authored-by: Francesco Nattino <49899980+fnattino@users.noreply.github.com>
Added a
from_csv
function for lazy loading csv files.from_csv