Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Vision] Conversion output type for columns "ID" and "Name" #107

Open
IrenaDk opened this issue Dec 19, 2022 · 2 comments
Open

[Vision] Conversion output type for columns "ID" and "Name" #107

IrenaDk opened this issue Dec 19, 2022 · 2 comments
Labels
good first issue Indicates a good issue for first-time contributors improvement Existing functionality, but better, faster, stronger

Comments

@IrenaDk
Copy link

IrenaDk commented Dec 19, 2022

The json output for the Vision conversion in the columns "ID" is currenlty a float number if the entry contains numbers only (this is the case most of the time), or string if there are letters in it (fewer components contain letters). First, this is inconsistent, so it would be better if all IDs are of type string. Alternatively, if the output of the entries with numbers only remain to be numbers, it is prefered that they are integer, rather than floats, since they are integers originally.

As for the Name column, it is prefered that it is always a string. Now there is inconsistency between the PGM components. For some the output is a string and for some a (float) number.

@bramstoeller
Copy link
Contributor

Fair point, I agree that this is inconsistent. However, the issue is in the Panda's excel reader (openpyxl). It tries to determine the the data type automatically. If I remember correctly, only columns containing very large integers are converted to float or double. In this case it is hard to determine the type automatically, because even when the ID column contains only integers, we'd still like them to be strings. A couple solutions could be:

  • Add a way to define the expected data type explicitly for some columns and let the Excel reader handle the types (requires quite some coding/testing/documentation).
  • Read columns always as strings and let python handle the conversion when necessary (possibly much slower and more memory required).
  • Add a function around the extra info definition that converts the data to the expected data type (quite specific for ID and Name and we will probably run into issues with the extra_info keys).
  • Post-process the values after conversion. Something like:
    def convert_to_string(value: Any) -> str:
        if value is None or (isinstance(value, Number) and np.isnan(value)):
            return ""
        if isinstance(value, SupportsRound) and round(value) == value:
            return str(int(round(value)))
        return str(value)

In my opinion, the first solution is the best / most sustainable, but that's quite some work. And, as the issues only arise in the extra_info field, there are currently more prominent issues that we have to pick up. Therefore, I'd advise you to use the post-process option.

Or, we'd like to invite you to fork this repository, create the fix in the source code, and add a pull request. We'd be happy to write the unit tests and review your suggestions.

@bramstoeller bramstoeller added the improvement Existing functionality, but better, faster, stronger label Dec 20, 2022
@IrenaDk
Copy link
Author

IrenaDk commented Dec 21, 2022

Thanks for the input Bram. I will temporarily work with a post-processing option.

@mgovers mgovers added the good first issue Indicates a good issue for first-time contributors label Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Indicates a good issue for first-time contributors improvement Existing functionality, but better, faster, stronger
Projects
Status: No status
Development

No branches or pull requests

3 participants