Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update flatten to accept DataFrame and GeoDataFrame #103

Merged
merged 58 commits into from
May 21, 2024

Conversation

FelipeSBarros
Copy link
Owner

@cuducos the purpose of this PR is to follow the development of flatten function (#96), making it accept and work with DataFrame;

tests/test_flatten.py Outdated Show resolved Hide resolved
tests/test_flatten.py Outdated Show resolved Hide resolved
@FelipeSBarros FelipeSBarros changed the title Update flatten to accept DataFrame Update flatten to accept DataFrame and GeoDataFrame Feb 23, 2024
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
Copy link
Owner Author

@FelipeSBarros FelipeSBarros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cuducos what do you think about this proposal?

crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
Copy link
Owner Author

@FelipeSBarros FelipeSBarros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I should be commiting this, but as you always says: it is better to discuss considering code implemented then ideas (actualy you never said exactly that. but always ask to see the code... so, this is how I understand your claim.

crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested without pandas installed?

crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
Copy link
Owner Author

@FelipeSBarros FelipeSBarros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this last commit doesn't solve the concerns your mentioned before, about the early return. But tried to work improve in caso the data is DAtaFrame....

crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
Copy link
Owner Author

@FelipeSBarros FelipeSBarros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now with ìs_empty function and some tests...

crossfire/__init__.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
tests/test_flatten.py Outdated Show resolved Hide resolved
tests/test_flatten.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
tests/test_flatten.py Outdated Show resolved Hide resolved
tests/test_flatten.py Outdated Show resolved Hide resolved
tests/test_flatten.py Outdated Show resolved Hide resolved
tests/test_flatten.py Outdated Show resolved Hide resolved
crossfire/__init__.py Outdated Show resolved Hide resolved
Co-authored-by: Eduardo Cuducos <4732915+cuducos@users.noreply.github.com>
crossfire/__init__.py Outdated Show resolved Hide resolved
tests/test_flatten.py Outdated Show resolved Hide resolved
tests/test_flatten.py Outdated Show resolved Hide resolved
tests/test_flatten.py Show resolved Hide resolved
tests/test_flatten.py Outdated Show resolved Hide resolved
tests/test_flatten.py Outdated Show resolved Hide resolved
@cuducos
Copy link
Collaborator

cuducos commented May 10, 2024

I agree with the last approach, but… why popping the key just to reinsert it next? Can’t you prevent popping it instead?

@FelipeSBarros
Copy link
Owner Author

FelipeSBarros commented May 12, 2024

I agree with the last approach, but… why popping the key just to reinsert it next? Can’t you prevent popping it instead?

@cuducos , I am not sure if I could understand your last comment. You agree it is the worst scenarious? I see the first proposal as the most insteresting: not "popping" anythink. just flattenning when there is something to be flattened.
This way the original key will be always kept. And those dicts with nested values would have new keys with the nested values flatened.

[UPDATE]
My last commits refactor _flatten function (list and pd) avoiding popping the nested keys/colummns ...

cuducos
cuducos previously approved these changes May 13, 2024
crossfire/clients/occurrences.py Outdated Show resolved Hide resolved
@FelipeSBarros
Copy link
Owner Author

@cuducos in my last commits I have refactored the _flaten_df so it has the same behavior as _flatten_list, when dealing with DataFrame with None values in the potential nested_columns;
Also, before merging this PR, I decided to follow your suggestion: expose flatten functionality, not as a function.

I added the flat evaluation in Occurrences' __call__
method, and updated the tests.

I am missing a integration test to confirm if when I call Occurrences(..., flat=True), the flatten() function is being called. but I am not sure if it is necessary.

@FelipeSBarros FelipeSBarros dismissed cuducos’s stale review May 15, 2024 03:23

I plan to implement flatten function as functionlity. I still have work to do.

@FelipeSBarros
Copy link
Owner Author

few things I still have to do:

  1. Confirm/fix if implementation of flat paramter is fine;

Trying the implementation I realized that when requesting data as ´dictformat, theflaten` function is not being used:

from crossfire import occurrences

occs_dict = occurrences(id_state='813ca36b-91e3-4a18-b408-60b27a1942ef',
                id_cities='5bd3bfe5-4989-4bc3-a646-fe77a876fce0',
                initial_date='2018-04-01', flat=True, format='dict')
'contextInfo_mainReason' in occs_dict[0].keys()
#False

occs_df = occurrences(id_state='813ca36b-91e3-4a18-b408-60b27a1942ef',
                id_cities='5bd3bfe5-4989-4bc3-a646-fe77a876fce0',
                initial_date='2018-04-01', flat=True, format='df')
'contextInfo_mainReason' in occs_df.columns
#True

occs_gdf = occurrences(id_state='813ca36b-91e3-4a18-b408-60b27a1942ef',
                id_cities='5bd3bfe5-4989-4bc3-a646-fe77a876fce0',
                initial_date='2018-04-01', flat=True, format='geodf')
'contextInfo_mainReason' in occs_gdf.columns
#True
  1. Figure out how to deal with nested columns in nested columns;

When trying the implementation I realized that contextInfo usually come with other nested key/columns as values:

occs_df = occurrences(id_state='813ca36b-91e3-4a18-b408-60b27a1942ef',
                id_cities='5bd3bfe5-4989-4bc3-a646-fe77a876fce0',
                initial_date='2018-04-01', flat=True, format='df')
'contextInfo_mainReason' in occs_df.columns

occs_df.iloc[0]['contextInfo_mainReason']
#{'id': 'baa3b299-67ad-41d2-aaf0-23ec8288cadb', 'name': 'Homicidio/Tentativa'}

How could we confirm if flattened key/column come with nested value and flatten then also?

  1. Rewrite documentation;

@cuducos
Copy link
Collaborator

cuducos commented May 16, 2024

How could we confirm if flattened key/column come with nested value and flatten then also?
I think the point is to check the type of value:

for key, value in occurrence.items():
    if isinstance(value, dict):
        flat(value)

Maybe for Pandas it would be a string formatted as JSON, so you can try to parse it as JSON and check if it is a dictionary:

def is_nested(text):
    try:
        data = loads(text)
    except ValueError:  # not sure this is the right exception, this is just an example
        return False
    return isinstance(data, dict)

@FelipeSBarros
Copy link
Owner Author

How could we confirm if flattened key/column come with nested value and flatten then also?
I think the point is to check the type of value:

for key, value in occurrence.items():
    if isinstance(value, dict):
        flat(value)

Maybe for Pandas it would be a string formatted as JSON, so you can try to parse it as JSON and check if it is a dictionary:

def is_nested(text):
    try:
        data = loads(text)
    except ValueError:  # not sure this is the right exception, this is just an example
        return False
    return isinstance(data, dict)

Thanks for your advice. I hope I have implemented the right way...

By the way, I am facing a new problem:
when I run the test test_flatten_pd_with_nested_columns_with_nested_values only, it pass:

poetry run pytest -k test_flatten_pd_with_nested_columns_with_nested_values
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.10.2, pytest-7.4.3, pluggy-1.3.0
rootdir: /home/felipe/repos/crossfire
configfile: pyproject.toml
plugins: ruff-0.2.1, anyio-4.1.0, asyncio-0.21.1
asyncio: mode=strict
collected 104 items / 103 deselected / 1 selected                                                                                                                                        

tests/test_flatten.py .                                                                                                                                                            [100%]

But running all tests, it fails.

FAILED tests/test_flatten.py::test_flatten_pd_with_nested_columns_with_nested_values - AssertionError: DataFrame are different

I have no idea of what might be causing the probem... do you have any clue?

@cuducos
Copy link
Collaborator

cuducos commented May 17, 2024

when I run the test test_flatten_pd_with_nested_columns_with_nested_values only, it pass:

[...] [100%]

But running all tests, it fails.

Probably because when you run this you mutate DICT_DATA_WITH_NESTED_VALUES_IN_NESTED_COLUMNSflatten does not return a new dictionary, it mutates the one passed as an argument. So the input is different on the second test when you run both tests. You can copy the dictionary in each test to prevent that.

@FelipeSBarros
Copy link
Owner Author

when I run the test test_flatten_pd_with_nested_columns_with_nested_values only, it pass:
[...] [100%]
But running all tests, it fails.

Probably because when you run this you mutate DICT_DATA_WITH_NESTED_VALUES_IN_NESTED_COLUMNSflatten does not return a new dictionary, it mutates the one passed as an argument. So the input is different on the second test when you run both tests. You can copy the dictionary in each test to prevent that.

Thanks, Cuducos! I had this in mind last night. But I was so tired that I didn't realize that I should use a deepcopy to make sure all values has copied not ony the structure.

Could you review the implementation of flatten() function as an Occurrence parameter (point 1 here )? I am not confident about the way I have implemented... I also it is not being executed when requesting the Occurrences as dict. :/

Copy link
Collaborator

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not being executed when requesting the Occurrences as dict. :/

So, the test suite has a bug? It is green, meaning test_occurrences_as_list_dicts_with_flat_parameter is passing — how is it passing if flatten is not being called? Or does this test has a deceiving name?

tests/test_flatten.py Show resolved Hide resolved
@FelipeSBarros
Copy link
Owner Author

So, the test suite has a bug? It is green, meaning test_occurrences_as_list_dicts_with_flat_parameter is passing — how is it passing if flatten is not being called? Or does this test has a deceiving name?

Yes, it seems I am the problem... I have just confirmed that the mentioned test is well written and passes. I also did a manual test requesting data from the API using format='dict' and it came as expected...
Let's move on.

@FelipeSBarros
Copy link
Owner Author

@cuducos I have updated the REAMDE adding flat as an Occurrences' parameter. I also added a small section to explaining its meaning and an example of the data returned by using it. Let me know if you agree.

Copy link
Collaborator

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work : )

@FelipeSBarros FelipeSBarros merged commit 89a346a into master May 21, 2024
24 checks passed
@FelipeSBarros FelipeSBarros deleted the make_flatten_gdf branch May 21, 2024 19:06
@FelipeSBarros
Copy link
Owner Author

Really great work : )

I am glad to hear it that! Thank you for your patience and teaching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants