Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ticket/2594/dev #2609

Merged
merged 11 commits into from
Nov 28, 2022
Merged

Ticket/2594/dev #2609

merged 11 commits into from
Nov 28, 2022

Conversation

mikejhuang
Copy link
Contributor

@mikejhuang mikejhuang commented Nov 16, 2022

This PR address three issues

  1. _build_stimulus_presentations method has pandas operations that trigger errors.
  2. stimulus_presentations['spatial_frequency'] contains numeric values of mixed str and float types, along with str(list)
  3. Replace instance of local_index to probe_channel_number as reported in ticket 'probe_channel_number' should replace 'local_index' in ecephys_session, line 1244 (_build_mean_waveforms) #2573

@mikejhuang mikejhuang force-pushed the ticket/2594/dev branch 2 times, most recently from 5ae3a2a to a5135dd Compare November 17, 2022 00:01
Copy link
Contributor

@morriscb morriscb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions. One other issue is making sure that the data that is currently released and accessed via the ecephys_session objects can still be loaded from it's NWBs.

@mikejhuang
Copy link
Contributor Author

One other issue is making sure that the data that is currently released and accessed via the ecephys_session objects can still be loaded from it's NWBs.

Does the ephys_session notebook do this? It currently runs through the notebook.

#


def eval_str(val):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add the type suggestions to this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, since it takes in any value of any type and returns any value of any type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the return types are consistent at least.

@morriscb
Copy link
Contributor

Does the ephys_session notebook do this? It currently runs through the notebook.

You'll have to check. As I said during sprint planning, you're about the first of the current Pika's to look at this code. My guess is yes, but it wouldn't hurt to double check.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@mikejhuang
Copy link
Contributor Author

mikejhuang commented Nov 18, 2022

You'll have to check. As I said during sprint planning, you're about the first of the current Pika's to look at this code. My guess is yes, but it wouldn't hurt to double check.

I believe it does do it. I committed the re-ran notebook. You can see the comparison here:
2020 pre-release run
current PR re-run

The table values are now either floating point or tuples instead of string. Cell 17 is a good overview in the differences.

The re-run is oddly cut off at cell 36 in this preview. It isn't cut off when I view it locally. Seems like it's something to do with the animation.jhtml.

Copy link
Contributor

@aamster aamster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left some feedback

stimulus_presentations.fillna(nonapplicable, inplace=True)

# pandas does not automatically convert boolean cols for fillna
boolean_colnames = stimulus_presentations.dtypes[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas 1.1 introduced dropna argument for groupby, which allows using na values as keys. If this is not a possibility, then I guess this is ok.

I also don't know what the use case is here, but I'm surprised we want to group by a missing key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call with the dropna. Although this would require a refactor since several parts of the code and notebook refer to the NA value as 'null'. I changed those to check for nan instead.

Some of these old files can require a good amount of linting once they're touched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after doing more testing, I discovered one cell of the notebook had a change in results. Apparently, there's an unresolved pandas bug with dropna in the groupby function when used with MultiIndex groupings. pandas-dev/pandas#36470

I reverted everything back to 'null'

if val.replace('.', '').isdigit(): # checks if val is numeric
val = eval(val)
elif val[0] == "[" and val[-1] == "]": # checks if val is list
val = tuple(eval(val))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break on the string "[foo]" or for that matter "['foo',;[]]", so you need try/catch here, which I know Chris didn't like but I think it's necessary to catch any number of issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those don't seem like valid entries for any of the fields. It should fail regardless if it passes an eval statement.
I don't see any tests that check for invalid entries though?

#


def eval_str(val):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already code for this in the codebase, search literal_col_eval . Can that be repurposed? I know that function is in a specific module. Maybe it can be moved/modified to a general util module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this function to brain_observatory/behavior/swdb/utilities.py

if val.replace('.', '').isdigit(): # checks if val is numeric
val = eval(val)
elif val[0] == "[" and val[-1] == "]": # checks if val is list
val = tuple(eval(val))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right place to convert to tuple, this function should just eval. Conversion to tuple should be done outside of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated this to two functions since I think it makes the code better organized through modularization. However, pandas apply is not an efficient way to iterate through all the rows, and running it twice may have additional overhead.

"""

if isinstance(val, str):
if val.replace('.', '').isdigit(): # checks if val is numeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you need to check if it is a number of list ahead of time, can't you just call eval if it is a string? That should return the correct thing regardless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be issues if a string of characters are inputted into eval, like with the column stimulus_name.
The alternative is to specify which columns you want to eval. I thought coming up with a set of rules to decide which column to eval instead of explicitly naming a list of them would make it easier to maintain.

col_type_map).fillna(nonapplicable)

# eval str(numeric) and str(lists), convert lists to tuple for
# dict key compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an example here of what the current values in the dataframe are and why we need to call eval. I think that will be helpful.

Copy link
Contributor Author

@mikejhuang mikejhuang Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description above provides some example along with the rationale for calling eval.

Here's a summary of the unique values in the dataframe taken before the fix was applied

color: [-1.0 1.0]
contrast: [0.8 1.0]
frame: [-1.0 0.0 1.0 ... 3597.0 3598.0 3599.0]
orientation: [0.0 30.0 45.0 60.0 90.0 120.0 135.0 150.0 180.0 225.0 270.0 315.0]
phase: ['0.0' '0.25' '0.5' '0.75' '[0.0, 0.0]' '[3644.93333333, 3644.93333333]'
 '[42471.86666667, 42471.86666667]']
size: ['[1920.0, 1080.0]' '[20.0, 20.0]' '[250.0, 250.0]' '[300.0, 300.0]']
spatial_frequency: [0.02 0.04 '0.04' 0.08 '0.08' 0.16 0.32 '[0.0, 0.0]']
temporal_frequency: [1.0 2.0 4.0 8.0 15.0]
x_position: [-40.0 -30.0 -20.0 -10.0 0.0 10.0 20.0 30.0 40.0]
y_position: [-40.0 -30.0 -20.0 -10.0 0.0 10.0 20.0 30.0 40.0]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a description to the PR. I also think an inline description would be helpful since the data structure is extremely messy and unconventional

@aamster aamster mentioned this pull request Nov 22, 2022
@mikejhuang mikejhuang force-pushed the ticket/2594/dev branch 3 times, most recently from aadebb4 to 989582d Compare November 23, 2022 00:39
1 - (1 / N)))
return ls

def literal_col_eval(df: pd.DataFrame,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the two utilities functions added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swdb is summer workshop of the dynamic brain. It doesn't belong here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought SWDB might've meant software db. (:
How about creating a utilities.py in allensdk/core ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure :)

].apply(naming_utilities.eval_str)


col_list = ["phase, size, spatial_frequency"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the logic here to eval/tuple by specifying columns instead of creating rules.

from allensdk.brain_observatory.behavior.behavior_project_cache.\
project_apis.data_io.project_cloud_api_base import ProjectCloudApiBase # noqa: E501


def literal_col_eval(df: pd.DataFrame,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this function to utilities.py and removed the default values for columns

Copy link
Contributor

@aamster aamster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please move the utility functions out of the swdb package, since that is for a workshop
  • Please do not lint until AFTER review. It is near impossible to review with the linting included

Otherwise looks good.

@@ -2,19 +2,44 @@
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changed in this file other than the linting? Why did you need to touch this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's when I rebased my PR onto the updated RC branch, there were merge conflicts that I needed to resolve which marked the file as being touched, and my script to run black on changed files formatted it.

@@ -3,23 +3,44 @@
from pynwb import NWBFile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changed in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's when I rebased my PR onto the updated RC branch, there were merge conflicts that I needed to resolve which marked the file as being touched, and my script to run black on changed files formatted it.

col_type_map).fillna(nonapplicable)

# eval str(numeric) and str(lists), convert lists to tuple for
# dict key compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a description to the PR. I also think an inline description would be helpful since the data structure is extremely messy and unconventional

allensdk/brain_observatory/behavior/swdb/utilities.py Outdated Show resolved Hide resolved
@mikejhuang mikejhuang force-pushed the ticket/2594/dev branch 3 times, most recently from bc230f4 to d1c6980 Compare November 28, 2022 09:37
@mikejhuang mikejhuang merged commit 28e8497 into rc/2.16.1 Nov 28, 2022
@mikejhuang mikejhuang deleted the ticket/2594/dev branch November 28, 2022 20:18
@ZeroAda
Copy link

ZeroAda commented Dec 4, 2023

just to follow up the first issue: _build_stimulus_presentations method has pandas operations that trigger errors. I change the lines # stimulus_presentations.replace("", nonapplicable, inplace=True) in ecephys_session.pyto the following:

bool_columns = stimulus_presentations.select_dtypes(include=['bool']).columns stimulus_presentations[bool_columns] = stimulus_presentations[bool_columns].astype('object') stimulus_presentations[bool_columns] = stimulus_presentations[bool_columns].fillna(nonapplicable)

and it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants