Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Polish to Missing Value Vizualizers in Contrib #668

Closed
wants to merge 10 commits into from

Conversation

ndanielsen
Copy link
Contributor

@ndanielsen ndanielsen commented Dec 10, 2018

This PR is addressing comments in the original PR for the missing values bar and dispersion visualizers.

The original PR is located here: #519

@bbengfort bbengfort added the review PR is open label Dec 10, 2018
@ndanielsen ndanielsen changed the title Add Polish to Missing Value Vizualizers in Contrib Add Polish to Missing Value Vizualizers in Contrib (Work in progress) Dec 10, 2018
@ndanielsen ndanielsen changed the title Add Polish to Missing Value Vizualizers in Contrib (Work in progress) Add Polish to Missing Value Vizualizers in Contrib Dec 10, 2018
"""

def setUp(self):
super(TestMissingBarVisualizer, self).setUp()
super(MissingBarVisualizerTestCase, self).setUp()
self.tol = 0.01
if os.name == 'nt': # Windows
self.tol = 0.5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another PR, I'm going to move these tolerance settings into VisualTestCase so they can be more generally shared.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea, e.g. if tol is None then use these default tolerances based on the OS.

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great polish! In this case, its the little things like fixtures and reference removals that make a huge impact to the overall quality and robustness of the code. Thanks for continuing to knock away at these visualizers!

Is your plan to move these visualizers out of contrib at some point? If so, can I ask what your criteria for moving them out of contrib is?



@pytest.fixture(scope='class')
def missingdata(request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fixture!

def test_viz_properties(self):
"""
Integration test of visualizer with pandas
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this test require pandas?

"""

def setUp(self):
super(TestMissingBarVisualizer, self).setUp()
super(MissingBarVisualizerTestCase, self).setUp()
self.tol = 0.01
if os.name == 'nt': # Windows
self.tol = 0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea, e.g. if tol is None then use these default tolerances based on the OS.

viz.poof()

self.assert_images_similar(viz, tol=self.tol)


def test_missingvaluesbar_numpy_with_y_target(self):
"""
Integration test of visualizer with numpy without target y passed in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*with target y passed in

viz.poof()

self.assert_images_similar(viz, tol=self.tol)


def test_missingvaluesbar_numpy_with_y_target_with_labels(self):
"""
Integration test of visualizer with numpy without target y passed in
but no class labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*with y passed in and class labels

@@ -41,20 +41,25 @@ def setUp(self):
if os.name == 'nt': # Windows
self.tol = 5.0

@pytest.mark.skipif(pd is None, reason="test requires pandas")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as before - why does this test require pandas? Sorry if I'm missing something obvious!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm making an explicit test cases for support of pandas objects.


features = [str(n) for n in range(20)]
classes = ['Class A', 'Class B']
viz = MissingValuesDispersion(features=features, classes=classes)
viz.fit(X, y=y)
viz.fit(self.missingdata.X, y=self.missingdata.y)
viz.poof()

self.assert_images_similar(viz, tol=self.tol)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome clean up of the tests - the fixture definitely made the tests much simpler to read!


else:
# add in counting of np.nan per target y by column
nan_counts = []
for target_value in np.unique(self.y):
for target_value in np.unique(y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for removing the data storage on the class! I think this is going to make the missing values visualizer much more performant!

@ndanielsen
Copy link
Contributor Author

@bbengfort my personal criteria for hoisting these visualizers into the main package are:

  • For bar viz, adding an option for percent in addition to count when targets are supplied
  • For dispersion, making sure that it accepts datetime index
  • Adding the quick method into the documentation. I think that we likely should emphasize quick methods more especially for feature vizualizers

@ndanielsen
Copy link
Contributor Author

I'm going to address the comments here and then I'll merge in. @bbengfort thanks as always for the great code review!

@bbengfort bbengfort mentioned this pull request Dec 11, 2018
@bbengfort
Copy link
Member

@ndanielsen those sound like good criteria and I'm excited that we're close to getting them into the main package!

What would you think about putting these visualizers under a new package called yellowbrick.wrangling? Potentially this falls under feature analysis, but I think this along with outlier detection visualizers might be something that would fall under its own package?

As for the quick methods, I agree that these should be hoisted to prime time. See #600 which is already underway in #669. These issues change the quick method API a tad (e.g. returning a visualizer) but also ensure consistency across all quick methods. Once that is done, we need to find a way to include them more informatively in the documentation. I'm not sure whether to autodoc them or to create a special section for them (or a special format in the API documentation for them).

Also one quick comment; see #673 -- there might be a minor merge collision with this PR. Would you mind working with @Kautumn06 to make sure both PRs are merged easily?

@rebeccabilbro
Copy link
Member

Hey there @ndanielsen - is this something you're still interested in working on? It's just occurring to me that the develop branch is starting to diverge somewhat significantly from your PR branch (and it's about to change a lot more with #687, #669, etc), so wanted to double check with you before things get too tangly!

@rebeccabilbro rebeccabilbro added the gone-stale PR or issue has not seen activity in >30 days and/or develop branch has since diverged significantly label Feb 13, 2019
@rebeccabilbro
Copy link
Member

Hi @ndanielsen! Given that the develop branch has evolved enough that we'll probably end up with some fairly gnarly conflicts if we try to merge this in now, I'm going to go ahead and archive this PR. Thank you for all the work you put into this and hope we can revisit these contributions sometime in the future; I know it will be a very popular feature!

@rebeccabilbro rebeccabilbro removed the review PR is open label Feb 15, 2019
@ndanielsen ndanielsen deleted the missing-value-refactor branch May 20, 2019 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gone-stale PR or issue has not seen activity in >30 days and/or develop branch has since diverged significantly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants