Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add random sampling and improve performance of Parallel Components viz #420

Merged
merged 8 commits into from May 23, 2018

Conversation

thekylesaurus
Copy link
Contributor

Fixes #59 (partially)

Allows for random sampling via "shuffle" parameter, consistent with sklearn train/test split, for instance.

Improves performance by at least an order of magnitude for several test data sets.

@DistrictDataLabs/team-oz-maintainers

@thekylesaurus
Copy link
Contributor Author

thekylesaurus commented May 14, 2018

Performance analysis shown below is based on iris data set, where replicates were used to increase the data set size (as indicated by the factor on the x-axis)

capture2
capture

@thekylesaurus
Copy link
Contributor Author

Unfortunately, the approach used results in a subtly modified style. There is no longer alpha addition intra-class. The argument could be made that the new style reduces intra-class information; or, inversely, the it reduces unnecessary business. Is this style change going to be acceptable? If not, I will abandon this request and simply submit version with only the random subsampling added.

current_style
new_style

@bbengfort
Copy link
Member

@thekylesaurus awesome PR, thank you so much for taking a look at this! I'm very impressed by the speed up graphs and the thorough presentation, you definitely addressed questions I may have had right off the bat.

I'm going to start my review now, and I'll dive into the alpha issue as I review. In the meantime, the tests aren't passing due to image closeness errors. It appears that your PR does not include updated images for the sampling method -- my guess was that you were omitting them until the style question was resolved. However, it is fairly simple to synchronize actual and images, if you'd like to follow these steps:

  1. Run the tests: pytest tests/test_features/test_pcoords.py
  2. In the root directory run the sync: python -m tests.images tests/test_features/test_pcoords.py

This will move the test case images generated by the tests in tests/actual_images to the baseline comparison directory, tests/baseline_images. You can then add the baseline images (the actual images are git ignored), commit and push them.

@bbengfort
Copy link
Member

Oh, make sure you have the test dependencies installed; pip install -r tests/requirements.txt!

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent PR and a welcome contribution, thank you so much! I had a few comments in the code that hopefully will be easy enough to address during the sprints today. Other than that, there are some tests that should be written ... or at least stubbed out - both for the pcoords tests and the new is_series util.

I'm sensitive to the fact that you're traveling - do you have time to get these things sorted, e.g. are you at the sprints again today? If not, we can work something out to ensure your contribution gets included in a timely manner; just let me know!

@@ -218,7 +218,7 @@ def fit(self, X, y=None, **kwargs):
# Store the classes for the legend if they're None.
if self.classes_ is None:
# TODO: Is this the most efficient method?
self.classes_ = [str(label) for label in set(y)]
self.classes_ = [label for label in set(y)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y may contain integers for already label encoded vectors; is there a reason you removed the forced string conversion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand why you would want to cast integers (or even floats) to string. It upsets the mapping between self.classes_ and y. Removing this cast makes certain operations easier downstream.

Nonetheless, I admittedly don't know all of the repercussions for the various children of DataVisualizer. I'll revert and handle this inside of ParallelCoordinates.draw()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate that - these are class labels, not classes and are generally used for display on the image. We do have a lot to figure out with this and is a fairly large topic that includes sequential vs. discrete target values #73 using label encoders, etc. Scikit-learn also has some check_target utilities that we'd also like to employ more meaningfully.

I think if you handle this inside of ParallelCoordinates.draw() you'll be setting a pretty good example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good deal. Is completed and will be sent in next push.

Attributes
--------

n_samples : int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this attribute is "learned" on fit, I recommend that we call it n_samples_ -- the underscore suffix is used in scikit-learn to distinguish between learned parameters and hyperparameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree. But self.n_samples existing prior to my contribs. I didn't know if this could break anything that inherits ParallelCoordinates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha - yeh I see it now, I checked and actually nothing subclasses ParallelCoordinates so it should be fine (you can run make make uml if you have pyreverse installed to generate these images) - though I suspect there is a test that checks that property that might have to be updated.

classes_yellowbrick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There actually isn't a test that references that field. But I'll add something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks, in that case, it must just be left-over technical debt; but I like having it.

# Visual Parameters
self.show_vlines = vlines
self.vlines_kwds = vlines_kwds or {
'linewidth': 1, 'color': 'black'
}

# Defaults for derived values
self.n_samples = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this parameter is learned on fit, it does not need a default specified in __init__, and even if we're not sampling, it can just contain the number of instances which might be handy information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pycharm gets mad when you add a member variable outside of init. I tend to agree with this pep8 convention - add member variables out of init can make it difficult to track different attributes. Having everything allocated in init helps interpretability IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nonetheless, I'll remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's certainly true, and in any other project, I would agree. The issue is that scikit-learn uses the existence of these attributes for check_fitted and other fit-related validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed from init in next push


# Choose a subset of samples
# TODO: allow selection of a random subset of samples instead of head
# Get the total number of observations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original version of the code, the "sampling" was done in draw since it was only a single line and prevented us from having to override DataVisualizer.fit -- however I think with the inclusion of more sampling code and the "learned" parameter, n_samples_, it now makes sense to add a fit() method to this class that calls super and performs the sampling. That will keep draw a bit more discrete and allow us to test the sampling mechanism without creating an image (e.g. we can patch draw and inspect the input passed to draw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbengfort - do you want this new ParallelCoordinates.fit() to return X and y? Maybe only in some non-default verbose mode used for testing only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do want to ensure the signatures for sklearn interface methods remain the same, for fit it is:

class Estimator(object):
    
    def fit(self, X, y=None):
        return self     

This will ensure that helper methods such as fit_transform in pipelines continue to function as expected.

To test, I recommend we mock ParallelCoordinates.draw, which X and y will be passed to, we can then inspect the output of fit there.

# Choose a subset of samples
# TODO: allow selection of a random subset of samples instead of head
# Get the total number of observations
nobs = len(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because X is an np.ndarray at this point, there is actually no cost to calling len(X) -- it's just returning an attribute, it's not actually counting or anything. I would prefer to use len(X) over a variable name because it is a bit more readable if that's ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Sounds good.

if (self.n_samples < nobs) and self.shuffle:
indices = choice(nobs, self.n_samples, replace=False)
else:
indices = slice(self.n_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat, I haven't used slice before, glad to learn something new!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just learned about it at PyCon2018!

self.n_samples = int(nobs * self.sample)

if (self.n_samples < nobs) and self.shuffle:
indices = choice(nobs, self.n_samples, replace=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a problem for this PR, but perhaps you could add a comment to the issue or a new issue (or maybe note this in the docstring) -- this method of sampling does not stratify by class, it's a uniform random sample of all instances. This means that the ratio of classes in the sampled dataset probably will not match the ratio of classes in the original data set.

Along those lines, scikit-learn has a number of shuffle-split utilities for cross-validation, we could recommend to the user that if they did need more advanced sampling they could use one of those utilities prior to passing data to the visualizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to add the sklearn idiomatic "stratify" option to this class (and probably others). As this PR is already getting a little heavy, I'll create a new issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks - I think that would be an excellent follow-on PR!

y = y[indices]

# Get shape of the sampled data
nrows, ncols = X.shape

# Normalize
if self.normalize is not None:
X = self.normalizers[self.normalize].fit_transform(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind also moving normalization into the fit() method, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

increments_in_class = increments_separated * len(x_in_class)
if len(x_in_class) > 0:
self.ax.plot(increments_in_class, x_in_class.flatten(), label=label, color=colors[label],
alpha=0.5, linewidth=1, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind walking me through this new drawing method? I can see that the overall goal here was to call the plot method once per class instead of once per row - a laudable goal and one that I can see dramatically improving the performance of the visualizer. It might help to have a few comments to guide other contributors to this method, and if I could get a more detailed sense of what you're doing I can work to apply it to other visualizers.

The alpha is set here to 0.5 -- so I'm not sure I understand exactly the change in the opacity style. Given the two figures you presented, I believe the top figure is preferable because it shows darker regions of greater instance density on a per-class basis. Was the top figure generated with this code or the old code?

Frankly, even if the bottom figure was generated with your revisions, I think these revisions are worth it to keep the opacity changes and we can work on detailed styling in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will add comments to the code and some documentation to this PR to clarify.

The top image was the "current" method. The bottom image was the "new" method.

I'll revert back to 0.25 alpha. The 0.5 seemed to give a better visual agreement with the former style. Are you OK with the linewidth specification? Again, this gives better stylistic agreement for the samples I investigated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with the linewidth argument -- in the past, we've had issues with fixed settings like this (e.g. in a "poster" mode of matplotlib where rcParams are set, this fixed value might cause some issues) but I think it will be ok for now.

@@ -202,6 +202,27 @@ def is_dataframe(obj):
isdataframe = is_dataframe


def is_series(obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! It would also be helpful if you wouldn't mind creating some tests for the type checking; the tests are pretty well defined in tests/test_utils/test_types.py -- should be a matter of copy and paste.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem!

@thekylesaurus
Copy link
Contributor Author

Hey @bbengfort - thanks for your energy and optimism! I'm happy to try to contribute.

I am on the case, but I am having some issues with the pytest-flakes and pytest-spec test dependencies. They are not conda installable and my work machine is not amenable to conda-forge or pypi installs. I'll try to find a way around, but completing this may require access to my personal machine. I won't have access to that until next weekend at the earliest.

@thekylesaurus
Copy link
Contributor Author

@bbengfort - I can also ssh to my work's linux box, but without certain windows conveniences, my abilities become a little bit handcuffed. I'll try to make it happen.

Thanks for your support!

@bbengfort
Copy link
Member

@thekylesaurus I completely understand, and I must say - we're thrilled to get a Windows developer on the team, we mainly use macOS or Linux. I've been trying to add AppVeyor support for Windows CI for the past few weeks, but it's been a tricky road since the only Windows box I have to test with is the CI itself!

We do have a couple of options:

  1. Merge this commit without the tests/add issues and stubs to create the tests -- with this option though, hopefully, you can work with someone there during the sprints to fix the tests quickly.
  2. Wait until you have access to your machine to finish up the PR.

Either option is fine with me, let me know what you prefer!

@thekylesaurus
Copy link
Contributor Author

@bbengfort - the test_parallel_coords() test was failing due to my decision to have shuffle=True by default, consistent with sklearn test/train split subsampling. I have now set shuffle=False by default, and it rectifies the problem.

However, I would think the default behavior should be to randomly subsample. Thoughts?

@bbengfort
Copy link
Member

I agree that the default behavior should be to randomly subsample, perhaps we can tackle that in another PR so that the tests pass now?

@thekylesaurus
Copy link
Contributor Author

@bbengfort - I had about 9 commits in my working repo today. Unfortunately, I corrupted some git-related parameter in that repo and was unable to interact with remote. I tried a number of work-arounds, to no avail. So, I re-cloned yellowbrick and manually migrated all changes relative to commit 32b292e... over to the new repo.

In the latest commit (9994e09), all of the comments in the above discussion are addressed, but you don't have the luxury of granular commits. I apologize for the inconvenience.

@thekylesaurus
Copy link
Contributor Author

Illustrations of current and new method shown below:

Current:
current_method

New:
new_method

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and we're very close to completion! Thank you so much for your hard work, and no worries about the granular commits, sorry that you had to redo all that work.

I had a couple of comments that should be addressed, but they're relatively minor. Other than that, it's about getting the tests to pass -- they're image comparison failures, possibly because you're generating the test images on a Windows machine, right?

If so, I can merge this PR with the tests failing and fix the tests to work on travis (we're still trying to figure out the windows situation). If you're on linux and want to try to generate the test images there and sync them, that's also fine with me.

visualizer.fit_transform(self.X, self.y)

visualizer = ParallelCoordinates(sample=3, shuffle=True, random_state=np.random.RandomState())
visualizer.fit_transform(self.X, self.y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a shuffle=False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_pcoords_sample_shuffle_false()

visualizer.fit_transform(self.X, self.y)

visualizer = ParallelCoordinates(sample=0.5, shuffle=True, random_state=np.random.RandomState())
visualizer.fit_transform(self.X, self.y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as above in the int sample.

Also, do we want to ensure no shuffling occurs when there is no sampling? Or is this satisfied in test_static_subsample?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_pcoords_sample_float_shuffle_false()

Tested in test_static_subsample. I would argue this is the best place because X and y are exposed on input and output.

@@ -131,3 +157,65 @@ def test_integrated_pcoords(self):
visualizer.fit_transform(X, y)
visualizer.poof()
self.assert_images_similar(visualizer)

@staticmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does making this a static method do for the test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8-compliance. That is about it.

ncols = 50

y = np.arange(ntotal).reshape(ntotal, 1)
X = np.ones((ntotal, ncols)) * y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much like this idea for a test case, it definitely allows you to determine the ordering with ease! Note, however, that y should have shape (100,) and instead, it has shape (100,1). Because you need this shape in order to compute X, I recommend the following:

y = np.arange(ntotal)
X = np.ones((ntotal, ncols)) * y.reshape(ntotal,1) 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Now done.

"""
Test that is_series does not match non-dataframes
"""
assert not is_series(obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding these tests!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -149,6 +152,15 @@ class ParallelCoordinates(DataVisualizer):
If int, specifies the maximum number of samples to display.
If float, specifies a fraction between 0 and 1 to display.

random_state : int, RandomState instance or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome +100 for this - thank you so much for thinking to add this in!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

n_samples_ : int
number of samples included in the visualization object

rng_ : np.random.RandomState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complete nit, and I apologize for it, but since I made a big deal of n_samples_ I feel like I should point out that this is not a learned parameter in fit, but a private attribute mirroring random_state and should probably be _rng. If I didn't have any other comments I would have just ignored this and perhaps made a quick change after the PR, but perhaps you could quickly update it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed reference to _rng from docstring.

self.n_samples = int(len(X) * self.sample)
X = X[:self.n_samples, :]
# Subsample
X, y = self._subsample(X, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the sampling a separate function was a very good idea - nice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


# Normalize
if self.normalize is not None:
X = self.normalizers[self.normalize].fit_transform(X)

# Get the shape of the data
super(ParallelCoordinates, self).fit(X, y, **kwargs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm missing this in the code review pane, but fit() should call draw() -- am I missing where this is happening?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, I think I got it, but I'm leaving my original comment above in - the super method is calling draw() right? Perhaps it would be more clear if we did something like:

        # the super method calls draw and returns self 
        return super(ParallelCoordinates, self).fit(X, y, **kwargs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added. I should have had this in there.

@bbengfort
Copy link
Member

Your illustration of the current and new incrementation method is very well done, thank you. I understand in principle what you're doing - I guess I'm not sure why you had to add the null column at the end to actually create the darkening effect, I would have thought that if you put two semi-transparent lines on top of each other, they would have darkened rather than blending as the did. Anyway - great work on figuring out a solution, the performance benefits are definitely worth it!

@thekylesaurus
Copy link
Contributor Author

@bbengfort - comments addressed in most recent commits.

Pertaining to failed test, can you direct me to the test that is failing? I am able to pass all tests in test_pcoords, including the image comparison test.

Pertaining to the nans, they are in there to prevent the line plot from drawing from right to left. If you didn't have the nans, there are be lines crossing from the right hand side of the image to the left-hand side of the window.

The nans don't address the alpha effect. The lack of alpha darkening is the unfortunate result of matplotlib not accounting for alpha darkening for intersections of a single trace with itself. I spent a solid hour trying to find a work-around, with no luck.

@bbengfort
Copy link
Member

Ah, sorry I forgot about the crossing lines thing - that was part of the reason we did the one line per instance implementation; the nans are a good solution! As for the alpha darkening, perhaps we can open an issue for that? I can't help feeling that there is something in matplotlib that might make it work, but if you've already spent hours on it, we should punt, the visualizer is working very well right now.

The tests are passing for you? Well, that's good news sort of - what system are you on?

I think potentially the easiest thing is for me to merge your PR and then try to fix the Travis tests myself, it's notoriously tricky and annoying. Are you ok with that?

For reference, here are the tests that are failing:

_______________ ParallelCoordinatesTests.test_integrated_pcoords _______________
self = <tests.test_features.test_pcoords.ParallelCoordinatesTests testMethod=test_integrated_pcoords>
    @pytest.mark.xfail(
        sys.platform == 'win32', reason="images not close on windows"
    )
    def test_integrated_pcoords(self):
        """
            Test parallel coordinates on a real data set (downsampled for speed)
            """
        occupancy = self.load_data('occupancy')
    
        X = occupancy[[
            "temperature", "relative_humidity", "light", "C02", "humidity"
        ]]
    
        y = occupancy['occupancy'].astype(int)
    
        # Convert X to an ndarray
        X = X.copy().view((float, len(X.dtype.names)))
    
        # Test the visualizer
        visualizer = ParallelCoordinates(sample=200)
        visualizer.fit_transform(X, y)
        visualizer.poof()
>       self.assert_images_similar(visualizer)
E       yellowbrick.exceptions.ImageComparisonFailure: images not close (RMS 26.357)
E       tests/actual_images/test_features/test_pcoords/test_integrated_pcoords.png
E       	vs
E       tests/baseline_images/test_features/test_pcoords/test_integrated_pcoords.png
tests/test_features/test_pcoords.py:185: ImageComparisonFailure
_______________ ParallelCoordinatesTests.test_normalized_pcoords _______________
self = <tests.test_features.test_pcoords.ParallelCoordinatesTests testMethod=test_normalized_pcoords>
    @pytest.mark.xfail(
        sys.platform == 'win32', reason="images not close on windows"
    )
    def test_normalized_pcoords(self):
        """
            Assert no errors occur using 'normalize' argument
            """
        visualizer = ParallelCoordinates(normalize='l2')
        visualizer.fit_transform(self.X, self.y)
        visualizer.poof()
>       self.assert_images_similar(visualizer)
E       yellowbrick.exceptions.ImageComparisonFailure: images not close (RMS 0.074)
E       tests/actual_images/test_features/test_pcoords/test_normalized_pcoords.png
E       	vs
E       tests/baseline_images/test_features/test_pcoords/test_normalized_pcoords.png
tests/test_features/test_pcoords.py:65: ImageComparisonFailure

@thekylesaurus
Copy link
Contributor Author

My machine is a Windows 7 Enterprise on Dell Latitude E5570 with Intel HD Graphics 530 and AMD Radeon R7 M370.

I did more digging. In the current code base, both of the failing tests indicated above had a wrapper applied to them:

@pytest.mark.xfail(
    sys.platform == 'win32', reason="images not close on windows"
)

This wrapper evidently skips these tests when run on a windows machine.

When I removed these wrappers, I actually passed the test_normalized_pcoords() test.

I failed the test_integrated_pcoords() test. However, not because the image match failed, but because my computers firewall is blocking the load_data('occupancy') step: "SSLError: ("bad handshake: SysCallError(10054, 'WSAECONNRESET')",)"

I am in favor of your merging, but I also would express caution. Some would argument that this PR modifies the look and feel of the ParallelCoordinates plot (due to the removal of intraclass alpha darkening). If you think the speed up increase it worth it, please take the lead on merging and fixing the tests.

Thanks @bbengfort

@bbengfort
Copy link
Member

@DistrictDataLabs/team-oz-maintainers @thekylesaurus suggests that we consider carefully merging this parallel coordinates implementation because it changes the style of the plots:

The current, poorly performing style plots each instance as it's own line one at a time, allowing us to achieve a darkening effect where more instances are:

current_style

The new implementation draws each class as a single line, breaking them up with nan values so there is no cross back and forth from the end of the parallel coordinates to the beginning. However, because these lines are the "same line", there is no darkening effect:

new_style

The performance boost from the proposed implementation is significant, especially as more instances are added to the plot (almost a 30x improvement for 100x the iris data set).

Any thoughts on how to proceed?

@rebeccabilbro
Copy link
Member

@bbengfort - I recommend we make this an alternate visualizer, FastParallelCoordinates, and keep the currently (slow) visualizer as ParallelCoordinates; we can use the docstrings and documentation for each to explain the tradeoffs to users and leave it up to them to decide which is best for their use case.

@lwgray
Copy link
Contributor

lwgray commented May 18, 2018

Is Key information being lost by not showing the density?

Disregard - my above comment - I know now that it is reduces intraclass information

@thekylesaurus
Copy link
Contributor Author

thekylesaurus commented May 18, 2018 via email

@bbengfort
Copy link
Member

@rebeccabilbro @thekylesaurus I think leaving it up to the user is probably the best thing to do; I'd say it's a tough call between overriding the draw method in a subclass and using a flag since the implications of the fast method are fairly subtle. For simplicity, I'm leaning towards using a flag in the constructor fast=False with a short explanation in the docstring about what's happening.

My feeling is that the usage will be that the user tries ParallelCoordinates without the fast flag, gets aggravated it's taking so long, sets the fast flag and tries again.

We should also create a section in the documentation with the graphs @thekylesaurus created and the speed up to show the implications in detail.

@bbengfort
Copy link
Member

bbengfort commented May 20, 2018

I will take the following steps when I get the chance:

  1. create an issue to fix tests/add fast and docs
  2. merge this PR
  3. synchronize baseline images and repair tests
  4. update Parallel Coordinates is getting slower #230 and Improve Parallel Coordinates #59
  5. update tests and add fast submit PR

Hopefully tomorrow or the day after. I would just merge this in now but am reluctant to because the tests are failing, and I'd like to get them sorted ASAP so other contributors don't have failing tests related to parallel coordinates.

@bbengfort
Copy link
Member

See #447

@bbengfort bbengfort merged commit 6e6c58f into DistrictDataLabs:develop May 23, 2018
@bbengfort
Copy link
Member

@thekylesaurus thank you so much for this PR and your hard work; we're glad to have you as a contributor, and hope you'll consider working with us more in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants