New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add random sampling and improve performance of Parallel Components viz #420
Conversation
…e done and complicates matching to class values
…te traces into class and plot classes individually.
Unfortunately, the approach used results in a subtly modified style. There is no longer alpha addition intra-class. The argument could be made that the new style reduces intra-class information; or, inversely, the it reduces unnecessary business. Is this style change going to be acceptable? If not, I will abandon this request and simply submit version with only the random subsampling added. |
@thekylesaurus awesome PR, thank you so much for taking a look at this! I'm very impressed by the speed up graphs and the thorough presentation, you definitely addressed questions I may have had right off the bat. I'm going to start my review now, and I'll dive into the alpha issue as I review. In the meantime, the tests aren't passing due to image closeness errors. It appears that your PR does not include updated images for the sampling method -- my guess was that you were omitting them until the style question was resolved. However, it is fairly simple to synchronize actual and images, if you'd like to follow these steps:
This will move the test case images generated by the tests in |
Oh, make sure you have the test dependencies installed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent PR and a welcome contribution, thank you so much! I had a few comments in the code that hopefully will be easy enough to address during the sprints today. Other than that, there are some tests that should be written ... or at least stubbed out - both for the pcoords tests and the new is_series
util.
I'm sensitive to the fact that you're traveling - do you have time to get these things sorted, e.g. are you at the sprints again today? If not, we can work something out to ensure your contribution gets included in a timely manner; just let me know!
yellowbrick/features/base.py
Outdated
@@ -218,7 +218,7 @@ def fit(self, X, y=None, **kwargs): | |||
# Store the classes for the legend if they're None. | |||
if self.classes_ is None: | |||
# TODO: Is this the most efficient method? | |||
self.classes_ = [str(label) for label in set(y)] | |||
self.classes_ = [label for label in set(y)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y
may contain integers for already label encoded vectors; is there a reason you removed the forced string conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand why you would want to cast integers (or even floats) to string. It upsets the mapping between self.classes_ and y. Removing this cast makes certain operations easier downstream.
Nonetheless, I admittedly don't know all of the repercussions for the various children of DataVisualizer. I'll revert and handle this inside of ParallelCoordinates.draw()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate that - these are class labels, not classes and are generally used for display on the image. We do have a lot to figure out with this and is a fairly large topic that includes sequential vs. discrete target values #73 using label encoders, etc. Scikit-learn also has some check_target
utilities that we'd also like to employ more meaningfully.
I think if you handle this inside of ParallelCoordinates.draw()
you'll be setting a pretty good example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good deal. Is completed and will be sent in next push.
yellowbrick/features/pcoords.py
Outdated
Attributes | ||
-------- | ||
|
||
n_samples : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this attribute is "learned" on fit
, I recommend that we call it n_samples_
-- the underscore suffix is used in scikit-learn to distinguish between learned parameters and hyperparameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree. But self.n_samples existing prior to my contribs. I didn't know if this could break anything that inherits ParallelCoordinates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There actually isn't a test that references that field. But I'll add something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thanks, in that case, it must just be left-over technical debt; but I like having it.
yellowbrick/features/pcoords.py
Outdated
# Visual Parameters | ||
self.show_vlines = vlines | ||
self.vlines_kwds = vlines_kwds or { | ||
'linewidth': 1, 'color': 'black' | ||
} | ||
|
||
# Defaults for derived values | ||
self.n_samples = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this parameter is learned on fit, it does not need a default specified in __init__
, and even if we're not sampling, it can just contain the number of instances which might be handy information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pycharm gets mad when you add a member variable outside of init. I tend to agree with this pep8 convention - add member variables out of init can make it difficult to track different attributes. Having everything allocated in init helps interpretability IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nonetheless, I'll remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's certainly true, and in any other project, I would agree. The issue is that scikit-learn uses the existence of these attributes for check_fitted
and other fit-related validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed from init in next push
yellowbrick/features/pcoords.py
Outdated
|
||
# Choose a subset of samples | ||
# TODO: allow selection of a random subset of samples instead of head | ||
# Get the total number of observations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original version of the code, the "sampling" was done in draw
since it was only a single line and prevented us from having to override DataVisualizer.fit
-- however I think with the inclusion of more sampling code and the "learned" parameter, n_samples_
, it now makes sense to add a fit()
method to this class that calls super and performs the sampling. That will keep draw
a bit more discrete and allow us to test the sampling mechanism without creating an image (e.g. we can patch draw
and inspect the input passed to draw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbengfort - do you want this new ParallelCoordinates.fit() to return X and y? Maybe only in some non-default verbose mode used for testing only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do want to ensure the signatures for sklearn interface methods remain the same, for fit it is:
class Estimator(object):
def fit(self, X, y=None):
return self
This will ensure that helper methods such as fit_transform
in pipelines continue to function as expected.
To test, I recommend we mock ParallelCoordinates.draw
, which X
and y
will be passed to, we can then inspect the output of fit
there.
yellowbrick/features/pcoords.py
Outdated
# Choose a subset of samples | ||
# TODO: allow selection of a random subset of samples instead of head | ||
# Get the total number of observations | ||
nobs = len(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because X
is an np.ndarray
at this point, there is actually no cost to calling len(X)
-- it's just returning an attribute, it's not actually counting or anything. I would prefer to use len(X)
over a variable name because it is a bit more readable if that's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Sounds good.
yellowbrick/features/pcoords.py
Outdated
if (self.n_samples < nobs) and self.shuffle: | ||
indices = choice(nobs, self.n_samples, replace=False) | ||
else: | ||
indices = slice(self.n_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very neat, I haven't used slice
before, glad to learn something new!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just learned about it at PyCon2018!
yellowbrick/features/pcoords.py
Outdated
self.n_samples = int(nobs * self.sample) | ||
|
||
if (self.n_samples < nobs) and self.shuffle: | ||
indices = choice(nobs, self.n_samples, replace=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a problem for this PR, but perhaps you could add a comment to the issue or a new issue (or maybe note this in the docstring) -- this method of sampling does not stratify by class, it's a uniform random sample of all instances. This means that the ratio of classes in the sampled dataset probably will not match the ratio of classes in the original data set.
Along those lines, scikit-learn has a number of shuffle-split utilities for cross-validation, we could recommend to the user that if they did need more advanced sampling they could use one of those utilities prior to passing data to the visualizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to add the sklearn idiomatic "stratify" option to this class (and probably others). As this PR is already getting a little heavy, I'll create a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks - I think that would be an excellent follow-on PR!
yellowbrick/features/pcoords.py
Outdated
y = y[indices] | ||
|
||
# Get shape of the sampled data | ||
nrows, ncols = X.shape | ||
|
||
# Normalize | ||
if self.normalize is not None: | ||
X = self.normalizers[self.normalize].fit_transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind also moving normalization into the fit()
method, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
yellowbrick/features/pcoords.py
Outdated
increments_in_class = increments_separated * len(x_in_class) | ||
if len(x_in_class) > 0: | ||
self.ax.plot(increments_in_class, x_in_class.flatten(), label=label, color=colors[label], | ||
alpha=0.5, linewidth=1, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind walking me through this new drawing method? I can see that the overall goal here was to call the plot method once per class instead of once per row - a laudable goal and one that I can see dramatically improving the performance of the visualizer. It might help to have a few comments to guide other contributors to this method, and if I could get a more detailed sense of what you're doing I can work to apply it to other visualizers.
The alpha is set here to 0.5 -- so I'm not sure I understand exactly the change in the opacity style. Given the two figures you presented, I believe the top figure is preferable because it shows darker regions of greater instance density on a per-class basis. Was the top figure generated with this code or the old code?
Frankly, even if the bottom figure was generated with your revisions, I think these revisions are worth it to keep the opacity changes and we can work on detailed styling in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will add comments to the code and some documentation to this PR to clarify.
The top image was the "current" method. The bottom image was the "new" method.
I'll revert back to 0.25 alpha. The 0.5 seemed to give a better visual agreement with the former style. Are you OK with the linewidth specification? Again, this gives better stylistic agreement for the samples I investigated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with the linewidth
argument -- in the past, we've had issues with fixed settings like this (e.g. in a "poster" mode of matplotlib where rcParams are set, this fixed value might cause some issues) but I think it will be ok for now.
@@ -202,6 +202,27 @@ def is_dataframe(obj): | |||
isdataframe = is_dataframe | |||
|
|||
|
|||
def is_series(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! It would also be helpful if you wouldn't mind creating some tests for the type checking; the tests are pretty well defined in tests/test_utils/test_types.py
-- should be a matter of copy and paste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem!
Hey @bbengfort - thanks for your energy and optimism! I'm happy to try to contribute. I am on the case, but I am having some issues with the pytest-flakes and pytest-spec test dependencies. They are not conda installable and my work machine is not amenable to conda-forge or pypi installs. I'll try to find a way around, but completing this may require access to my personal machine. I won't have access to that until next weekend at the earliest. |
@bbengfort - I can also ssh to my work's linux box, but without certain windows conveniences, my abilities become a little bit handcuffed. I'll try to make it happen. Thanks for your support! |
@thekylesaurus I completely understand, and I must say - we're thrilled to get a Windows developer on the team, we mainly use macOS or Linux. I've been trying to add AppVeyor support for Windows CI for the past few weeks, but it's been a tricky road since the only Windows box I have to test with is the CI itself! We do have a couple of options:
Either option is fine with me, let me know what you prefer! |
@bbengfort - the test_parallel_coords() test was failing due to my decision to have shuffle=True by default, consistent with sklearn test/train split subsampling. I have now set shuffle=False by default, and it rectifies the problem. However, I would think the default behavior should be to randomly subsample. Thoughts? |
I agree that the default behavior should be to randomly subsample, perhaps we can tackle that in another PR so that the tests pass now? |
@bbengfort - I had about 9 commits in my working repo today. Unfortunately, I corrupted some git-related parameter in that repo and was unable to interact with remote. I tried a number of work-arounds, to no avail. So, I re-cloned yellowbrick and manually migrated all changes relative to commit 32b292e... over to the new repo. In the latest commit (9994e09), all of the comments in the above discussion are addressed, but you don't have the luxury of granular commits. I apologize for the inconvenience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great and we're very close to completion! Thank you so much for your hard work, and no worries about the granular commits, sorry that you had to redo all that work.
I had a couple of comments that should be addressed, but they're relatively minor. Other than that, it's about getting the tests to pass -- they're image comparison failures, possibly because you're generating the test images on a Windows machine, right?
If so, I can merge this PR with the tests failing and fix the tests to work on travis (we're still trying to figure out the windows situation). If you're on linux and want to try to generate the test images there and sync them, that's also fine with me.
visualizer.fit_transform(self.X, self.y) | ||
|
||
visualizer = ParallelCoordinates(sample=3, shuffle=True, random_state=np.random.RandomState()) | ||
visualizer.fit_transform(self.X, self.y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add a shuffle=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test_pcoords_sample_shuffle_false()
visualizer.fit_transform(self.X, self.y) | ||
|
||
visualizer = ParallelCoordinates(sample=0.5, shuffle=True, random_state=np.random.RandomState()) | ||
visualizer.fit_transform(self.X, self.y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as above in the int sample.
Also, do we want to ensure no shuffling occurs when there is no sampling? Or is this satisfied in test_static_subsample
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test_pcoords_sample_float_shuffle_false()
Tested in test_static_subsample. I would argue this is the best place because X and y are exposed on input and output.
@@ -131,3 +157,65 @@ def test_integrated_pcoords(self): | |||
visualizer.fit_transform(X, y) | |||
visualizer.poof() | |||
self.assert_images_similar(visualizer) | |||
|
|||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does making this a static method do for the test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8-compliance. That is about it.
tests/test_features/test_pcoords.py
Outdated
ncols = 50 | ||
|
||
y = np.arange(ntotal).reshape(ntotal, 1) | ||
X = np.ones((ntotal, ncols)) * y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much like this idea for a test case, it definitely allows you to determine the ordering with ease! Note, however, that y
should have shape (100,)
and instead, it has shape (100,1)
. Because you need this shape in order to compute X
, I recommend the following:
y = np.arange(ntotal)
X = np.ones((ntotal, ncols)) * y.reshape(ntotal,1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Now done.
""" | ||
Test that is_series does not match non-dataframes | ||
""" | ||
assert not is_series(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding these tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -149,6 +152,15 @@ class ParallelCoordinates(DataVisualizer): | |||
If int, specifies the maximum number of samples to display. | |||
If float, specifies a fraction between 0 and 1 to display. | |||
|
|||
random_state : int, RandomState instance or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome +100 for this - thank you so much for thinking to add this in!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
yellowbrick/features/pcoords.py
Outdated
n_samples_ : int | ||
number of samples included in the visualization object | ||
|
||
rng_ : np.random.RandomState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a complete nit, and I apologize for it, but since I made a big deal of n_samples_
I feel like I should point out that this is not a learned parameter in fit, but a private attribute mirroring random_state
and should probably be _rng
. If I didn't have any other comments I would have just ignored this and perhaps made a quick change after the PR, but perhaps you could quickly update it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removed reference to _rng from docstring.
self.n_samples = int(len(X) * self.sample) | ||
X = X[:self.n_samples, :] | ||
# Subsample | ||
X, y = self._subsample(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the sampling a separate function was a very good idea - nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
yellowbrick/features/pcoords.py
Outdated
|
||
# Normalize | ||
if self.normalize is not None: | ||
X = self.normalizers[self.normalize].fit_transform(X) | ||
|
||
# Get the shape of the data | ||
super(ParallelCoordinates, self).fit(X, y, **kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'm missing this in the code review pane, but fit()
should call draw()
-- am I missing where this is happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, I think I got it, but I'm leaving my original comment above in - the super method is calling draw()
right? Perhaps it would be more clear if we did something like:
# the super method calls draw and returns self
return super(ParallelCoordinates, self).fit(X, y, **kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added. I should have had this in there.
Your illustration of the current and new incrementation method is very well done, thank you. I understand in principle what you're doing - I guess I'm not sure why you had to add the null column at the end to actually create the darkening effect, I would have thought that if you put two semi-transparent lines on top of each other, they would have darkened rather than blending as the did. Anyway - great work on figuring out a solution, the performance benefits are definitely worth it! |
@bbengfort - comments addressed in most recent commits. Pertaining to failed test, can you direct me to the test that is failing? I am able to pass all tests in test_pcoords, including the image comparison test. Pertaining to the nans, they are in there to prevent the line plot from drawing from right to left. If you didn't have the nans, there are be lines crossing from the right hand side of the image to the left-hand side of the window. The nans don't address the alpha effect. The lack of alpha darkening is the unfortunate result of matplotlib not accounting for alpha darkening for intersections of a single trace with itself. I spent a solid hour trying to find a work-around, with no luck. |
Ah, sorry I forgot about the crossing lines thing - that was part of the reason we did the one line per instance implementation; the nans are a good solution! As for the alpha darkening, perhaps we can open an issue for that? I can't help feeling that there is something in matplotlib that might make it work, but if you've already spent hours on it, we should punt, the visualizer is working very well right now. The tests are passing for you? Well, that's good news sort of - what system are you on? I think potentially the easiest thing is for me to merge your PR and then try to fix the Travis tests myself, it's notoriously tricky and annoying. Are you ok with that? For reference, here are the tests that are failing:
|
My machine is a Windows 7 Enterprise on Dell Latitude E5570 with Intel HD Graphics 530 and AMD Radeon R7 M370. I did more digging. In the current code base, both of the failing tests indicated above had a wrapper applied to them:
This wrapper evidently skips these tests when run on a windows machine. When I removed these wrappers, I actually passed the test_normalized_pcoords() test. I failed the test_integrated_pcoords() test. However, not because the image match failed, but because my computers firewall is blocking the load_data('occupancy') step: "SSLError: ("bad handshake: SysCallError(10054, 'WSAECONNRESET')",)" I am in favor of your merging, but I also would express caution. Some would argument that this PR modifies the look and feel of the ParallelCoordinates plot (due to the removal of intraclass alpha darkening). If you think the speed up increase it worth it, please take the lead on merging and fixing the tests. Thanks @bbengfort |
@DistrictDataLabs/team-oz-maintainers @thekylesaurus suggests that we consider carefully merging this parallel coordinates implementation because it changes the style of the plots: The current, poorly performing style plots each instance as it's own line one at a time, allowing us to achieve a darkening effect where more instances are: The new implementation draws each class as a single line, breaking them up with nan values so there is no cross back and forth from the end of the parallel coordinates to the beginning. However, because these lines are the "same line", there is no darkening effect: The performance boost from the proposed implementation is significant, especially as more instances are added to the plot (almost a 30x improvement for 100x the iris data set). Any thoughts on how to proceed? |
@bbengfort - I recommend we make this an alternate visualizer, |
Is Key information being lost by not showing the density? Disregard - my above comment - I know now that it is reduces intraclass information |
What about a flag in the existing visualizer?
…On Fri, May 18, 2018 at 11:20 AM Larry Gray ***@***.***> wrote:
Is Key information being lost by not showing the density?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#420 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEvWIrX8_Cx3HNNfbYlQL25Tghf5QSCjks5tzvTJgaJpZM4T-koL>
.
|
@rebeccabilbro @thekylesaurus I think leaving it up to the user is probably the best thing to do; I'd say it's a tough call between overriding the draw method in a subclass and using a flag since the implications of the fast method are fairly subtle. For simplicity, I'm leaning towards using a flag in the constructor My feeling is that the usage will be that the user tries We should also create a section in the documentation with the graphs @thekylesaurus created and the speed up to show the implications in detail. |
I will take the following steps when I get the chance:
Hopefully tomorrow or the day after. I would just merge this in now but am reluctant to because the tests are failing, and I'd like to get them sorted ASAP so other contributors don't have failing tests related to parallel coordinates. |
See #447 |
@thekylesaurus thank you so much for this PR and your hard work; we're glad to have you as a contributor, and hope you'll consider working with us more in the future! |
Fixes #59 (partially)
Allows for random sampling via "shuffle" parameter, consistent with sklearn train/test split, for instance.
Improves performance by at least an order of magnitude for several test data sets.
@DistrictDataLabs/team-oz-maintainers