Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grid search color plot viz (WIP) #261

Merged
merged 11 commits into from
Feb 3, 2018

Conversation

pbs929
Copy link
Contributor

@pbs929 pbs929 commented Jun 12, 2017

Don't mean to inundate you all with PRs, but I'm having fun :)

I put together a color plot visualizer for grid search - see the ipynb for a demo. Also aiming to create a structure into which other grid search visualizers can fit.

Unit tests are definitely needed... I will come back to this after getting some experience with the testing workflow on my other PRs. General feedback on the approach is welcome in the meantime.

Called for in issue #12.

@bbengfort
Copy link
Member

@NealHumphrey do you have time to take a look at this by any chance?

@pbs929 keep them coming!

@bbengfort
Copy link
Member

@pbs929 still interested in finishing this PR? I'm happy to do a review.

@pbs929
Copy link
Contributor Author

pbs929 commented Dec 5, 2017

Yes, still interested! As I recall I was waiting on some general feedback on the approach (see demo in the ipynb).

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very, very good - and I'm excited to get it into yellowbrick. I'm sorry it took so long to review, it just got buried in my email. Next time feel free to ping me/us so it comes back to our attention!

Let me know your thoughts on my comments below and then I can approve when you're ready.

param_1 : string
The name of the parameter to be visualized on the horizontal axis.

param_2 : string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the motivation for param_1 vs param_2 - but because these names are connected to the vertical and horizontal axes, maybe it would be easier to name them "x_param" and "y_param" - I know this is not really the x and y axes but they do it on bar graphs so I think it would be ok here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, easy enough

"""
# Get unique values of the two display parameters
x_vals = sorted(list(set(cv_results['param_' + param_1].compressed())))
y_vals = sorted(list(set(cv_results['param_' + param_2].compressed())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so just to make sure I understand what's happening here - you're fetching the param mask from the cv results dictionary, compressing it to get the real values instead, then calling set to find all unique values. These are the categories for the grid, right?

I've got a couple of notes:

  1. 'param_' + param_1 is used twice, it might be easier to store these in variables and note what they are or just fetch the mask array from the results and store that.
  2. If the parameter doesn't exist in the grid search, this will return a KeyError - it would be better if we raised a yellowbrick key error, e.g. YellowbrickKeyError(YellowbrickError, KeyError) that says "parameter not found in grid search" so it's more obvious to the user what happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right - I've added a couple comment lines to try to clarify and stored the masked array like you suggested in (1). Also did what you suggested in (2).

# This is a n_x by n_y array of lists with `None` in place of empties
# (my kingdom for a dataframe...)
all_scores = [[None for _ in range(n_x)] for _ in range(n_y)]
for x, y, score in zip(idx_x, idx_y, cv_results['mean_test_score']):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we should allow the user to choose which metric as an input to the visualizer (which defaults to mean_test_score but could also be - mean_train_score, mean_fit_time, and mean_score_time.

if x is not None and y is not None:
if all_scores[y][x] is None:
all_scores[y][x] = []
all_scores[y][x].append(score)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I certainly understand the motivation for creating this intermediate data structure, but is it not possible to compute max score at this point?

## Dimension reduction utility
##########################################################################

def param_projection(cv_results, param_1, param_2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems extremely useful for all GridSearchVisualizers - perhaps this should be in base.py -- possibly as a method of GridSearchVisualizer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or as a standalone function, with an aliased (static?) method in GridSearchVisualizer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: future work - add more params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I moved it up to the base module and aliased it in the base GridSearchVisualizer class. I didn't make the aliased method static, rather letting it access the gridsearch model object as self.estimator.

Definitely could add more params as future work.

## Base Grid Search Visualizer
##########################################################################

class GridSearchVisualizer(ModelVisualizer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good set up of this into the Yellowbrick class hierarchy!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Should return an error if it isn't.
"""
# A bit of type checking
if not is_gridsearch(model):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

)

# Mask nans so that they can be filled with a hatch
data = np.ma.masked_invalid(best_scores)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot! I'm going to have to use this in other things! The hatch nans thing I mean.

self.ax.set_yticklabels(y_vals, rotation=45)

# Add the colorbar
cb = self.ax.figure.colorbar(mesh, None, self.ax)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a running debate that I hope you might weigh in on: should color bars have the range (min, max) or should they have the range (0, 1)? Should this be an option we supply to the user?

Arguments for (min, max):

  • they create greater color disparity so it's easier to tell values apart
  • They bound the area you're looking in

Arguments for (0,1):

  • You can easily compare multiple heatmaps with each other
  • The closer min is to max, the more similar they appear in this as opposed to the other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndanielsen @rebeccabilbro -- just wanted to highlight this discussion for you again here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I'd advocate for min, max. With grid search, you can use any of a number of error metrics, which may or may not be bounded by 1. Furthermore when fine tuning parameters you may be optimizing over the 3rd or 4th decimal of the error metric, so fixing the range on a larger scale would make the visualization not very useful.

cb = self.ax.figure.colorbar(mesh, None, self.ax)
cb.outline.set_linewidth(0)

self.ax.set_aspect("equal")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - glad you included this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pbs929
Copy link
Contributor Author

pbs929 commented Feb 3, 2018

Alright @bbengfort I think I've addressed your comments. Let me know if you catch anything else - I'm pretty tied up these days with a newborn but I will try to find the time. :)

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbs929 This looks great thank you so much! I've approved the PR and will merge it into develop. The next steps for this include writing some tests (or even just test stubs) and updating the documentation. I think we can add those to an issue (or update an existing issue, I haven't looked yet).

Thanks again for all your contributions, and congrats on the new baby!

@bbengfort bbengfort merged commit 03724ed into DistrictDataLabs:develop Feb 3, 2018
@bbengfort bbengfort removed the ready label Feb 3, 2018
This was referenced Feb 3, 2018
@pbs929 pbs929 deleted the gridsearch branch February 3, 2018 20:35
pbs929 added a commit to pbs929/yellowbrick that referenced this pull request Feb 5, 2018
* develop: (36 commits)
  Grid search color plot viz  (DistrictDataLabs#261)
  Feature visualizers clean up missing data to ensure visualization works DistrictDataLabs#304
  Using visualizers on MNIST digits data. (DistrictDataLabs#290)
  Add pytest into the mix instead of nose (DistrictDataLabs#292)
  Updated ROC/AUC to correctly check the number of classes Bugfix DistrictDataLabs#288
  Fix tests (DistrictDataLabs#293)
  fixed joint plot tests
  DistrictDataLabs#161 Created @Property for visualizer size. (DistrictDataLabs#289)
  Threshold visualizer (DistrictDataLabs#207)
  Add image similarity tests for feature visualizers (DistrictDataLabs#283)
  add image similarity test for freqdist (DistrictDataLabs#284)
  alphas image similarity tests (DistrictDataLabs#282)
  Add image similarity tests for elbow and silhouette plots (DistrictDataLabs#281)
  add image similarity tests for confusion matrix (DistrictDataLabs#280)
  adding image comparison document to the contributing rst (DistrictDataLabs#279)
  some editorial changes on matplotlib page and sphinx repair to line 184 of rst
  peplot test stubs
  effective matplotlib draft
  updated readme, conda description and pypi description
  update pypi description
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants