Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ready: Feature 139 confusion matrix #144

Conversation

NealHumphrey
Copy link
Contributor

@NealHumphrey NealHumphrey commented Mar 4, 2017

Work in progress branch; initiating pull request for discussion with @rebeccabilbro

So far:

  • Made ConfusionMatrix class
  • Added init, fit, score, and draw methods. These mostly mimic the ClassificationReport
  • created example demo using the music data set I used in my user test. Can clean this up and include it in the commit, or remove this after development of this feature is complete (i.e. if we don't want it in the examples folder)

To discuss:

  • Should we use imshow like ClassificationReport, or convert over to pcolormesh like seaborn uses for their heatmap?
  • Hoisting duplicated code between ConfusionMatrix and ClassificationReport - what's worth moving up?
  • Handling of font size based on image size (so font size gets smaller with increased categories)
  • Handling of font size when predicted quantities have too-large-to-fit number of digits. Do we force the user over to percent-based heatmap?
  • Color defaults - force light gray background with medium gray font when the estimate is zero? Use a different categorical label for 100% right prediction squares (e.g. green, but w/ flexibility)?
  • Handling class imbalance - should we default to percent-of-true confusion matrix?
  • Handling class imbalance - should we create an option for a treemap-style resizing of row width, to visually demonstrate class sizes of the 'true' values?

@bbengfort bbengfort added the review PR is open label Mar 4, 2017
@coveralls
Copy link

coveralls commented Mar 4, 2017

Coverage Status

Coverage decreased (-1.4%) to 67.917% when pulling c113fbc on NealHumphrey:feature-139-confusion-matrix into 41ab384 on DistrictDataLabs:master.

@coveralls
Copy link

coveralls commented Mar 6, 2017

Coverage Status

Coverage decreased (-2.4%) to 66.926% when pulling 57e4f1b on NealHumphrey:feature-139-confusion-matrix into 41ab384 on DistrictDataLabs:master.

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be great if you could show a sample of the visualizer in the comments - I think you can just drag and drop the image into the comment box!


self.cmap = color_sequence(kwargs.pop('cmap', 'YlOrRd'))

#If possible, assign the classes. If this can't happen now, it will happen in the .fit() method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely one of the things that needs to be done; e.g. what if a fitted model is passed into the classifier, then fit() won't be called. The question then becomes; what happens if the estimator.classes_ is not None, but then fit() is called? Right now, if the user specifies the classes, then we don't want fit() to override the user-specified classes, but if we get them off the estimator, then we do want to change the classes on fit.

So here's my suggestion (that I think we'll use in almost all the classifier score models); let's make Visualizer.classes a property rather than a simple attribute that performs the lookup on demand:

def __init__(self, model, ax=None, classes=None, **kwargs):
    self._classes = classes 

@property
def classes(self):
    if self._classes is None: 
        try:
            return self.estimator.classes_ 
        except AttributeError:
            return None 
    return self._classes 

@classes.setter
def classes(self, value):
    self._classes = value 

Then in fit() we can check the value of self._classes but in draw() we can simply use self.classes. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I had forgotten about the need to not override user defined classes in fit(). We could simply use existing code and make sure to wrap .classes_ assignment in an if statement in the fit() method; but, making it a property instead is a good way to clue future developers in to the fact that they need to handle the assignment properly if classes is used anywhere else. I'll use the property approach.

Another thing @rebeccabilbro and I discussed is a .util for checking if an estimator is already fit or not. It does not look like there is a tool in sklearn that does that already. Each estimator returns a NotFittedError if you call predict without fit, but each estimator type has a different way of deciding if it should return NotFittedError (e.g. KNeighbors checks for existence of self.fit_method, Forest checks for self.estimators, linear classifiers check for self.coeff_). Might just use .classes_ for the classifiers for now and make the .util when we need it.

Is the classes vs. classes the naming convention for properties vs. attributes, respectively?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, markdown misinterpreted my underscores. I meant to say _classes vs. classes_

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do need an is_fitted() utility, and I think I did the same search you did and found what you did. I was working on the following:

def is_fitted(estimator): 
    X = np.random.rand(1, m) 
    try:
         estimator.predict(X)
         return True 
    except NotFittedError:
         return False 

The problem I had was that I didn't know what m should be … it has to be the same number of columns as the expected features matrix. This is, however, the "right" way of doing it. Though frankly, it's even more "right" to just expect the model to be fitted or not fitted and pass the exceptions to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So _classes is a Python thing, and signals to developers that this is an internal property and shouldn't be messed with, whereas classes is the external property (there is no public, private, or protected in Python). Note that __classes causes name mangling.

http://radek.io/2011/07/21/private-protected-and-public-in-python/

The name classes_ is actually a Scikit-Learn thing, from the docs

Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:

else:
self.classes_ = classes

#TODO this is the same as ClassificationReport, should hoist it up a level
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this should probably be a part of ClassificationScoreVisualizer as should the property above, and the refactored __init__. Would you be willing to make this change and run the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we do this as part of #151

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes I could do that next sprint

Copy link
Member

@bbengfort bbengfort Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will you put this property into this PR (for just this class) or do you just want to do it all in the next sprint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put this property directly in this class for this PR, and then we can hoist it up a level in #151

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

y_pred = self.predict(X)
self.confusion_matrix = confusion_matrix(y_true = y, y_pred = y_pred, labels=self.classes_, sample_weight=sample_weight)
#Convert confusion matrix to percent of true
self.confusion_matrix = np.round(self.confusion_matrix / np.sum(self.confusion_matrix, axis=1) * 100, decimals=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially I'd like to make this an option to show raw numbers or percents; one that is passed on commit. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue we discussed is what to do when the raw numbers have too many digits to display properly for the size of the matrix. Options include:

  • Display it anyways, formatting will break, and user will manually change to % mode. Kind of ugly.
  • Use logic to decide to override the user choice and display it as percent when it's too large. (maybe raise a warning? Potentially confusing behavior that doesn't fit to user expectations)
  • Only display the cells with too many digits as percent (potentially confusing when there are mixed meanings of numbers. Could use smart formatting to clarify) if whole number mode is selected.
  • Resize the font of all cells based on the max number of digits and available space. I would like to do this anyways to deal with situations that have lots of classes (like my example), but I need to find out how to get the display size of the graphic. This might mean adding the text needs to get moved into poof()? Do you know how to get pixel dimensions of the resulting graphic?
  • Resize the font of each cell individually based on digit count. Same implementation issue as above bullet.

Due to these complications, we thought we'd start first with only allowing the percent-based display format. What do you think about the choices between these options, for future implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percent seems like a convenient choice so as not to have to worry about fitting big numbers into a tight space; I guess scientific notation is another option to make the length predictable... maybe slightly less readable though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I like option 1 - default to percent, let the user manually change it to raw numbers, and if it breaks it breaks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hate making things that come out ugly. But also hate tools that try to do too much formatting for me without letting me override it (looking at you Excel). I suppose you're right @bbengfort - start out with a default that will always be pretty, and if the user breaks formatting by choosing raw numbers so be it.


def finalize(self, **kwargs):
pass
#TODO add this stuff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least set the title - I think that's good form; Some of the other visualizers have the ability to let the user specify the title, so we should probably review that and add here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, planning to.

Note, I used the convention of WIP for "work in progress" in the title of my Pull Request. The only way to do line-by-line comments is with a pull request, so this was a way to discuss progress so far (Rebecca and I had a video chat last night) - I wasn't expecting this to be ready to merge yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it; mostly wanted to mention the set_title thing. I meant having a title on figures is good form; and also you don't necessarily have to do too much work in finalize()

Y = np.linspace(start=0, stop=len(self.classes_), num=len(self.classes_))

# Draw the heatmap
mesh = self.ax.pcolormesh(X,Y,self.confusion_matrix, vmin=0, vmax=100,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pcolormesh handle light text on dark background and dark text on light background? If so, we should definitely move to it. But also in my reading, I'm seeing that it can handle larger matrices; so I think it's a good idea to use it over imshow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pcolormesh does not inherently deal with text. I agree we want to handle dark/light text vs. background (my example, and the existing ClassificationReport, both have this issue). We will need to handle this separately (probably manually). This sounds like something that should have a higher-level solution - i.e. for any given color, you can check whether the text should be light or dark. I don't understand how the palettes in yellowbrick work yet, so I'm not quite ready to add that logic in a clean and reusable way yet. I'll make a ticket.

One benefit of pcolormesh is that it allows you to manually specify the X and Y grid corner locations, so you can have adaptively sized blocks. It wasn't much harder to use than imshow, but you do have to generate the X and Y to use in most cases. I agree that pcolormesh is a good idea for any future creation of heatmapping.

@bbengfort
Copy link
Member

@rebeccabilbro one of the things we wanted to discuss was the code review process; so let's take a look at the code review I did here as an example. Thanks @NealHumphrey for being a good sport!

@coveralls
Copy link

coveralls commented Mar 8, 2017

Coverage Status

Coverage decreased (-3.9%) to 65.439% when pulling 9cc35b9 on NealHumphrey:feature-139-confusion-matrix into 41ab384 on DistrictDataLabs:master.

@NealHumphrey NealHumphrey changed the title WIP - Feature 139 confusion matrix Complete: Feature 139 confusion matrix Mar 8, 2017
@NealHumphrey
Copy link
Contributor Author

NealHumphrey commented Mar 8, 2017

@rebeccabilbro and @bbengfort - This should be ready enough to merge, or at least to show the group tonight. I made an example - below are a few copied images.

In my commit message I noted a few more things I'd like to do:

Here's the screenshots:
Handwritten Digits 1
Handwritten Digits 2
Music Origins 1
Sliced Music Origins - only some classes

@coveralls
Copy link

coveralls commented Mar 8, 2017

Coverage Status

Coverage decreased (-3.9%) to 65.439% when pulling b322d82 on NealHumphrey:feature-139-confusion-matrix into 41ab384 on DistrictDataLabs:master.

@rebeccabilbro
Copy link
Member

@NealHumphrey can you please add tests for the new class here to make sure our Travis coverage stays good? Note that once you convert classifiers to its own module (#151 - I think you said you wanted to take this one in the next sprint), you'll also need to refactor the tests accordingly.

@NealHumphrey NealHumphrey changed the title Complete: Feature 139 confusion matrix WIP: Feature 139 confusion matrix Mar 9, 2017
@NealHumphrey
Copy link
Contributor Author

Ok, going to do one more thing before this is ready to merge - going to fix that bug on the shortened classes list with percent. I got it started, will push a new commit in the next day or two.

@bbengfort
Copy link
Member

@NealHumphrey let me know when you want me to review.

…olormesh.

- Allows for percent or raw count representation of the predictions
- Implements heatmap with white=0, green=100%, and yellow-orange-red heatmap for everything else
- Allows zooming in on confusion matrix using passed list of classes, with accurate %-of-all-true calculations
- Tested for moderately large class numbers (30+)
- Diagonal line indicates accurate predictions
- Documentation added to docs/examples/methods.rst for one example matrix

Suggested future improvements:
- Resize font based on image size + class count
- Allow custom color coding, including custom colors for _over and _under values (e.g. zero and 100%)
- Vary text font color based on background color
- While this branch currently adds an example to methods.rst, the examples/confusionMatrix.ipynb has additional examples using different of the passed parameters. This should probably also be exported as rst and added to the docs, but there was not an obvious place to put it so I am excluding that for now.

Note this commit squashes all previous commits on this branch
@NealHumphrey NealHumphrey changed the base branch from master to develop March 16, 2017 17:08
@NealHumphrey
Copy link
Contributor Author

@bbengfort ready for you to review! I added tests, fixed that bug I mentioned, and made a few other improvements. I also exported the rst of the ipynb and added it to the documentation.

Note, I decided to go ahead and squash and rebase this on the current develop branch to keep a clean version history - thoughts on doing this in the future vs. not?

@NealHumphrey NealHumphrey changed the title WIP: Feature 139 confusion matrix Ready: Feature 139 confusion matrix Mar 16, 2017
@coveralls
Copy link

Coverage Status

Coverage increased (+1.9%) to 70.374% when pulling ceee7f8 on NealHumphrey:feature-139-confusion-matrix into 942a070 on DistrictDataLabs:develop.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+1.9%) to 70.374% when pulling ceee7f8 on NealHumphrey:feature-139-confusion-matrix into 942a070 on DistrictDataLabs:develop.

This was referenced Mar 16, 2017
Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NealHumphrey just did a review and it all looks good - very nice implementation of the Visualizer.

The only thing I'd request is that you move the examples/confusionMatrix.ipynb to examples/NealHumphrey -- this is kind of where we landed for both development and galleries, that the examples/examples.ipynb would be the "official" gallery of examples and that other notebooks for testing and development would be in our user folders.

You can certainly add the confusion matrix example to examples.ipynb if you think it belongs there as wel..

@@ -155,6 +156,17 @@ def is_dataframe(obj):
isdataframe = is_dataframe


#From here: http://stackoverflow.com/questions/26248654/numpy-return-0-with-divide-by-zero
def numpy_div0( a, b ):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have a couple of questions with this function. first, can a be a list, scalar, ndarray, etc? And second, can b be a list, scalar, ndarray, etc? E.g. what happens when we do:

numpy_div0(10, [0,2,5]) 

My question relates to the signature of the function. My preference (and this isn't blocking the pull request) is to call this div_safe or something like that, and to call the arguments something a bit more specific since we'll have potentially many other developers use this. I do like the over-func inclusion of the stackoverflow link; it's not part of the docstring but does give developers more information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I converted this to div_safe, added some documentation to the function, and wrote tests demonstrating what the function can and can't do.

@bbengfort
Copy link
Member

@NealHumphrey as for the rebasing -- that's fine with me; I want to make sure we merge into develop with --no-ff --no-edit so that we can keep track of who's doing what with the feature branches looping off; but keeping them compact and rebasing internally is fine. In fact, it really did help me with the code review; so I think maybe we should definitely make this a best practice.

@bbengfort
Copy link
Member

Actually, I see that we can select "squash and merge" from the dropdown, so potentially you don't have to rebase, we can do that automatically through github.

@bbengfort bbengfort added the type: feature a new visualizer or utility for yb label Mar 19, 2017
@NealHumphrey
Copy link
Contributor Author

@bbengfort I made both those changes, should be good to go now.

@coveralls
Copy link

coveralls commented Mar 20, 2017

Coverage Status

Coverage increased (+2.0%) to 70.499% when pulling 23ac982 on NealHumphrey:feature-139-confusion-matrix into 942a070 on DistrictDataLabs:develop.

@coveralls
Copy link

coveralls commented Mar 20, 2017

Coverage Status

Coverage increased (+2.0%) to 70.499% when pulling 4b900ef on NealHumphrey:feature-139-confusion-matrix into abb0550 on DistrictDataLabs:develop.

@bbengfort
Copy link
Member

@NealHumphrey are we ready to resolve the conflicts and merge this PR?

# Conflicts:
#	examples/nealhumphrey/confusionMatrix.ipynb
#	yellowbrick/classifier.py
#	yellowbrick/utils.py
@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 70.989% when pulling ce71fc6 on NealHumphrey:feature-139-confusion-matrix into 45268fc on DistrictDataLabs:develop.

@NealHumphrey NealHumphrey merged commit 1320a9a into DistrictDataLabs:develop Mar 23, 2017
@NealHumphrey NealHumphrey removed the review PR is open label Mar 23, 2017
@NealHumphrey
Copy link
Contributor Author

@bbengfort in this branch I used the 'Squash and merge' button. This does indeed perform what you want - squashes all the commits into one commit AND rebases it on top of the current yellowbrick/develop branch. The existing branch (i.e. the one in my fork) remains unchanged, so as such there's a slightly weird disconnect in version history in my local git history using git log, as seen in this screenshot of my GUI:
image

If we'd done a regular merge (without squash or rebase), those two would be connected. So people will just need to be aware of this, but it will make a cleaner /develop version history.

Note, this doesn't do what you said you wanted about fast forward - the way it handles squash+rebase by only doing this to the ddl/yellowbrick repo means its effectively doing a fast forward merge.

Personally, going forward I'll probably squash and rebase my forked branch before completing the pull request, which makes the squash button on Github redundant.

@bbengfort
Copy link
Member

Alright, thanks for looking into this - that all sounds good!

@bbengfort
Copy link
Member

@NealHumphrey also, we can add the ConfusionMatrix to the list in the README and in the front page of the documentation if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature a new visualizer or utility for yb
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants