Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new plot directive for clustering visualizers [Issue #687] #742

Merged
merged 12 commits into from Feb 15, 2019

Conversation

Kautumn06
Copy link
Contributor

@Kautumn06 Kautumn06 commented Feb 12, 2019

Adds the new plot directive from PR #446 in the documentation for the KElbowVisualizer, SilhouetteVisualizer, and InterclusterDistance. In addition, I deleted the scripts originally used to generate the images in the docs, along with the images.

I'll add comments to help explain the changes. However, please just let me know if anyone has any questions!

Quick Update: the tests originally failed which I think may have been because I initially set k in the InterclusterDistance visualizer, similar to how k is set in KElbowVisualizer.

model = KMeans()
visualizer = InterclusterDistance(model, k=9)

And while it did create the image when I ran make html, the tests failed. I think this may be because k is not actually a defined parameter for the InterclusterDistance visualizer, which is why I pushed a new commit with the following update:

model = KMeans(9)
visualizer = InterclusterDistance(model)

Since implementing the new plot directive, this script is no
longer needed to generate images in the SilhouetteVisualizer
documention.

Issue DistrictDataLabs#687
Updates the code block in the SilhouetteVisualizer documentation so that it can utilize the new plot directive from PR DistrictDataLabs#446 to autogenerate the images.

Issue DistrictDataLabs#687
I added :alt: and :context: close-figs settings for both plots in the
KElbowVisualizer documentation.

See also: DistrictDataLabs#687, DistrictDataLabs#446
Add new plot directive settings from PR DistrictDataLabs#446 to Intercluster Distance Maps
Visualizer documentation. In addition, I deleted the original script
that had previously been used to generate the image in the documentation.

See also: DistrictDataLabs#687
@Kautumn06 Kautumn06 added review PR is open type: documentation writing and editing tasks for RTD and removed review PR is open labels Feb 12, 2019
@@ -8,21 +8,23 @@ The ``KElbowVisualizer`` implements the "elbow" method to help data scientists s
To demonstrate, in the following example the ``KElbowVisualizer`` fits the ``KMeans`` model for a range of :math:`K` values from 4 to 11 on a sample two-dimensional dataset with 8 random clusters of points. When the model is fit with 8 clusters, we can see an "elbow" in the graph, which in this case we know to be the optimal number.

.. plot::
:context: close-figs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the recommended :context: and alt settings to the plot directive.

# Create synthetic dataset with 8 random clusters
X, y = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)
# Generate synthetic dataset with 8 random clusters
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed shuffle=True since it is True by default.

In addition, I added n_samples=1000 since that is what it was set to in the original script used to generate the image, and if not included, n_samples is only equal to 100 by default.

@@ -31,21 +33,25 @@ However, two other metrics can also be used with the ``KElbowVisualizer`` -- ``s
The ``KElbowVisualizer`` also displays the amount of time to train the clustering model per :math:`K` as a dashed green line, but is can be hidden by setting ``timings=False``. In the following example, we'll use the ``calinski_harabaz`` score and hide the time to fit the model.

.. plot::
:context: close-figs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new plot directive settings.

# Create synthetic dataset with 8 random clusters
X, _ = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)
# Generate synthetic dataset with 8 random clusters
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced _ with y here, since even though the target variable is not needed and it doesn't affect the final plot, I thought _ could be confusing for new users who may not have seen this before. In addition, it's not used in the other examples in our documentation when make_blobs is used to create the dataset.

I also removed shuffle=True and added n_samples=1000 for the same reasons I described above in a previous comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Kautumn06 for keeping an eye out for consistency across the docs! One note; while I think it's fine to use y instead of _ in this case here, keep in mind that using the underscore for dummy (i.e. unused) variables is not always/only a stylistic choice, some linters and code editors will actually raise a warning if you name a variable and don't end up using it!


visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
visualizer.fit(X) # Fit the data to the visualizer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added additional spacing.


# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12), metric='calinski_harabaz', timings=False)
visualizer = KElbowVisualizer(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small formatting change since the line was quite long.

@@ -5,26 +5,24 @@ Intercluster Distance Maps

Intercluster distance maps display an embedding of the cluster centers in 2 dimensions with the distance to other centers preserved. E.g. the closer to centers are in the visualization, the closer they are in the original feature space. The clusters are sized according to a scoring metric. By default, they are sized by membership, e.g. the number of instances that belong to each center. This gives a sense of the relative importance of clusters. Note however, that because two clusters overlap in the 2D space, it does not imply that they overlap in the original feature space.

.. code:: python
.. plot::
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, this example in the documentation had been broken out into two blocks—one to generate the dataset and one to fit the model and poof the visualizer. So in order to have it work with the new plot directive, I combined them into one and moved all of the import statements to the top of the block.

from yellowbrick.cluster import InterclusterDistance

# Instantiate the clustering model and visualizer
visualizer = InterclusterDistance(KMeans(9))
Copy link
Contributor Author

@Kautumn06 Kautumn06 Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you'll see below, I broke this code out into two lines to keep it consistent with the rest of the examples in the documentation.


# Instantiate the clustering model and visualizer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code had previously been combined in a single line, and while that does work, I think it's clearer to have it in two. In addition, this is the format we could throughout the rest of the documentation so I wanted to be consistent.

visualizer.fit(X) # Fit the training data to the visualizer
visualizer.poof() # Draw/show/poof the data
# Generate synthetic dataset with 12 random clusters
X, y = make_blobs(n_samples=1000, n_features=12, centers=12, random_state=42)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, the .rst file showed n_features=16; however, in the script used to generate the image, the parameter was set to n_features=12, so I changed it here to reflect that.

from yellowbrick.cluster import SilhouetteVisualizer

# Generate synthetic dataset with 8 random clusters
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the parameters used in the original script to generate the image.

# Instantiate the clustering model and visualizer
model = MiniBatchKMeans(6)
model = KMeans(6)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using KMeans instead of MiniBatchKMeans didn't affect the plot so I changed it here to make it consistent with our other clustering visualizers. In addition, I was worried that new users might be confused about why it is used here and not in the other clustering examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kautumn06 thanks for keeping an eye out for consistency! However I'm wondering if you would please consider changing this back to MiniBatchKMeans, for two reasons. First, MiniBatchKMeans can often converge significantly faster than KMeans, which can be a big advantage when you have to wait for the model to converge before you get your plot! Second, I think it's actually better to show a lot of different clustering, classification, and regression models throughout our documentation because it shows people that YB works on any estimator, not just certain ones. In fact, as you're going through the docs, if you see interesting opportunities to introduce new estimators, feel free to take the lead in changing the currently demonstrated models to more interesting/diverse ones so that we can show off what YB can really do!

Since the previous scripts that had been used to generate images in the
documentation can be deleted now that we're implementing the new plot
directive, I've updated my PR (DistrictDataLabs#742) by deleting them.

See also: DistrictDataLabs#687, DistrictDataLabs#446
Originally, in my PR (DistrictDataLabs#742) I had set the parameter in k in the Intercluster Distance
visualizer, similar to how k is set in KElbowVisualizer. And while it
still was able generate the image, I believe this is what caused the
test to fail.
@rebeccabilbro rebeccabilbro added this to the v1.0 milestone Feb 13, 2019
@rebeccabilbro
Copy link
Member

rebeccabilbro commented Feb 13, 2019

Hi @Kautumn06 — sorry that I didn't get to this earlier today, but we've been sorting out some problems with our automated tests. Now that everything seems to be fixed, I've updated your branch so that the tests can run again (note that this means you'll need to do a pull first if you need to push anything else to this branch); I'll take a look at your updates tomorrow!

@Kautumn06
Copy link
Contributor Author

Hi @rebeccabilbro no problem! I haven't pushed any changes since yesterday afternoon so hopefully there shouldn't be anymore problems. However, just let me know if you have any questions or if there is anything else you need me to help with!

@rebeccabilbro
Copy link
Member

rebeccabilbro commented Feb 13, 2019

Hi @Kautumn06 - I'm starting to take a look at your updates to the clustering docs and I had a quick question — in your above message, you said:

when I ran make html, the tests failed.

and I just wanted to make sure I understood what behavior you were experiencing before proceeding.

When you run make html, it builds a copy of the docs locally on your machine, but it doesn't run any tests. However, oftentimes when we run the make html command after making experimental changes to the docs, we'll get warnings or error messages from Sphinyx in the command line that tell us that there's something funky going on in the docs. When you say the tests failed after you ran make html, do you mean you got Sphinyx warnings? This is probably a sign that there is some kind of formatting error in the rst, a missing reference link, etc.

Alternatively, if you mean that after making the updates to the docs and running the tests (either locally on your machine with pytest, or by opening the PR, which automatically runs the tests and shows the results here on Github), you observed failing tests on the command line or saw the little red X mark here on GH, that's possibly due to the miniconda/Appveyor problems we were experiencing with our continuous integration tools.

The continuous integration testing issue is now resolved (for now ;D ), so would you please let me know if what you experienced seemed instead like a Sphinx issue, so I can look into that during my code review? Thanks for working on this!

@Kautumn06
Copy link
Contributor Author

Hi @rebeccabilbro sorry for any confusion! The test I was referring to was in fact the red X here on GitHub. Originally, I had thought that the problem occurred because of the parameter issue I mentioned above; however, it actually turned out to be from the miniconda/Appveyor problem we've been experiencing.

So please just let me know if that answers your question or if there is anything else you need me to do!

@Kautumn06
Copy link
Contributor Author

Hi @rebeccabilbro I just wanted to let you know that I fixed the two minor merge conflicts that came up today after #729 was merged in. It was a quick fix since the two differences were from me including the new plot directives :alt: option in the code blocks. Let me know if you have any questions!

Copy link
Member

@rebeccabilbro rebeccabilbro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @Kautumn06 — we're lucky to have someone on the team with your attention to detail! I've made a few small suggestions and requests; would you mind taking a look at those, making any final edits, and pinging me when you're ready for me to merge this in?

# Create synthetic dataset with 8 random clusters
X, _ = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)
# Generate synthetic dataset with 8 random clusters
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Kautumn06 for keeping an eye out for consistency across the docs! One note; while I think it's fine to use y instead of _ in this case here, keep in mind that using the underscore for dummy (i.e. unused) variables is not always/only a stylistic choice, some linters and code editors will actually raise a warning if you name a variable and don't end up using it!

# Instantiate the clustering model and visualizer
model = MiniBatchKMeans(6)
model = KMeans(6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kautumn06 thanks for keeping an eye out for consistency! However I'm wondering if you would please consider changing this back to MiniBatchKMeans, for two reasons. First, MiniBatchKMeans can often converge significantly faster than KMeans, which can be a big advantage when you have to wait for the model to converge before you get your plot! Second, I think it's actually better to show a lot of different clustering, classification, and regression models throughout our documentation because it shows people that YB works on any estimator, not just certain ones. In fact, as you're going through the docs, if you see interesting opportunities to introduce new estimators, feel free to take the lead in changing the currently demonstrated models to more interesting/diverse ones so that we can show off what YB can really do!

Kautumn06 and others added 3 commits February 15, 2019 09:52
In the SilhouetteVisualizer documentation, updated the code block to use
MiniBatchKMeans instead of KMeans.

See also: DistrictDataLabs#687
@Kautumn06
Copy link
Contributor Author

Hi @rebeccabilbro I've replaced KMeans with MiniBatchKMeans in the SilhouetteVisualizer documentation example. I also added the plot directive :alt: option to the code block since I had originally forgot to add it. Once the tests have passed, I'll merge it in.

@rebeccabilbro rebeccabilbro merged commit c749a94 into DistrictDataLabs:develop Feb 15, 2019
@Kautumn06 Kautumn06 deleted the clustering-docs branch February 15, 2019 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: documentation writing and editing tasks for RTD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants