Implement new plot directive for clustering visualizers [Issue #687] #742

Kautumn06 · 2019-02-12T15:17:48Z

Adds the new plot directive from PR #446 in the documentation for the KElbowVisualizer, SilhouetteVisualizer, and InterclusterDistance. In addition, I deleted the scripts originally used to generate the images in the docs, along with the images.

I'll add comments to help explain the changes. However, please just let me know if anyone has any questions!

Quick Update: the tests originally failed which I think may have been because I initially set k in the InterclusterDistance visualizer, similar to how k is set in KElbowVisualizer.

model = KMeans()
visualizer = InterclusterDistance(model, k=9)

And while it did create the image when I ran make html, the tests failed. I think this may be because k is not actually a defined parameter for the InterclusterDistance visualizer, which is why I pushed a new commit with the following update:

model = KMeans(9)
visualizer = InterclusterDistance(model)

Since implementing the new plot directive, this script is no longer needed to generate images in the SilhouetteVisualizer documention. Issue DistrictDataLabs#687

Updates the code block in the SilhouetteVisualizer documentation so that it can utilize the new plot directive from PR DistrictDataLabs#446 to autogenerate the images. Issue DistrictDataLabs#687

I added :alt: and :context: close-figs settings for both plots in the KElbowVisualizer documentation. See also: DistrictDataLabs#687, DistrictDataLabs#446

Add new plot directive settings from PR DistrictDataLabs#446 to Intercluster Distance Maps Visualizer documentation. In addition, I deleted the original script that had previously been used to generate the image in the documentation. See also: DistrictDataLabs#687

Kautumn06 · 2019-02-12T15:19:01Z

docs/api/cluster/elbow.rst

@@ -8,21 +8,23 @@ The ``KElbowVisualizer`` implements the "elbow" method to help data scientists s
 To demonstrate, in the following example the ``KElbowVisualizer`` fits the ``KMeans`` model for a range of :math:`K` values from 4 to 11 on a sample two-dimensional dataset with 8 random clusters of points. When the model is fit with 8 clusters, we can see an "elbow" in the graph, which in this case we know to be the optimal number.

 .. plot::
+    :context: close-figs


Added the recommended :context: and alt settings to the plot directive.

Kautumn06 · 2019-02-12T15:23:59Z

docs/api/cluster/elbow.rst

-    # Create synthetic dataset with 8 random clusters
-    X, y = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)
+    # Generate synthetic dataset with 8 random clusters
+    X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)


I removed shuffle=True since it is True by default.

In addition, I added n_samples=1000 since that is what it was set to in the original script used to generate the image, and if not included, n_samples is only equal to 100 by default.

docs/api/cluster/elbow.rst

Kautumn06 · 2019-02-12T15:26:39Z

docs/api/cluster/elbow.rst

@@ -31,21 +33,25 @@ However, two other metrics can also be used with the ``KElbowVisualizer`` -- ``s
 The ``KElbowVisualizer`` also displays the amount of time to train the clustering model per :math:`K` as a dashed green line, but is can be hidden by setting ``timings=False``. In the following example, we'll use the ``calinski_harabaz`` score and hide the time to fit the model.

 .. plot::
+    :context: close-figs


Added new plot directive settings.

Kautumn06 · 2019-02-12T15:34:35Z

docs/api/cluster/elbow.rst

-    # Create synthetic dataset with 8 random clusters
-    X, _ = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)
+    # Generate synthetic dataset with 8 random clusters
+    X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)


I replaced _ with y here, since even though the target variable is not needed and it doesn't affect the final plot, I thought _ could be confusing for new users who may not have seen this before. In addition, it's not used in the other examples in our documentation when make_blobs is used to create the dataset.

I also removed shuffle=True and added n_samples=1000 for the same reasons I described above in a previous comment.

Thanks @Kautumn06 for keeping an eye out for consistency across the docs! One note; while I think it's fine to use y instead of _ in this case here, keep in mind that using the underscore for dummy (i.e. unused) variables is not always/only a stylistic choice, some linters and code editors will actually raise a warning if you name a variable and don't end up using it!

Kautumn06 · 2019-02-12T15:35:01Z

docs/api/cluster/elbow.rst


-    visualizer.fit(X)    # Fit the data to the visualizer
-    visualizer.poof()    # Draw/show/poof the data
+    visualizer.fit(X)        # Fit the data to the visualizer


Added additional spacing.

Kautumn06 · 2019-02-12T15:36:02Z

docs/api/cluster/elbow.rst


    # Instantiate the clustering model and visualizer
    model = KMeans()
-    visualizer = KElbowVisualizer(model, k=(4,12), metric='calinski_harabaz', timings=False)
+    visualizer = KElbowVisualizer(


Small formatting change since the line was quite long.

Kautumn06 · 2019-02-12T15:40:17Z

docs/api/cluster/icdm.rst

@@ -5,26 +5,24 @@ Intercluster Distance Maps

 Intercluster distance maps display an embedding of the cluster centers in 2 dimensions with the distance to other centers preserved. E.g. the closer to centers are in the visualization, the closer they are in the original feature space. The clusters are sized according to a scoring metric. By default, they are sized by membership, e.g. the number of instances that belong to each center. This gives a sense of the relative importance of clusters. Note however, that because two clusters overlap in the 2D space, it does not imply that they overlap in the original feature space.

-.. code:: python
+.. plot::


Previously, this example in the documentation had been broken out into two blocks—one to generate the dataset and one to fit the model and poof the visualizer. So in order to have it work with the new plot directive, I combined them into one and moved all of the import statements to the top of the block.

Kautumn06 · 2019-02-12T15:42:17Z

docs/api/cluster/icdm.rst

    from yellowbrick.cluster import InterclusterDistance

-    # Instantiate the clustering model and visualizer
-    visualizer = InterclusterDistance(KMeans(9))


As you'll see below, I broke this code out into two lines to keep it consistent with the rest of the examples in the documentation.

Kautumn06 · 2019-02-12T15:46:13Z

docs/api/cluster/icdm.rst


+    # Instantiate the clustering model and visualizer


This code had previously been combined in a single line, and while that does work, I think it's clearer to have it in two. In addition, this is the format we could throughout the rest of the documentation so I wanted to be consistent.

Kautumn06 · 2019-02-12T15:49:28Z

docs/api/cluster/icdm.rst

-    visualizer.fit(X) # Fit the training data to the visualizer
-    visualizer.poof() # Draw/show/poof the data
+    # Generate synthetic dataset with 12 random clusters
+    X, y = make_blobs(n_samples=1000, n_features=12, centers=12, random_state=42)


Originally, the .rst file showed n_features=16; however, in the script used to generate the image, the parameter was set to n_features=12, so I changed it here to reflect that.

Kautumn06 · 2019-02-12T15:50:19Z

docs/api/cluster/silhouette.rst

    from yellowbrick.cluster import SilhouetteVisualizer

+    # Generate synthetic dataset with 8 random clusters
+    X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)


Added the parameters used in the original script to generate the image.

Kautumn06 · 2019-02-12T15:57:00Z

docs/api/cluster/silhouette.rst

    # Instantiate the clustering model and visualizer
-    model = MiniBatchKMeans(6)
+    model = KMeans(6)


Using KMeans instead of MiniBatchKMeans didn't affect the plot so I changed it here to make it consistent with our other clustering visualizers. In addition, I was worried that new users might be confused about why it is used here and not in the other clustering examples.

@Kautumn06 thanks for keeping an eye out for consistency! However I'm wondering if you would please consider changing this back to MiniBatchKMeans, for two reasons. First, MiniBatchKMeans can often converge significantly faster than KMeans, which can be a big advantage when you have to wait for the model to converge before you get your plot! Second, I think it's actually better to show a lot of different clustering, classification, and regression models throughout our documentation because it shows people that YB works on any estimator, not just certain ones. In fact, as you're going through the docs, if you see interesting opportunities to introduce new estimators, feel free to take the lead in changing the currently demonstrated models to more interesting/diverse ones so that we can show off what YB can really do!

Since the previous scripts that had been used to generate images in the documentation can be deleted now that we're implementing the new plot directive, I've updated my PR (DistrictDataLabs#742) by deleting them. See also: DistrictDataLabs#687, DistrictDataLabs#446

Originally, in my PR (DistrictDataLabs#742) I had set the parameter in k in the Intercluster Distance visualizer, similar to how k is set in KElbowVisualizer. And while it still was able generate the image, I believe this is what caused the test to fail.

rebeccabilbro · 2019-02-13T04:49:36Z

Hi @Kautumn06 — sorry that I didn't get to this earlier today, but we've been sorting out some problems with our automated tests. Now that everything seems to be fixed, I've updated your branch so that the tests can run again (note that this means you'll need to do a pull first if you need to push anything else to this branch); I'll take a look at your updates tomorrow!

Kautumn06 · 2019-02-13T15:52:25Z

Hi @rebeccabilbro no problem! I haven't pushed any changes since yesterday afternoon so hopefully there shouldn't be anymore problems. However, just let me know if you have any questions or if there is anything else you need me to help with!

rebeccabilbro · 2019-02-13T18:33:55Z

Hi @Kautumn06 - I'm starting to take a look at your updates to the clustering docs and I had a quick question — in your above message, you said:

when I ran make html, the tests failed.

and I just wanted to make sure I understood what behavior you were experiencing before proceeding.

When you run make html, it builds a copy of the docs locally on your machine, but it doesn't run any tests. However, oftentimes when we run the make html command after making experimental changes to the docs, we'll get warnings or error messages from Sphinyx in the command line that tell us that there's something funky going on in the docs. When you say the tests failed after you ran make html, do you mean you got Sphinyx warnings? This is probably a sign that there is some kind of formatting error in the rst, a missing reference link, etc.

Alternatively, if you mean that after making the updates to the docs and running the tests (either locally on your machine with pytest, or by opening the PR, which automatically runs the tests and shows the results here on Github), you observed failing tests on the command line or saw the little red X mark here on GH, that's possibly due to the miniconda/Appveyor problems we were experiencing with our continuous integration tools.

The continuous integration testing issue is now resolved (for now ;D ), so would you please let me know if what you experienced seemed instead like a Sphinx issue, so I can look into that during my code review? Thanks for working on this!

Kautumn06 · 2019-02-13T18:58:05Z

Hi @rebeccabilbro sorry for any confusion! The test I was referring to was in fact the red X here on GitHub. Originally, I had thought that the problem occurred because of the parameter issue I mentioned above; however, it actually turned out to be from the miniconda/Appveyor problem we've been experiencing.

So please just let me know if that answers your question or if there is anything else you need me to do!

Kautumn06 · 2019-02-14T18:46:05Z

Hi @rebeccabilbro I just wanted to let you know that I fixed the two minor merge conflicts that came up today after #729 was merged in. It was a quick fix since the two differences were from me including the new plot directives :alt: option in the code blocks. Let me know if you have any questions!

rebeccabilbro

Great work @Kautumn06 — we're lucky to have someone on the team with your attention to detail! I've made a few small suggestions and requests; would you mind taking a look at those, making any final edits, and pinging me when you're ready for me to merge this in?

rebeccabilbro · 2019-02-15T00:36:06Z

docs/api/cluster/elbow.rst

-    # Create synthetic dataset with 8 random clusters
-    X, _ = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)
+    # Generate synthetic dataset with 8 random clusters
+    X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)


Thanks @Kautumn06 for keeping an eye out for consistency across the docs! One note; while I think it's fine to use y instead of _ in this case here, keep in mind that using the underscore for dummy (i.e. unused) variables is not always/only a stylistic choice, some linters and code editors will actually raise a warning if you name a variable and don't end up using it!

rebeccabilbro · 2019-02-15T00:44:08Z

docs/api/cluster/silhouette.rst

    # Instantiate the clustering model and visualizer
-    model = MiniBatchKMeans(6)
+    model = KMeans(6)


@Kautumn06 thanks for keeping an eye out for consistency! However I'm wondering if you would please consider changing this back to MiniBatchKMeans, for two reasons. First, MiniBatchKMeans can often converge significantly faster than KMeans, which can be a big advantage when you have to wait for the model to converge before you get your plot! Second, I think it's actually better to show a lot of different clustering, classification, and regression models throughout our documentation because it shows people that YB works on any estimator, not just certain ones. In fact, as you're going through the docs, if you see interesting opportunities to introduce new estimators, feel free to take the lead in changing the currently demonstrated models to more interesting/diverse ones so that we can show off what YB can really do!

In the SilhouetteVisualizer documentation, updated the code block to use MiniBatchKMeans instead of KMeans. See also: DistrictDataLabs#687

Kautumn06 · 2019-02-15T16:13:14Z

Hi @rebeccabilbro I've replaced KMeans with MiniBatchKMeans in the SilhouetteVisualizer documentation example. I also added the plot directive :alt: option to the code block since I had originally forgot to add it. Once the tests have passed, I'll merge it in.

Kautumn06 added 4 commits February 11, 2019 16:23

Delete script generating silhouette image in docs

bf7f37c

Since implementing the new plot directive, this script is no longer needed to generate images in the SilhouetteVisualizer documention. Issue DistrictDataLabs#687

Update code blocks in SilhouetteVisualizer docs

c1a8a4c

Updates the code block in the SilhouetteVisualizer documentation so that it can utilize the new plot directive from PR DistrictDataLabs#446 to autogenerate the images. Issue DistrictDataLabs#687

Set plot directive settings for KElbowVisualizer

37ceed0

I added :alt: and :context: close-figs settings for both plots in the KElbowVisualizer documentation. See also: DistrictDataLabs#687, DistrictDataLabs#446

Kautumn06 added review PR is open type: documentation writing and editing tasks for RTD and removed review PR is open labels Feb 12, 2019

Kautumn06 commented Feb 12, 2019

View reviewed changes

docs/api/cluster/elbow.rst Show resolved Hide resolved

Kautumn06 commented Feb 12, 2019

View reviewed changes

Kautumn06 added 2 commits February 12, 2019 10:59

rebeccabilbro added this to the v1.0 milestone Feb 13, 2019

Merge branch 'develop' into clustering-docs

abfc1d4

rebeccabilbro self-requested a review February 13, 2019 15:26

Merge branch 'develop' into clustering-docs

2676518

Merge branch 'develop' into clustering-docs

54ca139

rebeccabilbro approved these changes Feb 15, 2019

View reviewed changes

Kautumn06 and others added 3 commits February 15, 2019 09:52

Merge branch 'develop' into clustering-docs

88dc973

Add :alt: plot directive option

3bd2d9e

Replace KMeans with MiniBatchKMeans

44bf620

In the SilhouetteVisualizer documentation, updated the code block to use MiniBatchKMeans instead of KMeans. See also: DistrictDataLabs#687

rebeccabilbro merged commit c749a94 into DistrictDataLabs:develop Feb 15, 2019

Kautumn06 deleted the clustering-docs branch February 15, 2019 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement new plot directive for clustering visualizers [Issue #687] #742

Implement new plot directive for clustering visualizers [Issue #687] #742

Kautumn06 commented Feb 12, 2019 •

edited

Kautumn06 Feb 12, 2019

Kautumn06 Feb 12, 2019

Kautumn06 Feb 12, 2019

Kautumn06 Feb 12, 2019

rebeccabilbro Feb 15, 2019

Kautumn06 Feb 12, 2019

Kautumn06 Feb 12, 2019

Kautumn06 Feb 12, 2019

Kautumn06 Feb 12, 2019 •

edited

Kautumn06 Feb 12, 2019

Kautumn06 Feb 12, 2019

Kautumn06 Feb 12, 2019

Kautumn06 Feb 12, 2019

rebeccabilbro Feb 15, 2019

rebeccabilbro commented Feb 13, 2019 •

edited

Kautumn06 commented Feb 13, 2019

rebeccabilbro commented Feb 13, 2019 •

edited

Kautumn06 commented Feb 13, 2019

Kautumn06 commented Feb 14, 2019

rebeccabilbro left a comment

rebeccabilbro Feb 15, 2019

rebeccabilbro Feb 15, 2019

Kautumn06 commented Feb 15, 2019

Implement new plot directive for clustering visualizers [Issue #687] #742

Implement new plot directive for clustering visualizers [Issue #687] #742

Conversation

Kautumn06 commented Feb 12, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kautumn06 Feb 12, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rebeccabilbro commented Feb 13, 2019 • edited

Kautumn06 commented Feb 13, 2019

rebeccabilbro commented Feb 13, 2019 • edited

Kautumn06 commented Feb 13, 2019

Kautumn06 commented Feb 14, 2019

rebeccabilbro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kautumn06 commented Feb 15, 2019

Kautumn06 commented Feb 12, 2019 •

edited

Kautumn06 Feb 12, 2019 •

edited

rebeccabilbro commented Feb 13, 2019 •

edited

rebeccabilbro commented Feb 13, 2019 •

edited