https://plotly.com/python/knn-classification/

In [11]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

# Display training and test splits

Using Scikit-learn, we first generate synthetic data that form the shape of a moon. We then split it into a training and testing set. Finally, we display the ground truth labels using a scatter plot.

In the graph, we display all the negative labels as squares, and positive labels as circles. We differentiate the training and test set by adding a dot to the center of test data.

In this example, we will use graph objects, Plotly's low-level API for building figures.

In [2]:
# Load and split data
X, y = make_moons(noise=0.3, random_state=0)

In [15]:
px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=y,
)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y.astype(str), test_size=0.25, random_state=0
)
trace_specs = [
    [X_train, y_train, "0", "Train", "square"],
    [X_train, y_train, "1", "Train", "circle"],
    [X_test, y_test, "0", "Test", "square-dot"],
    [X_test, y_test, "1", "Test", "circle-dot"],
]

In [7]:
fig = go.Figure(
    data=[
        go.Scatter(
            x=X[y == label, 0],
            y=X[y == label, 1],
            name=f"{split}",
            mode="markers",
            marker_symbol=marker,
        )
        for X, y, label, split, marker in trace_specs
    ]
)
fig.update_traces(marker_size=12, marker_line_width=1.5, marker_color="lightyellow")
fig.show()

## Visualize predictions on test split with plotly.express

Now, we train the kNN model on the same training data displayed in the previous graph. Then, we predict the confidence score of the model for each of the data points in the test set. We will use shapes to denote the true labels, and the color will indicate the confidence of the model for assign that score.

In this example, we will use Plotly Express, Plotly's high-level API for building figures. Notice that px.scatter only require 1 function call to plot both negative and positive labels, and can additionally set a continuous color scale based on the y_score output by our kNN model.

In [16]:
from sklearn.neighbors import KNeighborsClassifier

In [17]:
px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=y,
)

In [18]:
clf = KNeighborsClassifier(n_neighbors=15)
clf.fit(X_train, y_train)

In [19]:
y_score = clf.predict_proba(X_test)[:, 1]
y_score

array([0.33333333, 0.73333333, 0.66666667, 0.6       , 0.93333333,
       0.93333333, 0.        , 0.93333333, 0.4       , 0.73333333,
       0.26666667, 0.13333333, 0.93333333, 0.2       , 1.        ,
       0.73333333, 0.        , 0.33333333, 0.86666667, 0.4       ,
       0.2       , 0.73333333, 0.8       , 0.06666667, 0.        ])

In [21]:
fig = px.scatter(
    data_frame=X_test,
    x=0,
    y=1,
    color=y_score,
    color_continuous_scale="RdBu",
    symbol=y_test,
    symbol_map={"0": "square-dot", "1": "circle-dot"},
    labels={"symbol": "label", "color": "score of <br>first class"},
)
fig.update_traces(marker_size=12, marker_line_width=1.5)
fig.update_layout(legend_orientation="h")
fig.show()

## Probability Estimates with go.Contour

Just like the previous example, we will first train our kNN model on the training set.

Instead of predicting the conference for the test set, we can predict the confidence map for the entire area that wraps around the dimensions of our dataset. To do this, we use np.meshgrid to create a grid, where the distance between each point is denoted by the mesh_size variable.

Then, for each of those points, we will use our model to give a confidence score, and plot it with a contour plot.

In this example, we will use graph objects, Plotly's low-level API for building figures.

In [None]:
px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=y,
)

In [22]:
mesh_size = 0.02
margin = 0.25

In [23]:
# Create a mesh grid on which we will run our model
x_min, x_max = X[:, 0].min() - margin, X[:, 0].max() + margin
y_min, y_max = X[:, 1].min() - margin, X[:, 1].max() + margin
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)

In [24]:
# Create classifier, run predictions on grid
clf = KNeighborsClassifier(n_neighbors=15, weights="uniform")
clf.fit(X, y)

In [25]:
Z = clf.predict_proba(np.c_[xx.flatten(), yy.flatten()])[:, 1]
Z = Z.reshape(xx.shape)

In [26]:
fig = go.Figure(data=[go.Contour(x=xrange, y=yrange, z=Z, colorscale="RdBu")])
fig.show()

### Now, let's try to combine our go.Contour plot with the first scatter plot of our data points, so that we can visually compare the confidence of our model with the true labels.

In [28]:
# Create classifier, run predictions on grid
clf = KNeighborsClassifier(n_neighbors=15, weights="uniform")
clf.fit(X, y)

In [29]:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

In [30]:
trace_specs = [
    [X_train, y_train, "0", "Train", "square"],
    [X_train, y_train, "1", "Train", "circle"],
    [X_test, y_test, "0", "Test", "square-dot"],
    [X_test, y_test, "1", "Test", "circle-dot"],
]

In [31]:
fig = go.Figure(
    data=[
        go.Scatter(
            x=X[y == label, 0],
            y=X[y == label, 1],
            name=f"{split} Split, Label {label}",
            mode="markers",
            marker_symbol=marker,
        )
        for X, y, label, split, marker in trace_specs
    ]
)
fig.update_traces(marker_size=12, marker_line_width=1.5, marker_color="lightyellow")
fig.add_trace(
    go.Contour(
        x=xrange,
        y=yrange,
        z=Z,
        showscale=False,
        colorscale="RdBu",
        opacity=0.4,
        name="Score",
        hoverinfo="skip",
    )
)
fig.show()

## Multi-class prediction confidence with go.Heatmap

It is also possible to visualize the prediction confidence of the model using heatmaps. In this example, you can see how to compute how confident the model is about its prediction at every point in the 2D grid. Here, we define the confidence as the difference between the highest score and the score of the other classes summed, at a certain point.

In this example, we will use Plotly Express, Plotly's high-level API for building figures.

In [32]:
mesh_size = 0.02
margin = 1

In [33]:
# We will use the iris data, which is included in px
iris = px.data.iris()
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,1
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1


In [34]:
iris_train, iris_test = train_test_split(iris, test_size=0.25, random_state=0)

In [36]:
X_train = iris_train[["sepal_length", "sepal_width"]]
y_train = iris_train["species_id"]

In [37]:
# Create a mesh grid on which we will run our model
l_min, l_max = iris.sepal_length.min() - margin, iris.sepal_length.max() + margin
w_min, w_max = iris.sepal_width.min() - margin, iris.sepal_width.max() + margin
lrange = np.arange(l_min, l_max, mesh_size)
wrange = np.arange(w_min, w_max, mesh_size)
ll, ww = np.meshgrid(lrange, wrange)

In [38]:
# Create classifier, run predictions on grid
clf = KNeighborsClassifier(n_neighbors=15, weights="distance")
clf.fit(X_train, y_train)

In [39]:
Z = clf.predict(np.c_[ll.ravel(), ww.ravel()])
Z = Z.reshape(ll.shape)


X does not have valid feature names, but KNeighborsClassifier was fitted with feature names



In [40]:
proba = clf.predict_proba(np.c_[ll.ravel(), ww.ravel()])
proba = proba.reshape(ll.shape + (3,))


X does not have valid feature names, but KNeighborsClassifier was fitted with feature names



In [41]:
# Compute the confidence, which is the difference
diff = proba.max(axis=-1) - (proba.sum(axis=-1) - proba.max(axis=-1))
diff

array([[0.18501033, 0.18387614, 0.18271729, ..., 0.64328414, 0.64331738,
        0.64334429],
       [0.30438052, 0.18480458, 0.1836432 , ..., 0.64381981, 0.64384725,
        0.64386834],
       [0.30516186, 0.3040119 , 0.18459416, ..., 0.64436151, 0.64438295,
        0.64439804],
       ...,
       [1.        , 1.        , 1.        , ..., 0.87647455, 0.87647815,
        0.87648609],
       [1.        , 1.        , 1.        , ..., 0.87635737, 0.8763622 ,
        0.87636565],
       [1.        , 1.        , 1.        , ..., 0.87624188, 0.87624788,
        0.8762525 ]])

In [42]:
fig = px.scatter(
    data_frame=iris_test,
    x="sepal_length",
    y="sepal_width",
    symbol="species",
    symbol_map={
        "setosa": "square-dot",
        "versicolor": "circle-dot",
        "virginica": "diamond-dot",
    },
)

fig.update_traces(marker_size=12, marker_line_width=1.5, marker_color="lightyellow")

fig.add_trace(
    go.Heatmap(
        x=lrange,
        y=wrange,
        z=diff,
        opacity=0.25,
        customdata=proba,
        colorscale="RdBu",
        hovertemplate=(
            "sepal length: %{x} <br>"
            "sepal width: %{y} <br>"
            "p(setosa): %{customdata[0]:.3f}<br>"
            "p(versicolor): %{customdata[1]:.3f}<br>"
            "p(virginica): %{customdata[2]:.3f}<extra></extra>"
        ),
    )
)
fig.update_layout(legend_orientation="h", title="Prediction Confidence on Test Split")
fig.show()