# Classifying iris plants according to their measurements

We use sklearn to classify the iris data we saw back in Week 1.

The data is included in scikit-learn, so we can get it directly.

In [None]:
import itertools, os, re, string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from IPython.display import display, Image
import pydotplus as pdp

In [None]:
iris = datasets.load_iris()
print(dir(iris)) # Note that it is not a dataframe, more a generalised data object
print(iris.feature_names)
X, y = iris.data, iris.target

# create the model
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

# fit the model
knn.fit(X, y)

With knn, you can determine membership probabilities for each of the 3 labels. As you can see, the predict() function just picks the most likely label.

In [None]:
# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
# call the "predict" method:
result = knn.predict([[3, 5, 4, 2],])

print(iris.target_names[result])

In [None]:
knn.predict_proba([[3, 5, 4, 2],]) 

Make sure the directory exists beforehand to store the generated plots

In [None]:
picDir = "output/pics"
if not os.path.exists(picDir):
  os.makedirs(picDir)

The following utility class plots a coloured mesh showing regions where
a label would be chosen, overlaid with the training and test points.

In [None]:
def plot_2d_class(X, y, nTrain, model, plotTitle, fileTitle, cmap_area, cmap_pts):
  predNames=list(X.columns)
  c1 = predNames[:1] # first of 2
  c2 = predNames[-1:] # last of 2
  x_min, x_max = X[c1].min() - .1, X[c1].max() + .1
  y_min, y_max = X[c2].min() - .1, X[c2].max() + .1
  xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                       np.linspace(y_min, y_max, 100))
  # Wrap numpy array in a dataframe, to avoid "UserWarning: X does not have valid feature names, but [..]Classifier was fitted with feature names"
  Xgrid = pd.DataFrame(np.c_[xx.ravel(), yy.ravel()], columns=[c1[0], c2[0]])
  Z = model.predict(Xgrid)

  # Add the areas coloured by the fit
  Z = Z.reshape(xx.shape)
  
  fig, ax = plt.subplots()
  ax.pcolormesh(xx, yy, Z, cmap=cmap_area, shading='auto')

  x1 = list(itertools.chain.from_iterable(X[c1].values)) # https://stackoverflow.com/a/11264799
  x2 = list(itertools.chain.from_iterable(X[c2].values))

  # Plot the points with colours in the same colour segment as the area
  ax.scatter(x1[:nTrain], x2[:nTrain], c=y[:nTrain], cmap=cmap_pts) # training data
  ax.scatter(x1[nTrain:], x2[nTrain:], c=y[nTrain:], cmap=cmap_pts, edgecolors="black") # test data

  ax.set_title(plotTitle)
  ax.set_xlabel(c1[0])
  ax.set_ylabel(c2[0])
  fig.tight_layout()
  fig.savefig(fileTitle,bbox_inches='tight')
  #pl.close(fig) # See https://stackoverflow.com/a/37336028/1988855

Define a function that prepares 2-feature models for display, then calls `plot_2d_class` to display them in context

In [None]:
def dispTwoFeatureModels(predNames, df, pattern, model, modelType, hpName, hp, cmap_light, cmap_bold):
  for twoCols in itertools.combinations(predNames, 2): # https://stackoverflow.com/a/374645
    X = df[list(twoCols)]  # we only take two features at a time
    colNames = X.columns
    c1 = colNames[:1][0] # first of 2
    c2 = colNames[-1:][0] # last of 2
    c1 = pattern.sub("",c1.title()) # Make titlecase, then remove non-alphanumeric characters
    c2 = pattern.sub("",c2.title())
    model.fit(X, y)
    plotTitle = f"{hpName} = {hp} {modelType} fit to the Iris dataset"
    fileTitle = picDir + f"/{hpName}_{hp}_{modelType}_Iris_{c1}_{c2}.pdf"
    print("Plotting file %s" % (fileTitle))
    plot_2d_class(X, y, nTrain, knn, plotTitle, fileTitle, cmap_light, cmap_bold)

Remember: the label (0,1,2) maps to setosa, versicolor and virginica. Therefore, the probability that it is setosa, versicolor or virginica is 0, 0.8 and 0.3, respectively. Clearly versicolor is chosen!

In the next block of code, we take each pair of predictors from the four available in the Iris data set, and use the k-nearest-neighbour algorithm with k=3,5,7. 

In [None]:
# the following import is no longer needed, because plot_2d_class has been added above
#from plotSupp import plot_2d_class

# Create color maps for 3-class classification problem, as with iris
cmap_light = ListedColormap(['#FFDDDD', '#DDFFDD', '#DDDDFF'])
cmap_bold = ListedColormap(['#FF2222', '#22FF22', '#8888FF'])

#predNames = list(iris.data) # https://stackoverflow.com/a/19483025, except iris.data is an array, not a dataframe
predNames = iris.feature_names
df=pd.DataFrame(iris.data, columns=predNames)
nTrain = df.shape[0]
y = iris.target
pattern = re.compile('[\W_]+', re.UNICODE) # https://stackoverflow.com/a/1277047

for neighborCnt in range(3,8,2): # from 3 to a maximum of 8, in steps of 2, so 3,5,7
  knn = neighbors.KNeighborsClassifier(n_neighbors=neighborCnt)
  dispTwoFeatureModels(predNames, df, pattern, model=knn, modelType="KNN", hpName="k", hp=neighborCnt, cmap_light=cmap_light, cmap_bold=cmap_bold)

## Model Validation

The k-nearest-neighbours classification "model" should be validated. Clearly, the parameter $k$ is critical to its performance. Generally, smaller values of $k$ fit the training set more accurately (less bias) but generalise less well to test data (more variance). The opposite applies to larger values of $k$.

With $k$ set to its minimum value ($k = 1$), it fits the training set exactly and the confusion matrix is optimal:

In [None]:
X, y = iris.data, iris.target
knn1 = KNeighborsClassifier(n_neighbors=1)
knn1.fit(X, y)
y_pred1 = knn1.predict(X)
print(np.all(y == y_pred1))

The *confusion matrix* highlights where classification differences arise, as these occur on the off-diagognal elements of the matrix:

In [None]:
print(accuracy_score(y, y_pred1))
print(confusion_matrix(y, y_pred1))
print(classification_report(y, y_pred1, digits=3))

All 50 training samples for each class are identified correctly, as expected when $k = 1$ (accuracy score is 1, off-diagonal terms are 0, the classification report (relative to the trsining set) is "too good to be true"...

Note:

1. The _Recall_ of the $i^{\mbox{th}}$ predictor is $R_i \equiv c_{ii} / \sum_j c_{ij}$, which is the ratio of the $i^{\mbox{th}}$ diagonal element to the sum of the elements of the confusion matrix $C = \{c_{ij}\}$ in that _column_.
2. The _Precision_ of the $j^{\mbox{th}}$ predictor is $P_j \equiv c_{jj} / \sum_i c_{ij}$, which is the ratio of the $j^{\mbox{th}}$ diagonal element to the sum of the elements of the confusion matrix $C = \{c_{ij}\}$ in that _row_.
3. $F_1$-score is defined as $F_1 = 2\frac{R_i P_i}{R_i + P_i}$.

To test how the model generalizes to the training set, we hold back some of the training data by splitting the training data into a _training set_ and a _testing set_. We hold back 20% and stratify based on the data labels $y$, so each of the row counts in the confusion matrix should be $0.2 * 50 = 10$.

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, stratify=y)
knn1.fit(Xtrain, ytrain)
ypred1s = knn1.predict(Xtest)
print(accuracy_score(ytest, ypred1s))
print(confusion_matrix(ytest, ypred1s))
print(classification_report(ytest, ypred1s, digits=3))

Note the confusion (off-diagonal nonzero elements) between Iris species 2 and species 3. For comparison, we look at the confusion matrix when $k = 3$. Firstly, we try with all the training data (not holding any observations back for a test set).

In [None]:
knn3 = KNeighborsClassifier(n_neighbors=3)
knn3.fit(X, y)
y_pred3 = knn3.predict(X)
print(accuracy_score(y, y_pred3))
print(confusion_matrix(y, y_pred3))
print(classification_report(y, y_pred3, digits=3))

Note that 6 observations (3 each of species 2 and 3) are not classified the same as the human experts. However, this might also indicate something interesting about those observations. They could be outliers (not classified correctly) but, at the very least, they are extreme observations.

Now we try holding back 20% of the training set for use as test observations, leaving 80% of the training data to train the classifier. We then look at what happens to the confusion matrix. Note that sampling the data like this could result in *better* relative performance, depending on what happens to the 6 problematic observations.

In [None]:
knn3.fit(Xtrain, ytrain)
ypred3s = knn3.predict(Xtest)
print(accuracy_score(ytest, ypred3s))
print(confusion_matrix(ytest, ypred3s))
print(classification_report(ytest, ypred3s, digits=3))

Now we try a Decision Tree Classifier from sklearn on the same Iris data. The same interface is used as the k-nearest-networks classifier.

Again, we separate the data into training and test data.

In [None]:
# Derive Xtrain2, which is the 
XtrainDf = pd.DataFrame(data=Xtrain, columns=predNames)
c1 = 'petal length (cm)'
c2 = 'petal width (cm)'
colNames = [c1, c2]
Xtrain2 = XtrainDf[colNames]
nTrain = Xtrain2.shape[0]

XtestDf = pd.DataFrame(data=Xtest, columns=predNames)
Xtest2 = XtestDf[colNames]
Xcombined2 = pd.concat([Xtrain2, Xtest2])
ycombined = np.hstack((ytrain, ytest))

We also look at comparing different decision trees to the `PetalWidth` $\times$ `PetalLength` data, based on the following conditions

1. maximum tree depth (2,3,4,5)
2. choice of tree impurity algorithm (`gini` or `entropy`)

which is 8 combinations in all.

In [None]:
c1 = pattern.sub("",c1.title()) # Make titlecase, then remove non-alphanumeric characters
c2 = pattern.sub("",c2.title())

for treeDepth in range(2,6):
  for criterion in ["gini","entropy"]:
    tree2 = DecisionTreeClassifier(criterion=criterion, max_depth=treeDepth, random_state=0)

    tree2.fit(Xtrain2, ytrain)

    plotTitle = "depth = %i %s %s fit to the %s dataset" % (treeDepth, criterion, "DT", "Iris")
    fileTitle = picDir + "/depth_%i_%s_%s_%s_%s_%s.pdf" % (treeDepth, criterion, "DT", "Iris", c1, c2)

    print("Plotting "+fileTitle)
    plot_2d_class(Xcombined2, ycombined, nTrain, tree2, plotTitle, fileTitle, cmap_light, cmap_bold)

    ytree2 = tree2.predict(Xtest2)
    print(accuracy_score(ytest, ytree2))
    print(confusion_matrix(ytest, ytree2))
    print(classification_report(ytest, ytree2, digits=3))

Previous decision trees were based on just two predictors (`PetalWidth` and `PetalLength`) as this made visualisation easier. However, if we include all 4 predictors, we see that the fit can improve (score improves to 0.97 from 0.93).

In [None]:
criterion = "entropy"
treeDepth = 5
tree = DecisionTreeClassifier(criterion=criterion, max_depth=treeDepth, random_state=0)
tree.fit(Xtrain, ytrain)
y_treeTest = tree.predict(Xtest)
print(accuracy_score(ytest, y_treeTest))
print(confusion_matrix(ytest, y_treeTest))
print(classification_report(ytest, y_treeTest, digits=3))

One of the main advantages of decision trees is the fact that they provide easily interpreted models for prediction. Indeed, the rules encoded in the tree can help to understand how the predictors combine and contribute to explaining the classification. As such, decision trees are often described as _white box_, where other algorithms (in particular, neural networks) are best seen as _black box_.

To aid interpretation, `scikit-learn` can output the model in a graph description language such as [dot](https://www.graphviz.org/pdf/dotguide.pdf) using the `export_graphviz` method. If you wish, you can export the `dot` file and process it using tools, both command line such as [dotty](https://www.graphviz.org/pdf/dottyguide.pdf) and more general tools such as those listed [here](https://en.wikipedia.org/wiki/Graphviz). However, it is probably more convenient to use a `dot` postprocessor (`pydotplus`) directly from within the notebook to create an object that can be displayed in the notebook, or saved to a file as below.

In [None]:
# See https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
dot_data = export_graphviz(
    tree, 
    out_file=None,
    feature_names=predNames,
    class_names=['setosa', 'versicolor', 'virginica'],  
    filled=True,
    rounded=True)

# Write dot_data to file, for optional postprocessing
with open(picDir+"/tree.dot","w+") as f:
  f.writelines(dot_data)

graph = pdp.graph_from_dot_data(dot_data)
# create a PNG from the graph and display it in the notebook
display(Image(graph.create_png()))
graph.write_pdf(picDir+"/tree.pdf")

The nodes in the tree a coloured according to their function and impurity. The root is white, reflecting the fact that it has the highest impurity (mix of lables). The _setosa_ instances are releatively easily identified, because they share the condition that their `PetalWidth < 0.8cm`. The remaining 80 instances are then split on `PetalLength < 4.95cm`. While this improves the entropy (the child nodes have entropy 0.446 and 0.179), each of them requires further recursive splitiing. Also note that predictors such as `PetalLength` can reappear in further nodes, though the split is on a different value of the predictor (5.05cm instead of 4.95cm).

The terminal nodes (leaves) are all pure, with 40, 37, 39, 3 and 1 nodes in each.