# Going Beyond Binary
*Curtis Miller*

I have emphasized binary classification because it is the simplest form of classification and it is easier to develop binary classifiers than classifiers that predict one of more than two labels (which we may call **multiclass classification**). That said, such use cases certainly exist. What can we do then?

Let's take for example predicting the species of flowers in the iris dataset. Below I load in the iris dataset.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import classification_report

In [None]:
iris_obj = load_iris()
flower, species = iris_obj.data, iris_obj.target
flower_train, flower_test, species_train, species_test = train_test_split(flower, species, test_size = 0.1)
flower_train[:5, :]

In [None]:
species_train[:5]

## Inherently Multiclass Classifiers

Some classifiers don't lean on the binary assumption and are ready for predicting one of many labels already. Classifiers we've seen that are inherently multiclass classifiers include:

* KNN
* Decision trees
* Random forests
* Naive Bayes

### KNN

We already saw KNN applied to this dataset and its ability to predict one of many labels.

### Decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

In [None]:
tree = DecisionTreeClassifier(max_depth=3)
tree = tree.fit(flower_train, species_train)
print(classification_report(species_test, tree.predict(flower_test)))

In [None]:
dot_data = StringIO()

export_graphviz(tree,    # Function for exporting a visualization of the tree
                out_file=dot_data,
                # Data controlling the display of the graph
                filled=True, rounded=True,
                special_characters=True,
                feature_names=["Sepal Length", "Sepal Width",
                               "Petal Length", "Petal Width"],    # Use the name of the features
                proportion=True)    # Show proportions for labels

# Display graph in Jupyter notebook
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
forest = RandomForestClassifier(n_estimators=20, max_depth=2)
forest.fit(flower_train, species_train)
print(classification_report(species_test, forest.predict(flower_test)))

### Naive Bayes

In this case, I will use the exclusively Gaussian variant of the naive Bayes classifier, implemented in `GaussianNB`, since all variables in the iris dataset are continuous variables.

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
nb = GaussianNB()
nb = nb.fit(flower_train, species_train)
print(classification_report(species_test, nb.predict(flower_test)))

## One vs. All and One vs. One Classification

After we exhaust classifiers that are inherently multiclass, we are forced to combine binary classifiers in such a way that they can predict multiple labels. SVMs and logistic regression are examples of classifiers that are not inherently multiclass and need to be handled this way.

**One vs. all** classification trains a classifier for every class, where for each classifier trained, one class exclusively consists of "successes" and all data points not in that class are "failures". All classifiers make a prediction, and if a classifier predicts "success" while others predict "failure", the class associated with that classifier is the predicted class.

One vs. all classification is simple since we need as many classifiers as we have classes, and so can be done relatively quickly. It also works well when the number of data points in the training set doesn't cause large performance lags. Thus this scheme is popular. However, this algorithm assumes that every class can be separated from the rest by a single hyperplane; this may not be true, in which case learning fails.

**One vs. one** classification trains a classifier for every *combination* of classes. The training dataset is restricted to observations from these two classes, and a classifier is trained to distinguish them. In prediction, each classifier trained this way makes a prediction. The most common class predicted among the classifiers is the class finally predicted.

This mode of classification requires more classifiers; if there are $K$ classes, $\frac{K(K-1)}{2} \sim K^2$ classifiers are needed. This slows down prediction as well. This scheme does work well, though, when training the classifiers is expensive with respect to the size of the dataset (smaller datasets are used for training).

All classifiers implemented in **scikit-learn** support multiclass classification out of the box; `SVC` and `LogisticRegression`, in particular, already support these schemes. However, the **multiclass** module includes objects that allow for manual implementation of these schemes: `OneVsRestClassifier` for the one vs. all scheme, and `OneVsOneClassifier` for the one vs. one scheme.

`SVC` by default implements the one vs. one method, and `LogisticRegression` uses the one vs. all method.

### SVM (One vs. One)

In [None]:
from sklearn.svm import SVC

In [None]:
svm = SVC()
svm.fit(flower_train, species_train)
print(classification_report(species_test, svm.predict(flower_test)))

In [None]:
from sklearn.linear_model import LogisticRegression

### Logistic Regression (One vs. All)

In [None]:
logit = LogisticRegression()
logit.fit(flower_train, species_train)
print(classification_report(species_test, logit.predict(flower_test)))