In [6]:
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()

The **iris** dataset contains information about flowers with 4 features and it's class. The **digits** dataset contains data for 'pixels' 8x8 (64 Features) on handwritting digits. For the digit dataset, you will find thast images is organize in 8x8 arryas, while data is a vector of 64 digits per image.

#### We will be using Support Vector Classification (SVC)

In [5]:
from sklearn import svm
clf = svm.SVC(gamma=0.0001, C=100)

In [7]:
clf.fit(digits.data[:-1], digits.target[:-1])

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [8]:
clf.predict(digits.data[-1:])

array([8])

#### Saving and loading models
**Pickle** is a python way to save files. However, sklearn comes with joblib which uses pickle as well but it's optimize to work with sklearn objects and especially large numpy arrays.

In [9]:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

pickle is a wrapper for python objects. It's very useful but be aware of differences in protocols between python 2 and python 3. 

In [10]:
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(X[0:1])
y[0]

0

#### Multiclass and Multilabel classification

**Multiclass:** Think of it when you are predicting more than 2 events. For intance, if you are predicting spam vs ham, that's just one prediction as low probability spam is classified as ham. But if you are predicting banana, oranges and an apples, you will need a multiclass classifier.

**Multilabel:** This is when the same recod might be classified as multiple classes. For intances, in labeling a movie, you could have labels that says: romantic, comedic.

In [23]:
# multiclass examples
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer

X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]

classif = OneVsRestClassifier(estimator=SVC(random_state=0, probability=True))

classif.fit(X, y).predict(X)

array([0, 0, 1, 1, 2])

In [24]:
# multilabel example (one real target)
y = LabelBinarizer().fit_transform(y)
# Notice the values are tranformed in a way that a single record could be labeled positive in two or more columns
y

array([[1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1]])

In [25]:
# When using multilabel classfier, it's possible to get none label as in 4 and 5 record. 
# This is because a prediction is constructed for each label
classif.fit(X, y).predict(X)

array([[1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 0]])

In [26]:
# miltilabel example (miltiple labels)
# in this case, the same record could be part of multiple of the 5 possible classes
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
y

array([[1, 1, 0, 0, 0],
       [1, 0, 1, 0, 0],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 1, 0],
       [0, 0, 1, 0, 1]])

In [35]:
y_predicted = classif.fit(X, y).predict(X)

In [36]:
y_predicted_prob = classif.fit(X,y).predict_proba(X)

In [34]:
y_predicted

array([[ 0.21118603,  0.07390266,  0.92628847,  0.75593332,  0.33777743],
       [ 0.21118603,  0.6913297 ,  0.30895152,  0.75593332,  0.33793175],
       [ 0.91359643,  0.06198249,  0.94083921,  0.1373745 ,  0.33780267],
       [ 0.24110933,  0.69103373,  0.30860089,  0.39687795,  0.33793176],
       [ 0.71203383,  0.6913297 ,  0.30860089,  0.75505639,  0.14706887]])

#### Multilabel Metrics Documentation:
http://scikit-learn.org/stable/modules/model_evaluation.html#multilabel-ranking-metrics

In [43]:
from sklearn.metrics import coverage_error, label_ranking_average_precision_score, label_ranking_loss

**Coverage Error**: 

In [42]:
coverage_error(y, y_predicted_prob)

5.0

** Label Ranking Average Procesion Scores **:

In [40]:
label_ranking_average_precision_score(y, y_predicted_prob)

0.36666666666666664

** Label ranking loss **:

In [41]:
label_ranking_loss(y, y_predicted_prob)

0.96666666666666679