# Machine learning: An introduction

Terms:

* We work with *n* samples
* We tries to predict properties of unknown data
* If samples have more than one number - we say they have several attributes or *features*

Large categories:

* Supervised learning
    * Classification: Samples belong to two or more classes and want to predict class of unlabeled data
    * Regression: Output consists of one or more continuous variables
* Unsupervised learning: Discover groups of similar examples within the data (clustering) or determine distribution of data within input space (density estimation) or projecting data to fewer dimensions (visualization)


In [2]:
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()

Data overview:

In [3]:
digits.data

array([[  0.,   0.,   5., ...,   0.,   0.,   0.],
       [  0.,   0.,   0., ...,  10.,   0.,   0.],
       [  0.,   0.,   0., ...,  16.,   9.,   0.],
       ..., 
       [  0.,   0.,   1., ...,   6.,   0.,   0.],
       [  0.,   0.,   2., ...,  12.,   0.,   0.],
       [  0.,   0.,  10., ...,  12.,   1.,   0.]])

Ground truth for data:

In [5]:
digits.target

array([0, 1, 2, ..., 8, 9, 8])

# Learning and predicting

Predicting numbers:

* Samples of each 10 possible classes
* We try fitting *estimator* to *predict* classes to which unseen samples belong

In [6]:
from sklearn import svm
# The classifier, ready to train / learn from some data
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(digits.data[:-1], digits.target[:-1])

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [9]:
clf.predict(digits.data[-1:])
digits.data[-1:]

array([[  0.,   0.,  10.,  14.,   8.,   1.,   0.,   0.,   0.,   2.,  16.,
         14.,   6.,   1.,   0.,   0.,   0.,   0.,  15.,  15.,   8.,  15.,
          0.,   0.,   0.,   0.,   5.,  16.,  16.,  10.,   0.,   0.,   0.,
          0.,  12.,  15.,  15.,  12.,   0.,   0.,   0.,   4.,  16.,   6.,
          4.,  16.,   6.,   0.,   0.,   8.,  16.,  10.,   8.,  16.,   8.,
          0.,   0.,   1.,   8.,  12.,  14.,  12.,   1.,   0.]])

# Model persistence

We can save a model object by using pickle.

In [10]:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [13]:
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
print(clf2.predict(X[0:1]))
print(y[0])

[0]
0


In [14]:
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')

['filename.pkl',
 'filename.pkl_01.npy',
 'filename.pkl_02.npy',
 'filename.pkl_03.npy',
 'filename.pkl_04.npy',
 'filename.pkl_05.npy',
 'filename.pkl_06.npy',
 'filename.pkl_07.npy',
 'filename.pkl_08.npy',
 'filename.pkl_09.npy',
 'filename.pkl_10.npy',
 'filename.pkl_11.npy']

In [15]:
# Later you can load:
clf = joblib.load('filename.pkl')

# Conventions

Regression targets by default be cast to `float64`. Classification targets classes are maintained.

# Refitting and updating parameters

*Hyperparameters* can be updated after construction using `sklearn.pipeline.Pipeline.set_params` method. Calling `fit()` multiple times will overwrite previously learned.

# Multiclass vs. multilabel fitting

