# Mechine learning: the problem setting

* supervised learning
    --in which the data comes with additional attributes that we want to predict
    - calssification: label data with the correct category or class
    - regression: the desired output consists of one or more continuous variables
* unsupervised learning 
    --in which the training data consists of a set of input vectors x without any corresponding target values
    * clustering: discover groups of similar examples within the data
    * density estimation: determine the distribution of data within the input space
    * deminsionality reduction: project the data from a high-dimensional space down to two or three dimensions for the purpos of visualization

# Loading an example sataset 
training set & testing set, data.shape=(n_samples, n_features)

In [1]:
# load the sklean datasets
import sklearn.datasets as sk_datasets
# 鸢尾花数据集： load_iris()
# 手写数字数据集： load_digitals()
# 糖尿病数据集： load_diabets()
# 乳腺癌数据集： load_breast_cancer()
# 波士顿房价数据集： load_boston()
# 体能训练数据集： load_linnerud()
iris = sk_datasets.load_iris()
iris_x = iris.data  # load the data
iris_y = iris.target # load the labels
digits = sk_datasets.load_digits()
## the data is always a 2D array, shape(n_samples, n_features)
## loading from external datasets

# Learning and predicting
an estimator for classification is a Python object that implements the methods .fit(x,y) and .predict(T)

In [2]:
# an estimator as a black box
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)

In [3]:
clf.fit(digits.data[:-1], digits.target[:-1])

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [4]:
clf.predict(digits.data[-1:])

array([8])

# Model persistence
it is possible to save a model by using Python's pickle. It is more interesting to use joblib's replacement of pickle(joblib.dump & joblib.load), which is more efficient on big data, bug can only pickle to the disk and not to a string.

In [5]:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
iris = datasets.load_iris()
x, y = iris.data, iris.target
clf.fit(x, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [6]:
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(x[0:1]), y[0]

(array([0]), 0)

In [7]:
# use joblib
from sklearn.externals import joblib
joblib.dump(clf, 'clf.pkl')
# possible in another Python process
clf3 = joblib.load('clf.pkl')

# Conventions
scikit-learn eatimator follow certain rules to make their behavior more predictive.

### Type casting

In [8]:
# Type casting
#input will be cast to float64
import numpy as np
from sklearn import random_projection
rng = np.random.RandomState(0)
x = rng.rand(10, 2000)
x = np.array(x,dtype='float32')
x.dtype

dtype('float32')

In [9]:
transformer = random_projection.GaussianRandomProjection()
x_new = transformer.fit_transform(x)
x_new.dtype

dtype('float64')

In [10]:
# Regression targets are cast to float64, classification targets are maintained:
from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
clf = SVC()
clf.fit(iris.data, iris.target)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [11]:
clf.predict(iris.data[:3])

array([0, 0, 0])

In [12]:
clf.fit(iris.data, iris.target_names[iris.target])
list(clf.predict(iris.data[:3]))

['setosa', 'setosa', 'setosa']

### Refitting and updating parameters
Hyper-parameters of an estimator can be updated after it has been constructed via the sklearn.pipeline.Pipeline.set_params method. Calling fit() more than once will overwrite what was learned by any previous fit().

In [13]:
import numpy as np
from sklearn.svm import SVC
rng = np.random.RandomState(0)
x = rng.rand(100, 10)
y = rng.binomial(1, 0.5, 100)
x_test = rng.rand(5, 10)

clf = SVC()
clf.set_params(kernel='linear').fit(x,y)
clf.predict(x_test)

array([1, 0, 1, 1, 0])

In [14]:
clf.set_params(kernel='rbf').fit(x, y)
clf.predict(x_test)

array([0, 0, 0, 1, 0])

### Multiclass vs. multilabel fitting
when using multiclass.classifiers, the learning and prediction task that is performed is dependent on the format of the target data fit upon.

In [15]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer

x = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]

classif = OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(x, y).predict(x)

array([0, 0, 1, 1, 2])

In [16]:
y = LabelBinarizer().fit_transform(y)
classif.fit(x, y).predict(x)

array([[1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 0]])

In [17]:
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(x, y).predict(x)

array([[1, 1, 0, 0, 0],
       [1, 0, 1, 0, 0],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0]])

In [18]:
y

array([[1, 1, 0, 0, 0],
       [1, 0, 1, 0, 0],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 1, 0],
       [0, 0, 1, 0, 1]])