In [1]:
import sklearn as sk

In [2]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import tensorflow as tf

**`机器学习基本流程`**
+ **`特征工程`**: 数据清洗, 数据标准化, 特征选取, 特征降维
+ **`模型选取`**: 超参数确定
+ **`模型验证`**: 利用各种不同指标对模型性能进行检验

In [3]:
# example
from sklearn import neighbors, datasets, preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
iris = datasets.load_iris() # 数据集加载
X, y = iris.data[:, :2], iris.target 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) # 训练集测试集切分
scaler = preprocessing.StandardScaler().fit(X_train) # 数据标准化处理
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
knn = neighbors.KNeighborsClassifier(n_neighbors=5) # 初始化一个空的模型
knn.fit(X_train, y_train) # 模型训练
y_pred = knn.predict(X_test) 
accuracy_score(y_test, y_pred) # 计算准确率



0.631578947368421

**``**

**``**

**``**

**``**

**``**

**``**

**``**

**``**

**`model select / 模型选择`**

In [4]:
# 超参数
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
# 设定不同超参数值
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# 设定模型
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(X, y)
print(grid)
# 对超参数进行遍历，查看其表现
print(grid.best_score_)
print(grid.best_estimator_.alpha)



GridSearchCV(cv=None, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': array([1.e+00, 1.e-01, 1.e-02, 1.e-03, 1.e-04, 0.e+00])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
0.0
1.0


![``sklearn model select``](dataset/sklearn_map.png)

In [None]:
![``numpy-cheat-sheet_``](numpy-cheat-sheet_.png)

**`特征工程`**

In [14]:
# 归一化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)

In [18]:
# 正则化
# preprocessing.normalize(X, norm='l2')

from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)

In [19]:
# one-hot decode

In [21]:
data = [[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]
encoder = preprocessing.OneHotEncoder().fit(data)
encoder.transform(data).toarray()

array([[1., 0., 1., 0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 1., 0., 0.],
       [0., 1., 1., 0., 0., 0., 0., 1., 0.]])

**`评估方法`**

In [5]:
# sklearn.metrics
# 模型自带
knn.score(X_test, y_test)

# 导入metrics包
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.631578947368421

In [6]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00         8
          1       0.42      0.73      0.53        11
          2       0.73      0.42      0.53        19

avg / total       0.70      0.63      0.63        38



**`M.L. 6 Steps`**
+ 首先应该加载训练模型所用的数据集
+ 采用合适的比例将数据集划分为训练集和测试
+ 选取合适或者创建合适的训练模型
+ 将训练集中的数据输入到模型中进行训练
+ 通过第四步的训练大致确定模型所用的合理参数
+ 将测试集中的数据输入到模型中，根据模型得到的结果和真实的结果进行比较再次调整参数

# Sklearn six parts

**`分类`**
+ 识别某个对象属于哪个类
+ 应用:垃圾邮件检测, 图像识别
+ 算法:SVM, nearest neighbors, random forest

**`回归`**
+ 预测与对象相关联的连续值属性
+ 应用:药物反应, 股价
+ 算法:SVR, ridge regression, Lasso

**`聚类`**
+ 将相似对象自动分组
+ 应用:客户细分, 分组实验结果
+ 算法: k-means, spectral clustering, mean-shirt

**`降维`**
+ 减少要考虑的随机变量的数量
+ 应用:可视化, 提高效率
+ 算法: PCA, feature selection, non-negative matrix factorization

**`模型选择`**
+ 比较, 验证, 选择参数和模型
+ 目标: 通过参数调整提高精度
+ 模型: grid search, cross validation, metrics

**`预处理`**
+ 特征提取和归一化
+ 应用: 把输入数据(如本文)转换为机器学习算法可用的数据
+ preprocessing, feature extraction

**`dataset`**
+ http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

**`3 Part`**
+ 数据准备与预处理
+ 模型选择与训练
+ 模型验证与参数调优

**`模型持久化`**

In [11]:
from sklearn.externals import joblib

In [13]:
# 保存模型
joblib.dump(model, 'dataset/model.pkl')

# 载入模型
model = joblib.load('dataset/model.pkl')

**`level`**
+ 调用：知道算法的基本思想，能应用现有的库来做測试。简单说，就是了解kNN是做什么的，会调用sklearn中的kNN算法。
+ 调參：知道算法的主要影响參数，能进行參数调节优化。
+ 嚼透：理解算法的实现细节，而且能用代码实现出来。

**`API Reference`**

**`sklearn.calibration`**: Probability Calibration
+ Calibration of predicted probabilities.

**`sklearn.cluster`**: Clustering
+ The sklearn.cluster module gathers popular unsupervised clustering algorithms.

**`sklearn.cluster.bicluster`**: Biclustering
+ Spectral biclustering algorithms.

**`sklearn.compose`**: Composite Estimators
+ Meta-estimators for building composite models with transformers

**`sklearn.covariance`**: Covariance Estimators
+ The sklearn.covariance module includes methods and algorithms to robustly estimate the covariance of features given a set of points. The precision matrix defined as the inverse of the covariance is also estimated. Covariance estimation is closely related to the theory of Gaussian Graphical Models.

**`sklearn.cross_decomposition`**: Cross decomposition

**`sklearn.datasets`**: Datasets
+ The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch popular reference datasets. It also features some artificial data generators.

**`sklearn.decomposition`**: Matrix Decomposition
+ The sklearn.decomposition module includes matrix decomposition algorithms, including among others PCA, NMF or ICA. Most of the algorithms of this module can be regarded as dimensionality reduction techniques.

**`sklearn.discriminant_analysis`**: Discriminant Analysis
+ Linear Discriminant Analysis and Quadratic Discriminant Analysis

**`sklearn.dummy`**: Dummy estimators

**`sklearn.ensemble`**: Ensemble Methods
+ The sklearn.ensemble module includes ensemble-based methods for classification, regression and anomaly detection.

**`sklearn.exceptions`**: Exceptions and warnings
+ The sklearn.exceptions module includes all custom warnings and error classes used across scikit-learn.

**`sklearn.feature_extraction`**: Feature Extraction
+ The sklearn.feature_extraction module deals with feature extraction from raw data. It currently includes methods to extract features from text and images.

**`sklearn.gaussian_process`**: Gaussian Processes
+ The sklearn.gaussian_process module implements Gaussian Process based regression and classification.

**`sklearn.isotonic`**: Isotonic regression

**`sklearn.impute`**: Impute
+ Transformers for missing value imputation

**`sklearn.kernel_approximation`**: Kernel Approximation
+ The sklearn.kernel_approximation module implements several approximate kernel feature maps base on Fourier transforms.

**`sklearn.kernel_ridge`**: Kernel Ridge Regression
+ Module sklearn.kernel_ridge implements kernel ridge regression.

**`sklearn.linear_model`**: Generalized Linear Models
+ The sklearn.linear_model module implements generalized linear models. It includes Ridge regression, Bayesian Regression, Lasso and Elastic Net estimators computed with Least Angle Regression and coordinate descent. It also implements Stochastic Gradient Descent related algorithms.

**`sklearn.manifold`**: Manifold Learning
+ The sklearn.manifold module implements data embedding techniques.

**`sklearn.metrics`**: Metrics
+ The sklearn.metrics module includes score functions, performance metrics and pairwise metrics and distance computations.

**`sklearn.mixture`**: Gaussian Mixture Models
+ The sklearn.mixture module implements mixture modeling algorithms.

**`sklearn.model_selection`**: Model Selection

**`sklearn.multiclass`**: Multiclass and multilabel classification

This module implements multiclass learning algorithms:

        one-vs-the-rest / one-vs-all
        one-vs-one
        error correcting output codes

The estimators provided in this module are meta-estimators: they require a base estimator to be provided in their constructor. For example, it is possible to use these estimators to turn a binary classifier or a regressor into a multiclass classifier. It is also possible to use these estimators with multiclass estimators in the hope that their accuracy or runtime performance improves.

All classifiers in scikit-learn implement multiclass classification; you only need to use this module if you want to experiment with custom multiclass strategies.

The one-vs-the-rest meta-classifier also implements a predict_proba method, so long as such a method is implemented by the base classifier. This method returns probabilities of class membership in both the single label and multilabel case. Note that in the multilabel case, probabilities are the marginal probability that a given sample falls in the given class. As such, in the multilabel case the sum of these probabilities over all possible labels for a given sample will not sum to unity, as they do in the single label case.

**`sklearn.multioutput`**: Multioutput regression and classification
+ This module implements multioutput regression and classification.

The estimators provided in this module are meta-estimators: they require a base estimator to be provided in their constructor. The meta-estimator extends single output estimators to multioutput estimators.

**`sklearn.naive_bayes`**: Naive Bayes
+ The sklearn.naive_bayes module implements Naive Bayes algorithms. These are supervised learning methods based on applying Bayes’ theorem with strong (naive) feature independence assumptions.

**`sklearn.neighbors`**: Nearest Neighbors
+ The sklearn.neighbors module implements the k-nearest neighbors algorithm.

**`sklearn.neural_network`**: Neural network models
+ The sklearn.neural_network module includes models based on neural networks.

**`sklearn.pipeline`**: Pipeline
+ The sklearn.pipeline module implements utilities to build a composite estimator, as a chain of transforms and estimators.

**`sklearn.preprocessing`**: Preprocessing and Normalization
+ The sklearn.preprocessing module includes scaling, centering, normalization, binarization and imputation methods.

sklearn.random_projection: Random projection
Random Projection transformers

Random Projections are a simple and computationally efficient way to reduce the dimensionality of the data by trading a controlled amount of accuracy (as additional variance) for faster processing times and smaller model sizes.

The dimensions and distribution of Random Projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset.

The main theoretical result behind the efficiency of random projection is the Johnson-Lindenstrauss lemma (quoting Wikipedia):

    In mathematics, the Johnson-Lindenstrauss lemma is a result concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. The map used for the embedding is at least Lipschitz, and can even be taken to be an orthogonal projection.

**`sklearn.semi_supervised`**:Semi-Supervised Learning
+ The sklearn.semi_supervised module implements semi-supervised learning algorithms. These algorithms utilized small amounts of labeled data and large amounts of unlabeled data for classification tasks. This module includes Label Propagation.

**`sklearn.svm`**: Support Vector Machines
+ The sklearn.svm module includes Support Vector Machine algorithms.

**`sklearn.tree`**: Decision Trees
+ The sklearn.tree module includes decision tree-based models for classification and regression.

**`sklearn.utils`**: Utilities
+ The sklearn.utils module includes various utilities.