# Scikit-Learn
<table align="left">
  <td>
    <a target="_blank" href="http://nbviewer.ipython.org/github/ShowMeAI-Hub/awesome-AI-cheatsheets/blob/main/Scikit-Learn/Scikit-Learn-cheatsheet-code.ipynb"><img src="https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg" />在nbviewer上查看notebook</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/ShowMeAI-Hub/awesome-AI-cheatsheets/blob/main/Scikit-Learn/Scikit-Learn-cheatsheet-code.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" />在Google Colab运行</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets/tree/main/Scikit-Learn/Scikit-Learn-cheatsheet-code.ipynb"><img src="https://badgen.net/badge/open/github/color=cyan?icon=github" />在Github上查看源代码</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets/Scikit-Learn/Scikit-Learn速查表.pdf"><img src="https://badgen.net/badge/download/pdf/color=white?icon=github"/>下载速查表</a>
  </td>
</table>

## 说明
**notebook by [韩信子](https://github.com/HanXinzi-AI)@[ShowMeAI](https://github.com/ShowMeAI-Hub)**

更多AI速查表资料请查看[速查表大全](https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets)

Scikit-learn是开源的Python库，通过统一的界面实现机器学习、预处理、交叉验证及可视化算法。

## 简例

In [73]:
# 导入工具库
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 加载数据
iris = datasets.load_iris()
X, y = iris.data[:, :2], iris.target

# 切分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)

# 数据预处理
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# 训练与预测
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# 评估
accuracy_score(y_test, y_pred)

0.631578947368421

## 加载数据

Scikit-learn处理的数据是存储为NumPy数组或SciPy稀疏矩阵的数字，还支持Pandas数据框等可转换为数字数组的其它数据类型。

In [2]:
import numpy as np

In [3]:
X = np.random.random((10,5))

In [4]:
y = np.array(['M','M','F','F','M','F','M','M','F','F'])

In [5]:
X[X < 0.7] = 0

## 训练/测试集切分

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

## 数据预处理

### 标准化

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
scaler = StandardScaler().fit(X_train)     #拟合

In [10]:
standardized_X = scaler.transform(X_train)   #训练集变换

In [11]:
standardized_X_test = scaler.transform(X_test)   #测试集变换

### 归一化

In [12]:
from sklearn.preprocessing import Normalizer

In [13]:
scaler = Normalizer().fit(X_train)     #拟合

In [14]:
normalized_X = scaler.transform(X_train)   #训练集变换

In [15]:
normalized_X_test = scaler.transform(X_test)   #测试集变换

### 二值化

In [16]:
from sklearn.preprocessing import Binarizer

In [17]:
binarizer = Binarizer(threshold=0.0).fit(X)     #拟合

In [18]:
binary_X = binarizer.transform(X)     #变换

### 编码分类特征

In [19]:
from sklearn.preprocessing import LabelEncoder

In [20]:
enc = LabelEncoder()

In [21]:
y = enc.fit_transform(y)

### 缺失值处理

In [22]:
from sklearn.impute import SimpleImputer

In [23]:
imp = SimpleImputer(missing_values=0, strategy='mean')    #均值填充器

In [24]:
imp.fit_transform(X_train)   #对数据进行缺失值均值填充变换

array([[0.71335359, 0.8098837 , 0.75751865, 0.87692322, 0.72409291],
       [0.7514908 , 0.8098837 , 0.75751865, 0.87692322, 0.81120963],
       [0.77892177, 0.8098837 , 0.75751865, 0.78269117, 0.75448003],
       [0.79837969, 0.73368828, 0.75751865, 0.87692322, 0.81120963],
       [0.94975258, 0.8098837 , 0.75751865, 0.87692322, 0.81120963],
       [0.79837969, 0.8098837 , 0.75751865, 0.97115527, 0.81120963],
       [0.79837969, 0.88607912, 0.75751865, 0.87692322, 0.95505595]])

### 生成多项式特征

In [25]:
from sklearn.preprocessing import PolynomialFeatures

In [26]:
poly = PolynomialFeatures(5)

In [27]:
poly.fit_transform(X)

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.7514908 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.92223701, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [1.        , 0.        , 0.73368828, ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.        , 0.86963975, ..., 0.        , 0.        ,
        0.95458598],
       [1.        , 0.71335359, 0.        , ..., 0.        , 0.        ,
        0.19905426]])

## 创建模型

### 有监督学习评估器

**线性回归**

In [28]:
from sklearn.linear_model import LinearRegression

In [29]:
lr = LinearRegression(normalize=True)

**支持向量机(SVM)**

In [30]:
from sklearn.svm import SVC

In [31]:
svc = SVC(kernel='linear')

**朴素贝叶斯**

In [32]:
from sklearn.naive_bayes import GaussianNB

In [33]:
gnb = GaussianNB()

**KNN**

In [34]:
from sklearn import neighbors

In [35]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

### 无监督学习评估器

**主成分分析(PCA)**

In [36]:
from sklearn.decomposition import PCA

In [37]:
pca = PCA(n_components=0.95)

**K-Means聚类**

In [38]:
from sklearn.cluster import KMeans

In [39]:
k_means = KMeans(n_clusters=3, random_state=0)

In [40]:
1. ## 模型拟合

1.0

### 有监督学习

In [41]:
lr.fit(X, y)   #拟合数据与模型

LinearRegression(normalize=True)

In [42]:
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [43]:
svc.fit(X_train, y_train)

SVC(kernel='linear')

### 无监督学习

In [44]:
k_means.fit(X_train)   #拟合数据与模型

KMeans(n_clusters=3, random_state=0)

In [45]:
pca_model = pca.fit_transform(X_train)   #拟合并转换数据

## 预测

### 有监督评估器

In [46]:
y_pred = svc.predict(np.random.random((2,5)))   #预测标签

In [47]:
y_pred = lr.predict(X_test)   #预测标签

In [48]:
y_pred= knn.predict_proba(X_test)   #评估标签概率

### 无监督评估器

In [49]:
y_pred = k_means.predict(X_test)   #预测聚类算法里的标签

## 评估模型性能

### 分类评价指标

**准确率**

In [50]:
svc.fit(X_train, y_train)
svc.score(X_test, y_test)   #评估器评分法

0.3333333333333333

In [51]:
from sklearn.metrics import accuracy_score   #指标评分函数

In [52]:
y_pred = svc.predict(X_test)
accuracy_score(y_test, y_pred)  #评估accuracy

0.3333333333333333

**分类预估评价函数**

In [53]:
from sklearn.metrics import classification_report   #精确度、召回率、F1分数及支持率

In [54]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           F       0.50      0.50      0.50         2
           M       0.00      0.00      0.00         1

    accuracy                           0.33         3
   macro avg       0.25      0.25      0.25         3
weighted avg       0.33      0.33      0.33         3



**混淆矩阵**

In [55]:
from sklearn.metrics import confusion_matrix

In [56]:
print(confusion_matrix(y_test, y_pred))

[[1 1]
 [1 0]]


### 回归评价指标

**平均绝对误差**

In [57]:
from sklearn.metrics import mean_absolute_error

In [63]:
house_price = datasets.load_boston()
X, y = house_price.data, house_price.target
house_X_train, house_X_test, house_y_train, house_y_test = train_test_split(X, y, random_state=0)

In [64]:
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor().fit(house_X_train, house_y_train)
house_y_pred = dt.predict(house_X_test)
mean_absolute_error(house_y_test, house_y_pred)

3.406299212598425

**均方误差**

In [65]:
from sklearn.metrics import mean_squared_error

In [66]:
mean_squared_error(house_y_test, house_y_pred)

29.29307086614173

**R^2评分**

In [67]:
from sklearn.metrics import r2_score

In [69]:
r2_score(house_y_test, house_y_pred)

0.6414513601040801

### 聚类评价指标

**调整兰德系数**

In [None]:
from sklearn.metrics import adjusted_rand_score

In [None]:
adjusted_rand_score(y_true, y_pred)

**同质性**

In [None]:
from sklearn.metrics import homogeneity_score

In [None]:
homogeneity_score(y_true, y_pred)

**V-measure**

In [None]:
from sklearn.metrics import v_measure_score

In [None]:
metrics.v_measure_score(y_true, y_pred)

### 交叉验证

In [71]:
from sklearn.model_selection import cross_val_score

In [74]:
print(cross_val_score(knn, X_train, y_train, cv=4))

[0.78571429 0.75       0.85714286 0.82142857]


In [75]:
print(cross_val_score(lr, X, y, cv=2))

[-4.31567384 -1.89773191]


## 模型调参与优化

### 网格搜索超参优化

In [76]:
from sklearn.model_selection import GridSearchCV

params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn, param_grid=params)

grid.fit(X_train, y_train)

print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)

0.8031620553359684
2


### 随机搜索超参优化

In [77]:
from sklearn.model_selection import RandomizedSearchCV

params = {"n_neighbors": range(1,5),
            "weights": ["uniform", "distance"]}

rsearch = RandomizedSearchCV(estimator=knn,
                             param_distributions=params,
                             cv=4,
                             n_iter=8,
                             random_state=5)

rsearch.fit(X_train, y_train)
print(rsearch.best_score_)

0.8303571428571428
