# scikit-learn库学习

`Scikit-learn`是Python中非常流行的机器学习库，它提供了简单且高效的工具，用于数据挖掘和数据分析。以下是Scikit-learn的基本用法和一些常见的操作示例。

---

### 1.安装Scikit-learn

In [None]:
pip install scikit-learn

---

### 2.导入Scikit-learn

In [1]:
import sklearn

---

### 3.数据集

In [6]:
from sklearn.datasets import load_iris

# 加载鸢尾花数据集
iris = load_iris()
print("数据集描述:", iris.DESCR)
print("数据特征:", iris.feature_names)
print("目标特征:", iris.target_names)


数据集描述: .. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
f

In [4]:
import pandas as pd

# 使用Pandas加载CSV数据集
df = pd.read_csv('G:\VScode project\Python\iris_dataset.csv')
print(df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

  species  
0  setosa  
1  setosa  
2  setosa  
3  setosa  
4  setosa  


---

### 4. 数据预处理

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(iris.data)
print("标准化后的数据:\n", scaled_data[:5])

标准化后的数据:
 [[-0.90068117  1.01900435 -1.34022653 -1.3154443 ]
 [-1.14301691 -0.13197948 -1.34022653 -1.3154443 ]
 [-1.38535265  0.32841405 -1.39706395 -1.3154443 ]
 [-1.50652052  0.09821729 -1.2833891  -1.3154443 ]
 [-1.02184904  1.24920112 -1.34022653 -1.3154443 ]]


In [7]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(iris.data)
print("归一化后的数据:\n", normalized_data[:5])


归一化后的数据:
 [[0.22222222 0.625      0.06779661 0.04166667]
 [0.16666667 0.41666667 0.06779661 0.04166667]
 [0.11111111 0.5        0.05084746 0.04166667]
 [0.08333333 0.45833333 0.08474576 0.04166667]
 [0.19444444 0.66666667 0.06779661 0.04166667]]


In [8]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoded_target = encoder.fit_transform(iris.target)
print("编码后的目标变量:", encoded_target)


编码后的目标变量: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


---

### 5. 划分训练集和测试集

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)
print("训练集大小:", X_train.shape)
print("测试集大小:", X_test.shape)


训练集大小: (105, 4)
测试集大小: (45, 4)


---

### 6. 训练模型

In [10]:
from sklearn.linear_model import LinearRegression

# 创建线性回归模型
model = LinearRegression()

# 训练模型
model.fit(X_train, y_train)

# 预测
predictions = model.predict(X_test)
print("预测结果:", predictions)


预测结果: [ 1.24069097 -0.04537609  2.24501083  1.35143666  1.29775083  0.01024241
  1.05031173  1.82525399  1.37084413  1.06699186  1.70363485 -0.08712067
 -0.165166   -0.07724353 -0.03380619  1.40167699  2.00651252  1.04725931
  1.28368327  1.97600474  0.01782354  1.59952875  0.079732    1.92307532
  1.8621986   1.8790815   1.80251247  2.04196713  0.01873817  0.01291496
 -0.15365607 -0.08046738  1.18506728 -0.00461982 -0.02934265  1.68665136
  1.29088786 -0.07995434 -0.09076782 -0.16795331  1.75520461  1.37514144
  1.3174234  -0.07193336 -0.1131512 ]


In [11]:
from sklearn.linear_model import LogisticRegression

# 创建逻辑回归模型
model = LogisticRegression(max_iter=200)

# 训练模型
model.fit(X_train, y_train)

# 预测
predictions = model.predict(X_test)
print("预测结果:", predictions)


预测结果: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0]


In [12]:
from sklearn.tree import DecisionTreeClassifier

# 创建决策树模型
model = DecisionTreeClassifier()

# 训练模型
model.fit(X_train, y_train)

# 预测
predictions = model.predict(X_test)
print("预测结果:", predictions)


预测结果: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0]


---

### 7. 模型评估

In [13]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 准确率
accuracy = accuracy_score(y_test, predictions)
print("准确率:", accuracy)

# 精确率
precision = precision_score(y_test, predictions, average='macro')
print("精确率:", precision)

# 召回率
recall = recall_score(y_test, predictions, average='macro')
print("召回率:", recall)

# F1分数
f1 = f1_score(y_test, predictions, average='macro')
print("F1分数:", f1)


准确率: 1.0
精确率: 1.0
召回率: 1.0
F1分数: 1.0


---

### 8. 示例代码

In [19]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target

# 数据预处理
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# 创建和训练逻辑回归模型
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# 预测
predictions = model.predict(X_test)

# 模型评估
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions, average='macro')
recall = recall_score(y_test, predictions, average='macro')
f1 = f1_score(y_test, predictions, average='macro')

print("准确率:", accuracy)
print("精确率:", precision)
print("召回率:", recall)
print("F1分数:", f1)


[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
准确率: 1.0
精确率: 1.0
召回率: 1.0
F1分数: 1.0
