# 机器学习第四次实验

## 前置准备

本次实验需要用到高性能GPU故选择使用**Google Colab**运行代码内容，需要进行一些简单配置。

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 任务一：鸢尾花分类

>note：我们选择使用决策树（随机森林）算法与支持向量机（SVM）实现，多种机器学习方法进行对比实验。

In [None]:
csv_path = '/content/drive/My Drive/Colab Notebooks/data/iris.csv'
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
iris_df = pd.read_csv(csv_path)
print(iris_df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target
0                5.1               3.5                1.4               0.2       0
1                4.9               3.0                1.4               0.2       0
2                4.7               3.2                1.3               0.2       0
3                4.6               3.1                1.5               0.2       0
4                5.0               3.6                1.4               0.2       0


In [None]:
X, y = iris_df.iloc[:, :-1].values, iris_df.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
print(accuracy_score(y_test, y_pred))

1.0


In [None]:
# 下面使用随机森林算法对鸢尾花数据集进行分类
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier()
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
print(accuracy_score(y_test, y_pred))

1.0


In [None]:
# 使用网格搜索调整随机森林的参数进行模型优化
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [10, 20, 30, 40, 50],
    'max_depth': [3, 5, 7, 9]
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
print(grid_search.best_score_)
y_pred = grid_search.predict(X_test)
print(accuracy_score(y_test, y_pred))

{'max_depth': 3, 'n_estimators': 50}
0.9583333333333334
1.0


In [None]:
# 使用SVM算法对鸢尾花数据集进行分类
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print(accuracy_score(y_test, y_pred))

1.0


### 任务一代码运行结论

1. 首先调用**决策树分类器**对鸢尾花数据集进行分类，分类准确率为 $ 100\% $
2. 然后使用**随机森林**算法再次进行分类，分类准确率仍为 $ 100\% $
3. 接着使用**网格搜索**调整随机森林的参数，得出最优参数为：最大深度 $3$，决策树个数 $50$；最优得分为 $0.96$。在此参数下，分类准确率仍为 $ 100\% $
4. 最后使用**支持向量机**算法进行对照实验，分类准确率仍为 $ 100\% $，说明原数据集数据质量较高。

## 任务二：花卉图像识别

>note：我们选择使用**深度学习框架Tensorflow**设计一个**卷积神经网络CNN**来实现。

### 第一步：获取图像数据并进行数据预处理

数据预处理包括以下：

1. **调整图像大小**：确保所有图像都有统一的尺寸。
2. **图像像素归一化**：将图像的像素值缩放到 $0$ 到 $1$ 之间，帮助模型更快地收敛。
3. **数据增强**：通过随机变换如旋转、缩放等增加模型的泛化能力。

In [None]:
import os
import shutil
from PIL import Image
from concurrent.futures import ThreadPoolExecutor

def verify_image(path, filename, bad_folder):
    """检查图像文件是否损坏并处理损坏的文件。"""
    if not filename.endswith(('.jpg')):  # 检查文件扩展名
        return
    try:
        with Image.open(os.path.join(path, filename)) as img:
            img.verify()  # 验证文件是否损坏
    except (IOError, SyntaxError) as e:
        print('Bad file:', filename)  # 输出损坏的文件名
        # 将损坏的文件移动到指定目录
        shutil.move(os.path.join(path, filename), os.path.join(bad_folder, filename))

def process_directory(directory, bad_folder):
    """处理单个目录中的所有图像文件。"""
    with ThreadPoolExecutor(max_workers=5) as executor:  # 调整线程数以适应你的系统
        for filename in os.listdir(directory):
            executor.submit(verify_image, directory, filename, bad_folder)

if __name__ == "__main__":
    directories = ['/content/drive/My Drive/Colab Notebooks/flower/flowers/daisy',
                   '/content/drive/My Drive/Colab Notebooks/flower/flowers/dandelion',
                   '/content/drive/My Drive/Colab Notebooks/flower/flowers/rose',
                   '/content/drive/My Drive/Colab Notebooks/flower/flowers/sunflower',
                   '/content/drive/My Drive/Colab Notebooks/flower/flowers/tulip']  # 处理多个目录
    bad_folder = '/content/drive/My Drive/Colab Notebooks/flower/bad images'  # 损坏文件存放的目录

    for directory in directories:
        process_directory(directory, bad_folder)

>note：必须要有上面这一步，因为提供的数据集中会发现有不符合要求的文件以及损坏的文件，可能导致模型训练失败。

In [None]:
from keras.preprocessing.image import ImageDataGenerator
import warnings
warnings.filterwarnings('ignore')

datagen = ImageDataGenerator(
    rescale=1.0 / 255, # 归一化
    rotation_range=40, # 随机旋转角度范围
    width_shift_range=0.2, # 水平偏移的范围（作为总宽度的比例）
    height_shift_range=0.2, # 垂直偏移的范围（作为总高度的比例）
    shear_range=0.2, # 随机错切变换的角度
    zoom_range=0.2, # 随机缩放范围
    horizontal_flip=True, # 随机水平翻转
    fill_mode='nearest', # 填充像素的方法
    validation_split=0.2  # 设置验证集分割比例
)

validation_datagen = ImageDataGenerator(
    rescale=1.0 / 255
)

train_generator = datagen.flow_from_directory(
    '/content/drive/My Drive/Colab Notebooks/flower/flowers',
    target_size=(224, 224),
    batch_size=64,
    class_mode='sparse',
    shuffle=True,
    subset='training'
)

validation_generator = datagen.flow_from_directory(
    '/content/drive/My Drive/Colab Notebooks/flower/flowers',
    target_size=(224, 224),
    batch_size=64,
    class_mode='sparse',
    shuffle=False,
    subset='validation'
)

Found 3461 images belonging to 5 classes.
Found 861 images belonging to 5 classes.


### 第二步：构建CNN模型

>note：该模型通过**多层卷积**和**池化**学习图像的高级特征，然后通过**全连接层**进行分类。

In [None]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
    tf.keras.layers.MaxPooling2D((2, 2)), # 池化层，将卷积层输出的特征图进行降维
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Conv2D(128, (3, 3), activation='relu'),
    tf.keras.layers.Flatten(), # 扁平化层，将卷积层输出的多维特征图转换为一维，使其可以被后续的全连接层处理
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax') # （最终）输出层，输出5个类别的概率分布
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 222, 222, 32)      896       
                                                                 
 max_pooling2d (MaxPooling2  (None, 111, 111, 32)      0         
 D)                                                              
                                                                 
 conv2d_1 (Conv2D)           (None, 109, 109, 64)      18496     
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 54, 54, 64)        0         
 g2D)                                                            
                                                                 
 conv2d_2 (Conv2D)           (None, 52, 52, 128)       73856     
                                                                 
 flatten (Flatten)           (None, 346112)            0

### 第三步：训练CNN模型

In [None]:
model.fit(train_generator, epochs=10, validation_data=validation_generator)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7dab6a4a6620>

In [14]:
model.fit(train_generator, epochs=100, validation_data=validation_generator)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x7dab681e90c0>

>note：在经过110轮训练后，训练集准确率高达 $90\%$，验证集准确率高达 $78\%$，已经初步达到训练效果，综合考虑认为模型训练成功。

### 第四步：模型评估

>note：评估指标我们选择更全面的精确率，召回率以及F1分数。

In [15]:
# 准确率，精确率，召回率以及F1分数
from sklearn.metrics import classification_report

validation_generator.reset()
y_pred = model.predict(validation_generator)
y_pred = [list(x).index(max(x)) for x in y_pred]
y_true = validation_generator.classes
target_names = ['daisy', 'dandelion', 'rose', 'sunflower', 'tulip']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

       daisy       0.87      0.79      0.83       153
   dandelion       0.86      0.78      0.82       210
        rose       0.66      0.75      0.70       156
   sunflower       0.80      0.88      0.84       146
       tulip       0.70      0.69      0.69       196

    accuracy                           0.77       861
   macro avg       0.78      0.78      0.78       861
weighted avg       0.78      0.77      0.77       861



通过运行代码，我们发现训练的模型非常好，各个评估指标均很出色，特别是平均f1-score高达 $0.77$