# **Machine Learning — Project1**
刘蔚璁 10225501443 

## **实验背景**

本次数据使用的是 UCI 存储库提供的蘑菇数据集（Mushroom Dataset），创建于 1981 年，是一个生物学领域的多变量二分类数据集。该数据集包含 8124 个样本和 22 个分类特征，描述了伞菌科和口蘑科蘑菇的物理特性（如蘑菇帽形状、颜色、气味等），目标变量是蘑菇的可食用性（“可食用”或“有毒”）。

具体特征如下表所示：

| 特征名                     | 英文含义                                        | 中文含义                              |
|----------------------------|------------------------------------------------|---------------------------------------|
| cap-shape                 | Shape of the cap                               | 蘑菇帽的形状                         |
| cap-surface               | Surface texture of the cap                    | 蘑菇帽的表面质地                     |
| cap-color                 | Color of the cap                              | 蘑菇帽的颜色                         |
| bruises                   | Whether the mushroom bruises                  | 是否有瘀伤                           |
| odor                      | Smell of the mushroom                         | 蘑菇的气味                           |
| gill-attachment           | Attachment type of the gills                  | 菌褶的附着类型                       |
| gill-spacing              | Spacing between the gills                     | 菌褶之间的间距                       |
| gill-size                 | Size of the gills                             | 菌褶的大小                           |
| gill-color                | Color of the gills                            | 菌褶的颜色                           |
| stalk-shape               | Shape of the stalk                            | 茎的形状                             |
| stalk-root                | Root type of the stalk                        | 茎的根部类型                         |
| stalk-surface-above-ring  | Surface texture of stalk above the ring       | 茎环以上的表面质地                   |
| stalk-surface-below-ring  | Surface texture of stalk below the ring       | 茎环以下的表面质地                   |
| stalk-color-above-ring    | Color of the stalk above the ring             | 茎环以上的颜色                       |
| stalk-color-below-ring    | Color of the stalk below the ring             | 茎环以下的颜色                       |
| veil-type                 | Type of the veil                              | 菌幕的类型                           |
| veil-color                | Color of the veil                             | 菌幕的颜色                           |
| ring-number               | Number of rings on the stalk                  | 茎上的环数量                         |
| ring-type                 | Type of rings on the stalk                    | 茎环的类型                           |
| spore-print-color         | Color of the spore print                      | 孢子印的颜色                         |
| population                | Size of the mushroom population               | 蘑菇种群的大小                       |
| habitat	                | Natural habitat of the mushroom               | 蘑菇的自然栖息地                       |

## **实验目标**

分别使用逻辑回归、决策树和 SVM 对蘑菇数据集进行分类，比较三种方法。

## **实验前的准备**

载入所需要的 Python 安装包：

In [53]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from ucimlrepo import fetch_ucirepo 
import pandas as pd
import numpy as np
import torch
from torch import nn
from torch.utils.data import random_split

设置随机种子：

In [54]:
seed = 622

导入数据集：

In [55]:
mushroom = fetch_ucirepo(id=73) 
  
# data (as pandas dataframes) 
X = mushroom.data.features 
y = mushroom.data.targets 

查看数据集基本信息：

In [56]:
mushroom.metadata

{'uci_id': 73,
 'name': 'Mushroom',
 'repository_url': 'https://archive.ics.uci.edu/dataset/73/mushroom',
 'data_url': 'https://archive.ics.uci.edu/static/public/73/data.csv',
 'abstract': 'From Audobon Society Field Guide; mushrooms described in terms of physical characteristics; classification: poisonous or edible',
 'area': 'Biology',
 'tasks': ['Classification'],
 'characteristics': ['Multivariate'],
 'num_instances': 8124,
 'num_features': 22,
 'feature_types': ['Categorical'],
 'demographics': [],
 'target_col': ['poisonous'],
 'index_col': None,
 'has_missing_values': 'yes',
 'missing_values_symbol': 'NaN',
 'year_of_dataset_creation': 1981,
 'last_updated': 'Thu Aug 10 2023',
 'dataset_doi': '10.24432/C5959T',
 'creators': [],
 'intro_paper': None,
 'additional_info': {'summary': "This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525).  Each species is identified as definitely 

In [57]:
X.head(5)

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,x,s,n,t,p,f,c,n,k,e,...,s,w,w,p,w,o,p,k,s,u
1,x,s,y,t,a,f,c,b,k,e,...,s,w,w,p,w,o,p,n,n,g
2,b,s,w,t,l,f,c,b,n,e,...,s,w,w,p,w,o,p,n,n,m
3,x,y,w,t,p,f,c,n,n,e,...,s,w,w,p,w,o,p,k,s,u
4,x,s,g,f,n,f,w,b,k,t,...,s,w,w,p,w,o,e,n,a,g


查看目标变量分布：

In [58]:
y.value_counts()

poisonous
e            4208
p            3916
Name: count, dtype: int64

## **实验过程**

### **数据预处理**

缺失值处理：

In [59]:
missing_values = X.isnull().sum()
print("每列的缺失值数量：")
print(missing_values)

total_missing = missing_values.sum()
print(f"数据总缺失值数量: {total_missing}")

每列的缺失值数量：
cap-shape                      0
cap-surface                    0
cap-color                      0
bruises                        0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
dtype: int64
数据总缺失值数量: 2480


In [60]:
print(X['stalk-root'].unique())
print(X['stalk-root'].value_counts())

['e' 'c' 'b' 'r' nan]
stalk-root
b    3776
e    1120
c     556
r     192
Name: count, dtype: int64


通过统计可知，缺失值都源自于 stalk-root，该特征描述了蘑菇的根部形态，且缺失值占的比重较大，所以认为缺失值本身包含信息，不对其做删除或填充处理。

查看数据集后发现该数据集中均为非数值数据，因此使用独热编码将分类变量转换为模型可用的数值形式，其作用是将类别型数据表示为二进制特征矩阵。

In [61]:
encoder = OneHotEncoder(sparse_output=False)
X_encoded = encoder.fit_transform(X)
X_encoded[0,:]

array([0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1.,
       0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.,
       1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.])

划分训练集和测试集：

In [62]:
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.3, random_state=seed)

### **首先调用 python 库查看训练准确率**

In [77]:
# 训练并评估逻辑回归模型
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_test)
accuracy_lr = accuracy_score(y_test, y_pred_lr)
precision_lr = precision_score(y_test, y_pred_lr, pos_label='e')
recall_lr = recall_score(y_test,y_pred_lr,pos_label='e')
f1_lr = f1_score(y_test,y_pred_lr,pos_label='e')
print(f'Logistic Regression\n\tAccuracy: {accuracy_lr:.4f}')
print(f'\tPrecision: {precision_lr:.4f}')
print(f'\tRecall: {recall_lr:.4f}')
print(f'\tF1: {f1_lr:.4f}')

# 训练并评估决策树模型
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt,pos_label='e')
recall_dt = recall_score(y_test,y_pred_dt,pos_label='e')
f1_dt = f1_score(y_test,y_pred_dt,pos_label='e')
print(f'Decision Tree\n\tAccuracy: {accuracy_dt:.4f}')
print(f'\tPrecision: {precision_dt:.4f}')
print(f'\tRecall: {recall_dt:.4f}')
print(f'\tF1: {f1_dt:.4f}')

# 训练并评估SVM模型
svm = SVC(kernel='linear', gamma='scale', C=1.0, random_state=42)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
precision_svm = precision_score(y_test, y_pred_svm,pos_label='e')
recall_svm = recall_score(y_test,y_pred_svm,pos_label='e')
f1_svm = f1_score(y_test,y_pred_svm,pos_label='e')
print(f'Decision Tree\n\tAccuracy: {accuracy_svm:.4f}')
print(f'\tPrecision: {precision_svm:.4f}')
print(f'\tRecall: {recall_svm:.4f}')
print(f'\tF1: {f1_svm:.4f}')

Logistic Regression
	Accuracy: 0.9996
	Precision: 0.9992
	Recall: 1.0000
	F1: 0.9996
Decision Tree
	Accuracy: 1.0000
	Precision: 1.0000
	Recall: 1.0000
	F1: 1.0000
Decision Tree
	Accuracy: 1.0000
	Precision: 1.0000
	Recall: 1.0000
	F1: 1.0000


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


### **自己实现 logistic**

In [78]:
y_train_logistic = y_train.copy()
y_test_logistic = y_test.copy()
y_train_logistic['poisonous'] = y_train_logistic['poisonous'].map({'e': 1.0, 'p': 0.0})
y_test_logistic['poisonous'] = y_test_logistic['poisonous'].map({'e': 1.0, 'p': 0.0})

In [None]:
# 转换为PyTorch张量
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train_logistic.values, dtype=torch.float32).view(-1, 1)
X_test_t = torch.tensor(X_test, dtype=torch.float32)
y_test_t = torch.tensor(y_test_logistic.values, dtype=torch.float32).view(-1, 1)

# 逻辑回归 = Linear + Sigmoid
model = nn.Sequential(
    nn.Linear(X_train_t.shape[1], 1),
    nn.Sigmoid()
)

# 定义损失函数和优化器
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# 训练模型
epochs = 300
for epoch in range(epochs):
    model.train()
    optimizer.zero_grad()
    y_pred = model(X_train_t)
    loss = criterion(y_pred, y_train_t)
    loss.backward()
    optimizer.step()

    # 简单监控训练过程
    if (epoch+1) % 2 == 0:
        with torch.no_grad():
            # 计算训练集准确率
            train_pred = (y_pred > 0.5).float()
            train_acc = (train_pred == y_train_t).float().mean().item()
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}, Train Acc: {train_acc:.4f}')

# 测试模型
model.eval()
with torch.no_grad():
    y_test_pred = model(X_test_t)
    test_pred = (y_test_pred > 0.5).float()
    # test_acc = (test_pred == y_test_t).float().mean().item()
    test_acc = accuracy_score(y_test_t, test_pred)
    test_prec = precision_score(y_test_t, test_pred)
    test_rec = recall_score(y_test_t, test_pred)
    test_f1 = f1_score(y_test_t, test_pred)

print(f'Test Accuracy: {test_acc:.4f}')
print(f'Test Precision: {test_prec:.4f}')
print(f'Test Recall: {test_rec:.4f}')
print(f'Test F1: {test_f1:.4f}')

Epoch [2/300], Loss: 0.6739, Train Acc: 0.6082
Epoch [4/300], Loss: 0.6025, Train Acc: 0.8120
Epoch [6/300], Loss: 0.5394, Train Acc: 0.8903
Epoch [8/300], Loss: 0.4846, Train Acc: 0.9019
Epoch [10/300], Loss: 0.4376, Train Acc: 0.9013
Epoch [12/300], Loss: 0.3976, Train Acc: 0.9012
Epoch [14/300], Loss: 0.3636, Train Acc: 0.9024
Epoch [16/300], Loss: 0.3349, Train Acc: 0.9031
Epoch [18/300], Loss: 0.3106, Train Acc: 0.9045
Epoch [20/300], Loss: 0.2899, Train Acc: 0.9068
Epoch [22/300], Loss: 0.2720, Train Acc: 0.9098
Epoch [24/300], Loss: 0.2565, Train Acc: 0.9129
Epoch [26/300], Loss: 0.2428, Train Acc: 0.9158
Epoch [28/300], Loss: 0.2307, Train Acc: 0.9195
Epoch [30/300], Loss: 0.2197, Train Acc: 0.9228
Epoch [32/300], Loss: 0.2098, Train Acc: 0.9282
Epoch [34/300], Loss: 0.2008, Train Acc: 0.9353
Epoch [36/300], Loss: 0.1924, Train Acc: 0.9390
Epoch [38/300], Loss: 0.1848, Train Acc: 0.9418
Epoch [40/300], Loss: 0.1777, Train Acc: 0.9444
Epoch [42/300], Loss: 0.1712, Train Acc: 0.9

In [85]:
y_train_tree = (y_train_logistic.values).ravel()
y_test_tree = (y_test_logistic.values).ravel()

In [86]:
print(y_train_tree.shape)

(5686,)


### **自己实现决策树**

In [87]:
def entropy(labels):
    from math import log2
    counts = {}
    for l in labels:
        counts[l] = counts.get(l, 0) + 1
    total = len(labels)
    ent = 0.0
    for c in counts.values():
        p = c/total
        ent -= p * log2(p)
    return ent

def information_gain(X, y, feature_index):
    base_entropy = entropy(y)
    values = X[:, feature_index]
    unique_vals = np.unique(values)
    new_entropy = 0.0
    total = len(y)
    for val in unique_vals:
        subset_y = y[values == val]
        prob = len(subset_y) / total
        new_entropy += prob * entropy(subset_y)
    return base_entropy - new_entropy

def majority_class(labels):
    from collections import Counter
    counter = Counter(labels)
    return counter.most_common(1)[0][0]

# 递归构建决策树
def build_tree(X, y, feature_indices):
    # 如果所有标签相同，直接返回该标签
    if len(set(y)) == 1:
        return y[0]
    
    if len(feature_indices) == 0:
        # 无特征可分时，返回多数类
        return majority_class(y)
    
    # 选择信息增益最大特征
    gains = [information_gain(X, y, fi) for fi in feature_indices]
    best_feature = feature_indices[np.argmax(gains)]
    
    # 如果信息增益为0，无法继续分割，返回多数类
    if max(gains) == 0:
        return majority_class(y)
    
    tree = {best_feature: {}}
    values = X[:, best_feature]
    unique_vals = np.unique(values)
    # 子集划分
    new_feature_indices = [fi for fi in feature_indices if fi != best_feature]
    for val in unique_vals:
        subset_mask = (values == val)
        subtree = build_tree(X[subset_mask], y[subset_mask], new_feature_indices)
        tree[best_feature][val] = subtree
    return tree

def tree_predict(tree, x):
    if not isinstance(tree, dict):
        # 叶节点
        return tree
    # tree结构: {feature_index: {value: subtree, ...}}
    feature_index = list(tree.keys())[0]
    feature_val = x[feature_index]
    branches = tree[feature_index]
    if feature_val in branches:
        return tree_predict(branches[feature_val], x)
    else:
        # 如果没有对应分支，返回多数类策略
        # 或随机选一个分支，这里简单返回多数类策略
        # 不严谨，但作为演示
        return majority_class(list(branches.values()))

# 构建决策树
feature_indices = list(range(X_train.shape[1]))
decision_tree = build_tree(X_train, y_train_tree, feature_indices)

# 决策树预测与精度
dt_preds = [tree_predict(decision_tree, x) for x in X_test]

test_acc = accuracy_score(y_test_tree, dt_preds)
test_prec = precision_score(y_test_tree, dt_preds)
test_rec = recall_score(y_test_tree, dt_preds)
test_f1 = f1_score(y_test_tree, dt_preds)

print(f'Test Accuracy: {test_acc:.4f}')
print(f'Test Precision: {test_prec:.4f}')
print(f'Test Recall: {test_rec:.4f}')
print(f'Test F1: {test_f1:.4f}')
#dt_accuracy = np.mean(dt_preds == y_test_tree)
#print(f"Decision Tree Accuracy: {dt_accuracy:.4f}")

Test Accuracy: 1.0000
Test Precision: 1.0000
Test Recall: 1.0000
Test F1: 1.0000


### **自己实现 SVM**

In [92]:
y_train_SVM = y_train_tree.copy()
y_test_SVM = y_test_tree.copy()

In [93]:
# 检查标准化或缩放过程
print(f"X_train_scaled shape: {X_train.shape}")
print(f"y_train_SVM shape: {y_train_SVM.shape}")

X_train_scaled shape: (5686, 117)
y_train_SVM shape: (5686,)


In [94]:
# 使用次梯度下降实现线性SVM
# C: 正则化参数 
# decay_rate: 学习率衰减率
def linear_svm_train(X, y, C=1.0, lr=0.001, epochs=10, decay_rate=0.01):
    N, d = X.shape
    w = np.zeros(d)
    b = 0.0
    
    for epoch in range(epochs):
        for i in range(N):
            # 检查是否满足间隔条件 y_i * (w·x_i + b) >= 1
            if y[i] * (np.dot(w, X[i]) + b) < 1:
                # 不满足条件，更新 w 和 b
                w = w - lr * (w - C * y[i] * X[i])
                b = b + lr * C * y[i]
            else:
                # 满足条件，只更新 w
                w = w - lr * w

        # 动态调整学习率
        lr = lr / (1 + epoch * decay_rate)

        # 可选：打印每轮损失值
        hinge_loss = np.maximum(0, 1 - y * (np.dot(X, w) + b)).sum()
        total_loss = 0.5 * np.dot(w, w) + C * hinge_loss
        print(f"Epoch {epoch + 1}/{epochs}, Loss: {total_loss:.4f}")

    return w, b

def linear_svm_predict(X, w, b):
    return np.sign(np.dot(X, w) + b)

# 数据标准化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 确保标签为 {-1, 1}
y_train_SVM = np.where(y_train_SVM == 0, -1, y_train_SVM)
y_test_SVM = np.where(y_test_SVM == 0, -1, y_test_SVM)

# print(f"X_train_scaled shape: {X_train_scaled.shape}")
# print(f"y_train_SVM shape: {y_train_SVM.shape}")

# 训练线性 SVM
w_svm, b_svm = linear_svm_train(X_train_scaled, y_train_SVM, C=1.0, lr=0.001, epochs=20)

# 预测并计算准确率
svm_preds = linear_svm_predict(X_test_scaled, w_svm, b_svm)

test_acc = accuracy_score(y_test_SVM, svm_preds)
test_prec = precision_score(y_test_SVM, svm_preds)
test_rec = recall_score(y_test_SVM, svm_preds)
test_f1 = f1_score(y_test_SVM, svm_preds)

print(f'Test Accuracy: {test_acc:.4f}')
print(f'Test Precision: {test_prec:.4f}')
print(f'Test Recall: {test_rec:.4f}')
print(f'Test F1: {test_f1:.4f}')

#svm_accuracy = np.mean(svm_preds == y_test_SVM)
#print(f"SVM Accuracy: {svm_accuracy:.4f}")

Epoch 1/20, Loss: 286.7653
Epoch 2/20, Loss: 284.4900
Epoch 3/20, Loss: 290.6429
Epoch 4/20, Loss: 294.0821
Epoch 5/20, Loss: 278.3218
Epoch 6/20, Loss: 280.7040
Epoch 7/20, Loss: 277.2778
Epoch 8/20, Loss: 289.7039
Epoch 9/20, Loss: 278.0973
Epoch 10/20, Loss: 288.2401
Epoch 11/20, Loss: 277.3806
Epoch 12/20, Loss: 280.3890
Epoch 13/20, Loss: 278.7140
Epoch 14/20, Loss: 278.8103
Epoch 15/20, Loss: 276.1944
Epoch 16/20, Loss: 275.0444
Epoch 17/20, Loss: 273.3098
Epoch 18/20, Loss: 276.3710
Epoch 19/20, Loss: 281.3475
Epoch 20/20, Loss: 278.3885
Test Accuracy: 0.9963
Test Precision: 0.9928
Test Recall: 1.0000
Test F1: 0.9964
