# 葡萄酒质量

`Dataset Characteristics  数据集特征:`Multivariate  多元变量

`Subject Area  主题领域:`Business  商业

`Associated Tasks  相关任务:`Classification, Regression 分类，回归

`Feature Type  功能类型:`Real  实数

`# Instances  样本数量:`4898

`# Features  特征数量:`11

`Has Missing Values?  是否有缺失值？:`No  没有

## 变量表
| Variable Name         | Role    | Type        | Description | Units | Missing Values |
|----------------------|--------|------------|-------------|-------|---------------|
| fixed_acidity       | Feature | Continuous |             |       | no            |
| volatile_acidity    | Feature | Continuous |             |       | no            |
| citric_acid        | Feature | Continuous |             |       | no            |
| residual_sugar     | Feature | Continuous |             |       | no            |
| chlorides         | Feature | Continuous |             |       | no            |
| free_sulfur_dioxide | Feature | Continuous |             |       | no            |
| total_sulfur_dioxide | Feature | Continuous |             |       | no            |
| density          | Feature | Continuous |             |       | no            |
| pH               | Feature | Continuous |             |       | no            |
| sulphates        | Feature | Continuous |             |       | no            |


| 变量名称            | 角色  | 类型  | 描述  | 单位 | 缺失值 |
|-------------------|------|------|------|------|------|
| 固定酸度         | 特征  | 持续  |      |      | 没有 |
| 挥发酸          | 特征  | 持续  |      |      | 没有 |
| 柠檬酸          | 特征  | 持续  |      |      | 没有 |
| 残留糖          | 特征  | 持续  |      |      | 没有 |
| 氯化物          | 特征  | 持续  |      |      | 没有 |
| 游离二氧化硫     | 特征  | 持续  |      |      | 没有 |
| 总二氧化硫       | 特征  | 持续  |      |      | 没有 |
| 密度            | 特征  | 持续  |      |      | 没有 |
| pH值           | 特征  | 持续  |      |      | 没有 |
| 硫酸盐          | 特征  | 持续  |      |      | 没有 |


### 📌 导入必要的库
本节导入了 `pandas`、`numpy`、`pygwalker`、`torch` 和 `sklearn` 相关库，分别用于：
- **数据处理**（pandas、numpy）
- **数据可视化**（pygwalker）
- **深度学习**（torch）
- **模型评估**（sklearn）

此外，还检测了当前是否可用 `CUDA` 进行 GPU 计算。


In [1]:
import pandas as pd
import numpy as np
import pygwalker as pyg
import torch
from torch import nn
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

device = torch.device('cuda' if torch.cuda.is_available()
                      else 'mps' if torch.mps.is_available()
                      else 'cpu')
print(torch.__version__)
print(device)



2.6.0
mps


### 📌 读取数据集
- 这里分别加载了**白葡萄酒**和**红葡萄酒**的质量数据集。
- 由于数据以分号 (`;`) 分隔，因此 `sep=';'` 需要显式指定。
- 读取后，`quality` 列作为目标变量（标签 `y`），其余列作为特征（`X`）。


In [2]:
# Read the CSV files with semicolon separator
white_wine = pd.read_csv("data/winequality-white.csv", sep=';')
red_wine = pd.read_csv("data/winequality-red.csv", sep=';')

print("White wine columns:", white_wine.columns)
print("Red wine columns:", red_wine.columns)

# Assuming the 'quality' column exists, proceed with the original code
X_white = white_wine.drop("quality", axis=1)
y_white = white_wine["quality"]
X_red = red_wine.drop("quality", axis=1)
y_red = red_wine["quality"]

X_white.shape, y_white.shape, X_red.shape, y_red.shape

White wine columns: Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')
Red wine columns: Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')


((4898, 11), (4898,), (1599, 11), (1599,))

In [3]:
X_white = np.array(X_white).astype(np.float32)
y_white = np.array(y_white).astype(np.float32)
X_red = np.array(X_red).astype(np.float32)
y_red = np.array(y_red).astype(np.float32)
X_white,y_white,X_red,y_red

(array([[ 7.  ,  0.27,  0.36, ...,  3.  ,  0.45,  8.8 ],
        [ 6.3 ,  0.3 ,  0.34, ...,  3.3 ,  0.49,  9.5 ],
        [ 8.1 ,  0.28,  0.4 , ...,  3.26,  0.44, 10.1 ],
        ...,
        [ 6.5 ,  0.24,  0.19, ...,  2.99,  0.46,  9.4 ],
        [ 5.5 ,  0.29,  0.3 , ...,  3.34,  0.38, 12.8 ],
        [ 6.  ,  0.21,  0.38, ...,  3.26,  0.32, 11.8 ]], dtype=float32),
 array([6., 6., 6., ..., 6., 7., 6.], dtype=float32),
 array([[ 7.4  ,  0.7  ,  0.   , ...,  3.51 ,  0.56 ,  9.4  ],
        [ 7.8  ,  0.88 ,  0.   , ...,  3.2  ,  0.68 ,  9.8  ],
        [ 7.8  ,  0.76 ,  0.04 , ...,  3.26 ,  0.65 ,  9.8  ],
        ...,
        [ 6.3  ,  0.51 ,  0.13 , ...,  3.42 ,  0.75 , 11.   ],
        [ 5.9  ,  0.645,  0.12 , ...,  3.57 ,  0.71 , 10.2  ],
        [ 6.   ,  0.31 ,  0.47 , ...,  3.39 ,  0.66 , 11.   ]],
       dtype=float32),
 array([5., 5., 5., ..., 6., 5., 6.], dtype=float32))

### 📌 处理类别不平衡（SMOTE&RandomUnderSampler）
- **过采样（Over-sampling）**：当某些类别样本数量较少时，我们使用 `SMOTE` 生成新的合成样本，以平衡数据集。
- 这样可以防止模型偏向于高频类别，提高分类性能。


In [4]:
smote = SMOTE(random_state=42, k_neighbors=3)
X_white, y_white = smote.fit_resample(X_white, y_white)
X_red, y_red = smote.fit_resample(X_red, y_red)



- **下采样(Under-sampling):** 当某些类别样本数量过多时，我们使用RandomUnderSampler随机删除一些过多样本，以平衡数据集

In [5]:
rus = RandomUnderSampler(random_state=42)
X_white, y_white = rus.fit_resample(X_white, y_white)
X_red, y_red = rus.fit_resample(X_red, y_red)



### 📌 归一化数据
- 由于不同特征的数值范围不同，我们使用 `StandardScaler` 进行标准化，使所有特征均值为 `0`，标准差为 `1`。
- 这有助于梯度下降更稳定，提高模型训练效果。


In [6]:
scaler = StandardScaler()

# 对特征进行归一化 (只对 X 进行)
X_white = scaler.fit_transform(X_white)
X_red = scaler.fit_transform(X_red)

### 📌 将数据转换为 PyTorch 张量
- PyTorch 训练时需要 `Tensor` 格式，因此将 NumPy 数组转换为 `torch.tensor`。
- `dtype=torch.float` 确保数据类型正确，`to(device)` 允许数据在 CPU/GPU 间切换。


In [7]:
X_white = torch.tensor(X_white,dtype=torch.float).to(device)
y_white = torch.tensor(y_white,dtype=torch.long).to(device)
X_red = torch.tensor(X_red,dtype=torch.float).to(device)
y_red = torch.tensor(y_red,dtype=torch.long).to(device)
X_white,y_white,X_red,y_red

(tensor([[ 1.5137, -0.3900, -1.2357,  ..., -1.3194,  0.1758, -0.8137],
         [-1.3460, -0.5811,  0.9931,  ...,  2.2624, -0.4879,  0.6576],
         [ 2.1492,  2.7623,  0.4117,  ...,  0.1555, -0.9620, -1.8204],
         ...,
         [ 0.8850,  0.1719,  0.5117,  ...,  0.2496, -0.7179,  0.6173],
         [ 0.1482, -0.4606,  1.0124,  ...,  0.9053, -0.3727,  1.4725],
         [ 0.0398, -0.3954,  1.4424,  ...,  1.1210, -0.5669,  1.5783]],
        device='mps:0'),
 tensor([3, 3, 3,  ..., 9, 9, 9], device='mps:0'),
 tensor([[ 1.9706, -0.0062,  1.9173,  ..., -0.5073, -0.5475, -1.4871],
         [ 1.2443,  0.1155,  1.0869,  ..., -1.0883, -0.1766, -2.0065],
         [-0.5713,  2.4480, -1.3064,  ...,  1.9456, -0.7330, -0.0155],
         ...,
         [-0.0713, -0.3228,  0.5667,  ..., -0.8125,  0.3295,  1.9209],
         [ 0.6117, -1.0809,  1.3846,  ..., -1.0944,  1.4020,  0.9546],
         [-1.9425, -0.6652, -0.1133,  ...,  2.3334,  0.5111,  2.6984]],
        device='mps:0'),
 tensor([3, 3, 3,

### 📌 设定损失函数（带权重）
- 由于类别不平衡，使用 `CrossEntropyLoss` 并为罕见类别分配更高权重，以增强模型对少数类的关注。


In [8]:
unique_classes_white, counts = torch.unique(y_white, return_counts=True)
print(counts)
class_weights_white = 1.0 / counts.float()  # 逆频率权重
class_weights_white = class_weights_white / class_weights_white.sum()  # 归一化

# 转换为 tensor 并移动到 GPU/CPU
class_weights_white = class_weights_white.to(device)


unique_classes_red, counts = torch.unique(y_red, return_counts=True)
print(counts)
class_weights_red = 1.0 / counts.float()  # 逆频率权重
class_weights_red = class_weights_red / class_weights_red.sum()  # 归一化

# 转换为 tensor 并移动到 GPU/CPU
class_weights_red = class_weights_red.to(device)

tensor([2198, 2198, 2198, 2198, 2198, 2198, 2198], device='mps:0')
tensor([681, 681, 681, 681, 681, 681], device='mps:0')


通过打印count我们可以得知white的评分总数共有7种，而red的评分总数只有6种，因此在构建神经网络的时候要有不同的输出

### 📌 定义网络
- 由于两种酒的评价种数不同，我们可以构建两种神经网络（直接def一个，然后把输出个数作为输入也可以）
- 由于类别不平衡，使用 `CrossEntropyLoss` 并为罕见类别分配更高权重，以增强模型对少数类的关注。


In [9]:
class WineEvaluator_white(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(11,20)
        self.fc2 = nn.Linear(20,7)
        self.relu = nn.ReLU()

    def forward(self,x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x
net_white = WineEvaluator_white().to(device)
loss_fn_white = nn.CrossEntropyLoss(weight=class_weights_white)
optimizer_white = torch.optim.Adam(params=net_white.parameters(),lr=0.1)

In [10]:
class WineEvaluator_red(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(11,20)
        self.fc2 = nn.Linear(20,6)
        self.relu = nn.ReLU()

    def forward(self,x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x
net_red = WineEvaluator_red().to(device)
loss_fn_red = nn.CrossEntropyLoss(weight=class_weights_red)
optimizer_red = torch.optim.Adam(params=net_red.parameters(),lr=0.1)

### 📌 训练循环
- 注意此处我直接将两个网络放在一个循环里面进行训练了

In [11]:
epochs = 1000
for epoch in range(epochs):

    net_white.train()
    net_red.train()

    y_pred_logit_white = net_white(X_white)
    y_pred_logit_red = net_red(X_red)
    y_pred_prob_white = torch.softmax(y_pred_logit_white,dim=1)
    y_pred_prob_red = torch.softmax(y_pred_logit_red,dim=1)
    y_pred_label_white = torch.argmax(y_pred_prob_white,dim=1)
    y_pred_label_red = torch.argmax(y_pred_prob_red,dim=1)

    loss_white = loss_fn_white(y_pred_logit_white,y_white)
    loss_red = loss_fn_red(y_pred_logit_red,y_red)

    optimizer_white.zero_grad()
    optimizer_red.zero_grad()

    loss_red.backward()
    loss_white.backward()

    optimizer_red.step()
    optimizer_white.step()

    net_white.eval()
    net_red.eval()

    with torch.inference_mode():
        val_accuracy_white = accuracy_score(y_white.cpu().numpy(), y_pred_label_white.cpu().numpy())
        val_accuracy_red = accuracy_score(y_red.cpu().numpy(),y_pred_label_red.cpu().numpy())
    if epoch%100==0:
        #print(y_pred_logit_red)
        #print(y_pred_prob_red)
        #print(y_pred_label_red)
        #print(y_pred_label_red.shape)
        print(f"Epoch:{epoch}  Loss of White Wine:{loss_white.item()}   Accuracy of White Wine:{val_accuracy_white}")
        print(f"Epoch:{epoch}  Loss of Red Wine:{loss_red.item()}   Accuracy of Red Wine:{val_accuracy_red}")

Epoch:0  Loss of White Wine:1.980773687362671   Accuracy of White Wine:0.05245027947484726
Epoch:0  Loss of Red Wine:1.7455247640609741   Accuracy of Red Wine:0.0763582966226138
Epoch:100  Loss of White Wine:0.5491862893104553   Accuracy of White Wine:0.43845053945144935
Epoch:100  Loss of Red Wine:0.1352078765630722   Accuracy of Red Wine:0.47626040137053355
Epoch:200  Loss of White Wine:0.5183589458465576   Accuracy of White Wine:0.4445599896009359
Epoch:200  Loss of Red Wine:0.07343588769435883   Accuracy of Red Wine:0.4887420460107685
Epoch:300  Loss of White Wine:0.5179369449615479   Accuracy of White Wine:0.4425451709346159
Epoch:300  Loss of Red Wine:0.06084321066737175   Accuracy of Red Wine:0.49045521292217326
Epoch:400  Loss of White Wine:0.49293985962867737   Accuracy of White Wine:0.44728974392304693
Epoch:400  Loss of Red Wine:0.03377864137291908   Accuracy of Red Wine:0.49632892804698975
Epoch:500  Loss of White Wine:0.5385504364967346   Accuracy of White Wine:0.434095931

准确率尚可，达到了70.8%和80%