# 风力发电预测：设计阶段（第一部分）

在上一个实验中，您对[SDWPF数据集](https://arxiv.org/abs/2208.04360)进行了探索性分析，该数据集包含来自我国某个风电场的134个风力涡轮机的数据。SDWPF数据由龙源电力集团提供

在本实验中，您将开始设计风力发电预测的解决方案。您将在本实验中完成的步骤包括：

1. 导入Python包
2. 加载数据集
3. 目录异常值
4. 建立风能估计的基线
5. 执行特征工程 \
    5.1 删除冗余特征 - Pab \
    5.2 转换角度特征 \
    5.3 修复温度和有功功率特征 \
    5.4 创建时间特征
6. 使用更多特征更新线性模型基线
7. 使用神经网络改善风力发电估计

## 1. 导入Python包

运行下一个单元以导入您在本实验室中所需的Python包。

请注意`import utils`这一行。此行导入了专门为本实验室编写的函数。如果您想查看这些函数，请转到`文件 -> 打开...`并打开`utils.py`文件以查看。

In [None]:
import numpy as np 
import pandas as pd 
import utils 

print('All packages imported successfully!')

All packages imported successfully!


## 2. 加载数据集

原始数据集包含134个涡轮机的信息，当你运行下一个单元格时，你将读取数据，然后执行在上一个实验中运行的相同步骤，即选择平均发电量最高的前10个涡轮机，并将日期和时间戳列转换为单个日期时间列。

In [2]:
# 从csv文件中加载数据
raw_data = pd.read_csv("./data/wtbdata_245days.csv")

display(raw_data.head())
# 仅选择前10个涡轮机
top_turbines = utils.top_n_turbines(raw_data, 10)
# 格式化日期时间
top_turbines = utils.format_datetime(top_turbines, initial_date_str="01 05 2020")

# 打印出前几行数据
top_turbines.head()

Unnamed: 0,TurbID,Day,Tmstamp,Wspd,Wdir,Etmp,Itmp,Ndir,Pab1,Pab2,Pab3,Prtv,Patv
0,1,1,00:10,6.17,-3.99,30.73,41.8,25.92,1.0,1.0,1.0,-0.25,494.66
1,1,1,00:20,6.27,-2.18,30.6,41.63,20.91,1.0,1.0,1.0,-0.24,509.76
2,1,1,00:30,6.42,-0.73,30.52,41.52,20.91,1.0,1.0,1.0,-0.26,542.53
3,1,1,00:40,6.25,0.89,30.49,41.38,20.91,1.0,1.0,1.0,-0.23,509.36
4,1,1,00:50,6.1,-1.03,30.47,41.22,20.91,1.0,1.0,1.0,-0.27,482.21


Original data has 4727519 rows from 134 turbines.

Sliced data has 352799 rows from 10 turbines.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,Datetime,TurbID,Wspd,Wdir,Etmp,Itmp,Ndir,Pab1,Pab2,Pab3,Prtv,Patv
0,2020-05-01 00:10:00,1,6.17,-3.99,30.73,41.8,25.92,1.0,1.0,1.0,-0.25,494.66
1,2020-05-01 00:20:00,1,6.27,-2.18,30.6,41.63,20.91,1.0,1.0,1.0,-0.24,509.76
2,2020-05-01 00:30:00,1,6.42,-0.73,30.52,41.52,20.91,1.0,1.0,1.0,-0.26,542.53
3,2020-05-01 00:40:00,1,6.25,0.89,30.49,41.38,20.91,1.0,1.0,1.0,-0.23,509.36
4,2020-05-01 00:50:00,1,6.1,-1.03,30.47,41.22,20.91,1.0,1.0,1.0,-0.27,482.21


## 3. 目录异常值

如果您阅读与此数据集相关的论文，您会看到一个叫做“关于数据的注意事项”的部分，其中提到一些值应从分析中排除，因为它们要么是“缺失的”、“未知的”或“异常的”。

“缺失的”值是不言自明的，但这里是另外两种类型的定义：

“未知的”：
- 如果“Patv” ≤ 0 且 “Wspd” > 2.5
- 如果“Pab1” > 89° 或 “Pab2” > 89° 或 “Pab3” > 89°

“异常的”：
- 如果“Ndir” < -720 或 “Ndir” > 720
- 如果“Wdir” < -180 或 “Wdir” > 180

当您运行下一个单元格时，您将创建一个名为“Include”的新列，并将“缺失 / 未知 / 异常”的每个值设置为 False：

In [3]:
# 初始时将所有行标记为包含
top_turbines["Include"] = True

# 定义异常数据的判断条件
conditions = [
    np.isnan(top_turbines.Patv),  # 如果 Patv 为 NaN
    (top_turbines.Pab1 > 89) | (top_turbines.Pab2 > 89) | (top_turbines.Pab3 > 89),  # 任一 Pab 数值超过 89°
    (top_turbines.Ndir < -720) | (top_turbines.Ndir > 720),  # 如果 Ndir 不在 [-720, 720] 范围内
    (top_turbines.Wdir < -180) | (top_turbines.Wdir > 180),  # 如果 Wdir 不在 [-180, 180] 范围内
    (top_turbines.Patv <= 0) & (top_turbines.Wspd > 2.5)  # 如果 Patv ≤ 0 且 Wspd > 2.5
]

# 遍历每个异常条件，对异常数据进行标记排除
for condition in conditions:
    top_turbines = utils.tag_abnormal_values(top_turbines, condition)
    
# 显示处理后的数据前几行
top_turbines.head()

Unnamed: 0,Datetime,TurbID,Wspd,Wdir,Etmp,Itmp,Ndir,Pab1,Pab2,Pab3,Prtv,Patv,Include
0,2020-05-01 00:10:00,1,6.17,-3.99,30.73,41.8,25.92,1.0,1.0,1.0,-0.25,494.66,True
1,2020-05-01 00:20:00,1,6.27,-2.18,30.6,41.63,20.91,1.0,1.0,1.0,-0.24,509.76,True
2,2020-05-01 00:30:00,1,6.42,-0.73,30.52,41.52,20.91,1.0,1.0,1.0,-0.26,542.53,True
3,2020-05-01 00:40:00,1,6.25,0.89,30.49,41.38,20.91,1.0,1.0,1.0,-0.23,509.36,True
4,2020-05-01 00:50:00,1,6.1,-1.03,30.47,41.22,20.91,1.0,1.0,1.0,-0.27,482.21,True


现在运行下一个单元以创建 `clean_data` 数据框，其中不再包括所有数据，因为异常值已被去除：

In [9]:
# 删除所有异常值
clean_data = top_turbines[top_turbines.Include].drop(["Include"], axis=1)

clean_data.head()

Unnamed: 0,Datetime,TurbID,Wspd,Wdir,Etmp,Itmp,Ndir,Pab1,Pab2,Pab3,Prtv,Patv
0,2020-05-01 00:10:00,1,6.17,-3.99,30.73,41.8,25.92,1.0,1.0,1.0,-0.25,494.66
1,2020-05-01 00:20:00,1,6.27,-2.18,30.6,41.63,20.91,1.0,1.0,1.0,-0.24,509.76
2,2020-05-01 00:30:00,1,6.42,-0.73,30.52,41.52,20.91,1.0,1.0,1.0,-0.26,542.53
3,2020-05-01 00:40:00,1,6.25,0.89,30.49,41.38,20.91,1.0,1.0,1.0,-0.23,509.36
4,2020-05-01 00:50:00,1,6.1,-1.03,30.47,41.22,20.91,1.0,1.0,1.0,-0.27,482.21


## 4. 建立风能估算的基准

在继续之前，将使用“线性回归”模型创建一个风能估算的基准，以拟合风速与功率输出之间的关系。

可以使用下拉菜单为其中一台涡轮机训练线性模型，并通过查看预测与实际功率输出值以及模型的平均绝对误差的图表来看其性能。 

In [5]:
utils.linear_univariate_model(clean_data)

interactive(children=(Dropdown(description='Turbine', options=(1, 3, 4, 5, 6, 9, 10, 11, 12, 70), value=1), Ou…

## 5. 特征工程

在构建一个能够根据其他特征估计功率输出的模型之前，您需要进行一些“特征工程”。在此过程中，您将把现有特征转换为更好的表示，组合特征，解决它们的问题并创建新特征。

### 5.1 删除冗余特征 - Pab

在之前的实验中，所有的 `Pab#` 特征（代表 `叶片的俯仰角 #`）都是完美相关的，这意味着它们是冗余的。可以只保留这些特征中的一个，并将其重命名为 `Pab`。运行下一个单元格以仅保留一个 `Pab` 特征的列。

In [10]:
# 调用工具函数对 Pab 特征进行聚合，去除冗余列
clean_data = utils.cut_pab_features(clean_data)

# 显示处理后的前5行数据以验证结果
clean_data.head(5)

Unnamed: 0,Datetime,TurbID,Wspd,Wdir,Etmp,Itmp,Ndir,Pab,Prtv,Patv
0,2020-05-01 00:10:00,1,6.17,-3.99,30.73,41.8,25.92,1.0,-0.25,494.66
1,2020-05-01 00:20:00,1,6.27,-2.18,30.6,41.63,20.91,1.0,-0.24,509.76
2,2020-05-01 00:30:00,1,6.42,-0.73,30.52,41.52,20.91,1.0,-0.26,542.53
3,2020-05-01 00:40:00,1,6.25,0.89,30.49,41.38,20.91,1.0,-0.23,509.36
4,2020-05-01 00:50:00,1,6.1,-1.03,30.47,41.22,20.91,1.0,-0.27,482.21


### 5.2 转换角度特征

有3个特征（`Wdir`、`Ndir`、`Pab`）以度数表示。这是一个问题，因为模型无法知道不同的角度值（例如0°和360°）实际上是非常相似的（在这种情况下是相同的）。为了解决这个问题，可以将这些特征转换为它们的`sine`/`cosine`表示。

运行下一个单元将角度特征转换为它们的`sin`/`cos`表示。

In [11]:
# 转换所有角度相关特征（将角度值转换为正弦和余弦表示）
for feature in ["Wdir", "Ndir", "Pab"]:
    utils.transform_angles(clean_data, feature)  
    
# 显示转换后的前5行数据以供验证
clean_data.head(5)

Unnamed: 0,Datetime,TurbID,Wspd,Etmp,Itmp,Prtv,Patv,WdirCos,WdirSin,NdirCos,NdirSin,PabCos,PabSin
0,2020-05-01 00:10:00,1,6.17,30.73,41.8,-0.25,494.66,0.997576,-0.069582,0.899405,0.437116,0.999848,0.017452
1,2020-05-01 00:20:00,1,6.27,30.6,41.63,-0.24,509.76,0.999276,-0.038039,0.934142,0.356901,0.999848,0.017452
2,2020-05-01 00:30:00,1,6.42,30.52,41.52,-0.26,542.53,0.999919,-0.012741,0.934142,0.356901,0.999848,0.017452
3,2020-05-01 00:40:00,1,6.25,30.49,41.38,-0.23,509.36,0.999879,0.015533,0.934142,0.356901,0.999848,0.017452
4,2020-05-01 00:50:00,1,6.1,30.47,41.22,-0.27,482.21,0.999838,-0.017976,0.934142,0.356901,0.999848,0.017452


### 5.3 固定温度和有功功率

之前的实验室中，`Etmp` 和 `Itmp` 都有非常负的值。事实上，这些最小值非常接近绝对零度（-273.15 °C），这显然是一个错误。在这里，将使用线性插值来修正这些值。

有功功率的负值在当前问题的背景下是没有意义的, 所有负值都应该视为零。

可以通过运行以下单元来应用这些更改：

In [12]:
# 修正温度数据（纠正错误的温度值）
clean_data = utils.fix_temperatures(clean_data)

# 修正有功功率中的负值，将其设为零
clean_data["Patv"] = clean_data["Patv"].apply(lambda x: max(0, x))

# 显示处理后的前5行数据以供验证
clean_data.head(5)

Unnamed: 0,Datetime,TurbID,Wspd,Etmp,Itmp,Prtv,Patv,WdirCos,WdirSin,NdirCos,NdirSin,PabCos,PabSin
0,2020-05-01 00:10:00,1,6.17,30.73,41.8,-0.25,494.66,0.997576,-0.069582,0.899405,0.437116,0.999848,0.017452
1,2020-05-01 00:20:00,1,6.27,30.6,41.63,-0.24,509.76,0.999276,-0.038039,0.934142,0.356901,0.999848,0.017452
2,2020-05-01 00:30:00,1,6.42,30.52,41.52,-0.26,542.53,0.999919,-0.012741,0.934142,0.356901,0.999848,0.017452
3,2020-05-01 00:40:00,1,6.25,30.49,41.38,-0.23,509.36,0.999879,0.015533,0.934142,0.356901,0.999848,0.017452
4,2020-05-01 00:50:00,1,6.1,30.47,41.22,-0.27,482.21,0.999838,-0.017976,0.934142,0.356901,0.999848,0.017452


### 5.4 创建时间特征

您将创建特征，以对数据集中每个数据点的时间信号进行编码。

如果您对这种编码如何工作感到好奇，请务必查看这篇[文章](https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/)。

In [13]:
# 生成时间信号特征
clean_data = utils.generate_time_signals(clean_data)

# 显示前五行数据以验证转换结果
clean_data.head(5)

Unnamed: 0,Datetime,TurbID,Wspd,Etmp,Itmp,Prtv,Patv,WdirCos,WdirSin,NdirCos,NdirSin,PabCos,PabSin,Time-of-day sin,Time-of-day cos
0,2020-05-01 00:10:00,1,6.17,30.73,41.8,-0.25,494.66,0.997576,-0.069582,0.899405,0.437116,0.999848,0.017452,0.043619,0.999048
1,2020-05-01 00:20:00,1,6.27,30.6,41.63,-0.24,509.76,0.999276,-0.038039,0.934142,0.356901,0.999848,0.017452,0.087156,0.996195
2,2020-05-01 00:30:00,1,6.42,30.52,41.52,-0.26,542.53,0.999919,-0.012741,0.934142,0.356901,0.999848,0.017452,0.130526,0.991445
3,2020-05-01 00:40:00,1,6.25,30.49,41.38,-0.23,509.36,0.999879,0.015533,0.934142,0.356901,0.999848,0.017452,0.173648,0.984808
4,2020-05-01 00:50:00,1,6.1,30.47,41.22,-0.27,482.21,0.999838,-0.017976,0.934142,0.356901,0.999848,0.017452,0.21644,0.976296


运行下一个单元以进行最后一步来准备您的数据以用于建模。

In [14]:
# 定义预测特征：排除了 "Datetime"、"TurbID" 和 "Patv" 的所有列
predictors = [f for f in clean_data.columns if f not in ["Datetime", "TurbID", "Patv"]]

# 定义目标特征：只有 "Patv" 一列
target = ["Patv"]

# 重新排列数据列顺序，将 "TurbID" 放在首位，后面依次接上预测特征和目标特征
model_data = clean_data[["TurbID"] + predictors + target]

# 显示处理后数据的前5行
model_data.head(5)

Unnamed: 0,TurbID,Wspd,Etmp,Itmp,Prtv,WdirCos,WdirSin,NdirCos,NdirSin,PabCos,PabSin,Time-of-day sin,Time-of-day cos,Patv
0,1,6.17,30.73,41.8,-0.25,0.997576,-0.069582,0.899405,0.437116,0.999848,0.017452,0.043619,0.999048,494.66
1,1,6.27,30.6,41.63,-0.24,0.999276,-0.038039,0.934142,0.356901,0.999848,0.017452,0.087156,0.996195,509.76
2,1,6.42,30.52,41.52,-0.26,0.999919,-0.012741,0.934142,0.356901,0.999848,0.017452,0.130526,0.991445,542.53
3,1,6.25,30.49,41.38,-0.23,0.999879,0.015533,0.934142,0.356901,0.999848,0.017452,0.173648,0.984808,509.36
4,1,6.1,30.47,41.22,-0.27,0.999838,-0.017976,0.934142,0.356901,0.999848,0.017452,0.21644,0.976296,482.21


## 6. 使用更多特征更新线性模型基线

现在您已经完成了一些特征工程阶段，是时候使用您新的特征集进行更多建模。您可以使用下拉菜单选择要建模的涡轮机，并从您希望包含在模型中的特征列表中进行选择。使用键盘上的 Shift 键和箭头键选择您希望包含的特征，然后单击 `Run Interact` 按钮训练您的模型。

请注意，由于您包含了更多特征，因此无法在二维中可视化拟合模型。考虑到这一点，该图被替换为显示您所包含的每个特征的平均特征重要性的图：

In [15]:
# 使用更多特征创建线性模型
utils.linear_multivariate_model(model_data, predictors)
# 运行下面的交互操作可能需要一些时间

interactive(children=(Dropdown(description='Turbine', options=(1, 3, 4, 5, 6, 9, 10, 11, 12, 70), value=1), Se…

## 7. 使用神经网络改进风能估计

现在你将训练一个神经网络模型进行比较。与前一节一样，你可以使用下拉菜单选择要建模的涡轮机，并从列表中选择要包含的特征。点击`运行交互`按钮以训练网络并输出结果。

In [16]:
# 训练神经网络模型
utils.neural_network(model_data, predictors)

interactive(children=(Dropdown(description='Turbine', options=(1, 3, 4, 5, 6, 9, 10, 11, 12, 70), value=1), Se…