## 第零步：分析库导入与分析目标设置
为了更方便地进行数据处理与分析，我们定义了一个基类`DataframeAnalysis`,将输入数据统一处理成`pandas.DataFrame`格式.

目前支持的数据格式有：
- `csv`
- `xlsx`
- `parquet`

输入参数有：
- `root_path`: 所有数据存放的根路径
- `data_path`: 各个数据集在根路径下的子路径

In [1]:
pwd

'/Users/qiuyz24/PycharmProjects/Data-Process-Library/example/XiexinForecasting'

In [1]:
from Analysis.DataframeAnalysis import DataframeAnalysis
import plotly.express as px
import pandas as pd
import plotly.graph_objects as go
# 导入中冶赛迪高炉应用数据
root_path = '/data/Xiexindata/photovoltaic_power_plant_data'
data_path = '119.6661600_33.6244280_24_data.csv'
DA = DataframeAnalysis(root_path,data_path)

DataAnalysis loading data from: /Users/qiuyz24/PycharmProjects/Data-Process-Library/data/Xiexindata/photovoltaic_power_plant_data/119.6661600_33.6244280_24_data.csv


In [2]:
# 获得所有列名
DA.df_raw.columns

Index(['time', '温度（℃）', '相对温度（%）', '大气压力（hpa）', '平均风向（°）', '平均风速（m/s)',
       '累积降水量（mm）', '平均总辐射（W/m2）', '平均直接辐射（W/m2）', '平均散射辐射（W/m2）',
       '有效云量（百分比%）:p', '屋顶类型', '倾角', '组件功率', 'power'],
      dtype='object')

In [4]:
DA.df_raw

Unnamed: 0,time,温度（℃）,相对温度（%）,大气压力（hpa）,平均风向（°）,平均风速（m/s),累积降水量（mm）,平均总辐射（W/m2）,平均直接辐射（W/m2）,平均散射辐射（W/m2）,有效云量（百分比%）:p,屋顶类型,倾角,组件功率,power
0,2023/12/31 8:00:00,0.1,88.2,1027,330.6,3.2,0.0,69.3,26.6,42.7,12.7,斜屋顶,15,550,0.0
1,2023/12/31 8:15:00,0.5,85.9,1027,333.5,3.3,0.0,100.5,42.1,58.4,28.4,斜屋顶,15,550,0.0
2,2023/12/31 8:30:00,0.9,83.8,1027,337.4,3.4,0.0,138.5,62.7,75.9,44.0,斜屋顶,15,550,0.0
3,2023/12/31 8:45:00,1.2,81.7,1028,341.0,3.6,0.0,174.4,80.5,94.0,59.7,斜屋顶,15,550,0.0
4,2023/12/31 9:00:00,1.6,79.6,1028,344.4,3.7,0.0,205.9,92.7,113.2,75.4,斜屋顶,15,550,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35228,2025/1/1 7:00:00,-1.8,60.3,1024,160.9,2.0,0.0,0.0,0.0,0.0,0.0,斜屋顶,15,550,0.0
35229,2025/1/1 7:15:00,-1.4,59.9,1024,163.5,2.1,0.0,3.6,1.4,2.2,0.0,斜屋顶,15,550,0.0
35230,2025/1/1 7:30:00,-1.0,59.5,1024,166.8,2.3,0.0,18.7,7.3,11.4,0.0,斜屋顶,15,550,0.0
35231,2025/1/1 7:45:00,-0.6,59.0,1024,169.5,2.5,0.0,46.2,19.7,26.5,0.0,斜屋顶,15,550,0.0


In [52]:
DA.plot_column_plotly(columns=['power'], start_point=0, length=36000)

## 第一步：缺失值分析

检查数据集中有无缺失值，包括：
- 缺失的特征值
- 缺失的时间戳

In [8]:
# 获得包含缺失值的index条目
DA.getNanIndex(start_col='温度（℃）',end_col='power')

(Index([], dtype='int64'), False)

In [13]:
missing_dates, flag, timestamps, expected_range = DA.checkDateContinuity('time', freq='15min')
print(f"Missing dates {flag}: {missing_dates}")

Missing dates True: DatetimeIndex([], dtype='datetime64[ns]', freq='15min')


## 第二步：基础统计量分析
借助数据中的统计量，我们可以了解到数据最直观的分布情况，为后续的分析打下基础。

目前包含的分析目标有：
- 数据形状
- 每列均值
- 每列方差
- 每列标准差
- 每列最大值
- 每列最小值
- 每列中位数
- 每列分位数

In [49]:
# 获取数据形状：（序列长度，变量数）
DA.getShape()

(35233, 14)

In [18]:
# 去掉非浮点数类型的变量，防止影响统计分析
DA.df_raw.drop(columns=['屋顶类型'], inplace=True)

In [19]:
# 获取数据每一列的均值
Average = DA.getAverageColumn(start_col='温度（℃）',end_col='power')
Average

Unnamed: 0,feature,average
0,温度（℃）,15.989853
1,相对温度（%）,72.330091
2,大气压力（hpa）,1016.051287
3,平均风向（°）,158.953578
4,平均风速（m/s),2.794369
5,累积降水量（mm）,0.029933
6,平均总辐射（W/m2）,166.646462
7,平均直接辐射（W/m2）,92.137357
8,平均散射辐射（W/m2）,74.508986
9,有效云量（百分比%）:p,45.921222


In [20]:
fig = px.bar(Average, x='feature', y='average',color='average')
fig.show()

In [21]:
# 获取数据每一列的方差
Variance = DA.getVarianceColumn(start_col='温度（℃）',end_col='power')
Variance

Unnamed: 0,feature,variance
0,温度（℃）,111.081641
1,相对温度（%）,327.936504
2,大气压力（hpa）,95.296444
3,平均风向（°）,11309.851333
4,平均风速（m/s),1.661655
5,累积降水量（mm）,0.025686
6,平均总辐射（W/m2）,58681.454669
7,平均直接辐射（W/m2）,26790.828182
8,平均散射辐射（W/m2）,9462.581144
9,有效云量（百分比%）:p,1408.194616


In [22]:
fig = px.bar(Variance, x='feature', y='variance',color='variance')
fig.show()

In [23]:
# 获取数据每一列的标准差
Std = DA.getStdColumn(start_col='温度（℃）',end_col='power')
Std

Unnamed: 0,feature,standard deviation
0,温度（℃）,10.539528
1,相对温度（%）,18.109017
2,大气压力（hpa）,9.76199
3,平均风向（°）,106.347785
4,平均风速（m/s),1.289052
5,累积降水量（mm）,0.160269
6,平均总辐射（W/m2）,242.242553
7,平均直接辐射（W/m2）,163.67904
8,平均散射辐射（W/m2）,97.275799
9,有效云量（百分比%）:p,37.525919


In [24]:
fig = px.bar(Std, x='feature', y='standard deviation',color='standard deviation')
fig.show()

In [25]:
# 获取数据每一列的最大值
maxval = DA.getMaxColumn(start_col='温度（℃）',end_col='power')
maxval

Unnamed: 0,feature,max value
0,温度（℃）,37.1
1,相对温度（%）,99.8
2,大气压力（hpa）,1041.0
3,平均风向（°）,360.0
4,平均风速（m/s),8.3
5,累积降水量（mm）,4.08
6,平均总辐射（W/m2）,931.7
7,平均直接辐射（W/m2）,712.1
8,平均散射辐射（W/m2）,425.3
9,有效云量（百分比%）:p,100.0


In [26]:
fig = px.bar(maxval, x='feature', y='max value',color='max value')
fig.show()

In [28]:
# 获取数据每一列的最小值
minval = DA.getMinColumn(start_col='温度（℃）',end_col='power')
minval

Unnamed: 0,feature,min value
0,温度（℃）,-7.8
1,相对温度（%）,15.4
2,大气压力（hpa）,993.0
3,平均风向（°）,0.0
4,平均风速（m/s),0.1
5,累积降水量（mm）,0.0
6,平均总辐射（W/m2）,0.0
7,平均直接辐射（W/m2）,0.0
8,平均散射辐射（W/m2）,0.0
9,有效云量（百分比%）:p,0.0


In [29]:
fig = px.bar(minval, x='feature', y='min value',color='min value')
fig.show()

In [30]:
# 获取数据每一列的中位数
median = DA.getMedianColumn(start_col='温度（℃）',end_col='power')
median

Unnamed: 0,feature,median
0,温度（℃）,16.5
1,相对温度（%）,75.4
2,大气压力（hpa）,1016.0
3,平均风向（°）,140.1
4,平均风速（m/s),2.6
5,累积降水量（mm）,0.0
6,平均总辐射（W/m2）,2.7
7,平均直接辐射（W/m2）,0.0
8,平均散射辐射（W/m2）,2.3
9,有效云量（百分比%）:p,48.6


In [31]:
fig = px.bar(median, x='feature', y='median',color='median')
fig.show()

In [33]:
# 获取数据每一列的分位数：定义percent值以设置分为数
DA.getQuantileColumn(percent=[1/3,2/3],start_col='温度（℃）',end_col='power')

Unnamed: 0,温度（℃）,相对温度（%）,大气压力（hpa）,平均风向（°）,平均风速（m/s),累积降水量（mm）,平均总辐射（W/m2）,平均直接辐射（W/m2）,平均散射辐射（W/m2）,有效云量（百分比%）:p,倾角,组件功率,power
0.333333,10.7,65.5,1010.0,96.2,2.1,0.0,0.0,0.0,0.0,17.1,15.0,550.0,0.0
0.666667,22.8,83.6,1022.0,199.8,3.2,0.0,161.5,31.7,106.9,61.2,15.0,550.0,1.57


## 第三步：变量相关性分析
真实工业智能化应用数据往往是具有复杂相关性的多变量数据，挖掘不同变量之间的相关性与各个变量的自相关性可以获取对于下游任务宝贵的数据先验知识。

目前包含的相关性分析类型有：
- 互相关性分析
- 自相关性分析

In [34]:
# 获取所有序列两两之间的互相关性：定义method以指定计算相关性标准（'pearson' | 'kendall' | 'spearman'）
CrossCorr = DA.getCorr(method='pearson', start_col='温度（℃）', end_col='power')
CrossCorr

Unnamed: 0,温度（℃）,相对温度（%）,大气压力（hpa）,平均风向（°）,平均风速（m/s),累积降水量（mm）,平均总辐射（W/m2）,平均直接辐射（W/m2）,平均散射辐射（W/m2）,有效云量（百分比%）:p,倾角,组件功率,power
温度（℃）,1.0,-0.025424,-0.856611,-0.221332,0.050473,0.090026,0.378898,0.33316,0.382972,0.006462,,,0.266122
相对温度（%）,-0.025424,1.0,-0.198549,-0.077538,-0.168207,0.178411,-0.537964,-0.520792,-0.463375,0.217568,,,-0.554996
大气压力（hpa）,-0.856611,-0.198549,1.0,0.155286,-0.041963,-0.150505,-0.149104,-0.111243,-0.184126,-0.047026,,,-0.037188
平均风向（°）,-0.221332,-0.077538,0.155286,1.0,-0.099432,-0.005357,-0.017403,0.000433,-0.044068,-0.117639,,,0.011239
平均风速（m/s),0.050473,-0.168207,-0.041963,-0.099432,1.0,0.037367,0.102927,0.058923,0.157171,0.212212,,,0.088622
累积降水量（mm）,0.090026,0.178411,-0.150505,-0.005357,0.037367,1.0,-0.053911,-0.063485,-0.027434,0.219427,,,-0.065012
平均总辐射（W/m2）,0.378898,-0.537964,-0.149104,-0.017403,0.102927,-0.053911,1.0,0.958508,0.877456,-0.143688,,,0.905159
平均直接辐射（W/m2）,0.33316,-0.520792,-0.111243,0.000433,0.058923,-0.063485,0.958508,1.0,0.704314,-0.259857,,,0.885167
平均散射辐射（W/m2）,0.382972,-0.463375,-0.184126,-0.044068,0.157171,-0.027434,0.877456,0.704314,1.0,0.079414,,,0.76468
有效云量（百分比%）:p,0.006462,0.217568,-0.047026,-0.117639,0.212212,0.219427,-0.143688,-0.259857,0.079414,1.0,,,-0.1806


In [35]:
fig = px.imshow(CrossCorr)
fig.show()

In [53]:
# 获取所有序列自相关系数：定义lag以指定计算自相关的滞后期数（时间间隔）
SelfCorr = DA.getSelfCorr(lag=96, start_col='温度（℃）',end_col='power')
SelfCorr


invalid value encountered in divide



Unnamed: 0,feature,self correlation
0,温度（℃）,0.963645
1,相对温度（%）,0.714909
2,大气压力（hpa）,0.913194
3,平均风向（°）,0.252281
4,平均风速（m/s),0.284262
5,累积降水量（mm）,0.079457
6,平均总辐射（W/m2）,0.866532
7,平均直接辐射（W/m2）,0.713829
8,平均散射辐射（W/m2）,0.91147
9,有效云量（百分比%）:p,0.297664


In [39]:
fig = px.bar(SelfCorr, x='feature', y='self correlation',color='self correlation')
fig.show()

## 第四步：周期性分析

In [41]:
# 获取3个最主要的周期
topk, sample_freq = DA.getFFTtopk('power')
topk

{'top_k_power': array([ 7958.15880727, 35059.3362724 , 68929.53829941]),
 'fft_periods': array([35233,    48,    96])}

In [42]:
fig = px.line(sample_freq)
fig.show()

## 第五步：变量平稳性分析
针对变量平稳性的分析可以展示数据中的异常变化与波动。

目前包含的平稳性分析方法有：
- ADF
- Phillips-Perron
- DF-GLS
- KPSS
- Zivot-Andrew
- Variance Ratio

In [43]:
# 获取ADF平稳性测试结果
DA.getADF(start_col='power',end_col='power')

{'power': {'Test Statistic': -37.55249274211643,
  'P-value': 0.0,
  'Lags': 52,
  'Trend': 'c',
  'Summary': <class 'statsmodels.iolib.summary.Summary'>
  """
     Augmented Dickey-Fuller Results   
  Test Statistic                -37.552
  P-value                         0.000
  Lags                               52
  -------------------------------------
  
  Trend: Constant
  Critical Values: -3.43 (1%), -2.86 (5%), -2.57 (10%)
  Null Hypothesis: The process contains a unit root.
  Alternative Hypothesis: The process is weakly stationary.
  """}}

In [44]:
# 获取Phillips-Perron平稳性测试结果
DA.getPhillipsPerron(start_col='power',end_col='power')

{'power': {'Test Statistic': -29.706625650935326,
  'P-value': 0.0,
  'Lags': 52,
  'Trend': 'c',
  'Summary': <class 'statsmodels.iolib.summary.Summary'>
  """
       Phillips-Perron Test (Z-tau)    
  Test Statistic                -29.707
  P-value                         0.000
  Lags                               52
  -------------------------------------
  
  Trend: Constant
  Critical Values: -3.43 (1%), -2.86 (5%), -2.57 (10%)
  Null Hypothesis: The process contains a unit root.
  Alternative Hypothesis: The process is weakly stationary.
  """}}

In [45]:
# 获取DF-GLS平稳性测试结果
DA.getDFGLS(start_col='power',end_col='power')

{'power': {'Test Statistic': -20.19754520900222,
  'P-value': 0.0,
  'Lags': 52,
  'Trend': 'c',
  'Summary': <class 'statsmodels.iolib.summary.Summary'>
  """
        Dickey-Fuller GLS Results      
  Test Statistic                -20.198
  P-value                         0.000
  Lags                               52
  -------------------------------------
  
  Trend: Constant
  Critical Values: -2.57 (1%), -1.94 (5%), -1.62 (10%)
  Null Hypothesis: The process contains a unit root.
  Alternative Hypothesis: The process is weakly stationary.
  """}}