<a href="https://colab.research.google.com/github/Ahirvoas/Training-Day1/blob/main/Tutorial_day1-Anomaly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wind Turbine Modeling Workshop

In [None]:
!pip install gdown &> /dev/null
!gdown https://drive.google.com/uc?id=1MAAFJDEslGNRUQrVEH_leK1R-Tbo1n4M
!gdown https://drive.google.com/uc?id=1tGANUJSnGvMd3NPfcpKINCXkImjwh97m
!gdown https://drive.google.com/uc?id=1K8YVo_dctLD9-jK2c2gQv2PFZkNcGirf
!gdown https://drive.google.com/uc?id=1CA1FbRehTR5XCuTkBY0OmhxK6pMZ4Xe4

In [None]:

!ls

In this notebook, we have access to three comprehensive datasets, each providing crucial information about wind turbine operations:

* `scada_data.csv`: This dataset includes over 60 different parameters or statuses related to the components of wind turbines, all recorded by the Supervisory Control and Data Acquisition (SCADA) system. These parameters offer detailed insights into the operational conditions and performance metrics of the turbines.

* `fault_data.csv`: This dataset catalogs various fault types or modes that can occur in wind turbines. It provides essential information for diagnosing and understanding the different failure mechanisms that may affect turbine performance and reliability.

* `status_data.csv`: This dataset contains descriptions of the operational statuses of wind turbines. It details the various states and conditions under which the turbines operate, helping to monitor and analyze their performance and operational efficiency.

In [None]:
import pandas as pd

#### 1. Read data

In [None]:
scada_df = pd.read_csv('scada_data.csv')
scada_df['DateTime'] = pd.to_datetime(scada_df['DateTime'], format='%m-%d-%Y %H:%M')

In [None]:
fault_df = pd.read_csv('fault_data.csv')
fault_df['DateTime'] = pd.to_datetime(fault_df['DateTime'], format='%Y-%m-%d %H:%M:%S')

In [None]:
status_df = pd.read_csv('status_data.csv')
status_df['DateTime'] = pd.to_datetime(status_df['DateTime'], format='%d-%m-%Y %H:%M')

**Exercise**

Display the first 5 rows of each dataframe.

In [None]:
fault_df.Fault.unique()

The fault dataset catalogs various fault types or modes that can occur in wind turbines. Specifically, it includes three types of faults:

* gf: Generator Heating Fault
* af: Timeout Warning Message - Malfunction Air Cooling
* ef: Excitation Error - Overvoltage DC-Link

This information is essential for diagnosing and understanding the different failure mechanisms that may affect turbine performance and reliability.

#### 2. Time series analysis

Plot the different time spans corresponding to the three previous datasets stored in the `scada_df`, `fault_df` and `status_df` dataframes.

Plot the maximum power named 'WEC: max. Power' over the time from the SCADA stored in the `scada_df` dataframe.

Do the same but by resampling the data weekly

Plot of number of faults on monthly resampled data

### 3. Combine SCADA and faults data

Combination of SCADA and fault data to pair each measurements with associated faults. To do so, we use the `merge` function from pandas on Time and with the how='outer' argument.

In [None]:
df_combine = scada_df.merge(fault_df, on='Time', how='outer')

Additionally, there are numerous NaNs, or unmatched SCADA timestamps with fault timestamps, simply because no faults occur at certain times. For these NaNs, we will replace them with "NF", which stands for No Fault (normal condition).

Modification to not consider storm and now wind period.

In [None]:

mask = status_df['Status Text'].isin(['Storm : Average windspeed - (10min)', 'Lack of wind : Wind speed to low'])
indexes = status_df.DateTime[mask]

indexes = pd.to_datetime(indexes)  # Convert once
mask = ~df_combine['DateTime_x'].isin(indexes) & ~df_combine['DateTime_y'].isin(indexes)
df_combine = df_combine[mask]

mask = ~(df_combine['WEC: ava. windspeed'] == 0) & \
       ~(df_combine['WEC: max. windspeed'] == 0) & \
       ~(df_combine['WEC: min. windspeed'] == 0)

df_combine = df_combine[mask]
mask = ~(df_combine['WEC: ava. windspeed'] == 0) & \
       ~(df_combine['WEC: ava. Power'] == 0)

df_combine = df_combine[mask]
df_combine.shape
df_combine.head()

Print the averages of SCADA values grouped by fault modes.

### 4. Data preparation for ML

In [None]:
df_combine.Fault.value_counts()

There are far more records of NF (normal condition) than faulty records - imbalanced dataset. We will sample the No Fault dataframe and pick only 300 records.

In [None]:
# Combine no fault and faulty dataframes

df_proportional = pd.concat((df_nf, df_f), axis=0).reset_index(drop=True)

Preparing for the training dataset, we **drop irrelevant features**. First we drop datetime, time, and error columns. Next, features that "de facto" are output of wind turbine, such as power from wind, operating hours, and kWh production, are dropped. Also, climatic variable such as wind speed are not useful.

In [None]:
df_final = df_proportional.loc[:, ['WEC: ava. windspeed',
                                   'WEC: ava. Rotation',
                                   'WEC: ava. Power',
                                   'WEC: ava. reactive Power',
                                   'WEC: ava. blade angle A',
                                   'Spinner temp.',
                                   'Front bearing temp.',
                                   'Rear bearing temp.',
                                   'Pitch cabinet blade A temp.',
                                   'Pitch cabinet blade B temp.',
                                   'Pitch cabinet blade C temp.',
                                   'Rotor temp. 1',
                                   'Rotor temp. 2',
                                   'Stator temp. 1',
                                   'Stator temp. 2',
                                   'Nacelle ambient temp. 1',
                                   'Nacelle ambient temp. 2',
                                   'Nacelle temp.',
                                   'Nacelle cabinet temp.',
                                   'Main carrier temp.',
                                   'Rectifier cabinet temp.',
                                   'Yaw inverter cabinet temp.',
                                   'Fan inverter cabinet temp.',
                                   'Ambient temp.',
                                   'Tower temp.',
                                   'Control cabinet temp.',
                                   'Transformer temp.',
                                   'Inverter averages',
                                   'Inverter std dev',
                                   'Fault']]

In [None]:
df_final.Fault.value_counts().plot.pie(title='Fault Modes')

In [None]:
plt.figure(figsize=(8, 6))

# Plot NF (True)
plt.scatter(
    df_final[df_final['Fault'] == 'NF']['WEC: ava. Rotation'],
    df_final[df_final['Fault'] == 'NF']['WEC: ava. reactive Power'],
    color='green', label='True (NF)', alpha=0.7
)

# Plot EF (Anomalies)
plt.scatter(
    df_final[df_final['Fault'] == 'EF']['WEC: ava. Rotation'],
    df_final[df_final['Fault'] == 'EF']['WEC: ava. reactive Power'],
    color='red', label='Anomalies (EF)', alpha=0.7
)

plt.xlabel('Average Rotation')
plt.ylabel('Average Reactive Power')
plt.title('Scatter Plot: NF vs EF')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
from google.colab import drive

drive.mount('/content/drive')

In [None]:
df_final.to_csv('/content/drive/My Drive/dataset_anomalies.csv', index=False)