<a href="https://colab.research.google.com/github/Ahirvoas/Training-Day1/blob/main/Tutorial_day1-Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wind Turbine Modeling Workshop

In this notebook, we have access to three comprehensive datasets, each providing crucial information about wind turbine operations:

* `scada_data.csv`: This dataset includes over 60 different parameters or statuses related to the components of wind turbines, all recorded by the Supervisory Control and Data Acquisition (SCADA) system. These parameters offer detailed insights into the operational conditions and performance metrics of the turbines.

* `fault_data.csv`: This dataset catalogs various fault types or modes that can occur in wind turbines. It provides essential information for diagnosing and understanding the different failure mechanisms that may affect turbine performance and reliability.

* `status_data.csv`: This dataset contains descriptions of the operational statuses of wind turbines. It details the various states and conditions under which the turbines operate, helping to monitor and analyze their performance and operational efficiency.

In [None]:
import pandas as pd 

#### 1. Read data

In [None]:
scada_df = pd.read_csv('https://raw.githubusercontent.com/Ahirvoas/Training-Day1/refs/heads/main/data/scada_data.csv')
scada_df['DateTime'] = pd.to_datetime(scada_df['DateTime'])

In [None]:
fault_df = pd.read_csv('fault_data.csv')
fault_df['DateTime'] = pd.to_datetime(fault_df['DateTime'])

In [None]:
status_df = pd.read_csv('https://raw.githubusercontent.com/Ahirvoas/Training-Day1/refs/heads/main/data/status_data.csv')
status_df['DateTime'] = pd.to_datetime(status_df['DateTime'])

In [None]:
scada_df.head()

**Exercise** 

Display the first 5 rows of each dataframe.

In [None]:
fault_df.Fault.unique()

The fault dataset catalogs various fault types or modes that can occur in wind turbines. Specifically, it includes five types of faults:

* gf: Generator Heating Fault
* mf: Mains Failure Fault
* ff: Feeding Fault
* af: Timeout Warning Message - Malfunction Air Cooling
* ef: Excitation Error - Overvoltage DC-Link

This information is essential for diagnosing and understanding the different failure mechanisms that may affect turbine performance and reliability.

#### 2. Time series analysis

Plot the different time spans corresponding to the three previous datasets stored in the `scada_df`, `fault_df` and `status_df` dataframes.

Plot the maximum power with the corresponding maximum wind speed from the SCADA stored in the `scada_df` dataframe.

Do the same but by resampling the data weekly

Plot of number of faults on monthly resampled data

### 3. Combine SCADA and faults data

Combine SCADA and fault data to pair each measurements with associated faults. To do so, we will use the `merge` function from pandas on Time and with the how='outer' argument.

In [None]:
df_combine =

Additionally, there are numerous NaNs, or unmatched SCADA timestamps with fault timestamps, simply because no faults occur at certain times. For these NaNs, we will replace them with "NF", which stands for No Fault (normal condition).

In [None]:
df_combine['Fault'] =

Print the averages of SCADA values grouped by fault modes.

### 4. Data preparation for ML

In [None]:
df_combine.Fault.value_counts().plot.pie(title='Fault Modes')

In [None]:
df_combine.Fault.value_counts()

There are far more records of NF (normal condition) than faulty records - imbalanced dataset. We will sample the No Fault dataframe and pick only 300 records. 

In [None]:
# No fault mode dataframe
df_nf =

In [None]:
# With fault mode dataframe
df_f =

In [None]:
# Combine no fault and faulty dataframes

df_proportional = pd.concat((df_nf, df_f), axis=0).reset_index(drop=True)

Preparing for the training dataset, we **drop irrelevant features**. First we drop datetime, time, and error columns. Next, features that "de facto" are output of wind turbine, such as power from wind, operating hours, and kWh production, are dropped. Also, climatic variable such as wind speed are not useful.

In [None]:
df_final = df_proportional.loc[:, [
    'WEC: ava. windspeed','WEC: ava. Rotation',
    'WEC: ava. Power',
    'WEC: ava. reactive Power',
    'WEC: ava. blade angle A',
    'Spinner temp.',
    'Front bearing temp.',
    'Rear bearing temp.',
    'Pitch cabinet blade A temp.',
    'Pitch cabinet blade B temp.',
    'Pitch cabinet blade C temp.',
    'Rotor temp. 1',
    'Rotor temp. 2',
    'Stator temp. 1',
    'Stator temp. 2',
    'Nacelle ambient temp. 1',
    'Nacelle ambient temp. 2',
    'Nacelle temp.',
    'Nacelle cabinet temp.',
    'Main carrier temp.',
    'Rectifier cabinet temp.',
    'Yaw inverter cabinet temp.',
    'Fan inverter cabinet temp.',
    'Ambient temp.',
    'Tower temp.',
    'Control cabinet temp.',
    'Transformer temp.',
    'Inverter cabinet temp. averages',
    'Inverter cabinet temp. std dev',
    'Fault']]

In [None]:
df_final.Fault.value_counts().plot.pie(title='Fault Modes')