## This notebook is intended to prepare extracted buoy data from the [CDIP website](https://cdip.ucsd.edu/) for a rogue wave forecasting task.

### The required libraries are imported here.

In [4]:
import os
import IPython
import IPython.display
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import time
import math
from tensorflow import keras
import seaborn as sns
import random
import tensorflow as tf
from sklearn.metrics import confusion_matrix

- **A function to extract the wave heights and their respective indices in a given time series window. This is based on finding the zero-crossing incidents in the time window. Finally, the largest of the wave peaks in the window is found out.**
- **This is done to process the data in a manner such that the wave index is known where the maximum wave height is acheived and the data window can be slided accordingly to fit our neural network training process.**

**For feeding as input for the training of neural networks, the wave heights are normalized using the significant wave height of the window**.
**A sample data window is displayed here.**
![random wave window](random_wave_window_manyOne.jpg)

In [5]:
def find_max_wave_height(zdisp_window):
    zero_crossings = np.where(np.diff(np.sign(zdisp_window)))[0]
    zero_crossings= np.append(zero_crossings, len(zdisp_window)-1);
    zero_crossings= np.append(-1,zero_crossings)
    h_wave=np.zeros(len(zero_crossings)-1)
    t_wave=np.zeros(len(zero_crossings)-1)
            
    for iter_zero_crossing in range(len(zero_crossings)-1):
        peak_idx=np.argmax(np.abs(zdisp_window[zero_crossings[iter_zero_crossing]+1:zero_crossings[iter_zero_crossing+1]+1]))
        h_wave[iter_zero_crossing]=zdisp_window[zero_crossings[iter_zero_crossing]+1+peak_idx]
        t_wave[iter_zero_crossing]=zero_crossings[iter_zero_crossing]+1+peak_idx
                
    max_wave_height=max(np.abs(np.diff(h_wave)))
    max_index = int(t_wave[np.argmax(np.abs(np.diff(h_wave)))])
    
    return max_wave_height, max_index

- **The objective of the rogue wave forecasting task is as follows.**
- **Given a window of time series data extracted from a buoy, the purpose of the task is to predict whether there will be a rogue wave within some fixed time horizon. The training data is prepared such that there are equal proportions of wave data windows leading to a rogue wave in the horizon and those that do not lead up to a rogue wave in the horizon.**
- **The training input is thus each such data window, while the output is determined by the presence or absence of a rogue wave at the end of the fixed forecasting horizon.**
- **Experiments have been carried out to observe the effect of both the length of the training data window as well as the forecast horiron used in this training process on the rogue wave forecasting accuracy of the trained neural network models.**  

**An overview of the data window used and the subsequent rogue wave to be forecast is displayed through the illustration here. The forecast horizon $t_{horizon}$ is varied between 3,5 and 10 minutes and the length of the trainign window $t_{window}$ is varied between 15 and 20 minutes to investigate their respective effects on the rogue wave forecasting accuracy.**
<div style="text-align: center;">
  <img src="Slide3.JPG" width="600">
</div>

**Functions are created below to prepare the data windows leading upto rogue waves and those not leading upto rogue waves separately. These will be utilized to train our neural networks.**

In [6]:
def populate_rw_arrays(dir, array, start_idx, end_idx):
    for folder in os.listdir(dir):
        print("Processing: " + folder)
        
        start_time = time.time()
        for file in os.listdir(dir+ "/" + folder):
            if file.endswith(".npz"):
                data=np.load(dir+"/"+ folder+"/"+file)
                z_tmp=data['zdisp'][start_idx:end_idx]
                significant_wave_height=4*np.std(data['zdisp'])#[start_idx:-1]
                z_disp_rw.append(z_tmp/significant_wave_height)
                #print(z_disp_rw[0].shape)
        print("--- %s seconds ---" % (time.time() - start_time))
    return

In [7]:
def populate_norw_arrays(dir, array, end_idx):
    for folder in os.listdir(dir):
        print("Processing: " + folder)
        
        start_time = time.time()
        for file in os.listdir(dir+ "/" + folder):
            if file.endswith(".npz"):
                data=np.load(dir+"/"+ folder+"/"+file)
                z_tmp=data['zdisp_norw'][0:end_idx]
                significant_wave_height=4*np.std(data['zdisp_norw'])#[start_idx:-1]
                z_disp_norw.append(z_tmp/significant_wave_height)

        print("--- %s seconds ---" % (time.time() - start_time))
    return          

### The wave arrays are being created here.
- **The folders for the extracted rogue wave and non-rogue wave windows are accessed here. These contain GBs of data extracted from CDIP buoys all over the coast of USA throughout the entire duration of wave monitoring by CDIP.**
- **For the data extracted, every data window is of duration of 20 minutes. Appropriate modifications are put in place to change the training data window size and the forecast horizon size.**
- **The length of the training window, $t_{window}$ is given by "window_length_in_min". It is varied between 15 and 20 minutes.**
- **The training window size is given by "window_length_in_min". It is varied between 15 and 20 minutes.**
- **The forecast horizon $t_{advance}$ is modified through "window_start_min_rel_to_rw". It is modified to investigate forecasting efficiency for horizons of 0 minutes, 3 minutes, 5 minutes and 10 minutes.** 

In [5]:
root_folder_rw = os.getcwd()+"/wave_height_g_2"
root_folder_norw = os.getcwd()+"/wave_height_g_2"

window_start_min_rel_to_rw=-20
window_length_in_min=15

samplerate=1.28

rw_idx=round(25*60*samplerate)
start_idx=round((25+window_start_min_rel_to_rw)*60*samplerate)
end_idx=start_idx+round(window_length_in_min*60*samplerate)

rw_dir=root_folder_rw + '/rw_samples'
z_disp_rw=[] 
populate_rw_arrays(rw_dir, z_disp_rw, start_idx, end_idx)

z_disp_norw=[]
norw_dir=root_folder_norw + '/norw_samples'
end_idx=end_idx-start_idx
populate_norw_arrays(norw_dir, z_disp_norw, end_idx)

Processing: Buoy_028
--- 3.081294536590576 seconds ---
Processing: Buoy_029
--- 5.4089508056640625 seconds ---
Processing: Buoy_036
--- 5.466543674468994 seconds ---
Processing: Buoy_043
--- 1.8274195194244385 seconds ---
Processing: Buoy_045
--- 3.488495111465454 seconds ---
Processing: Buoy_067
--- 3.6028573513031006 seconds ---
Processing: Buoy_071
--- 4.27984619140625 seconds ---
Processing: Buoy_076
--- 4.5134406089782715 seconds ---
Processing: Buoy_081
--- 0.0 seconds ---
Processing: Buoy_087
--- 0.0 seconds ---
Processing: Buoy_088
--- 0.0 seconds ---
Processing: Buoy_089
--- 0.0 seconds ---
Processing: Buoy_090
--- 0.0 seconds ---
Processing: Buoy_091
--- 0.439974308013916 seconds ---
Processing: Buoy_092
--- 4.199975967407227 seconds ---
Processing: Buoy_093
--- 1.672478437423706 seconds ---
Processing: Buoy_094
--- 3.560746908187866 seconds ---
Processing: Buoy_095
--- 0.914783239364624 seconds ---
Processing: Buoy_096
--- 2.2710278034210205 seconds ---
Processing: Buoy_097


### The arrays for training the model is created using NumPy

In [15]:
z_disp_rw=np.vstack(z_disp_rw)
print(z_disp_rw.shape)

z_disp_norw=np.vstack(z_disp_norw)
print(z_disp_norw.shape)

(169961, 1152)
(169961, 1152)


In [16]:
possible_total_waves = round(z_disp_rw.shape[0] + z_disp_norw.shape[0])
print(f"Total possible waves in our study: {possible_total_waves}")

Total possible waves in our study: 339922


### The training and validation datasets are generated for each of the rogue and non-rogue groups

In [17]:
np.random.seed(5)  
len_array_rw = len(z_disp_rw) 
len_array_norw = len(z_disp_norw)

indices_rw_train = np.random.choice(len_array_rw, round(0.8*len_array_rw), replace=False)
indices_rw_test = [ind not in indices_rw_train for ind in range(0,len_array_rw)]

indices_norw_train = np.random.choice(len_array_norw, round(0.8*len_array_norw), replace=False)
indices_norw_test = [ind not in indices_norw_train for ind in range(0,len_array_norw)]

z_disp_rw_train = z_disp_rw[indices_rw_train]
labels_rw_train = np.ones(len(z_disp_rw_train))
z_disp_rw_test = z_disp_rw[indices_rw_test]
labels_rw_test = np.ones(len(z_disp_rw_test))

z_disp_norw_train = z_disp_norw[indices_norw_train]
labels_norw_train = np.zeros(len(z_disp_norw_train))
z_disp_norw_test = z_disp_norw[indices_norw_test]
labels_norw_test = np.zeros(len(z_disp_norw_test))

### The model training data and the test data is generated using equal proportions of rogue wave and non-rogue wave groups.

In [18]:
wave_data_train = np.concatenate((z_disp_rw_train, z_disp_norw_train), axis=0)
label_train = np.concatenate((labels_rw_train, labels_norw_train), axis=0)

idx = np.random.permutation(len(wave_data_train))
wave_data_train = wave_data_train[idx]
label_train=label_train[idx]

wave_data_test = np.concatenate((z_disp_rw_test, z_disp_norw_test), axis=0)
label_test = np.concatenate((labels_rw_test, labels_norw_test), axis=0)

idx_test = np.random.permutation(len(wave_data_test))
wave_data_test = wave_data_test[idx_test]
label_test=label_test[idx_test]

wave_data_train = wave_data_train.reshape((wave_data_train.shape[0], wave_data_train.shape[1], 1))
wave_data_test = wave_data_test.reshape((wave_data_test.shape[0], wave_data_test.shape[1], 1))

num_waves_total = len(wave_data_train)+len(wave_data_test)

print(f"The total number of wave samples in the case with relative rogue wave ratio 0.5 is {num_waves_total}.", end='\n')

np.savez(os.getcwd()  +'/DataPrepared2/RWs_H_g_2_tadv_5min_rw_smallWindow_0.5',wave_data_train=wave_data_train, wave_data_test=wave_data_test,label_train=label_train,label_test=label_test)

The total number of wave samples in the case with relative rogue wave ratio 0.5 is 339922.


In [19]:
print(label_train.shape)
print(wave_data_train.shape)

(271938,)
(271938, 1152, 1)


### Training data is also generated for experiments using different proportions of rogue wave and non-rogue wave data in the training process.

In [20]:
relative_rw = [0.2, 0.3, 0.4, 0.6 ,0.7, 0.8]

for i in range(len(relative_rw)):
    ratio = relative_rw[i]
    if ratio >= 0.5:
        num_rw_train = z_disp_rw_train.shape[0]
        num_rw_test = z_disp_rw_test.shape[0]
        
        num_norw_train = round(((1-ratio)/ratio) * num_rw_train)
        num_norw_test = round(((1-ratio)/ratio) * num_rw_test)
    else:
        num_norw_train = z_disp_norw_train.shape[0]
        num_norw_test = z_disp_norw_test.shape[0]
        
        num_rw_train = round((ratio / (1-ratio))*num_norw_train)
        num_rw_test = round((ratio / (1-ratio))*num_norw_test)

    num_waves_total = num_norw_train + num_norw_test + num_rw_train + num_rw_test

    print(f"The number of training rogue wave samples in the case with relative rogue wave ratio {ratio} is {num_rw_train}.")
    print(f"The number of training non-rogue wave samples in the case with relative rogue wave ratio {ratio} is {num_norw_train}.")
    print(f"The number of testing rogue wave samples in the case with relative rogue wave ratio {ratio} is {num_rw_test}.")
    print(f"The number of testing non-rogue wave samples in the case with relative rogue wave ratio {ratio} is {num_norw_test}.")
    print(f"The total number of wave samples in the case with relative rogue wave ratio {ratio} is {num_waves_total}.", end='\n\n\n')

    np.random.seed(5)  
    len_array_rw_train = len(z_disp_rw_train) 
    len_array_norw_train = len(z_disp_norw_train)

    len_array_rw_test = len(z_disp_rw_test) 
    len_array_norw_test = len(z_disp_norw_test)

    indices_rw_train = np.random.choice(len_array_rw_train, num_rw_train, replace=False)
    indices_rw_test = np.random.choice(len_array_rw_test, num_rw_test, replace=False)

    indices_norw_train = np.random.choice(len_array_norw_train, num_norw_train, replace=False)
    indices_norw_test = np.random.choice(len_array_norw_test, num_norw_test, replace=False)

    z_disp_rw_train_altered = z_disp_rw_train[indices_rw_train]
    labels_rw_train_altered = np.ones(len(z_disp_rw_train_altered))
    z_disp_rw_test_altered = z_disp_rw_test[indices_rw_test]
    labels_rw_test_altered = np.ones(len(z_disp_rw_test_altered))

    z_disp_norw_train_altered = z_disp_norw_train[indices_norw_train]
    labels_norw_train_altered = np.zeros(len(z_disp_norw_train_altered))
    z_disp_norw_test_altered = z_disp_norw_test[indices_norw_test]
    labels_norw_test_altered = np.zeros(len(z_disp_norw_test_altered))

    wave_data_train = np.concatenate((z_disp_rw_train_altered, z_disp_norw_train_altered), axis=0)
    label_train = np.concatenate((labels_rw_train_altered, labels_norw_train_altered), axis=0)

    idx = np.random.permutation(len(wave_data_train))
    wave_data_train = wave_data_train[idx]
    label_train=label_train[idx]

    wave_data_test = np.concatenate((z_disp_rw_test_altered, z_disp_norw_test_altered), axis=0)
    label_test = np.concatenate((labels_rw_test_altered, labels_norw_test_altered), axis=0)

    idx_test = np.random.permutation(len(wave_data_test))
    wave_data_test = wave_data_test[idx_test]
    label_test=label_test[idx_test]

    wave_data_train = wave_data_train.reshape(wave_data_train.shape[0], wave_data_train.shape[1], 1)
    wave_data_test = wave_data_test.reshape(wave_data_test.shape[0], wave_data_test.shape[1], 1)

    print(label_train.shape)
    print(wave_data_train.shape)
    print(label_test.shape)
    print(wave_data_test.shape)

    np.savez(os.getcwd()  +f'/DataPrepared2/RWs_H_g_2_tadv_5min_rw_smallWindow_{relative_rw[i]}',wave_data_train=wave_data_train, wave_data_test=wave_data_test,label_train=label_train,label_test=label_test)

The number of training rogue wave samples in the case with relative rogue wave ratio 0.2 is 33992.
The number of training non-rogue wave samples in the case with relative rogue wave ratio 0.2 is 135969.
The number of testing rogue wave samples in the case with relative rogue wave ratio 0.2 is 8498.
The number of testing non-rogue wave samples in the case with relative rogue wave ratio 0.2 is 33992.
The total number of wave samples in the case with relative rogue wave ratio 0.2 is 212451.


(169961,)
(169961, 1152, 1)
(42490,)
(42490, 1152, 1)
The number of training rogue wave samples in the case with relative rogue wave ratio 0.3 is 58272.
The number of training non-rogue wave samples in the case with relative rogue wave ratio 0.3 is 135969.
The number of testing rogue wave samples in the case with relative rogue wave ratio 0.3 is 14568.
The number of testing non-rogue wave samples in the case with relative rogue wave ratio 0.3 is 33992.
The total number of wave samples in the case wit

### Plotting some of the samples from the created training data

In [None]:
import random

N_plots=3
selected_samples=random.sample(range(0, len(z_disp_rw_train_altered)), N_plots)

plt.figure()
for iter_plot in selected_samples:
    print(iter_plot)
    plt.plot(z_disp_rw[iter_plot,:], label='Sample '+str(iter_plot))

plt.legend(loc='best')    
plt.show()