### Loading Experimental Data

Firstly, the laboratory corrosion data (measured corrosion rate of sample in seawater of known temperature, salinity, dissolved oxygen and pH) is loaded so that it can be used to train the ANN.

In [1]:
import os
import pandas as pd

def read_xlsx(filename="new_run.xlsx"):
    """
    reads an excel file stored within the "data" folder of the current working directory

    INPUTS:
        filename:  name of the xlsx file
    """
    current_dir = os.getcwd()  # Get the current working directory
    target_dir = os.path.join(current_dir, "data")  # Join the target folder to the current directory
    imm = os.path.join(target_dir, filename)
    imm_df = pd.read_excel(imm)
    return imm_df

df = read_xlsx()
# df

#### Filtering Data to the Global Seawater Range

Next, the data is filtered to meet the upper and lower bounds of the global seawater range [1]:

| Parameter               | Global Seawater Range   |
| ----------------------- | :---------------------: |
| Temperature (°C)        | -2-35                   |
| Dissolved Oxygen (mg/L) | 4-10.4                  |
| Salinity (ppt)          | 27-40                   |
| pH                      | 7-8.4                   |


<small>[1] Wang, Z., Sobey, A. J., & Wang, Y. (2021). Corrosion prediction for bulk carrier via data fusion of survey and experimental measurements. Materials & Design, 208, 109910.</small>

In [2]:
def filter_df(df):
    """
    filters the dataframe to the global seawater upper and lower bounds for each column
    """
    # Dictionary of column names and their bounds
    col_bounds = {
        'Temperature': (-2, 35),
        'Dissolved oxygen': (4, 10.4),
        'Salinity': (27, 40),
        'pH': (7, 8.4)
    }

    df_filt = df.copy()
    
    for col_name, (lower_bd, upper_bd) in col_bounds.items():
        df_filt = df_filt[
            (df_filt[col_name] >= lower_bd) & (df_filt[col_name] <= upper_bd)
        ]
    
    return df_filt

df_filt = filter_df(df)


Issues with the filtered corrosion data:
- source [44] from <small>[1] Wang, Z., Sobey, A. J., & Wang, Y. (2021) </small> specifies a temperature range of 25 ±5°C, constant pH, dissolved oxygen, salinity
- source [48] does not provide salinity measurements, paper [1] has assumed salinity of 27.15 ppt but this has not been verified (concentration of sea salts changes however a constant salinity value is assigned)

These two data sources can lead to a highly skewed dataset due to the constant ocean parameters used throughout.


In [3]:
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

def extract_Xy(df):
    """
    extraxts the input (X) and output (y) from a dataframe
    """
    input_cols = ["Temperature", "Salinity", "Dissolved oxygen", "pH"]
    output_col = ["Corrosion rate"]
    X = np.array(df[input_cols])
    y = np.array(df[output_col])
    return X, y

X, y = extract_Xy(df_filt)

### Preparing the data for the ANN

Next, the inputs are normalised based on the global seawater bounds as this will help with the performance of the ANN. Note that the global seawater bounds are used rather than a standard normalisation process based on the min and max of the input data. This is because the ANN will be used with global seawater datasets, hence the range of that data needs to be accounted for when normalising the lab data.

In [4]:
def norm_input(X):
    """
    Normalise each feature (between 0 and 1) based on the global seawater bounds
    """

    col_limits = [
        (-2, 35),  # temperature 
        (4, 10.4), # dissolved oxygen
        (27, 40),  # salinity
        (7, 8.4)   # pH
    ]

    normalised_features = []

    for col in range(X.shape[1]):
        min_limit, max_limit = col_limits[col]
        feature = X[:, col]
        normalised_feature = (feature - min_limit) / (max_limit - min_limit)
        normalised_features.append(normalised_feature)

    normalised_data = np.column_stack(normalised_features)
    return normalised_data

X_n = norm_input(X)

Finally the dataset is split into a training set (70%), validation set(10%) and test set (20%).

In [5]:
def split_dataset(X, y, random_seed=1, train_r=0.7, val_r=0.1, test_r=0.2):

    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=1 - train_r, random_state=random_seed)
    X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=test_r / (test_r + val_r), random_state=random_seed)
    return X_train, X_val, X_test, y_train, y_val, y_test