## Raw data processing

The goal is to:
- load the raw data
- collect each sensor into a single series
- Process the manual removals
- then save it within the H5py file


TODO
- manual removals
- save to the h5py file

## Combine a sensors raw data files into one

In [None]:

def load_raw_data_file(path, source, conversion):
    if source == 'System2000':
        raw_data = import_system2000(path, conversion)
    elif source == 'iFix':  
        raw_data = import_ifix(path, conversion)
    elif source == 'Danova':  
        raw_data = import_danova(path, conversion)
    else: 
        print("Unknown source, i.e. not System2000, iFix or Danova")
    return raw_data


def load_raw_data(metadata, sensor_data_path, save_path):
    """Load raw data files based on metadata."""

    raw_data_paths = {}

    # iterate each sensor group
    n_groups = metadata['IdMeasurement'].nunique()
    i_group = 0
    for sensor_name, sensor_group in metadata.groupby('IdMeasurement'):
        print(f"({i_group+1}/{n_groups}) Loading {sensor_name}")
        # create a dictionary with the sensor id as key and the data as value
        raw_sensor_data = pd.DataFrame()
        # iterate each row in the sensor group
        sensor_group = sensor_group.reset_index(drop=True)
        for i, row in sensor_group.iterrows():
            print(f"    {i+1}/{sensor_group.shape[0]}")
            # get the file path
            file_path = sensor_data_path / row['Folderpath'] / row['Filename']
            # load the raw data file: currently only adjust the datetime column
            sensor_data = load_raw_data_file(file_path, row['Source'], row['Conversion'])
            # sort by time
            sensor_data = sensor_data.sort_values(by='time')
            # add the raw data to the df in the dictionary
            raw_sensor_data = pd.concat([raw_sensor_data, sensor_data])
        # remove duplicated time
        raw_sensor_data = raw_sensor_data.drop_duplicates(subset=['time'])
        # sort by time
        raw_sensor_data = raw_sensor_data.sort_values(by='time')
        # remove nan values
        raw_sensor_data = raw_sensor_data.dropna(subset=['value'])
        # save the raw data as a pickle file
        file_path = save_path / f'{sensor_name}.pkl'
        raw_sensor_data.to_pickle(file_path)
        raw_data_paths[sensor_name] = file_path
        i_group += 1
        print(f"Saved {sensor_name} to {save_path / f'{sensor_name}.pkl'}")
        print('')
    # save the raw data paths
    with open(save_path / 'raw_data_paths.pkl', 'wb') as f:
        pickle.dump(raw_data_paths, f)
    return raw_data_paths


RUNTIME: 20 minutes

GOAL: saving av pickle for faster load time

In [None]:
save_path = INTERIM_DATA_DIR / 'Bellinge' / 'sensor-data'
# create the save path if it does not exist
save_path.mkdir(parents=True, exist_ok=True)

In [None]:
#raw_data_paths = load_raw_data(metadata, sensor_data_path, save_path)

In [None]:
# load the raw data paths
with open(save_path / 'raw_data_paths.pkl', 'rb') as f:
    raw_data_paths = pickle.load(f)

## Custom Pre-Processing

The goal is to:
- Resample data into 1 minute intervals

### TODO:
- save to the h5py file
- make for each individual sensor, not all combined
- then make for combined

In [None]:
def create_subset_data(raw_data_paths, min_time, max_time, resample_freq):
    """Goal: Create a subset of the raw data based on time and within a single dataframe."""
    sensor_names = list(raw_data_paths.keys())
    # create an dataframe with the time as index
    time_range = pd.date_range(start=min_time, end=max_time, freq=resample_freq)
    subset_data = pd.DataFrame(index=time_range)
    for i, sensor_name in enumerate(sensor_names):
        print(f"({i+1}/{len(sensor_names)}) Loading {sensor_name}")
        raw_data = pd.read_pickle(raw_data_paths[sensor_name])
        # make sure time is of the correct type
        raw_data['time'] = pd.to_datetime(raw_data['time'])
        # set time as index
        raw_data = raw_data.set_index('time')
        # extract the value column
        raw_data = raw_data[['value']]
        # rename the value column to the sensor name
        raw_data = raw_data.rename(columns={'value': sensor_name})

        ### Performing necessary data cleaning (1 minute resampling)   
        # Create a new DataFrame with the time range and no data
        time_range = pd.date_range(start=raw_data.index.min(), end=raw_data.index.max(), freq='1min')
        time_df = pd.DataFrame(index=time_range)
        # Concatenate the original raw_data with the new time_df
        expanded_data = pd.concat([raw_data, time_df], axis=1) # This will create NaNs for the new time points, which appear after the original data
        # Handle duplicate indices (i.e., original data points that already exist in the 1-minute intervals)

        # Drop any duplicate indices (i.e., original data points that already exist in the 1-minute intervals)
        expanded_data = expanded_data[~expanded_data.index.duplicated(keep='first')]
        # Sort the data by time
        expanded_data = expanded_data.sort_index()
        # Interpolate to fill in the gaps, limiting interpolation to small gaps (e.g., up to 2 missing minutes)
        interpolated_data = expanded_data.interpolate(method='time', limit=2)
        # Now remove the original irregular time points, keeping only the regular 1-minute intervals
        regular_data = interpolated_data.loc[time_range]

        # add the data to the subset data based on time
        subset_data = pd.concat([subset_data, regular_data], axis=1)

    # sort by time
    subset_data = subset_data.sort_index()
    # save the subset data as a pickle file
    subset_data_path = save_path / 'subset_data.pkl'
    subset_data.to_pickle(subset_data_path)
    return subset_data_path


In [None]:
# create a subset of the data
min_time = '2020-01-01 00:00:00'
max_time = '2020-12-31 23:59:59'
# resample seems to be 1 minute
resample_freq = '1min'
# RUNTIME: 2 minutes
# MEMORY: ~80 MB
subset_data_path = create_subset_data(raw_data_paths, min_time, max_time, resample_freq)

# TODO:

## Custom Processing

The goal is to:
- Errors within the data, use their
- Quality comparison with the other processing pipeline


### Pre-found outliers/errors
- *Manufacturer quality stamp*. These data were stamped with “low quality” in the iFIX SCADA system.
- *Manual remove*. These are data that for some reason were deemed untrustworthy, for instance observation values during maintenance or start-up periods.
- *Out of bounds*. These are data outside a defined physically meaningful range of possible values (e.g. bottom and top levels of a pipe/basin).
- *Frozen sensor*. These data do not change during a time period of e.g. 20 min.
- *Outlier*. These are data with spikes with a manually chosen height and duration; in our case this category is only applicable to interim Danova sensor data, which occasionally showed spike patterns which are probably not correct.


## Notes on the pre-defined processing step

- The interpolation seems to be ill-defined
    - TODO: check how many missing values it can interpolate
- Not all files have the frozen_high column
- scaling factor comments?