# **Preprocessing Notebook Description**

Description: this notebook contains functions that window, normalize, and smooth the data. At the bottom a sequence of steps using them to preprocess the data is demonstrated. First the time series are subset from -30 seconds before sample detect time to 40 seconds after sample detect time. If this full time range is not present in a reading, then that reading is dropped. Next, the time series are normalized to scale their values between 0 and 1. Finally, a convolution smoother with a bartlett window of length 50 is applied to smooth the time series. The time series are then written into fresh csv files. 

## **Imports**
Here we have all of the import statements needed for the packages used for preprocessing. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import signal
from tsmoothie.smoother import *
from scipy.signal import savgol_filter

## **Load Raw Original Data**

Here all of the original raw time series and predictor files are read into pandas data frames. 

**You will need to change the paths to match where you have these files stored**

In [None]:
# Import the original time series data files. 
ecd_ts = pd.read_csv('../Data/RawData/TimeSeries/PC_TS.csv')
un_ts = pd.read_csv('../Data/RawData/TimeSeries/US_TS.csv')
syn_ts = pd.read_csv('../Data/RawData/TimeSeries/PC_TS_Synthetic.csv')
con_ts =  pd.read_csv('../Data/RawData/TimeSeries/PC_TS_Contaminated.csv')

# Import the original Predictors files
syn_pred = pd.read_csv('../Data/RawData/Predictors/SyntheticPC.csv')
con_pred =  pd.read_csv('../Data/RawData/Predictors/PCAggContaminated.csv')
rem_pred = pd.read_csv('../Data/RawData/Predictors/RawAggData2021-2022.csv')


# Import the original predictor ID files
ecd_id = pd.read_csv('../Data/RawData/Predictors/TestID_PC.txt', sep = ',', header = None, squeeze = True).transpose()
un_id = pd.read_csv('../Data/RawData/Predictors/TestID_Unsuccessful.txt', sep = ',', header = None, squeeze = True).transpose()

# Drop the duplicated predictors (125 of the records were duplicated but with different data formats, all other info is the same)
dups = rem_pred['TestID'].duplicated()
#removed the duplicate 125 IDs.
rem_pred = rem_pred[~dups]

## **Separate the predictors out for the different categories**

In [None]:
# Separate out the wild ECD and unsuccessful predictors. 
ecd_pred = rem_pred[rem_pred['TestID'].isin(ecd_id[0])]
un_pred = rem_pred[rem_pred['TestID'].isin(un_id[0])]

# Save the separated predictor files for future use. 
ecd_pred.to_csv('../Data/PreprocessedData/Predictors/ECD.csv', index = False)
un_pred.to_csv('../Data/PreprocessedData/Predictors/Unsuccessful.csv', index = False)

## **Windowing Functions**
### Function to window and zero the original data
This function takes the start time with respect to sample detect time (-30 means 30 seconds before sample detect time), the end time (40 would mean 40 seconds after sample detect time), a data frame with the time series where each row is a time series and each column is a time point. It is assumed that the sampling rate is 5 Hz, and that the data frame contains a 'TestId'column. It is also assumed that the time-series that are passed in have not already been altered, so the first time point is 0.2 (corresponding to the first sample). Finally, the dataframe with the aggregate predictors for the time series must also be passed in so that sample detect time can be computed. The returned dataframe has the time series zeroed with respect to sample detect time and subset to the times between start and end. The data frame is also reindexed so that the column names are the time in seconds with respect to sample detect time (eg. '-30.0', '-28.8', ... '0.0', ... '40.0'). 

Note that this function also drops any readings with a sample detect time at 0. When we looked at the data, all of these readings were unsuccessful with the return code `CannotCalculate`.

In [None]:
# Subset the times series based on start and end times relative to sample detect time. 
# start - seconds to start from relative to sample detect (eg. -5 would start 5 seconds before, 5 would start 5 seconds after)
# end - seconds to end from relative to sample detect. 
# ts - data frame where each row is a time series
# pred - data frame where each row is a reading, has the sample detect time column. 
def window_and_zero(start, end, ts, pred):
    #Remove records with sample detect time at 0. 
    ts = ts[ts['TestId'].isin(pred['TestID'][pred['SampleDetectTime']!=0])]
    
    # Get the list of IDs for both data sources. 
    ts_ids = ts['TestId']
    pred_ids = pred['TestID']
    
    #Drop ids from the time series. 
    ts = ts.drop('TestId', axis = 1)
    
    # Convert sample detect time to samples from second
    sample_detect = (pred['SampleDetectTime']/0.2).astype(int)
    
    # Make a data frame with test ids, and start and end indices to window based on. 
    index = pd.concat([pred_ids, int(start/0.2)+sample_detect, int(end/0.2) +sample_detect], axis = 1)
    index.columns = ["TestId", "Start", "End"]
    # Merge start and end indices into time series data frame based on ids. 
    ts['TestId'] = ts_ids
    ts = ts.merge(index, how = 'left', on = 'TestId')
    # Save the order of ids for later use. 
    ids = ts['TestId']
    
    #Subset each time series based on it's start and end indices. Uses a list comprehension since the indices will be different for each time series. 
    subsets = [ts.iloc[row, ts['Start'][row]:ts['End'][row]].reset_index(drop = True) for row in range(len(ts))]
    
    #Now that all time series are the same length, and zeroed w.r.t sample detect, convert back to a dataframe. 
    subsets = pd.DataFrame(subsets)
    
    #Rename the columns to match the zeroed time stamps. 
    subsets.columns =  [str(round(m,1)) for m in np.arange(start,end, 0.2)]
    
    #Return a dataframe with test ids reatttached. 
    return pd.concat([pd.DataFrame(ids),subsets], axis = 1)

### Function to window zeroed data 

This function can be used to subset the data frames that are output by the window_and_zero function. It is useful if you want to find something like the calibration/post/sample windows. 

In [None]:
# Subset the zeroed data. Useful if you want to pull out calibration/sample/post windows. 
# start - the start time you want to subset w.r.t sample detect time, in seconds. Must be evenly divisuble by 0.2 since that's the sampling rate of the data. 
#
# end - the end time you want to subset w.r.t sample detect time, in seconds. It must be evenly divisuble by 0.2 since that's the sampling rate of the data. 
#
# zeroed_ts - a dataframe ouput by the window_and_zero() function. 

def window_zeroed(start, end, zeroed_ts):
    # Make sure there's only one decimal place. Also that they are in string format. 
    start = str(round(float(start), 1))
    end = str(round(float(end), 1))
    
    #Subest and return the data frame. 
    res = zeroed_ts.loc[:,start:end]
    res = pd.concat([zeroed_ts['TestId'], res], axis = 1)
    return res

## **Scaling Functions**

### Function to normalize

This function will normalize the data. This means that each time series will be scaled so that it's values are scaled between 0 and 1. 

In [None]:
# Scales the time series in a dataframe so that their values are between 0 and 1. 
# ts_df: A frame where each row is a time series and each column is a time point, with an additional `TestId` column. 
# Returns a data frame where each row is a time series with values scale between 0 and 1. 
def normalize(ts_df):
    ids = ts_df['TestId'] #Save the ids for later use. 
    subset = ts_df.drop('TestId', axis = 1)
    m = subset.min(axis = 1) #New data frame with the min value for each time series.
    r = subset.max(axis = 1) - m #New data frame with range value for each time series. 

    #When r==0, it means that min = max and thus dividing by 0 (creating nans). Instead, set m to 0 and r to 1 so that readings that are completely flat are unchanged
    m[r==0] = 0
    r[r==0] = 1

    #Subtract the mean of the time series from each time point. 
    st1 = subset.sub(m, axis = 0) 
    #Divide each time point by the standard deviation of the time series. 
    norm = st1.div(r, axis = 0)

    #Return the standardized time series with test ids back in the first column. 
    return pd.concat([ids, norm], axis = 1) 

### Function to standardize

This function will standardize the data. This means that each time series will be scaled to have mean 0 and standard deviation 1. One caveat is that time series with a standard deviation of 0 will be left with a standard deviation of 1 to avoid division by 0.

In [None]:
# Standardize all the time series in a dataframe (transform each time series to have mean 0 and standard deviation 1).
# ts_df - data frame where each row is a time series, has columns for time points, and one column with TestId.
# Returns a new data frame where each time series is standardized to have mean 0 and standard deviation 1. 
def standardize(ts_df):
    ids = ts_df['TestId']
    subset = ts_df.drop('TestId', axis = 1)
    m = subset.mean(axis = 1) #New data frame with the meam value for each time series.
    r = subset.std(axis = 1) #New data frame with standard deviation for each time series. 

    #Set r to 1 when the standard deviation is 0 so that readings that are completely flat don't becom nans when divided by 0. 
    r[r==0] = 1

    #Subtract the mean of the time series from each time point. 
    st1 = subset.sub(m, axis = 0) 
    #Divide each time point by the standard deviation of the time series. 
    norm = st1.div(r, axis = 0)

    #Return the standardized time series with test ids back in the first column. 
    return pd.concat([ids, norm], axis = 1) 

## **Noise Reduction Functions**

### Function to smooth with a moving average 

Uses a convolution smoother with a uniform rectangular kernal to smooth using the moving average. 

In [None]:
# Smooth all the waveforms in a dataframe using a moving average via the convolution smoother from tsmoothie along with a 
# rectangular window. 
# ts - data frame where each row is a time series, has columns for time points, and one column with TestId. Must not contain any nas. 
# wl - the length of the window used to convolve.
# Start - the number of seconds before sample detect where the readings start.
# Stop - the number of seconds after sample detect where the readings end.
def moving_average_smooth(ts, wl = 30, start = -30, end = 40):
    #Save ids
    ids = ts['TestId'].reset_index(drop = True)
    # operate smoothing
    smoother = ConvolutionSmoother(window_len=wl, window_type='ones')
    smoother.smooth(ts.drop('TestId', axis = 1))
    smooth = pd.DataFrame(smoother.smooth_data)
    smooth.columns = [str(round(m,1)) for m in np.arange(start,end, 0.2)]
    smooth['TestId'] = ids
    return smooth

### Function to smooth using a convolution smoother with a bartlett window

Using the bartlett window should lead to less distorion on the edges. 

In [None]:
# Smooth all the waveforms in a dataframe using the convolution smoother from tsmoothie along with a 
# Bartlett window. 
# ts - data frame where each row is a time series, has columns for time points, and one column with TestId. Must not contain any nas. 
# wl - the length of the window used to convolve.
# Start - the number of seconds before sample detect where the readings start.
# Stop - the number of seconds after sample detect where the readings end.
def bartlett_convolve_smooth(ts, wl = 50, start = -30, end = 40):
    #Save ids
    ids = ts['TestId'].reset_index(drop = True)
    # operate smoothing
    smoother = ConvolutionSmoother(window_len=50, window_type='bartlett')
    smoother.smooth(ts.drop('TestId', axis = 1))
    smooth = pd.DataFrame(smoother.smooth_data)
    smooth.columns = [str(round(m,1)) for m in np.arange(start,end, 0.2)]
    smooth['TestId'] = ids
    return smooth

### Function to smooth using a Savitzy-Golay filter

Works similarly to a moving average but instead of taking the average of the points to adjust their values, it fits a polynomial. 

In [None]:
def savgol_smooth(ts, wl = 51, polyorder = 2, start = -30, end = 40):
    #Save ids
    ids = ts['TestId'].reset_index(drop = True)
    # operate smoothing
    smoother.smooth(ts.drop('TestId', axis = 1))
    smooth = savgol_filter(ts.drop('TestId', axis = 1).to_numpy(),  wl, polyorder = polyorder, deriv=0, axis = 1)
    smooth = pd.DataFrame(smooth)
    smooth.columns = [str(round(m,1)) for m in np.arange(start,end, 0.2)]
    smooth['TestId'] = ids
    return smooth

## **Diagnostic Functions**

### Function to plot 5 random smoothed waveforms on top of the original ones, and the associated periodogram below. 

In [None]:
def show_smooth(unfiltered_ts, filtered_ts):
    # get 5 random test ids
    ides = list(unfiltered_ts['TestId'].sample(n=5))
    
    # get first and last timestamp to label the x-axis in plot
    start = unfiltered_ts.columns[~unfiltered_ts.columns.str.isalpha()][0]
    end = unfiltered_ts.columns[~unfiltered_ts.columns.str.isalpha()][-1]
    
    ## Plot the overlayed waveforms
    
    # define subplot grid
    fig, axs = plt.subplots(nrows=1, ncols=5, figsize=(30, 5))

    # loop through tickers and axes
    for ide, ax in zip(ides, axs.ravel()):
        # plot the filtered and unfiltered versions for the test id. 
        ax.plot(unfiltered_ts[unfiltered_ts['TestId'] == ide].drop('TestId', axis = 1).transpose(), c = 'b', alpha = 0.3, label = 'unfiltered')
        ax.plot(filtered_ts[filtered_ts['TestId'] == ide].drop('TestId', axis = 1).transpose(), c = 'b', label = 'filtered')
        ax.xaxis.set_ticks([start, '-0.0', end])
        ax.set_xlabel('time')
        ax.set_ylabel('signal')

    plt.legend()
    plt.show()
    
    ## Plot the overlayed periodograms
    
    # define subplot grid
    fig, axs = plt.subplots(nrows=1, ncols=5, figsize=(30, 5))

    # loop through tickers and axes
    for ide, ax in zip(ides, axs.ravel()):
        # plot the filtered and unfiltered versions for the test id. 
        f, ps = signal.periodogram(unfiltered_ts[unfiltered_ts['TestId'] == ide].drop('TestId', axis = 1), 5)
        f1, ps1 = signal.periodogram(filtered_ts[filtered_ts['TestId'] == ide].drop('TestId', axis = 1), 5)
        
        ax.plot(f, ps[0], c = 'b', alpha = 0.3, label = 'unfiltered')
        ax.plot(f1, ps1[0], c = 'b', label = 'filtered')
        ax.set_xlabel('frequency (Hz)')
        ax.set_ylabel('power')

    plt.legend()
    plt.show()

## **Demo of a preprocessing pipeline**

In this preprocessing pipeline, first the time series are subset from -30 seconds before sample detect time to 40 seconds after sample detect time. If this full time range is not present in a reading, then that reading is dropped. Next, the time series are normalized to scale their values between 0 and 1. Finally, convolution smoother with a bartlett window of length 50 is applied to smooth the time series. The time series are then written into fresh csv files. 

In [None]:
# Defining 'non wet-up' as 30 seconds before to 40 seconds after sample detect
start = -30
end = 40

# Subset to -30 seconds before sample detect to 40 seconds after, this will get rid of wetup and make sure everything is aligned. 
ecd_w = window_and_zero(start, end, ecd_ts, ecd_pred)
syn_w = window_and_zero(start, end, syn_ts, syn_pred)
con_w = window_and_zero(start, end, con_ts, con_pred)
un_w = window_and_zero(start, end, un_ts, un_pred)

# Normalize to scale all the waveforms between 0 and 1. 
ecd_norm = normalize(ecd_w).dropna()
syn_norm = normalize(syn_w).dropna()
con_norm = normalize(con_w).dropna()
un_norm = normalize(un_w).dropna()

#Smooth with a Bartlett window convolution
ecd_smooth = bartlett_convolve_smooth(ecd_norm)
un_smooth = bartlett_convolve_smooth(un_norm)
con_smooth = bartlett_convolve_smooth(con_norm)
syn_smooth = bartlett_convolve_smooth(syn_norm)

#Get rid of the weird outliers for now. 
ecd_smooth = ecd_smooth[~ecd_smooth['TestId'].isin([9610647, 9610462])]

### Plot some of theECD errors after filtering on top of the normalized ones. 

In [None]:
show_smooth(ecd_norm, ecd_smooth)

### Save the processed waveforms to new files

**You will need to make a folder for them or change the path**

In [None]:
ecd_smooth.to_csv('../Data/PreprocessedData/TimeSeries/ecd_smooth.csv', index = False)
syn_smooth.to_csv('../Data/PreprocessedData/TimeSeries/syn_smooth.csv', index = False)
con_smooth.to_csv('../Data/PreprocessedData/TimeSeries/cont_smooth.csv', index = False)
un_smooth.to_csv('../Data/PreprocessedData/TimeSeries/un_smooth.csv', index = False)