## 3: Pre-processing
Pre-processing of PPG signal segments includes smoothing the signal, finding cycles using the peak points in each cycle, two-dimensional normalization of PPG signals. Also, rejecting low-quality segments and cutting the segments so that they contain complete waves.

#### Plot examples original PPG signals (before pre-processing)

In [1]:
def plot_raws_PPG_signal(PPG, num):
    """
    Plot examples of raws PPG signal.

    Parameters:
    - PPG (DataFrame): The PPG data containing multiple segments.
    - num (int): The number of segment examples to plot.

    Returns:
    None
    """
    
    for i in range(0, num):
        # Extract the data for the current example
        data = (PPG.iloc[i * 700: (i + 1) * 700].values).copy()
        
        # Plot the PPG data 
        plt.figure(figsize=(7, 3))
        plt.title(f'Raw PPG')
        plt.plot(data)
        plt.xlabel('Number of samples')
        plt.ylabel('Amplitude (volts)')
        plt.show()

In [2]:
# Plot the train data before pre-proccesing 
# plot_raws_PPG_signal(train_df['PLETH'], 50)

# Plot the test data before pre-proccesing 
# plot_raws_PPG_signal(test_df['PLETH'], 30)

#### A: Noise filtering
Noise filtering by a fourth-order Savitzky–Golay filter with a window size of 19. This filter is a moving average filter to smooth the PPG signal in order to reduce noise and capture trends or patterns in the data.

In [3]:
def Filter_Savitzky_Golay(data):
    """
    Filter the PLETH signal by segments using the Savitzky-Golay filter.

    Parameters:
    - data (DataFrame): The DataFrame containing the PLETH signal and segments numbers.

    Returns:
    None

    Modifies the input DataFrame by adding a new column 'PLETH_filtered' containing the filtered signal.
    """

    # Group the PLETH signal by segments
    grouped_df = data['PLETH'].groupby(data['Part_Number'])
    PLETH_filtered = np.array([])

    # Apply the Savitzky-Golay filter to each segment of the PLETH signal
    for part in grouped_df:
        part_filtered = savgol_filter(part[1].values, window_length=19, polyorder=4)
        PLETH_filtered = np.concatenate((PLETH_filtered.ravel(), part_filtered.ravel()))

    # Add a new column to the DataFrame with the filtered signal
    data['PLETH_filtered'] = PLETH_filtered

In [18]:
# Apply the filter on the train & test segments
Filter_Savitzky_Golay(train_df)
Filter_Savitzky_Golay(test_df)

##### A.1: Plot examples of segments after filter

In [38]:
def plot_filtered_PPG_signal(PPG, num):
    """
    Plot examples of filtered PPG signal.

    Parameters:
    - PPG (DataFrame): The PPG data containing multiple segments.
    - num (int): The number of segment examples to plot.

    Returns:
    None
    """
    
    for i in range(0, num):
        # Extract the data for the current example
        data = (PPG.iloc[i * 700: (i + 1) * 700]).copy()
        
        # Plot the PPG data 
        plt.figure(figsize=(10, 5))
        
        plt.subplot(121)
        plt.title(f'Raw PPG segment')
        plt.plot(data['PLETH'].values)
        plt.xlabel('Number of samples')
        plt.ylabel('Amplitude (volts)')
        
        plt.subplot(122)
        plt.title(f'Filterded PPG segment')
        plt.plot(data['PLETH_filtered'].values)
        plt.xlabel('Number of samples')
        plt.ylabel('Amplitude (volts)')
        
        plt.tight_layout()
        plt.show()

In [4]:
# Plot train example segments before & after noise filtering 
# plot_filtered_PPG_signal(train_df, 10)

# Plot test example segments before & after noise filtering 
# plot_filtered_PPG_signal(test_df, 10)

#### B: Trend removel
The removal of the trend line caused by the breathing activity has also been removed from the segments

In [19]:
def Trend_Removel(data):
    """
    Apply trend removal to the PLETH signal by parts.

    Parameters:
    - data (DataFrame): The DataFrame containing the PLETH signal and part numbers.

    Returns:
    None

    Modifies the input DataFrame by updating the 'PLETH_filtered' column with the trend-removed signal.
    """

    # Group the filtered PLETH signal by part number
    grouped_df = data['PLETH_filtered'].groupby(data['Part_Number'])
    PLETH_trend_removal = np.array([])

    # Apply trend removal to each part of the filtered PLETH signal
    for part in grouped_df:
        part_detrend = signal.detrend(part[1])
        PLETH_trend_removal = np.concatenate((PLETH_trend_removal.ravel(), part_detrend.ravel()))

    # Update the 'PLETH_filtered' column with the trend-removed signal
    data['PLETH_trend_removel'] = PLETH_trend_removal

In [20]:
# Apply the trend removel on the train & test segments
Trend_Removel(train_df)
Trend_Removel(test_df)

##### B.1: Plot examples of segments after trend removel

In [54]:
def plot_trend_removel_PPG_signal(PPG, num):
    """
    Plot examples of trend removel PPG signal compared to the filtered signal.

    Parameters:
    - PPG (DataFrame): The PPG data containing multiple segments.
    - num (int): The number of segment examples to plot.

    Returns:
    None
    """
    
    for i in range(0, num):
        # Extract the data for the current example
        data = (PPG.iloc[i * 700: (i + 1) * 700]).copy()
        
        # Plot the PPG data 
        plt.figure(figsize=(7, 3))
        
        plt.title(f'PPG segment')
        plt.plot(data['PLETH_filtered'].values, label='Filterded PPG')
        plt.plot(data['PLETH_trend_removel'].values, label='Trend removel PPG')
        plt.xlabel('Number of samples')
        plt.ylabel('Amplitude (volts)')
        
        plt.legend(loc='lower right', fontsize=8)
        plt.tight_layout()
        plt.show()

In [5]:
# Plot train example segments before & after trend removel
# plot_trend_removel_PPG_signal(train_df, 10)

# Plot test example segments before & after trend removel
# plot_trend_removel_PPG_signal(test_df, 10)

#### C: Cycle detection
Using the heartpy library, for each segment the cycles were defined based on finding the peak points of each cycle in the single.

In [21]:
def get_max_peaks(data):
    """
    Extract the indices of the maximum peaks from the input data using the heartpy library.

    Parameters:
    - data (array-like): The input data for peak extraction.

    Returns:
    - max_peaks (numpy array): An array of indices representing the positions of the maximum peaks.
    """
    
    # The input data is processed using the heartpy library's `process()` function with a sampling rate of 100.0 (100 Hz).
    working_data, measures = hp.process(data, 100.0)
    
    # Extract the 'peaklist' from the 'working_data' dictionary.
    max_peaks = np.array(working_data['peaklist'])
    
    return max_peaks

In [22]:
def get_min_peaks(data, max_peaks):
    """
    Extract the indices of the minimum peaks from the input data based on the indices of the maximum peaks.

    Parameters:
    - data (array-like): The input data for peak extraction.
    - max_peaks (array-like): An array of indices representing the positions of the maximum peaks.

    Returns:
    - min_peaks (numpy array): An array of indices representing the positions of the minimum peaks.
    """
    
    # Initialized 'min_peaks' list to store the indices of the minimum peaks.
    min_peaks = []

    # Check if the minimum value between the first maximum peak and the preceding data point is less than the first data point. 
    if np.min(data[1:max_peaks[0]]) < data[0]:
        # If true, the index of the minimum value is appended to 'min_peaks'.
        min_peaks.append(np.argmin(data[0:max_peaks[0]]))

    # Iterates over pairs of adjacent maximum peaks using a loop. 
    for peak0, peak1 in zip(max_peaks[:-1], max_peaks[1:]):
        # Find the index of the minimum value within the corresponding data range between the two maximum peaks and appends it to 'min_peaks'.
        min_peaks.append(peak0 + np.argmin(data[peak0:peak1]))
        
    # 'min_peaks' list is converted to a numpy array.
    min_peaks = np.array(min_peaks)
    
    return min_peaks

In [23]:
def get_peaks(data):
    """
    Calculate the indices of the minimum and maximum peaks in the input data using the `get_max_peaks()` and `get_min_peaks()` functions.

    Parameters:
    - data (array-like): The input data for peak extraction.

    Returns:
    - max_peaks (numpy array): An array of indices representing the positions of the maximum peaks.
    - min_peaks (numpy array): An array of indices representing the positions of the minimum peaks.
    """
    
    # Use `get_max_peaks()` funtion to calculate the indices of the maximum peaks in the input data.
    max_peaks = get_max_peaks(data)
   
    # Use `get_min_peaks()` with the input data and the indices of the maximum peaks to calculate the indices of the corresponding minimum peaks.
    min_peaks = get_min_peaks(data, max_peaks)
    
    # Check the sizes of the `max_peaks` and `min_peaks` arrays to ensure they have the same number of elements. If the sizes are not equal, it adjusts the `max_peaks` array by removing the first element.
    if max_peaks.size > min_peaks.size:
        max_peaks = max_peaks[1:]
    elif max_peaks.size < min_peaks.size:
        max_peaks = max_peaks[1:]

    return max_peaks, min_peaks

##### C.1: Plot examples of segments for which peaks were not found

In [24]:
def plot_examples_without_peaks(df, num):
    """
    Plots PPG signal examples without any peaks found.

    Parameters:
        df (DataFrame): DataFrame containing the PPG signal data.
        num (int): Number of examples to plot.
    """

    for i in range(0, num):
        data = (df['PLETH_trend_removel'].iloc[i*700:(i+1)*700].values).copy()
        try:
            max_peaks, min_peaks = get_peaks(data)
        except:
            # Plot the PPG signal with no peaks found
            plt.figure(figsize=(7, 3))
            plt.title(f'Bad PPG segment (no peaks found)')
            plt.plot(data)
            plt.xlabel('Number of samples')
            plt.ylabel('Amplitude (volts)')
            pass


In [6]:
# Plot examples of segments using plot_examples_without_peaks function
# plot_examples_without_peaks(train_df, 1000)

#### D: Disqualification of segments
Artifacts were filtered from the training data set according to two requirements defined by learning the best parameters, so that only high-quality signals were retained while leaving enough data for training. The segments were rejected by defining the coefficient of variation for time (X) and amplitude (Y).

In [54]:
def Conditions_PPG(data, max_peaks, min_peaks, amp_cv=4, time_cv=12):
    """
    Apply filtering conditions to identify "good" segments of the signal based on amplitude and time criteria.

    Parameters:
    - data (array-like): The input PPG signal data.
    - max_peaks (numpy array): An array of indices representing the positions of the maximum peaks.
    - min_peaks (numpy array): An array of indices representing the positions of the minimum peaks.
    - amp_std (float): The threshold value for the amplitude coefficient of variation. Default is 4.
    - time_std (float): The threshold value for the time coefficient of variation. Default is 12.

    Returns:
    - amp_standart_score (float): The normalized amplitude standard score.
    - time_standart_score (float): The normalized time standard score.
    - is_good_part (bool): Indicates whether the part of the signal satisfies the filtering conditions.
    """
    
    # Calculate the time differences between consecutive minimum and maximum peaks in the PPG signal.
    time_min = min_peaks[1:] - min_peaks[:-1]
    time_max = max_peaks[1:] - max_peaks[:-1]

    # Calculate the mean time difference and calculate the amplitude coefficient of variation using the mean and standard deviation.
    mean_time = (time_min + time_max) / 2
    time_cv_score = mean_time.mean() / mean_time.std()

    # Calculate the amplitude differences between corresponding maximum and minimum peaks and calculate the time coefficient of variation using the mean and standard deviation.
    amp = data[max_peaks] - data[min_peaks]
    amp_cv_score = amp.mean() / amp.std()

    # Checks if the normalized amplitude coefficient of variation is greater than amp_std and the normalized time standard score is greater than time_std.
    if (amp_cv_score > amp_cv) and (time_cv_score > time_cv):
        # If both conditions are satisfied, the segment of the signal is considered "good".
        return amp_cv_score, time_cv_score, True
    
    else:
        # If the conditions are not met,the segment of the signal is considered " not good".
        return amp_cv_score, time_cv_score, False

##### D.1: Plot examples of valid & invalid segments

In [212]:
def plot_examples_valid_and_invalid_segments(df):
    """
    Plots examples of PPG signals that are valid & invalid segments.

    Args:
        df (DataFrame): DataFrame containing the PPG signal data.
        num (int): Number of examples to plot.
    """
    
    # Grouped the data frame by segments
    grouped_df = df.groupby(df['Part_Number'])

    # Iterate over each segment in the grouped data frame
    for part in grouped_df:
        # Extract the PPG signal for the segment
        data = (part[1]['PLETH_trend_removel'].values).copy()
        
        try:
            # Attempt to find the max and min peaks using the 'get_peaks' function
            max_peaks, min_peaks = get_peaks(data)
        except:
            pass
        else:
            # Evaluate the conditions on the PPG data using the 'Conditions_PPG' function
            amp_cv_score, time_cv_score, booli = Conditions_PPG(data, max_peaks, min_peaks)
            if booli:
                # Plot the original and the normalized signals with peaks
                plt.figure(figsize=(7, 4))
                plt.title(f'Valid PPG segment')
                plt.plot(data)
                plt.plot(max_peaks, data[max_peaks], 'or')
                plt.plot(min_peaks, data[min_peaks], 'oy')
                plt.ylabel('Amplitude (volts)')
                plt.xlabel('Number of samples')
                plt.show()
                
            else: 
                plt.figure(figsize=(7, 4))
                plt.title(f'Invalid PPG segment')
                plt.ylabel('Amplitude (volts)')
                plt.xlabel('Number of samples')
                plt.plot(data)
                plt.show()

In [7]:
# Plot train example valid & invalid segments
# plot_examples_valid_and_invalid_segments(train_df)

#### E: 2-dimensional normalization
For the signal segments that passed the above conditions, normalization was performed in width and amplitude.

In [56]:
def normalize_data(data, max_peaks, min_peaks):
    """
    Normalize the PPG signal data between peaks by performing linear normalization on the corresponding data points 
    within each peak window.

    Parameters:
    - data (array-like): The input PPG signal data.
    - max_peaks (numpy array): An array of indices representing the positions of the maximum peaks.
    - min_peaks (numpy array): An array of indices representing the positions of the minimum peaks.

    Returns:
    - norm_data (numpy array): The normalized amplitude (y) coordinates between peaks.
    - norm_time (numpy array): The normalized time (x) coordinates between peaks.
    - all_peaks (numpy array): An array of indices representing the positions of all peaks.
    - max_peaks (numpy array): The original positions of the maximum peaks.
    - min_peaks (numpy array): The original positions of the minimum peaks.
    """

    # Initializes arrays for storing the normalized time and amplitude values.
    norm_data = np.zeros_like(data)
    norm_time = norm_data.copy()
    
    # Concatenates the positions of maximum and minimum peaks into a single array and sorts them in ascending order.
    all_peaks = np.sort(np.concatenate([max_peaks, min_peaks], axis=0))

    # Iterates over eacSh pair of consecutive peaks and performs linear normalization on the corresponding data points 
    # within the window.
    for i, (peak0, peak1) in enumerate(zip(all_peaks[:-1], all_peaks[1:])):
        window = data[peak0:peak1].copy()
        if peak0 == peak1:
            window = np.append(window, peak0)
        window -= window.min()
        window /= window.max()
        norm_data[peak0:peak1] = window
        norm_time[peak0:peak1] = np.linspace(0.5 * i, 0.5 * (i + 1), window.size + 1)[1:]
    
    return norm_data, norm_time, all_peaks, max_peaks, min_peaks

#### F: Cutting into whole cycles
Cutting of the normal and normalized segments was performed so that each segment contains only complete cycles. That is, each section now contains a sequence of complete cycles only, from the first high point to the last high point.

In [57]:
def cutting_into_whole_cycles(data, time, max_peaks, min_peaks):
    """
    Cuts the data into whole cycles by trimming the time and amplitude arrays to exclude values outside the range of the concatenated peaks.

    Parameters:
    - data (numpy array): The amplitude (y) coordinates.
    - time (numpy array): The time (x) coordinates.
    - all_peaks (numpy array): An array of indices representing the positions of all peaks.
    - max_peaks (numpy array): The original positions of the maximum peaks.
    - min_peaks (numpy array): The original positions of the minimum peaks.

    Returns:
    - time (numpy array): The trimmed time (x) coordinates.
    - data (numpy array): The trimmed amplitude (y) coordinates.
    - fix_min_peaks (numpy array): The adjusted positions of the fixed minimum peaks.
    - fix_max_peaks (numpy array): The adjusted positions of the fixed maximum peaks.
    """

    # Trim the normalized time and amplitude arrays to exclude any values outside the range of the concatenated peaks.
    all_peaks = np.sort(np.concatenate([max_peaks, min_peaks], axis=0))
    data = data[all_peaks[0]:all_peaks[-1]]
    time = time[all_peaks[0]:all_peaks[-1]]
    time[0] = 0
    
    # Adjusts the positions of the normalized maximum and minimum peaks to match the trimmed arrays.
    fix_max_peaks = max_peaks - all_peaks[0]
    fix_min_peaks = min_peaks - all_peaks[0]

    # If the last position of the normalized maximum peaks is greater than the last position of the normalized minimum peaks, it decrements the last position of the normalized maximum peaks by 1.
    if fix_max_peaks[-1] > fix_min_peaks[-1]:
        fix_max_peaks[-1] -= 1
        
    # If the last position of the normalized maximum peaks is less than the last position of the normalized minimum peaks, it decrements the last position of the normalized minimum peaks by 1.
    elif fix_max_peaks[-1] < fix_min_peaks[-1]:
        fix_min_peaks[-1] -= 1

    return time, data, fix_min_peaks, fix_max_peaks

##### F.1: Plot examples of processed segments & normalized processed segments

In [182]:
def plot_examples_with_peaks_and_norm(df, num):
    """
    Plots examples of PPG signals with peaks and their corresponding normalized segments.

    Args:
        df (DataFrame): DataFrame containing the PPG signal data.
        num (int): Number of examples to plot.
    """
    
    # Grouped the data frame by segments
    grouped_df = df.groupby(df['Part_Number'])

    # Iterate over each segment in the grouped data frame
    for part in grouped_df:
        # Extract the PPG signal for the segment
        data = (part[1]['PLETH_trend_removel'].values).copy()
        
        try:
            # Attempt to find the max and min peaks using the 'get_peaks' function
            max_peaks, min_peaks = get_peaks(data)
        except:
            pass
        else:
            # Evaluate the conditions on the PPG data using the 'Conditions_PPG' function
            amp_cv_score, time_cv_score, booli = Conditions_PPG(data, max_peaks, min_peaks)
            if booli:
                # Plot the original and the normalized signals with peaks
                plt.figure(figsize=(10, 5))
        
                plt.subplot(121)
                plt.title(f'Processed original PPG signal')
                plt.plot(data)
                plt.plot(max_peaks, data[max_peaks], 'or')
                plt.plot(min_peaks, data[min_peaks], 'oy')

                norm_data, norm_time, all_peaks, max_peaks, min_peaks = normalize_data(data, max_peaks, min_peaks)
                norm_time, norm_data, norm_min_peaks, norm_max_peaks = cutting_into_whole_cycles(norm_data, norm_time, max_peaks, min_peaks)
                plt.subplot(122)
                plt.title(f'Processed normalized PPG segment')
                plt.plot(norm_time, norm_data)
                plt.plot(norm_time[norm_max_peaks], norm_data[norm_max_peaks], 'or')
                plt.plot(norm_time[norm_min_peaks], norm_data[norm_min_peaks], 'oy')
                plt.show()

In [8]:
# Plot the train data after pre-proccesing   
# plot_examples_with_peaks_and_norm(train_df, 10)

# Plot the test data after pre-proccesing  
# plot_examples_with_peaks_and_norm(test_df, 10)

#### All the pre-processing 
Display all the pre-process steps on the data (train & test) 

In [159]:
def norm_df_with_peaks(data, max_peaks, min_peaks):
    """
    Creates a normalized data frame with the peaks by applying normalization and cutting into whole cycles.

    Parameters:
    - data (numpy array): The original PPG signal data.
    - max_peaks (numpy array): The positions of the maximum peaks.
    - min_peaks (numpy array): The positions of the minimum peaks.

    Returns:
    - norm_part_df (pandas DataFrame): The normalized data frame with the peaks.
    """

    # Call the normalize_data function to normalize the data based on the maximum and minimum peaks.
    norm_data, norm_time, all_peaks, max_peaks, min_peaks = normalize_data(data, max_peaks, min_peaks)
    
    # Call the cutting_into_whole_cycles function to trim the normalized data to include only whole cycles.
    norm_time, norm_data, norm_min_peaks, norm_max_peaks = cutting_into_whole_cycles(norm_data, norm_time, all_peaks, max_peaks, min_peaks)
    
    # Create an empty DataFrame to store the normalized data.
    norm_part_df = pd.DataFrame()
    
    # Add columns for normalized time and data.
    norm_part_df['norm_time(X)'] = norm_time
    norm_part_df['norm_data(y)'] = norm_data
    
    # Assign 'Min' and 'Max' labels to the corresponding peaks.
    norm_part_df.loc[norm_min_peaks, 'peak'] = 'Min'
    norm_part_df.loc[norm_max_peaks, 'peak'] = 'Max'
    
    return norm_part_df

In [160]:
def df_with_peaks(data, max_peaks, min_peaks):
    """
    Creates the original data frame with the peaks by extracting the peaks from the original data.

    Parameters:
    - data (pandas DataFrame): The original PPG data frame.
    - max_peaks (numpy array): The positions of the maximum peaks.
    - min_peaks (numpy array): The positions of the minimum peaks.

    Returns:
    - new_part_df (pandas DataFrame): The original data frame with the peaks.
    """

    # Create a copy of the second part of the original data frame to store the new data frame with the peaks.
    new_part_df = data[1].copy()

    # Reset the index of the new data frame to ensure sequential index values.
    new_part_df.reset_index(drop=True, inplace=True)

    # Concatenate the positions of the maximum and minimum peaks into a single array and sort them in ascending order.
    all_peaks = np.sort(np.concatenate([max_peaks, min_peaks], axis=0))

    # Assign 'Min' and 'Max' labels to the rows in the new data frame corresponding to the positions of the minimum and maximum peaks, respectively.
    new_part_df.loc[min_peaks, 'peak'] = 'Min'
    new_part_df.loc[max_peaks, 'peak'] = 'Max'
    
    # Extracts the data between the first and last peak positions to include only the relevant data within the peaks.
    new_part_df = new_part_df[all_peaks[0]:all_peaks[-1]]

    # Calculate the number of cycles in the segment by 2 and subtracting 1 (since the first peak is not considered a complete cycle).
    cycle_nums = (len(all_peaks) / 2) - 1
    
    # Add a new column 'Num_of_Cycles_in_part' to the new data frame, indicating the number of cycles in each part.
    new_part_df['Num_of_Cycles_in_part'] = cycle_nums

    return new_part_df

In [154]:
def pre_processing1(df):
    """
    Performs the first stage of pre-processing on the given PPG data frame.
    Its include peak extraction, condition evaluation, and creating the normalized and processed data frames.
    
    Parameters:
    - df (pandas DataFrame): The input PPG data frame.

    Returns:
    - norm_df (pandas DataFrame): The normalized data frame with the peaks.
    - processed_df (pandas DataFrame): The processed data frame with the peaks and additional information.
    - parts_with_peaks (int): The number of parts that contain peaks.
    - parts_after_conditions (int): The number of parts that satisfy the conditions for further processing.
    """

    # Group the data frame by the 'Part_Number' column
    grouped_df = df.groupby(df['Part_Number'])

    # Initialize variables to track part numbers, parts with peaks, and parts after applying conditions
    part_num = 0
    parts_with_peaks = 0
    parts_after_conditions = 0

    # Create empty data frames for the normalized data and processed data
    norm_df = pd.DataFrame()
    processed_df = pd.DataFrame()
    processed_df['Num_of_Cycles_in_part'] = np.nan

    # Iterate over each segment in the grouped data frame
    for part in grouped_df:
        # Extract the PPG signal for the segment
        data = (part[1]['PLETH_trend_removel'].values).copy()
        
        try:
            # Attempt to find the max and min peaks using the 'get_peaks' function
            max_peaks, min_peaks = get_peaks(data)
            parts_with_peaks += 1
        except:
            pass
        else:
            # Evaluate the conditions on the PPG data using the 'Conditions_PPG' function
            amp_coefficient_variation_score, time_coefficient_variation_score, booli = Conditions_PPG(data, max_peaks, min_peaks)
            
            if booli:
                parts_after_conditions += 1
                # Create the normalized data frame with the peaks
                norm_part_df = norm_df_with_peaks(data, max_peaks, min_peaks)
                part_num += 1
                norm_part_df['Part_Number'] = part_num
                norm_df = pd.concat([norm_df, norm_part_df])

                # Create the original data frame with the peaks
                new_part_df = df_with_peaks(part, max_peaks, min_peaks)
                new_part_df['Part_Number'] = part_num
                processed_df = pd.concat([processed_df, new_part_df])

    # Reset the index of the normalized data frame and processed data frame
    norm_df.reset_index(drop=True, inplace=True)
    processed_df.reset_index(drop=True, inplace=True)

    return norm_df, processed_df, parts_with_peaks, parts_after_conditions

In [155]:
# Apply the first pre-processing function on the train & test data
norm_train_df1, train_df1, parts_with_peaks_train, parts_after_conditions_train = pre_proccesing1(train_df)
norm_test_df1, test_df1, parts_with_peaks_test, parts_after_conditions_test = pre_proccesing1(test_df)

In [191]:
# Find the maximum value that represent the number of segments in each data frame
train_parts = train_df['Part_Number'].values[-1]
test_parts = test_df['Part_Number'].values[-1]

# Print the percentage of train segments with peaks and passing the conditions
print(f'Train parts that found peaks for them: {parts_with_peaks_train}, which are: {np.round((parts_with_peaks_train/train_parts)*100, 2)}%')
print(f'Train parts that passed the conditions: {parts_after_conditions_train}, which are: {np.round((parts_after_conditions_train/parts_with_peaks_train)*100, 2)}%')
print('\n')

# Print the percentage of test segments with peaks and passing the conditions
print(f'Test parts that found peaks for them: {parts_with_peaks_test}, which are: {np.round((parts_with_peaks_test/test_parts)*100, 2)}%')
print(f'Test parts that passed the conditions: {parts_after_conditions_test}, which are: {np.round((parts_after_conditions_test/parts_with_peaks_test)*100, 2)}%')

Train parts that found peaks for them: 5016, which are: 79.58%
Train parts that passed the conditions: 3421, which are: 68.2%


Test parts that found peaks for them: 1245, which are: 20.88%
Test parts that passed the conditions: 930, which are: 74.7%


In [192]:
# Use the count_parts function to count the number of segments in this step
count_parts(train_df1, 'train')
count_parts(test_df1, 'test')

train - total parts: 3421, invasive BP: 3000, non-invasive BP: 322
test - total parts: 930, invasive BP: 855, non-invasive BP: 35


In [193]:
def pre_processing2(norm_df, df):
    """
    Pre-processes the normalized and original data frames by matching the MAP values to the corresponding segments.

    Parameters:
        norm_df (DataFrame): Normalized data frame.
        df (DataFrame): Original data frame.

    Returns:
        Tuple: A tuple containing the pre-processed normalized data frame, pre-processed original data frame, and the number of parts with blood pressure (MAP) values.
    """

    # Group the original data frame by segments.
    grouped_df = df.groupby('Part_Number')

    # Initialize variables.
    parts_with_BP = 0
    processed_df_list = []
    processed_norm_df_list = []

    # Iterate over each segment in the original data frame.
    for part_number, data in grouped_df:
        
        # Check if there is only one non-null MAP value in the segment. If there is label of BP, save this segment.
        if data['MAP'].notna().sum() == 1:
            parts_with_BP += 1  
            data['Part_Number'] = parts_with_BP  
            processed_df_list.append(data)  

            # Find the matching part in the normalized data frame.
            data_norm = norm_df[norm_df['Part_Number'] == part_number].copy()
            data_norm['Part_Number'] = parts_with_BP 
            
            # Extract the MAP value from the original data frame.
            MAP_val = data['MAP'].dropna().values[0]  
            data_norm['MAP'] = MAP_val  
            processed_norm_df_list.append(data_norm)  

    # Concatenate the processed normalized segmwnts and processed original segments into separate data frames
    processed_norm_df = pd.concat(processed_norm_df_list).reset_index(drop=True)
    processed_df = pd.concat(processed_df_list).reset_index(drop=True)

    return processed_norm_df, processed_df, parts_with_BP

In [194]:
# Apply the second pre-processing function on the train & test data
norm_train_df2, train_df2, parts_with_BP_train = pre_proccesing2(norm_train_df1, train_df1)
norm_test_df2, test_df2, parts_with_BP_test = pre_proccesing2(norm_test_df1, test_df1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Part_Number'] = parts_with_BP


In [195]:
# Print the percentage of train segments that passed all the pre-processing
print(f'Train parts with BP: {parts_with_BP_train}, which are: {np.round((parts_with_BP_train/parts_after_conditions_train)*100,2)}')

# Print the percentage of test segments that passed all the pre-processing
print(f'Test parts with BP: {parts_with_BP_test}, which are: {np.round((parts_with_BP_test/parts_after_conditions_test)*100,2)}')

Train parts with BP: 3322, which are: 97.11
Test parts with BP: 890, which are: 95.7


In [196]:
# Use the count_parts function to count the number of segments in this step
count_parts(train_df2, 'train')
count_parts(test_df2, 'test')

train - total parts: 3322, invasive BP: 3000, non-invasive BP: 322
test - total parts: 890, invasive BP: 855, non-invasive BP: 35


## Save the proccessed df & proccessed norm df to the desktop 

In [197]:
train_df = train_df2.copy()
train_norm_df = norm_train_df2.copy()

test_df = test_df2.copy()
test_norm_df = norm_test_df2.copy()

run = '8'

In [198]:
# Download the train proccessed df to the desktop in 2 different files (beacause its too big)
file_path1 = f'data/train_proccessed_df1_run{run}.xlsx'
file_path2 = f'data/train_proccessed_df2_run{run}.xlsx'

cut_ind = train_df.shape[0]//2

train_df[:cut_ind].to_excel(file_path1, index=False)
train_df[cut_ind:].to_excel(file_path2, index=False)

In [199]:
# Download the train proccessed norm df to the desktop in 2 different files (beacause its too big)
file_path1_norm = f'data/train_proccessed_norm_df1_run{run}.xlsx'
file_path2_norm = f'data/train_proccessed_norm_df2_run{run}.xlsx'

cut_ind_norm = train_norm_df.shape[0]//2

train_norm_df[:cut_ind_norm].to_excel(file_path1_norm, index=False)
train_norm_df[cut_ind_norm:].to_excel(file_path2_norm, index=False)

In [200]:
# Download the test proccessed df to the desktop
test_file_path = f'C:data/test_proccessed_df_run{run}.xlsx'

test_df.to_excel(test_file_path, index=False)

In [201]:
# Download the test proccessed norm df to the desktop
test_file_path_norm = f'C:data/test_proccessed_norm_df_run{run}.xlsx'

test_norm_df.to_excel(test_file_path_norm, index=False)