## 🛠️ **Data Loading, Processing and Feature Extraction**  

This notebook processes raw sensor data from the **Open Seizure Database (OSDB)** to extract **metadata**, perform **frequency domain analysis** using **FFT**, and compute statistical and physical features such as skewness, kurtosis, and total distance traveled. The resulting notebook generates the following 4 dataframes ~ df_metadata, df_sensordata, df_sensordata_fft, df_features

In [37]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.signal as signal
import librosa.display
import pandas as pd
import json
from scipy.stats import skew, kurtosis
import numpy as np
import matplotlib.pyplot as plt


# Load the JSON file
file_path = '../../tests/testData/testDataVisualisation.json'  # Replace with your JSON file path
with open(file_path, 'r') as file:
    raw_json = json.load(file)

In [38]:
# Flatten the JSON and extract relevant fields
flattened_data = []

for attribute in raw_json:
    # Extract user_id
    user_id = attribute.get('userId', None)
    datapoints = attribute.get('datapoints', [])
    subtype = attribute.get('subType', None)
    stype = attribute.get('type', None)
    desc = attribute.get('desc', None)    
    seizureTimes = attribute.get('seizureTimes', [])
    sampleFreq = attribute.get('sampleFreq', 25)
    watchSdName = attribute.get('watchSdName', None)


    for point in datapoints:
        eventId = point.get('eventId', None)
        #hrAlarmActive = attribute.get('hrAlarmActive', None)
        #o2SatAlarmActive = attribute.get('o2SatAlarmActive', None)
        #dataSourceName = attribute.get('dataSourceName', None)
        #watchSdName= attribute.get('watchSdName', None)

        
    # Append the flattened structure
    flattened_data.append({
        'eventId': eventId,
        'userId': user_id,
        'subtype': subtype,
        'type': stype,
        'desc':desc,
        'seizureTimes': seizureTimes,
        'sampleFreq': sampleFreq,
        #"hrAlarmActive": hrAlarmActive, 
        #"o2SatAlarmActive": o2SatAlarmActive, 
        'watchSdName': watchSdName, 
        

    })

    # Create a DataFrame
df_metadata = pd.DataFrame(flattened_data)
df_metadata.to_csv('generatedCsvDatasets/metadata.csv', index=False)
# Display the DataFrame
df_metadata.head()

Unnamed: 0,eventId,userId,subtype,type,desc,seizureTimes,sampleFreq,watchSdName
0,407,39,Other,Seizure,twisting to left. right arm flapping. left a...,"[-38.0, 76.0]",25,
1,764,39,Other,Seizure,"kneeling up, looking to right","[-35.0, 75.0]",25,
2,4924,39,Other,Seizure,on back. left arm flapping,"[-25.0, 60.0]",25,
3,5483,39,Tonic-Clonic,Seizure,,"[-45.0, 35.0]",25,GarminSD
4,5486,39,Tonic-Clonic,Seizure,,"[-15.0, 60.0]",25,GarminSD


In [39]:
# Flatten the JSON and extract relevant fields
flattened_data = []

for attribute in raw_json:
    user_id = attribute.get('userId', None)
    seizure_times = attribute.get('seizureTimes', [])
    datapoints = attribute.get('datapoints', [])

    for point in datapoints:
        event_id = point.get('eventId', None)
        hr = point.get('hr', [])
        o2Sat = point.get('o2Sat', [])
        rawData = point.get('rawData', [])
        rawData3D = point.get('rawData3D', [])
        # Append every datapoint as a row
        flattened_data.append({
            'eventId': event_id,
            'userId': user_id,
            'hr': hr,
            'o2Sat': o2Sat,
            'rawData': rawData,
            'rawData3D': rawData3D,

        })

# Create a DataFrame
df_sensordata = pd.DataFrame(flattened_data)
df_sensordata.to_csv('generatedCsvDatasets/sensordata.csv', index=False)

# Display the DataFrame
df_sensordata.head()


Unnamed: 0,eventId,userId,hr,o2Sat,rawData,rawData3D
0,407,39,67,-1,"[1496, 1480, 1500, 1492, 1496, 1484, 1500, 149...",[]
1,407,39,67,-1,"[1492, 1508, 1496, 1476, 1484, 1476, 1496, 150...",[]
2,407,39,68,-1,"[1488, 1496, 1484, 1492, 1492, 1508, 1504, 148...",[]
3,407,39,69,-1,"[1488, 1476, 1480, 1504, 1496, 1508, 1484, 148...",[]
4,407,39,69,-1,"[1504, 1488, 1504, 1492, 1484, 1500, 1496, 149...",[]


In [40]:
# Sampling frequency (25 Hz as per your clarification)
sampling_rate = 25  # in Hz

# Define FFT calculation function for each row
def calculate_fft(raw_data):
    # Remove the DC component (mean of the signal)
    raw_data = raw_data - np.mean(raw_data)
    
    # Compute the Fourier Transform (FFT) for the entire signal
    fft_result = np.fft.fft(raw_data)
    
    # Compute the frequencies corresponding to the FFT result
    frequencies = np.fft.fftfreq(len(raw_data), d=1/sampling_rate)
    
    # Compute the magnitude of the FFT (absolute value)
    fft_magnitude = np.abs(fft_result)
    
    # Only consider the positive frequencies (the FFT is symmetric)
    positive_frequencies = frequencies[:len(frequencies)//2]
    positive_fft_magnitude = fft_magnitude[:len(frequencies)//2]
    
    return positive_frequencies, positive_fft_magnitude

# Add a new column for FFT data for all rows in the DataFrame
fft_results = []

for _, row in df_sensordata.iterrows():
    # Extract rawData for the row
    raw_data = np.array(row['rawData'])
    
    # Calculate the FFT for the current row
    positive_frequencies, positive_fft_magnitude = calculate_fft(raw_data)
    
    # Store the result as a list of FFT magnitudes
    fft_results.append(list(positive_fft_magnitude))  # Modify if needed to store specific frequency ranges

# Add the FFT results to the DataFrame as a new column
df_sensordata['FFT'] = fft_results

df_sensordata.to_csv('generatedCsvDatasets/sensordata_fft.csv', index=False)

df_sensordata.head()


Unnamed: 0,eventId,userId,hr,o2Sat,rawData,rawData3D,FFT
0,407,39,67,-1,"[1496, 1480, 1500, 1492, 1496, 1484, 1500, 149...",[],"[1.2960299500264227e-11, 143.05125737182817, 5..."
1,407,39,67,-1,"[1492, 1508, 1496, 1476, 1484, 1476, 1496, 150...",[],"[9.094947017729282e-13, 75.0235079481899, 31.0..."
2,407,39,68,-1,"[1488, 1496, 1484, 1492, 1492, 1508, 1504, 148...",[],"[2.2737367544323206e-13, 91.25440903139302, 81..."
3,407,39,69,-1,"[1488, 1476, 1480, 1504, 1496, 1508, 1484, 148...",[],"[1.3642420526593924e-11, 101.37768172754973, 7..."
4,407,39,69,-1,"[1504, 1488, 1504, 1492, 1484, 1500, 1496, 149...",[],"[7.275957614183426e-12, 116.42740204040987, 77..."


In [41]:
# Sample data (you already have the DataFrame 'df_sensordata' containing all rows)
# df_sensordata = ...  # Make sure to load your data here if not already loaded.

# Sampling frequency (25 Hz as per your clarification)
sampling_rate = 25  # in Hz

# Define FFT calculation function for each row
def calculate_fft(raw_data):
    # Remove the DC component (mean of the signal)
    raw_data = raw_data - np.mean(raw_data)
    
    # Compute the Fourier Transform (FFT) for the entire signal
    fft_result = np.fft.fft(raw_data)
    
    # Compute the frequencies corresponding to the FFT result
    frequencies = np.fft.fftfreq(len(raw_data), d=1/sampling_rate)
    
    # Compute the magnitude of the FFT (absolute value)
    fft_magnitude = np.abs(fft_result)
    
    # Only consider the positive frequencies (the FFT is symmetric)
    positive_frequencies = frequencies[:len(frequencies)//2]
    positive_fft_magnitude = fft_magnitude[:len(frequencies)//2]
    
    return positive_frequencies, positive_fft_magnitude

# Add a new column for FFT data for all rows in the DataFrame
fft_results = []

for _, row in df_sensordata.iterrows():
    # Extract rawData for the row
    raw_data = np.array(row['rawData'])
    
    # Calculate the FFT for the current row
    positive_frequencies, positive_fft_magnitude = calculate_fft(raw_data)
    
    # Store the result as a list of FFT magnitudes
    fft_results.append(list(positive_fft_magnitude))  # Modify if needed to store specific frequency ranges

# Add the FFT results to the DataFrame as a new column
df_sensordata['FFT'] = fft_results

# Optionally, you can view the DataFrame to confirm the FFT column has been added
df_sensordata.head()



# Sampling frequency (25 Hz)
sampling_rate = 25  # in Hz

# Function to calculate additional features
def calculate_additional_features(raw_data):
    # Calculate Skewness
    skewness = skew(raw_data)
    
    # Calculate Kurtosis
    kurt = kurtosis(raw_data)
    
    # Calculate Standard Deviation
    std_dev = np.std(raw_data)
    
    # Calculate Maximum Acceleration
    max_acceleration = np.max(np.abs(raw_data))  # Max magnitude of acceleration
    
    # Calculate Total Distance Traveled (using numerical integration of acceleration)
    # Distance = 0.5 * acceleration * (time_step)^2 (for each timestep)
    time_step = 1 / sampling_rate  # time per sample in seconds
    distance_traveled = 0.5 * raw_data * time_step**2
    total_distance = np.sum(np.abs(distance_traveled))  # Sum of absolute distances
    
    return skewness, kurt, std_dev, max_acceleration, total_distance

# Add new columns for each row in the DataFrame
additional_features = []

for _, row in df_sensordata.iterrows():
    # Extract rawData for the row
    raw_data = np.array(row['rawData'])
    
    # Calculate additional features
    skewness, kurt, std_dev, max_acceleration, total_distance = calculate_additional_features(raw_data)
    
    # Store the result as a list of additional features
    additional_features.append([skewness, kurt, std_dev, max_acceleration, total_distance])

# Convert the list of additional features to a DataFrame
additional_features_df = pd.DataFrame(additional_features, columns=['Skewness', 'Kurtosis', 'StdDev', 'MaxAcceleration', 'TotalDistance'])

# Concatenate the new features with the original DataFrame, making sure there are no duplicates
df_sensordata = pd.concat([df_sensordata.drop(columns=['Skewness', 'Kurtosis', 'StdDev', 'MaxAcceleration', 'TotalDistance'], errors='ignore'), additional_features_df], axis=1)

# Optionally, you can view the DataFrame to confirm the new columns have been added
df_sensordata.to_csv('generatedCsvDatasets/features.csv', index=False)

df_sensordata.head(5)


Unnamed: 0,eventId,userId,hr,o2Sat,rawData,rawData3D,FFT,Skewness,Kurtosis,StdDev,MaxAcceleration,TotalDistance
0,407,39,67,-1,"[1496, 1480, 1500, 1492, 1496, 1484, 1500, 149...",[],"[1.2960299500264227e-11, 143.05125737182817, 5...",0.361479,0.259783,10.162003,1520.0,148.8736
1,407,39,67,-1,"[1492, 1508, 1496, 1476, 1484, 1476, 1496, 150...",[],"[9.094947017729282e-13, 75.0235079481899, 31.0...",-0.154103,-0.599913,9.286288,1508.0,148.8192
2,407,39,68,-1,"[1488, 1496, 1484, 1492, 1492, 1508, 1504, 148...",[],"[2.2737367544323206e-13, 91.25440903139302, 81...",0.029643,-0.209197,10.692694,1516.0,148.9952
3,407,39,69,-1,"[1488, 1476, 1480, 1504, 1496, 1508, 1484, 148...",[],"[1.3642420526593924e-11, 101.37768172754973, 7...",0.497013,0.355193,10.101564,1520.0,148.912
4,407,39,69,-1,"[1504, 1488, 1504, 1492, 1484, 1500, 1496, 149...",[],"[7.275957614183426e-12, 116.42740204040987, 77...",-0.316861,-0.08601,9.912048,1512.0,149.0464


## **Step 1: Data Loader** 📂 
**Description**:  
In the first step, the script loads the raw JSON data file, which contains sensor data, event information, and metadata for each user.  
- **Purpose**: To make the data accessible for processing and analysis.  
- **Process**: The JSON file is read and stored into a variable (`raw_json`) for further use.  
- **Input**: A JSON file located at the specified path.  
- **Output**: Raw nested JSON data stored in memory.  

---

## **Step 2: Metadata Processing** 🗂️  
**Description**:  
The second step processes and **flattens the JSON data** to extract essential metadata fields.  
- **Purpose**: Organize relevant details such as `eventId`, `userId`, `type`, `subtype`, `description`, and additional attributes into a structured format.  
- **Process**:  
   - Loops through the JSON data to extract key fields.  
   - Organizes these fields into a tabular structure.  
   - Saves the metadata as a **CSV file** for easy access and storage.  
- **Output**:  
   - A clean DataFrame (`df_metadata`) containing all metadata information.  
   - A CSV file (`metadata.csv`) stored in the `generatedCsvDatasets` directory.  

---

## **Step 3: Sensor Data Processing** ⚙️  
**Description**:  
The third step focuses on extracting and organizing **raw sensor data** (`rawData`) from the JSON file.  
- **Purpose**: Transform time series sensor readings into a format that can be analyzed and visualized.  
- **Process**:  
   - Iterates through the flattened metadata.  
   - Extracts `rawData` for each `eventId`.  
   - Prepares a structured DataFrame for sensor data.  
- **Output**: A DataFrame (`df_sensordata`) with raw sensor readings mapped to corresponding events.

---

## **Step 4: Frequency Domain Analysis (FFT)** 🔍  
**Description**:  
The fourth step applies the **Fast Fourier Transform (FFT)** to the `rawData`.  
- **Purpose**: Analyze the frequency components of the sensor data to uncover signal patterns.  
- **Process**:  
   - Removes the **DC component** (signal mean).  
   - Computes the FFT and retains **positive frequencies** only.  
   - Adds the FFT magnitude results to a new column (`FFT`).  
- **Output**: A new column containing FFT results, enabling frequency domain analysis for each signal.

---

## **Step 5: Feature Extraction** 📈  
**Description**:  
The final step extracts meaningful **statistical and physical features** from the time-domain sensor data.  
- **Purpose**: Generate features that can be used for downstream tasks like analysis, visualization, or machine learning.  
- **Features Extracted**:  
   - **Skewness**: Measure of asymmetry.  
   - **Kurtosis**: Sharpness of the signal distribution.  
   - **Standard Deviation**: Spread of values.  
   - **Max Acceleration**: Maximum signal magnitude.  
   - **Total Distance**: Approximated by integrating acceleration over time.  
- **Output**:  
   - A DataFrame (`df_features`) with the new features added.  
   - A CSV file (`features.csv`) containing the final enriched dataset.

---

# 👤 Author  
Developed for the Open Seizure Database by **Jamie Pordoy**.