# SV3 Data Preprocessing 

This notebook runs a user through the steps to select a campaign and preprocess all the raw data into the inputs necessary to run GARPOS.  

It is specific to the steps for processing SV3 data.  

## Import modules

In [None]:
import os
from pathlib import Path

from es_sfgtools.workflows.workflow_handler import WorkflowHandler

### Browse available campaigns from the community archive and select target
- Locate the campaign of interest in https://gage-data.earthscope.org/archive/seafloor, and note the `network`, `station`, and `campaign` names, which will be input in the cell below.  

- Note: the cascadia-gorda raw data is currently hidden (by request) but still usable, here are the available campaigns

    |  | GCC1 | NBR1 | NCC1 |
    |---|---|---|---|
    | **2022** |2022_A_1065 | 2022_A_1065  |  2022_A_1065 |
    | **2023** |  2023_A_1063 | 2023_A_1063 | 2023_A_1063 |
    | **2024** |  2024_A_1126 |  2024_A_1126 | 2024_A_1126 |


- In order to use this notebook to process new campaigns, the data must first be submitted and made available from the community archive 

In [None]:
# Input survey parameters
network='cascadia-gorda'
site='NCC1'
campaign='2023_A_1063'

# Set data directory path for local environment
data_dir = Path(f"{os.path.expanduser('~/data/sfg')}")
raw_data_dir = data_dir / network / site / campaign / "raw"


#### USE THE FOLLOWING DEFAULTS UNLESS DESIRED ####
os.makedirs(data_dir, exist_ok=True)
workflow = WorkflowHandler(directory=data_dir)
workflow.set_network_station_campaign(network_id=network, 
                                      station_id=site, 
                                      campaign_id=campaign)
                                      
print(f"Workflow directory: {workflow.directory}")
print(f"Raw data directory for campaign: {raw_data_dir}")


## Optional Steps - ingest raw data or download raw data from the cloud
### Option 1: Ingest Local Raw Data

If you already have raw data files downloaded on your local machine, use this option to register them with the workflow catalog.

The code below will scan the `raw_data_dir` directory (defined above) and add any existing raw data files to the internal catalog, making them available for processing. This is useful when you've manually downloaded files or are reusing data from a previous session.

In [None]:
# Ingest raw data from the local raw data directory
workflow.ingest_add_local_data(raw_data_dir)

### Option 2: Download Data from Community Archive

If you don't have data downloaded locally, this option retrieves raw data from the EarthScope community archive (https://gage-data.earthscope.org/archive/seafloor).

**Two-step process:**
1. **Catalog the available data**: `workflow.ingest_catalog_archive_data()` queries the archive and creates an inventory of available files for your selected campaign

2. **Download the raw data**: `workflow.ingest_download_archive_data()` downloads the necessary raw data files to your local directory

The workflow automatically identifies and downloads the appropriate file types. Files already present locally are skipped.

In [None]:
# Ingest catalog data
workflow.ingest_catalog_archive_data()

# Download data
workflow.ingest_download_catalog_data()

## Configure Processing Parameters

The `global_config` dictionary below defines settings for each stage of the data processing pipeline. These settings control how raw data is converted into GARPOS-ready inputs.

### Configuration Sections:

- **`dfop00_config`**: Settings for processing acoustic ping/reply sequences from DFOP00 files
  - `override`: Set to `True` to reprocess even if data already exists

- **`novatel_config`**: Settings for processing GNSS range observations from Novatel receivers
  - `n_processes`: Number of parallel processes to use (adjust based on your CPU cores)
  - `override`: Set to `True` to reprocess existing data

- **`position_update_config`**: Settings for interpolating waveglider positions to ping times
  - `lengthscale`: Gaussian process lengthscale parameter for smoothing (in seconds)
  - `plot`: Set to `True` to generate diagnostic plots
  - `override`: Set to `True` to reprocess existing data

- **`pride_config`**: Settings for PRIDE-PPPAR precise point positioning
  - `cutoff_elevation`: Minimum satellite elevation angle (degrees)
  - `system`: GNSS constellation(s) - "GREC23J" = GPS/GLONASS/Galileo/BDS/QZSS
  - `frequency`: GNSS frequency bands to use for each system
  - `loose_edit`: Use relaxed editing for high-dynamic waveglider data
  - `sample_frequency`: Output position sampling rate (Hz)
  - `tides`: Tide corrections - "SOP" = solid/ocean/polar
  - `override`: Set to `True` to reprocess existing solutions
  - `override_products_download`: Set to `True` to re-download orbit/clock products

- **`rinex_config`**: Settings for generating RINEX observation files
  - `n_processes`: Number of parallel processes for RINEX generation
  - `time_interval`: Length of each RINEX file in hours (24 = daily files)
  - `override`: Set to `True` to regenerate existing RINEX files

**Note**: Set `override=False` to skip processing steps where outputs already exist. This is useful for resuming interrupted workflows.

In [None]:
global_config = {
    "dfop00_config": {
        "override": True
        },
    "novatel_config": {
        "n_processes": 14,
        "override": False
        },
    "position_update_config": {
        "override": True,
        "lengthscale": 0.1,
        "plot": False
        },
    "pride_config": {
        "cutoff_elevation": 7,
        "end": None,
        "frequency": ["G12", "R12", "E15", "C26", "J12"],
        "high_ion": None,
        "interval": None,
        "local_pdp3_path": None,
        "loose_edit": True,
        "sample_frequency": 1,
        "start": None,
        "system": "GREC23J",
        "tides": "SOP",
        "override_products_download": False,
        "override": False
        },
    "rinex_config": {
        "n_processes": 14,
        "time_interval": 24,
        "override": False
        }
    }   

## Run the Complete SV3 Processing Pipeline

This single command executes the entire data processing workflow, transforming raw SV3 data into GARPOS-ready observation files.

**What it does:**
1. `process_novatel`: Preprocesses Novatel 770 and 000 binary files for the current context
    1. **Novatel 770**: Extracts GNSS observations to primary TileDB array
    2. **Novatel 000**: Extracts GNSS observations to secondary array + IMU
           positions
        
2. `build_rinex`: Generates and catalog daily RINEX files for the current campaign.
    1. Consolidates GNSS observation data
    2. Determines processing year from config or campaign name
    3. Invokes tile2rinex to generate daily RINEX files
    4. Creates AssetEntry for each RINEX file
    5. Updates asset catalog with merge job

3. `run_pride`: Runs PRIDE-PPP on RINEX files to generate KIN and residual files.
    1. Retrieves RINEX files needing processing
    2. Downloads GNSS product files (SP3, OBX, ATT) for each unique DOY
    3. Runs PRIDE-PPPAR in parallel to convert RINEX to KIN format
    4. Adds KIN and residual files to asset catalog

5. `process_kinematic`: Processes KIN files to generate kinematic position dataframes.
    1. Retrieves KIN files needing processing
    2. Converts each KIN file to a structured dataframe
    3. Writes dataframes to kinematic position TileDB array
    4. Marks files as processed in asset catalog

6. `process_dfop00`: Processes Sonardyne DFOP00 files to generate preliminary shotdata.
    1. Retrieves DFOP00 files needing processing
    2. Converts each file to shotdata dataframe (acoustic ping-reply
        sequences)
    3. Writes dataframes to preliminary shotdata TileDB array
    4. Marks files as processed in asset catalog

7. `update_shotdata`: Refines shotdata with interpolated high-precision kinematic positions. This step significantly improves position accuracy by replacing GNSS
    positions with interpolated PRIDE-PPP solutions.
    1. Gets merge signature from preliminary shotdata and kinematic
    position arrays
    2. Checks if refinement is needed (via override or merge status)
    3. Merges shotdata with interpolated kinematic positions
    4. Writes refined shotdata to final TileDB array
    5. Records merge job in asset catalog

8. `process_svp`: Processes CTD and Seabird files to generate sound velocity profiles (SVP).
    
**Parameters:**
- `job='all'`: Runs all processing stages sequentially. You can also specify individual stages like `process_novatel`, `build_rinex`, `run_pride`, `process_kinematic`, `process_dfop00`, `refines_shotdata`, `process_svp`
- `primary_config=global_config`: (optional) Uses the configuration settings defined above 
- `secondary_config`: (optional) Station-specific config overrides - uncomment and define if needed for special cases

**Processing time:** This can take some time depending on campaign length. Progress will be displayed as each stage completes.

In [None]:
workflow.preprocess_run_pipeline_sv3(
                job='all',
                primary_config=global_config,
            )

## Optional: Clean Up Raw Data Files

After successful processing, large raw data files can be safely removed to free up disk space. All essential data has been normalized and stored in the TileDB arrays and intermediate products.

**What will be deleted:**
- Files with `.raw` extension (GNSS observation binaries)
- Files with `.bin` extension (acoustic data binaries)

**⚠️ Warning:** This action cannot be undone. Only run this after confirming your processing completed successfully. Raw binary files can be re-downloaded from the archive if needed later.

In [None]:
import shutil

# Check raw data directory and find .raw and .bin files
if raw_data_dir.exists():
    # Find all .raw and .bin files
    raw_files = list(raw_data_dir.rglob('*.raw'))
    bin_files = list(raw_data_dir.rglob('*.bin'))
    all_files_to_delete = raw_files + bin_files
    
    if all_files_to_delete:
        # Calculate total size of files to be deleted
        total_size = sum(f.stat().st_size for f in all_files_to_delete)
        size_gb = total_size / (1024**3)
        
        print(f"Raw data directory: {raw_data_dir}")
        print(f"\nFiles to be deleted:")
        print(f"  - {len(raw_files)} .raw files")
        print(f"  - {len(bin_files)} .bin files")
        print(f"  - Total size: {size_gb:.2f} GB")
        
        print(f"\nDetailed list:")
        for f in all_files_to_delete:
            file_size = f.stat().st_size / (1024**2)
            print(f"  - {f.relative_to(raw_data_dir)}: {file_size:.2f} MB")
        
        # Uncomment the following lines to actually delete the files
        # print(f"\n⚠️  Deleting {len(all_files_to_delete)} files ({size_gb:.2f} GB)...")
        # for f in all_files_to_delete:
        #     f.unlink()
        # print("✓ Raw binary files deleted successfully")
        
        print("\n⚠️  File deletion is COMMENTED OUT for safety.")
        print("    Uncomment the deletion lines above to proceed with cleanup.")
    else:
        print(f"No .raw or .bin files found in: {raw_data_dir}")
else:
    print(f"Raw data directory not found: {raw_data_dir}")