# Data Preprocessing
This notebook runs a user through the steps to select a survey and preprocess all the raw data into the inputs necessary to run GARPOS.  

In [1]:
import os
import sys
from pathlib import Path
import pandas as pd

from es_sfgtools.processing.pipeline import DataHandler
from es_sfgtools.utils.archive_pull import (
    list_survey_files
    )

from es_sfgtools.utils.loggers import BaseLogger

BaseLogger.route_to_console()

os.environ['DYLD_LIBRARY_PATH'] = "/Users/gottlieb/miniconda3/envs/seafloor_geodesy/lib"
os.environ["PATH"] += os.pathsep + "/Users/gottlieb/.PRIDE_PPPAR_BIN"

## Step 1. Initial Setup


#### Browse available surveys from the community archive and select target
- Locate the survey of interest in https://gage-data.earthscope.org/archive/seafloor, and note the `network`, `station`, and `survey` names, which will be input in the cell below.  Leave vessel_type as SV3 unless you know you are working with older SV2 data.
- In order to use this notebook to process new surveys, the data must first be submitted and made available from the community archive 

In [28]:
# Input survey parameters
network='cascadia-gorda'
site='NCC1'
survey='2024_A_1126'
vessel_type = 'SV3'

# Set data directory path for local environment
data_dir = Path(f"{os.path.expanduser('~/data/sfg')}")
os.makedirs(data_dir, exist_ok=True)

#### USE THE FOLLOWING DEFAULTS UNLESS DESIRED####
data_handler = DataHandler(directory=data_dir)
data_handler.change_working_station(network=network, station=site, survey=survey)
BaseLogger.set_dir(data_handler.station_log_dir)

if vessel_type == 'SV3':
    pipeline, config = data_handler.get_pipeline_sv3()
elif vessel_type == 'SV2':
    pipeline, config = data_handler.get_pipeline_sv2()
else:
    raise ValueError(f"Vessel type {vessel_type} not recognized")



Building directory structure for cascadia-gorda NCC1 2024_A_1126
Building TileDB arrays for NCC1
Changed working station to cascadia-gorda NCC1


/Users/gottlieb/data/sfg


# Step 2. Inventory available data and its location

In [12]:
# Generate a list of files available from remote archive
#TODO: implement options for raw vs intermediate vs processed 
remote_filepaths = list_survey_files(network=network, station=site, survey=survey, show_details=True)

Found under https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw:
    136 NOV000 file(s)
    40 NOV770 file(s)
    25 DFOP00 file(s)


In [13]:
# See what files exist locally
data_type_counts = data_handler.get_dtype_counts()
print(f"Local data directory contains the following:")
for item in data_type_counts.items():
    print(f"    {item[0]}: {item[1]}")

Local data directory contains the following:
    dfop00: 25
    kin: 21
    kinresiduals: 18
    novatel770: 40
    rinex: 22


## Step 3. Pull data from remote archive

In [14]:
#Add found remote files to the local catalog.  Note this builds an inventory, 
#but does not do the downloading until a later step.
# TODO: Detail counts of files local vs only remote
data_handler.add_data_remote(remote_filepaths=remote_filepaths)


File type not recognized for https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw/323843_001_20240921_225319_00110_AHRS00.bin
File type not recognized for https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw/323843_001_20240921_225319_00110_TCVR00.bin
File https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw/323843_001_20240921_225330_00110_NOV770.raw already exists in the catalog
File type not recognized for https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw/323843_001_20240921_230308_00110_INOUT0.bin
File type not recognized for https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw/323843_001_20240922_000000_00111_AHRS00.bin
File type not recognized for https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw/323843_001_20240922_000000_00111_GNSS00.bin
File https://gage-data.earthscope.org/ar

#### Select files types for downloading
Observable file types depend on whether data was collected with an SV2 or SV3 waveglider.  

![Alt text](garpos_flow.jpg)

In [15]:
# Download the files by type
# data_handler.download_data(file_type='sonardyne', show_details=False)
# data_handler.download_data(file_type='novatel', show_details=False)
# data_handler.download_data(file_type='master', show_details=False)
# data_handler.download_data(file_type='svpavg', show_details=False)
# data_handler.download_data(file_type='leverarm', show_details=False)

data_handler.download_data(file_types='dfop00')
data_handler.download_data(file_types='novatel770')

No new files to download
No new files to download


Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.DFOP00
Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.NOVATEL770


# Step 4. Parse/Process raw data to processing input schemas

- 4.1 Parse acoustic observations into AcousticDataFrames
- 4.2 Parse IMU observations into IMUDataFrames
- 4.3 Process GNSS observables to generate PositionDataFrames
    - Parse RANGE-A novatel messages, build RINEX files
    - Run PRIDE-PPP-AR on RINEX, generate Kin files
    - Parse Kin files into PositionDataFrames
- 4.4 Parse metadata files into SiteConfig

### 4.1 Process and read DFOP00 files 

In [16]:
#config.dfop00_config.override=True
config.dfop00_config.show_details=True
pipeline.config = config
pipeline.process_dfop00()

Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.DFOP00
Found 2 DFOP00 Files to Process


Processing DFOP00 Files:   0%|          | 0/2 [00:00<?, ?it/s]

Generated 0 ShotData dataframes From 2 DFOP00 Files


### 4.3 Take all GNSS parent files and generate GNSS df's

In [17]:
pipeline.pre_process_novatel()

Processing Novatel 770 data for cascadia-gorda NCC1 2024_A_1126
Getting local assets for cascadia-gorda NCC1 2024_A_1126 AssetType.NOVATEL770
Processing Novatel 000 data for cascadia-gorda NCC1 2024_A_1126
Getting local assets for cascadia-gorda NCC1 2024_A_1126 AssetType.NOVATEL000
No Novatel 000 Files Found to Process for cascadia-gorda NCC1 2024_A_1126


In [18]:
pipeline.get_rinex_files()

In [21]:
config.rinex_config.override=False
pipeline.config = config
pipeline.process_rinex()

Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.RINEX
Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.KIN


  warn(response)


Processing Rinex Files:   0%|          | 0/1 [00:00<?, ?it/s]

In [22]:
pipeline.process_kin()

Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.KIN
Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.KINRESIDUALS


Processing Kin Files:   0%|          | 0/4 [00:00<?, ?it/s]

In [23]:
pipeline.update_shotdata()

Updating shotdata with interpolated gnss data
