# Data Preprocessing

This notebook runs a user through the steps to select a survey and preprocess all the raw data into the inputs necessary to run GARPOS.  

In [2]:
import os
from pathlib import Path

from es_sfgtools.processing.pipeline import DataHandler
from es_sfgtools.utils.archive_pull import (
    list_survey_files
    )

from es_sfgtools.utils.loggers import BaseLogger


### Confirm required environment variables are set

In [3]:
# this must be set correctly for GO executables to translate novatel to rinex

#Linux
#!echo $LD_LIBRARY_PATH

#Mac
os.environ['DYLD_LIBRARY_PATH'] = "/Users/gottlieb/miniconda3/envs/seafloor_geodesy_mac/lib"
os.getenv('DYLD_LIBRARY_PATH')

'/Users/gottlieb/miniconda3/envs/seafloor_geodesy_mac/lib'

In [4]:
# this confirms PRIDE-PPPAR is in the PATH
!which pdp3

/Users/gottlieb/.PRIDE_PPPAR_BIN/pdp3


## Step 1. Initial Setup


#### Browse available surveys from the community archive and select target
- Locate the survey of interest in https://gage-data.earthscope.org/archive/seafloor, and note the `network`, `station`, and `survey` names, which will be input in the cell below.  Leave vessel_type as SV3 unless you know you are working with older SV2 data.
- In order to use this notebook to process new surveys, the data must first be submitted and made available from the community archive 

In [5]:
# Input survey parameters
network='cascadia-gorda'
site='NCC1'
#survey='2024_A_1126'
survey='2023_A_1063'
vessel_type = 'SV3'

# Set data directory path for local environment
data_dir = Path(f"{os.path.expanduser('~/data/sfg')}")
os.makedirs(data_dir, exist_ok=True)

#### USE THE FOLLOWING DEFAULTS UNLESS DESIRED####
data_handler = DataHandler(directory=data_dir)
data_handler.change_working_station(network=network, station=site, campaign=survey)
BaseLogger.set_dir(data_handler.station_log_dir)

if vessel_type == 'SV3':
    pipeline, config = data_handler.get_pipeline_sv3()
elif vessel_type == 'SV2':
    pipeline, config = data_handler.get_pipeline_sv2()
else:
    raise ValueError(f"Vessel type {vessel_type} not recognized")



Building directory structure for cascadia-gorda NCC1 2023_A_1063
No date range set for cascadia-gorda, NCC1, 2023_A_1063
Building TileDB arrays for NCC1
Changed working station to cascadia-gorda NCC1


# Step 2. Inventory available data and its location

In [7]:
# Generate a list of files available from remote archive
#TODO: implement options for raw vs intermediate vs processed 
remote_filepaths = list_survey_files(network=network, station=site, survey=survey, show_details=True)

Listing survey files from url https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2023_A_1063/raw
Found under https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2023_A_1063/raw:
    7 DFOP00 file(s)
    90 NOV000 file(s)
    38 NOV770 file(s)


In [8]:
# See what files exist locally
data_type_counts = data_handler.get_dtype_counts()
print(f"Local data directory contains the following:")
for item in data_type_counts.items():
    print(f"    {item[0]}: {item[1]}")

Local data directory contains the following:
    dfop00: 7
    novatel770: 38


## Step 3. Pull data from remote archive

In [9]:
#Add found remote files to the local catalog.  Note this builds an inventory, 
#but does not do the downloading until a later step.
# TODO: Detail counts of files local vs only remote
data_handler.add_data_remote(remote_filepaths=remote_filepaths)


21 files not recognized and skipped
Added 0 out of 90 files to the catalog


#### Select files types for downloading
Observable file types depend on whether data was collected with an SV2 or SV3 waveglider.  

![Alt text](garpos_flow.jpg)

In [10]:
# Download the files by type
# data_handler.download_data(file_type='sonardyne', show_details=False)
# data_handler.download_data(file_type='novatel', show_details=False)
# data_handler.download_data(file_type='master', show_details=False)
# data_handler.download_data(file_type='svpavg', show_details=False)
# data_handler.download_data(file_type='leverarm', show_details=False)

data_handler.download_data(file_types='dfop00')
data_handler.download_data(file_types='novatel770')

No new AssetType.DFOP00 files to download
No new AssetType.NOVATEL770 files to download


# Step 4. Parse/Process raw data to processing input schemas

- 4.1 Parse acoustic observations into AcousticDataFrames
- 4.2 Parse IMU observations into IMUDataFrames
- 4.3 Process GNSS observables to generate PositionDataFrames
    - Parse RANGE-A novatel messages, build RINEX files
    - Run PRIDE-PPP-AR on RINEX, generate Kin files
    - Parse Kin files into PositionDataFrames
- 4.4 Parse metadata files into SiteConfig

### 4.1 Process and read DFOP00 files 

In [11]:
#config.dfop00_config.override=True
#config.dfop00_config.show_details=True
#pipeline.config = config
pipeline.process_dfop00()

No DFOP00 Files Found to Process for cascadia-gorda NCC1 2023_A_1063


pipeline.pre_process_novatel()### 4.3 Take all GNSS parent files and generate GNSS df's

In [14]:
#config.novatel_config.override=True
#pipeline.config = config
pipeline.pre_process_novatel()

Processing 38 Novatel 770 files for cascadia-gorda NCC1 2023_A_1063. This may take a few minutes...
Running NOVB2TILE on 38 files
Added 38 Novatel 770 Entries to the catalog
Processing Novatel 000 data for cascadia-gorda NCC1 2023_A_1063
No Novatel 000 Files Found to Process for cascadia-gorda NCC1 2023_A_1063


In [6]:
# config.rinex_config.n_processes=2
config.rinex_config.override=True
pipeline.config = config
pipeline.get_rinex_files()

Gathering Rinex Files for cascadia-gorda NCC1 2023_A_1063. This may take a few minutes...


Command
/Users/gottlieb/GIT/es_sfgtools/src/golangtools/build/tdb2rnx_darwin_arm64 -tdb /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/TileDB/rangea_db.tdb -settings /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/rinex_metav2.json -timeint 1 -year 0


"Time Range: 2023-09-08 14:26:07.7 +0000 UTC - 2024-09-22 00:00:18.3 +0000 UTC Found At /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/TileDB/rangea_db.tdb"
"Found 20320 Epochs From Array Within Timespan: {2023-09-08 14:00:00 +0000 UTC 2023-09-08 15:00:00 +0000 UTC}"
"Generating Daily RINEX File For Year 2023, Month 9, Day 8 To NCC12510.23o"
"Found 36000 Epochs From Array Within Timespan: {2023-09-08 15:00:00 +0000 UTC 2023-09-08 16:00:00 +0000 UTC}"
"Found 36000 Epochs From Array Within Timespan: {2023-09-08 16:00:00 +0000 UTC 2023-09-08 17:00:00 +0000 UTC}"
"Found 36000 Epochs From Array Within Timespan: {2023-09-08 17:00:00 +0000 UTC 2023-09-08 18:00:00 +0000 UTC}"
"Found 36000 Epochs From Array Within Timespan: {2023-09-08 18:00:00 +0000 UTC 2023-09-08 19:00:00 +0000 UTC}"
"Found 36000 Epochs From Array Within Timespan: {2023-09-08 19:00:00 +0000 UTC 2023-09-08 20:00:00 +0000 UTC}"
"Found 36000 Epochs From Array Within Timespan: {2023-09-08 20:00:00 +0000 UTC 2023-09-08 21:00:00 +000

In [8]:
config.rinex_config.override=True
pipeline.config = config   
pipeline.process_rinex()

Processing Rinex Data for cascadia-gorda NCC1 2023_A_1063. This may take a few minutes...
Found 9 Rinex Files to Process
Converting RINEX file /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/2023_A_1063/intermediate/NCC12560.23o to kin file
Converting RINEX file /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/2023_A_1063/intermediate/NCC12550.23o to kin file
Converting RINEX file /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/2023_A_1063/intermediate/NCC12510.23o to kin file
Converting RINEX file /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/2023_A_1063/intermediate/NCC12660.24o to kin file
Converting RINEX file /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/2023_A_1063/intermediate/NCC12570.23o to kin file
Converting RINEX file /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/2023_A_1063/intermediate/NCC12520.23o to kin file
Converting RINEX file /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/2023_A_1063/intermediate/NCC12540.23o to kin file
Converting RINEX file /Users/gottlieb/data/sfg/cascadia-gord

In [9]:
pipeline.process_kin()

Looking for Kin Files to Process for cascadia-gorda NCC1 2023_A_1063
Found 8 Kin Files to Process: processing
Processing Kin Files:   0%|          | 0/8 [00:00<?, ?it/s]GNSS Parser: 34425 shots from FILE /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/2023_A_1063/intermediate/kin_2023251_ncc1.kin
Processing Kin Files:  12%|█▎        | 1/8 [00:00<00:06,  1.09it/s]GNSS Parser: 86180 shots from FILE /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/2023_A_1063/intermediate/kin_2023256_ncc1.kin
Processing Kin Files:  25%|██▌       | 2/8 [00:02<00:07,  1.19s/it]GNSS Parser: 13935 shots from FILE /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/2023_A_1063/intermediate/kin_2023257_ncc1.kin
Processing Kin Files:  38%|███▊      | 3/8 [00:02<00:04,  1.12it/s]GNSS Parser: 86381 shots from FILE /Users/gottlieb/data/sfg/cascadia-gorda/NCC1/2023_A_1063/intermediate/kin_2023254_ncc1.kin
Processing Kin Files:  50%|█████     | 4/8 [00:04<00:04,  1.05s/it]GNSS Parser: 520 shots from FILE /Users/gottlieb/data/sfg/c

In [10]:
pipeline.update_shotdata()

Updating shotdata with interpolated gnss data
Merging shotdata and gnss data
Interpolating shotdata for date 2023-09-08
Interpolating 2955 points
Interpolating ENU values
Interpolation took 16.535 seconds for 2955 x 4 points
Interpolating shotdata for date 2023-09-09
Interpolating 5758 points
Interpolating ENU values


KeyboardInterrupt: 