# Data Preprocessing
This notebook runs a user through the steps to select a survey and preprocess all the raw data into the inputs necessary to run GARPOS.  

In [1]:
import os
from pathlib import Path
import pandas as pd

from es_sfgtools.processing.pipeline import DataHandler
from es_sfgtools.utils.archive_pull import (
    list_survey_files
    )

from es_sfgtools.utils.loggers import BaseLogger

BaseLogger.route_to_console()

os.environ['DYLD_LIBRARY_PATH'] = "/Users/gottlieb/miniconda3/envs/seafloor_geodesy/lib"

## Step 1. Initial Setup


#### Browse available surveys from the community archive and select target
- Locate the survey of interest in https://gage-data.earthscope.org/archive/seafloor, and note the `network`, `station`, and `survey` names, which will be input in the cell below.
- In order to use this notebook to process new surveys, the data must first be submitted and made available from the community archive 

# Step 2. Inventory available data and its location

In [None]:
network='cascadia-gorda'
site='NCC1'
survey='2024_A_1126'
#site='SEM1'
#survey='2022_A_1049'

#### USE THE FOLLOWING DEFAULTS UNLESS DESIRED####

# Set data directory
data_dir = Path(f"/Users/gottlieb/data/sfg")

data_handler = DataHandler(directory=data_dir)
data_handler.change_working_station(network=network, station=site, survey=survey)
BaseLogger.set_dir(data_handler.station_log_dir)

Building directory structure for cascadia-gorda NCC1 2024_A_1126
Building TileDB arrays for NCC1
Changed working station to cascadia-gorda NCC1


In [8]:
# Generate a list of files available from remote archive
#TODO: implement options for raw vs intermediate vs processed 
remote_filepaths = list_survey_files(network=network, station=site, survey=survey, show_details=True)

Found under https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw:
    136 NOV000 file(s)
    40 NOV770 file(s)
    25 DFOP00 file(s)


In [9]:
# See what files exist locally
data_type_counts = data_handler.get_dtype_counts()
print(f"Local data directory contains the following:")
for item in data_type_counts.items():
    print(f"    {item[0]}: {item[1]}")

Local data directory contains the following:
    dfop00: 25
    novatel770: 40
    rinex: 22


## Step 3. Pull data from remote archive

In [10]:
#Add found remote files to the local catalog.  Note this builds an inventory, 
#but does not do the downloading until a later step.
# TODO: Detail counts of files local vs only remote
data_handler.add_data_remote(remote_filepaths=remote_filepaths)


2025-01-06 13:47:27,322 - data_handler.py:301 - INFO - File https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw/323843_001_20240921_225330_00110_NOV770.raw already exists in the catalog 
2025-01-06 13:47:27,328 - data_handler.py:301 - INFO - File https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw/323843_001_20240922_000000_00111_NOV770.raw already exists in the catalog 
2025-01-06 13:47:27,334 - data_handler.py:301 - INFO - File https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw/323843_001_20240922_140341_00112_NOV770.raw already exists in the catalog 
2025-01-06 13:47:27,336 - data_handler.py:301 - INFO - File https://gage-data.earthscope.org/archive/seafloor/cascadia-gorda/NCC1/2024_A_1126/raw/323843_001_20240922_143036_00112_DFOP00.raw already exists in the catalog 
2025-01-06 13:47:27,342 - data_handler.py:301 - INFO - File https://gage-data.earthscope.org/archive/seafloor/cascad

#### Select files types for downloading
Observable file types depend on whether data was collected with an SV2 or SV3 waveglider.  

![Alt text](garpos_flow.jpg)

In [5]:
# Download the files by type
# data_handler.download_data(file_type='sonardyne', show_details=False)
# data_handler.download_data(file_type='novatel', show_details=False)
# data_handler.download_data(file_type='master', show_details=False)
# data_handler.download_data(file_type='svpavg', show_details=False)
# data_handler.download_data(file_type='leverarm', show_details=False)

data_handler.download_data(file_types='dfop00')
data_handler.download_data(file_types='novatel770')

2025-01-06 14:01:47,401 - data_handler.py:359 - INFO - No new files to download 
2025-01-06 14:01:47,405 - data_handler.py:359 - INFO - No new files to download 


Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.DFOP00
Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.NOVATEL770


# Step 4. Parse/Process raw data to processing input schemas

- 4.1 Parse acoustic observations into AcousticDataFrames
- 4.2 Parse IMU observations into IMUDataFrames
- 4.3 Process GNSS observables to generate PositionDataFrames
    - Parse RANGE-A novatel messages, build RINEX files
    - Run PRIDE-PPP-AR on RINEX, generate Kin files
    - Parse Kin files into PositionDataFrames
- 4.4 Parse metadata files into SiteConfig

In [3]:
pipeline, config = data_handler.get_pipeline_sv3()

### 4.1 Process and read DFOP00 files 

In [8]:
config.dfop00_config.override=True
config.dfop00_config.show_details=True
pipeline.config = config
pipeline.process_dfop00()

Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.DFOP00
Found 25 DFOP00 Files to Process


Processing DFOP00 Files:   0%|          | 0/25 [00:00<?, ?it/s]

Generated 23 ShotData dataframes From 25 DFOP00 Files


### 4.3 Take all GNSS parent files and generate GNSS df's

In [4]:
pipeline.pre_process_novatel()

Processing Novatel 770 data for cascadia-gorda NCC1 2024_A_1126
Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.NOVATEL770
Processing Novatel 000 data for cascadia-gorda NCC1 2024_A_1126
Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.NOVATEL000
No Novatel 000 Files Found to Process for cascadia-gorda NCC1 2024_A_1126


In [7]:
pipeline.get_rinex_files()

In [8]:
pipeline.process_rinex()

Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.RINEX
Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.KIN


FileNotFoundError: PRIDE-PPP binary 'pdp3' not found in path

In [6]:
pipeline.process_kin()

Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.KIN
Getting assets for cascadia-gorda NCC1 2024_A_1126 AssetType.KINRESIDUALS


In [None]:
pipeline.update_shotdata()