### Notes - Oct 21, 2024 ###
This is a copy from 10.FIM Version 4.5.2.11 which included hand data loads plus ras2fim data. 
We will remove all ras2fim stuff here knowing that sometimes ras2fim will be uploaded on its own.
However.. when ras2fim is loaded, some steps here will need to be re-run. Those steps will be
duplicated when we build our next ras2fim load. This hand release does not have a ras2fim update so
we will keep the one in place.
</br></br>
All code in here will be reviewed and adjusted as the loads progress. Consider each step to be
WIP until you see a load date below.


### Load Status for hand 4.5.11.1 - Started Oct 31, 2024 (well... restarted from the 21st)

#### Add dates to each line as they have been loaded

1. `Crosswalk` :  ---  Done: Nov 5
2. `Lambda FIM_PREFIX` :   ---  Done: Nov 5
3. `Lambda FIM_VERSION and Memory` :   ---  Done: Nov 5
4. `ras2fim` :  No update in this release. But a few adjustments for fim_version and model_version here :  -- Done: Nov 5
5. `AEP`   --- (HOLD) - needs lambda and image updates (via tf)
    - `2 year` :  -- 
    - `5 year` :  -- 
    - `10 year` :  -- 
    - `25 year` :  -- 
    - `50 year` :  -- 
    - `HW / High Water` :  -- 
    - `Change the hv-vpp-ti-viz-fim-data-prep Lambda memory back to 2048mb` :  -- 
6. `Catchments`  --- (HOLD) - needs lambda and image updates (via tf)
    - `Branch 0` :  -- 
    - `GMS` :  -- 
7. `usgs_elev_table` :  --  Done Nov 7
8. `hydrotable / hydrotable_staggered` : -- 
9. `usgs_rating_curve / usgs_rating_curves staggered` : -- 
10. `Skills Metrics` :  -- 
11. `FIM Performance` :  -- 
12. `CatFIM`
    - `Stage Based CatFIM` :  -- 
    - `Flow Based CatFIM` :   -- 
    - `CatFIM FIM 30` : Stage based only? flow not needed but confirm this.
13. `Clear HAND cache` :
14. `GIT` and `Terraform ??` : We can now do GitHub check in from here. Watch for branches.



In [9]:
# Cell to manually pip reload a packages that the Jupyter engine not retained
# !pip install numpy
# !pip install geopandas
# !pip install pyarrow
# !pip install xarray
# !pip install geoalchemy2
# !pip install contextily
# !pip install rioxarray

!pip install python-dotenv
print("All loaded")


Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
All loaded


In [None]:
# pd.set_option("max_info_rows", 100000) # override  

In [2]:
import os
import codecs
import csv

import sys

from datetime import datetime
from pathlib import Path

import boto3
import geopandas as gpd
import json
import pandas as pd
import s3fs
import sqlalchemy
import xarray as xr

from geopandas import GeoDataFrame
from io import StringIO
from geoalchemy2 import Geometry
from shapely import wkt
from shapely.geometry import Polygon
from sqlalchemy.exc import DataError   # yes, reduntant, fix it later
from sqlalchemy.types import Text    # yes, reduntant, fix it later


sys.path.append(os.path.join(os.path.abspath(''), '..'))

import helper_functions.shared_functions as sf
import helper_functions.s3_shared_functions as s3_sf

from helper_functions.viz_classes import database

print("imports loaded")


imports loaded


In [None]:
# Load AWS Keys
from dotenv import load_dotenv
aws_keys_path = os.path.join(Path.home(),"SageMaker", "AWS_keys.env")
print(f"aws_keys are at {aws_keys_path}")
load_dotenv(aws_keys_path)

TI_ACCESS_KEY = os.environ['WF_TI_ACCESS_KEY']
TI_SECRET_KEY = os.environ['WF_TI_SECRET_KEY']
TI_TOKEN = os.environ['WF_TI_TOKEN']

# I updated the file but it is not being honored in the enviro values

# print(TI_ACCESS_KEY)
# print(TI_SECRET_KEY)
# print(TI_TOKEN)

print("aws_keys loaded")

In [17]:

# we won't load this into any tables at this time
# The phrase of FIM 5.1.0 will be embedded in config files
#PU0LIC_FIM_VERSION = "FIM 5.1.0"
HAND_MODEL_VERSION = "4.5.11.1"
RAS2FIM_MODEL_VERSION = "2.0"

HAND_ROOT_DPATH = "fim/hand_4_5_11_1"
HAND_DATASETS_DPATH = f"{FIM_ROOT_DPATH}/hand_datasets"
QA_DATASETS_DPATH = f"{FIM_ROOT_DPATH}/qa_datasets"

FIM_BUCKET = "hydrovis-ti-deployment-us-east-1"
FIM_CROSSWALK_FPATH = os.path.join(HAND_DATASETS_DPATH, "crosswalk_table.csv")
PIPELINE_ARN = 'arn:aws:states:us-east-1:526904826677:stateMachine:hv-vpp-ti-viz-pipeline'

COLUMN_NAME_MODEL_VERSION = "model_version"

# Sometimes these credential values get updated. To find the latest correct values, go to your AWS Console log page and click on the "Access Key"
# link to get the latest valid set. Using the "AWS environment variables" values.
# If this is not set correctly, you will get an HTTP error 400 when you call S3 lower.
# You might also see an error of 'An error occurred (NoSuchKey) when calling the GetObject operation:
# The specified key does not exist." the creds are not correct"

S3_CLIENT = boto3.client("s3")
STEPFUNCTION_CLIENT = boto3.client('stepfunctions')
VIZ_DB_ENGINE = sf.get_db_engine('viz')

print("Global Variables loaded")

Global Variables loaded


<h2>1 - UPLOAD FIM4 HYDRO ID/FEATURE ID CROSSWALK</h2>

In [None]:

print(f"Getting column name from {FIM_CROSSWALK_FPATH}")

data = S3_CLIENT.get_object(Bucket=FIM_BUCKET, Key=FIM_CROSSWALK_FPATH)
d_reader = csv.DictReader(codecs.getreader("utf-8")(data["Body"]))
headers = d_reader.fieldnames


header_str = "("
for header in headers:
    header_str += header
    if header in ['hand_id', 'hydro_id', 'lake_id']:
        header_str += ' integer,'
    elif header in ['branch_id', 'feature_id']:
        header_str += ' bigint,'
    else:
        header_str += ' TEXT,'
header_str = header_str[:-1] + ")"
print(header_str)

db = database(db_type="viz")
with db.get_db_connection() as conn, conn.cursor() as cur:
    
    print(f"Deleting/Creating derived.fim4_featureid_crosswalk using columns {header_str}")
    sql = f"DROP TABLE IF EXISTS derived.fim4_featureid_crosswalk; CREATE TABLE derived.fim4_featureid_crosswalk {header_str};"
    cur.execute(sql)
    conn.commit()
    
    # TODO: Nov: Drop the other 2 tables? No. ignore featureid_huc_crosswalk and featureid_huc_crosswalk_ak (not ours)
    

    print(f"Importing {FIM_CROSSWALK_FPATH} to derived.fim4_featureid_crosswalk")
    sql = f"""
        SELECT aws_s3.table_import_from_s3(
           'derived.fim4_featureid_crosswalk',
           '', 
           '(format csv, HEADER true)',
           (SELECT aws_commons.create_s3_uri(
               '{FIM_BUCKET}',
               '{FIM_CROSSWALK_FPATH}',
               'us-east-1'
                ) AS s3_uri
            ),
            aws_commons.create_aws_credentials('{TI_ACCESS_KEY}', '{TI_SECRET_KEY}', '{TI_TOKEN}')
           );
        """
    cur.execute(sql)
    conn.commit()
   
    
    print(f"Adding {COLUMN_NAME_FIM_VERSION} column to derived.fim4_featureid_crosswalk")
    sql = f"ALTER TABLE derived.fim4_featureid_crosswalk ADD COLUMN IF NOT EXISTS {COLUMN_NAME_FIM_VERSION} text DEFAULT '{PUBLIC_FIM_VERSION}';"
    cur.execute(sql)
    conn.commit()
    
    print(f"Adding {COLUMN_NAME_MODEL_VERSION} column to derived.fim4_featureid_crosswalk")
    sql = f"ALTER TABLE derived.fim4_featureid_crosswalk ADD COLUMN IF NOT EXISTS {COLUMN_NAME_MODEL_VERSION} text DEFAULT '{FIM_MODEL_VERSION}';"
    cur.execute(sql)
    conn.commit()
    
    print("Adding feature id index to derived.fim4_featureid_crosswalk")
    # Drop it already exists
    sql = "DROP INDEX IF EXISTS derived.fim4_crosswalk_feature_id"
    cur.execute(sql)
    conn.commit()    
    sql = "CREATE INDEX fim4_crosswalk_feature_id ON derived.fim4_featureid_crosswalk USING btree (feature_id)"
    cur.execute(sql)
    conn.commit()

    print("Adding hydro id index to derived.fim4_featureid_crosswalk")
    # Drop it already exists
    sql = "DROP INDEX IF EXISTS derived.fim4_crosswalk_hydro_id"
    cur.execute(sql)
    conn.commit()    
    sql = "CREATE INDEX fim4_crosswalk_hydro_id ON derived.fim4_featureid_crosswalk USING btree (hydro_id)"
    cur.execute(sql)
    conn.commit()

print("")
print("Successully loaded derived.fim4_featureid_crosswalk and updated it")
print("... Estimated time to completion is just a few mins")


<h2>2 - UPDATE FIM HAND PROCESSING LAMBDA ENV VARIABLE WITH NEW FIM PREFIX</h2>

https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/hv-vpp-ti-viz-hand-fim-processing?tab=configure

Lambda name: hv-vpp-ti-viz-hand-fim-processing

In the Configuration Tab, click on the `Environment variables` (left menu), then change the `FIX_PREFIX` to location of the latest hand_dataset you are working on. Referencial to S3 Bucket name.
<br>
ie) fim/fim_4_5_11_1/hand_datasets


<h2>3 - UPDATE FIM DATA PREP LAMBDA ENV VARIABLE WITH NEW FIM VERSION AND MEMORY</h2>

https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/hv-vpp-ti-viz-fim-data-prep?tab=code

Lambda name: hv-vpp-ti-viz-hand-fim-processing

In the `Configuration` Tab, click on the `Environment variables` (left menu), then change the `FIM_VERSION` to the latest fim model version. 
<br>
ie) 4.5.11.1
<br><br>
<b>Then:</b> Still in the Configuration Tab, now click on the `General Configuration` (left menu), followed 
by the `edit` button on the far right side, to get into the `General Configuration` page details.
<br>Change (if they are not already there)
<br>Memory (text field) to 4096 (MB)  and
<br>Emphermeral Storage tp 1024 (MB)
<br>

#### Note: Later in these steps we will change the Memory and Emphermal Storage back to default values, see below ####
Nov 5, 2024: Added new variable called "MODEL_VERSION" : "HAND 4.5.11.1".  FIM_VERSION IS NOW: 5.1.0





<h2>4 - UPDATE RAS2FIM DATA (inc ras2fim boundaries) IN DB</h2>

As of Oct 2024, we have a new fim (hand) release covered in this file, but ras2fim does not have a new
release. ras2fim will likely be loaded as new datasets become available. 

***The code for ras2fim is removed here from the 4.5.2.11 set and will rebuilt as it''s own new separate load script when that happens.***

However, we will have a few modifications for ras2fim data (not a reload) to help bring in the new
fim_version and model_version columns. Those changes are included here.


In [11]:

# Update "geocurves" to update the "fim_version" field to "FIM 5.1.0:

print(f"Updating geocurves table to fim_version of {PUBLIC_FIM_VERSION}"

sf.execute_sql(f'''
UPDATE
    ras2fim.geocurves
SET
    fim_version = '{PUBLIC_FIM_VERSION}';
''', db_type="viz")

print("Updating done for geocurves")



Updated geocurves table to new FIM version value


<h2>5 - Run AEP FIM Pipelines.</h2>
Updated Documentation from Tyler Early 2024: This can be done in a couple of diferent ways.

1) One option is to use the pipeline_input code created below by Corey to start the AEP pipelines directly from this notebook.<br>
   However, those pipeline_input dictionaries may very well be be out of date, pending more recent updates to the pipelines.<br?


2) The other option, which I prefer, is to setup a manual test event in the initialize_pipeline lambda function to trigger an AEP pipeline like this:</b>
{
  "configuration": "reference",
  "products_to_run": "static_nwm_aep_inundation_extent_library",
  "invoke_step_function": false
}

Using this test event will produce the pipeline instructions, printing any errors that come up, and you can simply change the invoke_step_function flag to True when you're ready to actually invoke a pipeline run (which you can monitor/manage in the step function gui). You will need to manually update the static_nwm_aep_inundation_extent_library.yml product config file to only run 1 aep configuration at a time, and work through the configs as the pipelines finish (takes about an hour each). I've also found that the fim_data_prep lambda function needs to be temporarilly increased to ~4,500mb of memory to run these pipelines. It's also worth noting that these are very resource intesive pipelines, as FIM is calculated for every reach in the nation. AWS costs can amount to hundreds or even thousands of dollars by running these pipelines, so use responsibly.

A couple other important notes:
- These AEP configurations write data directly to the aep_fim schema in the egis RDS database, instead of the viz database.
- <b>You'll need to dump the aep_fim schema after that is complete for backup / deployment into other environments.</b>
- This process has not been tested with new NWM 3.0 Recurrence Flows, and a good thorough audit / QC check of output data is warranted, given those changes and the recent updates to the pipelines.


In [None]:

# Aug 6, 2024: Note: This was created after all intervals were created, so only HW was tested against

def get_aep_pipeline_input(stage_interval):
    pipeline_input = {
      "configuration": "reference",
      "job_type": "auto",
      "data_type": "channel",
      "keep_raw": False,
      "reference_time": datetime.now().strftime('%Y-%m-%d 00:00:00'),
      "configuration_data_flow": {
        "db_max_flows": [],
        "db_ingest_groups": [],
        "python_preprocessing": []
      },
      "pipeline_products": [
        {
          "product": "static_nwm_aep_inundation_extent_library",
          "configuration": "reference",
          "product_type": "fim",
          "run": True,
          "fim_configs": [
            {
              "name": f"rf_{stage_interval}_inundation",
              "target_table": f"aep_fim.rf_{stage_interval}_inundation",
              "fim_type": "hand",
              "sql_file": f"rf_{stage_interval}_inundation"
            }
          ],
          "services": [
            "static_nwm_aep_inundation_extent_library_noaa"
          ],
          "raster_outputs": {
            "output_bucket": "",
            "output_raster_workspaces": []
          },
          "postprocess_sql": [],
          "product_summaries": [],
          "python_preprocesing_dependent": False
        }
      ],
      "sql_rename_dict": {},
      "logging_info": {
          "Timestamp": int(datetime.now().timestamp())
      }
    }

    return pipeline_input

print("function: get_aep_pipeline_input loaded")


In [None]:

#### 2 Year Flow
pipeline_input = get_aep_pipeline_input("2")

# notice, slightly different object name
pipeline_name = f"sagemaker_aep_2_{datetime.now().strftime('%Y%m%dT%H%M')}"

STEPFUNCTION_CLIENT.start_execution(
    stateMachineArn = PIPELINE_ARN,
    name = pipeline_name,
    input= json.dumps(pipeline_input)
)

print(f"AEP : 2 year flows ie: rf_2_inundation kicked off. Can take 45 mins. Pipeline : hv-vpp-ti-viz-pipeline - {pipeline_name}")


In [None]:

#### 5 Year Flow
pipeline_input = get_aep_pipeline_input("5")

# notice, slightly different object name
pipeline_name = f"sagemaker_aep_5_{datetime.now().strftime('%Y%m%dT%H%M')}"

STEPFUNCTION_CLIENT.start_execution(
    stateMachineArn = PIPELINE_ARN,
    name = pipeline_name,
    input= json.dumps(pipeline_input)
)

print(f"AEP : 5 year flows ie: rf_5_inundation kicked off. Can take 45 mins. Pipeline : hv-vpp-ti-viz-pipeline  - {pipeline_name}")


In [None]:

#### 10 Year Flow
pipeline_input = get_aep_pipeline_input("10")

# notice, slightly different object name
pipeline_name = f"sagemaker_aep_10_{datetime.now().strftime('%Y%m%dT%H%M')}"

STEPFUNCTION_CLIENT.start_execution(
    stateMachineArn = PIPELINE_ARN,
    name = pipeline_name,
    input= json.dumps(pipeline_input)
)

print(f"AEP : 10 year flows ie: rf_10_inundation kicked off. Can take 45 mins. Pipeline : hv-vpp-ti-viz-pipeline  - {pipeline_name}")


In [None]:

#### 25 Year Flow
pipeline_input = get_aep_pipeline_input("25")

# notice, slightly different object name
pipeline_name = f"sagemaker_aep_25_{datetime.now().strftime('%Y%m%dT%H%M')}"

STEPFUNCTION_CLIENT.start_execution(
    stateMachineArn = PIPELINE_ARN,
    name = pipeline_name,
    input= json.dumps(pipeline_input)
)

print(f"AEP : 25 year flows ie: rf_25_inundation kicked off. Can take 45 mins. Pipeline : hv-vpp-ti-viz-pipeline  - {pipeline_name}")


In [None]:

#### 50 Year Flow
pipeline_input = get_aep_pipeline_input("50")

# notice, slightly different object name
pipeline_name = f"sagemaker_aep_50_{datetime.now().strftime('%Y%m%dT%H%M')}"

STEPFUNCTION_CLIENT.start_execution(
    stateMachineArn = PIPELINE_ARN,
    name = pipeline_name,
    input= json.dumps(pipeline_input)
)

print(f"AEP : 50 year flows ie: rf_50_inundation kicked off. Can take 45 mins. Pipeline : hv-vpp-ti-viz-pipeline  - {pipeline_name}")


In [None]:

#### HW (High Water) Flow
pipeline_input = get_aep_pipeline_input("high_water")

# notice, slightly different object name
pipeline_name = f"sagemaker_aep_hw_{datetime.now().strftime('%Y%m%dT%H%M')}"

STEPFUNCTION_CLIENT.start_execution(
     stateMachineArn = PIPELINE_ARN,
     name = pipeline_name,
     input= json.dumps(pipeline_input)
)

print(f"AEP : High Water year flows ie: rf_hw_inundation kicked off. Can take 45 mins. Pipeline : hv-vpp-ti-viz-pipeline  - {pipeline_name}")
print("")

<h3>IMPORTANT: Return hv-vpp-ti-viz-fim-data-prep Lambda memory to 2048mb</h3>

https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/hv-vpp-ti-viz-fim-data-prep?tab=code

Lambda name: hv-vpp-ti-viz-hand-fim-processing


<h2>6 - RUN CATCHMENT WORKFLOWS 2 CONFIGS AT A TIME. CHECK FOR STEP FUNCTION FINISHING BEFORE STARTING NEW ONE</h2>

### 6a - Branch 0 Catchments. Wait until it is done before kicking off the next GMS (Level Path) catchments load a bit lower. ###

In [None]:
# TODO: Add backups to these (4.4.0.0)  (already not available for 4.4.0.0)


# TODO: We likely need to keep the schema, so trun is fine for now, but eventually, get a lsit of the indexes and re-build 
# indexes each time as/if needed. Granted these tables are loaded via Lambdas, so I am not sure how indexes will play into that

sf.execute_sql('''
TRUNCATE 
    fim_catchments.branch_0_catchments, 
    fim_catchments.branch_0_catchments_hi, 
    fim_catchments.branch_0_catchments_prvi;
''', db_type="egis")

print("Catchment Truncation for Branch 0 Done")
print("")

In [None]:
pipeline_input = {
  "configuration": "reference",
  "job_type": "auto",
  "data_type": "channel",
  "keep_raw": False,
  "reference_time": datetime.now().strftime('%Y-%m-%d 00:00:00'),
  "configuration_data_flow": {
    "db_max_flows": [],
    "db_ingest_groups": [],
    "python_preprocessing": []
  },
  "pipeline_products": [
    {
      "product": "static_hand_catchments_0_branches",
      "configuration": "reference",
      "product_type": "fim",
      "run": True,
      "fim_configs": [
        {
          "name": "catchments_0_branches",
          "target_table": "fim_catchments.branch_0_catchments",
          "fim_type": "hand",
          "sql_file": "catchments_0_branches"
        }
      ],
      "services": [
        "static_hand_catchments_0_branches_noaa"
      ],
      "raster_outputs": {
        "output_bucket": "",
        "output_raster_workspaces": []
      },
      "postprocess_sql": [],
      "product_summaries": [],
      "python_preprocesing_dependent": False
    },
    {
      "product": "static_hand_catchments_0_branches_hi",
      "configuration": "reference",
      "product_type": "fim",
      "run": True,
      "fim_configs": [
        {
          "name": "catchments_0_branches_hi",
          "target_table": "fim_catchments.branch_0_catchments_hi",
          "fim_type": "hand",
          "sql_file": "catchments_0_branches_hi"
        }
      ],
      "services": [
        "static_hand_catchments_0_branches_hi_noaa"
      ],
      "raster_outputs": {
        "output_bucket": "",
        "output_raster_workspaces": []
      },
      "postprocess_sql": [],
      "product_summaries": [],
      "python_preprocesing_dependent": False
    },
    {
      "product": "static_hand_catchments_0_branches_prvi",
      "configuration": "reference",
      "product_type": "fim",
      "run": True,
      "fim_configs": [
        {
          "name": "catchments_0_branches_prvi",
          "target_table": "fim_catchments.branch_0_catchments_prvi",
          "fim_type": "hand",
          "sql_file": "catchments_0_branches_prvi"
        }
      ],
      "services": [
        "static_hand_catchments_0_branches_prvi_noaa"
      ],
      "raster_outputs": {
        "output_bucket": "",
        "output_raster_workspaces": []
      },
      "postprocess_sql": [],
      "product_summaries": [],
      "python_preprocesing_dependent": False
    }
  ],
  "sql_rename_dict": {},
  "logging_info": {
      "Timestamp": int(datetime.now().timestamp())
  }
}

pipeline_name = f"sagemaker_0_catchments_{datetime.now().strftime('%Y%m%dT%H%M')}"

# TODO: For later... fix fim_version value and add model_version column. current fim_version vlaue is showing 4.5.2.11

STEPFUNCTION_CLIENT.start_execution(
    stateMachineArn = PIPELINE_ARN,
    name = pipeline_name,
    input= json.dumps(pipeline_input)
)

print("Catchments Branch 0 load kicked off. Last runtime: 23:38.019. "
      f"Pipeline : hv-vpp-ti-viz-pipeline  - {pipeline_name}")


### 6b - GMS (Level Paths / non branch 0) catchments ###

In [None]:

# TODO: Add backups to these

sf.execute_sql('''
TRUNCATE
    fim_catchments.branch_gms_catchments,
    fim_catchments.branch_gms_catchments_hi,
    fim_catchments.branch_gms_catchments_prvi;
''', db_type="egis")

print("Catchment Truncation for GMS (Level Path) Branchs Done")

In [None]:
pipeline_input = {
  "configuration": "reference",
  "job_type": "auto",
  "data_type": "channel",
  "keep_raw": False,
  "reference_time": datetime.now().strftime('%Y-%m-%d 00:00:00'),
  "configuration_data_flow": {
    "db_max_flows": [],
    "db_ingest_groups": [],
    "python_preprocessing": []
  },
  "pipeline_products": [
    {
      "product": "static_hand_catchments_gms_branches",
      "configuration": "reference",
      "product_type": "fim",
      "run": True,
      "fim_configs": [
        {
          "name": "catchments_gms_branches",
          "target_table": "fim_catchments.branch_gms_catchments",
          "fim_type": "hand",
          "sql_file": "catchments_gms_branches"
        }
      ],
      "services": [
        "static_hand_catchments_gms_branches_noaa"
      ],
      "raster_outputs": {
        "output_bucket": "",
        "output_raster_workspaces": []
      },
      "postprocess_sql": [],
      "product_summaries": [],
      "python_preprocesing_dependent": False
    },
    {
      "product": "static_hand_catchments_gms_branches_hi",
      "configuration": "reference",
      "product_type": "fim",
      "run": True,
      "fim_configs": [
        {
          "name": "catchments_gms_branches_hi",
          "target_table": "fim_catchments.branch_gms_catchments_hi",
          "fim_type": "hand",
          "sql_file": "catchments_gms_branches_hi"
        }
      ],
      "services": [
        "static_hand_catchments_gms_branches_hi_noaa"
      ],
      "raster_outputs": {
        "output_bucket": "",
        "output_raster_workspaces": []
      },
      "postprocess_sql": [],
      "product_summaries": [],
      "python_preprocesing_dependent": False
    },
    {
      "product": "static_hand_catchments_gms_branches_prvi",
      "configuration": "reference",
      "product_type": "fim",
      "run": True,
      "fim_configs": [
        {
          "name": "catchments_gms_branches_prvi",
          "target_table": "fim_catchments.branch_gms_catchments_prvi",
          "fim_type": "hand",
          "sql_file": "catchments_gms_branches_prvi"
        }
      ],
      "services": [
        "static_hand_catchments_gms_branches_prvi_noaa"
      ],
      "raster_outputs": {
        "output_bucket": "",
        "output_raster_workspaces": []
      },
      "postprocess_sql": [],
      "product_summaries": [],
      "python_preprocesing_dependent": False
    }
  ],
  "sql_rename_dict": {},
  "logging_info": {
      "Timestamp": int(datetime.now().timestamp())
  }
}

pipeline_name = f"sagemaker_gms_catchments_{datetime.now().strftime('%Y%m%dT%H%M')}"

# TODO: For later... fix fim_version value and add model_version column. current fim_version vlaue is showing 4.5.2.11

STEPFUNCTION_CLIENT.start_execution(
    stateMachineArn = PIPELINE_ARN,
    name = pipeline_name,
    input= json.dumps(pipeline_input)
)

print("Catchments GMS Branches (Level Paths / non branch 0) load kicked off."
      f" Last runtime: 24:45.150. Pipeline : hv-vpp-ti-viz-pipeline  - {pipeline_name}")


<h2>7 - Recreate derived.usgs_elev_table</h2>

In [15]:

# Has appx 2,150 HUCs to process, but this section goes quickly.

sf.execute_sql('DROP TABLE IF EXISTS derived.usgs_elev_table;')

uet_usecols = ['location_id', 'HydroID', 'dem_adj_elevation', 'nws_lid', 'levpa_id']

paginator = S3_CLIENT.get_paginator('list_objects')
operation_parameters = {'Bucket': FIM_BUCKET,
                        'Prefix': f'{HAND_DATASETS_DPATH}/',
                        'Delimiter': '/'}
page_iterator = paginator.paginate(**operation_parameters)
page_count = 0
for page in page_iterator:
    
    prefix_objects = page['CommonPrefixes']
    for i, prefix_obj in enumerate(prefix_objects):
        print(f"Processing {(i+1) + (1000 * page_count)} of"
              f"{len(prefix_objects) * page_count} on page {page_count + 1} (1000 per page)")
        huc_prefix = prefix_obj.get("Prefix")
        usgs_elev_table_key = f'{huc_prefix}usgs_elev_table.csv'
        try:
            uet = S3_CLIENT.get_object(
                Bucket=FIM_BUCKET, 
                Key=usgs_elev_table_key
            )['Body']
            uet_df = pd.read_csv(uet, header=0, usecols=uet_usecols)
            uet_df['fim_version'] = PUBLIC_FIM_VERSION
            uet_df[COLUMN_NAME_MODEL_VERSION] = FIM_MODEL_VERSION
            uet_df.to_sql(
                con=VIZ_DB_ENGINE,
                dtype={
                    "location_id": Text(),
                    "nws_data_huc": Text()
                },
                schema='derived',
                name='usgs_elev_table',
                index=False, 
                if_exists='append'
            )
        except Exception as e:
            if "NoSuchKey" in str(e):
                pass
            else:
                raise e

    page_count += 1
                                     
                
print("usgs_elev_tables load completed")


Processing 1 of 1000 on page 1
Processing 2 of 1000 on page 1
Processing 3 of 1000 on page 1
Processing 4 of 1000 on page 1
Processing 5 of 1000 on page 1
Processing 6 of 1000 on page 1
Processing 7 of 1000 on page 1
Processing 8 of 1000 on page 1
Processing 9 of 1000 on page 1
Processing 10 of 1000 on page 1
Processing 11 of 1000 on page 1
Processing 12 of 1000 on page 1
Processing 13 of 1000 on page 1
Processing 14 of 1000 on page 1
Processing 15 of 1000 on page 1
Processing 16 of 1000 on page 1
Processing 17 of 1000 on page 1
Processing 18 of 1000 on page 1
Processing 19 of 1000 on page 1
Processing 20 of 1000 on page 1
Processing 21 of 1000 on page 1
Processing 22 of 1000 on page 1
Processing 23 of 1000 on page 1
Processing 24 of 1000 on page 1
Processing 25 of 1000 on page 1
Processing 26 of 1000 on page 1
Processing 27 of 1000 on page 1
Processing 28 of 1000 on page 1
Processing 29 of 1000 on page 1
Processing 30 of 1000 on page 1
Processing 31 of 1000 on page 1
Processing 32 of 

<h2>8 - Recreate derived.hydrotable_staggered</h2>

In [16]:

# Takes appx 5.75 to 6 hrs to run

print("hydrotable reloaded - started")
start_dt = datetime.now()

sf.execute_sql('DROP TABLE IF EXISTS derived.hydrotable;')
sql = '''
SELECT distinct LPAD(huc8::text, 8, '0') as huc8 FROM derived.featureid_huc_crosswalk WHERE huc8 is not null;
'''
df = sf.sql_to_dataframe(sql)
ht_usecols = ['HydroID', 'feature_id', 'stage', 'discharge_cms']

paginator = S3_CLIENT.get_paginator('list_objects')
operation_parameters = {'Bucket': FIM_BUCKET,
                        'Prefix': f'{HAND_DATASETS_DPATH}/',
                        'Delimiter': '/'}
page_iterator = paginator.paginate(**operation_parameters)
page_count = 0
for page in page_iterator:

    prefix_objects = page['CommonPrefixes']
    for i, prefix_obj in enumerate(prefix_objects):
        
        print(f"Processing {(i+1) + (1000 * page_count)} of"
              f"{len(prefix_objects) * page_count} on page {page_count + 1} (1000 per page)")        
        
        print(f"Processing {(i+1) + (1000 * page_count)} of"
              f" {len(prefix_objects) * page_count} on page {page_count + 1} (1000 per page)")
        branch_prefix = f'{prefix_obj.get("Prefix")}branches/0/'
        branch_files_result = S3_CLIENT.list_objects(
            Bucket=FIM_BUCKET, 
            Prefix=branch_prefix, 
            Delimiter='/'
        )
        hydro_table_key = None
        for content_obj in branch_files_result.get('Contents'):
            branch_file_prefix = content_obj['Key']
            if 'hydroTable' in branch_file_prefix:
                hydro_table_key = branch_file_prefix

        if hydro_table_key:
            # print(f"Found usgs_elev_table and hydroTable in {branch_prefix}")
            try:
                # print("...Fetching csvs...")
                ht = S3_CLIENT.get_object(
                    Bucket=FIM_BUCKET,
                    Key=hydro_table_key
                )['Body']
                # print("...Reading with pandas...")
                ht_df = pd.read_csv(ht, header=0, usecols=ht_usecols)
                # print('...Writing to db...')
                ht_df['fim_version'] = PUBLIC_FIM_VERSION
                ht_df[COLUMN_NAME_MODEL_VERSION] = FIM_MODEL_VERSION
                ht_df.to_sql(
                    con=VIZ_DB_ENGINE, 
                    schema='derived',
                    name='hydrotable',
                    index=False,
                    if_exists='append'
                )
            except Exception as e:
                raise e
                print(f'Fetch failed: {e}')
                
        page_count += 1
                
                
end_dt = datetime.now()
time_duration = end_dt - start_dt
print("hydrotable reload done")
print(f"... duration was  {str(time_duration).split('.')[0]}")


hydrotable reloaded - started
Processing 1 of 0 on page 1 (1000 per page)
Processing 1002 of 1000 on page 2 (1000 per page)
Processing 2003 of 2000 on page 3 (1000 per page)


KeyboardInterrupt: 

In [None]:

print("hydrotable_staggered started")

start_dt = datetime.now()

sql = '''
DROP TABLE IF EXISTS derived.hydrotable_staggered;
SELECT
    et.location_id,
    ht.feature_id,
    (stage + et.dem_adj_elevation) * 3.28084 as elevation_ft,
    LEAD((stage + et.dem_adj_elevation) * 3.28084) OVER (PARTITION BY ht.feature_id ORDER BY ht.feature_id, stage) as next_elevation_ft,
    discharge_cms * 35.3147 as discharge_cfs,
    LEAD(discharge_cms * 35.3147) OVER (PARTITION BY ht.feature_id ORDER BY ht.feature_id, stage) as next_discharge_cfs
INTO derived.hydrotable_staggered
FROM derived.hydrotable AS ht
JOIN derived.usgs_elev_table AS et ON ht."HydroID" = et."HydroID" AND et.location_id IS NOT NULL;
'''
sf.execute_sql(sql)

print("hydrotable_staggered reload done")
end_dt = datetime.now()
time_duration = end_dt - start_dt
print(f"... duration was  {str(time_duration).split('.')[0]}")



In [None]:

# we don't need the hydrotable anymore as it has been reloaded and adjusted above in hydrotable_staggered
sf.execute_sql('DROP TABLE IF EXISTS derived.hydrotable;')
print("Done dropping derived.hydrotable, post hydrotable_staggered load")


<h2>9 - Recreate derived.usgs_rating_curves_staggered</h2>

In [None]:

# Aug 16, 2024 - done for 4.4.0.0 (4.5.2.11)

# TODO: Aug 2024: Change this to a backup without indexes and not rename
# Aug 27, 2024: This needs to be redone so we don't rename tables, it messes up indexes and index names when we use _to_sql commands later

# sf.execute_sql(f'ALTER TABLE IF EXISTS derived.usgs_rating_curves RENAME TO usgs_rating_curves_{OLD_FIM_TAG};')
# sf.execute_sql(f'ALTER TABLE IF EXISTS derived.usgs_rating_curves_staggered RENAME TO usgs_rating_curves_staggered_{OLD_FIM_TAG};')
# print("usgs rating curve tables renamed and cleaned")


In [None]:

sql = '''
    DROP TABLE IF EXISTS derived.usgs_rating_curves;
    DROP TABLE IF EXISTS derived.usgs_rating_curves_staggered;
'''
sf.execute_sql(sql)

print("Done dropping usgs_rating_curves and usgs_rating_curves_staggered")


In [None]:
# run the script to load the usgs_rating_curve.csv. Exact duration not yet known. Appx 30 min (??)

start_dt = datetime.now()
event = {
    'target_table': 'derived.usgs_rating_curves',
    'target_cols': ['location_id', 'flow', 'stage', 'navd88_datum', 'elevation_navd88'],
    'file': f'{QA_DATASETS_DPATH}/usgs_rating_curves.csv',
    'bucket': FIM_BUCKET,
    'reference_time': '2023-08-23 00:00:00',
    'keep_flows_at_or_above': 0,
    'iteration_index': 0
}

sf.execute_db_ingest(event, None)

print("done loading usgs_rating_curves")
end_dt = datetime.now()
time_duration = end_dt - start_dt
print(f"... duration was  {str(time_duration).split('.')[0]}")


In [None]:

# Takes under a minute
print("Starting usgs_rating_curves_staggered build based on usgs_rating_curve table")

sql = '''
SELECT 
    location_id,
    flow as discharge_cfs, 
    LEAD(flow) OVER (PARTITION BY location_id ORDER BY location_id, stage) as next_discharge_cfs,
    stage,
    navd88_datum,
    elevation_navd88 as elevation_ft,
    LEAD(elevation_navd88) OVER (PARTITION BY location_id ORDER BY location_id, stage) as next_elevation_ft
INTO derived.usgs_rating_curves_staggered
FROM derived.usgs_rating_curves;
'''

sf.execute_sql(sql)

print("Done loading usgs_rating_curves_staggered")


In [None]:

# usgs_rating_curves is a temp table and is loaded with some changes into the usgs_rating_curves_staggered
sf.execute_sql('DROP TABLE IF EXISTS derived.usgs_rating_curves;')
print("Done dropping derived.usgs_rating_curves, post loading usgs_rating_curves_staggered")

<h2>10 - UPDATE SRC SKILL METRICS IN DB</h2>

In [None]:
# Already run for 4.4.0.0 (4.5.2.11)

'''
Be Very Careful to just rename tables. If they have indexes, the index will now point to the new
table names but maintain the original index name. Those index names can really mess stuff up.
Best to never rename unless you rename indexes as well. This particular on is ok. 
Note: When various '"to_sql" tools are run which have GIST indexes, this index column name issue
will be the problem.

Why Drop instead of Truncate? if the schema changes for the incoming, truncate will have column
missmatches.

We really should be backing up indexes and constraints as well.

'''

# TODO: Aug 2024: Change this away from "rename" to copy / drop. 
# sf.execute_sql(f'ALTER TABLE IF EXISTS derived.src_skill_temp RENAME TO src_skill_temp_{OLD_FIM_TAG};')
# sf.execute_sql(f'ALTER TABLE IF EXISTS derived.src_skill RENAME TO src_skill_{OLD_FIM_TAG};')

# print("src_skill and src_skill_temps db renamed")


# TODO: Rob Aug 2024: change this to backup of table and not rename as it messses with indexes
# Don't need a copy of the reference src_skill table , so just drop it.
new_table_name = f"derived.src_skill_temp_{OLD_FIM_TAG}"
sql = f'''
   CREATE TABLE IF NOT EXISTS {new_table_name} AS TABLE derived.src_skill_temp;
'''


#print("src_skill and src_skill_temps db renamed")


In [None]:
# Prep the dbs for the new load
#sf.execute_sql('DROP TABLE IF EXISTS derived.src_skill_temp;')
#sf.execute_sql('DROP TABLE IF EXISTS reference.src_skill;', db_type='egis')
#print("Done dropping src_skill and src_skill_temp tables")

In [None]:

# Load the src_skill_temp table
start_dt = datetime.now()

event = {
    'target_table': 'derived.src_skill_temp',
    'target_cols': None,  # This means "all"
    'file': f'{QA_DATASETS_DPATH}/agg_nwm_recurr_flow_elev_stats_location_id.csv',
    'bucket': FIM_BUCKET,
    'reference_time': '2023-08-23 00:00:00',
    'keep_flows_at_or_above': 0,
    'iteration_index': 0,
    'db_type': 'viz'
}

execute_db_ingest(event, None)
print("Done loading derived.src_skill_temp table")
end_dt = datetime.now()
time_duration = end_dt - start_dt
print(f"... duration was  {str(time_duration).split('.')[0]}")


In [None]:

# Load into src_skill table adding geometry to it from external.usgs_gage. Yes.. more/less straight from WRDS tables
# Some recs appear to be in error in the csv. location id = 394220106431500 (those are dropped below)

start_dt = datetime.now()

sf.execute_sql('DROP TABLE IF EXISTS derived.src_skill;')

sql = f"""
SELECT
	(row_number() OVER ())::int as oid,
	gage.name,
	LPAD(skill.location_id::text, 8, '0') as location_id,
	skill.nrmse,
	skill.mean_abs_y_diff_ft,
	skill.mean_y_diff_ft,
	skill.percent_bias,
    '{PUBLIC_FIM_VERSION}' as {COLUMN_NAME_FIM_VERSION},
    '{FIM_MODEL_VERSION}' as {COLUMN_NAME_MODEL_VERSION},
	gage.geo_point as geom
INTO derived.src_skill
FROM derived.src_skill_temp skill
JOIN external.usgs_gage AS gage ON LPAD(gage.usgs_gage_id::text, 8, '0') = LPAD(skill.location_id::text, 8, '0')
"""

sf.execute_sql(sql)

print("Done loading derived.src_skill table")
end_dt = datetime.now()
time_duration = end_dt - start_dt
print(f"... duration was  {str(time_duration).split('.')[0]}")



<h4>Then export the derived.src_skill table and import it into the EGIS reference.src_skill table</h4>

In [None]:

sf.move_data_from_viz_to_egis("derived.src_skill", "reference.src_skill")
print("Done")


<h2>11 - UPDATE FIM PERFORMANCE METRICS IN DB</h2>

In [None]:

# Make copies of current dbs for 4.4.0.0 (4.5.2.11)
# DONE: for 4.4.0.0 (4.5.2.11)

# NOTE: Aug 2024: The problem with not droppign them and rebuilding them with indexes, is that if the table schema
# changes it is not reflected


# Points
new_table_name = f"reference.fim_performance_points_{OLD_FIM_TAG}"
sql = f'''
    CREATE TABLE IF NOT EXISTS {new_table_name} AS TABLE reference.fim_performance_points;
'''
sf.execute_sql(sql, db_type='egis')
print(f"fim_performance_points copied to {new_table_name} if it does not already exists")


# Catchments
new_table_name = f"reference.fim_performance_catchments_{OLD_FIM_TAG}"
sql = f'''
   CREATE TABLE IF NOT EXISTS {new_table_name} AS TABLE reference.fim_performance_catchments;
'''
sf.execute_sql(sql, db_type='egis')
print(f"fim_performance_catchments copied to {new_table_name} if it does not already exists")


# Polys
new_table_name = f"reference.fim_performance_polys_{OLD_FIM_TAG}"
sql = f'''
   CREATE TABLE IF NOT EXISTS {new_table_name} AS TABLE reference.fim_performance_polys;
'''
sf.execute_sql(sql, db_type='egis')
print(f"fim_performance_polys copied to {new_table_name} if it does not already exists")

print("Done making backups of the FIM performance tables")



In [None]:
# clean up tables for new load

# TODO: Aug 2024: Add postgresql if / else. Truncate "if exists" doesn't exist. :)

table_names = [
    "reference.fim_performance_points",
    "reference.fim_performance_polys",
    "reference.fim_performance_catchments"
]

for tb_name in table_names:
    sql = f"TRUNCATE TABLE {tb_name}"
#    print(sql)
    sf.execute_sql(sql,db_type='egis')


print(f"All fim_performance tables trunated if they exist")



In [None]:

# Load the new fim performance tables

start_dt = datetime.now()

# os.environ['EGIS_DB_HOST'] =''  #TI DB

db_type = "egis"
db_engine = sf.get_db_engine(db_type)
s3 = boto3.client('s3')

# Define bucket and parent directories.
bucket = "hydrovis-ti-deployment-us-east-1"

# file_handles = ['fim_performance_points.csv']
# file_handles = ['fim_performance_points.csv', 'fim_performance_polys.csv', 'fim_performance_catchments_dissolved.csv']
# file_handles = ['fim_performance_points.csv', 'fim_performance_polys.csv']
file_handles = ['fim_performance_catchments.csv']

for file_handle in file_handles:

    print("Reading file...")
    # df = pd.read_csv(local_download_path)
    file_to_download = f"{QA_DATASETS_DPATH}/{file_handle}"
    df = s3_sf.download_S3_csv_files_to_df_from_list(FIM_BUCKET, [file_to_download], True)
    print("File read.")

    # Rename headers.

    if file_handle == 'fim_performance_points.csv':
        df = df.rename(columns={'Unnamed: 0': 'oid', 'geometry': 'geom'})
    else:
        df = df.rename(columns={'Unnamed: 0': 'oid', 'geometry': 'geom', 'huc':'huc8'})

    print(df.dtypes)
    # Convert all field names to lowercase (needed for ArcGIS Pro).
    df.columns = df.columns.str.lower()

    # Enforce data types on df before loading in DB (TODO: need to create special cases for each layer).
    if file_handle == 'fim_performance_points.csv':
        df = df.astype({'huc': 'str'})
    else:
        df = df.astype({'huc8': 'str'})
    df = df.fillna(0)
    try:
        df = df.astype({'feature_id': 'int'})
        df = df.astype({'feature_id': 'str'})
        df = df.astype({'oid': 'int'})
    except KeyError:  # If there is no feature_id field
        pass
    try:
        df = df.astype({'nwm_seg': 'int'})
        df = df.astype({'nwm_seg': 'str'})
    except KeyError:  # If there is no nwm_seg field
        pass
    try:
        df = df.astype({'usgs_gage': 'int'})
        df = df.astype({'usgs_gage': 'str'})
    except KeyError:  # If there is no usgs_gage field
        pass

    # zfill HUC8 field.
    if file_handle == 'fim_performance_points.csv':
        df['huc'] = df['huc'].apply(lambda x: x.zfill(8))
    else:
        df['huc8'] = df['huc8'].apply(lambda x: x.zfill(8))

    df['version'] = PUBLIC_FIM_VERSION
    df[COLUMN_NAME_MODEL_VERSION] = FIM_MODEL_VERSION

    # Upload df to database.
    stripped_layer_name = file_handle.replace(".csv", "")
    table_name = "reference." + stripped_layer_name
    print("Loading data into DB...")

    # Chunk load data into DB

    if file_handle in ['fim_performance_catchments.csv']:

        print("Chunk loading...")
        # Create list of df chunks
        n = 10000  # chunk row size
        list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
        # geometry = 'MULTIPOLYGON'
        # Load the first chunk into the DB as a new table
        first_chunk_df = list_df[0]
        print(first_chunk_df.shape[0])

        first_chunk_df.to_sql(
            name=stripped_layer_name, 
            con=db_engine, 
            schema='reference',
            if_exists='replace', 
            index=False,
            dtype={'oid': sqlalchemy.types.Integer(),
                   'version': sqlalchemy.types.String(),
                   'geom': Geometry('MULTIPOLYGON', srid=3857)
                  }
        )
        # Load remaining chunks into newly created table

        for remaining_chunk_df in list_df[1:]:
            print(remaining_chunk_df.shape[0])
            remaining_chunk_df.to_sql(
                name=stripped_layer_name,
                con=db_engine,
                schema='reference',
                if_exists='append',
                index=False,
                dtype={'oid': sqlalchemy.types.Integer(),
                       'version': sqlalchemy.types.String(),
                       'geom': Geometry('MULTIPOLYGON', srid=3857)
                      }
            )
    else:
        if 'points' in stripped_layer_name: geometry = 'POINT'
        if 'polys' in stripped_layer_name: geometry = 'POLYGON'
        # print("GEOMETRY")
        # print(geometry)
        df.to_sql(
            name=stripped_layer_name,
            con=db_engine,
            schema='reference',
            if_exists='replace',
            index=False,
            dtype={'oid': sqlalchemy.types.Integer(),
                   'version': sqlalchemy.types.String(),
                   'geom': Geometry(geometry, srid=3857)
                  }
        )

    print(f">>> {file_handle} downloaded and loaded")

    # deleted the downloaded file that was just processed.
    # if os.path.exists(local_download_path):


print("")

end_dt = datetime.now()
time_duration = end_dt - start_dt
# print("All FIM Performance files loaded")
print(f"... duration was  {str(time_duration).split('.')[0]}")


<h2>12 - CatFIM (Stage-Based and Flow-Based)</h2>

<h4>Function to load CatFIM Data (Non Public)</h4>

In [None]:
''' Function to load CatFIM data (for any flow / stage / library / sites but non public)'''


def load_catfim_table(catfim_type):

    '''
    Inputs:
        - catfim_type: name identififer for the set, such as "flow_based_catfim" or "flow_based_catfim_sites", etc
              Sometimes the file_handle name can be the name of the s3 file (without extension) and/or the table
              name.
              Options: flow_based_catfim, flow_based_catfim_sites, stage_based_catfim, stage_based_catfim_sites
    '''

    db_type = "egis"
    db_engine = sf.get_db_engine(db_type)
    src_crs = "3857"

    # --------------------------------------
    # Drop the original Db if already in place
    table_name = catfim_type  # yes, dup variable for now

    sf.execute_sql(f"DROP TABLE IF EXISTS reference.{table_name};", db_type=db_type)
    print(f"Dropping reference.{table_name} table if it existed")
    print("")

    # --------------------------------------
    # Get the data from S3 and load it into a df
    if catfim_type in ['flow_based_catfim', 'stage_based_catfim']:
        file_to_download = f"{QA_DATASETS_DPATH}/{catfim_type}_library.csv"
    else:
        file_to_download = f"{QA_DATASETS_DPATH}/{catfim_type}.csv"

    # print(f"Downloading {file_to_download} ... ")

    df = s3_sf.download_S3_csv_files_to_df_from_list(FIM_BUCKET, [file_to_download], True)
    num_recs = len(df)
    print(f"File read. {num_recs} records to load")

    # --------------------------------------
    # Adjusting Columns and data
    # Rename headers. All files this name
    df = df.rename(columns={'Unnamed: 0': 'oid',
                            'geometry': 'geom',
                            'huc': 'huc8'})

    # Convert all field names to lowercase (needed for ArcGIS Pro).
    df.columns = df.columns.str.lower()

    # Enforce data types on df before loading in DB (TODO: need to create special cases for each layer).
    df = df.astype({'huc8': 'str'})
    df = df.fillna(0)
    try:
        df = df.astype({'feature_id': 'int'})
        df = df.astype({'feature_id': 'str'})
    except KeyError:  # If there is no feature_id field
        pass
    try:
        df = df.astype({'nwm_seg': 'int'})
        df = df.astype({'nwm_seg': 'str'})
    except KeyError:  # If there is no nwm_seg field
        pass
    try:
        df = df.astype({'usgs_gage': 'int'})
        df = df.astype({'usgs_gage': 'str'})
    except KeyError:  # If there is no usgs_gage field
        pass

    # zfill HUC8 field.
    df['huc8'] = df['huc8'].apply(lambda x: x.zfill(8))

    if '_sites' in catfim_type:
        df = df.astype({'nws_data_rfc_forecast_point': 'str'})
        df = df.astype({'nws_data_rfc_defined_fcst_point': 'str'})
        df = df.astype({'nws_data_riverpoint': 'str'})


    # As of Nov 1, 2024: Ignore the incoming "version" from dataset
    # df['version'] = PUBLIC_FIM_VERSION
    df[COLUMN_NAME_FIM_VERSION] = PUBLIC_FIM_VERSION
    df[COLUMN_NAME_MODEL_VERSION] = FIM_MODEL_VERSION

    # --------------------------------------
    # Load to DB
    # Chunk load data into DB
    if catfim_type in ['flow_based_catfim', 'stage_based_catfim']:

        # Create list of df chunks
        n = 1000  # chunk row size
        print(f"Chunk loading... into {table_name} -- {n} records at a time")
        print("")
        chunk_df = [df[i:i+n] for i in range(0, df.shape[0], n)]

        # Load the first chunk into the DB as a new table
        first_chunk_df = chunk_df[0]
        num_chunks = len(chunk_df)

        print(f" ... loading chunk 1 of {num_chunks}")

        first_chunk_df.to_sql(
            name=table_name,
            con=db_engine,
            schema='reference',
            if_exists='replace',
            index=False,
            dtype={'oid': sqlalchemy.types.Integer(),
                   'geom': Geometry('MULTIPOLYGON', srid=src_crs)}
        )

        # Load remaining chunks into newly created table
        ctr = 1  # Already loaded one
        for remaining_chunk in chunk_df[1:]:
            # print(remaining_chunk.shape[0])
            ctr += 1
            print(f" ... loading chunk {ctr} of {num_chunks}")
            remaining_chunk.to_sql(
                        name=table_name,
                        con=db_engine,
                        schema='reference',
                        if_exists='append',
                        index=False,
                        dtype={'oid': sqlalchemy.types.Integer(),
                               'geom': Geometry('MULTIPOLYGON', srid=src_crs)
                              }
                    )
        # end for
    else:  # sites tables
        print(f"Loading data into {table_name} ...")

        df.to_sql(
            name=table_name,
            con=db_engine,
            schema='reference',
            if_exists='replace',
            index=False,
            dtype={'oid': sqlalchemy.types.Integer(),
                   'geom': Geometry('POINT', srid=src_crs)}
        )

    # This should auto create a gist index against the geometry column
    # if that index name already exists, the upload will fail, the index can not pre-exist
    # Best to drop the table before loading.

    # return

print("load_catfim_table function loaded")


<h3>12.a - Backup old DBs and prepare new databases (but not the "public" FIM 10/30 db's)</h3>

In [None]:
# This covers both Stage Based and Flow Based (but not the "public" catfim db's)

# The "Public" db backups ana loads are in cells lower (12.d and higher)

# DONE for 4.4.0.0.  (4.5.2.11)

# # print("Starting Data Backups and table drops for stage and flow based catfim")
# db_names = ["stage_based_catfim", "stage_based_catfim_sites",
#             "flow_based_catfim", "flow_based_catfim_sites"]

# for db_name in db_names:
#     new_table_name = f"reference.{db_name}_{OLD_FIM_TAG}"
#     sql = f'''
#         CREATE TABLE IF NOT EXISTS {new_table_name} AS TABLE reference.{db_name};
#     '''
#     sf.execute_sql(sql, db_type='egis')
#     print(f"{db_name} copied to {new_table_name} if it does not already exist")


# Aug 2024: Now we can drop the tables as we don't have any indexes on them at this time other than the gist geom index.
# By dropping them, we can auto adjust the tables schema. (don't truncate)

# for db_name in db_names:
#     sf.execute_sql(f"DROP TABLE IF EXISTS reference.{db_name};", db_type='egis')
#     print(f"reference.{db_name} table dropped if it existed")


# print("Data Backups of flow based catfim are complete")


<h3>12.b - Updated Flow and Stage Based CatFIM Data (Non Public)</h3>

<h3>AUG 2024: IMPORTANT NOTE:</h3>
The stage based catfim (library) csv has grown to appx 10 GiB. Our current notebook, hv-vpp-ti-viz-notebook only has 15 GiB memory.
Running tool can easily overwhelm the notebook server and freeze it up forcing a reboot.
Sometimes when the notebook instance comes back up, it no longer has ths swap system in place. You will need most of the memory
and some swap to load it.  Keep an eye a "terminal" windows and keep entering `free -h` to keep an eye on it's usage.
</br>
We will need to review to see if we want to:

1. Upgrade this notebook server with more memory (and harddrive space would be good)

2. Change the load of the catfim library (non sites) data to another system. Maybe we can load it via a lambda to an EC2 or something?

3. Get the FIM Team to break it to smaller pieces, but watch carefully for the OID system (unique id for all records)

**When you are done running this script, Please restart this kernal as it does not appear to be releasing all memory. (memory leak?)**


Also looks like Tyler has some notebooks where he was moving this into a lambda load? We need to look into that


In [None]:

print("Starting of CatFIM data")

# catfim_types = ['flow_based_catfim', 'flow_based_catfim_sites']
# catfim_types =  ['stage_based_catfim', 'stage_based_catfim_sites']
catfim_types = ['stage_based_catfim_sites']
# catfim_types = ['stage_based_catfim']

start_dt = datetime.now()

for catfim_type in catfim_types:
    print(f"Loading {catfim_type} data")
    load_catfim_table(catfim_type)

print("")
end_dt = datetime.now()
time_duration = end_dt - start_dt
print(f"... duration was  {str(time_duration).split('.')[0]}")



<h3>12.c - CatFIM Backup old "public" FIM 10 / 30 DBs and prepare new databases</h3>

In [None]:
'''
This covers ONLY Catfim public FIM 10/30 for both flow based and stage based
'''

''' DONE for 4.4.0.0.  (4.5.2.11)'''

# db_name_appendix = f"{OLD_FIM_TAG}_fim_10"

# print("Starting Data Backups and table drops for stage and flow based PUBLIC catfim")
# # db_names = ["stage_based_catfim_public", "stage_based_catfim_sites_public",
# #              "flow_based_catfim_public", "flow_based_catfim_sites_public"]

# # stage_based_catfim_sites_public didn't exist for fim 10 but should have in TI (does in other enviros likely)
# db_names = ["stage_based_catfim_public", 
#              "flow_based_catfim_public", "flow_based_catfim_sites_public"]

# for db_name in db_names:
#     new_table_name = f"reference.{db_name}_{db_name_appendix}"
#     sql = f"CREATE TABLE IF NOT EXISTS {new_table_name} AS TABLE reference.{db_name}"
#     sf.execute_sql(sql, db_type='egis')
#     print(f"{db_name} copied to {new_table_name} if it does not already exist")

    
# # Aug 2024: Now we can drop the tables as we don't have any indexes on them at this time other than the gist geom index.
# # By dropping them, we can auto adjust the tables schema. (don't truncate)

# for db_name in db_names:
#     sf.execute_sql(f"DROP TABLE IF EXISTS reference.{db_name};", db_type='egis')
#     print(f"reference.{db_name} table dropped if it existed")

# print("Data Backups of flow based catfim are complete")


<h3>12.d - Load CatFIM "public" FIM 30 DBs</h3>

In [None]:


print("Loading CatFIM Public datasets (FIM 30)")

catfim_types = ["stage_based_catfim", "stage_based_catfim_sites",
                "flow_based_catfim", "flow_based_catfim_sites"]

__public_fim_release = "fim_30"  # The new fim public release being loaded (ie. fim_10, fim_30, fim_60..)

start_dt = datetime.now()

for catfim_type in catfim_types:
    print("")
    sql = f'''
    DROP TABLE IF EXISTS reference.{catfim_type}_public;

    SELECT
        catfim.*,
        '{__public_fim_release}' as public_fim_release
    INTO reference.{catfim_type}_public
    FROM reference.{catfim_type} as catfim
    JOIN reference.public_fim_domain as fim_domain ON ST_Intersects(catfim.geom, fim_domain.geom)
    '''
    print(sf.execute_sql(sql, db_type='egis'))
    print(f"public {__public_fim_release} data load for {catfim_type} is complete")

# what about indexes again?

# for db_name in db_names:
#     new_table_name = f"reference.{db_name}_{db_name_appendix}"
#     sql = f"CREATE TABLE IF NOT EXISTS {new_table_name} AS TABLE reference.{db_name}"
#     sf.execute_sql(sql, db_type='egis')
#     print(f"{db_name} copied to {new_table_name} if it does not already exist")

print("")
end_dt = datetime.now()
time_duration = end_dt - start_dt
print(f"... duration was  {str(time_duration).split('.')[0]}")


<h2>13 - Clear the HAND Cache</h2>

In [None]:
sql = """
TRUNCATE TABLE fim_cache.hand_hydrotable_cached;
TRUNCATE TABLE fim_cache.hand_hydrotable_cached_max;
TRUNCATE TABLE fim_cache.hand_hydrotable_cached_geo;
TRUNCATE TABLE fim_cache.hand_hydrotable_cached_zero_stage;
"""
sf.execute_sql(sql)

<h2>14 - SAVE TO REPO (AND REDEPLOY TO TI WITH NEW VERSION VARIABLE IN TERRAFORM ??)</h2>

Oct 21, 2024: We don''t have a system per-say to update for Terraform, but we now have github hooks
built right into JupyterHub. We need to figure out how to work with multiple branches and "getting latest"
but this gives us source control management now.


Note from Rob: While, un-elegant, there so much quick evolution here that I recommend we even keep seperate named load scripts in GIT
ie) one for FIM Version 4.4.0.0 and one for 4.5.2.11, etc. So many changes for each edition and very fast script changes WIP may 
make it smarter to keep each script seperately (ie. 4.4.0.0, 4.5.2.11, etc)

<h4>Make sure to Publish the changes to git and add a PR</h4>

