# Download USGS flow data
It seems difficult to automatically find how much data is available for a given USGS station, so instead we request a very long time period. The server will automatically return only available data to us. 

Workflow:
- Save server response as a temporary file
- Separate response into a `.csv` file containing the data and ...
- ... a `.txt` file with header (meta) info.

We need the data in `.csv` for future processing but we cannot immediately store the whole thing as a `.csv` file because the line organization doesn't match. Might as well process it here. 

Download info source: https://waterservices.usgs.gov/rest/IV-Service.html#Specifying

In [2]:
import sys
import time
import pandas as pd
import urllib.request
from pathlib import Path
sys.path.append(str(Path().absolute().parent))
import python_cs_functions as cs

### Config handling

In [3]:
# Specify where the config file can be found
config_file = '../0_config/config.txt'

In [5]:
# Get the required info from the config file
data_path = cs.read_from_config(config_file,'data_path')

# CAMELS-spat metadata
cs_meta_path = cs.read_from_config(config_file,'cs_basin_path')
cs_meta_name = cs.read_from_config(config_file,'cs_meta_name')
cs_unusable_name = cs.read_from_config(config_file,'cs_unusable_name')

# Basin folder
cs_basin_folder = cs.read_from_config(config_file, 'cs_basin_path')
basins_path = Path(data_path) / cs_basin_folder

# Data period
time_s = cs.read_from_config(config_file, 'usgs_start_t')
time_e = cs.read_from_config(config_file, 'usgs_start_e')

### Data loading

In [7]:
# CAMELS-spat metadata file
cs_meta_path = Path(data_path) / cs_meta_path
cs_meta = pd.read_csv(cs_meta_path / cs_meta_name)

### Loop over sites and download the flow record

In [6]:
# General settings
var = '00060' # streamflow; 00065 for gage height
main_url = 'https://nwis.waterservices.usgs.gov/nwis/iv/' # (i)nstantaneous (v)alues

In [8]:
# Loop over the USA stations only
dnf = [] # List of incomplete stations, retaining these for easier printout and checking later
for ix,row in cs_meta.iterrows():
    if row.Country == 'USA':
        
        # Get paths, etc
        site, _, raw_path, _, _,_ = cs.prepare_flow_download_outputs(cs_meta, ix, basins_path)
        
        # Skip to next if we already have the data for this station
        if raw_path.is_file():
            continue
        
        # Construct the download URL
        url = f'{main_url}?format=rdb&sites={site}&startDT={time_s}&endDT={time_e}&parameterCd={var}&siteStatus=all'
        time.sleep(1) # pause for a second so we don't bombard the server with requests
        
        # Download the URL to a temporary location
        urllib.request.urlretrieve(url, raw_path)      
        
        # Checks
        df = pd.read_csv(raw_path, delimiter='\t', comment='#', low_memory=False) # skip comments (#); low_mem prevents mixed datatype warning 
        if len(df) < 3: # Sites with no data still have a 2-line df. 1st: data format. 2nd: NaNs
            print(f'No data downloaded for {site}')
            dnf.append(site)
        else:
            print(f'Completed {site}')

Completed 01013500
Completed 01022500
Completed 01030500
Completed 01031500
Completed 01047000
Completed 01052500
Completed 01054200
Completed 01055000
Completed 01057000
Completed 01073000
Completed 01078000
Completed 01118300
Completed 01121000
Completed 01123000
Completed 01134500
Completed 01137500
Completed 01139000
Completed 01139800
Completed 01142500
Completed 01144000
Completed 01162500
Completed 01169000
Completed 01170100
Completed 01181000
Completed 01187300
Completed 01195100
Completed 01333000
Completed 01350000
Completed 01350080
Completed 01350140
Completed 01365000
Completed 01411300
Completed 01413500
Completed 01414500
Completed 01415000
Completed 01423000
Completed 01434025
Completed 01435000
Completed 01439500
Completed 01440000
Completed 01440400
Completed 01451800
Completed 01466500
Completed 01484100
Completed 01485500
Completed 01486000
Completed 01487000
Completed 01491000
Completed 01510000
Completed 01516500
Completed 01518862
Completed 01532000
Completed 01

Completed 07335700
Completed 07340300
Completed 07346045
Completed 07359610
Completed 07362100
Completed 07362587
Completed 07373000
Completed 07375000
Completed 07376000
Completed 08013000
Completed 08014500
Completed 08023080
Completed 08025500
Completed 08029500
Completed 08050800
Completed 08066200
Completed 08066300
Completed 08070000
Completed 08070200
Completed 08079600
Completed 08082700
Completed 08086212
Completed 08086290
Completed 08101000
Completed 08103900
Completed 08104900
Completed 08109700
Completed 08150800
Completed 08155200
Completed 08158700
Completed 08158810
Completed 08164000
Completed 08164300
Completed 08164600
Completed 08165300
Completed 08171300
Completed 08175000
Completed 08176900
Completed 08178880
Completed 08189500
Completed 08190000
Completed 08190500
Completed 08194200
Completed 08195000
Completed 08196000
Completed 08198500
Completed 08200000
Completed 08202700
Completed 08267500
Completed 08269000
Completed 08271000
Completed 08324000
Completed 08

# TO DO: re-run this with the new `dnf` list

In [12]:
# Print which basins we need to check
for entry in dnf:
    print(entry)

No data downloaded for 02342933
No data downloaded for 02464360
No data downloaded for 04233000
No data downloaded for 11230500
No data downloaded for 11237500
No data downloaded for 12178100


Manual checks indicate that no Instantaneous Value (IV) discharge data is available for these stations. Checked on 2023-02-27.
- 02342933: https://waterdata.usgs.gov/monitoring-location/02342933/#period=P1Y
- 02464360: https://waterdata.usgs.gov/monitoring-location/02464360/#period=P1Y
- 04233000: https://waterdata.usgs.gov/monitoring-location/04233000/#period=P1Y
- 11230500: https://waterdata.usgs.gov/monitoring-location/11230500/#period=P1Y
- 11237500: https://waterdata.usgs.gov/monitoring-location/11237500/#period=P1Y
- 12178100: https://waterdata.usgs.gov/monitoring-location/12178100/#period=P1Y (IV gauge height but no discharge)

In [15]:
country = 'USA'

In [11]:
reason = 'No Instantaneous Values of discharge available'

In [19]:
# Make a dataframe that lists the basins we cannot use
cs_unusable = pd.DataFrame({'Country': country,
                            'Station_id': dnf,
                            'Reason': reason})

In [20]:
cs_unusable

Unnamed: 0,Country,Station_id,Reason
0,USA,2342933,No Instantaneous Values of discharge available
1,USA,2464360,No Instantaneous Values of discharge available
2,USA,4233000,No Instantaneous Values of discharge available
3,USA,11230500,No Instantaneous Values of discharge available
4,USA,11237500,No Instantaneous Values of discharge available
5,USA,12178100,No Instantaneous Values of discharge available


In [21]:
cs_unusable.to_csv(cs_meta_path / cs_unusable_name)