## Code and methods for downloading data from USGS.

We will explore two methods:  
1. Manually downloading the dataset from USGS  
2. Downloading the dataset using an API  

First, we’ll obtain data using the API.

To proceed, we need to install the `dataretrieval` package.

Execute the following command: `!pip install dataretrieval`

In [1]:
!pip install dataretrieval

Collecting dataretrieval
  Downloading dataretrieval-1.0.12-py3-none-any.whl.metadata (9.2 kB)
Downloading dataretrieval-1.0.12-py3-none-any.whl (38 kB)
Installing collected packages: dataretrieval
Successfully installed dataretrieval-1.0.12


In [3]:
import pandas as pd #Pandas is for 1D and 2D data


# first import the functions for downloading data from NWIS
import dataretrieval.nwis as nwis #nwis is class

# specify the USGS site code for which we want data.
site = '02336490'


# get instantaneous values (iv) iv=instantenous values
df = nwis.get_record(sites=site, service='iv', start='2020-01-01', end='2024-01-01')


# get basic info about the site
df3 = nwis.get_record(sites=site, service='site')


# Information about the codes can be found in
# https://help.waterdata.usgs.gov/parameter_cd?group_cd=PHY

# 00065 - Gauge height, feet

In [4]:
df

Unnamed: 0_level_0,site_no,00060,00060_cd,00065,00065_cd
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-01 05:00:00+00:00,02336490,1920.0,A,5.90,A
2020-01-01 05:15:00+00:00,02336490,1890.0,A,5.85,A
2020-01-01 05:30:00+00:00,02336490,1880.0,A,5.81,A
2020-01-01 05:45:00+00:00,02336490,1860.0,A,5.76,A
2020-01-01 06:00:00+00:00,02336490,1840.0,A,5.72,A
...,...,...,...,...,...
2024-01-02 03:45:00+00:00,02336490,1040.0,A,4.14,A
2024-01-02 04:00:00+00:00,02336490,1040.0,A,4.14,A
2024-01-02 04:15:00+00:00,02336490,1040.0,A,4.14,A
2024-01-02 04:30:00+00:00,02336490,1040.0,A,4.13,A


In [5]:
df3

Unnamed: 0,agency_cd,site_no,station_nm,site_tp_cd,lat_va,long_va,dec_lat_va,dec_long_va,coord_meth_cd,coord_acy_cd,...,reliability_cd,gw_file_cd,nat_aqfr_cd,aqfr_cd,aqfr_type_cd,well_depth_va,hole_depth_va,depth_src_cd,project_no,geometry
0,USGS,2336490,"CHATTAHOOCHEE RIVER AT GA 280, NEAR ATLANTA, GA",ST,334902.7,842849.2,33.817417,-84.480333,N,S,...,,NNNNNNNN,,,,,,,,POINT (-84.48033 33.81742)


In [6]:
# datetime column is used for indexing
# remove indexing so that we can access the column and column name

df.reset_index(inplace=True)

# select relevant columns
df = df[['datetime', '00065']]

In [7]:
df = df.rename(columns={'datetime': 'DATE', '00065': 'gauge_height'})
df

Unnamed: 0,DATE,gauge_height
0,2020-01-01 05:00:00+00:00,5.90
1,2020-01-01 05:15:00+00:00,5.85
2,2020-01-01 05:30:00+00:00,5.81
3,2020-01-01 05:45:00+00:00,5.76
4,2020-01-01 06:00:00+00:00,5.72
...,...,...
139793,2024-01-02 03:45:00+00:00,4.14
139794,2024-01-02 04:00:00+00:00,4.14
139795,2024-01-02 04:15:00+00:00,4.14
139796,2024-01-02 04:30:00+00:00,4.13


In [8]:
df.to_csv(f"{site}_raw_data_api.csv")

In [9]:
df.isnull().sum()

Unnamed: 0,0
DATE,0
gauge_height,63


## Downloading Data Manually

1. Go to the <a href="https://dashboard.waterdata.usgs.gov/app/nwd/en/">USGS Dashboard</a> and search for the gauging station, in our case "02336490."
2. Go to the **Data** page and click on **Current/ Historical Observations**.
3. Navigate to the **Legacy real-time page**.
4. Select the data you need, in this case, Gauge Height Data.
5. Choose "Tab-separated" as the format, then click the **GO** button on the right.

- By default, it displays data for about one week. If you need to download data spanning multiple years, you’ll need to use a different URL.
- If you try to load large data volumes, the page will notify you of the alternate URL: **https://nwis.waterdata.usgs.gov/usa/nwis/uv/**. The rest of the URL structure remains the same.
- You must specify the time range directly in the URL, like this: `period=&begin_date=2008-01-01&end_date=2024-05-29`.
  
Finally, you can use the `curl` command (a Unix-based terminal command) to download the webpage as a file to your computer. We download and save it as an HTML file.

In [10]:
## !curl "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_00065=on&format=rdb&site_no=02336490&legacy=1&period=&begin_date=2008-01-01&end_date=2024-05-29" > "usgs_data.html"

In [13]:
# specify the raw html file path
file_path = 'usgs_data.html'

# open file using python and print first few (28) lines.
with open(file_path, 'r') as file:
    for line in file.readlines()[:28]:
        print(line)

    file.close()


# Some of the data that you have obtained from this U.S. Geological Survey database

# may not have received Director's approval. Any such data values are qualified

# as provisional and are subject to revision. Provisional data are released on the

# condition that neither the USGS nor the United States Government may be held liable

# for any damages resulting from its use.

#

# Additional info: https://waterdata.usgs.gov/provisional-data-statement/

#

# Contact:   gs-w_waterdata_support@usgs.gov

# retrieved: 2024-09-04 18:01:16 EDT       (nadww02)

#

# Data for the following 1 site(s) are contained in this file

#    USGS 02336490 CHATTAHOOCHEE RIVER AT GA 280, NEAR ATLANTA, GA

# -----------------------------------------------------------------------------------

#

# Data provided for site 02336490

#            TS   parameter     Description

#         39679       00065     Gage height, feet

#

# Data-value qualification codes included in this output:

#     A  Approved for

In [14]:
# Write code to parse file and save data as a csv file

# load relevant libraries
import datetime
import pandas as pd

# open file
with open(file_path, 'r') as file:
    # read all lines from the files
    all_lines = file.readlines()

    # define two arrays to collect dates and heights
    dates = []
    heights = []

    # loop through each line which has data records, it starts from line 26.
    for line in all_lines[26:]:
        # since data is tab separated, split using tap '\t'
        splited_line = line.split('\t')

        # checks if the last column has 'A' to check if record was approved for publication. You can see the file metadata as printed above.
        if splited_line[-1] == 'A\n':
            # column 3 contains the datetime data
            dates.append(datetime.datetime.strptime(splited_line[2], '%Y-%m-%d %H:%M'))
            # column 4 contains the gauge height data
            heights.append(float(splited_line[4]))

    file.close()

# Save the data as dataframe with columns DATE and gauge_height
data_dict = {'DATE': dates, 'gauge_height': heights}
df = pd.DataFrame(data_dict)

# save data to a csv file
df.to_csv(f'{site}_raw_data_manual.csv')