# USGS dataretrieval Python Package `get_qwdata()` Examples

This notebook provides examples of using the Python dataretrieval package to retrieve water quality sample data for United States Geological Survey (USGS) monitoring sites. The dataretrieval package provides a collection of functions to get data from the USGS Samples database and other online sources of hydrology and water quality data, including the United States Environmental Protection Agency (USEPA).

### Install the Package

Use the following code to install the package if it doesn't exist already within your Jupyter Python environment.

In [None]:
!pip install dataretrieval

Load the package so you can use it along with other packages used in this notebook.

In [None]:
from dataretrieval import samples
from IPython.display import display

### Basic Usage

The dataretrieval package has several functions that allow you to retrieve data from different web services. This examples uses the `get_usgs_samples()` function to retrieve water quality sample data for USGS monitoring sites from Samples. The following arguments are supported:

* **ssl_check** : boolean, optional
        Check the SSL certificate.
* **service** : string
        One of the available Samples services: "results", "locations", "activities",
        "projects", or "organizations". Defaults to "results".
* **profile** : string
        One of the available profiles associated with a service. Options for each
        service are:
        results - "fullphyschem", "basicphyschem",
                    "fullbio", "basicbio", "narrow",
                    "resultdetectionquantitationlimit",
                    "labsampleprep", "count"
        locations - "site", "count"
        activities - "sampact", "actmetric",
                        "actgroup", "count"
        projects - "project", "projectmonitoringlocationweight"
        organizations - "organization", "count"
* **activityMediaName** : string or list of strings, optional
        Name or code indicating environmental medium in which sample was taken.
        Check the `activityMediaName_lookup()` function in this module for all
        possible inputs.
        Example: "Water".
* **activityStartDateLower** : string, optional
        The start date if using a date range. Takes the format YYYY-MM-DD.
        The logic is inclusive, i.e. it will also return results that
        match the date. If left as None, will pull all data on or before
        activityStartDateUpper, if populated.
* **activityStartDateUpper** : string, optional
        The end date if using a date range. Takes the format YYYY-MM-DD.
        The logic is inclusive, i.e. it will also return results that
        match the date. If left as None, will pull all data after
        activityStartDateLower up to the most recent available results.
* **activityTypeCode** : string or list of strings, optional
        Text code that describes type of field activity performed.
        Example: "Sample-Routine, regular".
* **characteristicGroup** : string or list of strings, optional
        Characteristic group is a broad category of characteristics
        describing one or more results. Check the `characteristicGroup_lookup()`
        function in this module for all possible inputs.
        Example: "Organics, PFAS"
* **characteristic** : string or list of strings, optional
        Characteristic is a specific category describing one or more results.
        Check the `characteristic_lookup()` function in this module for all
        possible inputs.
        Example: "Suspended Sediment Discharge"
* **characteristicUserSupplied** : string or list of strings, optional
        A user supplied characteristic name describing one or more results.
* **boundingBox**: list of four floats, optional
        Filters on the the associated monitoring location's point location
        by checking if it is located within the specified geographic area. 
        The logic is inclusive, i.e. it will include locations that overlap
        with the edge of the bounding box. Values are separated by commas,
        expressed in decimal degrees, NAD83, and longitudes west of Greenwich
        are negative.
        The format is a string consisting of:
        - Western-most longitude
        - Southern-most latitude
        - Eastern-most longitude
        - Northern-most longitude 
        Example: [-92.8,44.2,-88.9,46.0]
* **countryFips** : string or list of strings, optional
        Example: "US" (United States)
* **stateFips** : string or list of strings, optional
        Check the `stateFips_lookup()` function in this module for all
        possible inputs.
        Example: "US:15" (United States: Hawaii)
* **countyFips** : string or list of strings, optional
        Check the `countyFips_lookup()` function in this module for all
        possible inputs.
        Example: "US:15:001" (United States: Hawaii, Hawaii County)
* **siteTypeCode** : string or list of strings, optional
        An abbreviation for a certain site type. Check the `siteType_lookup()`
        function in this module for all possible inputs.
        Example: "GW" (Groundwater site)
* **siteTypeName** : string or list of strings, optional
        A full name for a certain site type. Check the `siteType_lookup()`
        function in this module for all possible inputs.
        Example: "Well"
* **usgsPCode** : string or list of strings, optional
        5-digit number used in the US Geological Survey computerized
        data system, National Water Information System (NWIS), to
        uniquely identify a specific constituent. Check the 
        `characteristic_lookup()` function in this module for all possible
        inputs.
        Example: "00060" (Discharge, cubic feet per second)
* **hydrologicUnit** : string or list of strings, optional
        Max 12-digit number used to describe a hydrologic unit.
        Example: "070900020502"
* **monitoringLocationIdentifier** : string or list of strings, optional
        A monitoring location identifier has two parts: the agency code
        and the location number, separated by a dash (-).
        Example: "USGS-040851385"
* **organizationIdentifier** : string or list of strings, optional
        Designator used to uniquely identify a specific organization.
        Currently only accepting the organization "USGS".
* **pointLocationLatitude** : float, optional
        Latitude for a point/radius query (decimal degrees). Must be used
        with pointLocationLongitude and pointLocationWithinMiles.
* **pointLocationLongitude** : float, optional
        Longitude for a point/radius query (decimal degrees). Must be used
        with pointLocationLatitude and pointLocationWithinMiles.
* **pointLocationWithinMiles** : float, optional
        Radius for a point/radius query. Must be used with
        pointLocationLatitude and pointLocationLongitude
* **projectIdentifier** : string or list of strings, optional
        Designator used to uniquely identify a data collection project. Project
        identifiers are specific to an organization (e.g. USGS).
        Example: "ZH003QW03"
* **recordIdentifierUserSupplied** : string or list of strings, optional
        Internal AQS record identifier that returns 1 entry. Only available
        for the "results" service.

#### Example 1: Get all water quality sample data for a single monitoring site

In [None]:
siteID = 'USGS-10109000'
wq_data = samples.get_usgs_samples(monitoringLocationIdentifier=siteID)
print('Retrieved data for ' + str(len(wq_data[0])) + ' samples.')

### Interpreting the Result

The result of calling the `get_usgs_samples()` function is an object that contains a Pandas data frame object and an associated metadata object. The Pandas data frame contains the water quality sample data for the requested site, and or observed variables and time frame.

Once you've got the data frame, there's several useful things you can do to explore the data.

Display the data frame as a table. The default data frame for this function is a  long, flat table, with a row for each observed variable at a given site and date/time.

In [None]:
display(wq_data[0])

Show the data types of the columns in the resulting data frame.

In [None]:
print(wq_data[0].dtypes)

The other part of the result returned from the `get_usgs_data()` function is a metadata object that contains information about the query that was executed to return the data. For example, you can access the URL that was assembled to retrieve the requested data from the USGS web service. The USGS web service responses contain a descriptive header that defines and can be helpful in interpreting the contents of the response.

In [None]:
print('The query URL used to retrieve the data from USGS Samples was: ' + wq_data[1].url)

### Additional Examples

#### Example 2: Get water quality sample data for multiple sites for a single parameter

In [None]:
site_ids = ['USGS-04024430', 'USGS-04024000']
parameter_code = '00065'
wq_multi_site = samples.get_usgs_samples(monitoringLocationIdentifier=site_ids, usgsPCode=parameter_code)
print('Retrieved data for ' + str(len(wq_multi_site[0])) + ' samples.')
display(wq_multi_site[0])

#### Example 3: Retrieve water quality sample data for multiple sites, including a list of parameters, within a time period defined by start date until present

In [None]:
site_ids = ['USGS-04024430', 'USGS-04024000']
parameterCd = ['34247', '30234', '32104', '34220']
startDate = '2012-01-01'
wq_data2 = samples.get_usgs_samples(monitoringLocationIdentifier=site_ids, usgsPCode=parameterCd,
                           activityStartDateLower=startDate)
print('Retrieved data for ' + str(len(wq_multi_site[0])) + ' samples.')
display(wq_data2[0])


#### Example 4: Retrieve water quality sample data for one site and convert to a wide format

Note that the USGS samples database returns multiple parameters in a "long" format: each row in the resulting table represents a single observation of a single parameters. Furthermore, every observation has 181 fields of metadata. However, if you wanted to place your water quality data into a "wide" format, where each column represents a water quality parameter code, the code below details one solution.

In [None]:
siteID = 'USGS-10109000'
wq_data,_ = samples.get_usgs_samples(monitoringLocationIdentifier=siteID)
print('Retrieved data for ' + str(len(wq_data)) + ' sample results.')

wq_data["characteristic_unit"] = wq_data["Result_Characteristic"] + ", " + wq_data["Result_MeasureUnit"]
wq_data_wide = wq_data.pivot_table(index=['Location_Identifier', 'Activity_StartDate', 'Activity_StartTime'], columns="characteristic_unit", values="Result_Measure", aggfunc='first')
display(wq_data_wide)
