# <u>Background</u>: Impacts of Falls Lake on Stream Flow
These exercises build off the Excel Water Flow examples covered in the last data analytics session. To remind you, in those exercises we wanted to answer the general question: **How did construction of Falls Lake reservoir affect downstream flow?**. And in particular, we refined our question to address the following: 
* Do plots of monthly average streamflow look different for the period of data taken before Falls lake (1930-1980) than from the period after (1984-2017)?

* What are the streamflows associated with 100, 500, and 1000 year floods? Have they changed since the reservoir was constructed?

* Has Falls lake succeeded in minimizing the frequency of low flow events? 

## The analytical workflow
To answer these questions, we walk through a series of steps - *the analytical workflow* - which includes:
* locating the data
* getting the data into our analytical environment
* tidying the data for efficient analysis
* executing our analysis
* visualizing output
* evaluating our results and repeating as necessary

Here in this notebook, we begin this process using Python, demonstrating how select Python packages (also called "modules" or "libraries") help us in downloading and formatting our data. 

---
# Importing data
This notebook demonstrates how Python, using the `requests`, `io`, and `pandas` libraries, can generate a working dataset from on-line resources. Specifically, this notebook forms a request to a web service, here the NWIS API, and handles the response in such a way as to form a tidy data frame, which is saved as a comma separated value (CSV) formatted file on the local disk. 

### Water flow data for the Neuse River near Clayton gage site. 
The data we want, namely mean daily stream flow for a site downstream of Falls Lake, resides on the National Water Information System's (NWIS) servers. Recall the procedure we used in the Excel exercise to locate and query this server to reveal data for the Neuse River near Clayton gage site. The result was presented as a web page, and the URL of that web page (provided below) actually contained all the instructions necessary to generate the data we wanted. In other words, we can manipulate this address and easly get data for a different site or for a different time range. 

All this is by design (something called a *REST API* or *web service*- another topic beyond the scope of today), and it allows us to leverage Python tools to programmatically pull data into our analysis environment. Examination of the service documentation reveals we can pull a lot more than daily stream flow, and we'll revisit that idea later...

Service Documentation:
https://waterservices.usgs.gov/rest/DV-Test-Tool.html

Example request in URL format:<br>
http://waterservices.usgs.gov/nwis/dv/?format=rdb&sites=02089000&startDT=2010-10-01&endDT=2017-09-30&statCd=00003&parameterCd=00060&siteStatus=all

#### Import the Python libraries required to run the script
Like *R*, Python has a huge developer environment, and these developers are constantly creating new libraries that run specific tasks. Here we load these into our current scripting environment.<br><br>

<font color=#767676 size="2">*Note: While many 3rd party libraries may be included in your default Python installation, some are not an need to be **installed** prior importing them in a script. We examined how the `requests` library was manually installed in the setup for this exercise.*</font>

In [2]:
#Import libraries
import requests
import io
import pandas as pd

#### Assemble the parameters that will be used in the data request
To make our script more dynamic, we store key values such as the site number and the specific parameters we want to fetch. That way we can easily tweak these values and somewhat easily generate results for a different site or parameter. 

In [3]:
#Set site, parameter, and stat codes
siteNo = '02087500' #Neuse R. Near Clayton 
pcode = '00060'     #Discharge (cfs)
scode = '00003'     #Daily mean

In [4]:
#Set start and end dates
startDate = '1930-10-01'
endDate = '2017-09-03'

#### Format the request, then send it and store the response as a variable
Here we construct the URL used to retrieve the data. We could store it a simple string

In [5]:
#Construct the service URL and parameters
#https://waterdata.usgs.gov/nwis/dv?
url =  'https://waterservices.usgs.gov/nwis/dv'
params = {'sites':siteNo,
          'parameterCd':pcode,
          'statCd':scode,
          'startDT':startDate,
          'endDT':endDate,
          'format':'rdb',
          'siteStatus':'all'
         }

In [6]:
#Send the requests and translate the response
response_raw = requests.get(url,params)
response_clean = response_raw.content.decode('utf-8')

In [7]:
response_raw.url

'https://waterservices.usgs.gov/nwis/dv/?endDT=2017-09-03&sites=02087500&format=rdb&statCd=00003&parameterCd=00060&startDT=1930-10-01&siteStatus=all'

#### Clean up the response and read it into a data frame

In [6]:
#Create a list of metadata rows to skip; rows 1-29 and 31 
rowsToSkip = list(range(28))
#Append '30' to the list
rowsToSkip.append(29)

In [7]:
#Convert the data into a data frame
df = pd.read_csv(io.StringIO(response_clean),
                 skiprows=rowsToSkip,     #Skip metadta and data spec lines
                 delimiter='\t',          #Set to tab delimited
                 dtype={'site_no':'str'}) #Set site_no to a string datatype

#### Examine the resulting data frame

In [8]:
#Display the first 5 rows
df.head()

Unnamed: 0,agency_cd,site_no,datetime,85235_00060_00003,85235_00060_00003_cd
0,USGS,2087500,1930-10-01,347.0,A
1,USGS,2087500,1930-10-02,173.0,A
2,USGS,2087500,1930-10-03,132.0,A
3,USGS,2087500,1930-10-04,125.0,A
4,USGS,2087500,1930-10-05,125.0,A


In [9]:
#Rename the last two fields
df.rename(columns={'85235_00060_00003':'MeanFlow_cfs','85235_00060_00003_cd':'Confidence'},inplace=True)
df.head()

Unnamed: 0,agency_cd,site_no,datetime,MeanFlow_cfs,Confidence
0,USGS,2087500,1930-10-01,347.0,A
1,USGS,2087500,1930-10-02,173.0,A
2,USGS,2087500,1930-10-03,132.0,A
3,USGS,2087500,1930-10-04,125.0,A
4,USGS,2087500,1930-10-05,125.0,A


In [10]:
df.dtypes

agency_cd        object
site_no          object
datetime         object
MeanFlow_cfs    float64
Confidence       object
dtype: object

#### Save the dataframe to a csv file

In [11]:
#Save to a csv file
df.to_csv('GageData.csv',index=False)