# A Baseline Workflow
![Our Training Data Workflow](images/DataWorkflowTraining.png "Our Training Data Workflow")

In [2]:
import dataretrieval.nwis as nwis
# https://github.com/DOI-USGS/dataretrieval-python

# Acquire / Filter
df = nwis.get_record(sites='04294000', service='iv', start='2022-06-01', end='2022-11-01', parameterCD='00060')
df

Unnamed: 0_level_0,site_no,00060,00060_cd
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-06-01 04:00:00+00:00,04294000,2240.0,A
2022-06-01 04:15:00+00:00,04294000,2210.0,A
2022-06-01 04:30:00+00:00,04294000,2210.0,A
2022-06-01 04:45:00+00:00,04294000,2190.0,A
2022-06-01 05:00:00+00:00,04294000,2190.0,A
...,...,...,...
2022-11-02 02:45:00+00:00,04294000,914.0,A
2022-11-02 03:00:00+00:00,04294000,860.0,A
2022-11-02 03:15:00+00:00,04294000,780.0,A
2022-11-02 03:30:00+00:00,04294000,718.0,A


## Quick Sidebar: pandas

If we look at the type of variable that df is, we'll see it's a pandas DataFrame.


In [3]:
type(df)

pandas.core.frame.DataFrame

Show of hands, who's used pandas and feels like they have a good grasp of it?

pandas is one of the most widely used imports in the Python ecosystem. The main use of pandas is to use it's core data structure, the DataFrame. If you're familiar with R, the pandas DataFrame is very similar. You can think of a DataFrame like a table, with columns of variables and rows of data records. pandas DataFrames have an index and defining the index as appropriate to your data will allow you to use the pandas library of functions most effectively. The pandas library of functions (their API - Application Programmer's Interface) is massive, allowing you to filter and manipulate your data in just a few lines of code.  More info here: [https://pandas.pydata.org/docs/index.html](https://pandas.pydata.org/docs/index.html)

## Another Quick Sidebar: NumPy

Another quiz: Who knows what NumPy is?

NumPy is another widely used Python import. Where as pandas operates on rows and columns of a table, NumPy is responsible for the definition of the data types in the table like integers, floats, etc. NumPy data types are the default data types in pandas DataFrames (although there is a long-term plan to shift to pyarrow, but that is yet another digression). Because of this underlying use of NumPy by pandas, you can easily retrieve columns and DataFrames as NumPy arrays and operate on them with the highly optimized NumPy mathmatical operators.  More info here: [https://numpy.org/](https://numpy.org/)

## Back to our Workflow...

In [4]:
# Manipulate
daily = df['00060'].resample('1D').mean()
daily

datetime
2022-06-01 00:00:00+00:00    1628.875000
2022-06-02 00:00:00+00:00    2029.770833
2022-06-03 00:00:00+00:00    2553.333333
2022-06-04 00:00:00+00:00    2503.229167
2022-06-05 00:00:00+00:00    1667.802083
                                ...     
2022-10-29 00:00:00+00:00     898.187500
2022-10-30 00:00:00+00:00     796.343750
2022-10-31 00:00:00+00:00     762.520833
2022-11-01 00:00:00+00:00     650.510417
2022-11-02 00:00:00+00:00    1238.375000
Freq: D, Name: 00060, Length: 155, dtype: float64

In [5]:
# Visualize
daily.describe()

count     155.000000
mean      989.816458
std      1278.847398
min       110.748958
25%       306.796875
50%       549.020833
75%      1064.348958
max      9091.354167
Name: 00060, dtype: float64

Now, putting it all together...

![Our Training Data Workflow](images/DataWorkflowTraining.png "Our Training Data Workflow")

In [6]:
import dataretrieval.nwis as nwis

# Acquire / Filter
df = nwis.get_record(sites='04294000', service='iv', start='2022-06-01', end='2022-11-01', parameterCD='00060')

# Manipulate
daily = df['00060'].resample('1D').mean()

# Visualize
daily.describe()

count     155.000000
mean      989.816458
std      1278.847398
min       110.748958
25%       306.796875
50%       549.020833
75%      1064.348958
max      9091.354167
Name: 00060, dtype: float64