## LESSON OBJECTIVES
1. Set up a data analysis session in RStudio
2. Import and explore datasets in R
3. Apply data exploration skills to a real-world example dataset

## OPENING DISCUSSION: WHY DO WE EXPLORE OUR DATA?

Why is data exploration our first step in analyzing a dataset? What information do we gain? How does data exploration aid in our decision-making for data analysis steps further down the pipeline?

## IMPORT DATA AND VIEW SUMMARIES

In [74]:
#Import the Pandas package
import pandas as pd
from datetime import datetime, timedelta

In [2]:
#Read the USGS flow data into a Pandas dataframe
df_USGS = pd.read_csv('./data/Raw/USGS_Site02085000_Flow_Raw.csv')

In [3]:
#Rename columns -- R: colnames(df_USGS) <- c(...)
df_USGS.columns = ("agency_cd", "site_no", "datetime", 
                   "discharge_max", "discharge_max_approval", 
                   "discharge_min", "discharge_min_approval", 
                   "discharge_mean", "discharge_mean_approval", 
                   "gage_height_max", "gage_height_max_approval", 
                   "gage_height_min", "gage_height_min-approval", 
                   "gage_height_mean", "gage_height_mean_approval")

In [4]:
#Examine the data
df_USGS.head()

Unnamed: 0,agency_cd,site_no,datetime,discharge_max,discharge_max_approval,discharge_min,discharge_min_approval,discharge_mean,discharge_mean_approval,gage_height_max,gage_height_max_approval,gage_height_min,gage_height_min-approval,gage_height_mean,gage_height_mean_approval
0,USGS,2085000,1/1/28,74.0,A,,,,,,,,,,
1,USGS,2085000,1/2/28,61.0,A,,,,,,,,,,
2,USGS,2085000,1/3/28,56.0,A,,,,,,,,,,
3,USGS,2085000,1/4/28,54.0,A,,,,,,,,,,
4,USGS,2085000,1/5/28,48.0,A,,,,,,,,,,


## ADJUSTING DATASETS

### Formatting dates
As in R, dates in imported CSV files aren't recognized automatically, as dates can take many formats. Inspection of the `datetime` column in our dataframe indicates it was imported as an "Object" ("`O`"), which likely means as a raw string.

In [5]:
#Reveal the data type of the 'datetime' column
df_USGS['datetime'].dtype

dtype('O')

Dates are tricky. Python's `datetime` package helps, but it too is somewhat confusing. First, we can use it to get today's date using its `today` (or `now` function), but it returns a peculiar object

In [26]:
today = datetime.today()
type(today)

datetime.datetime

We can print that object and it appears intuitive (and very precise!)

In [27]:
print(today)

2019-01-10 16:40:54.435637


And we can use the `strftime` method to convert our datetime object into various formats. The web site http://strftime.org/ offers a nice overview of the format objects 

In [29]:
#Formate the 'today' datetime object as a formatted string
today.strftime('%Y-%m-%d')

'2019-01-10'

See if you can get it to print today's date as "Thursday, January 01, 2019"

In [39]:
today.strftime('%A, %B %m, %Y')

'Thursday, January 01, 2019'

Back to our task at hand

In [48]:
#Convert our datetime string values to a datetime values 
newDate = pd.to_datetime(df_USGS['datetime'],format='%m/%d/%y')
newDate.dtype

dtype('<M8[ns]')

Note that for every date prior to 1969, Python has assigned the date in the 2000s rather than the 1900s. This can be fixed with an `ifelse` statement inside a function. Run through the code below and write what is happening in the comment above each line.

In [63]:
newDate[0].strftime('%y%m%d') > 181231,

'280101'

In [67]:
x = newDate[0] 
#if x > today:

In [72]:
x.year-100

1928

In [77]:
x - timedelta(days=36525)

Timestamp('1928-01-01 00:00:00')

In [79]:
newDate[newDate > today] = newDate - timedelta(days=36525)

In [80]:
newDate

0       1928-01-01
1       1928-01-02
2       1928-01-03
3       1928-01-04
4       1928-01-05
5       1928-01-06
6       1928-01-07
7       1928-01-08
8       1928-01-09
9       1928-01-10
10      1928-01-11
11      1928-01-12
12      1928-01-13
13      1928-01-14
14      1928-01-15
15      1928-01-16
16      1928-01-17
17      1928-01-18
18      1928-01-19
19      1928-01-20
20      1928-01-21
21      1928-01-22
22      1928-01-23
23      1928-01-24
24      1928-01-25
25      1928-01-26
26      1928-01-27
27      1928-01-28
28      1928-01-29
29      1928-01-30
           ...    
33186   2018-11-10
33187   2018-11-11
33188   2018-11-12
33189   2018-11-13
33190   2018-11-14
33191   2018-11-15
33192   2018-11-16
33193   2018-11-17
33194   2018-11-18
33195   2018-11-19
33196   2018-11-20
33197   2018-11-21
33198   2018-11-22
33199   2018-11-23
33200   2018-11-24
33201   2018-11-25
33202   2018-11-26
33203   2018-11-27
33204   2018-11-28
33205   2018-11-29
33206   2018-11-30
33207   2018