# Exploratory Data Analysis for the Physical Properties of Lakes

This lesson was adapted from educational material written by [Dr. Kateri Salk](https://www.hydroshare.org/user/4912/) for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the first part of a two-part exercise focusing on the physical properties of lakes. 

## Introduction

Lakes are dynamic, nonuniform bodies of water in which the physical, biological, and chemical properties interact. Lakes also contain the majority of Earth's fresh water supply. This lesson introduces exploratory data analysis in the context of the physical properties of lakes. 

## Learning Objectives

After successfully completing this exercise, you will be able to:

1. Apply exploratory data analytics skills to applied questions about physical properties of lakes


## Requirements to Complete Lesson 

### Packages
This lesson requires the installation of the following R packages to run the provided script:
- `tidyverse`- Version 1.3.0. A collection of R packages designed for data science.
- `lubridate`- Version 1.7.9. Functions for working with dates/times.


### Data and Code

The input files are made available to you in a public folder. 

This lesson also requires the dataset <span style="color:blue">NTER-LTER_Lake_ChemistryPhysics_Raw.csv. </span><br> 

The dataset used in this lesson contains data from studies on several lakes in the North Temperate Lakes District in the state of Wisconsin. The data were collected as part of the Long Term Ecological Research station created by the National Science Foundation.  More information can be found here: https://lter.limnology.wisc.edu/about/overview. 

The code provided in this resource was developed using R version 3.6.1. 

### Set Working Directory



In R, the working directory is the directory where R starts when looking for any file to open (as directed by a file path) and where it saves any output. This lesson assumes that you have set your working directory to the folder location of the downloaded and unzipped data subsets. 

### Load Packages


In [2]:
options(warn=-1)

library(tidyverse)
library(lubridate)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✓[39m [34mggplot2[39m 3.3.3     [32m✓[39m [34mpurrr  [39m 0.3.4
[32m✓[39m [34mtibble [39m 3.0.5     [32m✓[39m [34mdplyr  [39m 1.0.3
[32m✓[39m [34mtidyr  [39m 1.1.2     [32m✓[39m [34mstringr[39m 1.4.0
[32m✓[39m [34mreadr  [39m 1.4.0     [32m✓[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union




### Load Data

In [3]:
NTLdata <- read.csv("NTL-LTER_Lake_ChemistryPhysics_Raw.csv")

## Data Wrangling and Exploration

The purpose of this section is to perform exploratory data analysis to investigate the structure of the provided dataset. 

## `NTLdata`

To create a spreadsheet-style data viewer on a matrix-like R object, simply execute the name of the dataframe. 

In [4]:
NTLdata

lakeid,lakename,year4,daynum,sampledate,depth,temperature_C,dissolvedOxygen,irradianceWater,irradianceDeck,comments
<chr>,<chr>,<int>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
L,Paul Lake,1984,148,5/27/84,0.00,14.5,9.5,1750.0,1620,
L,Paul Lake,1984,148,5/27/84,0.25,,,1550.0,1620,
L,Paul Lake,1984,148,5/27/84,0.50,,,1150.0,1620,
L,Paul Lake,1984,148,5/27/84,0.75,,,975.0,1620,
L,Paul Lake,1984,148,5/27/84,1.00,14.5,8.8,870.0,1620,
L,Paul Lake,1984,148,5/27/84,1.50,,,610.0,1620,
L,Paul Lake,1984,148,5/27/84,2.00,14.2,8.6,420.0,1620,
L,Paul Lake,1984,148,5/27/84,3.00,11.0,11.5,220.0,1620,
L,Paul Lake,1984,148,5/27/84,4.00,7.0,11.9,100.0,1620,
L,Paul Lake,1984,148,5/27/84,5.00,6.1,2.5,34.0,1620,


## Date Formatting

### `class`

 Use the `class` function to print the vector of names of classes an object inherits from.

In [19]:
class(NTLdata$sampledate)

The results indicate that the `sampledate` column of the NTLdata dataframe is a factor. 

### `as.Date`

Use the `as.Date` function to convert a string to a date. 

In [20]:
NTLdata$sampledate <- as.Date(NTLdata$sampledate, "%m/%d/%y")

In [21]:
# Check the class of the sampledate column now 
class(NTLdata$sampledate)

## Addressing NA's

### `dim`

Use the `dim` function to retrieve the dimensions of an object. 

In [22]:
dim(NTLdata)

### `drop_na`

Use the `drop_na` function to drop any rows containing missing Temperature values. 

In [23]:
NTLdata <- NTLdata %>%
  drop_na(temperature_C)

In [24]:
# Check the dimensions of the main data frame now
dim(NTLdata)

### `summary`

Use the `summary` function to compute summary statistics of data and model objects. In this lesson, we will use `summary` to determine how many observations are present for each lake. 

In [25]:
summary(NTLdata$lakename)

   Length     Class      Mode 
    34756 character character 

## Subsetting and Filtering 

### `filter`

Use the `filter` function to choose rows/cases where conditions are true. In this lesson, we will use `filter` to select the three lakes with the most observations. 

In [26]:
NTLdata <- NTLdata %>%
  filter(lakename %in% c("Paul Lake", "Peter Lake", "Tuesday Lake"))

In [27]:
## You could also filter the data frame using this notation to get the same results
NTLdata <- NTLdata %>%
  filter(lakename == "Paul Lake" | lakename == "Peter Lake" | 
           lakename == "Tuesday Lake")

### Create three separate data frames for each lake

In [28]:
Pauldata <- filter(NTLdata, lakename == "Paul Lake")
Peterdata <- filter(NTLdata, lakename == "Peter Lake")
Tuesdaydata <- filter(NTLdata, lakename == "Tuesday Lake")

### Determine the beginning and end dates of each lake's period of record

### `min` and `max`

Use the `min` and `max` functions to return the minimum and maximum values of a numeric vector or column.

In [29]:
min(Pauldata$sampledate)
max(Pauldata$sampledate)
min(Peterdata$sampledate)
max(Peterdata$sampledate)
min(Tuesdaydata$sampledate)
max(Tuesdaydata$sampledate)

### Determine the unique depth values (meters) sampled in each lake

### `unique`

Use the `unique` function to extract unique elements by removing duplicate elements/rows from a vector, data frame or array. 

In [30]:
unique(Pauldata$depth)
unique(Peterdata$depth)
unique(Tuesdaydata$depth)