<a href="https://colab.research.google.com/github/MCRLdata-Sandbox/tutorials/blob/main/ML_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Welcome!

This tutorial is designed to provide an introduction to machine learning (ML) for users with any level of experience with coding. All code is written in R, but you do NOT need to know or learn R to complete this tutorial!

ML is a subset of artificial intelligence (AI) where the computer can learn and improve its performance of a task without the user's input. It can be very useful for a wide variety of data-based tasks, and there are many different algorithms that perform many different types of tasks.

For this tutorial we will focus on one ML algorithm: Random Forests (RF). RF is a very useful algorithm we can use to predict relationships between different variables. RF is relativel robust to many of the factors that can cause problems in normal statistical models (things like co-correlation of predictors, non-normal distributions, and non-linear relationships).



## 2. Setup

Before we start anything, we need to set up our coding environment. Because getting a coding language like R or Python running on your computer is often an involved process, this tutorial takes advantage of Google Colab, which pre-loads all of the software you need. You do, however, need to install packages and set up your environment. We'll do this by running the code chunk below (press the play button in the upper-left).

**IMPORTANT: from here on, when you see a code chunk with a play button, you can press play! Please do not skip code chunks as the code below generally depends on the code above**

This code chunk will take a couple minutes (~3 minutes on my machine) because R needs to install and load several libraries. All other code chunks should be much faster!


In [1]:
## I want to understand how long things take
install.packages('tictoc')
library(tictoc)

## Install and load required packages
tic("install and load packages")
install.packages(c('tidyverse', 'rsample', 'cowplot', 'ranger'))

library(tidyverse)
library(rsample)
library(cowplot)
library(ranger)
toc()

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘warp’, ‘furrr’, ‘slider’, ‘RcppEigen’


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m U

install and load packages: 196.853 sec elapsed


While that's loading, let's start with the question we want our model to answer - **Can we use basic hydrology and water quality data to predict aquatic carbon dioxide concentrations in Sequim Bay?**

This is a  common type of question to ask ML to solve and while it seems simple, finding the answer can be a complicated process. This tutorial leverages the awesome datasets being collected off the MCRL dock by the [MCRLdata](https://mcrldata.pnnl.gov/) pipeline. We will use the partial pressure of carbon dioxide in water (pCO2) as the variable we want to predict, and a range of parameters, including water temperature, tidal stage, windspeed, and others as our predictors.