# Reproducible Science in R and Figures (Part 1)

<img src="https://zenodo.org/record/3332808/files/1728_TURI_Book%20sprint_5%20turing%20way_040619.jpg" alt="Drawing" style="display:block; margin:auto; width:30%">


### Skills you will have by the end of the lesson...

- Creating a data pipeline
- Writing modular code
- Keeping your data safe
- Keeping your code sane
- Code that runs on other people's computers!


> Images from [The Turing Way](https://the-turing-way.netlify.app/reproducible-research/reproducible-research.html)! 


<img src="https://github.com/the-turing-way/the-turing-way/blob/main/book/website/figures/reusable-code-garden.png?raw=true" alt="Drawing" style="display:block; margin:auto; width: 600px;"/>

## What is Reproducible Research?

<img src="https://the-turing-way.netlify.app/_images/reproducibility.jpg" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>


Scientific results and evidence are strengthened if those results can be replicated and confirmed by several independent researchers. 

What you do on your computer is a part of scientific methodology like fieldwork and labwork.

<img src="image/img_crisis.png" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>

*Major media outlets have reported on investigations showing that a significant percentage of scientific studies cannot be reproduced. This leads to other academics and society losing trust in scientific results.*
https://www.nature.com/articles/533452a

### Reproducible

<img src="image/img_fail_reproduce.png" alt="Drawing" style="display:block; margin:auto; width: 400px;"/>


 - Authors provide all the necessary data and the computer code to run the analysis again, re-creating the results.

### Replicable

- A study that arrives at the same scientific findings as another study, collecting new data (possibly with different methods) and completing new analyses.

<img src="https://book.the-turing-way.org/_images/reproducible-definition-grid.svg" alt="Drawing" style="display:block; margin:auto; width: 600px;"/>

# Why bother?

### Helps YOU

<img src="https://github.com/the-turing-way/the-turing-way/blob/main/book/website/figures/help-you-of-the-future.png?raw=true" alt="Drawing" style="display:block; margin:auto; width:400px;"/>

Usually the person trying to redo your work and analysis is yourself at a later date!


<img src="https://book.the-turing-way.org/_images/project-history.svg" alt="Drawing" style="display:block; margin:auto; width: 600px;"/>


A lot of time is wasted redoing analyses because you can't read or understand your own code from a few weeks or years ago, and having to redo figures because you've lost the code to generate it. 

It also acts as a fail safe backup for your work!

### Helps open science

Open research aims to transform research by making it more reproducible, transparent, reusable, collaborative, accountable, and accessible to society. 

<img src="https://book.the-turing-way.org/_images/evolution-open-research.png" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>



- Publicly available: It is difficult to use and benefit from knowledge hidden behind barriers such as passwords and paywalls.

- Reusable: Research outputs need to be licensed appropriately, so that prospective users know any limitations on re-use.

- Transparent: With appropriate metadata to provide clear statements of how research output was produced and what it contains.

<img src="https://raw.githubusercontent.com/the-turing-way/the-turing-way/0e5c59a66a2665b056b6f5471e4f54e02b443d95/book/website/figures/fair-principles.svg" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>

### Helps you publish
<img src="https://www.turing.ac.uk/sites/default/files/inline-images/Culture%20shift.jpg" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>

It is becoming more common for code release to be part of the scientific publishing and grant approval process. 

### Prevents mistakes

<img src="https://github.com/the-turing-way/the-turing-way/blob/main/book/website/figures/testing-motivation1.png?raw=true" alt="Drawing" style="display:block; margin:auto; width: 700px;"/>   

# Penguin Data


This data set on Palmer penguins has nice examples all over the internet. It contains morphometric data from three species of penguin.

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" alt="Drawing" style="display:block; margin:auto; width: 800px;"/>

# Creating a Project Folder


One of the kindest things you can do for yourself when using a computer is to organise your files carefully. This means...

### ⚠️ **DO NOT save to Desktop!**
> - If your computer breaks, you will lose all your files on Desktop.

You are given OneDrive storage from the University, you will get the benefits of having your files in the cloud, but also the ability to access them from any computer. The University will also back up your files for you. 

## (01) In RStudio, press "New Project"

<img src="image/directory_000.png" alt="Drawing" style="display:block; margin:auto; width: 300px;"/>

A pop up will appear, select "New Directory" then "New Project"

<img src="image/directory_000a.png" alt="Drawing" style="display:block; margin:auto; width: 300px;"/>

<img src="image/directory_000b.png" alt="Drawing" style="display:block; margin:auto; width: 285px;"/>

## (02) Navigate to your OneDrive and call your project "PenguinProject"

<img src="image/directory_000c.png" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>

## (03) Your window in RStudio will now look like this:

You can see your project name in the top left, and in the Files pane on the bottom right.


<img src="image/directory_000d.png" alt="Drawing" style="display:block; margin:auto; width: 600px;"/>


We now have a ".Rproj" file in our project folder which will allow us to open this project in the future. If you click it, you will get settings for your project. 

We now have the top level of our project, which you can also find in Windows Explorer.

<img src="image/directory_001.png" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>

## (04) Create a new R script, or R Markdown file, or R Notebook file

I don't mind which you choose, I will use an R Markdown file. In a future session you will learn about Quarto, which is a more powerful way to write documents.



For now, call this "penguin_analysis.Rmd"

<img src="image/img_001_newfile.png" alt="Drawing" style="width: 300px;"/> <img src="image/img_002_newrmd.png" alt="Drawing" style="width: 300px;"/>

Press Save and now you can check the files pane on the bottom right.

<img src="image/img_003_dirInR.png" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>

Here are all our files so far:

<img src="image/directory_002.png" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>

## (05) Create a subfolder for the data

We will now make a new subfolder called "data". You can do this in RStudio using the file panel on the bottom right.

<img src="image/directory_000e.png" alt="Drawing" style="display:block; margin:auto; width: 300px;"/>
<img src="image/directory_000f.png" alt="Drawing" style="display:block; margin:auto; width: 300px;"/>

If you prefer, you can do this within Windows Explorer, and then refresh the RStudio file panel.

We now have a new subfolder. 

<img src="image/directory_003.png" alt="Drawing" style="display:block; margin:auto; width: 300px;"/>

## (06) Installing Libraries

A library is a collection of functions that we can use in our code. We will use the `tidyverse` library, which contains many helpful functions.

We need to install the library the first time we use it, this means downloading it to our computer. You can use the "Packages" pane on the bottom right to do this, next to the "Files" pane.

#### ⚠️ **Do not start your code with `install.packages()`** 
>
>We will discuss why later, but this is bad practice.

<img src="image/img_003_packages.png" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>

First we need "tidyverse" in the search bar, then click Install.

<img src="image/img_003_install.png" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>

### (07) Loading Libraries

After that, we need to load the library into our R session. We can do this using the `library()` function. Afterwards, you will see it in the "Packages" pane.

<img src="image/img_003_library.png" alt="Drawing" style="display:block; margin:auto; width: 400px;"/> <img src="image/img_003_tidypane.png" alt="Drawing" style="display:block; margin:auto; width: 400px;"/>

### (08) Installing and Loading More Libraries

For today, we will need to install and load the following libraries: 

<img src="https://m-clark.github.io/data-processing-and-visualization/img/hex_tidyverse.svg" alt="Drawing" style="display:block; margin:auto; width: 100px;"/>

- `tidyverse`
- `palmerpenguins`
- `here`
- `janitor`

Please install and load these libraries now.



<img src="image/img_003_more.png" alt="Drawing" style="display:block; margin:auto; width: 400px;"/>

## (09) Making Sure R knows where it is

When we run R code, we want it to be able to read and write to the files in our project. As a result we need to tell R where this current project is. Because we used RProject, this is already inbuilt. You can check this by running the following code in your console terminal:


`here::here()`


This should return the path to your project.

<img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/here.png" alt="Drawing" style="display:block; margin:auto; width: 350px;"/>



### ⚠️ **Do not use `setwd()`!** It is a bad habit to get into. 

## (10) Loading Penguin Data

We will now load the penguin data from the `palmerpenguins` library. You may be used to loading functions from libraries but we can also get data too. 

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png" alt="Drawing" style="display:block; margin:auto; width: 350px;"/>



In [25]:
# I'm installing to a jupyter notebook instance
install.packages("palmerpenguins")
install.packages("here")
install.packages("janitor")
install.packages("tidyverse")


Installing package into ‘/opt/homebrew/lib/R/4.4/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/opt/homebrew/lib/R/4.4/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/opt/homebrew/lib/R/4.4/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/opt/homebrew/lib/R/4.4/site-library’
(as ‘lib’ is unspecified)



In [26]:
library(palmerpenguins)
library(here)
library(janitor)
library(tidyverse)

### There is a variable called penguins_raw...

As a reminder, the `head()` function shows the first 6 rows of a data frame.


In [27]:
head(penguins_raw)

studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.
PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,
PAL0708,6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190.0,3650.0,MALE,8.66496,-25.29805,



## (11) Checking Column Names

One problem we can see immediately is that the column names are badly formatted. Let us take a look at them:


In [28]:
colnames(penguins_raw)

They have mixed capitals and lower case letters, there are spaces in between words which can cause lots of issues within code, and other times there are no spaces...

- `studyName` is called camel case
- `Sample Number` is called sentence case and with a space
- `Culmen Length (mm)` has spaces and also brackets which could also cause issues. 
- `Delta 15 N (o/oo)` has spaces, brackets, and a slash!

It's maybe tempting when you see these kinds of problems to go and edit the column names in excel, as well as any other issues you see. 

### ⚠️ **Never do this.**
> Editing a file in excel is completely unreproducible, no one has any record of what you edited and can lead to mistakes being missed. 

## (12) Preserving our Raw Data

A best practice before we do anything is to save this file as `penguins_raw.csv` and consider it "read only"! That means it is preserved exactly as is before we start meddling with it. 

<img src="https://book.the-turing-way.org/_images/rdm-storage.jpg" alt="Drawing" style="display:block; margin:auto; width: 500px;"/>


And using the following command we are going to save the variable `penguins_raw` as a `.csv` file. 

In [29]:

write_csv(penguins_raw, here("data", "penguins_raw.csv"))


We are going to use the `here` library to help us with the file path. We are telling it to use the `\data` folder and the filename `penguins_raw.csv`. If we didn't use the `here` library, we would have to use a raw file path, which is not very reproducible.

Now we have a new file in our data folder.

<img src="image/directory_004.png" alt="Drawing" style="display:block; margin:auto; width: 400px;"/>

## (13) Removing Columns

We can use `select` and a hyphen `-` to remove that column. 

In [30]:
penguins_raw <- select(penguins_raw, -Comments)

colnames(penguins_raw)

You can see that the "Comments" column is now gone. 

We can do the same for two columns at once, using the function `starts_with()`.

In [31]:
penguins_raw <- select(penguins_raw, -starts_with("Delta"))
colnames(penguins_raw)


## ⚠️ **This code is bad practice.**

We are overwriting the original data frame, and then overwriting it again. If I run it twice you will get an error. 

This is our first mistake. However, we have saved a safe copy of the original data frame as `penguins_raw.csv` in our data folder. We can load it again now. 

In [33]:
penguins_raw <- read_csv(here("data", "penguins_raw.csv"), show_col_types = FALSE)
colnames(penguins_raw)
# show_col_types = FALSE    stops it printing after loading

In [34]:
# ------- All the code so far: -------
# library(tidyverse)
# library(palmerpenguins)
# library(here)
# library(janitor)
#
# head(penguins_raw)
#
# write_csv(penguins, here("data", "penguins.csv"))
# ------------------------------

# ------- Previous version: -------
# names(penguins_raw)
# penguins_raw <- select(penguins_raw, -Comments)
# penguins_raw <- select(penguins_raw, -starts_with("Delta"))
# names(penguins_raw)
# ------------------------------

# Loading the data because we overwrote it 
colnames(penguins_raw)
penguins_raw <- read_csv(here("data", "penguins_raw.csv"), show_col_types = FALSE)
penguins_clean <- select(penguins_raw, -Comments)
penguins_clean <- select(penguins_clean, -starts_with("Delta"))
colnames(penguins_clean)



### ⚠️ **We are still overwriting**.

We overwrite "penguins_clean" in the second line. 

## Using Piping from Tidyverse

Instead we can use piping from the `tidyverse` library. The `%>%` means "and then" and we can use it to chain commands together. Use `Ctrl` + `Shift` + `M` to write one quickly. 


<img src="https://upload.wikimedia.org/wikipedia/commons/8/89/Dplyr_hex_logo.svg" alt="Drawing" style="display:block; margin:auto; width: 200px;"/>



In [35]:
# ------- Previous version: -------
# penguins_clean <- select(penguins_raw, -Comments)
# penguins_clean <- select(penguins_clean, -starts_with("Delta"))
# names(penguins_raw)
# ------------------------------

# Using piping
colnames(penguins_raw)
penguins_clean <- penguins_raw %>%
  select(-Comments) %>%
  select(-starts_with("Delta"))

colnames(penguins_clean)


## (14) Cleaning Column Names

Now we have removed a few extra columns, but we still have a problem with the names of the columns. 


The `janitor` library has a function called `clean_names()` which will clean the column names for us. We can change our pipe to include it.
<img src="https://raw.githubusercontent.com/sfirke/janitor/refs/heads/main/man/figures/logo_small.png" alt="Drawing" style="display:block; margin:auto"/>



In [22]:
# ------- Previous version: -------
# names(penguins_raw)
# 
# penguins_clean <- penguins_raw %>%
#   select(-Comments) %>%
#   select(-starts_with("Delta"))
# 
# names(penguins_clean)

colnames(penguins_raw)

penguins_clean <- penguins_raw %>%
  select(-Comments) %>%
  select(-starts_with("Delta")) %>%
  clean_names() # Uses janitor package

colnames(penguins_clean)

Our column names have now been altered. There are no spaces, and all in lower case. 

This means the columns are now "computer readable" and also "human readable". We should always use hyphens or underscores to separate words we use in code, or just use camel case (whichIsLikeThis). Special characters can cause problems in code, like brackets and slashes. 

<img src="https://khalilstemmler.com/img/blog/camel-snake-pascal-case/camel-case-snake-case-pascal-case.png" alt="Drawing" style="display:block; margin:auto; width: 400px;"/>


Here is all our code together:

In [None]:
library(tidyverse)
library(palmerpenguins)
library(here)
library(janitor)

head(penguins_raw)

write_csv(penguins_raw, here("data", "penguins_raw.csv"))

penguins_clean <- penguins_raw %>%
  select(-Comments) %>%
  select(-starts_with("Delta")) %>%
  clean_names() # Uses janitor package

colnames(penguins_clean)



## Data Pipelines

As things currently stand in research papers, generally the methodology looks like this:

<img src="https://github.com/LydiaFrance/Reproducible_Figures_R/blob/lessons/unknown_methods.png?raw=true" alt="Drawing" style="display:block; margin:auto; width: 800px;"/>

We know where the research questions come from, we can read those in the introduction, and we can usually find detailed methodology about data collection in the lab or field. But when it comes to what steps happened for the analysis and producing figures, we get no information! Even the data used for the analysis is unavailable most of the time. 

<img src="https://github.com/LydiaFrance/Reproducible_Figures_R/blob/lessons/meme.png?raw=true" alt="Drawing" style="display:block; margin:auto; width: 300px;"/>

What we should be doing is filling in the blanks:

<img src="https://github.com/LydiaFrance/Reproducible_Figures_R/blob/lessons/data_pipeline.png?raw=true" alt="Drawing" style="display:block; margin:auto; width: 1000px;"/>

And in closer detail:

<img src="https://github.com/LydiaFrance/Reproducible_Figures_R/blob/lessons/data_pipeline_detail.png?raw=true" alt="Drawing" style="display:block; margin:auto; width: 900px;"/>

Where we can make sure the raw and cleaned data are also available, for example on Zenodo. This can help in situations where scientific results need to be held under scrutiny, and also for collaborative purposes! Other people may have exciting new uses for the computational work you have worked hard on.

## (15) Reuseable Code through Functions

We often need the same code over and over again. This aspect of building **data pipelines** is making the parts **reusable** for other purposes. These code blocks that might be making a figure or running a specific model you know you'll need multiple times. 

Something that is extremely helpful in making reuseable code is making a **function**. 


<img src="https://book.the-turing-way.org/_images/reproducible-definition-grid.svg" alt="Drawing" style="display:block; margin:auto; width: 400px;"/>

### Creating a Function in R

We use functions all the time in R and we load them from Libraries. We can actually make our own. 

We start with the name of our function, we're going to call it `cleaning_penguin_columns`. In the brackets of `function()` we specify what is getting fed into the function, in this case the raw data. 

> Note, it doesn't matter what we call the function! You might want to call it `cleaning_up_column_names`.

Then as before, we put the pipe inside. We're keeping this function generally applicable, so the input can just have a generic name like `data` or `raw_data`. Unlike before, we don't specify what the new variable will be called, as the function will simply output it. The `{}` brackets specify what is inside the function.

In [23]:
cleaning_penguin_columns <- function(raw_data){
  raw_data %>%
    select(-Comments) %>%
    select(-starts_with("Delta")) %>%
    clean_names()
}

# Fun fact, you can call the outputs and inputs anything you want. As long as it it internally consistent:

# cleaning_penguin_columns <- function(abcdefg){
#   abcdefg %>%
#     clean_names()
# }

# However, this is not human readable. We write code for humans, not computers. 


Now we've written this function, absolutely nothing happens. We need to **run** the function. 

### Running the Function

Here I will edit the function to add more steps. 

In [24]:
# Defining the function:
cleaning_penguin_columns <- function(raw_data){
  raw_data %>%
    clean_names() %>%
    remove_empty(c("rows", "cols")) %>% # This removes rows and columns that are empty
    select(-starts_with("delta")) %>%   # Why is this row now different to before?
    select(-comments)                   # Why is this row now different to before?
}

# Loading the raw data again
penguins_raw <- read_csv(here("data", "penguins_raw.csv"), show_col_types = FALSE)

# Check the column names
colnames(penguins_raw)

# Running the function:
penguins_clean <- cleaning_penguin_columns(penguins_raw)

# Checking the output
colnames(penguins_clean)

# Fun fact Part 2: It doesn't matter if the input 
# when you call the function is different to the input in the definition.
# penguins_raw is different to raw_data, but R knows what we mean. this 
# is helpful because we don't need to remember the exact name we used
# in the function definition. And it makes it reuseable!


### To be extra clear, we can add a print statement inside the function.

In [None]:

# ---- Previous version: ----
# cleaning_penguin_columns <- function(raw_data){
#   raw_data %>%
#     clean_names() %>%
#     remove_empty(c("rows", "cols")) %>%
#     select(-starts_with("delta")) %>%
#     select(-comments)
# }

# to be extra clear, we can put print statements in the function to show what is happening.
cleaning_penguin_columns <- function(raw_data){
  raw_data %>%
    clean_names() %>%
    remove_empty(c("rows", "cols")) %>% # This removes rows and columns that are empty
    select(-starts_with("delta")) %>%   # Why is this row now different to before?
    select(-comments)                   # Why is this row now different to before?
    print("Removed empty rows and columns, cleaned column names, removed delta and comments columns")
}




## (16) Saving the Clean Data

We can again use the `here` library to save the data.

In [None]:
write_csv(penguins_clean, here("data", "penguins_clean.csv"))

We can now look at our files:

<img src="image/directory_005.png" alt="Drawing" style="display:block; margin:auto; width: 400px;"/>

## Moving the Cleaning Function to a Separate Script


So far in our code, we have been careful to save a copy of our raw data and clean data, but we also want a safe copy of our cleaning code. It is critical to also save a copy of the steps we took to clean the data, so that it is open to scrutiny and checking. If wrote the cleaning code in the terminal in R or across multiple files we might lose then there would be no way of reproducing it exactly later.

In this toy example, we have also talked about this cleaning code function being reuseable. What is always bad is copying and pasting our code everywhere. 

Why?

- If you make a mistake, then you have to hunt down everywhere you pasted it
- It's easy to have multiple variations accidentally and confuse yourself
- No record of how that code has changed over time (remember your version control lesson)

### (17) Making a new subfolder

We're going to first make a new subfolder called "functions" where we can put our reusable code. You can do this as before by clicking the "New Folder" button in the bottom right of your screen. Or you can do it in the console:

`dir.create(here("functions"))`

Only run this once!

<img src="image/directory_006.png" alt="Drawing" style="display:block; margin:auto; width: 400px;"/>

### (18) Making a new script

We're now going to make a new R script (not markdown)in our "functions" folder. You can do this by clicking the "New File" button in the bottom right of your screen. Or you can do it in the console:

`file.create(here("functions", "cleaning.R"))`

<img src="image/directory_007.png" alt="Drawing" style="display:block; margin:auto; width: 400px;"/>


Inside this R file we can paste our function. 

In [None]:

# This is what is inside cleaning.r:
# -----------------------------------------------------------
# # Clean column names, remove empty rows, remove columns called comment and delta

# -----------------------------------------------------------


### (19) Loading the Function file

Now we are going to load this function into our current script. We do this by using the `source()` function. This is like the library function, but for our own R scripts and functions we have made. 



In [1]:
# ---- The top of our RMarkdown file ----
library(tidyverse)
library(palmerpenguins)
library(here)
library(janitor)

source(here("functions", "cleaning.R"))
# -------------------------------------

# ---- Load the raw data ----
penguins_raw <- read_csv(here("data", "penguins_raw.csv"), show_col_types = FALSE)
# -------------------------------------

# ---- Using our functions from the functions script ----
penguins_clean <- cleaning_penguin_columns(penguins_raw)
# ------------------------------------------------------

# ---- Save the clean data ----
write_csv(penguins_clean, here("data", "penguins_clean.csv"))
# -------------------------------------

# ---- Check the output ----
names(penguins_clean)
# -------------------------------------


# Why bother?

Having copies of the same code everywhere is a bad idea. 

We're building our data pipeline, and keeping these blocks safe and version controlled is really important. For this workshop we're only doing minor steps and so it will seem like overkill. 

<img src="https://imgs.xkcd.com/comics/fixing_problems.png" alt="Drawing" style="display:block; margin:auto; width: 200px;"/>



If you keep all the code you use for cleaning data in one place, you can then call it in multiple places whenever you need it.

For example, you might want to have a script that looks at the flippers, and another script that looks at the egg laying. If you copied and pasted the cleaning data function into both scripts, and then later on realise you made a mistake in that function, you'll have to hunt down every copy you have. 

Instead you can save it to a file called `functions/cleaning.R` and everytime you use it, you refer to that copy only. Then any changes you make to it are consistently applied everywhere. 

You can also use your git commits to keep a clear record of what has happened to these functions over time, (your github time machine) rather than having to look through all your scripts manually. 

<img src="https://book.the-turing-way.org/_images/project-history.svg" alt="Drawing" style="display:block; margin:auto; width: 600px;"/>

Also, if someone was quickly looking at your code folder, they can quickly find all the methods you used on your data to check what is going on! This is a lot better than a 2000 line `Rmd` file to read through and try to figure out what your methods were. 

<img src="https://book.the-turing-way.org/_images/readable-code.svg" alt="Drawing" style="display:block; margin:auto; width: 600px;"/>

So, just like we keep a safe record of our data, we keep a safe record of our **data protocols**.

<img src="https://the-turing-way.netlify.app/_images/reproducibility.jpg" alt="Drawing" style="display:block; margin:auto; width: 600px;"/>

Also, it takes a bit of time right now, but you're saving a serious amount of headaches later on...

<img src="https://raw.githubusercontent.com/LydiaFrance/Reproducible_Figures_R/refs/heads/lessons/CodeFast.png" alt="Drawing" style="display:block; margin:auto;width: 700px;"/>

## (20) Adding more functions

Now go to Canvas and copy over the functions from the `cleaning.R` file to your `cleaning.R` file. 


Before, we had a function which did multiple things. You can seen in this code it has been broken up to multiple functions. 

Add print statements to each of these functions to tell the user what is happening. 



Now we can use these functions in our `rmd` file. 


In [3]:
# ---- The top of our RMarkdown file ----
library(tidyverse)
library(palmerpenguins)
library(here)
library(janitor)

source(here("functions", "cleaning.R"))
# -------------------------------------

# ---- Load the raw data ----
penguins_raw <- read_csv(here("data", "penguins_raw.csv"), show_col_types = FALSE)
# -------------------------------------

# ---- Using our functions for cleaning ----
penguins_clean <- clean_column_names(penguins_raw)
penguins_clean <- remove_columns(penguins_clean, c("comments", "delta"))
penguins_clean <- shorten_species(penguins_clean)
penguins_clean <- remove_empty_columns_rows(penguins_clean)
# -------------------------------------



── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
here() starts at /Users/lfrance/Library/CloudStorage/OneDrive-Nexus365/000_AlanTuring/004_REG/001_Teaching/R_teaching


Attaching package: ‘janitor’


The following obje

### ⚠️ **We are overwriting**.

Take some time to now make a pipe with these functions instead.

If you prefer, you can also make a function called "cleaning_penguins" with the pipes inside. 


In [4]:
# ---- The top of our RMarkdown file ----
library(tidyverse)
library(palmerpenguins)
library(here)
library(janitor)

source(here("functions", "cleaning.R"))
# -------------------------------------

# ---- Save the raw data ----
# write_csv(penguins_raw, here("data", "penguins_raw.csv"))
# -------------------------------------

# ---- Load the raw data ----
penguins_raw <- read_csv(here("data", "penguins_raw.csv"), show_col_types = FALSE)
# -------------------------------------

colnames(penguins_raw)

# ---- Using our functions for cleaning from cleaning.R ----
penguins_clean <- penguins_raw %>%
  clean_column_names() %>%
  remove_columns(c("comments", "delta")) %>%
  shorten_species() %>%
  remove_empty_columns_rows()
# -------------------------------------

# ---- Check the output ----
colnames(penguins_clean)
# -------------------------------------

# ---- Save the clean data ----
write_csv(penguins_clean, here("data", "penguins_clean.csv"))
# -------------------------------------




## (21) Subset the Data

As part of data pipelines, we often want to subset the data for plotting or analysis. We may only want to look at two columns, the species and the body mass. We can do this by using the `select()` function. 



In [7]:
body_mass <- penguins_clean %>%
  select(species, body_mass_g)

head(body_mass)


species,body_mass_g
<chr>,<dbl>
Adelie,3750.0
Adelie,3800.0
Adelie,3250.0
Adelie,
Adelie,3450.0
Adelie,3650.0


We can see there is NA, which means there is a missing value. 

We have already removed empty rows, but this is an example of just a missing value, the rest of the row is not completely empty. 


In [25]:
# Count the number of rows and the number of missing values
print(paste("Number of rows:", nrow(body_mass)))
print(paste("Number of missing values:", sum(is.na(body_mass))))



[1] "Number of rows: 344"
[1] "Number of missing values: 2"


## (22) Removing Missing Values

We can remove missing values by using the `na.omit()` function. Again, you will find this in the `cleaning.R` file. 


In [None]:
body_mass <- subset_columns(penguins_clean, c("species", "body_mass_g"))
body_mass <- remove_NA(body_mass)

### ⚠️ **We are overwriting**.


In [27]:
body_mass <- penguins_clean %>%
    select(species, body_mass_g) %>%
    remove_NA()

print(paste("Number of rows:", nrow(body_mass)))
print(paste("Number of missing values:", sum(is.na(body_mass))))


[1] "Number of rows: 342"
[1] "Number of missing values: 0"


## (23) Filter by species

Another kind of subsetting is to filter by a specific value. In this case we might want to look at just the Adelie penguins. 

In [31]:
adelie_data <- penguins_clean %>%
  filter(species == "Adelie")

head(adelie_data)



study_name,sample_number,species,region,island,stage,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
PAL0708,1,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE
PAL0708,2,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE
PAL0708,3,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE
PAL0708,4,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,
PAL0708,5,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE
PAL0708,6,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190.0,3650.0,MALE


We can also combine to get just the body mass of Adelie penguins. 

In [30]:
# adelie_data <- penguins_clean %>%
#   filter(species == "Adelie")

# head(adelie_data)

adelie_body_mass <- penguins_clean %>%
  filter(species == "Adelie") %>%
  select(species, body_mass_g) %>%
  remove_NA()

head(adelie_body_mass)



species,body_mass_g
<chr>,<dbl>
Adelie,3750
Adelie,3800
Adelie,3250
Adelie,3450
Adelie,3650
Adelie,3625


Every time, we start with `penguins_clean` so we do not overwrite our data. 

## Installing libraries the reproducible way

Before we have been using `install.packages()` to install the libraries. This is not the best way to do it, because it is not reproducible. If we put this in a script, it will install the libraries everytime we run the script, which is very slow. 

<img src="https://rstudio.github.io/renv/logo.svg" alt="Drawing" style="display:block; margin:auto;width: 200px;"/>


### (24) Install `renv`

Check your files and see if you already have a renv folder. If you do, you can skip this step. If you don't, you can install renv by running the following code:

In [None]:
install.packages("renv")

We will now initialise renv:

In [None]:
renv::init()

You will now see a `renv` folder in your working directory. We're going to use this to keep a record of the libraries we have installed. 

Try to now install a new library. You might want to pick `modelsummary` or `table1` as an example, but you can pick any library. 

In [None]:
install.packages("modelsummary")

## (25) Create a Snapshot

We can now create a snapshot of everything we have installed. This will create a file called `renv.lock` which will contain a record of all the libraries we have installed. 


In [None]:
renv::snapshot()

### (26) Look at the lock file

Open the `renv.lock` file and look at the libraries that have been installed. You'll notice that it has installed a load of other dependencies too. You can also run this as a summary:

In [None]:
renv::diagnostics()

### (27) Restore the snapshot

When someone else wants to run our code, they can download all our folders and subfolders, including our new lockfile. They can then restore the snapshot to ensure they have the same libraries installed. 

In [None]:
renv::restore()

This is much more efficient than installing the libraries manually. 

### (28) Reopen RProject

Save everything, and close RStudio. Open up your RProject by opening the `RProject` file. 

You'll notice in the bottom left hand corner it says `renv` connected. This means that renv is active. You can write the following to check:

In [None]:
renv::status()

# Conclusion

In this lesson, we've covered several key aspects of reproducible research and good coding practices in R:

1. Creating a structured project folder
2. Using relative file paths with the `here` package
3. Writing modular, reusable code with functions
4. Separating data cleaning steps into a dedicated script
5. Managing package dependencies with `renv`

By implementing these practices, we've built a robust, reproducible data pipeline for our penguin data analysis project. This approach offers several benefits:

- **Reproducibility**: Anyone can now run our analysis and get the same results.
- **Collaboration**: Our well-organised project structure makes it easier for others to understand and contribute to our work.
- **Maintainability**: Modular code and clear documentation make it easier to update or extend our analysis in the future.
- **Portability**: Using `renv` ensures our code will work across different environments and over time.

Remember, reproducible research is not just about following a set of rules, but about cultivating a mindset of transparency and rigour in your scientific work. Keep refining these practices and exploring new tools that can enhance the reproducibility of your work.

By making your research reproducible, you're not only improving the quality and credibility of your own work, but also contributing to the broader scientific community by enabling others to build upon your findings more easily.
