# Reproducible Research


Images from [The Turing Way](https://the-turing-way.netlify.app/reproducible-research/reproducible-research.html)! 

<img src="https://zenodo.org/record/3332808/files/1728_TURI_Book%20sprint_5%20turing%20way_040619.jpg" alt="Drawing" style="width: 300px;"/>

## What is Reproducible Research?

<img src="https://the-turing-way.netlify.app/_images/reproducibility.jpg" alt="Drawing" style="width: 600px;"/>


Scientific results and evidence are strengthened if those results can be replicated and confirmed by several independent researchers. 

What you do on your computer is a part of scientific methodology like fieldwork and labwork.

> Major media outlets have reported on investigations showing that a significant percentage of scientific studies cannot be reproduced. 
> This leads to other academics and society losing trust in scientific results.
>
> https://www.nature.com/articles/533452a

**Reproducible**

 - Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.

**Replicable**

- A study that arrives at the same scientific findings as another study, collecting new data (possibly with different methods) and completing new analyses.

<img src="https://oliviergimenez.github.io/reproducible-science-workshop/slides/assets/definitions.jpg" alt="Drawing" style="width: 600px;"/>




# Why bother?

### Helps YOU

<img src="https://zenodo.org/record/3332808/files/1728_TURI_Book sprint_19 readable code_040619.jpg" alt="Drawing" style="width: 500px;"/>


Usually the person trying to redo your work and analysis is yourself at a later date!

A lot of time is wasted redoing analyses because you can't read or understand your own code from a few weeks or years ago, and having to redo figures because you've lost the code to generate it. 

It also acts as a fail safe backup for your work!

### Helps open science

<img src="https://the-turing-way.netlify.app/_images/evolution-open-research.png" alt="Drawing" style="width: 500px;"/>


Open research aims to transform research by making it more reproducible, transparent, reusable, collaborative, accountable, and accessible to society. 


- Be publicly available: It is difficult to use and benefit from knowledge hidden behind barriers such as passwords and paywalls.

- Be reusable: Research outputs need to be licensed appropriately, so that prospective users know any limitations on re-use.

- Be transparent: With appropriate metadata to provide clear statements of how research output was produced and what it contains.

Open practices can make it easier for researchers to connect by increasing the discoverability and visibility of one’s work, facilitating rapid access to novel data and software resources, and creating new opportunities to interact with and contribute to ongoing communal projects.

### Helps you publish
<img src="https://www.turing.ac.uk/sites/default/files/inline-images/Culture%20shift.jpg" alt="Drawing" style="width: 400px;"/>



It is becoming more common for code release to be part of the scientific publishing and grant approval process. 

# Palmer Penguins


This data set on Palmer penguins has nice examples all over the internet. It contains morphometric data from three species of penguin.

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" alt="Drawing" style="width: 400px;"/>

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png" alt="Drawing" style="width: 350px;"/>

# Quick Setup for your new RProject

We are creating a **Data Pipeline**. 

<img src="https://i.imgur.com/AIwZP7a.png" alt="Drawing" style="width: 900px;"/>

In our last lesson, we went through the steps of creating a new Rproject, saving raw data, cleaning data, and saving the clean data. 

Here are the steps to do that again:

## `00` New Project


In RStudio, create a "New Project..."

<img src="https://d33wubrfki0l68.cloudfront.net/87562c4851bf4d0b0c8415dc9d6690493572e362/a6685/screenshots/rstudio-project-1.png" alt="Drawing" style="width: 400px;"/>

<img src="https://d33wubrfki0l68.cloudfront.net/0fa791e2621be297cb9c5cac0b2802223e3d7714/57d89/screenshots/rstudio-project-2.png" alt="Drawing" style="width: 400px;"/>

<img src="https://d33wubrfki0l68.cloudfront.net/dee6324df1f5c5121d1c1e9eed822ee52c87167b/1f325/screenshots/rstudio-project-3.png" alt="Drawing" style="width: 400px;"/>

### DO NOT SAVE IT TO YOUR DESKTOP

Instead navigate somewhere sensible (hint! you are strongly advised to use your university OneDrive account). 

`C/Users/lfrance/OneDrive/BiologyComputerSkills/`

You will then be asked for a name of your Project which will become the name of your new folder. 

### Make sure your file and folder names are readable for computers and humans

Not human readable folder names:

- `shfgljshdfg`
- `thingy`
- `Untitled22837682673`
- `code`

These have zero information for humans about what the files are about. 

Not computer readable folder name:

- `My Project in R`
- `PenguinProject(new)`
- `PenguinProjectThisOne!!!`

Punctuation marks will cause problems for your computer reading these files. Spaces also cause problems. 

Use dashes or underscores or capital letters rather than spaces. 

- `PenguinProject`
- `Penguin-Project`
- `Penguin_Project`

## `01` Your New Project

You now have a folder which looks like this:

<img src="image/directory_001.png" alt="Drawing" style="width: 400px;"/>

### What is this file?

A project file saves your settings and configurations for you so you can quit R and come back to this folder and continue from where you left off.  

For now, ignore it. 

## `02` Create an R Script

An R file is where we write our code. Create a new file by pressing "New". It's your choice whether you want to use RMarkdown, Quarto, or R files. I will use an RMarkdown file, which has the extension `.rmd`. 

<img src="image/img_001_newfile.png" alt="Drawing" style="width: 300px;"/>

Give this a human and computer readable name. 

<img src="image/img_002_newrmd.png" alt="Drawing" style="width: 400px;"/>

Save your file so that your folder looks like this:

<img src="image/directory_002.png" alt="Drawing" style="width: 400px;"/>

<img src="image/img_003_dirInR.png" alt="Drawing" style="width: 500px;"/>


## `03` Code is for Humans

Your `rmd` file is a document written for other people. You should describe what it contains and what it does clearly. If you open your work from last term you will have forgotten what you were doing, and so writing out what you are doing and why is always critically important. 

Use the top of your document to describe what you are doing today. Hint: we will be doing data exploration of the Palmer Penguins data set. 

You can press "visual" near the top left corner of your open script to make the page more readable if you prefer.

*If you are using an R file, make sure you use `#` to write comments.*

<img src="image/img_004_mdText.png" alt="Drawing" style="width: 500px;"/>

To create a readable document, press the "Knit" button and select PDF. 

<img src="image/img_007_knit.png" alt="Drawing" style="width: 300px;"/>

If you get an error, run the following code in your console:

`install.packages("tinytex")`

A window will pop up with your new PDF version of your work. This will also be visible as a pdf file in your folder. Remember, you will need to press the Knit button to update this PDF as it needs to be made manually. 

<img src="image/img_008_pdf.png" alt="Drawing" style="width: 800px;"/>

This is the document version of your work, which you should try to produce for your assignment of this session. 

`QUESTION:` What does echo=FALSE do on line 33 in the screenshot?

## `04` Installing Packages

Feel free to use the interface if it makes life easier. 

<img src="image/img_005_package.png" alt="Drawing" style="width: 500px;"/>


We will now write some R code to install some libraries and load them. If you are using `rmd`, your code block starts with 

` ```{name of code block}`

and ends with 

` ``` `

You will need the following packages:



In [12]:
install.packages("ggplot2")
install.packages("palmerpenguins")
install.packages("janitor")
install.packages("dplyr")

# There is a faster way to do this is you prefer:
install.packages(c("ggplot2", "palmerpenguins", "janitor", "dplyr"))



Installing package into ‘/opt/homebrew/lib/R/4.3/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/opt/homebrew/lib/R/4.3/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/opt/homebrew/lib/R/4.3/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/opt/homebrew/lib/R/4.3/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/opt/homebrew/lib/R/4.3/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘forcats’, ‘plyr’, ‘progress’, ‘reshape’


Installing packages into ‘/opt/homebrew/lib/R/4.3/site-library’
(as ‘lib’ is unspecified)



## `05` Loading Packages

Now they are installed, the packages need to be loaded. 

In [18]:
# Load the packages:
library(ggplot2)
library(palmerpenguins)
library(janitor)
library(dplyr)


Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2



## `06` Load Data

The package palmer penguins contains the data called "penguins_raw". 

`QUESTION:` What problems can you see in this data set?

In [9]:
# Look at the table with the raw data

head(penguins_raw)



studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.
PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,
PAL0708,6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190.0,3650.0,MALE,8.66496,-25.29805,


## `07` Clean the Data

We can see significant problems with the data. We will use the `janitor` package to help us clean the data. Remember, we need to save the raw data first. 

Create a new folder called `data` in your project folder:

<img src="image/img_006_dataDir.png" alt="Drawing" style="width: 400px;"/>

Now you have a safe place to save your raw data before we start cleaning it. 



In [None]:
write.csv(penguins_raw, "data/penguins_raw.csv")


## <span style='background:yellow'> Exercise 01 (5 mins) </span>


### CHECK YOUR DATA IS SAVED

<img src="image/directory_004.png" alt="Drawing" style="width: 400px;"/>


Now we can use the code we created last time to clean this data set. 

Last time we had code to do the following steps:


In [14]:
# Loading the data from the saved version is good practice here. 
# penguins_raw <- read.csv("data/penguins_raw.csv")

# Check what the column names look like:
names(penguins_raw)

# Create a new variable called penguins_clean. We will remove two columns.
penguins_clean <- select(penguins_raw,-starts_with("Delta"))
penguins_clean <- select(penguins_clean,-Comments)

# Check the column names in the new data frame:
names(penguins_clean)

## `08` Piping

As a recap, we can use pipes to do multiple steps to a dataframe so we don't make mistakes. 

To read this, it starts by saying "make a new variable called penguins_clean from penguins_raw after doing the following steps to it". 

In [15]:
# Check what the column names look like:
names(penguins_raw)

#  ---- OLD VERSION OF THE CODE ----
# penguins_clean <- select(penguins_raw,-starts_with("Delta"))
# penguins_clean <- select(penguins_clean,-Comments)
# ----------------------------------

penguins_clean <- penguins_raw %>%
  select(-starts_with("Delta")) %>%
  select(-Comments) %>%
  clean_names()

# Check the column names in the new data frame:
names(penguins_clean)

The `clean_names()` function comes from janitor and fixes all the problems with spaces and capital letter for us. 

## `09` Functions

Instead of writing the same lines of code repeatedly, we can package them into a function and just call that function whenever needed. We use functions from packages (like the `clean_names` and `select` functions) but we can also write our own custom ones. 

You may want to run different analyses on the same data set, and so it doesn't make sense to write out code repeatedly across different `rmd` files, but instead write functions which can be shared across different files. 



In [34]:
# A function to make sure the column names are cleaned up, 
# eg lower case and snake case
clean_column_names <- function(penguins_data) {
    penguins_data %>%
        select(-starts_with("Delta")) %>%
        select(-Comments) %>%
        clean_names()
}


### Defining a Function

The anatomy of this function is as follows:

`clean_column_names` : The name of your function, which should describe what it does

`function` : Tells R you are defining a function

`(penguins_data)` : What the function needs to work, in this case it is expecting a data frame. The name doesn't matter, you can call it "x" as long as it is internally consistent. When you use the function, whatever you give it will be treated as `penguins_data`

`{` : Starts the function definition

`penguins_data %>%` : Starts the pipe

`clean_names()` : The cleaning function from `janitor`

`}` : The end of the function definition

### Calling a Function

We have defined the function, but we haven't actually used it yet. We need to **call** the function next. 

In [52]:
# Check what the column names look like:
names(penguins_raw)

#  ---- OLD VERSION OF THE CODE ----
# penguins_clean <- select(penguins_raw,-starts_with("Delta"))
# penguins_clean <- select(penguins_clean,-Comments)
# ----------------------------------

#  ---- OLD VERSION OF THE CODE ----
# penguins_clean <- penguins_raw %>%
#   select(-starts_with("Delta")) %>%
#   select(-Comments) %>%
#   clean_names()
# ----------------------------------

# Define Function
clean_function <- function(penguins_data) {
    penguins_data %>%
        select(-starts_with("Delta")) %>%
        select(-Comments) %>%
        clean_names()
}

# Call Function
penguins_clean <- clean_function(penguins_raw)

# Check the column names in the new data frame:
names(penguins_clean)




## `10` Cleaning File

Writing functions is like creating tools that you can easily reuse and apply to your data. It's often helpful to store your tools in a place that means they can be used in multiple analysis files. 

## <span style='background:yellow'> Exercise 02 (5 mins) </span>


### Create a `functions` subfolder

### Inside, create an R file called `cleaning.R`. 

<img src="image/directory_007.png" alt="Drawing" style="width: 400px;"/>


This time we are writing an R file rather than `.rmd`. I have written a series of functions that clean up the Palmer Penguins dataset. [Open this file and copy it into your project.](https://github.com/LydiaFrance/Reproducible_Figures_R/blob/recap_lessons/PenguinProjectExample/functions/cleaning.r) 


### Paste in the contents 
Copy and paste the contents into your file.  

### Load this file into your `.rmd` file


In [50]:
source("functions/cleaning.r")

Test out one of the functions from the file. 

In [51]:

names(penguins_raw)

penguins_clean <- clean_column_names(penguins_raw)

names(penguins_clean)



Test out another function. This one `shorten_species` makes the species data easier to read. 

In [None]:

# Another function fixes the species names. They're currently too long:
head(penguins_clean$species)

penguins_clean_species <- shorten_species(penguins_clean)

head(penguins_clean_species$species)


We would like to use multiple functions from this file. it's best practice therefore to pipe them. 

because your functions have sensible names, and each one only does one thing, someone reading your code can understand what is happening without having to see all the gory details. However, they can easily find these functions within your cleaning file if they want to see what it is doing. 

In [54]:

# Because we want to use multiple functions, we should pipe them. 
penguins_clean <- penguins_raw %>%
    clean_column_names() %>%
    shorten_species() %>%
    remove_empty_columns_rows()


## `11` Subset the Data

There are functions in the cleaning file that filter and subset the data. For example there is one called:

`filter_by_species <- function(penguins_data, selected_species)`

You can see there is now an additional input in t he brackets. To run this function, you need to give it the `penguins_clean` data but also the name of the species you want, for example "Gentoo". 

In [59]:
# ---- The code inside this function: ---
# filter_by_species <- function(penguins_data, selected_species) {
#     penguins_data %>%
#         filter(species == selected_species)
# }
# ---------------------------------------

gentoo_only <- filter_by_species(penguins_clean, "Gentoo")

head(gentoo_only)

study_name,sample_number,species,region,island,stage,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,delta_15_n_o_oo,delta_13_c_o_oo,comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,1,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N31A1,Yes,2007-11-27,46.1,13.2,211,4500,FEMALE,7.993,-25.5139,
PAL0708,2,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N31A2,Yes,2007-11-27,50.0,16.3,230,5700,MALE,8.14756,-25.39369,
PAL0708,3,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N32A1,Yes,2007-11-27,48.7,14.1,210,4450,FEMALE,8.14705,-25.46172,
PAL0708,4,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N32A2,Yes,2007-11-27,50.0,15.2,218,5700,MALE,8.2554,-25.40075,
PAL0708,5,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N33A1,Yes,2007-11-18,47.6,14.5,215,5400,MALE,8.2345,-25.54456,
PAL0708,6,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N33A2,Yes,2007-11-18,46.5,13.5,210,4550,FEMALE,7.9953,-25.32829,


If you want to have Gentoo and Adelie, you can create a list of words (strings) and pass those into the function as well. 

In [60]:
gentoo_adelie_only <- filter_by_species(penguins_clean, c("Gentoo","Adelie"))

head(gentoo_adelie_only)


study_name,sample_number,species,region,island,stage,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,delta_15_n_o_oo,delta_13_c_o_oo,comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,2,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
PAL0708,4,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.
PAL0708,6,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190.0,3650.0,MALE,8.66496,-25.29805,
PAL0708,8,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N4A2,No,2007-11-15,39.2,19.6,195.0,4675.0,MALE,9.4606,-24.89958,Nest never observed with full clutch.
PAL0708,10,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N5A2,Yes,2007-11-09,42.0,20.2,190.0,4250.0,,9.13362,-25.09368,No blood sample obtained for sexing.
PAL0708,12,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N6A2,Yes,2007-11-09,37.8,17.3,180.0,3700.0,,,,No blood sample obtained.


There is another function called `subset_columns` which again requires the names of the columns you want and it will only keep those:

In [64]:
# ---- The function inside the cleaning.R file: ----
# subset_columns <- function(penguins_data, column_names) {
#     penguins_data %>%
#         select(all_of(column_names))
# }
# --------------------------------------------------

species_flipper_length_only <- subset_columns(penguins_clean, c("flipper_length_mm", "species"))

head(species_flipper_length_only)

flipper_length_mm,species
<dbl>,<chr>
181.0,Adelie
186.0,Adelie
195.0,Adelie
,Adelie
193.0,Adelie
190.0,Adelie


We can see this subset contains an NA, or a missing value. There is a function in the file which removes any rows with missing values. It would be helpful to therefore create a pipe. 

## <span style='background:yellow'> Exercise 03 (5 mins) </span>

Create a variable called `gentoo_data` and create it using a pipe. The data variable should only have culmen lengths and species columns. The rows should only from the Gentoo penguins. Finally, remove the missing values. 

In [None]:

# If we just want gentoo culmen lengths:
gentoo_culmen_lengths <- penguins_clean %>%
    filter_by_species("Gentoo") %>%
    subset_columns(c("culmen_length_mm", "species","sex")) %>%
    remove_NA()

head(gentoo_culmen_lengths)


## THE CODE SO FAR

In [66]:
install.packages(c("ggplot2", "palmerpenguins", "janitor", "dplyr"))

# Load the packages:
library(ggplot2)
library(palmerpenguins)
library(janitor)
library(dplyr)

# Load the function definitions
source("functions/cleaning.r")

# Save the raw data:
write.csv(penguins_raw, "data/penguins_raw.csv")

# Check the raw data:
names(penguins_raw)

# Clean the data:
penguins_clean <- penguins_raw %>%
    clean_column_names() %>%
    shorten_species() %>%
    remove_empty_columns_rows()

# Check the cleaned data:
names(penguins_clean)

# Save cleaned data:
write.csv(penguins_clean, "data/penguins_clean.csv")

# Filter the data:
flipper_data <- penguins_clean %>%
    subset_columns(c("flipper_length_mm", "species","sex")) %>%
    remove_NA()

head(flipper_data)



Installing packages into ‘/opt/homebrew/lib/R/4.3/site-library’
(as ‘lib’ is unspecified)



flipper_length_mm,species,sex
<dbl>,<chr>,<chr>
181,Adelie,MALE
186,Adelie,FEMALE
195,Adelie,FEMALE
193,Adelie,FEMALE
190,Adelie,MALE
181,Adelie,FEMALE
