# Reproducible Science in R and Figures (Part 1)

### What you should know by the end of this lesson...

- How to load data and clean it 
- Create an exploratory figure
- Communicate with a figure by changing its appearance
- Remake the same figure for a different purpose
- Save a figure for use in a report and poster

### Skills you will have by the end of the lesson...

- Using `ggplot2`
- Changing the elements in a figure
- Creating a custom function
- Reading from other R files
- Creating `png` or `svg` (vector) image files
- Functional, modular, reproducible code

Images from [The Turing Way](https://the-turing-way.netlify.app/reproducible-research/reproducible-research.html)! 

<img src="https://zenodo.org/record/3332808/files/1728_TURI_Book%20sprint_5%20turing%20way_040619.jpg" alt="Drawing" style="width: 300px;"/>


## What is Reproducible Research?

<img src="https://the-turing-way.netlify.app/_images/reproducibility.jpg" alt="Drawing" style="width: 600px;"/>


Scientific results and evidence are strengthened if those results can be replicated and confirmed by several independent researchers. 

What you do on your computer is a part of scientific methodology like fieldwork and labwork.

> Major media outlets have reported on investigations showing that a significant percentage of scientific studies cannot be reproduced. 
> This leads to other academics and society losing trust in scientific results.
>
> https://www.nature.com/articles/533452a

**Reproducible**

 - Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.

**Replicable**

- A study that arrives at the same scientific findings as another study, collecting new data (possibly with different methods) and completing new analyses.

<img src="https://oliviergimenez.github.io/reproducible-science-workshop/slides/assets/definitions.jpg" alt="Drawing" style="width: 600px;"/>


## Why bother?

### Helps YOU

<img src="https://the-turing-way.netlify.app/_images/project-history.jpg" alt="Drawing" style="width: 300px;"/>



Usually the person trying to redo your work and analysis is yourself at a later date!

A lot of time is wasted redoing analyses because you can't read or understand your own code from a few weeks or years ago, and having to redo figures because you've lost the code to generate it. 

It also acts as a fail safe backup for your work!

### Helps open science

<img src="https://the-turing-way.netlify.app/_images/evolution-open-research.jpg" alt="Drawing" style="width: 500px;"/>


Open research aims to transform research by making it more reproducible, transparent, reusable, collaborative, accountable, and accessible to society. 


- Be publicly available: It is difficult to use and benefit from knowledge hidden behind barriers such as passwords and paywalls.

- Be reusable: Research outputs need to be licensed appropriately, so that prospective users know any limitations on re-use.

- Be transparent: With appropriate metadata to provide clear statements of how research output was produced and what it contains.

Open practices can make it easier for researchers to connect by increasing the discoverability and visibility of one’s work, facilitating rapid access to novel data and software resources, and creating new opportunities to interact with and contribute to ongoing communal projects.

### Helps you publish
<img src="https://www.turing.ac.uk/sites/default/files/inline-images/Culture%20shift.jpg" alt="Drawing" style="width: 400px;"/>



It is becoming more common for code release to be part of the scientific publishing and grant approval process. 








# Penguin Data

This data set on Palmer penguins has nice examples all over the internet. It contains morphometric data from three species of penguin.

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" alt="Drawing" style="width: 400px;"/>

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png" alt="Drawing" style="width: 350px;"/>


First we need to load the data from a library called `palmerpenguins` and some libraries that will help. 

In [1]:
library(palmerpenguins)
library(ggplot2)
library(janitor)
library(dplyr)


Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




## Looking at the Data

We can look at the data as a table in order to see what issues it might have. This table has over 33 rows so we are just going to look at the first 6. 

In [2]:
head(penguins_raw)

studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.
PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,
PAL0708,6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190.0,3650.0,MALE,8.66496,-25.29805,


One problem we can see immediately, the column names. 
Let's just look at the column names:

In [3]:
names(penguins_raw)


They have mixed capitals and lower case letters, there are spaces in between words which can cause lots of issues within code, and other times there are no spaces...

- `studyName` is called camel case
- `Sample Number` is called sentence case and with a space
- `Culmen Length (mm)` has spaces and also brackets which could also cause issues. 

It's maybe tempting when you see these kinds of problems to go and edit the column names in excel, as well as any other issues you see. 

> **Never do this.**

Editing a file in excel is completely unreproducible, no one has any record of what you edited and can lead to mistakes being missed. 

A best practice before we do anything is to save this file as `raw_data` and consider it "read only"! That means it is preserved exactly as is before we start meddling with it. 



In [4]:
write.csv(penguins_raw, paste0("PenguinProject/data_raw/penguins_raw.csv"))

You'll see I have a directory structure that will help me keep track of what I am doing. So I made an R project in a new directory and made sure it includes a git repository for version control. 

---

## <span style='background:yellow'> Exercise 01 (5mins)  </span>

Create a new project in RStudio. Give it a name without spaces.

Open a new script, call it something like `Lab_Tutorial.R`.

Install and load the penguin library.

Load the penguin data and save it in a folder within your project called `data_raw`.

---

Once we've got that working, we can now start to fix the data set. We now have a safe, read-only copy in our directory of the data. Within R, there is a working copy of the data called `penguins_raw` which we can use for cleaning.



In [5]:
# The hypen '-' means we want to remove that column.

penguins_raw <- select(penguins_raw,-starts_with("delta"))
penguins_raw <- select(penguins_raw,-Comments)

names(penguins_raw)


If we look at how this little bit of code works, we start with `penguins_raw` and apply a function to it (in this case, remove columns that start with delta) and overwrite `penguins_raw`. Then we overwrite it again once we removed comments. 

This is a bit of a problem. It's not very robust, if I run it again it will throw an error:

Here's our first mistake! But that's okay, we can reload the safe copy of `penguin_raw`:

In [9]:
# Reloading the penguin library because I overwrote penguins_raw.
penguins_raw <- read.csv("PenguinProject/data_raw/penguins_raw.csv")

R has a better way of doing this called `piping` which is what this symbol means `%>%`. It means take the first variable, `penguins_raw`, do the following steps to it in order, then save it as something else, in this case we're calling it `penguins_clean`. I can run this multiple times and nothing goes wrong.

In [14]:
names(penguins_raw)

penguins_clean <- penguins_raw %>%
  select(-starts_with("delta")) %>%
  select(-Comments)

names(penguins_clean)



We could take time to edit all these column names invidiually, but one important rule in coding is that anything you're trying to do, someone has probably done it before. If we load the library called `janitor` it has a very handy function called `clean_names()`.

If I remake our pipe with this addition:

In [16]:
names(penguins_raw)

penguins_clean <- penguins_raw %>%
    clean_names() %>%
    select(-starts_with("delta")) %>%
    select(-comments)

names(penguins_clean)


We now have those columns removed as before, but also all the column names are now standardised and suitable for R. Using other people's libraries saves a lot of time!

We have been careful to save a copy of our raw data, but it is important to save a copy of the steps we took to clean the data, so that it is open to scrutiny and checking. If we did all this in the terminal in R then there would be no way of reproducing it exactly later. 

Something that is extremely helpful is making a function. 

## Creating a Function in R

We start with the name of our function, we're going to call it `cleaning`. In the brackets of `function()` we specify what is getting fed into the function, in this case the raw data. 

> Note, it doesn't matter what we call this!

Then as before, we put the pipe inside starting with the initial variable `penguins_raw`. Unlike before, we don't specify what the new variable will be called, as the function will simply output it. The `{}` brackets specify what is inside the function. 


In [17]:
cleaning <- function(penguins_raw){
  penguins_raw %>%
    clean_names()
}


Now we've defined it, absolutely nothing happens. 

In [18]:
names(penguins_raw)

We have to actually call the function and create a new variable using it for it to actually work. First we define the function, and then we call it. 

### Calling the Function

In [19]:
penguins_now_clean <- cleaning(penguins_raw)
names(penguins_now_clean)

--- 
## <span style='background:yellow'> Exercise 02 (5mins) </span>

Write out your cleaning function.

Fill in the rest of the pipes. 

Check it works.

Save your clean data to a `Project/data_clean/` folder with a suitable name. 


---




Now we have a function, this acts as a building block we can call on our data where we need to. We can save it in our project folder in an R file. 

I'm going to save it to `Project/functions/cleaning.R`.

### Why bother?

Having copies of the same code everywhere is a bad idea. 

We're building our data pipeline, and keeping these blocks safe and version controlled is really important. For this workshop we're only doing minor steps and so it will seem like overkill. If you keep all the code you use for cleaning data in one place, you can then call it in multiple places whenever you need it.

For example, you might want to have a script that looks at the flippers, and another script that looks at the egg laying. If you copied and pasted the cleaning data function into both scripts, and then later on realise you made a mistake in that function, you'll have to hunt down every copy you have. 

Instead you can save it to a file called `functions/cleaning.R` and everytime you use it, you refer to that copy only. Then any changes you make to it are consistently applied everywhere. 

You can also use your git commits to keep a clear record of what has happened to these functions over time, rather than having to look through all your scripts. 

So, just like we keep a safe record of our data, we keep a safe record of our data protocols. 

<img src="CodeFast.png" alt="Drawing" style="width: 700px;"/>

