# Reproducible Science in R and Figures (Part 1)

### What you should know by the end of this lesson...

- How to load data and clean it 
- Create an exploratory figure
- Communicate with a figure by changing its appearance
- Remake the same figure for a different purpose
- Save a figure for use in a report and poster

### Skills you will have by the end of the lesson...

- Using `ggplot2`
- Changing the elements in a figure
- Creating a custom function
- Reading from other R files
- Creating `png` or `svg` (vector) image files
- Functional, modular, reproducible code

Images from [The Turing Way](https://the-turing-way.netlify.app/reproducible-research/reproducible-research.html)! 

<img src="https://zenodo.org/record/3332808/files/1728_TURI_Book%20sprint_5%20turing%20way_040619.jpg" alt="Drawing" style="width: 300px;"/>


## What is Reproducible Research?

<img src="https://the-turing-way.netlify.app/_images/reproducibility.jpg" alt="Drawing" style="width: 600px;"/>


Scientific results and evidence are strengthened if those results can be replicated and confirmed by several independent researchers. 

What you do on your computer is a part of scientific methodology like fieldwork and labwork.

> Major media outlets have reported on investigations showing that a significant percentage of scientific studies cannot be reproduced. 
> This leads to other academics and society losing trust in scientific results.
>
> https://www.nature.com/articles/533452a

**Reproducible**

 - Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.

**Replicable**

- A study that arrives at the same scientific findings as another study, collecting new data (possibly with different methods) and completing new analyses.

<img src="https://oliviergimenez.github.io/reproducible-science-workshop/slides/assets/definitions.jpg" alt="Drawing" style="width: 600px;"/>


# Why bother?

### Helps YOU

<img src="https://the-turing-way.netlify.app/_images/project-history.jpg" alt="Drawing" style="width: 300px;"/>



Usually the person trying to redo your work and analysis is yourself at a later date!

A lot of time is wasted redoing analyses because you can't read or understand your own code from a few weeks or years ago, and having to redo figures because you've lost the code to generate it. 

It also acts as a fail safe backup for your work!

### Helps open science

<img src="https://the-turing-way.netlify.app/_images/evolution-open-research.jpg" alt="Drawing" style="width: 500px;"/>


Open research aims to transform research by making it more reproducible, transparent, reusable, collaborative, accountable, and accessible to society. 


- Be publicly available: It is difficult to use and benefit from knowledge hidden behind barriers such as passwords and paywalls.

- Be reusable: Research outputs need to be licensed appropriately, so that prospective users know any limitations on re-use.

- Be transparent: With appropriate metadata to provide clear statements of how research output was produced and what it contains.

Open practices can make it easier for researchers to connect by increasing the discoverability and visibility of one’s work, facilitating rapid access to novel data and software resources, and creating new opportunities to interact with and contribute to ongoing communal projects.

### Helps you publish
<img src="https://www.turing.ac.uk/sites/default/files/inline-images/Culture%20shift.jpg" alt="Drawing" style="width: 400px;"/>



It is becoming more common for code release to be part of the scientific publishing and grant approval process. 








# Penguin Data

## *Note from after the lab practical* 

*Thank you everyone for being understanding about the pacing and difficulty of today's session. I had to modify the lesson as I was going and I realise that meant some aspects were clumsy and clunky. I've edited the following notes so you can follow the logic of the lesson.* 

This data set on Palmer penguins has nice examples all over the internet. It contains morphometric data from three species of penguin.

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" alt="Drawing" style="width: 400px;"/>

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png" alt="Drawing" style="width: 350px;"/>


First we need to create a new project folder (also known as directory) for our work. 

Create a new folder called `PenguinProjects` in a sensible place on your computer. We call the location of this folder the `path` and it looks something like this:

`C/Users/lfrance/Documents/BiologyComputerSkills/PenguinProjects`

Create your .r or .rmd file inside this folder:

```
- PenguinProject
  - penguin_analysis.rmd
```

It doesn't matter what you call it. Later we are going to create more folders within this folder.

When you're running code within your file needs to be able to see this folder and it will look for everything it needs here, like data, other R files, and folders to save things.

## Installing and Loading Libraries

The first thing we're going to write in our code file is to load some libraries. 

We want all our to load the data from a library called `palmerpenguins` and some libraries that will help. 

Please install these in your RStudio unless you already have them. 

In [1]:

library(palmerpenguins)
library(ggplot2)
library(janitor)
library(dplyr)

# setwd('a path in here')
setwd('PenguinProject') # This is to make it work on my machine! Yours will be different.


Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Try not to start your scripts with `install.packages("somepackage")` because that will wreck havoc and overwrite your libraries every time you run code and it's very slow. Instead install once, then just load when you need.

We also need to tell this file where the project folder is. This is called "setting the working directory" or "setwd". So for example, you put inside the path:

`setwd("C/Users/lfrance/Documents/BiologyComputerSkills/PenguinProjects")`

**MAKE SURE** you are putting the right location in here! If RStudio is looking inside `PenguinProject` for a folder called `PenguinProject` it won't find it and get upset. 

## Looking at the Data


We have installed and loaded `palmerpenguins` which actually contains data inside it. We will be using data stored inside a variable called `penguins_raw`.

We can look at the data as a table in order to see what issues it might have. This table has over 300 rows so we are just going to look at the first 6. 

In [2]:
head(penguins_raw)

studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.
PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,
PAL0708,6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190.0,3650.0,MALE,8.66496,-25.29805,


One problem we can see immediately, the column names. 
Let's just look at the column names:

In [5]:
names(penguins_raw)


They have mixed capitals and lower case letters, there are spaces in between words which can cause lots of issues within code, and other times there are no spaces...

- `studyName` is called camel case
- `Sample Number` is called sentence case and with a space
- `Culmen Length (mm)` has spaces and also brackets which could also cause issues. 

It's maybe tempting when you see these kinds of problems to go and edit the column names in excel, as well as any other issues you see. 

> **Never do this.**

Editing a file in excel is completely unreproducible, no one has any record of what you edited and can lead to mistakes being missed. 

## Preserving our Raw Data

A best practice before we do anything is to save this file as `raw_data` and consider it "read only"! That means it is preserved exactly as is before we start meddling with it. 

What we need to do is create a new folder called `raw_data` inside our project folder:

```
- PenguinProject/
    - data_raw/
    - penguin_analysis.rmd
```

We can create our `data_raw/` folder manually or you can use this command in R: `dir.create("data_raw")`, but be careful you don't run it multiple times!


And using the following command we are going to save the variable `penguins_raw` as a `.csv` file. 


In [6]:
write.csv(penguins_raw, "data_raw/penguins_raw.csv")

# This will break if you haven't set the 
# working directory carefully.

# The folder needs to exist first. 

You'll see I have a directory structure that will help me keep track of what I am doing. 

```
- PenguinProject/
    - data_raw/
        - penguins_raw.csv
    - penguin_analysis.rmd
```

---

## <span style='background:yellow'> Exercise 01 (5mins)  </span>

Create a new project in RStudio. Give it a name without spaces.

Install and load the following libraries:

- palmerpenguins
- ggplot2
- dpylr
- janitor

Open a new script, call it something like `Lab_Tutorial.R`. You may use RMarkdown if you prefer. Make sure the `setwd()` is applied to your project directory. 

Load the penguin data and save it in a folder *within* your project called `data_raw/`. 

You should have the following:

```
- PenguinProject/
    - data_raw/
        - penguins_raw.csv
    - penguin_analysis.rmd
```

---


## Cleaning the Column Names

Once we've got that working, we can now start to fix the data set. We now have a safe, read-only copy in a file in our folder of the data. Within R, there is a variable data called `penguins_raw` which we can use for cleaning.



In [7]:
# The hypen '-' means we want to remove that column.

penguins_raw <- select(penguins_raw,-starts_with("Delta"))
penguins_raw <- select(penguins_raw,-Comments)

names(penguins_raw)


If we look at how this little bit of code works, we start with `penguins_raw` and apply a function to it (in this case, remove columns that start with delta) and overwrite `penguins_raw`. Then we overwrite it again once we removed comments. 

This is a problem. It's not very robust, if I run it again it will throw an error. You can try this by running these lines twice.

### This is our first mistake! 

But that's okay, we can reload the safe copy of `penguin_raw` using 

`penguins_raw <- read.csv("data_raw/penguins_raw.csv")`

and instead of overwriting in the cleaning steps create a variable called `penguins_clean`.

In [13]:
# ---- Previous version: -----
# penguins_raw <- select(penguins_raw,-starts_with("Delta"))
# penguins_raw <- select(penguins_raw,-Comments)
# ----------------------------

# Reloading the penguin library because I overwrote penguins_raw.
penguins_raw <- read.csv("data_raw/penguins_raw.csv")

names(penguins_raw)

penguins_clean <- select(penguins_raw,-starts_with("Delta"))
penguins_clean <- select(penguins_clean,-Comments)

names(penguins_clean)

Now we have made a new variable called `penguins_clean`. However, this still involves overwriting in the second cleaning step. It is okay to overwrite if these lines are right next to each other. If you accidentally moved that line a lot further down and had code inbetween it then you'll get errors easily if you run it multiple times. 

## ...Overwriting is bad

The `dplyr` library we installed has a better way of doing this called `piping` which is what this symbol means `%>%`. It means take the first variable, `penguins_raw`, do the following steps to it in order, then save it as something else, in this case we're calling it `penguins_clean`. I can run this multiple times and nothing goes wrong.

In [29]:

# ---- Previous versions: -----
# penguins_raw <- select(penguins_raw,-starts_with("Delta"))
# penguins_raw <- select(penguins_raw,-Comments)
#
# penguins_clean <- select(penguins_raw,-starts_with("Delta"))
# penguins_clean <- select(penguins_clean,-Comments)

# ----------------------------

# Reloading the penguin library because I overwrote penguins_raw.
penguins_raw <- read.csv("data_raw/penguins_raw.csv")

names(penguins_raw)

penguins_clean <- penguins_raw %>%
  select(-starts_with("Delta")) %>%
  select(-Comments)

names(penguins_clean) 



Now I've removed the columns I don't want, but the names are still a problem.

We could take time to edit all these column names invidually, but one important rule in coding is that anything you're trying to do, someone has probably done it before. We installed and loaded a library called `janitor` it has a very handy function called `clean_names()`.

If I remake our pipe with this addition:

In [None]:
names(penguins_raw)

penguins_clean <- penguins_raw %>%
    select(-starts_with("Delta")) %>%
    select(-Comments) %>%
    clean_names()

# Note, the order here is very important. 
# clean_names removes capital letters and select is case sensitive.
# You would need to change to "delta" and Comments if clean_names() came first. 

names(penguins_clean)


We now have those columns removed as before, but also all the column names are now standardised and suitable for R. Using other people's libraries saves a lot of time!

What we're starting to do is produce a **Data Pipeline**.

In [15]:
# Our FULL code now:

library(palmerpenguins)
library(ggplot2)
library(janitor)
library(dplyr)

# Loading the raw penguin data
penguins_raw <- read.csv("data_raw/penguins_raw.csv")

# Look at the column names
names(penguins_raw)

# Pipe to clean up the column names
penguins_clean <- penguins_raw %>%
  select(-starts_with("Delta")) %>%
  select(-Comments) %>%
    clean_names()

# Look at the table, first few rows
head(penguins_clean) 


Unnamed: 0_level_0,x,study_name,sample_number,species,region,island,stage,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<int>,<chr>
1,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE
2,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE
3,3,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE
4,4,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,
5,5,PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE
6,6,PAL0708,6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190.0,3650.0,MALE


## Data Pipelines

As things currently stand in research papers, generally the methodology looks like this:

<img src="https://i.imgur.com/wbgVDSZ.png" alt="Drawing" style="width: 800px;"/>

We know where the research questions come from, we can read those in the introduction, and we can usually find detailed methodology about data collection in the lab or field. But when it comes to what steps happened for the analysis and producing figures, we get no information! Even the data used for the analysis is unavailable most of the time. 

<img src="https://i.imgur.com/xZmsFXP.png" alt="Drawing" style="width: 300px;"/>

What we should be doing is filling in the blanks:

<img src="https://i.imgur.com/7LHRrh1.png" alt="Drawing" style="width: 1000px;"/>

And in closer detail:

<img src="https://i.imgur.com/AIwZP7a.png" alt="Drawing" style="width: 900px;"/>

Where we can make sure the raw and cleaned data are also available, for example on Zenodo. This can help in situations where scientific results need to be held under scrutiny, and also for collaborative purposes! Other people may have exciting new uses for the computational work you have worked hard on.

## Reuseable Code

We often need the same code over and over again. This aspect of building **data pipelines** is making the parts **reusable** for other purposes. These code blocks that might be making a figure or running a specific model you know you'll need multiple times. 

Something that is extremely helpful in making reuseable code is making a **function**. 

## Creating a Function in R

We use functions all the time in R and we load them from Libraries. We can actually make our own. 

We start with the name of our function, we're going to call it `cleaning`. In the brackets of `function()` we specify what is getting fed into the function, in this case the raw data. 

> Note, it doesn't matter what we call the function! You might want to call it `cleaning_columns`.

Then as before, we put the pipe inside. We're keeping this function generally applicable, so the input can just have a generic name like `data` or `raw_data`. Unlike before, we don't specify what the new variable will be called, as the function will simply output it. The `{}` brackets specify what is inside the function. 

In [17]:
cleaning <- function(data_raw){
  data_raw %>%
    clean_names()
}

# Fun fact, you can call the outputs and inputs anything 
# you want. As long as it it internally consistent:

# cleaning <- function(abcdefg){
#   abcdefg %>%
#     clean_names()
# }

# For readability, choose variable names that make sense and can be read!
# Write code for humans, not computers. 

Now we've defined it, absolutely nothing happens. 

In [18]:
names(penguins_raw)

We have to actually **call** the function to make it do something. First we define the function, and then we call it. 

### Calling the Function

Here I can put the rest of the pipe inside, and I added another line which removes any empty rows or columns:

In [26]:
# Defining the function
cleaning <- function(data_raw){
  data_raw %>%
    clean_names() %>%
    remove_empty(c("rows", "cols")) %>%
    select(-starts_with("delta")) %>%
    select(-comments)
    }
 
# Calling the function
penguins_now_clean <- cleaning(penguins_raw)

# Checking the results
names(penguins_now_clean)

# Fun fact Part II: it doesn't matter if the input here has a different name to the input 
# we used when creating the function. 
# penguins_raw is different to data_raw and yet the function will work fine.

# Save the Clean Data

It makes sense to save our now clean data to a csv file. Create a new folder called `data_clean`:

```
- PenguinProject/
    - data_raw/
        - penguins_raw.csv
    - data_clean/
    - penguin_analysis.rmd
```
And as before we will save a `.csv` with our newly cleaned data.

In [None]:
write.csv(penguins_now_clean, "data_clean/penguins_clean.csv")


```
- PenguinProject/
    - data_raw/
        - penguins_raw.csv
    - data_clean/
        - penguins_clean.csv
    - penguin_analysis.rmd
```

--- 
## <span style='background:yellow'> Exercise 02 (10mins) </span>

Write out your cleaning function.

Fill in the rest of the pipeline...
- Starts with the penguin_raw data 
- It should remove those two columns
- It should clean the column names

Check it works.

Save your clean data to a `Project/data_clean/` folder with a suitable name. 


---




# Putting the Function in File

## Keeping the cleaning code safe

So far in our code, we have been careful to save a copy of our raw data and clean data, but we also want a safe copy of our cleaning code. It is critical to also save a copy of the steps we took to clean the data, so that it is open to scrutiny and checking. If wrote the cleaning code in the terminal in R or across multiple files we might lose then there would be no way of reproducing it exactly later.

In this toy example, we have also talked about this cleaning code function being reuseable. What is always bad is copying and pasting our code everywhere. 

Why?

- If you make a mistake, then you have to hunt down everywhere you pasted it
- It's easy to have multiple variations accidentally and confuse yourself
- No record of how that code has changed over time (remember your version control lesson)


Now we have a function, this acts as a building block we can call on our data where we need to. We can also save it in our project folder in an R file. 

I'm going to save it to a new folder called `functions/` in an R file called `cleaning.r`.

```
- PenguinProject/
    - data_raw/
        - penguins_raw.csv
    - data_clean/
        - penguins_clean.csv
    - functions/
        - cleaning.r
    - penguin_analysis.rmd
```

# Why bother?

Having copies of the same code everywhere is a bad idea. 

We're building our data pipeline, and keeping these blocks safe and version controlled is really important. For this workshop we're only doing minor steps and so it will seem like overkill. 

If you keep all the code you use for cleaning data in one place, you can then call it in multiple places whenever you need it.

For example, you might want to have a script that looks at the flippers, and another script that looks at the egg laying. If you copied and pasted the cleaning data function into both scripts, and then later on realise you made a mistake in that function, you'll have to hunt down every copy you have. 

Instead you can save it to a file called `functions/cleaning.R` and everytime you use it, you refer to that copy only. Then any changes you make to it are consistently applied everywhere. 

You can also use your git commits to keep a clear record of what has happened to these functions over time, (your github time machine) rather than having to look through all your scripts manually. 

Also, if someone was quickly looking at your code folder, they can quickly find all the methods you used on your data to check what is going on! This is a lot better than a 2000 line `Rmd` file to read through and try to figure out what your methods were. 

So, just like we keep a safe record of our data, we keep a safe record of our **data protocols**. 


Also, it takes a bit of time right now, but you're saving a serious amount of headaches later on...

<img src="https://i.imgur.com/zQGLO2E.png" alt="Drawing" style="width: 700px;"/>



Here's our full code now. We generally put our function definitions at the top of the file. 

In [31]:

library(palmerpenguins)
library(ggplot2)
library(janitor)
library(dplyr)

# Defining the cleaning function
cleaning <- function(data_raw){
  data_raw %>%
    clean_names() %>%
    remove_empty(c("rows", "cols")) %>%
    select(-starts_with("delta")) %>%
    select(-comments)
    }
# -------------- 

# Loading the raw penguin data
penguins_raw <- read.csv("data_raw/penguins_raw.csv")

# Look at the column names
names(penguins_raw)

# Calling the function to clean it
penguins_now_clean <- cleaning(penguins_raw)

# Look at the table, first few rows
head(penguins_clean) 

# Save the clean data
write.csv(penguins_now_clean, "data_clean/penguins_clean.csv")



Unnamed: 0_level_0,X,studyName,Sample.Number,Species,Region,Island,Stage,Individual.ID,Clutch.Completion,Date.Egg,Culmen.Length..mm.,Culmen.Depth..mm.,Flipper.Length..mm.,Body.Mass..g.,Sex
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<int>,<chr>
1,1,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE
2,2,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE
3,3,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE
4,4,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,
5,5,PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE
6,6,PAL0708,6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190.0,3650.0,MALE


# [Now go to the next part about plotting](https://github.com/LydiaFrance/Reproducible_Figures_R/blob/lessons/lesson_notebook02_figures.ipynb). 