# Intro to chemistRy

Welcome to the *chemistRy*, a product of the JMU Department of Chemistry and Biochemistry. This is the second lesson in the series which  aims to teach you about data visualization, data tidying, statistics, and a bit of R coding. If you don't know how to code, don't worry! These lessons assume no prior knowledge of code or R.

A few things to start:

1.   These lessons only work in Google Chrome
2.   Make your copy by going to File> Save a Copy in Drive; then locate a spot in your Drive folder
3.   This is a special Google Colab for R, regular Colab runs in Python. If you want to make your own R notebook, follow this [link](https://colab.research.google.com/drive/16u6W4rl-AHX_yXlQx7HcgjAuNimfqwj_).

If you have questions, feel free to contact Dr. Chris Berndsen in the JMU Chemistry Department.


In [None]:
#@title Run this block to set up the system
#@markdown It make take up to a minute to complete (usually much less)
install.packages("modeldata")
## load packages
library(tidyverse)
library(modeldata)

# Visualizing data

Data come in many types and can be analyzed in many ways. In Chemistry and Biochemistry, we sometimes use statistics to summarize the measurements, while other times we show the raw data such as measuring the absorbance of a sample. In most instances, both showing a plot and a summary table of statistics is helpful, especially when you have complex data. In this module, we will also introduce the basics of plotting data in R using the `ggplot2` package. This package will be the primary package for data visualization in later modules.

In the next sections you will be plotting a data set in `ggplot2` and then trying a few variations on plotting the data set. Finally, you will be shown the elements of a good plot and be tasked with constructing a good plot.

---

## Why visualize data?
Chemistry data often come in the form of numbers. Think about some of the values you may have determined in lab: mass, density, rate, absorbance, pH, etc. Comparison of a few measurements is easy, however as the size of the data increases, so does the complexity of the analysis required to make comparisons.

The table below shows part of a data set on the content of 100 meat samples. The data were collected by spectroscopy and those measurements used to calculate the fat, water, and protein content of the meat sample. Looking at the data, what is the relationship between the fat, water, and protein content?




In [None]:
#@title Run this block to see the table
library(modeldata)
data(meats)

# convert meats to long format
tbl <- meats %>%
  select(101:103)

colnames(tbl) <- c("% water", "% fat", "% protein")


head(tbl)

Using the numbers alone, the relationships between these parameters is not clear. This is when making a picture of the data in the form of a plot is valuable. Let's take a look at a plot showing the percent fat vs. the percent protein.

In [None]:
#@title Run this block to see the plot
data("meats")

# convert meats to long format
tbl <- meats %>%
  select(101:103)

ggplot(tbl, aes(x = protein, y = fat)) +
  geom_point(alpha = 0.5) +
  theme_classic() +
  theme(legend.position = "none",
        axis.text = element_text(size = 20),
        axis.title = element_text(size = 24)) +
  labs(x = "% protein", y = "% fat")


Plotting the reveals a strong trend where as protein content increases, the fat content decreases. Without plotting the data, this trend would not be obvious. One caveat is that while this plot shows the trend, it is not very effective at communicating the data. In the next two sections, you will explore both plotting data to observe trends and formatting the plot to communicate data effectively.

# Using the ggplot2 package

`ggplot2` is a popular **package** in R. A package is a collection of functions or code for R, sort of like Microsoft Excel runs within Windows or Safari runs in MacOS. Functions are usually given a name followed by `()`. In the previous module, you used `mean()` to calculate an average value. Inside the parenthesis, you put your data or the numbers.  In the case of ggplot2, these functions create the various aspects of the plot like the color scheme, the type of plot, or whether there should be a title. We combine the functions together using a + sign between the functions. When we combine all these individual tasks together, we get a beautiful plot.

There are a few terms to know before we look at the code:

| term  | definition  | code  | code example  |
|---|---|---|---|
| function  | a group of commands that acts on a piece of data  | ...()  | ggplot(df) or geom_point()  |
|  aes |  aesthetic or features of the plot that change with the data |  aes(...) | aes(x = temp, y = rate, fill = species  |
| theme  | changes the non-data parts of the plot  |  theme() | theme(axis.text = element_text(size = 20)  |
|  geom | the type of plot  | geom_  | geom_point()  |


Now, let's look at the code for the plot above on meats and how the functions in the ggplot2 package combine. Pay attention to where and how the terms and functions are used to make the plot starting in line 11.

In [None]:
# this line selects the data set to use
data("meats")

# you can ignore these lines for now
# lines 5 and 6 trim down the meats data set using a pipe
# we will learn about piping later
tbl <- meats %>%
  select(101:103)

# making the ggplot!

# choose the ggplot function
# set data to plot to tbl, and choose the x and y variables, these are the column names in a spreadsheet
# the aes means aesthetic and chooses the variables
ggplot(data = tbl, aes(x = protein, y = fat)) +

  # choose the geom or type of plot, in this case a point or scatter plot
  geom_point() +

  # define how the plots looks, classic theme in this case
  theme_classic() +

  # hide the legend
  theme(legend.position = "none",
        # change the axis label text to size 20
        axis.text = element_text(size = 20),
        # change the axis title text to size 24
        axis.title = element_text(size = 24)) +

  # label the axes, with units!
  labs(x = "% protein", y = "% fat")

Whew! That took a lot of code to make the plot! Why would you ever use ggplot rather than using the chart button in Microsoft Excel or Google Sheets? Here are some reasons related to science and communication:

*   Once you make the plot, you can use the same code for other data sets and have them all look the same --> **makes it easier to compare data sets**
*   Easy to find where to customize an aspect of the plot
*   You have full control over every part of the plot
*   Many more ways to show data and communicate your data
---

## Changing a Plot

Now, let's play with customizing a ggplot a bit and compare what each change does.

Below are a series of tasks to complete, make the changes to the specific code line then run the code. Copy the resulting image to a document, in a section called Changing a Plot and how the plot looks different with each change. You will turn this in on Canvas at the end of completing the module.

*If the code fails and you cannot figure out the issue, what went wrong, copy the code from above and overwrite the problem code. Then try again.*



1. Switch the variables on x and y, change x to fat, and y to protein
2. Correct the labeling of the axes to reflect the change in task 1
3. Inside the aes(...) in line 9, add `, color = protein`
4. Inside geom_point(...) on line 11, type `aes(size = fat)`
5. Change the theme on line 13 from classic, to bw
6. Change the plot type to geom_line (but don't change the aes(...) from line 4)






In [None]:
# this line selects the data set to use
data("meats")
# lines 5 and 6 trim down the meats data set using a pipe
# we will learn about piping later
tbl <- meats %>%
  select(101:103)

## make the changes below only!
ggplot(data = tbl, aes(x = protein, y = fat)) +

  geom_point() +

  theme_classic() +

  theme(legend.position = "none",
        axis.text = element_text(size = 20),
        axis.title = element_text(size = 24)) +

  labs(x = "% protein", y = "% fat")

An answer key will be available after this assignment has closed so that you can see plot should have looked like if you completed all 6 tasks correctly.

# Exploring your data
Sometimes, you don't know how the data should be plotted until after you have explored different plotting styles and approaches. This is especially true with larger data sets. In the final section, you are going to make a plot of a data set from [Brodnjak-Vončina, D., Kodba, Z. C., & Novič, M. (2005). Multivariate data analysis in classification of vegetable oils characterized by the content of fatty acids. Chemometrics and Intelligent Laboratory Systems, 75(1), 31–43.](https://www.sciencedirect.com/science/article/abs/pii/S0169743904001200). The scientists measured the different type of fat molecules in various for 96 vegetables and how much of that fat was in the vegetable. A short table of the data will be shown after you run the code box below.

In [None]:
data(oils)

head(oils)

Your assignment is to create **two** plots from these data. The first should compare fatty acid content among the vegetables and the second should show a relationship you identified by exploring the data. In both cases, the plot should be clear, well-labeled, and accessible.

To help you, two code blocks are set up below. The first gives you a sense of how to construct the comparison of fatty acids in each vegetable. The second code block provides a skeleton ggplot code, which you can customize. There are many ways to get credit for these problems.

# Comparison of fatty acid content in vegetables

In [None]:
data(oils)

# use ggplot, use the table above to choose which columns to use and put on x or y
ggplot(data = oils, aes(x = _____, y = ______)) +

  # pick a geom: geom_point(), geom_col(), geom_hist(), geom_jitter()
  # don't forget to include the parenthesis!
  geom_   +

  # set the axis.title font size to at least 20 and axis.text to at least 16
  # go back and review how we did this in the meat plot if you forgot the code
  theme() +

  # add the axis title labels for x and y
  # note how "" are used in the the meat plot labels
  labs()

# Exploring the data!
Remember in the second plot, you should explore the data to look for something interesting or just make a fun plot. An outline of code is provided in the code block below and a table below the code shows some extra customizations that you can play with.