<img src="images/RIINBRE-Logo.jpg" width="400" height="400"><img src="images/MIC_Logo.png" width="600" height="600">

# Analysis of Biomedical Data for Biomarker Discovery
<a id="top2"></a>
## Submodule 2: Introduction to R Data Structures
### Dr. Christopher L. Hemme
### Director, [RI-INBRE Molecular Informatics Core](https://web.uri.edu/riinbre/mic/)
### The University of Rhode Island College of Pharmacy

---

## Overview

This Jupyter Notebook provides an introduction to R data structures, serving as Submodule 2 of a larger course on analyzing biomedical data for biomarker discovery. It covers basic R data types (numeric, integer, character, logical, NA, NaN, Inf), explaining how to check and convert between them. The notebook then focuses on core data structures: vectors, lists, matrices, arrays, and data frames. It emphasizes the concepts of dimensionality and homogeneity/heterogeneity for each structure, and provides examples of how to create and manipulate them using functions like c(), list(), matrix(), and data.frame(). It highlights best practices for handling missing data (NA) and potential pitfalls like data type coercion. The notebook also includes embedded quizzes and concludes by explaining the importance of understanding R data structures for tasks like regression analysis in bioinformatics. It's targeted towards beginners in R, specifically those interested in biomedical data analysis, and suggests users adopt tibbles (from the tidyverse) for a more modern approach to data frames.

## Learning Objectives

+ **Understanding R Data Types:**  Learn to identify and convert between different data types like numeric, integer, character, logical, and special values like NA, NaN, and Inf.
+ **Working with Missing Values (NA):**  Understand the difference between NA and zero, and learn how to handle missing data in calculations using `na.rm`.
+ **Understanding R Data Structures:** Learn about the different data structures available in R (vectors, lists, matrices, arrays, and data frames) and their properties of dimensionality (1D, 2D, multi-dimensional) and homogeneity (same or different data types within the structure).
+ **Creating and Manipulating Vectors:**  Learn how to create vectors using `c()`, determine their length, access elements using bracket notation, including subsetting with logical vectors and conditional statements.
+ **Creating and Manipulating Lists:**  Learn how to create lists using `list()`, understand their heterogeneous nature, access elements using single and double bracket notation, and understand how to work with nested data structures within lists.
+ **Creating and Manipulating Matrices and Arrays:**  Learn how to create matrices using `matrix()`, control their dimensions with `nrow`, `ncol`, and `byrow`, access elements using bracket notation, and understand the difference between `dim` and `length`.
+ **Creating and Manipulating Data Frames:**  Learn how to create data frames using `data.frame()`, understand their structure in relation to spreadsheet-like data, access columns using bracket and `$` notation, convert columns to factors, and understand common data import issues and coercion.
+ **Best Practices for Data Frames:** Be introduced to the concept of tibbles as a modern alternative to data frames.
+ **Relevance to Bioinformatics and Regression Analysis:**  Understand why data structures are important in bioinformatics analyses, especially in the context of regression models where specific data structures (matrices, vectors) are required for operations.

## Get Started

Before we begin working with data, we need to discuss some fundamental topics relevant to data processing and analysis.  We will start with a discussion on common data structures in R.  These data structures will be important for building our experimental object and for regression analysis.  This is not intended to be a comprehensive review of R.  If you are already familiar with R, you can skip to Chapter 3: Linear Models.

<div class="alert alert-block alert-info">
<b>&#9995; Tip:</b> Blue boxes will indicate helpful tips.</div>

<div class="alert alert-block alert-warning">
<b>&#127891; Note:</b> Used for interesting asides or notes.
</div>

<div class="alert alert-block alert-success">
<b>&#9997; Reference:</b> This box indicates a reference for an attached figure or table.
</div>

<div class="alert alert-block alert-danger">
<b>&#128721; Caution:</b> A red box indicates potential hazards or pitfalls you may encounter.
</div>

---

## Data Structures in R

When working with data in R, we need to know two things: what type of data are we working with, and what data structure should be used?  Data types in R include **numeric** (floating point numbers), **integer**, **character** (text), and **logical** (TRUE/FALSE).  There are several ways to determine the type of your data.  The first is with the *class* function:

In [None]:
vec <- c(1:10)
vec
class(vec)

Here we create a vector of integers from 1 to 10 (we'll discuss what a vector is in a moment).  The __*class*__ function tells us that this data is of type **integer**.

A second way to determine the data type is with the __*str*__ function:

In [None]:
str(vec)

__*str*__ gives us the structure of the data.  In this case, it's a simple vector of integers.  For more complex data structures, __*str*__ is a valuable tool for understanding what types of data are in your data structure and how it is organized.  For now, let's look at the third way to identify the data type.  Most data types in R have a corresponding __*is*__ function that indicates using logical values whether the data is or is not of a specific type.

In [None]:
is.integer(vec)

In [None]:
is.numeric(vec)

In [None]:
is.character(vec)

<div class="alert alert-block alert-info">
<b>&#9995; Tip:</b> Notice that vec is both integer and numeric.  Integer values are numeric, but numeric values are not necessarily integers.</div>

For converting between data types, each type has a corresponding *as* function:

In [None]:
is.integer(vec)

In [None]:
vec_char <- as.character(vec)
is.integer(vec_char)
is.character(vec_char)

There are a few special data types we should cover: **NA**, **NAN** and **+/-Inf**.

**NA** means 'not available' and is the most common method of displaying missing data.

In [None]:
vec_zero <- c(1,3,4,0,7,9)
vec_zero

In [None]:
vec_na <- c(1,3,4,NA,7,9)
vec_na

In [None]:
mean(vec_zero)
mean(vec_na)
mean(vec_na, na.rm = TRUE)

<div class="alert alert-block alert-danger">
<b>&#128721; Caution:</b> Notice that NA is not the same as zero.  In vec_zero, the mean function calculates the mean based on six values including zero, returning a mean of 4.  When we replace the zero with NA, mean returns NA unless we add the argument na.rm = TRUE (meaning remove NA).  The mean of vec_na is then calculated based on five values, returning a mean of 4.8.  When working with real data, we will almost always have to deal with missing values so it's important to know how to deal with it.
</div>

<div class="alert alert-block alert-info">
<b>&#9995; Tip:</b> 'na.rm' is a common argument in many R functions.  Later we will show ways to work with NA values during import.</div>

Now consider the following examples:

In [None]:
0/0
1/0
-1/0

In many programming languages, attempting to carry out an improper mathematical operation such as dividing by zero will return an error.  In R, we instead get the following values:
<ul>
    <li><b>NaN</b> - Not a Number
    <li><b>Inf</b> - Positive Infinity
    <li><b>-Inf</b> - Negative Infinity
</ul>

This functionality allows you to work with these types of situations without requiring extra error checking routines.

In [None]:
vec_nan <- c(1, 3, NA, NaN, Inf, -Inf, 10)
vec_nan

In [None]:
is.na(vec_nan)
is.nan(vec_nan)
is.infinite(vec_nan)
is.finite(vec_nan)

The __*is*__ functions return logical vectors indicating which elements of the vector (or other data structure) meet the specified condition.

<div class="alert alert-block alert-info">
<b>&#9995; Tip:</b> Notice that NaN is also considered NA.</div>

You can negate these operations by using the __*!*__ operator.  For instance, to identify which values in *vec_nan* are NOT **NA**, use:

In [None]:
!is.na(vec_na)

Logical vectors are also very useful for subsetting data structures.

In [None]:
vec_nan[is.finite(vec_nan)]
vec_nan[!is.na(vec_nan)]

<div class="alert alert-block alert-info">
<b>&#9995; Tip:</b> Even though NA, NaN and Inf look like character data types, R recognizes them as special cases and will not coerce numeric data if it sees these values.</div>

---

Now that we understand data types, let's discuss data structures.  A data structure is a way to organize our data in a consistent form that can be easily manipulated by other R functions.  When we store data in a variable, we must either explicitly define the data structure or trust R to define it for us.  For example, when we load a text file into R, it will typically be loaded as a data frame unless we explicitly define it as, for example, a matrix.

Data structures in R are typically defined by two properties: dimensionality and homogeneity. A homogeneous data structure means that all values in the data structure are of the same data type, e.g., all numeric.  A heterogeneous data structure allows for each element to be its own type, though there may still be restrictions on data types within that element.

Dimensionality simply means how many dimensions the data structure is composed of.  A 1-dimensional data structure is an ordered sequence of values (e.g., 1,2,3,4,5).  Multi-dimensional data structures add one or more dimensions to the data (e.g., a 2-dimensional data structure is similar to how data is organized in a spreadsheet).  Below is a table of the most common base R data structures.

<table>
<thead>
    <tr><th>Dimensionality</th><th>Homogeneous</th><th>Heterogeneous</th></tr>
</thead>
<tbody>
    <tr><td>1-Dimensional</td><td>Vector</td><td>List</td></tr>
    <tr><td>2-Dimensional</td><td>Matrix</td><td>Data Frame</td></tr>
    <tr><td>Multi-Dimensional</td><td>Array</td><td>-</td></tr>
</tbody>
</table>

<div class="alert alert-block alert-info">
<b>&#9995; Tip:</b> These are the base R data structures, but you will often see package developers create new data structures for specific applications.  These are usually based off of the existing base R data structures but allow for additional functionality, specific storage patterns, etc.  Some common enhanced data structures you might see include:
<ul>
 <li>Data Table - A modified data frame used in the data.table package that is optimized for large datasets and mimics the functionality of databases</li>
 <li>Tibble - A modified data frame built along tidy principles and implemented in the tidyverse suite of packages</li>
 </ul>
</div>

As with data types, R data structures usually have associated __is__ and __as__ functions for identification and conversion.

### Vectors

The vector is the fundamental data structure in R and is a 1-dimensional homogeneous sequence of values.  We have already seen how to create a vector using the __c()__ function.  A single value is called a scalar, but technically speaking, a scalar in R is just a vector of length 1.

In [None]:
vec

In [None]:
length(vec)

In [None]:
is.vector(vec)

In [None]:
is.list(vec)

You might have encountered vectors in previous math classes to indicate the direction and magnitude of an arrow.  That's exactly what we're working with in R, just on a more generalized scale. R is specifically designed to carry out mathematical operations on vectors and matrices.  In fact, R uses a process called vectorization to carry out operations on all elements of a data structure, meaning that we can carry out simple or complex mathematical operations without having to write specific looping functions.

In [None]:
2*vec

In [None]:
vec^2

In [None]:
vec * t(vec)

As previously mentioned, a vector is homogeneous, meaning all values must be of the same type.  This is important to remember, because in the case of ambiguous data, R will attempt to coerce the data to a single type which can cause problems for your analysis.  Consider the following cases:

In [None]:
vec_num <- c(1,3,4,7,9,12)
class(vec_num)
mean(vec_num)

In [None]:
vec_char <- c(1,3,4,"7",9,12)
class(vec_char)
mean(vec_char)

In __*vec_char*__, a single character forces every element of the vector to be changed to the character type.  When we try to carry out mathematical operations on this vector, we get a warning because the mean function expects numeric data.

This is a very common problem when importing data from external sources such as spreadsheets, so it's important to always check the structure of your data when importing to ensure that your data is what you expect it to be.

To extract elements from a vector, we use square bracket notation.  There are a variety of ways to extract specific elements from a vector, some of which are shown below.

In [None]:
vec
vec[1]
vec[-1]
vec[1:3]
vec[c(1,3)]

In [None]:
vec_log <- c(TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE)
vec[vec_log]

In [None]:
vec[vec > 6]
vec[vec > 2 & vec <= 7]

---

### Lists

A list is a heterogeneous version of a vector.  In other words, each element of a list can be any data type.  While this may seem chaotic compared to the cleaner vector, it allows us to group together related data that might not necessarily be of the same type.  We create lists using the list function.

In [None]:
l <- list("scalar" = 1, "numeric" = 1:10, "character" = c("a", "b", "c"), "logical" = c(TRUE, FALSE, TRUE, FALSE))
l
str(l)

In [None]:
is.list(l)
is.data.frame(l)

<div class="alert alert-block alert-info">
<b>&#9995; Tip:</b> Most data structures in R allow you to name sets of data which you can then refer to instead of the numeric index.  This simplifies manipulating large data sets.</div>

Notice a few properties of our list.  First, each element of the list is a different length.  In a list, this is perfectly acceptable, but it means that it's up to you to process each element of the list properly.  Second, the final three element of the list are themselves vectors.  In fact, we can add any data structure to a list, including other lists.  This allows us to create very complex multidimensional data structures even though a list is technically only one dimension.  This is a powerful feature of lists, but again, it is up to you to understand the structure of your list and how to get the relevant data out of it.

Using the bracket notation that we learned with vectors will return another list.  This isn't always what we want.  To get the data itself from a list, we use the double bracket notation.  In the example below, the double bracket notation returns a vector stored in the second element of the list.  We can then chain that function with standard single bracket notation to get the second element of the vector stored in the second element of the list.  This chain functionality is how we can access data that is stored deeply in a complex list.

In [None]:
l[2]
l[[2]]
l[[2]][2]

---

### Matrices and Arrays

An array is a multi-dimensional data structure that holds data of the same type.  A vector is a one-dimensional array, and a two dimensional array is called a matrix.

Vector (dimensions 1x4): $$\begin{bmatrix} 6 & 4 & 3 & 7 \end{bmatrix}$$

Matrix (dimensions 4x4): $$\begin{bmatrix} 6 & 4 & 3 & 7 \\ 2 & 1 & 1 & 4 \\ 9 & 1 & 3 & 8 \\ 6 & 3 & 2 & 1 \end{bmatrix}$$ 

Array (dimensions 4x4x3): $$\begin{bmatrix} 6 & 4 & 3 & 7 \\ 2 & 1 & 1 & 4 \\ 9 & 1 & 3 & 8 \\ 6 & 3 & 2 & 1 \end{bmatrix}_1$$
$$\begin{bmatrix} 3 & 2 & 8 & 1 \\ 3 & 3 & 9 & 7 \\ 3 & 6 & 5 & 4 \\ 7 & 8 & 9 & 2 \end{bmatrix}_2$$ 
$$\begin{bmatrix} 1 & 9 & 0 & 1 \\ 4 & 2 & 9 & 7 \\ 5 & 3 & 5 & 3 \\ 2 & 7 & 8 & 9 \end{bmatrix}_3$$ 

Matrices are the most common type of multidimensional array we deal with, so we'll focus on those going forward.  We can build a matrix using the __*matrix*__ function.

In [None]:
mat <- matrix(1:16, nrow = 4, ncol = 4)
mat
is.matrix(mat)
mat_byrow <- matrix(1:16, nrow = 4, ncol =4, byrow = TRUE)
mat_byrow
is.matrix(mat_byrow)

The first argument to __*matrix*__ is the data itself.  We use __*nrow*__ and __*ncol*__ to define the dimensions of the matrix, and the __*byrow*__ argument determines whether the matrix is populated by row or by column.  We can check the dimensions of our matrix using the __*dim*__ function.

In [None]:
dim(mat)

<div class="alert alert-block alert-danger">
    <b>&#128721; Caution:</b> <i>dim</i> will not work on a vector.  Use <i>length</i> instead.  <i>length</i> will also work on a matrix, but it will return the total number of elements in the matrix (e.g., a 4x4 matrix will have length = 16)
</div>

To access elements of a matrix, we use the single bracket notation of vectors but with a second dimension, with the first dimension representing rows and the second representing columns.  By leaving one or the other blank, we can extract all rows or columns (just don't forget the comma).

In [None]:
mat[1,1]

In [None]:
mat[1,]

In [None]:
mat[,1]

In [None]:
mat[1:3, 1:3]

In [None]:
mat[mat > 5]

We will cover matrices and matrix operations in more detail in <b>Submodule 3: Introduction to Linear Models</b>.

---

### Data Frames

A **data frame** is a two-dimensional data structure analogous to a spreadsheet.  The rows of a data frame represent some feature (e.g., sample names) while the columns represent variables that define the data (e.g., Treatment state, time points, etc.).  Data frames are heterogeneous in the sense that each column can be its own data type, but within a column, all data must be of the same type.  For example, if you have a column "Time Point" that consists of numeric values, then all elements of that column will be numeric unless coerced to something else.

The data frame is one of the most common data structures and will almost always be the default data structure when importing external data (such as a spreadsheet).  One common use of data frames in bioinformatics is for phenotypic data (also called metadata), that is, data used to describe samples (as opposed to experimental data).

<div class="alert alert-block alert-danger">
    <b>&#128721; Caution:</b> One of the most common problems encountered with data frames is coercion of your data because the imported data isn't sufficiently cleaned (especially when importing data from Excel).  Issues include unexpected text or invisible control codes in numeric columns that coerce the data to character, or improperly defined missing data ('0' or '-' instead of blank cells or NA).
</div>

<div class="alert alert-block alert-info">
<b>&#9995; Tip:</b> While it's useful to know how to manipulate data frames in base R, we strongly recommend users learn how to use the <b>tibble</b> instead.  The tibble is a modern implementation of the data frame that has most of the same functionality but removes some of the weird quirks that have plagued data frames since their inception.  Tibbles are implemented in the <i>tibble</i> package which is part of the <i>tidyverse</i> suite of packages.  This makes tibbles fully compatible with other <i>tidyverse</i> packages such as <i>ggplot2</i> and <i>dplyr</i> and greatly simplifies tasks such as data subsetting.</div>

Data frames are constructed using the __<i>data.frame</i>__ function.

In [None]:
df <- data.frame(
    "Sample" = c("S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9"),
    "Treatment" = c(rep("Control", 3), rep("Placebo", 3), rep("Treatment", 3))
)
df$Treatment <- factor(df$Treatment)
df
dim(df)
str(df)

We can extract a specific column from a data frame either by using the matrix notation we learned earlier or we can use the __<i>$</i>__ notation.  In this example, we convert the Treatment column to a factor (since this will be a covariate in a regression model).

---

<p><span style="font-size: 30px"><b>Quizzes</b></span> <span style="float : inline;">(run the command below to display the quizzes)</span> </p>

In [None]:
IRdisplay::display_html('<iframe src="quizes/Chapter2_Quizes.html" width=100% height=450></iframe>')

# Why is this important?

In addition to being able to organize and tidy our data using different data structures, many of the functions in R require specific data types.  For example, matrix operations tend to work on matrices and vectors and you might get errors or strange results if you try those operations on data frames.  When we begin discussing regression, it will be important to know what types of data structures we are working with.  In proteomics analysis, for example, we will use an input matrix of proteomic signal intensities, a design matrix that defines our model, and an error vector.  The output of our regression model is a vector of our coefficients which is itself a weighting vector that indicates which covariate has the greatest effect on our dependent variable.  Understanding our these data structures interact is an important skill in bioinformatics.

---

In [None]:
sessionInfo()

---

## Conclusion

This submodule provided a foundational overview of R data structures, crucial for effective biomedical data analysis and biomarker discovery.  We explored fundamental data types like numeric, integer, character, and logical, along with special values like NA, NaN, and Inf, emphasizing the importance of recognizing and handling them correctly.  The core data structures—vectors, lists, matrices, arrays, and data frames—were explained in terms of their dimensionality and homogeneity. We learned how to create, manipulate, and access elements within these structures, highlighting the benefits of vectorization in R for efficient computations.  Understanding these principles is essential for building experimental objects, performing regression analysis, and interpreting the results in the context of biomarker discovery, as demonstrated by the example of proteomics data analysis.  Proper data structure management ensures compatibility with R functions and allows for a clear understanding of data organization and relationships, ultimately facilitating insightful data exploration and analysis in biomedical research.

## Clean up

Remember to move to the next notebook or shut down your instance if you are finished.

<div style="display: flex; justify-content: center; margin-top: 20px; width: 100%;"> 
    <div style="display: flex; justify-content: space-between; width: 50%;"> 
        <div> 
            <a href=https://github.com/NIGMS/Analysis-of-Biomedical-Data-for-Biomarker-Discovery/blob/master/GoogleCloud/Submodule01_Biomarker_Concepts.ipynb#overview>Previous section</a>                                            
        </div> 
        <div> 
            <a href="#top2">Top of this page</a>                                                      
        </div> 
        <div> 
            <a href=https://github.com/NIGMS/Analysis-of-Biomedical-Data-for-Biomarker-Discovery/blob/master/GoogleCloud/Submodule03_Intro_to_Linear_Models.ipynb#overview>Next section</a>
        </div> 
    </div>
</div>