<img src="./intro_images/MIE.PNG" width="100%" align="left" />

<table style="float:right;">
    <tr>
        <td>                      
            <div style="text-align: right"><a href="https://alandavies.netlify.com" target="_blank">Dr Alan Davies</a></div>
            <div style="text-align: right">Senior Lecturer Health Data Science</div>
            <div style="text-align: right">University of Manchester</div>
         </td>
         <td>
             <img src="./intro_images/alan.PNG" width="30%" />
         </td>
     </tr>
</table>

# 5.0 Data structures
****

#### About this Notebook
This notebook introduces <code>data structures</code> in R These can be used to store data in more complex structures.

<div class="alert alert-block alert-warning"><b>Learning Objectives:</b> 
<br/> At the end of this notebook you will be able to:
    
- Explore and practice using the main data structures available in R 

- Investigate how data can be stored and accessed in these structures 

</div> 

<a id="top"></a>

<b>Table of contents</b><br>

5.1 [Atomic vectors](#vec)

5.2 [Lists](#lists)

5.3 [Matrices](#matrix)

5.4 [Data frames](#df)

So far we have been using simple variables to store single items of data for our programs. R supports more advanced data structures for storing and organising data. These include <code>vectors</code>, <code>lists</code>, <code>matrices</code> and <code>data frames</code>. We will examine each in turn.

<a id="vec"></a>
#### 5.1 Atomic vectors

Vectors can contain multiple variables. Below we create a vector called <code>fruit</code> and using brackets we add three items (in this case strings) to the vector separated by commas. We can use the <code>length()</code> function to see how many items <code>elements</code> are contained in our vector.

In [1]:
fruit <- c('apple', 'pear', 'banana')

In [2]:
fruit

In [3]:
length(fruit)

To access the individual elements (items) of our vector, we can use the index of the element. 

In [5]:
fruit[1]

In [6]:
fruit[2]

<div class="alert alert-danger">
<b>Note:</b> In R the first element of a vector is <code>1</code> and not 0.
</div>

<img src="./intro_images/list.PNG" width="500" />

The image above shows another way of looking at our fruit vector. You can image a vector as a series of connected boxes that can contain some data (in this case the names of some fruit). We can now pass this whole vector around like a single variable. This is very useful if you want to store a lot of data (for example all the players in a football team or a list of patients having a knee replacement operation). To access the individual data in the boxes you use the index number (the numbers on the bottom of the boxes in the image above starting at 1).  

An empty vector can be defined by using the <code>vector</code> function. Vectors should contain all the same type of data for example all the elements should be numbers or text etc.

In [7]:
my_vec <- vector()

In [8]:
my_vec

In [9]:
length(my_vec)

We can create vectors of a specific type such as <code>numeric</code> or <code>logical</code> by specifying a number of elements in brackets.

In [11]:
numeric(5)

In [12]:
logical(5)

We can also create vectors by generating a sequence of numbers. For example the numbers 1 to 10 by 0.1.

In [15]:
seq_vec <- seq(1, 10, by=0.1)

In [16]:
seq_vec

There are also some inbuilt helper vectors such as the upper and lower case letters.

In [17]:
LETTERS[1:5]

In [18]:
letters[1:8]

In [19]:
another_vec <- c(23, 13.3, 3, 6, 28)

<div class="alert alert-block alert-info">
<b>Task 1:</b>
<br> 
1. Print the value <code>13.3</code> contained in element 2 of <code>another_vec</code><br>
2. What element is the value <code>6</code> stored at in <code>another_vec</code>?
</div>

In [20]:
print(another_vec[2])

[1] 13.3


It is stored in element 4.

We can add items to the vector using the <code>append()</code> function. In the example below we use this function to add the number 5 to the end of the vector.

In [29]:
my_vec <- c(1, 2, 3, 4)
my_vec <- append(my_vec, 5) 
print(my_vec)

[1] 1 2 3 4 5


We can also remove items from a vector using <code>%in%</code>. Here we essentially copy over all the items apart from the ones we want to remove which we specify in <code>elements_to_remove</code>.

In [30]:
elements_to_remove <- c(1)
my_vec <- my_vec[!(my_vec %in% elements_to_remove)]
print(my_vec)

[1] 2 3 4 5


<a id="lists"></a>
#### 5.2 Lists

Lists in R can contain data of different types. Lists can also contain sub lists.

In [22]:
my_list <- list(42, "Paul", 56, TRUE, 11)

In [24]:
print(my_list)

[[1]]
[1] 42

[[2]]
[1] "Paul"

[[3]]
[1] 56

[[4]]
[1] TRUE

[[5]]
[1] 11



We can output a list by specifying the elements in brackets like we do with vectors. 

In [25]:
print(my_list[1])

[[1]]
[1] 42



To get at the actual values we need to use double square brackets <code>[[]]</code>.

In [26]:
print(my_list[[1]])

[1] 42


We can combine lists using <code>c</code>. Here we have one list <code>lst1</code> with values <code>2, 4 and 5</code>. We can combine this with a second list <code>lst2</code> and store the combined list in <code>lst3</code>. When we output the third lists content we see it contains the values of the two previous lists.

In [32]:
lst1 <- list(2, 4, 5)
lst2 <- list(2, 5)
lst3 <- c(lst1, lst2)
print(lst3)

[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 5

[[4]]
[1] 2

[[5]]
[1] 5



We can also remove elements from lists by specifying <code>NULL</code> in the element we want to remove. For example if we want to remove the <code>5</code> in the list below which is also the fifth element. When we output the list after this operation we see the last element has been removed.

In [34]:
my_list <- list(1, 2, 3, 4, 5)
my_list[[5]] <- NULL
print(my_list)

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4



<div class="alert alert-block alert-info">
<b>Task 2:</b>
<br> 
1. Make a new vector with at least 5 text (string) items<br>
2. Use the <code>sort()</code> function to put your list into alphabetical order<br>
<strong>HINT:</strong> you may want to look up the sort function in the <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/sort.html" target="_blank">documentation</a>.
</div>

In [37]:
new_vec <- c('apples', 'oranges', 'bananas', 'pears', 'grapes')
sort(new_vec, decreasing = FALSE)

Lists can also be used to create more complex structures but combining them together. Consider the following example that stores some medical information about a patient.

<img src="./intro_images/dr.jpg" width="500" />

In [40]:
med_data <- list(name="Mike Smith",
                dob="13/12/1979",
                age=40,
                BP="120/80",
                HR=76,
                PMH=c("diabetes", "hypertension", "atrial fibrillation"))

In [41]:
print(med_data)

$name
[1] "Mike Smith"

$dob
[1] "13/12/1979"

$age
[1] 40

$BP
[1] "120/80"

$HR
[1] 76

$PMH
[1] "diabetes"            "hypertension"        "atrial fibrillation"



Here we can store multiple items of information with associated names in a single data structure. The name is specified and then followed by an equals <code>=</code> with its associated value. We can access the data stored within using the square brackets notation like so:

In [42]:
print(med_data[["BP"]])

[1] "120/80"


In [43]:
print(med_data[["name"]])

[1] "Mike Smith"


In [44]:
print(med_data[["PMH"]])

[1] "diabetes"            "hypertension"        "atrial fibrillation"


In [46]:
print(med_data[["PMH"]][3])

[1] "atrial fibrillation"


Another way of accessing the data by it's name is like so <code>&lt;name of list&gt;$&lt;name of list item&gt;</code> like so.

In [47]:
print(med_data$BP)

[1] "120/80"


We can also change items stored at this label in the same way as a list. For example changing the blood pressure (BP) value:

In [48]:
med_data$BP <- "132/76"
print(med_data$BP)

[1] "132/76"


<div class="alert alert-block alert-info">
<b>Task 3:</b>
<br> 
1. Add the medical condition <code>irritable bowel syndrome (IBS)</code> to the past medical history <code>(PMH)</code> in the list and print the result
</div>

In [49]:
med_data[["PMH"]][4] <- "irritable bowel syndrome (IBS)"
print(med_data)

$name
[1] "Mike Smith"

$dob
[1] "13/12/1979"

$age
[1] 40

$BP
[1] "132/76"

$HR
[1] 76

$PMH
[1] "diabetes"                       "hypertension"                  
[3] "atrial fibrillation"            "irritable bowel syndrome (IBS)"



<a id="matrix"></a>
#### 5.3 Matrices

Matrices (singular matrix) are used to store data in a 2D arrangement with rows and columns. We often use matrices to store information for Machine Learning tasks. For example, how training data is organized prior to applying a ML model. Again we can reduce complex problems in higher dimensions into simpler linear problems that are easier to compute and solve using whats called <code>linear approximation</code>. Let's take a look at how we can create a matrix using R.

In [50]:
M <- matrix(c(1,2,3,4,2,4,5,2,5), nrow=3, ncol=3, byrow=TRUE)

In [52]:
print(M)

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    2    4
[3,]    5    2    5


We would represent such a matrix in math notation as follows:

$$ M = \begin{bmatrix} 1 & 2 & 3 \\[0.3em] 4 & 2 & 4 \\[0.3em] 5 & 2 & 5 \end{bmatrix}   $$

We can then refer to specific values using a subscript notation. Where $i$ is the row index and $j$ is the colum index.

$$ M(_{i,j}) $$

$$
\begin{bmatrix} M_{1,1} & M_{1,2} & M_{1,3} \\[0.3em] M_{2,1} & M_{2,2} & M_{2,3} \\[0.3em] M_{3,1} & M_{3,2} & M_{3,3} \end{bmatrix}  
$$

We can access individual elements of a matrix in R using the row then column index specified in square brackets. So if we wanted the number <code>4</code> which is on the 2nd row in the 1st column like so.

In [53]:
print(M[2, 1])

[1] 4


If a matrix is the same shape as another they can be summed. To do this we can simply add the numbers in the corresponding positions i.e.

$$ \begin{bmatrix} 1 & 0 \\[0.3em] 2 & 4 \end{bmatrix} + \begin{bmatrix} 3 & 6 \\[0.3em] 3 & 4 \end{bmatrix} = \begin{bmatrix} 1 + 3 & 0 + 6 \\[0.3em] 2 + 3 & 4 + 4 \end{bmatrix} = \begin{bmatrix} 4 & 6 \\[0.3em] 5 & 8 \end{bmatrix}$$

We can recreate the two matrices shown above in R like so and then output them.

In [54]:
M_1 <- matrix(c(1,0,2,4), nrow=2, ncol=2, byrow=TRUE)
M_2 <- matrix(c(3,6,3,4), nrow=2, ncol=2, byrow=TRUE)

In [55]:
print(M_1)
print(M_2)

     [,1] [,2]
[1,]    1    0
[2,]    2    4
     [,1] [,2]
[1,]    3    6
[2,]    3    4


We can use the conventional plus operator <code>+</code> to add the two matrices together.

In [56]:
print(M_1 + M_2)

     [,1] [,2]
[1,]    4    6
[2,]    5    8


Unfortunately matrix multiplication works differently. To do this we need to ensure that the number of columns in the first matrix are equal to the number of rows in the second. Then we can multiply and then add the corresponding values.

$$
\begin{bmatrix} 1 & 0 \\[0.3em] 2 & 4 \\[0.3em] \end{bmatrix} \times
\begin{bmatrix} 3 & 6 \\[0.3em] 3 & 4 \\[0.3em] \end{bmatrix} =
\begin{bmatrix} (1 \times 3) + (0 \times 3) & (1 \times 6) + (0 \times 4) \\[0.3em] 
(2 \times 3) + (4 \times 3) & (2 \times 6) + (4 \times 4)
\\[0.3em] \end{bmatrix} = 
\begin{bmatrix} 3 & 6 \\[0.3em] 18 & 28 \\[0.3em] \end{bmatrix}
$$

If we use the conventional multiplication operator <code>*</code> we get an incorrect result.

In [57]:
print(M_1 * M_2)

     [,1] [,2]
[1,]    3    0
[2,]    6   16


The <code>dot product</code> operator needs to be used to instead, like so:

In [58]:
print(M_1 %*% M_2)

     [,1] [,2]
[1,]    3    6
[2,]   18   28


<div class="alert alert-block alert-info">
<b>Task 4:</b>
<br> 
1. Create and output the following two matrices<br>
$$
A = \begin{bmatrix} 1 & 4 & 4 & 2 \\[0.3em] 6 & 3 & 2 & 6 \\[0.3em] 4 & 4 & 2 & 5 \\[0.3em] 6 & 7 & 1 & 4 \end{bmatrix}
B = \begin{bmatrix} 5 & 6 & 2 & 1 \\[0.3em] 6 & 3 & 1 & 4 \\[0.3em] 5 & 2 & 5 & 2 \\[0.3em] 6 & 4 & 5 & 2 \end{bmatrix}
$$
2. Then output the sum and multiplication of the matrices.
</div>

In [59]:
A <- matrix(c(1,4,4,2,6,3,2,6,4,4,2,5,6,7,1,4), nrow=4, ncol=4, byrow=TRUE)
B <- matrix(c(5,6,2,1,6,3,1,4,5,2,5,2,6,4,5,2), nrow=4, ncol=4, byrow=TRUE)
print(A)
print(B)

     [,1] [,2] [,3] [,4]
[1,]    1    4    4    2
[2,]    6    3    2    6
[3,]    4    4    2    5
[4,]    6    7    1    4
     [,1] [,2] [,3] [,4]
[1,]    5    6    2    1
[2,]    6    3    1    4
[3,]    5    2    5    2
[4,]    6    4    5    2


In [60]:
print(A + B)
print(A %*% B)

     [,1] [,2] [,3] [,4]
[1,]    6   10    6    3
[2,]   12    6    3   10
[3,]    9    6    7    7
[4,]   12   11    6    6
     [,1] [,2] [,3] [,4]
[1,]   61   34   36   29
[2,]   94   73   55   34
[3,]   84   60   47   34
[4,]  101   75   44   44


<a id="df"></a>
#### 5.4 Data frames

As R was designed for statistical analysis and not general purpose programming it comes with a build in data structure for representing data in tabular format. This is called a <code>data frame</code>.

Here we can define column names such as <code>id, patient_name and appointment_date</code>. We can then create vectors of data for the columns using <code>c</code> and providing the data itself.

In [66]:
df <- data.frame(
    id=c("123532", "454264", "564263", "675432", "853243"),
    patient_name=c("Paul Smith", "Dan Anders", "Suzzane Mills", "Jane Symoore", "Oludamillarie Samuals"),
    appointment_date=c("13/07/2022", "22/07/2022", "01/08/2022", "12/08/2022", "12/08/2022")
)

In [67]:
print(df)

      id          patient_name appointment_date
1 123532            Paul Smith       13/07/2022
2 454264            Dan Anders       22/07/2022
3 564263         Suzzane Mills       01/08/2022
4 675432          Jane Symoore       12/08/2022
5 853243 Oludamillarie Samuals       12/08/2022


<div class="alert alert-success">
<b>Note:</b> There are functions that can load data from files (e.g. Excel or CSV etc.) This data can then be loaded directly into a data frame object. 
</div>

In a similar way to lists we can access individual columns in several different ways, like so.

In [68]:
df$patient_name

In [69]:
df[["patient_name"]]

In [70]:
df[2]

patient_name
Paul Smith
Dan Anders
Suzzane Mills
Jane Symoore
Oludamillarie Samuals


You can also specify columns by their index (e.g. the second and third columns) and all rows. 

In [71]:
df[ ,c(2,3)]

patient_name,appointment_date
Paul Smith,13/07/2022
Dan Anders,22/07/2022
Suzzane Mills,01/08/2022
Jane Symoore,12/08/2022
Oludamillarie Samuals,12/08/2022


You can also specify a specific row or rows as here where we retrieve the first row and first three columns.

In [75]:
df[1, c(1,2,3)]

id,patient_name,appointment_date
123532,Paul Smith,13/07/2022


Other useful options include the number of rows and numbers of columns <code>nrow</code> and <code>ncol</code>.

In [77]:
ncol(df)

In [78]:
nrow(df)

We can also list the column names with the <code>colnames</code> function.

In [79]:
colnames(df)

You can use this to overwrite one of more of the column names as in this example where we rename the first column <code>id</code> to <code>Patient_id</code>.

In [80]:
colnames(df)[1] <- "Patient_id"

In [81]:
print(df)

  Patient_id          patient_name appointment_date
1     123532            Paul Smith       13/07/2022
2     454264            Dan Anders       22/07/2022
3     564263         Suzzane Mills       01/08/2022
4     675432          Jane Symoore       12/08/2022
5     853243 Oludamillarie Samuals       12/08/2022


Extra columns can be added by specifying the name of the data frame (in this case <code>df</code>) then a dollar sign <code>$</code> and the new column name. Here we add two new columns to represent the patients systolic and diastolic blood pressures. Note that the number of items you add must equal the number of existing rows or an error will be generated.

In [82]:
df$BP_sys <- c(122, 130, 155, 142, 101)
df$BP_dia <- c(70, 82, 76, 55, 66)

In [83]:
print(df)

  Patient_id          patient_name appointment_date BP_sys BP_dia
1     123532            Paul Smith       13/07/2022    122     70
2     454264            Dan Anders       22/07/2022    130     82
3     564263         Suzzane Mills       01/08/2022    155     76
4     675432          Jane Symoore       12/08/2022    142     55
5     853243 Oludamillarie Samuals       12/08/2022    101     66


The <code>summary</code> function can be used to provide statistical information for numerical fields such as the <code>mean</code>, <code>median</code>, <code>max</code> and <code>min</code> and quartiles.

In [84]:
print(summary(df))

  Patient_id                patient_name   appointment_date     BP_sys   
 123532:1    Dan Anders           :1     01/08/2022:1       Min.   :101  
 454264:1    Jane Symoore         :1     12/08/2022:2       1st Qu.:122  
 564263:1    Oludamillarie Samuals:1     13/07/2022:1       Median :130  
 675432:1    Paul Smith           :1     22/07/2022:1       Mean   :130  
 853243:1    Suzzane Mills        :1                        3rd Qu.:142  
                                                            Max.   :155  
     BP_dia    
 Min.   :55.0  
 1st Qu.:66.0  
 Median :70.0  
 Mean   :69.8  
 3rd Qu.:76.0  
 Max.   :82.0  


There are also several ways to extract subsets of data from a data frame. The following two examples show two ways of achieving the same result. In both cases we are interested in extracting all the values from the data frame where the systolic blood pressure is more than <code>130 mmHg</code>.

In [88]:
df[which(df$BP_sys > 130), ]

Unnamed: 0,Patient_id,patient_name,appointment_date,BP_sys,BP_dia
3,564263,Suzzane Mills,01/08/2022,155,76
4,675432,Jane Symoore,12/08/2022,142,55


In [93]:
subset(df, BP_sys > 130, select=c(1:5))

Unnamed: 0,Patient_id,patient_name,appointment_date,BP_sys,BP_dia
3,564263,Suzzane Mills,01/08/2022,155,76
4,675432,Jane Symoore,12/08/2022,142,55


<div class="alert alert-block alert-info">
<b>Task 5:</b>
<br> 
Using one of the methods for sub-setting, extract the data where the diastolic pressure is <code>70 or less</code> and only retrieve the patients name and diastolic BP. 
</div>

In [95]:
subset(df, BP_dia <= 70, select=c(2,5))

Unnamed: 0,patient_name,BP_dia
1,Paul Smith,70
4,Jane Symoore,55
5,Oludamillarie Samuals,66


In the next note book we look at <code>iteration</code> which can be used to run code multiple times or until a certain condition(s) is met. This is often used in conjunction with various data structures.

### Notebook details
<br>
<i>Notebook created by <strong>Dr. Alan Davies</strong>.
<br>
&copy; Alan Davies 2022

## Notes: