<a href="https://colab.research.google.com/github/ARU-Bioinf-MSB-2020/week_1/blob/main/R_programming_language_vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# R programming language - Vectors

Understanding vectors is crucial before learning about the other data structures in R. The tutorials below will walk you through the basic concepts you need to understand before progressing any further.

## Creating vectors

To manually create a vector you can use the` c()` function (short for combine) which can take an arbitrary number of arguments and will combine them into a single vector.  The only parameters that need to be passed to `c() `here are the data values that we want to combine. For example, let's make a vector containing a set of chromosome numbers.

In [None]:
Chr <- c(19, 22, 21, 18) 
Chr

If words are used as data values they must be surrounded by quotes (either single or double - there's no difference in function, though pairs of quotes must be of the same type).

In [None]:
Ref <- c("A","G","A","T")
Ref

####*Exercise: If quotes are not used, R will try and find the data structure that you have referred to. Remove the quotes from the example above.*

To access the whole data structure just type the name of it as before or you can use `head()` to see just the first few lines of the vector. 

In [None]:
Ref <- c("C","G","A","T","C","G","A","T","C","G","A","T","C","G","A","T","C","G","A","T","C","G","A","T","C","G","A","T","C","G","A","T","C","G","A","T","C","G","A","T")
head(Ref)


## Data types in vectors
Within a vector all of the values must be of the same ‘type’.  There are four basic data types in R:


*   Numeric - An integer or floating point number
*   Character - Any amount of text from single letter to whole essay
*   Logical - TRUE or FALSE values
*   Factor - A categorised set of character values

Try and think about what the data categories might mean.

Factors are the default way which R stores many pieces of text.  They are used when grouping data for statistical or plotting operations and in many cases are interchangeable with characters, but there are differences, especially when merging or sorting data which can cause problems if you use the wrong type for this kind of data.



To see what type of data you’re storing in a vector you can either look in the workspace tab of RStudio or you can use the class function.



In [None]:
class(c(1,2,3))
class(c("A","G","T"))
class(c(TRUE,TRUE,FALSE))

## Functions for making vectors
Although you can make vectors manually using the `c() `function, there are also some specialised functions for making vectors.  These provide a quick and easy way to make up commonly used series of values.



The `seq()` function can be used to make up arithmetic series of values.  You can specify either a start, end and increment (by) value, or a start, increment (by) and length (length.out) and the function will make up an appropriate vector for you. 

In [None]:
seq(from=5,to=10,by=0.5)
seq(from=1,by=2,length.out=10)

The `rep()` function simply repeats a value a specified number of times.

In [None]:
rep("1",5)

Finally, there is a special operator for creating vectors of sequential integers.  You simply separate the lower and higher values by a colon to generate a vector of the intervening values.

In [None]:
10:20

You can also combine these functions with` c()` to make up more complicated vectors

## Accessing vector subsets
To access specific positions in a vector you can put square brackets after it, and then use a vector of the index positions you want to retrieve.  Note that unlike most other programming languages index counts in R start at 1 and not 0. To view the 2nd value in the data structure:

In [None]:
Ref <- c("A","G","A","T")
Ref[2]    

Note, that in this instance the number 2 in the above expression is actually just a shortcut for c(2), so what we’re pulling out are a vector of index positions.  We can therefore also use the automated ways to make vectors of integers to easily pull out larger subsets.  To view a range of values we could use the lower:higher notation we saw above: 

In [None]:
Ref <- c("A","G","A","T")
Ref[2:4]

We should think of the statement above as two separate operations.  We use 2:4 to make a vector with 2,3,4 in it, and then we put that into square brackets to select the corresponding values from Ref. To view or select non-adjacent values the c() function can be used again. To view the 2nd and the 4th values:

In [None]:
Ref <- c("A","G","A","T")
Ref [c(2,4)]

## Accessing vectors using names
In all R data structures you have the option of assigning names to numeric positions so that you can use the name to access the data instead of the position.  For vectors you read and assign names using the names function.  When you first create a vector there won’t be any names associated with it, but you can assign some and then use them in the places where you would otherwise use the positions – using the same square bracket notation.

We assign names by treating the function as a variable and assigning data to it.  This style of assignment is very common in R. To illustrate this let's take some base positions and assign a reference allele to them.

In [None]:
c(112,134,157,187) -> Position
c("A","G","C","T") -> Refs
names(Position) <-c(Refs)
Position

We now see that the names are associated with the values in the vector.

We can also use the names to retrieve the corresponding values.  Even though names are assigned we can still use index positions too.

#### *Question: Fill in the missing space to call value 112 in the vector position by name in the code above*



In [None]:
 [" "]

## Vectorised operations
The other big difference between R and other programming languages is that normal operations are designed to be applied to whole vectors rather than individual values.  This means that you can very quickly and easily apply changes to whole sets of data without having to write complex code to loop through individual values.

Imagine we wanted to add a single bp to the position of all the variants in the dataset in the last tutorial then we can do this using a single operation.

In [None]:
c(112,134,157,187) -> Position
Position + 1

You can also use two vectors in any mathematical operation and the calculation will be performed on the equivalent positions between the two vectors.  If one vector is shorter than the other then the calculation will ‘wrap round’ and start again at the beginning (but you will get a warning if the longer vector’s length isn’t a multiple of the shorter vector’s length).

In [None]:
x <- 1:10
y <- 21:30
x+y


It is important to understand how operations involving two vectors work since this ends up being a critical aspect of many parts of R.  The basic rules are that if two vectors are the same length, then equivalent indices are paired together.