# Introduction to R Part 5: Vectors

In data analysis, you typically work with large collections of related values rather than singular values. As a langauge built for statistics and data analysis, R's data stuctures are designed to make it easy to perform operations on many data values at the same time. R's most basic data stucture is the vector. In R, a vector is a sequence of data elements of the same atomic type. You can have numeric vectors, logical vectors, character vectors and so on.

To create a vector with specific values, use the c() function. c() takes a comma separated sequence of elements as input and combines them into a vector:

In [1]:
x <- c(1,2,3)  #Create a numeric vector and assign it to x

print(x)  #Print the value of x to the screen

y <- c("Life","Is","Study")  #Create a character vector

print(y)  #Print y to the screen

[1] 1 2 3
[1] "Life"  "Is"    "Study"


You can also combine two vectors using c():

In [2]:
z <- c(x,y) #combine vectors x and y

print(z)

[1] "1"     "2"     "3"     "Life"  "Is"    "Study"


If you try to combine vectors of different types as shown above, R will automatically convert the vector into the type that fits best. In this case, the numbers are converted into thier character equivalents.


### Vector Indexing


When you create a vector, each element in the vector is assigned an index based on its position in the  vector. The first element is at index position 1, the second element is at index position 2 and so on. (Note that unlike many other programming langauges, indexes in R start at 1 instead of 0.).

When you print a vector to the screen, each starts line with a number in square brackets followed by vector values. The number in square brackets indicates the index of next value listed on that line. For large large vectors, this labeling can be helpful. For instance, consider a vector consisting of 100 random numbers bewteen 0 and 1:

In [3]:
random_data <- runif(100)  #Create a vector of 100 random number

print(random_data) #Print the vector

  [1] 0.011052210 0.490145197 0.618312998 0.490490869 0.553500076 0.419672743
  [7] 0.273818429 0.483188420 0.356707835 0.106790057 0.454534974 0.988843087
 [13] 0.100713713 0.711968180 0.836512019 0.702161684 0.348264976 0.632885545
 [19] 0.070282435 0.208262876 0.468534216 0.110959568 0.211663193 0.950492488
 [25] 0.045724627 0.069961404 0.981119999 0.254390155 0.773755667 0.033159428
 [31] 0.370948425 0.065027314 0.354573011 0.021361357 0.057543301 0.628536676
 [37] 0.005561950 0.862910367 0.510286567 0.819589180 0.322644416 0.249997037
 [43] 0.322395303 0.287262184 0.201719848 0.172489972 0.294691101 0.026757865
 [49] 0.996838058 0.202869637 0.756352332 0.079539363 0.958007912 0.867734396
 [55] 0.552013245 0.379368102 0.459864587 0.149453555 0.470665427 0.991903896
 [61] 0.307580338 0.610368726 0.807964261 0.009829363 0.989950905 0.512811468
 [67] 0.090756134 0.652818237 0.613619230 0.911626308 0.315600886 0.122582456
 [73] 0.819179695 0.928432825 0.008830840 0.436736538 0.61037360

In this case, having the index counters on the left hand side is a bit more useful as it immediately gives us an idea of the vector's size and keeps it organized.

You can access a specific value in a vector by typing the name of the vector and then wrapping the index associated with the value you want to access in square brackets:

In [4]:
random_data[7]  #Get the value at index 7

Attempting to access an index that doesn't exist returns NA. NA denotes a missing value.

In [5]:
random_data[200]  

[1] NA

You can access ranges of values by placing a colon bewteen the starting and ending indicies of the range:

In [6]:
subset1 <- random_data[7:14]  #Get values from index 7 to 14

print(subset1)

[1] 0.2738184 0.4831884 0.3567078 0.1067901 0.4545350 0.9888431 0.1007137
[8] 0.7119682


You can even access a specific subset of values by wrapping a vector in the square brackets:

In [7]:
subset2 <- random_data[c(1,10,100)] #Get the first, tenth and 100th value

print(subset2)

[1] 0.01105221 0.10679006 0.14817441


A subset of a vector is just a shorter vector. In fact, singular values are technically vectors of length 1, so all numbers and other atomic data types we've used up till now were vectors all along! You can check the length of a vector with the length() function:

In [8]:
length(10)  #A singular value is a vector of length 1

length(random_data) 

Here are a few other useful ways to index into vectors:

In [9]:
#Adding a minus sign excludes a given index:

y <- c("Life","Is","Study")
y <- y[-2]                   #Exclude index 2
print(y)

#A minus sign can also exclude a given range of indicies:

random_data <- runif(50)            #Generate 50 random numbers
random_data_sub <- random_data[-(2:49)] #Exclude the range 2 through 49
print(random_data_sub)

[1] "Life"  "Study"
[1] 0.7264720 0.6743836


You can also index a vector with a logical vector of the same length. In this case the subset is created from each index where the corresponding logical vector is TRUE. Indexing with a logical vector is a common way to filter a numeric or character vector for values that fulfill certain criteria:

In [10]:
#Create a logical vector identifing values over 0.5 in random_data

logical_over_half <- (random_data > 0.5)
print(logical_over_half)

 [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
[13]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
[25] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
[37]  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
[49] FALSE  TRUE


In [11]:
#Use the logical vector to create a subset of the values over 0.5
over_half <- random_data[logical_over_half]

print(over_half)

 [1] 0.7264720 0.9341863 0.7109709 0.8635124 0.6350620 0.5538607 0.8724669
 [8] 0.5468323 0.6934848 0.6939485 0.8385381 0.7443223 0.5381606 0.7729469
[15] 0.9077848 0.5292942 0.6852107 0.8276365 0.8546837 0.6960142 0.5445985
[22] 0.7261246 0.7140555 0.5903167 0.9449168 0.6743836


In [12]:
#Use the logical vector and the not symbol (!) to get values under 0.5

under_half <- random_data[!logical_over_half]

print(under_half)

 [1] 0.40734587 0.31231468 0.23277656 0.02317402 0.31814147 0.46354222
 [7] 0.36133522 0.06569161 0.12635459 0.30916333 0.39621222 0.43596518
[13] 0.01944562 0.25662792 0.08924450 0.20461418 0.46867852 0.41329992
[19] 0.47674342 0.10402504 0.05368797 0.38420069 0.22512529 0.39159728


In [13]:
#You can perform logical indexing all in one step:

random_data[random_data > 0.5]

In [14]:
#You can also use more complicated logical expressions.
#In this case we grab all values bewteen 0.4 and 0.6:

random_data[(random_data < 0.6) & (random_data > 0.4)]

Finally, you can use %in% to create a subset of elements that are contained within some other vector:

In [15]:
my_vector <- c("a","b","c","d","a","a","f")

my_vector[my_vector %in% c("a","c")]


### Vectorized Operations


One of the biggest benefits of R is that it is built around perforiming operations on vectors. Many R functions and operations behave in a "vectorized" manner, meaning they act upon each element of a vector and return the result in a new vector. Vectorized operations simplify the process of performing the same calculations on related data. All the basic operators and functions we've learned so far that operate on single values work on vectors longer than length 1.

In [16]:
example_vector <- c(1,2,3)

# + adds to each value in the vector
example_vector + 10

# - performs element-wise subtraction
example_vector - 10

Other math operators like *, /, ^ and %% work the same way as do functions like like round(), floor() and cieling():

In [17]:
example_vector2 <- c(1.6, 2.5, 3.5)

round(example_vector2)
      
floor(example_vector2)

Vectorized operations make it easy to carry out vector transformations quickly without worrying about programming constructs like for and while loops (we'll discuss those more later.).

Vector operations that invovle two or more vectors are typically executed in an element-wise fashion. For example, if you take two numeric vectors of the sample length and add them, the result is a new vector containing the sums of the values at each index:

In [18]:
vector1 <- c(1,2,3,4)
vector2 <- c(10,20,30,40)

print( vector1+vector2 )

[1] 11 22 33 44


In [19]:
#Other math operations also work in this way:

vector1*vector2  #element-wise multiplication

vector1/vector2  #element-wise division

vector1 %% vector2  #element-wise modulus

In [20]:
#If you want a vector inner product, use %*%

vector1 %*% vector2

0
300


*Note: An inner product is the sum of the elementwise multiplication of two vectors. It always returns a single value.

Vectorized operations can also work on character vectors. Let's consider the function paste() which takes two or more objects as input and concatenates them into a character vector. If you pass paste() character vectors longer than lengh 1, it combines them in an element-wise fashion:

In [21]:
x <- c("Life","Is","Study")
y <- c("Blogging","Is","Fun")

paste(x,y)

The data type conversion functions we discussed in the atomic data types section also work on longer vectors.

In [22]:
x <- c(1,2,3)
print(x)
typeof(x)

x <- as.character(x)
print(x)
typeof(x)

[1] 1 2 3


[1] "1" "2" "3"


### Generating Vectors

Creating vectors by hand with the c() function works fine for short vectors, but it becomes cumbersome quickly when you're working with longer vectors. R includes a variety of convenience functions to generate vectors.

You can generate all whole numbers in a range using a colon:

In [23]:
x <- 1:20 
print(x)

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


You can also generate sequences using the seq() function. Seq takes the arguments from, to, and by which specify the starting point, stopping point and size of the sequence increment:

In [24]:
y <- seq(from = 1, to = 20, by = 1)
print(y)

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


In [25]:
z <- seq(0, 100, 10)   #You can omit the argument names
print(z)

 [1]   0  10  20  30  40  50  60  70  80  90 100


Use rep() to create a vector of the same value repeated a specified number of times:

In [26]:
r <- rep(x=1, times=20)
print(r)

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


As we saw earlier, you can use the runif() function to draw random values from specified range:

In [27]:
x <- runif(n=20, min=0, max=100)
print(x)

 [1] 91.373091  2.302970 21.632559 58.908381 16.419712 27.170919 15.427215
 [8]  5.048896  2.292052 81.036314 72.759988 53.785448 63.399075  4.636821
[15] 45.166667 72.805486 72.470191 62.684585  8.123387  5.034811


The function runif() draws numbers from a uniform distribution, so all values within the range are equally likely. R also has functions for drawing random numbers from other types of distributions, such as rnorm() for the normal distribution, rexp() for the exponential distribution and rbinom() for the binomial distribution. We won't go into these any further right now, but suffice it to say R is very useful if you have to deal with probabilty distributions.

You can accomplish a suprising amount in R using only vectors and vector commands in the console, but real-world data is usually stuctured in 2 dimensional tables. Next time we'll learn about R's simpliest multi-dimensional object, the matrix.

### Next Time: Introduction to R Part 6: Matrices