# Basics building blocks

- Any object that contains data is called a data structure.
- A single number is considered a vector of length 1
- use c() to create a vector
- c stands for concatenate/combine

In [2]:
z <- c(1.1,9,3.4)

In [3]:
?c

In [4]:
z

In [5]:
z <- c(z, 555)

In [6]:
z

In [7]:
z * 2 + 100

In [8]:
a = c(100,4,16,9)

In [9]:
sqrt(a)

In [10]:
a ^ 2

In [11]:
x = 10

In [12]:
b = a/10

In [13]:
b

In [14]:
c = x/b

In [15]:
c

In [17]:
# Recycling. The shorter vector is rotated.

c(1,2,3,4) + c(0,10)

In [19]:
c(1,2,3,4,8) + c(0,10,5)

“longer object length is not a multiple of shorter object length”

# Workspace and Files

R provides a common API for interacting with files. This ensures that the same code will work across different platforms.

In [21]:
# Get working directory

getwd()

In [24]:
# List all objects in the local workspace

ls()

In [27]:
# List files in the working directory

list.files()

In [29]:
# Get the arguments to list.files()

args(list.files())

NULL

In [30]:
old.dir = getwd()

In [31]:
# Create a new directoy in the curretn directory

dir.create('testdir')

In [33]:
# Set working directory

list.files()

In [34]:
setwd('testdir')

In [36]:
getwd()

In [37]:
# Create a file

file.create("myTest.R")

In [38]:
list.files()

In [39]:
# Check if file exists
# Useful if you want to process a loist of files. You might want to check if the file exists or not before beginning the processing.

file.exists("myTest.R")

In [40]:
# Access information

file.info("myTest.R")

Unnamed: 0,size,isdir,mode,mtime,ctime,atime,uid,gid,uname,grname
myTest.R,0,False,664,2018-12-25 20:52:55,2018-12-25 20:52:55,2018-12-25 20:52:55,1000,1000,tushar,tushar


In [43]:
# To grab specific things, you could use $size, $isdir, etc.

file.info("myTest.R")$size
file.info("myTest.R")$isdir
file.info("myTest.R")$mode
file.info("myTest.R")$mtime

[1] "664"

[1] "2018-12-25 20:52:55 EST"

In [44]:
# Rename the file

file.rename("myTest.R","myTest2.R")

In [45]:
list.files()

In [46]:
# Make a copy of file

file.copy("myTest2.R","myTest3.R")

In [47]:
list.files()

In [48]:
file.path('myTest3.R')

In [49]:
# file.path can be used to construct file and directoty paths that are independent of the OS

file.path('folder1','folder2')

In [50]:
?dir.create

In [53]:
# create a directory d1 and another inside d2 called d1.

dir.create(file.path('d3','d3'), recursive = TRUE)

In [54]:
# Delete the directory and all the contents inside it recursively

unlink('d3', recursive = TRUE)

In [55]:
list.files()

In [57]:
setwd(old.dir)

In [58]:
list.files()

In [60]:
unlink('testdir')

In [61]:
list.files()

# Sequences of numbers

In [62]:
1:20

In [63]:
pi:10

In [64]:
15:1

In [66]:
# Help for operators. Enclose them in backticks

?`:`

In [67]:
# To have more control over creating sequences, use seq()
seq(1, 20)

In [68]:
seq(1,20,0.5)

In [71]:
my_seq = seq(5,10, length = 30)
my_seq

In [72]:
length(my_seq)

In [82]:
# Create a sequence starting from 1 but with length of another sequence

1:length(my_seq)

In [74]:
seq(1,length(my_seq))

In [75]:
seq(along.with = my_seq)

In [76]:
seq_along(my_seq)

In [78]:
# Replicate

rep(0, times = 20)

In [80]:
# Replicate the vector 20 times

rep(c(0,1,2), times = 20)

In [81]:
# replicate each element of the vector consecutively

rep(c(0,1,2), each = 10)

In [83]:
?seq_along

# Vectors

Simplest and the most common data structure
<br><br>
Vectors come in 2 flavors:<br>
- Atomic vector - Contains data of the same type(numeric, character, complex, logical)
- List - Contains multiple data types

In [86]:
# Logical operators

n <- c(0.5, 55, -10, 6)
n

In [88]:
a <- n < 1
a

In [89]:
n >= 6

In [90]:
x = c("My","name","is","Tushar")
x

In [91]:
length(x)

In [93]:
# Join the characters to form a continuous string
# collapse specifies the separator

paste(x, collapse = " ")

In [94]:
x = c(x, ".")

In [95]:
x

In [96]:
paste(x, collapse = " ")

In [98]:
paste("Hello", "World", sep = " ")

In [100]:
# Here the smaller vector gets rotated.

paste(c(1:4),c("X","Y","Z"), sep = "")

In [101]:
# Here numeric vector gets coerced into character vector.

paste(LETTERS, 1:4, sep = "-")

# Missing Values

In R, NA is used to represent a value that is 'Not Available' or 'Missing'. Any operation involving NA, yeild an NA

In [102]:
x <- c(33, NA, 5, NA)

In [103]:
x

In [104]:
x * 3

In [105]:
y <- rnorm(1000)

In [106]:
y

In [107]:
z = rep(NA,1000)

In [108]:
z

In [110]:
# We now select 100 samples from the 2 vector anc combine them.

m = sample(c(y, z), 100)

In [111]:
m

In [118]:
# Locate the NA

test = is.na(m)

In [119]:
test

In [114]:
m == NA

This happens because NA is not a value. It is a place holder for a quantity that is not available. m == NA is not a complete expression and hence a vector with all NAs is returned.

In [120]:
sum(test)

In [123]:
# NaN - Not a Number

0/0

In [124]:
Inf - Inf

# Subsetting vectors

In [125]:
x = rnorm(20)
x

In [127]:
y = rep(NA,20)
y

In [129]:
z = sample(c(x, y))
z

In [130]:
# Get first 10 elements
z[1:10]

    Index vectors come in 4 flavors:
    1. Logical vectors
    2. Vector of +ve integers
    3. Vector of -ve integers
    4. Vector of character strings

In [133]:
# Extract elements of a vector that are not NA

is.na(z)

In [135]:
# Get a vector of all NAs

z[is.na(z)]

In [136]:
# Get a vector of everything other than NAs.

z[!is.na(z)]

In [138]:
z>0

In [140]:
length(z[z>0])

In [141]:
z[z>0]

In [144]:
# Get posotive elements excluding the missing values

z[z > 0 & !is.na(z)]

#### Note- R uses one based indexing

In [145]:
# Getting specific index elements

z[c(1,3,6)]

In [146]:
z[0]

In [147]:
z[9293847239]

In [150]:
# Get elements excluding a few indices.

z[c(1,3,4)]

z[c(-1,-3,-4)]

# Alternative
z[-c(1,3,4)]

In [1]:
# Named indices

q = c(foo = 11, bar = 2, norf = NA)

In [2]:
q

In [3]:
# We can get the names by using names()

names(q)


In [4]:
# We can create unnamed vector and then add names

r = c(11,2,NA)
names(r) <- c("foo","bar", "norf")

In [5]:
r

In [6]:
# We can check if 2 vectors are identical by using identcal()

identical(q, r)

In [7]:
q["bar"]

In [8]:
q[c("foo","bar")]

# Matrices and Data Frames

They both represent rectangular data types. i.e. they are used to store tabular data.

Matrices- Single class of data<br>
Data frames - Many classes of data

In [9]:
a <- 1:20

In [10]:
a

In [12]:
# Dimensions of an object

dim(a)

NULL

In [13]:
length(a)

In [14]:
dim(a) <- c(4,5)

In [15]:
dim(a)

In [17]:
# Alternate way to get dimensions

attributes(a)

In [18]:
a

0,1,2,3,4
1,5,9,13,17
2,6,10,14,18
3,7,11,15,19
4,8,12,16,20


In [19]:
class(a)

In [20]:
my_matrix <- a

Matrix is somply an atomic vector with dimension attribute

In [21]:
?matrix

In [22]:
my_matrix2 = matrix(1:20, nrow = 4, ncol=5)

In [23]:
my_matrix2

0,1,2,3,4
1,5,9,13,17
2,6,10,14,18
3,7,11,15,19
4,8,12,16,20


In [24]:
identical(my_matrix,my_matrix2)

In [27]:
# We may want to label the rows. one way of doing this is to add a column to the matrix

patients <- c("A","B","C","D")

cbind(patients, my_matrix)

patients,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
A,1,5,9,13,17
B,2,6,10,14,18
C,3,7,11,15,19
D,4,8,12,16,20


Here as matrix can handle only one type of data, the numeric values were implicitly coerced into string values.

In [28]:
# We could use data.frame to avoid this problem.

my_data = data.frame(patients, my_matrix)

In [29]:
my_data

patients,X1,X2,X3,X4,X5
A,1,5,9,13,17
B,2,6,10,14,18
C,3,7,11,15,19
D,4,8,12,16,20


In [30]:
class(my_data)

In [31]:
# Assigning names to the columns

cnames <- c("P","Q","R","S","T","U")

In [32]:
# use colnames()

colnames(my_data) <- cnames

In [33]:
my_data

P,Q,R,S,T,U
A,1,5,9,13,17
B,2,6,10,14,18
C,3,7,11,15,19
D,4,8,12,16,20


# Lists

1. Vectors - All data must be of the same type. Numeric, Char, Logical
2. Matrix - All data must be of the same type. Numeric, Char, Logical
3. Data Frame -All data within a column must be of the same type. Numeric, Char, Logical

<br>

A list in R allows to gather different objects under the same name in an ordered way. These objects can be vectors, matrices, dataframes and other lists too. 


In [34]:
my_v = c(1,2)

In [36]:
my_m = matrix(c(1:12), nrow = 3, ncol = 4)

In [37]:
my_df = mtcars[1:10,]

In [39]:
my_list <- list(my_v, my_m, my_df)

In [40]:
my_list

0,1,2,3
1,4,7,10
2,5,8,11
3,6,9,12

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


WE must give names to the components to know what they stand for

In [42]:
my_list <- list(vec = my_v, mat = my_m, df = my_df)

In [43]:
# To change names of the components of the list, use names()

names(my_list) <- c("vec2","mat2","df2")

In [44]:
my_list

0,1,2,3
1,4,7,10
2,5,8,11
3,6,9,12

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [45]:
# Grabbing components of the list

my_list[[1]]

In [46]:
my_list[1]

In [47]:
my_list["vec2"]

In [48]:
my_list$vec2

In [49]:
dim(my_list$mat2)

In [50]:
# Selecting specific elements from the components

my_list[[3]][,1:3]

Unnamed: 0,mpg,cyl,disp
Mazda RX4,21.0,6,160.0
Mazda RX4 Wag,21.0,6,160.0
Datsun 710,22.8,4,108.0
Hornet 4 Drive,21.4,6,258.0
Hornet Sportabout,18.7,8,360.0
Valiant,18.1,6,225.0
Duster 360,14.3,8,360.0
Merc 240D,24.4,4,146.7
Merc 230,22.8,4,140.8
Merc 280,19.2,6,167.6


In [51]:
# Adding more components to the list

c(my_list, author = "xyz")

0,1,2,3
1,4,7,10
2,5,8,11
3,6,9,12

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [53]:
# Structure of the list

str(my_list)

List of 3
 $ vec2: num [1:2] 1 2
 $ mat2: int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
 $ df2 :'data.frame':	10 obs. of  11 variables:
  ..$ mpg : num [1:10] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
  ..$ cyl : num [1:10] 6 6 4 6 8 6 8 4 4 6
  ..$ disp: num [1:10] 160 160 108 258 360 ...
  ..$ hp  : num [1:10] 110 110 93 110 175 105 245 62 95 123
  ..$ drat: num [1:10] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92
  ..$ wt  : num [1:10] 2.62 2.88 2.32 3.21 3.44 ...
  ..$ qsec: num [1:10] 16.5 17 18.6 19.4 17 ...
  ..$ vs  : num [1:10] 0 0 1 1 0 1 0 1 1 1
  ..$ am  : num [1:10] 1 1 1 0 0 0 0 0 0 0
  ..$ gear: num [1:10] 4 4 4 3 3 3 3 4 4 4
  ..$ carb: num [1:10] 4 4 1 1 2 1 4 2 2 4


Lists are fundamental when your function must return more than one object

# Functions

Fundamental building blocks of R. Small pieces of reusabe code that can be treated like any other R object.

In [54]:
Sys.Date()

In [57]:
mean(c(1,2,3))

In [58]:
boring_function <- function(x){
    x
}

In [59]:
boring_function("Hello world")

In [60]:
boring_function

In [62]:
my_mean <- function(x){
    s = sum(x)
    l = length(x)
    mean = s/l
    return(mean)
}

In [63]:
my_mean(c(1,2,3))

In [65]:
increment <- function(n, by = 1){
    return(n + by)
}

In [66]:
increment(3)

In [67]:
increment(3,2)

In [69]:
remainder <- function(num, divisor = 2){
    return(num %% 2)
}

In [70]:
remainder(10)

In [71]:
remainder(11)

In [72]:
remainder(13)

In [75]:
remainder(divisor = 9, num = 19)

In [77]:
# See the functions arguments
# here we just passed a function as an argument to another function

args(remainder)

In [78]:
evaluate <- function(func, data){
    func(data)
}

In [79]:
evaluate(sum, c(1,2,3))

In [81]:
evaluate(mean, c(1,2,3))

In [82]:
evaluate(sd, c(1,2,3))

Here we can also pass the functions which do not have a name i.e. we can pass anonymous functions. 

In [85]:
evaluate(function(x){x + 1}, 2)

In [86]:
evaluate(function(x){x[length(x)]}, c(1,2,3))

In [87]:
?paste

### All arguments after ellipses must have a default value

In [88]:
telegram <- function(...){
    paste(start = "START",..., stop = "STOP")
}

In [89]:
telegram()

In [90]:
telegram("Hello there")

### Unpacking arguments from the ellipses

In [91]:
add_some_stuff <- function(...){
    args <- list(...)
    return(sum(args$a,args$b))
}

In [92]:
add_some_stuff(a = 10, b = 19)

### Making custom binary operators

In [96]:
"%x%" <- function(a, b){
    (a + b)*(a - b)
}

In [97]:
2 %x% 3

# Logic

- 2 logical values - TRUE and FALSE
- can construct logical expressions evaluating to true and false

In [98]:
TRUE == TRUE

In [99]:
TRUE == FALSE

In [100]:
6 == 7

In [101]:
6 < 6

In [102]:
6 <= 6

In [103]:
5 != 6

In [104]:
!FALSE

In [105]:
FALSE & FALSE

In [106]:
TRUE && TRUE

In [109]:
# RECYCLING IS DONE HERE

TRUE & c(TRUE, FALSE, TRUE)

In [112]:
# Here the left operand is evaluated with only the first operand in the list

TRUE && c(FALSE, FALSE, TRUE)

### All AND operations are done before OR opertations.

In [114]:
5 > 8 || 6 != 8 && 4 > 3.9

In [116]:
isTRUE(4 > 9)

In [117]:
isFALSE(4>9)

In [119]:
isTRUE(3)

In [120]:
isTRUE(NA)

In [121]:
identical("a","a")

### XOR - If exactly one of the arguments is True then the answer is true

In [123]:
8 != 8.0

In [124]:
xor(5==5, FALSE)

In [125]:
ints <- c(1:10)

In [126]:
ints

In [127]:
ints > 5

In [128]:
# TO get the indices of the vector use which

which(ints > 5)

In [133]:
# Check if any of them satisifies the condition

any(ints < 1)

In [134]:
any(ints < 2)

In [135]:
# Check if all satisfy the condition

all(ints > 0)

In [136]:
ints == 10

# Control Flow

In [137]:
even_odd <- function(n){
    op = n %% 2
    
    if(op == 0){
        print("Even Number")
    } else {
        print("Odd Number")
    }
}

In [138]:
even_odd(19)

[1] "Odd Number"


# lapply and sapply

- Commonly known as loop functions
- Offer a consise way of implementing the Split-Apply-Combine strategy for data analysis
- Split data into a small piece, apply the operation to each piece and then combine the results.

In [4]:
# Loading the flags dataset

y <- "https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data"

In [5]:
flags <- read.table(y, header = FALSE, sep = ",")

In [6]:
names(flags) <- c("country", "landmass", "zone", "area", "population", "language", 
              "religion", "bars", "stripes", "colors", "red", "green", "blue",
              "gold", "white", "black", "orange","mainhue", "circles", "crosses", 
              "saltires", "quarters", "sunstars", "crescents", "triangle", "icon",
              "animate", "text", "topleft", "botright")

In [151]:
head(flags)

country,landmass,zone,area,population,language,religion,bars,stripes,colors,⋯,saltires,quarters,sunstars,crescents,triangle,icon,animate,text,topleft,botright
Afghanistan,5,1,648,16,10,2,0,3,5,⋯,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,⋯,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,⋯,0,0,1,1,0,0,0,0,green,white
American-Samoa,6,3,0,0,1,1,0,0,5,⋯,0,0,0,0,1,1,1,0,blue,red
Andorra,3,1,0,0,6,0,3,0,3,⋯,0,0,0,0,0,0,0,0,blue,red
Angola,4,2,1247,7,10,5,0,2,3,⋯,0,0,1,0,0,1,0,0,red,black


In [152]:
dim(flags)

In [155]:
class(flags)

In [156]:
?lapply

In [157]:
cls_list = lapply(flags, class)

In [158]:
cls_list

In [159]:
class(cls_list)

In [160]:
as.character(cls_list)

In [161]:
cls_vector <- sapply(flags, class)

In [162]:
cls_vector

In [163]:
class(cls_vector)

In [164]:
names(flags)

In [165]:
flags$orange

In [166]:
sum(flags$orange)

In [169]:
flag_colors = flags[,11:17]

In [171]:
head(flag_colors)

red,green,blue,gold,white,black,orange
1,1,0,1,1,1,0
1,0,0,1,0,1,0
1,1,0,0,1,0,0
1,0,1,1,1,0,1
1,0,1,1,0,0,0
1,0,0,1,0,1,0


In [175]:
x = sapply(flag_colors, sum)

In [177]:
x

### Here sapply returns a vector as each element of the list is a vector of length 1

In [179]:
flag_colors_mean = sapply(flag_colors, mean)

In [180]:
flag_colors_mean

In [7]:
flag_shapes <- flags[,19:23]

In [8]:
head(flag_shapes)

circles,crosses,saltires,quarters,sunstars
0,0,0,0,1
0,0,0,0,1
0,0,0,0,1
0,0,0,0,0
0,0,0,0,0
0,0,0,0,1


### Here sapply returns a matrix as each element of the list is a vector of length > 1. We are calculating minimum and maximum

In [14]:
# We want to extract the minimum and maximum number of times the shapes appear in the flags

shape_mat <- sapply(flag_shapes, range)

In [12]:
shape_mat

In [15]:
class(shape_mat)

### Example of when sapply cannot figure out how to simplify the result and instead,  returns a list.

In [16]:
# We need unique values for each variable in the flags database. 

unique_vals = lapply(flags, unique)

In [17]:
unique_vals

### Here each component is a vector of different lengths. Thus there is not way for the sapply function to know how to give out the result. Using vector or a matrix is not possible and thus, sapply gives the result in the form of a list.

In [18]:
unique_vals = sapply(flags, unique)

In [19]:
unique_vals

In [20]:
class(unique_vals)

In [22]:
# FInd the length of each component of the list.

z = sapply(unique_vals, length)

In [23]:
z

In [24]:
class(z)

### You may need to apply a function that is not defined yet. 

#### Pretend you are interested in only the second item from each element of the unique_vals list that you just created. Since each element of the unique_vals list is a vector and we’re not aware of any built-in function in R that returns the second element of a vector, we will construct our own function

In [27]:
lapply(unique_vals, function(element) element[2])

### Here we defined our own function that we will apply. Our function has not name. Anonymous functions. 