# Variables, data frames, indexing, functions etc

In [None]:
#setwd("/maths/research/nha20/Teach/RIntroCourse") Set working directory & ensure any files used are in this directory.

## Computer representation of numbers

Real numbers are not stored exactly on computers. Use binary version of ``scientific'' notation e.g. $1.234 \times 10^2$. This needs care e.g.

In [1]:

x <- seq(0, 0.5, 0.1) ##generate a sequence from 0 to 0.5 in steps of 0.1
x ##Look at x


Is x equal to (0, 0.1, 0.2, 0.3, 0.4, 0.5)? To find out, type

x == c(0, 0.1, 0.2, 0.3, 0.4, 0.5)


This is known as _FAQ 7.31_: see [https://cran.r-project.org/doc/FAQ/R-FAQ.html].

## Rounding problems

Tiny inaccuracies can accumulate:
  
The sample variance of a vector ```x``` is often calculated as  
  
   $var (x) = (\sum x^2-n\bar{x}^2) / (n-1)$

Try it out and compare it with ```var()```: 

 
myvar <- function(x) (sum(x^2) - length(x) * mean(x)^2) / (length(x) - 1)
x <- seq(1:100)
myvar(x)
var(x)
x <- seq(1:100) + 10000000000
myvar(x)
var(x)

Can you see why there was a problem?

## Variables

Basic Types of Variables

Variables are the equivalent of memories in your calculator. But you can have unlimited (almost!) quantities of them and they have names of your choosing. And different   types. The basic types are

 + integer
 + double
 + character
 + logical: these take one of the two values ```TRUE``` or ```FALSE```
    (or ```NA```, see later)
 + factor or categorical
 

## More Specialised Classes

As well as the basic types of variables, R recognises many more complicated  objects such as 

* **vectors**, **matrices**, **arrays**: groups of objects all of the same type
 
* **lists** of other objects which may be of different types

*  Specialised objects such as **Dates**, **Linear Model fits**

  
## Special values of objects

There are some types of data which need to be treated specially in calculations:

* ```NA``` The value ```NA``` is given to any data which R
    knows to be missing. This is not a character string, so a
    character string with value ```"NA"``` will be treated differently
    from one with the value ```NA```.

*  ```Inf``` The result of e.g. dividing any non-zero number by zero

*  ```NaN``` The result of e.g. attempting to find the logarithm of a
 negative number.

In [None]:
 
as.numeric(c("a", "1"))
x / 0
log(-x)

   
## Factors

Factors are variables which can only take one of a finite set of discrete values. They naturally occur as vectors, and can be

* **numeric** e.g. drug doses with values 1mg, 2mg, 5mg
* or **character** e.g. voting intention with values Liberal Democrat, Conservative, Labour, Other
  
Although factors are stored as numbers, along with the label
corresponding to each number, they cannot be treated as numeric. Would
it make sense to ask R to calculate ```mean(voting intention)```?

A more useful function for factors is ```table``` which will count
how many of each value occur in the vector.


## Ordered Factors

Some factor variables have a natural ordering. Drug doses do, but
voting intentions usually do not. R will treat the two types
differently. It is important not to allow R to treat non-ordered
factors as ordered ones, since the results could be meaningless.

## Creating factors

Use ```cut``` to create factor variables from continuous ones:

age <- runif(100) * 50
table(cut(age, c(0, 10, 20, 30, 40, 50)))

The function `factor()` can be used to create factor variables from characters.

## Data frames 
* For storing data which is a collection of observations (rows) of a set of variables (columns). E.g. book titles and prices.
* Similar to a matrix but variables in different columns can have different types.
* Always the same number of entries in each row, although some may be missing (```NA```).
* Can be formed by reading in data e.g. from a spreadsheet, or constructed using the function ```data.frame```.

You will need the following file [Example Chicken weights csv](../data/chickwt.csv)

Download the file from the link to the folder where you are running R.

In [None]:
head(mydata <- read.csv('chickwt.csv'))

## Lists

A data frame is a kind of list, which is a vector
of objects of possibly different types. For example, an employee
record might be created by

In [None]:
Empl <- list(employee = "Anna", spouse = "Fred", children = 3,
             child.ages = c(4, 7, 9))
Empl

Try the following:

In [None]:
Empl[[1]]
Empl$spouse

Empl[[4]]
length(Empl[[4]]) # numeric vector of length 3

Empl[4]
length(Empl[4]) # list of length 1
Empl[[2:4]] # note that this will produce an error
Empl[2:4]  # this works

Components are always numbered and may be referred to by
number. e.g. ```Empl[[1]]```. If they are named, can also be
referred to by name using the \$ operator eg. ```Empl$spouse```

## Keeping track of objects

Once you have created some objects, how do you remind yourself what you called them? 

In [None]:
ls()
str(Empl)

```ls()``` lists the names of all the objects in you workspace and ```str()`` gives information about a specific object.

## Operators

Arithmetic Operators: ```+, - , /, *, ^```.

In [None]:
3^2
10 %% 3 # modulo reduction
10 %/% 3 # integer division

a <- matrix(1:4, nrow=2)
b <- matrix(c(2, 1, 2, 4), nrow=2)
a %*% b # matrix multiplication

Operators can also be used on vectors, with recycling if necessary. 

```R
x <- c(1, 2, 3, 4)
y <- c(5, 6)
x + 3
#(4,5,6,7) and
x + y
#(6,8,8,10)
```

## Logical operators

* ```==``` (equal), ```!=``` (not equal), ```>, <, >=, <=```

* ```!``` (not), ```|``` (or), ```||``` (or), ```&``` (and) ```&&``` (and)

* ```|```, ```&``` work on vectors 

* ```||``` and ```&&``` consider only one element

Examples:

In [None]:
x <- c(TRUE, FALSE, TRUE)
y <- c(FALSE, TRUE, TRUE)

x | y
x || y

x & y
x && y
x[3] && y[3]

The && is also a "short circuit and", i.e. it won't evaluate its second argument if the first argument is FALSE. Compare the following

In [None]:
x <- -1

(x > 0) & (log(x) > 0)

(x > 0) && (log(x) > 0)

## Dates
Example:

In [None]:
myDate <- as.Date('10-Jan-1993', format="%d-%b-%Y")

class(myDate)
as.numeric(myDate)

myDate2 <- as.Date('10-Jan-1994', format="%d-%b-%Y")

myDate2-myDate # can substract two dates

```Date```class does not deal with times. Classes ```POSICXct``` and ```POSIXlt``` deal with time zones.  


## Some useful functions
Look at the following functions:

In [None]:
c(1, "a")
1:5
c(1,2,3,4,5)
seq(1, 10, by=2)
rep(c(1, 2), times=3) 
rep(c(1, 2), each=3) 
rep(c(1, 2), c(2, 3)) 


paste(c(1, 2), c('x', 'y', 'z'))
paste(c(1, 2), c('x', 'y', 'z'), collapse=' ')

sort(mydata$weight)
sort(mydata$weight, decreasing=TRUE)

table(rpois(20, 5))

## Matrices, arrays and indexing

In [None]:
(mymat <- matrix(1:12, 3, 4)) # Entries go down columns unless you specify byrow=TRUE.
dim(mymat)
myarr<-mymat
dim(myarr) <-c(3,2,2) # creating an array
myarr
myarr[,,1] # using indexing to select the entries of the arrays third dimension y

x <-c(2,4,6,8,10,12)
names(x) <- c("a", "b", "c", "d", "e", "f")

## Use different types of indices
x[c(1,3,6,5)] 
x[c("a","c","f", "e")] ## gives same as above
x[c(TRUE,FALSE,TRUE,rep(FALSE,3))] # can also use logical vectors to select entries
x[c(-1,-4)] ## exclude first and fourth entry

y<-x
(y[] <- 0 ) ## Empty. Select all, useful to replace all vector entries
names(y) ## will be the same as before

# compare with:
y<-x
(y<-0) 

# Recycling:
x[c(1,3)] <-4.5 # recycling will be used if sub-vector selected for replacement is longer than the right-hand side.

(x[10] <- 8) # replacing to an index greater than the length of the vector extends it, filling in with NA's

x[11] # returns NA

# indexing matrices and arrays
mymat[1:2, -2]
mymat[mymat>1] <- NA # note: no comma
mymat[cbind(rep(1,3), c(2,3,4))] <- NA

# If the result has length 1 in any dimension, this is dropped unless you use the argument drop=FALSE:
mydata[1:2, 1] #is a vector
mydata[1, 1:2] #is a data.frame
class(mydata[1:2, 1])
class(mydata[1, 1:2])
mydata[mydata$weight > 400,]

## Indexing data frames

* Data frames can be indexed like matrices, but only ```drop``` dimensions if you select from a single column, not if you select from a single row.

* If you select rows from a data frame with only one column the result will be a vector unless you use ```drop=FALSE```

* Often want to select the rows of a data frame which meet some criterion.

* use logical indexing

In [None]:
mydata[1,]
attributes(mydata[,2])

mydata[mydata$weight> 400,]

## More Examples on matrices: Lower triangular, adding matrices, eigenvalues, column sums, ...

Check out the following:

In [None]:
mymat <- matrix(1:12, nrow=3, )
mymat
mymat2 <- matrix(1:12, nrow=3, byrow=TRUE)
mymat2
mymat + mymat2
mymat %*% t(mymat2)

mysq <- matrix(rnorm(9), nrow=3)
solve(mysq)

mysym <- mysq
mysym[lower.tri(mysym)] <-
	mysym[upper.tri(mysym)]
eigen(mysym)
colSums(mymat)

## Functions

Writing simple functions:

In [None]:
x <- rnorm(100, mean=0.3, sd=1.2)


std.dev <- function(x) sqrt(var(x)) # function to calculate the standard deviation of a vector

# function to calculate the two-tailed p-value of a t.test
# note that function arguments can have default values
t.test.p <- function(x, mu=0)  {
    n <- length(x)
    t <- sqrt(n) * (mean(x) - mu) /
		std.dev(x)
    2 * (1 - pt(abs(t), n - 1)) # the object of the final line will be returned
}

std.dev(x)
t.test.p(x) # this will use the default value for mu
t.test.p(mu=1, x=x) 
t.test.p(x, 1)

## Flow control: if, for , while, repeat


In [None]:
myfn <- function(n=100)
{
    tmp <- rep(NA, 3)
    tmp[1] <- mean(runif(n))
    tmp[2] <- mean(runif(n))
    tmp[3] <- mean(runif(n))
    mean(tmp[tmp > .2])
}
set.seed(1)
myfn()
myfn(1000)

## Control flow: if
Example

In [None]:
myfna <- function(n=100)
{
    tmp <- rep(NA, 3)
    x <- mean(runif(n))
    if (x > 0.2)
        tmp[1] <- x
    x <- mean(runif(n))
    if (x > 0.2)
        tmp[2] <- x
    x <- mean(runif(n))
    if (x > 0.2)
        tmp[3] <- x
    mean(tmp, na.rm=TRUE)
}
set.seed(1)
myfna()
myfna(1000)

## Control Flow: For 

In [None]:
myfn1 <- function(obs=10, n=100)
{
    x <- rep(NA, n)
    for (i in 1:n)
    {
        tmp <- runif(obs)
        x[i] <- mean(tmp)
    }
    c(mn=mean(x), std=sd(x))
}
set.seed(1)
myfn1()
myfn1(1000)

The functions ```while``` and ```repeat``` don't require loop variables.

## The function ifelse

The function ```ifelse``` reduces the need for loops and can make code more efficient. Example:
```{r}
x <- c(0, 1, 1, 2)
y <- c(44, 45, 56, 77)

z <- rep(NA, 4)
for (i in 1:length(x))
{
    if (x[i] > 0)
        z[i] <- y[i] / x[i]
    else
        z[i] <- y[i] / 99
}
z
```
This can be replaced by:
```{r}
(z <- ifelse(x > 0, y / x, y / 99))
```

## Exercise: 1

*  Create a vector containing all the dates in 2007, using ```seq``` and ```as.Date```.

* There is a version of ```cut``` for dates, called ```cut.Date```. Use this to create a factor with values corresponding to the date of the first day of the week in which each of these dates falls. Start the weeks on Sundays.

* Create ```x```, a vector of length 100, with integer values in the range $1:5$, randomly ordered. (Hint: look at the function ```sample```.)

* Use ```paste``` to create a vector of labels: `("Colour 1", "Colour 2", "Colour 3", "Colour 4", "Colour 5")`

*  Use the ```factor``` command to create a factor from the vector ```x```,  with the labels created above.

* Create a data frame with 100 rows and two columns, one containing a random sample of the vector of dates created above, and the other containing the factor vector of colour names.
  
* Select the rows for which the date is after 1st June 2007.

[Solution]()

## Solution+: 1
*  Create a vector containing all the dates in 2007, using ```seq``` and ```as.Date```.

In [3]:
myDate1 <- as.Date('1-Jan-2007', format="%d-%b-%Y")
myDate2 <- myDate1+364
myDate2

In [4]:
dates2007<-seq(myDate1,myDate2,1)

* There is a version of ```cut``` for dates, called ```cut.Date```. Use this
  to create a factor with values corresponding to the date of the first day of
  the week in which each of these dates falls. Start the weeks on Sundays.

In [5]:
fdate<-cut.Date(dates2007, "weeks",start.on.monday=FALSE)
table(fdate)

fdate
2006-12-31 2007-01-07 2007-01-14 2007-01-21 2007-01-28 2007-02-04 2007-02-11 
         6          7          7          7          7          7          7 
2007-02-18 2007-02-25 2007-03-04 2007-03-11 2007-03-18 2007-03-25 2007-04-01 
         7          7          7          7          7          7          7 
2007-04-08 2007-04-15 2007-04-22 2007-04-29 2007-05-06 2007-05-13 2007-05-20 
         7          7          7          7          7          7          7 
2007-05-27 2007-06-03 2007-06-10 2007-06-17 2007-06-24 2007-07-01 2007-07-08 
         7          7          7          7          7          7          7 
2007-07-15 2007-07-22 2007-07-29 2007-08-05 2007-08-12 2007-08-19 2007-08-26 
         7          7          7          7          7          7          7 
2007-09-02 2007-09-09 2007-09-16 2007-09-23 2007-09-30 2007-10-07 2007-10-14 
         7          7          7          7          7          7          7 
2007-10-21 2007-10-28 2007-11-04 2007-11-11 2007-11-18 200

* Create ```x```, a vector of length 100, with integer values in the range
  $1:5$, randomly ordered. (Hint: look at the function ```sample```.)

In [6]:
sample(1:5,100,replace=TRUE)

* Use ```paste``` to create a vector of labels:
    ("Colour 1", "Colour 2", "Colour 3", "Colour 4", "Colour 5")

In [7]:
x <-paste("Colour",1:5)
x

*  Use the ```factor``` command to create a factor from the vector ```x```,
  with the labels created above.

In [8]:
xf<-factor(x)
table(xf)

xf
Colour 1 Colour 2 Colour 3 Colour 4 Colour 5 
       1        1        1        1        1 

In [9]:
xf

* Create a data frame with 100 rows and two columns, one containing a random
  sample of the vector of dates created above, and the other containing the
  factor vector of colour names.

In [10]:
(datf<-data.frame(d=sample(dates2007,100,replace=TRUE),c=xf)  )

d,c
2007-01-28,Colour 1
2007-12-06,Colour 2
2007-03-13,Colour 3
2007-06-18,Colour 4
2007-11-21,Colour 5
2007-10-07,Colour 1
2007-12-17,Colour 2
2007-12-29,Colour 3
2007-03-19,Colour 4
2007-12-02,Colour 5


  
* Select the rows for which the date is after 1st June 2007.

In [11]:
datf[datf$d-as.Date('1-June-2007', format="%d-%b-%Y")>0,]

Unnamed: 0,d,c
2,2007-12-06,Colour 2
4,2007-06-18,Colour 4
5,2007-11-21,Colour 5
6,2007-10-07,Colour 1
7,2007-12-17,Colour 2
8,2007-12-29,Colour 3
10,2007-12-02,Colour 5
11,2007-06-13,Colour 1
13,2007-12-31,Colour 3
15,2007-06-02,Colour 5


:Solution+

## Exercise: 2

* Generate a matrix with 10 rows and 5 columns, with random entries between 0 and 10. (Hint: look at ```runif```)

*  Write a function using ```for``` to calculate the column means of the matrix.

*  Extract the even rows from the matrix.

[Solution]()

## Solution+: 2

* Generate a matrix with 10 rows and 5 columns, with random entries
  between 0 and 10. (Hint: look at ```runif```)

In [12]:
matti<-matrix(runif(50,0,10),nrow=10)

*  Write a function using ```for``` to calculate the column means of
  the matrix.

In [13]:
for (i in 1:5) mean(matti[,i])
apply(matti,2,mean)

*  Extract the even rows from the matrix.

In [14]:
matti[seq(2,10,2),]

0,1,2,3,4
1.809024,7.4299051,2.914167,7.9939373,3.804061
9.422536,2.1948195,9.004747,1.7392336,7.476902
4.024851,9.8905019,3.193427,6.3106328,7.187629
5.633283,5.6588363,2.141353,2.4566433,7.913578
6.239367,0.6260394,8.586384,0.7935159,1.967952


: solution+