# Basic Data Types
* Author: Johannes Maucher
* Last Update: 2017-10-03, some modifications by OK in 2019



## Different R Syntax Systems

As you can see at [R Syntax Comparison](https://github.com/rstudio/cheatsheets/raw/master/syntax.pdf) there are some different R syntax systems. The most important ones are the pure Base R convention without any opensource libraries, which is limited and sometimes cumbersome. 

The other one is the ecosystem with a well-documented workflow called [**tidyverse**](https://www.tidyverse.org/packages), which includes additional libraries like *ggplot2, dplyr, tidyr, readr, purrr* and *tibble* to visualize and manipulate data at a comfortable way. 


Some examples:

* Manipulation of data ([dplyr](https://dplyr.tidyverse.org/), [tidyr](https://tidyr.tidyverse.org/))
* Better support of different data types ([stringr](https://stringr.tidyverse.org/) for strings, [lubridate](https://lubridate.tidyverse.org/) for date and datetime, [forcats](https://forcats.tidyverse.org/) for categorical/factors)
* Data visualization ([ggplot2](https://ggplot2.tidyverse.org/))
* Data-oriented programming ([purrr](https://purrr.tidyverse.org/))


It is build around of the concept **tidy data**, which we will discuss in detail at our section "Machine Learning".

The tidyverse library is developed by RStudio’s chief scientist Hadley Wickham, also known as the author of the excellent book [R for Data Science](https://r4ds.had.co.nz) [WH19].

Most functions of tidyverse **works only with data frames** (the most important data type in R - see below) and **tibbles/tribbles** - a special form of data frames in tidyverse, which we will not use here in this lectures (see [The Trouble with Tibbles](https://www.r-bloggers.com/the-trouble-with-tibbles/)). In the sections of *Basic R*, we will describe both ways. In the main sections we will use mostly the tidyverse conventions on a selected way. I prefer using data frames and not tibbles for example. I prefer Juypter Notebooks and not R Notebooks a.s.o.


`Discussion:` Why should we use tidyverse functions? The main reasons are speed, function chaining and simple logical syntax. Someones prefer Base R, data.table's or something else. There are a lot of references, e.g. [dplyr and a very basic benchmark](http://datascience.la/dplyr-and-a-very-basic-benchmark/), [Four reasons why you should check out the R package dply](http://www.zevross.com/blog/2014/03/26/four-reasons-why-you-should-check-out-the-r-package-dplyr-3/) a.s.o.


`Keep in mind:` There is another library which builds on tidyverse called **[tidyquant](https://cran.r-project.org/web/packages/tidyquant/vignettes/TQ00-introduction-to-tidyquant.html)** which additionally includes functions for financial analysis.


In [1]:
library(tidyverse)

-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --
[32mv[39m [34mggplot2[39m 3.2.0     [32mv[39m [34mpurrr  [39m 0.3.2
[32mv[39m [34mtibble [39m 2.1.3     [32mv[39m [34mdplyr  [39m 0.8.3
[32mv[39m [34mtidyr  [39m 0.8.3     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.3.1     [32mv[39m [34mforcats[39m 0.4.0
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


## Vectors

In R a vector is a sequence of elements. All elements must be of the same basic type. The elements of a vector are called *components*.

### Define vectors
Assign single numeric value to a variable:

In [2]:
(x <- 6)
class(x)
print(x)

[1] 6


Assign sequence of integers to a variable:

In [3]:
(x <- 1:13)
class(x)

Assign all-zero vector of defined length and type:

In [4]:
(x <- integer(5))
class(x)
 
(x <- numeric(5))
class(x)

Assign regular numeric sequence with [`seq()`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/seq). Some arguments are

* 'from'
* 'to'
* 'by' defines the increment (default).
* 'length' defines the number of elements.

In [5]:
(x <- seq(from=1, to=30, by=4)) #or x <- seq(1, 30, 4)
(x <- seq(from=1, to=30, length=11))
class(x)


Generate vectors with repeated values and sequences with [`rep()`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/rep). Some arguments are

* 'x' list of values.
* 'each' defines the time to repeat each element of x. 
* 'times' defines times to repeat the sequence of x (default).

In [6]:
(x <- rep(x=3, times=5))  #or only x <- rep(3, 5)
(x <- rep(x=1:2, each=5))
(x <- rep(x=1:2, times=5))

Generate arbitrary vector with [`c()`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/c)

In [7]:
y <- c(100, 200, 50, -10)
y
class(y)

In [8]:
z <- c("last", "in", "first", "out")
class(z)

The first component of a vector is at index 1, the second at index 2 and so on. However, instead of using this standard integer-index, it is possible to define index-names explicitly. This is shown in the following code snippet:

In [9]:
customer <- c(37, 61000, 250000, 70000)

featureNames <- c("age", "income", "capital", "credit")
names(customer) <- featureNames

customer

### Access vector components

Filter by index:

In [10]:
z[1]

Get all elements except those specified in a negative index:

In [11]:
z[-1]
z[-3]

In [12]:
print(z[-3])

[1] "last" "in"   "out" 


Get type of vector components:

In [13]:
print(z[0])
class(z)

character(0)


Filter by explicit index name:

In [14]:
customer["age"]

Filter by set of indices:

In [15]:
customer[2:3]


In [16]:
customer[c(1, 4)]

In [17]:
customer[c("age", "credit")]


Filter by mask

In [18]:
rel <- c(FALSE, TRUE, TRUE, FALSE)
customer[rel]

Filter by value:

In [19]:
y

y[y < 100]

Elements of different type can be assigned to a vector. However, all different types are mapped to a single type, i.e. after assignment all elements of the vector are of same type. 

In [20]:
mix <- c(1, 6, 8,TRUE)
mix
class(mix)

mix2 <- c(1, 6 ,8, TRUE, "word")
mix2
class(mix2)

### Modification of vectors
#### Append vectors:


In [21]:
(mix <- c(mix, 99))

Alternative way to append vectors:

In [22]:
mix[length(mix)+1] <- 999
mix

Append multiple elements:

In [23]:
(mix <- c(mix, mix2))

>Note what happened to the type of vector elements

#### Insert elements into vectors

Create a vector by specifying its first and last element and it's length:

In [24]:
(initVec <- seq(2, 3, length=11))

Insert with [`append()`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/append)

In [25]:
insVec1 <- append(initVec, 0, after=2)

In [26]:
initVec
insVec1

In [27]:
(insVec2 <- append(initVec, 10:12, after=2))

### Simple operations on vectors

#### Arithmetic Operations

In [28]:
(a <- c(3, 9, 18))

(f <- c(1/3, 2/3, 5/3))
(r <- a*f)

In [29]:
(t <- a+r)
sum(t)

#### Boolean Operations


In [30]:
(u <- rep(c(1, 5, 6), 3))
(v <- rep(c(4, 5, 7), 3))
(w <- 6:10)

Compare two vectors componentwise:

In [31]:
u == v

Determine the positions, where two vectors have the same components:

With [`which()`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/which) we can get the indices of a vector or the row numbers of a data frame, depending on the logical condition. 

In [32]:
which(u == v)

If the vectors have different length:

In [33]:
u
w

(u == w)


(which(u == w))

"Länge des längeren Objektes
 	 ist kein Vielfaches der Länge des kürzeren Objektes"

"Länge des längeren Objektes
 	 ist kein Vielfaches der Länge des kürzeren Objektes"

### Set operations

In [34]:
(a <- seq(1, 20, 3))
(b <- 1:10)
a
(c <- rep(c(1, 5, 8), 3))

In [35]:
a
b

print("Elements of a, which are contained in b:")
a %in% b

print("Elements of b, which are contained in a:")
b %in% a

[1] "Elements of a, which are contained in b:"


[1] "Elements of b, which are contained in a:"


In [36]:
# a=1 4 7 10 13 16 19,  b=1 2 3 4 5 6 7 8 9 10,  c=1 5 8 1 5 8 1 5 8 
print("Unique elements of c:")
unique(c)

print("Union of a and b:")
union(a, b)

print("Intersection of a and b:")
intersect(a, b)

print("Elements, which are in b but not in a:")
setdiff(b, a)

[1] "Unique elements of c:"


[1] "Union of a and b:"


[1] "Intersection of a and b:"


[1] "Elements, which are in b but not in a:"


### Sampling and sorting

In [37]:
d <- 1:48

set.seed(213)
print("Sample 6 out of 48 without replacement:")
sample(d, 6)

print("Sample 6 out of 48 with replacement:")
sample(d, 6, replace=T)  # same as replace=TRUE

[1] "Sample 6 out of 48 without replacement:"


[1] "Sample 6 out of 48 with replacement:"


In [38]:
s <- sample(d, 15, replace=TRUE)
print("Sample 15 out of 48 with replacement:")
s

print("Reverse order of elements:")
rev(s)

print("Sort sample in increasing order:")
sort(s)

print("Sort sample in decreasing order:")
sort(s, decreasing=TRUE)

print("Original position of items in increasing order")
order(s)

print("Sort sample via order-function (same result as above):")
s[order(s)]


[1] "Sample 15 out of 48 with replacement:"


[1] "Reverse order of elements:"


[1] "Sort sample in increasing order:"


[1] "Sort sample in decreasing order:"


[1] "Original position of items in increasing order"


[1] "Sort sample via order-function (same result as above):"


## Matrices
In R a matrix is a 2-dimensional collection of elements. All elements must be of the same basic type.

### Matrix construction 1
First define a vector, which is then passed to the [`matrix(vector, nrow, ncol)`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/matrix)-function. The number of rows and columns of the matrix is defined by the arguments `nrow` a nd `ncol` in this function.

In [39]:
vec <- 20:39
cat("Length of vector vec:", length(vec))

Length of vector vec: 20

In [40]:
vec
(matC <- matrix(vec, nrow=4, ncol=5))

0,1,2,3,4
20,24,28,32,36
21,25,29,33,37
22,26,30,34,38
23,27,31,35,39


In [41]:
(matR <- matrix(vec, nrow=4, ncol=5, byrow = TRUE))

(matOwn <- matrix(c(1, 2, 3, 4, 5,
                    6, 7, 8, 9, 10), 
                  nrow=2, ncol=5, byrow = TRUE))    #It is more explainable

0,1,2,3,4
20,21,22,23,24
25,26,27,28,29
30,31,32,33,34
35,36,37,38,39


0,1,2,3,4
1,2,3,4,5
6,7,8,9,10


In [42]:
(matO <- matrix(vec, nrow=4, ncol=10))
class(matO)

0,1,2,3,4,5,6,7,8,9
20,24,28,32,36,20,24,28,32,36
21,25,29,33,37,21,25,29,33,37
22,26,30,34,38,22,26,30,34,38
23,27,31,35,39,23,27,31,35,39


In [43]:
matP <- matrix(vec, nrow=4, ncol=3)
matP

"Datenlänge [20] ist kein Teiler oder Vielfaches der Anzahl der Spalten [3]"

0,1,2
20,24,28
21,25,29
22,26,30
23,27,31


### Accessing matrix elements

In [44]:
matR
matR[2, 4]
matR[1:2, 3:5]
matR[1:2, c(1, 3, 4)]
matR[, 2]
matR[3,]
matR[c(4, 1),]

0,1,2,3,4
20,21,22,23,24
25,26,27,28,29
30,31,32,33,34
35,36,37,38,39


0,1,2
22,23,24
27,28,29


0,1,2
20,22,23
25,27,28


0,1,2,3,4
35,36,37,38,39
20,21,22,23,24


In [45]:
print(matR)

     [,1] [,2] [,3] [,4] [,5]
[1,]   20   21   22   23   24
[2,]   25   26   27   28   29
[3,]   30   31   32   33   34
[4,]   35   36   37   38   39


### Define row- and column-names

In [46]:
dimnames(matR) <- list(c("r1", "r2", "r3", "r4"), c("c1", "c2", "c3", "c4", "c5"))
print(matR)

   c1 c2 c3 c4 c5
r1 20 21 22 23 24
r2 25 26 27 28 29
r3 30 31 32 33 34
r4 35 36 37 38 39


In [47]:
matR["r1", "c3"]

### Matrix construction 2
Combine matrices horizontally by *rbind( )* and vertically by *cbind( )*. 

In [48]:
customer <- c(37, 61000, 250000, 70000)
featureNames <- c("age", "income", "capital", "credit")
names(customer) <- featureNames
customer

In [49]:
custMat <- matrix(customer, nrow=1, ncol=4, byrow = TRUE)
print(custMat)

     [,1]  [,2]   [,3]  [,4]
[1,]   37 61000 250000 70000


In [50]:
customer2 <- c(51, 134600, 750000, 90000)
names(customer2) <- featureNames
customer2
custMat2 <- matrix(customer2, nrow=1, ncol=4, byrow = TRUE)

In [51]:
allCust <- rbind(custMat, custMat2)
print(allCust)

     [,1]   [,2]   [,3]  [,4]
[1,]   37  61000 250000 70000
[2,]   51 134600 750000 90000


In [52]:
allCust <- rbind(allCust, allCust)
print(allCust)

     [,1]   [,2]   [,3]  [,4]
[1,]   37  61000 250000 70000
[2,]   51 134600 750000 90000
[3,]   37  61000 250000 70000
[4,]   51 134600 750000 90000


Serialization of a matrix, column-by-column:

In [53]:
x1 <- c(allCust)

class(x1)

### Operations on matrices

Define some matrices for demonstration of matrix-operations:

In [54]:
(A <- matrix(seq(1, 6), nrow=2, ncol=3))

0,1,2
1,3,5
2,4,6


In [55]:
(B <- matrix(seq(1, 6),nrow=2, ncol=3, byrow = T))

0,1,2
1,2,3
4,5,6


In [56]:
(C <- matrix(seq(1, 6), nrow=3, ncol=2, byrow = T))

0,1
1,2
3,4
5,6


In [57]:
(D <- matrix(seq(1, 9), nrow=3))

0,1,2
1,4,7
2,5,8
3,6,9


#### Elementwise multiplication

In [58]:
A
B
A * B

0,1,2
1,3,5
2,4,6


0,1,2
1,2,3
4,5,6


0,1,2
1,6,15
8,20,36


#### Matrix multiplication 

In [59]:
A
C
A %*% C

0,1,2
1,3,5
2,4,6


0,1
1,2
3,4
5,6


0,1
35,44
44,56


#### Transpose of a matrix

In [60]:
A
t(A)

0,1,2
1,3,5
2,4,6


0,1
1,2
3,4
5,6


#### Extract diagonal elements of a matrix

In [61]:
D
diag(D)

0,1,2
1,4,7
2,5,8
3,6,9


#### Create diagonal matrix

In [62]:
diag(rep(1, 4))

0,1,2,3
1,0,0,0
0,1,0,0
0,0,1,0
0,0,0,1


#### Inverse Matrix
The inverse matrix of a ($d \times d$)-matrix $E$ is the matrix $E^{-1}$, such that matrix product $E \cdot E^{-1}$ yields the ($d \times d$) unity-matrix (diagonal matrix with ones on the diagonal). 

For the example: <BR>
[`floor(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/Round) returns the largest integer but not greater than x.<BR>
[`runif(n, min, max)`](https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/Uniform) creates 'n' random number from a uniform distribution with 'min' (default=0) and 'max' (default=1).

In [63]:
set.seed(123)
(E <- matrix(floor(10*runif(9)), nrow=3))

runif(9)

0,1,2
2,8,5
7,9,8
4,0,5


In [64]:
(F <- solve(E))

0,1,2
-0.39473684,0.3508772,-0.1666667
0.02631579,0.0877193,-0.1666667
0.31578947,-0.2807018,0.3333333


In [65]:
E %*% F

0,1,2
1.0,0.0,0
0.0,1.0,0
-2.220446e-16,-2.220446e-16,1


#### Solve system of linear equations
A system of linear equations, can be written as a matrix-multiplication, with known matrix $C$ and vector $a$ and unknown vector $x$:

$$
\mathbf{a}=C \mathbf{x}
$$

$$
\left(
\begin{array}{c}
a_1\\
a_2\\
a_3\\
\end{array}
\right)
=
\left(
\begin{array}{ccc}
C_{11} & C_{12} & C_{13} \\
C_{21} & C_{22} & C_{23} \\
C_{31} & C_{32} & C_{33} \\
\end{array}
\right)
\cdot
\left(
\begin{array}{c}
x_1\\
x_2\\
x_3\\
\end{array}
\right)
$$

Vector $x$ can be determined by:

In [66]:
C <- E
C

a <- c(7, 15, 9)
as.data.frame(a)  #different display

x <- solve(C, a)
as.data.frame(x)  #different display

0,1,2
2,8,5
7,9,8
4,0,5


a
<dbl>
7
15
9


x
<dbl>
1
0
1


## Factors
In Data Analysis usually categorical variables have to be treated in another way than numerical variables. In R the values of categorical variables are not represented in vectors but in so called *factors*. Factors provide a very efficient way of processing categorical variables. Another advantage is that some algorithms, e.g. machine learning algorithms like decision trees, are already implemented such that they process factor-variables in another way as vector-variables. This means, that no explicit preprocessing routines must be implemented for categorical variables, such  as e.g. one-hot-encoding.

In [67]:
vect <- c('red', 'blue', 'red', 'green', 'yellow', 'green', 
          'green', 'blue') #create vector
print(vect) 
print(class(vect))

fact <- factor(vect) #convert vector into factor
print(fact)

print(class(fact))

levels(fact)

[1] "red"    "blue"   "red"    "green"  "yellow" "green"  "green"  "blue"  
[1] "character"
[1] red    blue   red    green  yellow green  green  blue  
Levels: blue green red yellow
[1] "factor"


Each possible value of a factor-variable is called a *level*. The set of *levels* can easily be modified, as shown in the code cell below: 

In [68]:
print(levels(fact))
levels(fact) <- c('blau', 'grün', 'rot', 'gelb')
print(fact)

[1] "blue"   "green"  "red"    "yellow"
[1] rot  blau rot  grün gelb grün grün blau
Levels: blau grün rot gelb


The frequency of each level (value) in a factor-variable can be determined by the `table()`-function:

In [69]:
table(fact)

fact
blau grün  rot gelb 
   2    3    2    1 

We rename levels by the [`revalue()`](https://www.rdocumentation.org/packages/h2o/versions/2.8.4.4/topics/Revalue) function from "plyr" package, which is included in "tidyverse" package. Some arguments are:

* 'x' defines the factor vector
* 'replace' defines a named character vector with new values as values, and old values as names. If NULL, then no replacement is performed .e.g.: *revalue(x, c("old"="new", "older"="newer"))*

See also [Renaming levels of a factor](http://www.cookbook-r.com/Manipulating_data/Renaming_levels_of_a_factor/).


In [70]:
#library(plyr)

fact.new <- fact
table(fact.new)

fact.new <- plyr::revalue(fact.new, c("blau"="blue", "grün"="green"))

levels(fact.new)
table(fact.new)


fact.new
blau grün  rot gelb 
   2    3    2    1 

fact.new
 blue green   rot  gelb 
    2     3     2     1 

## Lists
In R lists are vectors, which can contain elements of different types. List elements can be basic types, more complex types, such as lists, matrices or data frames, even functions are possible. Since list elements can be named a list in R is similar to **dictionaries** in other programming languages


### List Creation
In the code cell below `l1` is quite simple list, which contains only integers. List `l2` contains different types, list `l3` has named components and list `l4` contains a function - element: 

In [71]:
l1 <- list(3, 7, 8)
l2 <- list(3, "drei", 1:3)
l3 <- list(ID=100, firstname="paul", lastname="major")
l4 <- list(data=2:10, statfunc=mean)

l1
l2
l3
l4

l4$statfunc(l4$data)

In the following cell a list, with named elements is defined. Each element itself is a vector. 

In [72]:
tvec <- c(17, 21, 23, 14)
wvec <- c("rainy", "sunny", "sunny", "cloudy")
dvec <- c("monday", "tuesday", "wednesday", "thursday")
yvec  <- 2016
wlist <- list(temp=tvec, weather=wvec, day=dvec, year=yvec)

print(wlist)
class(wlist)

$temp
[1] 17 21 23 14

$weather
[1] "rainy"  "sunny"  "sunny"  "cloudy"

$day
[1] "monday"    "tuesday"   "wednesday" "thursday" 

$year
[1] 2016



### Accessing List Elements
There are different options for accessing list elements:

In [73]:
wlist[1]
class(wlist[1])

wlist["temp"]
class(wlist["temp"])

wlist$temp
class(wlist$temp)

**Note** that if only a single square-bracket is applied in the list access, the result is always a (sub-)list. This is true even in the case that the list element at the specified position is a single value as shown in the code-snippet below:

In [74]:
l1 <- list(3, 7, 8)
l1[2] 
class(l1[2])


This property of list-access is a frequent cause of errors. For example simple operations like addition can not be performed on lists. Hence the attempts like shown in the following code-cell yield errors.

In [75]:
# l1[2] + l1[3]

If the list access shall return single elements, not (sub-)lists, double square-brackets must be applied as shown below:

In [76]:
l1[[2]]
class(l1[[2]])

In [77]:
l1[[2]] + l1[[3]]

Access vector-components of list-components: 

In [78]:
wlist[["temp"]][2] #returns the second component of the vector wlist[["temp"]], 
                   #which is the only element of the list wlist["temp"]
class(wlist[["temp"]][2])

wlist[["day"]][3]
class(wlist[["day"]][3])

In [79]:
wlist$temp
wlist$day[2]
wlist$year

### Flatten nested lists
Flattening means that a more-dimensional object is transformed into a one-dimensional representation. For example, a list of lists (2-dimensional object) can be turned into a one-dimensional list, by applying the `unlist()`-function:

In [80]:
flist <- unlist(wlist)
flist

class(flist)

### Removing elements from a list
List elements can be removed by assigning `NULL` to the corresponding element:

In [81]:
l1
l1[2] <- NULL
l1

## Data Frame

### Generate Data Frame and Access Elements

Data frames are applied for storing data tables and is **the most important data type in R**. A data frame is a list of vectors of equal length.

Initialize dataframe from column data:

In [82]:
tvec <- c(17, 21, 23, 14)
wvec <- c("rainy", "sunny", "sunny", "cloudy")
dvec <- c("monday", "tuesday", "wednesday", "thursday")

#Create a data frame with named columns otherwise it will be the vector variable name
df <- data.frame("degree" = tvec, "weather" = wvec, "weekday" = dvec)
print(df)

cat("\n#Tidyverse syntax:\n")
glimpse(df)   #Short overview of df

  degree weather   weekday
1     17   rainy    monday
2     21   sunny   tuesday
3     23   sunny wednesday
4     14  cloudy  thursday

#Tidyverse syntax:
Observations: 4
Variables: 3
$ degree  [3m[90m<dbl>[39m[23m 17, 21, 23, 14
$ weather [3m[90m<fct>[39m[23m rainy, sunny, sunny, cloudy
$ weekday [3m[90m<fct>[39m[23m monday, tuesday, wednesday, thursday


Get number of rows and columns of a dataframe:

In [83]:
dim(df)

### Access, select and filter elements of a dataframe

Some functions from **tidyverse** (more exactly from the dplyr package) and some symbols:

* `select()` for include/exclude columns - see [select()](https://dplyr.tidyverse.org/reference/select.html) and [examples](https://r4ds.had.co.nz/transform.html#select)

* `filter()` for filtering of rows - see [filter()](https://dplyr.tidyverse.org/reference/filter.html) and [examples](https://r4ds.had.co.nz/transform.html#filter-rows-with-filter)

* `%>%` to chain functions to a pipeline where the first argument of a function will automatically replaced by the data itself - see our example below


#### Some tips for pipeling with %>%
To refer explicit to the data as an argument use the symbol `.` - a point, e.g. 

    lm(mpg ~ wt, data = .)

To avoid explicit the automatically replacing of the first argument of a function by the data use the symbol `{...}` or `%$%` (from package "magrittr"), e.g. 

    iris %>% {cor(x = .$Sepal.Length, y = .$Sepal.Width)}
    
See [Use pipe without feeding first argument](https://stackoverflow.com/questions/38717657/use-pipe-without-feeding-first-argument)

In [84]:
#row, column
df[2,3]
df[2,"weekday"]
df["2","weekday"]


cat("\n#Tidyverse syntax:\n")
df %>%
    select(weekday) %>%       
    filter(row_number() == 2)  



#Tidyverse syntax:


weekday
<fct>
tuesday


In [85]:
#All columns except from 'weather' to 'weekday'
select(df, -(weather:weekday))   

#----Example
#All rows where 'weather' == "rainy" AND 'weekday' == "monday"
filter(df, weather == "rainy", weekday == "monday")  

cat("\nLook at the pipe symbol %>% above. It is the same as:\n")
df.select <- select(df, "weekday")
filter(df.select, row_number() == 2)

degree
<dbl>
17
21
23
14


degree,weather,weekday
<dbl>,<fct>,<fct>
17,rainy,monday



Look at the pipe symbol %>% above. It is the same as:


weekday
<fct>
tuesday


In [86]:
#Number of rows and columns 
nrow(df)

ncol(df)

In [87]:
(x1 <- df[[3]])
class(df[[3]])

(x2 <- df[3])
class(df[3])

#Here the objects are identical
#identical(df[[3]], df[, 3])


weekday
<fct>
monday
tuesday
wednesday
thursday


In [88]:
df$weekday

cat("\n#Tidyverse syntax:\n")
df %>%
    select("weekday")


#Tidyverse syntax:


weekday
<fct>
monday
tuesday
wednesday
thursday


In [89]:
df[, 3]   #For tidyverse syntax - see above 

In [90]:
df[3]

class(df[3])

weekday
<fct>
monday
tuesday
wednesday
thursday


In [91]:
df[c("degree", "weekday")]

cat("\n#Tidyverse syntax:\n")
df %>%
    select("degree", "weekday")

degree,weekday
<dbl>,<fct>
17,monday
21,tuesday
23,wednesday
14,thursday



#Tidyverse syntax:


degree,weekday
<dbl>,<fct>
17,monday
21,tuesday
23,wednesday
14,thursday


In [92]:
df[2,]

cat("\n#Tidyverse syntax:\n")
df %>%
    filter(row_number() == 2)

Unnamed: 0_level_0,degree,weather,weekday
Unnamed: 0_level_1,<dbl>,<fct>,<fct>
2,21,sunny,tuesday



#Tidyverse syntax:


degree,weather,weekday
<dbl>,<fct>,<fct>
21,sunny,tuesday


In [93]:
df[c(2, 4),]

cat("\n#Tidyverse syntax:\n")
df %>%
    filter(row_number() == 2 | row_number() == 4)   # It is self explaining

Unnamed: 0_level_0,degree,weather,weekday
Unnamed: 0_level_1,<dbl>,<fct>,<fct>
2,21,sunny,tuesday
4,14,cloudy,thursday



#Tidyverse syntax:


degree,weather,weekday
<dbl>,<fct>,<fct>
21,sunny,tuesday
14,cloudy,thursday


In [94]:
df["1",]   #For tidyverse syntax - see above 

degree,weather,weekday
<dbl>,<fct>,<fct>
17,rainy,monday


### Select rows by value
The next two cells demonstrate how dataframe elements can be filtered by value. In this example only the rows (days) with sunny weather shall be determined:

In [95]:
F <- df$weather=="sunny"
F


In [96]:
df[F,]

cat("\n#Tidyverse syntax:\n")
df %>%
    filter(weather == "sunny")

Unnamed: 0_level_0,degree,weather,weekday
Unnamed: 0_level_1,<dbl>,<fct>,<fct>
2,21,sunny,tuesday
3,23,sunny,wednesday



#Tidyverse syntax:


degree,weather,weekday
<dbl>,<fct>,<fct>
21,sunny,tuesday
23,sunny,wednesday


### Import data into data frame

>>>> Bis hier

In [97]:
energyData <- read.csv(file="../data/EnergyMixGeoClust.csv", header=TRUE, 
                       sep=",",row.names=1)

#str(energyData)

numObs <- dim(energyData)[1]
numObs

#To prevent that strings are converted to factors by read.csv(), 
#we can use argument 'stringsAsFactors=FALSE'
#   energyData$Country <- as.factor(energyData$Country)

glimpse(energyData)

Observations: 65
Variables: 11
$ Country   [3m[90m<fct>[39m[23m US, Canada, Mexico, Argentina, Brazil, Chile, Colombia, E...
$ Oil       [3m[90m<dbl>[39m[23m 842.9, 97.0, 85.6, 22.3, 104.3, 15.4, 8.8, 9.9, 8.5, 27.4...
$ Gas       [3m[90m<dbl>[39m[23m 588.7, 85.2, 62.7, 38.8, 18.3, 3.0, 7.8, 0.4, 3.1, 26.8, ...
$ Coal      [3m[90m<dbl>[39m[23m 498.0, 26.5, 6.8, 1.1, 11.7, 4.1, 3.1, 0.0, 0.5, 0.0, 2.3...
$ Nuclear   [3m[90m<dbl>[39m[23m 190.2, 20.3, 2.2, 1.8, 2.9, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
$ Hydro     [3m[90m<dbl>[39m[23m 62.2, 90.2, 6.0, 9.2, 88.5, 5.6, 9.3, 2.1, 4.5, 19.5, 8.3...
$ Total2009 [3m[90m<dbl>[39m[23m 2182.0, 319.2, 163.2, 73.3, 225.7, 28.1, 29.0, 12.4, 16.6...
$ CO2Emm    [3m[90m<dbl>[39m[23m 5941.9, 602.7, 436.8, 164.2, 409.4, 70.3, 57.9, 31.3, 35....
$ Lat       [3m[90m<dbl>[39m[23m 37.090240, 56.130366, 23.634501, -38.416097, -14.235004, ...
$ Long      [3m[90m<dbl>[39m[23m -95.712891, -106.346771, -102.552784, -63.616672, 

Show data structure

In [None]:
#Tidyverse function
glimpse(energyData)

#str(energyData)

Observations: 65
Variables: 11
$ Country   [3m[90m<fct>[39m[23m US, Canada, Mexico, Argentina, Brazil, Chile, Colombia, E...
$ Oil       [3m[90m<dbl>[39m[23m 842.9, 97.0, 85.6, 22.3, 104.3, 15.4, 8.8, 9.9, 8.5, 27.4...
$ Gas       [3m[90m<dbl>[39m[23m 588.7, 85.2, 62.7, 38.8, 18.3, 3.0, 7.8, 0.4, 3.1, 26.8, ...
$ Coal      [3m[90m<dbl>[39m[23m 498.0, 26.5, 6.8, 1.1, 11.7, 4.1, 3.1, 0.0, 0.5, 0.0, 2.3...
$ Nuclear   [3m[90m<dbl>[39m[23m 190.2, 20.3, 2.2, 1.8, 2.9, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
$ Hydro     [3m[90m<dbl>[39m[23m 62.2, 90.2, 6.0, 9.2, 88.5, 5.6, 9.3, 2.1, 4.5, 19.5, 8.3...
$ Total2009 [3m[90m<dbl>[39m[23m 2182.0, 319.2, 163.2, 73.3, 225.7, 28.1, 29.0, 12.4, 16.6...
$ CO2Emm    [3m[90m<dbl>[39m[23m 5941.9, 602.7, 436.8, 164.2, 409.4, 70.3, 57.9, 31.3, 35....
$ Lat       [3m[90m<dbl>[39m[23m 37.090240, 56.130366, 23.634501, -38.416097, -14.235004, ...
$ Long      [3m[90m<dbl>[39m[23m -95.712891, -106.346771, -102.552784, -63.616672, 

In [99]:
energyData$Coal[1:10]              # Show only the first 10 rows

cat("\n#Tidyverse syntax:\n")
    select(energyData, Coal) %>%
    filter(row_number() <= 10)     # Show only the first 10 rows



#Tidyverse syntax:


Coal
<dbl>
498.0
26.5
6.8
1.1
11.7
4.1
3.1
0.0
0.5
0.0


In [100]:
help(read.csv)

## Exercises


[Exercise on basic data types in R](../exercises/Ass02DataTypesR.ipynb)