# **Lab 8: Tibble, Data import**

In [2]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.3
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.4
[32m✔[39m [34mtidyr  [39m 1.0.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## **Creating tibbles**

Creating tibbles is similar to data.frames, but no strict rules on column names:

In [2]:
tb <- tibble(x = 1:5, y = 1,z = x ^ 2 + y, `.2way` = 2)
print(tb)

[90m# A tibble: 5 x 4[39m
      x     y     z `.2way`
  [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m
[90m1[39m     1     1     2       2
[90m2[39m     2     1     5       2
[90m3[39m     3     1    10       2
[90m4[39m     4     1    17       2
[90m5[39m     5     1    26       2


Tibbles are a built on top of dataframes, so they have additional functionality compared to dataframes. But if you want to coerce some classical dataframes into tibbles, use  `as_tibble()` :

In [3]:
class(iris)

In [4]:
class(as_tibble(iris))

In [5]:
iris_tbl = as_tibble(iris)
print(iris_tbl)

[90m# A tibble: 150 x 5[39m
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m        [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m  
[90m 1[39m          5.1         3.5          1.4         0.2 setosa 
[90m 2[39m          4.9         3            1.4         0.2 setosa 
[90m 3[39m          4.7         3.2          1.3         0.2 setosa 
[90m 4[39m          4.6         3.1          1.5         0.2 setosa 
[90m 5[39m          5           3.6          1.4         0.2 setosa 
[90m 6[39m          5.4         3.9          1.7         0.4 setosa 
[90m 7[39m          4.6         3.4          1.4         0.3 setosa 
[90m 8[39m          5           3.4          1.5         0.2 setosa 
[90m 9[39m          4.4         2.9          1.4         0.2 setosa 
[90m10[39m          4.9         3.1          1.5         0.1 setosa 
[90m# … with 140 more rows[39m


Another way to create a tibble is with `tribble()`, short for transposed tibble. `tribble()` is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.

In [6]:
tribble(
  ~x, ~y, ~z,
  "a", 2, 3.6,
  "b", 1, 8.5
)

x,y,z
<chr>,<dbl>,<dbl>
a,2,3.6
b,1,8.5


The benefit of tibble over dataframe is that it allows us to visualize and subset our data more easily.

## **Subsetting**

Subseting tibbles is stricter than subseting data.frames, and **ALWAYS** returns
objects with expected class: a single [ returns a tibble, a double[[ returns a vector.

In [7]:
class(diamonds$carat)

In [8]:
class(diamonds[["carat"]])

In [9]:
class(diamonds[,"carat"])

In [10]:
diamonds[[carat]]

ERROR: ignored

## **More on tibbles**

You can read more about other tibble features by calling on your R console:

In [0]:
vignette("tibble")

## **Exercise**

*   Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration?

In [0]:
df <- data.frame(abc = 1, xyz = "a")
df$x
df[, "xyz"]
df[, c("abc", "xyz")]

*   Practice referring to non-syntactic names in the following data frame by:
    *   Extracting the variable called 1.
    *   Plotting a scatterplot of 1 vs 2.
    *   Creating a new column called 3 which is 2 divided by 1.
    *   Renaming the columns to one, two and three.  



In [0]:
annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)

What does tibble::enframe() do? When might you use it?

## **Data import**


*   The current working directory (cmd) is the location which R is
currently pointing to.
*   Whenever you try to read or save a file without specifying the path explicitly, the cmd will be used by default.
*   When are executing code from an R markdown/notebook code chunk, the cmd is the location of the document.
*   To see the current working directory use getwd():

In [0]:
getwd()

To change the working directory use setwd(path_name) with a specified
path as an argument:

In [0]:
setwd("path/to/directory")

## **Importing text data**

*   Text Files in a table format can be read and saved to a selected
variable using a `read.table()` function. Use `?read.table` to learn
more about the function.
*   To read these files use the following command:

In [0]:
mydata <- read.table("path/to/filename.csv", header=TRUE, sep = ",")
# read.csv() has covenient argument defaults for '.csv' files
mydata <- read.csv("path/to/filename.csv")

## **`read.csv()` vs `read_csv()`**
*   `read_csv()` in `readr` package is 2~3 times faster and suitable for large data set.
*   They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.
*   They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.

## **The `readr` package**

*   `read_csv()`: reads comma delimited files,
*   `read_tsv()`: tab-separated files
*   `read_delim()`: general delimited files
*   `read_fwf()`: fixed-width files. You can specify fields either by their widths with  `fwf_widths()`  or their position with  `fwf_positions()` .  
*   `read_table()`:  reads a common variation of fixed width files where columns are separated by white space.
*   `read_log()`: web log files

In [0]:
library(MASS)

In [0]:
head(Boston)

In [0]:
write.csv(Boston, file = "~/Desktop/boston.csv", row.names = FALSE)

In [0]:
boston <- read_csv("~/Desktop/boston.csv")

Notice that readr gets its column names from the first row of the CSV. If the first (or however many) has metadata instead of column names, you can use the following to indicate which line has the actual column names.

In [3]:
read_csv("The first line of metadata
  The second line of metadata
  x,y,z
  1,2,3", skip = 2)

x,y,z
<dbl>,<dbl>,<dbl>
1,2,3


If your data does not have column name, use `col_name` argument:

In [4]:
read_csv(
 "1,2,3\n 4,5,6", col_name = F
)

X1,X2,X3
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


In [5]:
read_csv("# A comment I want to skip
  x,y,z
  1,2,3", comment = "#")

x,y,z
<dbl>,<dbl>,<dbl>
1,2,3


Alternatively you can pass col_names, a character vector which will be used as the column names:

In [6]:
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))

x,y,z
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file:

In [7]:
read_csv("a,b,c\n1,2,.", na = ".")

a,b,c
<dbl>,<dbl>,<lgl>
1,2,


## **Exercise**

*   What function would you use to read a file where fields were separated with “|”? Create such a file and make sure it works.

*   Apart from file, skip, and comment, what other arguments do `read_csv()` and `read_tsv()` have in common?

*   What are the most important arguments to `read_fwf()`?

*  Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By convention, read_csv() assumes that the quoting character will be ". If you want to change it, what arguments do you need to specify to read the following text into a data frame?

In [0]:
x <- "x,y\n1,'a,b'"