# Reading Data

There are a few principal functions reading data into R.
* `read.table` and read.csv for reading tabular data _(inverse of `write.table`)_
* `readLines` for reading lines of a text file  _(inverse of `writeLines`)_
* `source` for reading in R code files _(inverse of `dump`)_
* `dget` for reading in R code files _(inverse of `dput`)_
* `load` for reading in saved workspaces  _(inverse of `save`)_
* `unserialize` for reading single R objects in binary form  _(inverse of `serialize`)_

### Reading Data Files with `read.table`

The `read.table` function is one of th emost commonly used functions for reading data. It has a few important arguments:
* `file` - the name of a file
* `header` - logical (True/False) indicating if the file has a header line (first line has variable names or is data
* `sep` - a string indicating how the columns are separated (',' or ';' or ' ')
* `colClasses` - a characer vector indicating the class of each column in the dataset
* `nrows` - the number of rows in the dataset 
* `comment.char` - a character string indicating the comment character (anything after this character will be treated as a comment)
* `skip` the number of lines to skip from the beginning
* `stringAsFactors` - should character variables be coded as factors?

For small to moderately sized datasets you can call read.table without specifying any other arguments

`data <- read.table("foo.txt")`

R will automatically:
* return a data frame
* skip lines that begin with a #
* figure out how many rows there are (and how much memory needs to be allocated)
* figure out what type of variable is in each column of the table. Telling R all these things directly though makes R run faster and more efficiently
* `read.csv` is identical to read.table except that the default separator is comma


### Reading in Larger Datasets with read.table

With much larger datasets you can do the following things to prevent R from choking:
* Read the help page for read.table, which contains many hints - very helpful, particularly for optimising large datasets
* Make a rough calculation of the memory required to store your dataset. If the dataset is larger than the amount of RAM on your computer, you can probably stop here
* Set `comment.char = ""` if there are no commented lines in your file
* Use the colClasses argument, it's much faster (up to twice as fast). A quick and dirty way to figure out the classes of each column is the following:
    * `initial <- read.table("datatable.txt", nrows = 100)`
    * `classes <- sapply(initial, class)`
    * `tabAll <- read.table("datatable.txt", colClasses = classes)`
* Set `nrows` it doesn't make R run any faster but helps with memory usage, a mild overestimate is okay. You can use the unix tool `wc` to calculate the number of lines in a file

##### Know thy system

In general, when using R with larger datasets, it's useful to know a few things about your system:
* How much memory is available?
* What other applications are in use?
* Is anyone else logged in?
* What is the OS?
* Is the OS 32 or 64 bit? You can access more memory on a 64 bit system

#### Calculating Memory requirements

If I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly how much memeory is required to store this data frame?

1,500,000 x 120 x 8 bytes/numeric
1440000000 bytes
1440000000 / 2^20 bytes/MB
1,373.29 MB
1.34 GB

A rule of thumb is that you need about twice as much RAM than is required as a minimum for having this data on file.

## Textual Formats

* `dumping` and `dput`ing are useful because the resulting textual format is editable and in case of corruption, potentially recoverable
* Unlike writing out a table or csv, dump and dput preserve the _metadata_ (sacrificing some readability), so that another user doesn't have to specify it all over again.
* Textual formats can work much better with VCS like git
* Textual formats adhere to the "Unix philosophy"
* Downside: not very space efficient

**dput-ing R objects:**

dput can only be used on a single R object

In [2]:
y <- data.frame(a = 1, b = "a")
dput(y)

structure(list(a = 1, b = structure(1L, .Label = "a", class = "factor")), .Names = c("a", 
"b"), row.names = c(NA, -1L), class = "data.frame")


In [3]:
# To save to a file
# dput(y, file = "y.R")
# new.y <- dget("y.R")
# new.y

**dumping R objects:**

dump can deparse multiple objects and read back in using _parse_

In [4]:
# x <- "foo"
# y <- data.frame(a = 1, b = "a")
# dump(c("x", "y"), file = "data.R")
# rm(x, y) # Removes x and y
# source ("data.R")
# y
# x