# Appendix C Data 

## C.1 Data importation

The first step in data analysis is to load the data set into your workspace. We will examine comma-separated value (CSV) data  and data from the Internet in this section. 

### C.1.1 Comma-separated value

CSV is one of the most common formats for sharing data. The CSV data has the advantage of being human-readable. The disadvantage is that there is no actual standard for reading or writing these files.
Here's an example of CSV data on heights:
    
    "earn","height","sex","ed","age","race"
    50000,74.4244387818035,"male",16,45,"white"
    60000,65.5375428255647,"female",16,58,"white"
    30000,63.6291977374349,"female",16,29,"white"
    50000,63.1085616752971,"female",16,91,"other"
    51000,63.4024835710879,"female",17,39,"white"
    9000,64.3995075440034,"female",15,26,"white"
    
The first row (usually) has a *header* giving the column names. Subsequent rows give the actual data. Strings are (usually) quoted.

We might also see these data come in the format:
    
    earn,height,sex,ed,age,race
    50000,74.4244387818035,male,16,45,white
    60000,65.5375428255647,female,16,58,white
    30000,63.6291977374349,female,16,29,white
    50000,63.1085616752971,female,16,91,other
    51000,63.4024835710879,female,17,39,white
    9000,64.3995075440034,female,15,26,white
    
There are no quotes!

Or even:

    50000,74.4244387818035,male,16,45,white
    60000,65.5375428255647,female,16,58,white
    30000,63.6291977374349,female,16,29,white
    50000,63.1085616752971,female,16,91,other
    51000,63.4024835710879,female,17,39,white
    9000,64.3995075440034,female,15,26,white
    
No column names!

The `read_csv` command is designed to read this type of file. Note that this command is part of `tidyverse` and is **not** the `read.csv` in `R`. We generally want to use `read_csv` over `read.csv` since (i) it is much faster and (ii) it outputs nicely formatted `tibble`s which you can pass into other `tidyverse` functions.

In [29]:
library(tidyverse)
heights<-read_csv("../Data/Heights.csv") %>% print


[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------------------------[39m
cols(
  earn = [32mcol_double()[39m,
  height = [32mcol_double()[39m,
  sex = [31mcol_character()[39m,
  ed = [32mcol_double()[39m,
  age = [32mcol_double()[39m,
  race = [31mcol_character()[39m
)




[90m# A tibble: 1,192 x 6[39m
    earn height sex       ed   age race    
   [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   
[90m 1[39m [4m5[24m[4m0[24m000   74.4 male      16    45 white   
[90m 2[39m [4m6[24m[4m0[24m000   65.5 female    16    58 white   
[90m 3[39m [4m3[24m[4m0[24m000   63.6 female    16    29 white   
[90m 4[39m [4m5[24m[4m0[24m000   63.1 female    16    91 other   
[90m 5[39m [4m5[24m[4m1[24m000   63.4 female    17    39 white   
[90m 6[39m  [4m9[24m000   64.4 female    15    26 white   
[90m 7[39m [4m2[24m[4m9[24m000   61.7 female    12    49 white   
[90m 8[39m [4m3[24m[4m2[24m000   72.7 male      17    46 white   
[90m 9[39m  [4m2[24m000   72.0 male      15    21 hispanic
[90m10[39m [4m2[24m[4m7[24m000   72.2 male      12    26 white   
[90m# ... with 1,182 more rows[39m


Here `read_csv` has told us what columns it found, and also what the data types it found for them are. Generally these will be correct but we will see examples later where it guesses wrongly and we have to manually override them.

Here is another version of `heights`, where we are not fortunate enough to have a header telling us which columns came from where. 

In [31]:
heights_no_hdr<-read_csv("../Data/Heights_no_hdr.csv") %>% print


[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------------------------[39m
cols(
  `50000` = [32mcol_double()[39m,
  `74.4244387818035` = [32mcol_double()[39m,
  male = [31mcol_character()[39m,
  `16` = [32mcol_double()[39m,
  `45` = [32mcol_double()[39m,
  white = [31mcol_character()[39m
)




[90m# A tibble: 1,191 x 6[39m
   `50000` `74.4244387818035` male    `16`  `45` white   
     [3m[90m<dbl>[39m[23m              [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   
[90m 1[39m   [4m6[24m[4m0[24m000               65.5 female    16    58 white   
[90m 2[39m   [4m3[24m[4m0[24m000               63.6 female    16    29 white   
[90m 3[39m   [4m5[24m[4m0[24m000               63.1 female    16    91 other   
[90m 4[39m   [4m5[24m[4m1[24m000               63.4 female    17    39 white   
[90m 5[39m    [4m9[24m000               64.4 female    15    26 white   
[90m 6[39m   [4m2[24m[4m9[24m000               61.7 female    12    49 white   
[90m 7[39m   [4m3[24m[4m2[24m000               72.7 male      17    46 white   
[90m 8[39m    [4m2[24m000               72.0 male      15    21 hispanic
[90m 9[39m   [4m2[24m[4m7[24m000               72.2 male      

Now `read_csv()` has erroneously assumed that the first row of data are the header names. To override this behavior, we need to specify the column names by hand. 

In [32]:
heights_no_hdr<-read_csv("../Data/Heights_no_hdr.csv",col_names=F) %>% print


[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------------------------[39m
cols(
  X1 = [32mcol_double()[39m,
  X2 = [32mcol_double()[39m,
  X3 = [31mcol_character()[39m,
  X4 = [32mcol_double()[39m,
  X5 = [32mcol_double()[39m,
  X6 = [31mcol_character()[39m
)




[90m# A tibble: 1,192 x 6[39m
      X1    X2 X3        X4    X5 X6      
   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   
[90m 1[39m [4m5[24m[4m0[24m000  74.4 male      16    45 white   
[90m 2[39m [4m6[24m[4m0[24m000  65.5 female    16    58 white   
[90m 3[39m [4m3[24m[4m0[24m000  63.6 female    16    29 white   
[90m 4[39m [4m5[24m[4m0[24m000  63.1 female    16    91 other   
[90m 5[39m [4m5[24m[4m1[24m000  63.4 female    17    39 white   
[90m 6[39m  [4m9[24m000  64.4 female    15    26 white   
[90m 7[39m [4m2[24m[4m9[24m000  61.7 female    12    49 white   
[90m 8[39m [4m3[24m[4m2[24m000  72.7 male      17    46 white   
[90m 9[39m  [4m2[24m000  72.0 male      15    21 hispanic
[90m10[39m [4m2[24m[4m7[24m000  72.2 male      12    26 white   
[90m# ... with 1,182 more rows[39m


In [36]:
heights_no_hdr<-read_csv("../Data/Heights_no_hdr.csv",
                         col_names=c("earn","height","sex","ed","age","race" )) %>% print


[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------------------------[39m
cols(
  earn = [32mcol_double()[39m,
  height = [32mcol_double()[39m,
  sex = [31mcol_character()[39m,
  ed = [32mcol_double()[39m,
  age = [32mcol_double()[39m,
  race = [31mcol_character()[39m
)




[90m# A tibble: 1,192 x 6[39m
    earn height sex       ed   age race    
   [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   
[90m 1[39m [4m5[24m[4m0[24m000   74.4 male      16    45 white   
[90m 2[39m [4m6[24m[4m0[24m000   65.5 female    16    58 white   
[90m 3[39m [4m3[24m[4m0[24m000   63.6 female    16    29 white   
[90m 4[39m [4m5[24m[4m0[24m000   63.1 female    16    91 other   
[90m 5[39m [4m5[24m[4m1[24m000   63.4 female    17    39 white   
[90m 6[39m  [4m9[24m000   64.4 female    15    26 white   
[90m 7[39m [4m2[24m[4m9[24m000   61.7 female    12    49 white   
[90m 8[39m [4m3[24m[4m2[24m000   72.7 male      17    46 white   
[90m 9[39m  [4m2[24m000   72.0 male      15    21 hispanic
[90m10[39m [4m2[24m[4m7[24m000   72.2 male      12    26 white   
[90m# ... with 1,182 more rows[39m


To create short examples illustrating `read_csv`'s behavior, we can specify the contents of a csv file inline.

In [39]:
read_csv(
"a, b, c
1, 2, 3
4, 5, 6
"
)

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


We might want to skip a few rows in the beginning that have metadata.

In [43]:
read_csv(
"# Title
// date 
% comments
a, b, c
1, 2, 3
4, 5, 6
", skip=3
)

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


Some CSVs will come with comments, typically in the form of lines prefaced by `#`. We can also skip comments line by specifying a comment character.

In [45]:
read_csv(
"# comment 1
# comment 2
a, b, c
1, 2, 3
# comment 3
4, 5, 6
# comment 4
",comment="#"
)

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


Set `col_names = FALSE` when you don't have column names in the file. The column names are then set to X1, X2, ...

In [46]:
read_csv(
"a, b, c
1, 2, 3
4, 5, 6
", col_names=F
)

X1,X2,X3
<chr>,<chr>,<chr>
a,b,c
1,2,3
4,5,6


We can specify our own column names.

In [47]:
read_csv(
"a, b, c
1, 2, 3
4, 5, 6
", col_names=c('name 1','name 2', 'name 3')
)

name 1,name 2,name 3
<chr>,<chr>,<chr>
a,b,c
1,2,3
4,5,6


We can specify how missing values are represented in the file.

In [None]:
read_csv(
"a, b, c
1, 2, 3
4, , 6
"
)

We can save a `data.frame` to a `.csv` file using `write_csv()`.