# Appendix C Data 

Data visualization is an important skill to have for a data scientist. There are several tools available for data visualization. There are paid services/products offered by companies like [Tableau](https://www.tableau.com/) that let people generate high quality visualizations from data stored in speadsheets and databases. [D3.js](https://d3js.org/) is a Javascript library that uses a browser to display high quality, interactive graphics. Spreadsheet programs, such as Microsoft Excel, also offer visualization tools.

Since this is a course based on the `R` language, we will explore the visualization tools provided by the `R` language and packages. Even if we restrict ourselves to `R`, we have a few choices. The [R base graphics package](https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/00Index.html) provides basic plotting tools that may be sufficient for many purposes. We will also look at the [ggplot2 package](http://ggplot2.org/) that offers a higher level of abstraction to create graphics. For an interesting comparison between base R graphics and ggplots, see this [blog post](https://flowingdata.com/2016/03/22/comparing-ggplot2-and-r-base-graphics/).




In [2]:
# load tidyverse

## C.1 Data importation

The first step in data analysis is to load the data set into your workspace. We will examine comma-separated value (CSV) data in this section, as it is one of the most common formats for sharing data. The CSV data has the advantage of being human-readable. The disadvantage is that there is no actual standard for reading or writing these files.

Here's an example of CSV data on heights:
    
    "earn","height","sex","ed","age","race"
    50000,74.4244387818035,"male",16,45,"white"
    60000,65.5375428255647,"female",16,58,"white"
    30000,63.6291977374349,"female",16,29,"white"
    50000,63.1085616752971,"female",16,91,"other"
    51000,63.4024835710879,"female",17,39,"white"
    9000,64.3995075440034,"female",15,26,"white"
    
The first row (usually) has a *header* giving the column names. Subsequent rows give the actual data. Strings are (usually) quoted.

We might also see these data come in the format:
    
    earn,height,sex,ed,age,race
    50000,74.4244387818035,male,16,45,white
    60000,65.5375428255647,female,16,58,white
    30000,63.6291977374349,female,16,29,white
    50000,63.1085616752971,female,16,91,other
    51000,63.4024835710879,female,17,39,white
    9000,64.3995075440034,female,15,26,white
    
There are no quotes!

Or even:

    50000,74.4244387818035,male,16,45,white
    60000,65.5375428255647,female,16,58,white
    30000,63.6291977374349,female,16,29,white
    50000,63.1085616752971,female,16,91,other
    51000,63.4024835710879,female,17,39,white
    9000,64.3995075440034,female,15,26,white
    
No column names!

The `read_csv` command is designed to read this type of file. Note that this command is part of `tidyverse` and is **not** the `read.csv` in `R`. We generally want to use `read_csv` over `read.csv` since (i) it is much faster and (ii) it outputs nicely formatted `tibble`s which you can pass into other `tidyverse` functions.

In [3]:
heights <- read_csv("../Data/heights.csv") %>% print


[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------------------------[39m
cols(
  earn = [32mcol_double()[39m,
  height = [32mcol_double()[39m,
  sex = [31mcol_character()[39m,
  ed = [32mcol_double()[39m,
  age = [32mcol_double()[39m,
  race = [31mcol_character()[39m
)




[90m# A tibble: 1,192 x 6[39m
    earn height sex       ed   age race    
   [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   
[90m 1[39m [4m5[24m[4m0[24m000   74.4 male      16    45 white   
[90m 2[39m [4m6[24m[4m0[24m000   65.5 female    16    58 white   
[90m 3[39m [4m3[24m[4m0[24m000   63.6 female    16    29 white   
[90m 4[39m [4m5[24m[4m0[24m000   63.1 female    16    91 other   
[90m 5[39m [4m5[24m[4m1[24m000   63.4 female    17    39 white   
[90m 6[39m  [4m9[24m000   64.4 female    15    26 white   
[90m 7[39m [4m2[24m[4m9[24m000   61.7 female    12    49 white   
[90m 8[39m [4m3[24m[4m2[24m000   72.7 male      17    46 white   
[90m 9[39m  [4m2[24m000   72.0 male      15    21 hispanic
[90m10[39m [4m2[24m[4m7[24m000   72.2 male      12    26 white   
[90m# ... with 1,182 more rows[39m


Here `read_csv` has told us what columns it found, and also what the data types it found for them are. Generally these will be correct but we will see examples later where it guesses wrongly and we have to manually override them.

Here is another version of `heights`, where we are not fortunate enough to have a header telling us which columns came from where. 

In [5]:
read_csv("../Data/heights_no_hdr.csv") %>% print


[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------------------------[39m
cols(
  `50000` = [32mcol_double()[39m,
  `74.4244387818035` = [32mcol_double()[39m,
  male = [31mcol_character()[39m,
  `16` = [32mcol_double()[39m,
  `45` = [32mcol_double()[39m,
  white = [31mcol_character()[39m
)




[90m# A tibble: 1,191 x 6[39m
   `50000` `74.4244387818035` male    `16`  `45` white   
     [3m[90m<dbl>[39m[23m              [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   
[90m 1[39m   [4m6[24m[4m0[24m000               65.5 female    16    58 white   
[90m 2[39m   [4m3[24m[4m0[24m000               63.6 female    16    29 white   
[90m 3[39m   [4m5[24m[4m0[24m000               63.1 female    16    91 other   
[90m 4[39m   [4m5[24m[4m1[24m000               63.4 female    17    39 white   
[90m 5[39m    [4m9[24m000               64.4 female    15    26 white   
[90m 6[39m   [4m2[24m[4m9[24m000               61.7 female    12    49 white   
[90m 7[39m   [4m3[24m[4m2[24m000               72.7 male      17    46 white   
[90m 8[39m    [4m2[24m000               72.0 male      15    21 hispanic
[90m 9[39m   [4m2[24m[4m7[24m000               72.2 male      

Now `read_csv()` has erroneously assumed that the first row of data are the header names. To override this behavior, we need to specify the column names by hand. 

In [8]:
read_csv("../Data/heights_no_hdr.csv", col_names = F) %>% print


[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------------------------[39m
cols(
  X1 = [32mcol_double()[39m,
  X2 = [32mcol_double()[39m,
  X3 = [31mcol_character()[39m,
  X4 = [32mcol_double()[39m,
  X5 = [32mcol_double()[39m,
  X6 = [31mcol_character()[39m
)




[90m# A tibble: 1,192 x 6[39m
      X1    X2 X3        X4    X5 X6      
   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   
[90m 1[39m [4m5[24m[4m0[24m000  74.4 male      16    45 white   
[90m 2[39m [4m6[24m[4m0[24m000  65.5 female    16    58 white   
[90m 3[39m [4m3[24m[4m0[24m000  63.6 female    16    29 white   
[90m 4[39m [4m5[24m[4m0[24m000  63.1 female    16    91 other   
[90m 5[39m [4m5[24m[4m1[24m000  63.4 female    17    39 white   
[90m 6[39m  [4m9[24m000  64.4 female    15    26 white   
[90m 7[39m [4m2[24m[4m9[24m000  61.7 female    12    49 white   
[90m 8[39m [4m3[24m[4m2[24m000  72.7 male      17    46 white   
[90m 9[39m  [4m2[24m000  72.0 male      15    21 hispanic
[90m10[39m [4m2[24m[4m7[24m000  72.2 male      12    26 white   
[90m# ... with 1,182 more rows[39m


In [9]:
read_csv("../Data/heights_no_hdr.csv", 
         col_names = c("earn", "height", "sex", "ed", "age", "race")) %>% print


[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------------------------[39m
cols(
  earn = [32mcol_double()[39m,
  height = [32mcol_double()[39m,
  sex = [31mcol_character()[39m,
  ed = [32mcol_double()[39m,
  age = [32mcol_double()[39m,
  race = [31mcol_character()[39m
)




[90m# A tibble: 1,192 x 6[39m
    earn height sex       ed   age race    
   [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   
[90m 1[39m [4m5[24m[4m0[24m000   74.4 male      16    45 white   
[90m 2[39m [4m6[24m[4m0[24m000   65.5 female    16    58 white   
[90m 3[39m [4m3[24m[4m0[24m000   63.6 female    16    29 white   
[90m 4[39m [4m5[24m[4m0[24m000   63.1 female    16    91 other   
[90m 5[39m [4m5[24m[4m1[24m000   63.4 female    17    39 white   
[90m 6[39m  [4m9[24m000   64.4 female    15    26 white   
[90m 7[39m [4m2[24m[4m9[24m000   61.7 female    12    49 white   
[90m 8[39m [4m3[24m[4m2[24m000   72.7 male      17    46 white   
[90m 9[39m  [4m2[24m000   72.0 male      15    21 hispanic
[90m10[39m [4m2[24m[4m7[24m000   72.2 male      12    26 white   
[90m# ... with 1,182 more rows[39m


To create short examples illustrating `read_csv`'s behavior, we can specify the contents of a csv file inline.

In [9]:
read_csv(
    "a, b, c
     1, 2, 3
     4, 5, 6
")

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


We might want to skip a few rows in the beginning that have metadata.

In [6]:
read_csv(
"# First row to skip
// Second row to skip
% Third row to skip
a, b, c
1, 2, 3
4, 5, 6
", skip = 3)

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


Some CSVs will come with comments, typically in the form of lines prefaced by `#`. We can also skip comments line by specifying a comment character.

In [13]:
read_csv("
# First comment line
a, b, c
# This separate the header from the data
1, 2, 3
4, 5, 6
# Another comment line
", comment = '#')

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


Set `col_names = FALSE` when you don't have column names in the file. The column names are then set to X1, X2, ...

In [14]:
read_csv("
1, 2, 3
4, 5, 6
") %>% print

[38;5;246m# A tibble: 1 x 3[39m
    `1`   `2`   `3`
  [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m     4     5     6


We can specify our own column names.

In [15]:
read_csv("
1, 2, 3
4, 5, 6
", col_names = c("a", "b", "c"))

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


We can specify how missing values are represented in the file.

In [16]:
read_csv(
    "a, b, c
     1, 2, 3
     4,  , 6
") %>% print

[38;5;246m# A tibble: 2 x 3[39m
      a     b     c
  [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m     1     2     3
[38;5;250m2[39m     4    [31mNA[39m     6


In [17]:
read_csv(
    "a, b, c
     1, 2, 3
     4, -1, 6
", na = "-1") %>% print

[38;5;246m# A tibble: 2 x 3[39m
      a     b     c
  [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m     1     2     3
[38;5;250m2[39m     4    [31mNA[39m     6


We can save a tibble to a `.csv` file using `write_csv()`.

In [24]:
cubes <- data.frame(cbind(1:10,(1:10)^2,(1:10)^3))
colnames(cubes) <- c("1st","2nd","3rd")

In [25]:
cubes %>% print
write_csv(cubes, "../Data/cubes.csv")

   1st 2nd  3rd
1    1   1    1
2    2   4    8
3    3   9   27
4    4  16   64
5    5  25  125
6    6  36  216
7    7  49  343
8    8  64  512
9    9  81  729
10  10 100 1000


In [26]:
cat(read_file('../Data/cubes.csv'))

1st,2nd,3rd
1,1,1
2,4,8
3,9,27
4,16,64
5,25,125
6,36,216
7,49,343
8,64,512
9,81,729
10,100,1000


In [27]:
cubes2 <- read_csv("../Data/cubes.csv")
print(cubes2)

Parsed with column specification:
cols(
  `1st` = [32mcol_double()[39m,
  `2nd` = [32mcol_double()[39m,
  `3rd` = [32mcol_double()[39m
)



[38;5;246m# A tibble: 10 x 3[39m
   `1st` `2nd` `3rd`
   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m     1     1     1
[38;5;250m 2[39m     2     4     8
[38;5;250m 3[39m     3     9    27
[38;5;250m 4[39m     4    16    64
[38;5;250m 5[39m     5    25   125
[38;5;250m 6[39m     6    36   216
[38;5;250m 7[39m     7    49   343
[38;5;250m 8[39m     8    64   512
[38;5;250m 9[39m     9    81   729
[38;5;250m10[39m    10   100  [4m1[24m000


These days, it's increasingly common to pull data from online sources. For example, say we wanted to know the population of European countries. This is [easily found](https://en.wikipedia.org/wiki/Demographics_of_Europe#Population_by_country) on Wikipedia. We may want to analyze this kind of data in `R`. We can use the package `htmltab` to scrap data from the Internet. 

In [5]:
library(htmltab)

The syntax of this command is:

```
htmltab(<url>, <table identifier>)
```
Let's try it with the Wikipedia page above.

In [1]:
url <- "http://en.wikipedia.org/wiki/Demographics_of_Europe#Population_by_country"



This did not produce what we want. The reason is that there are many tables on this page, and by default `htmltab()` just takes the first one it finds. We can pass a number as the second argument in order to take the second, third, etc.

To get `europe.pop` into a usable format we need to do a bit more work. 

In making the plot above, we use quite a few new commands. We will learn more about data manipulations in the following sections, and we will see more about `ggplot2` in Chapter 2.