# Tidyverse Examples

# Core Tidy

There are a lot of packages in the Tidyverse. We're only going to focus on Core Tidy.
That includes:
* ggplot2
* dplyr
* tidyr (We're going to skip tidyr)
* readr
* purrr (We're going to skip purrr)
* tibble

Tidyverse website: https://www.tidyverse.org/

Code samples and some text pulled from R For Data Science by Wickham & Grolemund, 2016. ISBN-13: 978-1491910399

In [None]:
library(tidyverse)


In [None]:
#install.packages("nycflights13", repos = "http://cran.us.r-project.org")

Note that filter() and lag() are overwritten by the Tidyverse. If you want to use them you have to acces them with their full name.
* stats::filter()
* stats::lag()

# dplyr For Data Manipulation

Dplyr provides a more natural way to manipulate data vis-à-vis core R.

Five key functions:
* filter()
* arrange()
* select()
* mutate()
* summarize()

All the functions (called verbs in Tidyverse speak) work the same. 

new data_frame = verb(existing_data_frame, action_to_take)

In [None]:
library(nycflights13) #336,776 flights that departed NYC in 2013

In [None]:
?flights

In [None]:
head(flights)

In [None]:
september_flights = filter(flights, month == 9, distance >= 1000)
head(september_flights)

In [None]:
nov_dec = filter(flights, month %in% c(11,12))
nov_dec

In [None]:
flights_ordered_by = arrange(flights,air_time, distance)
head(flights_ordered_by)

In [None]:
longest_flights = arrange(flights, desc(air_time, distance))
head(longest_flights)

In [None]:
# Select columns by name
cols_by_name = select(flights, year, month, day)
head(cols_by_name)


In [None]:
# Select all columns between year and day (inclusive)
cols_between_year_and_day = select(flights, year:day)
head(cols_between_year_and_day)

In [None]:
# Select all columns except those from year to day (inclusive)
cols_except_from_year_to_day = select(flights, -(year:day))
head(cols_except_from_year_to_day)

In [None]:
#select() can be used to rename variables, 
#but it's rarely useful because it drops all of the variables not explicitly mentioned. 
#Instead, use `rename()`, which is a variant of `select()` 
#that keeps all the variables that aren't explicitly mentioned

column_rename = rename(flights, tail_num = tailnum) #Note the "odd" syntax. rename = original column name.
head(column_rename)


In [None]:
version_of_star = select(flights, everything()) #SELECT *
head(version_of_star)


In [None]:
version_of_star = select(flights, time_hour, air_time, everything())
head(version_of_star)

In [None]:
#Create a narrower data frame so we can see our work
flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time)

In [None]:
calculation = mutate(flights_sml,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
head(calculation)

In [None]:
#If you only want to keep the new variables, use transmute():
only_calc_results = transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
head(only_calc_results)

In [None]:
#Summarise needs to be used in conjuction with group by
by_day = group_by(flights, year, month, day)
summarised_data = summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
head(summarised_data)

In [None]:
#Who is guilty of this nonsense!?
by_dest = group_by(flights, dest) #Group

delay = summarise(
by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) #Summarise

final = filter(delay, count > 20, dest != "HNL") #Filter

head(final)

In [None]:
#Pipe operations fix that %>%
delays = flights %>% group_by(dest) %>% summarise(count = n(),dist = mean(distance, na.rm = TRUE),delay = mean(arr_delay, na.rm = TRUE)) %>% filter(count > 20, dest != "HNL")
head(delays)

# Joins with dplyr

Other tables in NYC flights data

In [None]:
head(airlines)

In [None]:
head(airports)

In [None]:
head(planes)

In [None]:
head(weather)

In [None]:
#Reduce our data set for this exercise
flights2 = flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
head(flights2)

In [None]:
flights_with_plane_info = flights2 %>% inner_join(planes, by = "tailnum") %>% rename(year = year.x) %>% select(-year.y)
head(flights_with_plane_info)

SQL is the inspiration for dplyr's conventions, so the translation is straightforward:  

| dplyr                      |                      SQL                     |
|----------------------------|:--------------------------------------------:|
| inner_join(x, y, by = "z") | SELECT * FROM x INNER JOIN y USING (z)       |
| left_join(x, y, by = "z")  | SELECT * FROM x LEFT OUTER JOIN y USING (z)  |
| right_join(x, y, by = "z") | SELECT * FROM x RIGHT OUTER JOIN y USING (z) |
| full_join(x, y, by = "z")  | SELECT * FROM x FULL OUTER JOIN y USING (z)  |

# Ggplot2 for data vis

Genearlized use of ggplot  

ggplot(data = `<DATA>`) + `<GEOM_FUNCTION>`(mapping = aes(`<MAPPINGS>`))  

`<DATA>` = your dataset  

`<GEOM_FUNCTION>` =  A __geom__ is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on.  

`<MAPPINGS>` = An aesthetic is a visual property of the objects in your plot that you map to your data.

In [None]:
head(diamonds) #part of the ggplot2 package

In [None]:
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))

In [None]:
ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

In [None]:
ggplot(data = mpg) + geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))

In [None]:
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2)

In [None]:
ggplot(data = mpg, 
mapping = aes(x = displ, y = hwy)) + 
geom_point(mapping = aes(color = class)) + 
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

# Tibble  

"Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code."  

### Why I Love Tibbles Over Dataframes  

* It never changes the type of the inputs (e.g. it never converts strings to factors!).
* It never changes the names of variables.
* It never creates row names

In [None]:
iris_as_tibble = as_tibble(iris)
print(iris_as_tibble)

Sometimes older code doesn't work with Tibbles. From the book R For Data Science:  

"The main reason that some older functions don't work with tibble is the `[` function.  We don't use `[` much in this book because `dplyr::filter()` and `dplyr::select()` allow you to solve the same problems with clearer code. With base R data frames, `[` sometimes returns a data frame, and sometimes returns a vector. With tibbles, `[` always returns another tibble."

In [None]:
#If you need to work with older code that doesn't work with Tibbles,
#you can coerece Tibbles back to dataframes.
class(as.data.frame(iris_as_tibble)) #show us the type of the object

# Readr

Turn flat files into data frames  

* `read_csv()` reads comma delimited files, `read_csv2()` reads semicolon
  separated files (common in countries where `,` is used as the decimal place),
  `read_tsv()` reads tab delimited files, and `read_delim()` reads in files
  with any delimiter.

* `read_fwf()` reads fixed width files. You can specify fields either by their
  widths with `fwf_widths()` or their position with `fwf_positions()`.
  `read_table()` reads a common variation of fixed width files where columns
  are separated by white space.

* `read_log()` reads Apache style log files. (But also check out
  [webreadr](https://github.com/Ironholds/webreadr) which is built on top
  of `read_log()` and provides many more helpful tools.)


In [None]:
heights = read_csv("../../Data/read_data/heights.csv", na="")

### Compared to base R

If you've used R before, you might wonder why we're not using `read.csv()`. There are a few good reasons to favour readr functions over the base equivalents:

* They are typically much faster (~10x) than their base equivalents.
  Long running jobs have a progress bar, so you can see what's happening. 
  If you're looking for raw speed, try `data.table::fread()`. It doesn't fit 
  quite so well into the tidyverse, but it can be quite a bit faster.

* They produce tibbles, they don't convert character vectors to factors,
  use row names, or munge the column names. These are common sources of
  frustration with the base R functions.

* They are more reproducible. Base R functions inherit some behaviour from
  your operating system and environment variables, so import code that works 
  on your computer might not work on someone else's.

In [None]:
write_csv(heights, "../../Data/read_data/heights_out.csv") 

If you're just saving to excel to work with it there, use write_excel_csv().

In [None]:
write_excel_csv(heights, "../../Data/read_data/heights_out_for_excel.csv") 

You lose type information when you write. To fix you can write it out to RDS (R custom binary format) or use a package called feather to store it in a binary format sharable across applications.

## Other types of data

To get other types of data into R, we recommend starting with the tidyverse packages listed below. They're certainly not perfect, but they are a good place to start. For rectangular data:

* __haven__ reads SPSS, Stata, and SAS files.

* __readxl__ reads excel files (both `.xls` and `.xlsx`).

* __DBI__, along with a database specific backend (e.g. __RMySQL__, 
  __RSQLite__, __RPostgreSQL__ etc) allows you to run SQL queries against a 
  database and return a data frame.

For hierarchical data: use __jsonlite__ (by Jeroen Ooms) for json, and __xml2__ for XML. Jenny Bryan has some excellent worked examples at <https://jennybc.github.io/purrr-tutorial/>.

For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [__rio__](https://github.com/leeper/rio) package.