In [1]:
library(tidyverse)

-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --
[32mv[39m [34mggplot2[39m 3.2.1     [32mv[39m [34mpurrr  [39m 0.3.3
[32mv[39m [34mtibble [39m 2.1.3     [32mv[39m [34mdplyr  [39m 0.8.3
[32mv[39m [34mtidyr  [39m 1.0.0     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.3.1     [32mv[39m [34mforcats[39m 0.4.0
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


# GETTING STARTED

What function would you use to read a file where fields were separated with
“|”?
###### Use the `read_delim()` function with the argument `delim="|"`.

In [2]:
read_delim("a|b|c\n1|2|3\n4|5|6", delim = "|")

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

###### help(read_csv)
read_csv(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  quote = "\"", comment = "", trim_ws = TRUE, skip = 0,
  n_max = Inf, guess_max = min(1000, n_max),
  progress = show_progress(), skip_empty_rows = TRUE)
###### help(read_tsv)
read_tsv(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  quote = "\"", comment = "", trim_ws = TRUE, skip = 0,
  n_max = Inf, guess_max = min(1000, n_max),
  progress = show_progress(), skip_empty_rows = TRUE)

###### They both have the same arguments only that read_csv is comma delimited while read_tsv is tab delimited


In [6]:
# What are the most important arguments to read_fwf()? 
# The most important argument is col_positions, which defines the column positions.
help(read_fwf)

Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?
"x,y\n1,'a,b'"

In [7]:
#The argument is quote, and we can use it in read_csv(), read_csv2(), and read_tsv() as well. For example:

read_csv("x,y\n1,'a,b'", quote = "'")

x,y
<dbl>,<chr>
1,"a,b"


Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

In [9]:
read_csv("a,b\n1,2,3\n4,5,6")
#Only two columns names are provided, so the values in the last column are dropped.

"2 parsing failures.
row col  expected    actual         file
  1  -- 2 columns 3 columns literal data
  2  -- 2 columns 3 columns literal data
"

a,b
<dbl>,<dbl>
1,2
4,5


In [10]:
read_csv("a,b,c\n1,2\n1,2,3,4")
#Only three column names are provided. The third column has NA since no value was provided. 
#The value in the last column in the last row is dropped as the column is not available

"2 parsing failures.
row col  expected    actual         file
  1  -- 3 columns 2 columns literal data
  2  -- 3 columns 4 columns literal data
"

a,b,c
<dbl>,<dbl>,<dbl>
1,2,
1,2,3.0


In [11]:
read_csv("a,b\n\"1")
#The open quote \" is dropped because there is no paired close quote. 
#There is only one value in the second row, so NA is coerced in the second column.

"2 parsing failures.
row col                     expected    actual         file
  1  a  closing quote at end of file           literal data
  1  -- 2 columns                    1 columns literal data
"

a,b
<dbl>,<chr>
1,


In [13]:
read_csv("a,b\n1,2\na,b")
# Since the second rows are strings, the entire columns are coerced into strings.

a,b
<chr>,<chr>
1,2
a,b


In [15]:
read_csv("a;b\n1;3")
#read_csv() looks for commas, not semi-colons. Everything is treated as one column name and one value.

a;b
<chr>
1;3


# PARSING A VECTOR

In [45]:
# What are the most important arguments to locale()?
# All arguents are important and useful in different situations
help(locale)
# locale(date_names = "en", date_format = "%AD", time_format = "%AT",
 # decimal_mark = ".", grouping_mark = ",", tz = "UTC",
 # encoding = "UTF-8", asciify = FALSE)

In [24]:
# What happens if you try and set decimal_mark and grouping_mark to the same character?  
locale(decimal_mark = ',', grouping_mark=',')
# Explicitly setting decimal_mark and grouping_mark to the same character results in an error

ERROR: Error: `decimal_mark` and `grouping_mark` must be different


In [25]:
# What happens to the default value of grouping_mark when you set decimal_mark to “,”?
locale(decimal_mark = ',')
#if decimal_mark is set to',' then grouping_mark will be set to '.' automatically. 

<locale>
Numbers:  123.456,78
Formats:  %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
        (Thu), Friday (Fri), Saturday (Sat)
Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
        June (Jun), July (Jul), August (Aug), September (Sep), October
        (Oct), November (Nov), December (Dec)
AM/PM:  AM/PM

In [26]:
# What happens to the default value of decimal_mark when you set the grouping_mark to “.”?
locale(grouping_mark = ',')
#if decimal_mark is set to',' then grouping_mark will be set to '.' automatically.

<locale>
Numbers:  123,456.78
Formats:  %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
        (Thu), Friday (Fri), Saturday (Sat)
Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
        June (Jun), July (Jul), August (Aug), September (Sep), October
        (Oct), November (Nov), December (Dec)
AM/PM:  AM/PM

I didn’t discuss the date_format and time_format options to locale(). What do they do?  _They provide default date and time formats._  
Construct an example that shows when they might be useful.

In [58]:
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
parse_date("14 oct.1979", "%d %b %Y", locale = locale("fr"))
parse_date("01/02/15", locale = locale(date_format = "%d/%m/%y"))

If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.

In [67]:
parse_date("02/01/2007")
ken_locale = locale(date_format = "%d/%m/%Y")
parse_date("02/01/200
7", locale = ken_locale)

"1 parsing failure.
row col   expected     actual
  1  -- date like  02/01/2007
"

What’s the difference between read_csv() and read_csv2()?
###### read_csv()  is comma delimited while read_csv2() is semi-colon delimited

Generate the correct format string to parse each of the following dates and times:

In [68]:
d1 <- "January 1, 2010"

parse_date(d1, "%B %d, %Y")

In [69]:
d2 <- "2015-Mar-07"
parse_date(d2, "%Y-%b-%d")

In [70]:
d3 <- "06-Jun-2017"
parse_date(d3, "%d-%b-%Y")

In [71]:
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, "%B %d (%Y)")

In [72]:
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, "%m/%d/%y")

In [73]:
t1 <- "1705"
parse_time(t1, "%H%M")

17:05:00

In [74]:
t2 <- "11:15:10.12 PM"
parse_time(t2, "%I:%M:%OS %p")

23:15:10.12