# Data access

## CSV files

There's multiple functions in the `readr` for CSV file reading. Let's use them on a dataset available in Kaggle that has homemade beer recipes from Brewer's Friend [[1]](https://www.kaggle.com/jtrofe/beer-recipes).

Let's check the first few lines of the data with base R's `file` and `readLines`.

In [73]:
example_file <- file("beers/recipeData.csv",'r')
first_lines <- readLines(example_file,n=2)
close(example_file)

print(first_lines)

[1] "BeerID,Name,URL,Style,StyleID,Size(L),OG,FG,ABV,IBU,Color,BoilSize,BoilTime,BoilGravity,Efficiency,MashThickness,SugarScale,BrewMethod,PitchRate,PrimaryTemp,PrimingMethod,PrimingAmount"    
[2] "1,Vanilla Cream Ale,/homebrew/recipe/view/1633/vanilla-cream-ale,Cream Ale,45,21.77,1.055,1.013,5.48,17.65,4.83,28.39,75,1.038,70,N/A,Specific Gravity,All Grain,N/A,17.78,corn sugar,4.5 oz"


Before we choose which reader we want to use we need to check the format of the data. 

`readr` has predefined functions for the following data formats [[1]](http://readr.tidyverse.org/reference/read_delim.html):
- `read_delim` parses generic data delimited by a character
- `read_csv` assumes that the data is delimited by commas
- `read_csv2` assumes that the data is delimited by semicolons
- `read_tsv` assumes that the data is delimited by tabs

In this case we want to use `read_csv`.

In [68]:
library(tidyverse)

beer_recipes <- read_csv("beers/recipeData.csv")

Parsed with column specification:
cols(
  .default = col_character(),
  BeerID = col_integer(),
  StyleID = col_integer(),
  `Size(L)` = col_double(),
  OG = col_double(),
  FG = col_double(),
  ABV = col_double(),
  IBU = col_double(),
  Color = col_double(),
  BoilSize = col_double(),
  BoilTime = col_integer(),
  Efficiency = col_double()
)
See spec(...) for full column specifications.


From the output one can see that `read_csv` tries to parse the datatype of the column automatically.

By running `spec` one can see the full definitons.

In [69]:
spec(beer_recipes)

cols(
  BeerID = col_integer(),
  Name = col_character(),
  URL = col_character(),
  Style = col_character(),
  StyleID = col_integer(),
  `Size(L)` = col_double(),
  OG = col_double(),
  FG = col_double(),
  ABV = col_double(),
  IBU = col_double(),
  Color = col_double(),
  BoilSize = col_double(),
  BoilTime = col_integer(),
  BoilGravity = col_character(),
  Efficiency = col_double(),
  MashThickness = col_character(),
  SugarScale = col_character(),
  BrewMethod = col_character(),
  PitchRate = col_character(),
  PrimaryTemp = col_character(),
  PrimingMethod = col_character(),
  PrimingAmount = col_character()
)

Many of the data columns seem to be characters instead of numbers. Let's use `col_types`-argument to specify a better definition.

In [76]:

beer_recipes <- read_csv("beers/recipeData.csv",
                        col_types=cols(
                          BeerID = col_integer(),
                          Name = col_character(),
                          URL = col_character(),
                          Style = col_character(),
                          StyleID = col_integer(),
                          `Size(L)` = col_double(),
                          OG = col_double(),
                          FG = col_double(),
                          ABV = col_double(),
                          IBU = col_double(),
                          Color = col_double(),
                          BoilSize = col_double(),
                          BoilTime = col_double(),
                          BoilGravity = col_double(),
                          Efficiency = col_double(),
                          MashThickness = col_double(),
                          SugarScale = col_character(),
                          BrewMethod = col_character(),
                          PitchRate = col_double(),
                          PrimaryTemp = col_double(),
                          PrimingMethod = col_character(),
                          PrimingAmount = col_character()
                        )
                        )

“94768 parsing failures.
row # A tibble: 5 x 5 col     row col           expected actual file                   expected   <int> <chr>         <chr>    <chr>  <chr>                  actual 1     1 MashThickness a double N/A    'beers/recipeData.csv' file 2     1 PitchRate     a double N/A    'beers/recipeData.csv' row 3     2 MashThickness a double N/A    'beers/recipeData.csv' col 4     2 PitchRate     a double N/A    'beers/recipeData.csv' expected 5     2 PrimaryTemp   a double N/A    'beers/recipeData.csv'
... ................. ... ............................................................ ........ ............................................................ ...... ............................................................ .... ............................................................ ... ............................................................ ... ............................................................ ........ ......................................................

This produced a lot of problems. Let's check the problems with `problems`.

In [77]:
problems(beer_recipes)

row,col,expected,actual,file
1,MashThickness,a double,,'beers/recipeData.csv'
1,PitchRate,a double,,'beers/recipeData.csv'
2,MashThickness,a double,,'beers/recipeData.csv'
2,PitchRate,a double,,'beers/recipeData.csv'
2,PrimaryTemp,a double,,'beers/recipeData.csv'
3,BoilGravity,a double,,'beers/recipeData.csv'
3,MashThickness,a double,,'beers/recipeData.csv'
3,PitchRate,a double,,'beers/recipeData.csv'
3,PrimaryTemp,a double,,'beers/recipeData.csv'
4,BoilGravity,a double,,'beers/recipeData.csv'


Most of the problems seem to be related to _N/A_ not being a recognized name for `NA`. Let's add that to the initial read call with `na`-argument. 

In [80]:
beer_recipes <- read_csv("beers/recipeData.csv",na=c("","NA","N/A"))

spec(beer_recipes)

Parsed with column specification:
cols(
  .default = col_double(),
  BeerID = col_integer(),
  Name = col_character(),
  URL = col_character(),
  Style = col_character(),
  StyleID = col_integer(),
  BoilTime = col_integer(),
  SugarScale = col_character(),
  BrewMethod = col_character(),
  PrimingMethod = col_character(),
  PrimingAmount = col_character()
)
See spec(...) for full column specifications.


cols(
  BeerID = col_integer(),
  Name = col_character(),
  URL = col_character(),
  Style = col_character(),
  StyleID = col_integer(),
  `Size(L)` = col_double(),
  OG = col_double(),
  FG = col_double(),
  ABV = col_double(),
  IBU = col_double(),
  Color = col_double(),
  BoilSize = col_double(),
  BoilTime = col_integer(),
  BoilGravity = col_double(),
  Efficiency = col_double(),
  MashThickness = col_double(),
  SugarScale = col_character(),
  BrewMethod = col_character(),
  PitchRate = col_double(),
  PrimaryTemp = col_double(),
  PrimingMethod = col_character(),
  PrimingAmount = col_character()
)

Now most of the columns seem correct. Last column seems to include units (_oz_). Using mutate is probably easiest way of getting rid of them.

Let' use `gsub` to remove it with regular expressions [[gsub]](http://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html).

After that we can use `str` to check that our dataset looks fine.

In [89]:
beer_recipes <- beer_recipes %>%
    mutate(PrimingAmount=as.double(gsub(' oz$','',PrimingAmount)))

str(beer_recipes)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	73861 obs. of  22 variables:
 $ BeerID       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Name         : chr  "Vanilla Cream Ale" "Southern Tier Pumking clone" "Zombie Dust Clone - EXTRACT" "Zombie Dust Clone - ALL GRAIN" ...
 $ URL          : chr  "/homebrew/recipe/view/1633/vanilla-cream-ale" "/homebrew/recipe/view/16367/southern-tier-pumking-clone" "/homebrew/recipe/view/5920/zombie-dust-clone-extract" "/homebrew/recipe/view/5916/zombie-dust-clone-all-grain" ...
 $ Style        : chr  "Cream Ale" "Holiday/Winter Special Spiced Beer" "American IPA" "American IPA" ...
 $ StyleID      : int  45 85 7 7 20 10 86 45 129 86 ...
 $ Size(L)      : num  21.8 20.8 18.9 22.7 50 ...
 $ OG           : num  1.05 1.08 1.06 1.06 1.06 ...
 $ FG           : num  1.01 1.02 1.02 1.02 1.01 ...
 $ ABV          : num  5.48 8.16 5.91 5.8 6.48 5.58 7.09 5.36 5.77 8.22 ...
 $ IBU          : num  17.6 60.6 59.2 54.5 17.8 ...
 $ Color        : num  4.83 15.64 8.98 8.5 4.57 ...
 $

Let's say that we want to write the resulting `tibble` in a format that is easily readable in Excel.

For this we'd want to use `write_excel_csv` (there are similar functions for normal csv, tsv etc.) [[write_excel_csv]](http://readr.tidyverse.org/reference/write_delim.html).  

In [94]:
write_excel_csv(beer_recipes, 'beer-recipes-excel-format.csv')

# Feather

Let's say you have a big dataset you have pre-processed with R, but want to analyze with Python. The new feather-format that uses Apache Arrow's data specification is created by the creators of Tidy-R and Pandas and it should be interoprable with both of them [[feather's page in Github]](https://github.com/wesm/feather).

What matters the most is that it is fast and compact (because it is a binary data format).

Using it is simple, just load `feather`-library an write data with `write_feather` [[write_feather]](https://cran.r-project.org/web/packages/feather/feather.pdf).

Loading data is done with `read_feather`.

Do note that more complex structures like nested tibbles do not necessarily fit into a feather.

In [98]:
library(feather)

write_feather(beer_recipes,'beer_recipes.feather')

beer_recipes2 <- read_feather('beer_recipes.feather')

## Database access

There exists a package `DBI` that defines a common interface that can be used to access various different databases.

We won't be going through them but if you're going to be working with e.g. big data owned by a company, this package might interest you.

# Exercise time:

1. Modify column specifications for FIFA World Cup match data [[1]](https://www.kaggle.com/abecklas/fifa-world-cup). Use `col_datetime` in `col_types` to get a good specification for column _DateTime_ [[col_datetime]](http://readr.tidyverse.org/reference/parse_datetime.html). Use `col_factor` to make columns _Round_, _Stadium_, _City_, _HomeTeam_ and _AwayTeam_ into factors.
2. Store the resulting tibble as a feather.

In [129]:
fifa_matches <- read_csv("fifa/WorldCupMatches.csv")

Parsed with column specification:
cols(
  Year = col_integer(),
  DateTime = col_character(),
  Round = col_character(),
  Stadium = col_character(),
  City = col_character(),
  HomeTeam = col_character(),
  HomeGoals = col_integer(),
  AwayGoals = col_integer(),
  AwayTeam = col_character(),
  Observation = col_character()
)
