# Data access

## CSV files

There's multiple functions in the `readr` for CSV file reading. Let's use them on a dataset available in Kaggle that has homemade beer recipes from Brewer's Friend [[1]](https://www.kaggle.com/jtrofe/beer-recipes).

Let's check the first few lines of the data with base R's `file` and `readLines`.

In [None]:
example_file <- file("beers/recipeData.csv",'r')
first_lines <- readLines(example_file,n=2)
close(example_file)

print(first_lines)

Before we choose which reader we want to use we need to check the format of the data. 

`readr` has predefined functions for the following data formats [[1]](http://readr.tidyverse.org/reference/read_delim.html):
- `read_delim` parses generic data delimited by a character
- `read_csv` assumes that the data is delimited by commas
- `read_csv2` assumes that the data is delimited by semicolons
- `read_tsv` assumes that the data is delimited by tabs

In this case we want to use `read_csv`.

In [None]:
library(tidyverse)

beer_recipes <- read_csv("beers/recipeData.csv")

From the output one can see that `read_csv` tries to parse the datatype of the column automatically.

By running `spec` one can see the full definitons.

In [None]:
spec(beer_recipes)

Many of the data columns seem to be characters instead of numbers. Let's use `col_types`-argument to specify a better definition.

In [None]:

beer_recipes <- read_csv("beers/recipeData.csv",
                        col_types=cols(
                          BeerID = col_integer(),
                          Name = col_character(),
                          URL = col_character(),
                          Style = col_character(),
                          StyleID = col_integer(),
                          `Size(L)` = col_double(),
                          OG = col_double(),
                          FG = col_double(),
                          ABV = col_double(),
                          IBU = col_double(),
                          Color = col_double(),
                          BoilSize = col_double(),
                          BoilTime = col_double(),
                          BoilGravity = col_double(),
                          Efficiency = col_double(),
                          MashThickness = col_double(),
                          SugarScale = col_character(),
                          BrewMethod = col_character(),
                          PitchRate = col_double(),
                          PrimaryTemp = col_double(),
                          PrimingMethod = col_character(),
                          PrimingAmount = col_character()
                        )
                        )

This produced a lot of problems. Let's check the problems with `problems`.

In [None]:
problems(beer_recipes)

Most of the problems seem to be related to _N/A_ not being a recognized name for `NA`. Let's add that to the initial read call with `na`-argument. 

In [None]:
beer_recipes <- read_csv("beers/recipeData.csv",na=c("","NA","N/A"))

spec(beer_recipes)

Now most of the columns seem correct. Last column seems to include units (_oz_). Using mutate is probably easiest way of getting rid of them.

Let' use `gsub` to remove it with regular expressions [[gsub]](http://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html).

After that we can use `str` to check that our dataset looks fine.

In [None]:
beer_recipes <- beer_recipes %>%
    mutate(PrimingAmount=as.double(gsub(' oz$','',PrimingAmount)))

str(beer_recipes)

Let's say that we want to write the resulting `tibble` in a format that is easily readable in Excel.

For this we'd want to use `write_excel_csv` (there are similar functions for normal csv, tsv etc.) [[write_excel_csv]](http://readr.tidyverse.org/reference/write_delim.html).  

In [None]:
write_excel_csv(beer_recipes, 'beer-recipes-excel-format.csv')

# Feather

Let's say you have a big dataset you have pre-processed with R, but want to analyze with Python. The new feather-format that uses Apache Arrow's data specification is created by the creators of Tidy-R and Pandas and it should be interoprable with both of them [[feather's page in Github]](https://github.com/wesm/feather).

What matters the most is that it is fast and compact (because it is a binary data format).

Using it is simple, just load `feather`-library an write data with `write_feather` [[write_feather]](https://cran.r-project.org/web/packages/feather/feather.pdf).

Loading data is done with `read_feather`.

Do note that more complex structures like nested tibbles do not necessarily fit into a feather.

In [None]:
library(feather)

write_feather(beer_recipes,'beer_recipes.feather')

beer_recipes2 <- read_feather('beer_recipes.feather')

## Database access

There exists a package `DBI` that defines a common interface that can be used to access various different databases.

We won't be going through them but if you're going to be working with e.g. big data owned by a company, this package might interest you.

# Exercise time:

1. Modify column specifications for FIFA World Cup match data [[1]](https://www.kaggle.com/abecklas/fifa-world-cup). Use `col_datetime` in `col_types` to get a good specification for column _DateTime_ [[col_datetime]](http://readr.tidyverse.org/reference/parse_datetime.html). Use `col_factor` to make columns _Round_, _Stadium_, _City_, _HomeTeam_ and _AwayTeam_ into factors.
2. Store the resulting tibble as a feather.

In [None]:
fifa_matches <- read_csv("fifa/WorldCupMatches.csv")

# Solutions:

## 1.

In [None]:
fifa_matches <- read_csv("fifa/WorldCupMatches.csv",
                         col_types=cols(
                             DateTime=col_datetime('%d%.%b%+%Y%+%R'),
                             Round=col_factor(levels=NULL),
                             Stadium=col_factor(levels=NULL),
                             City=col_factor(levels=NULL),
                             HomeTeam=col_factor(levels=NULL),
                             AwayTeam=col_factor(levels=NULL)
                         )
                )

str(fifa_matches)

## 2.

In [None]:
write_feather(fifa_matches,'fifa_matches.feather')