# Importing datasets to R

When dealing with a lot of different data set, often stored in the cloud, it can be really cumbersome to
manually download each one and store them on your computer before loading them in R for analysis. In those
cases, downloading the files directly to R is the best practice, since it's time effective and facilitate the file management on your computer.

In this exercise set, we'll see how to use the <code>httr</code> package and the <code>readr</code> package and how common function deal with data from the web.

<strong>Exercise 1</strong>
The government of the united state has a portal where you can download large data set on about every aspect of life in the USA. Load the
<a href="https://inventory.data.gov/dataset/032e19b4-5a90-41dc-83ff-6e4cd234f565/resource/38625c3d-5388-4c16-a30f-d105432553a4/download/postscndryunivsrvy2013dirinfo.csv">Integrated Postsecondary Education Data System</a>
data set which gives information about every American College and University with the <code>read.delim()</code> function, the <code>read.table()</code> function with and without the <code>header</code> parameter set to TRUE
and the <code>read.csv()</code> function. Look at the structure of each output.

In [1]:
university.url<-"https://inventory.data.gov/dataset/032e19b4-5a90-41dc-83ff-6e4cd234f565/resource/38625c3d-5388-4c16-a30f-d105432553a4/download/postscndryunivsrvy2013dirinfo.csv"

US.uni.delim<-read.delim(file=university.url, sep=",")
str(US.uni.delim)

US.uni.table.1<-read.table(file=university.url, sep=",")
str(US.uni.table.1)

US.uni.table.2<-read.table(file=university.url, sep=",",header=TRUE )
str(US.uni.table.2)

US.uni.rcsv<-read.csv(file=university.url )
str(US.uni.rcsv)

'data.frame':	7769 obs. of  66 variables:
 $ UNITID  : int  100654 100663 100690 100706 100724 100733 100751 100760 100812 100830 ...
 $ INSTNM  : Factor w/ 7626 levels "A & W Healthcare Educators",..: 102 6791 246 6792 105 6793 6528 1126 413 432 ...
 $ ADDR    : Factor w/ 7675 levels " ","#1 Campus View Drive",..: 5058 7272 906 3667 7064 4525 6393 1893 3591 6415 ...
 $ CITY    : Factor w/ 2543 levels "Aberdeen","Abilene",..: 1557 190 1432 1008 1432 2284 2284 27 92 1432 ...
 $ STABBR  : Factor w/ 59 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ ZIP     : Factor w/ 6517 levels "00602","00602-0960",..: 2518 2504 2542 2525 2533 2505 2510 2484 2512 2543 ...
 $ FIPS    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ OBEREG  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ CHFNM   : Factor w/ 6715 levels " ","(Ruby) Elaine Cue",..: 1422 5355 4736 5550 2762 5658 1827 6130 5592 3431 ...
 $ CHFTITLE: Factor w/ 646 levels " ","Academic Affairs",..: 427 427 427 427 427 121 427 427 427 121 ...
 $ GENTELE : num  2.56e+0

<strong>Exercise 2</strong>
Repeat the process of exercise 1, but this time use the <code>read_csv()</code> function from the <code>readr</code> package.

In [2]:
library("readr")
US.uni_csv<-read_csv(file=university.url)
str(US.uni_csv)

Parsed with column specification:
cols(
  .default = col_integer(),
  INSTNM = col_character(),
  ADDR = col_character(),
  CITY = col_character(),
  STABBR = col_character(),
  ZIP = col_character(),
  CHFNM = col_character(),
  CHFTITLE = col_character(),
  GENTELE = col_double(),
  FAXTELE = col_double(),
  EIN = col_character(),
  OPEID = col_character(),
  WEBADDR = col_character(),
  ADMINURL = col_character(),
  FAIDURL = col_character(),
  APPLURL = col_character(),
  NPRICURL = col_character(),
  ACT = col_character(),
  CLOSEDAT = col_character(),
  IALIAS = col_character(),
  F1SYSNAM = col_character()
  # ... with 3 more columns
)
See spec(...) for full column specifications.


Classes 'tbl_df', 'tbl' and 'data.frame':	7769 obs. of  66 variables:
 $ UNITID  : int  100654 100663 100690 100706 100724 100733 100751 100760 100812 100830 ...
 $ INSTNM  : chr  "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
 $ ADDR    : chr  "4900 Meridian Street" "Administration Bldg Suite 1070" "1200 Taylor Rd" "301 Sparkman Dr" ...
 $ CITY    : chr  "Normal" "Birmingham" "Montgomery" "Huntsville" ...
 $ STABBR  : chr  "AL" "AL" "AL" "AL" ...
 $ ZIP     : chr  "35762" "35294-0110" "36117-3553" "35899" ...
 $ FIPS    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ OBEREG  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ CHFNM   : chr  "Dr. Andrew Hugine, Jr." "Ray L. Watts" "Michael Turner" "Robert A. Altenkirch" ...
 $ CHFTITLE: chr  "President" "President" "President" "President" ...
 $ GENTELE : num  2.56e+09 2.06e+09 3.34e+13 2.57e+09 3.34e+09 ...
 $ FAXTELE : num  2.56e+09 2.06e+09 3.34e+09 NA 3.35e+09 ...
 $ EIN     : chr  "

<strong>Exercise 3</strong>
Load this
<a href="http://donnees.ville.montreal.qc.ca/dataset/ceb2427e-aa50-4d06-b13a-d1b21e2702b9/resource/cbcce53f-d6a0-4b08-b481-967b0490cd40/download/lieuxculturels.xls">Excel file</a>
which contains a list of information about the cultural site in Montréal city using the <code>read.xls()</code> function from the <code>gdata</code> package and look at the structure of the output.
Note that the commonly use <code>read.xlsx()</code> doesn't read file from url. Also Windows users will have to download a <a href="https://www.activestate.com/activeperl/downloads">perl distribution</a> to make
the <code>gdata</code> package.

In [4]:
mtl.url<-"http://donnees.ville.montreal.qc.ca/dataset/ceb2427e-aa50-4d06-b13a-d1b21e2702b9/resource/cbcce53f-d6a0-4b08-b481-967b0490cd40/download/lieuxculturels.xls"

library(gdata)
mtl.gdata<-read.xls(xls=mtl.url)
str(mtl.gdata)


"package 'gdata' was built under R version 3.3.3"gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

Attaching package: 'gdata'

The following object is masked from 'package:stats':

    nobs

The following object is masked from 'package:utils':

    object.size

The following object is masked from 'package:base':

    startsWith



'data.frame':	101 obs. of  12 variables:
 $ Arrondissement                : Factor w/ 22 levels " elle tourne\\, une crÃ©ation de MichÃ¨le Lapointe et de RenÃ© Rioux. Les artistes expliquent ainsi leur dÃ©marche: Â« La grand"| __truncated__,..: 10 21 2 12 4 18 2 12 4 12 ...
 $ Noms.du.réseau                : Factor w/ 15 levels "","Ãglise","BibliothÃ¨que",..: 10 3 5 5 3 5 5 5 3 5 ...
 $ Nom.du.lieu.culturel.municipal: Factor w/ 101 levels "","Ãglise Ã©vangÃ©lique armÃ©nienne",..: 56 72 31 32 11 43 33 30 13 34 ...
 $ Adresse                       : Factor w/ 85 levels "","1, chemin du MusÃ©e",..: 52 51 5 7 63 31 59 50 56 12 ...
 $ Code.postal                   : Factor w/ 81 levels "","H1A 1T9","H1A 1W1",..: 21 37 51 12 57 68 64 10 55 9 ...
 $ Ville                         : Factor w/ 6 levels "","Anjou","Lachine",..: 4 4 6 6 4 6 6 6 4 6 ...
 $ Province                      : Factor w/ 2 levels "","QC": 2 2 2 2 2 2 2 2 2 2 ...
 $ Téléphone.général             : Factor w/ 73 levels "",

<strong>Exercise 4</strong>
So far, we've been downloading clean data set with a common file type, but we can do a lot more with R. For example, try to download the front page of <a href="http://www.r-exercises.com/">r-exercises</a> by
using the <code>GET()</code> function from the <code>httr</code> package. Then look at the structure of the output and print the html file.

In [5]:
library("httr")
url <- "http://www.r-exercises.com/"
response<-GET(url)
str(response)
print(response)

List of 10
 $ url        : chr "http://www.r-exercises.com/"
 $ status_code: int 200
 $ headers    :List of 19
  ..$ date             : chr "Fri, 09 Jun 2017 17:27:12 GMT"
  ..$ content-type     : chr "text/html; charset=UTF-8"
  ..$ transfer-encoding: chr "chunked"
  ..$ connection       : chr "keep-alive"
  ..$ set-cookie       : chr "__cfduid=d1bfad783c3e3fe94496863c3f560a6c21497029231; expires=Sat, 09-Jun-18 17:27:11 GMT; path=/; domain=.r-exercises.com; Htt"| __truncated__
  ..$ cache-control    : chr "no-store, no-cache, must-revalidate"
  ..$ cf-railgun       : chr "4624d14858 stream 0.000000 0210 206c"
  ..$ expires          : chr "Thu, 19 Nov 1981 08:52:00 GMT"
  ..$ host-header      : chr "192fc2e7e50945beb8231a492d6a8024"
  ..$ link             : chr "<http://www.r-exercises.com/wp-json/>; rel=\"https://api.w.org/\""
  ..$ pragma           : chr "no-cache"
  ..$ set-cookie       : chr "wpSGCacheBypass=0; expires=Fri, 09-Jun-2017 16:27:12 GMT; Max-Age=0; path=/"
  ..$ set-coo

<strong>Exercise 5</strong>
The <code>GET()</code> function is quite versatile and can download whatever file at the given url. The City of New York published a data set of the total crime index for the largest American cities as a <code>json</code> file.

Download this file <a href="https://data.cityofnewyork.us/api/views/3h6b-pt5u/rows.json?accessType=DOWNLOAD">here</a> and repeat the step of exercise 4.

In [6]:
NY.json.url <- "https://data.cityofnewyork.us/api/views/3h6b-pt5u/rows.json?accessType=DOWNLOAD"
NY.response<-GET(NY.json.url)
str(NY.response)
print(NY.response)

List of 10
 $ url        : chr "https://data.cityofnewyork.us/api/views/3h6b-pt5u/rows.json?accessType=DOWNLOAD"
 $ status_code: int 200
 $ headers    :List of 12
  ..$ server                     : chr "nginx"
  ..$ date                       : chr "Fri, 09 Jun 2017 17:27:13 GMT"
  ..$ content-type               : chr "application/json; charset=utf-8"
  ..$ transfer-encoding          : chr "chunked"
  ..$ connection                 : chr "keep-alive"
  ..$ x-socrata-requestid        : chr "93yw5z2n638qm9qkl9k4xsknp"
  ..$ access-control-allow-origin: chr "*"
  ..$ etag                       : chr "W/\"a736a72e96dd1502cf0d2ee83ab68894\""
  ..$ last-modified              : chr "Sat, 25 Oct 2014 18:01:15 UTC"
  ..$ age                        : chr "0"
  ..$ x-socrata-region           : chr "aws-us-east-1-fedramp-prod"
  ..$ content-encoding           : chr "gzip"
  ..- attr(*, "class")= chr [1:2] "insensitive" "list"
 $ all_headers:List of 1
  ..$ :List of 3
  .. ..$ status : int 200
  ..