### Data
- values of *qualitative* or *quantitative* variables, belonging to a set of items

**Processed Data** means the data that is *ready for analysis* . Processing can include *merging, subsetting, transforming, etc*. 

In [1]:
# getting the current directory
getwd()

### Setting Working Directory

![image.png](attachment:image.png)

### Checking for and creating directories
- ``file.exists("directoryName")`` will check if a directory exists
- ``dir.create("directoryName")`` will create a directory

In [2]:
if (!file.exists("createme")) {
    dir.create("createme")
}

### Getting data from the internet
- ``download.file()``
- parameters include *url, destfile, method*

**Code**
```
fileURL <- "url"
download.file(fileURL, destfile = "filepath", method="curl")
list.files("folderpath")

dateDownloaded <- date()
dateDownloaded
```

### Loading flat files - ``read.table()``
- data is read into RAM
- file parameters include *file, header, sep, row.names, nrows*
- related to ``read.csv()`` or ``read.csv2()``

![image.png](attachment:image.png)

### Reading Excel files
- using the packages of xlsx package
    - ``library(xlsx)``
- ``data <- read.xlsx("path", sheetIndex=1, header=TRUE, colIndex, rowIndex)``
- ``head(data)``
- ``write.xlsx`` write out excel file with similar arguments
- *XLConnect* package has more options for writing and manipulating Excel files
- *XLConnect vignette* is a good place to start for that package

### Reading XML
- extensible markup language
- components include markup (labels that give the text structure) and content (the actual text of the document)

**Tags** correspond to general labels
- start tags: ``<section>``
- end tags: ``</section>``
- empth tags: ``<line-break />``

**Elements** are specific examples of tags
- ``<Greeting> Hello, world </Greeting>``

**Attributes** are components of the labels
- ``<img src="imgname.jpg" alt="instructor:/>``
- ``<step number="3"> Connect A to B. </step>``

**Reading XML file into R**
```
library(XML)
fileURL <- "URL"
doc <- xmlTreeParse(fileURL, useInternal=TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)
names(rootNode)

# accessing parts by [] brackets
rootNode[[1]][[1]]

# programatically extract parts of the file
xmlSApply(rootNode, xmlValue)
```

### Reading JSON file

![image.png](attachment:image.png)

```
library(jsonlite)
jsonData <- fromJSON("URL")
names(jsonData)
names(jsonData$owner)
```

**Writing data frames to JSON**
```
myjson <- toJSON(dataframe, pretty=TRUE)
cat(myjson)
head(myjson)
```

### data.table
- inherets from data.frame
    - all functions that accept data.frame work on data.table
- written in C so it is faster at subsetting, group, and updating

```
library(data.table)
DT = data.table(x=.., y=.., z=..)
head(DT, 3)
```
**To see all the data tables in memory**: ``tables()``


### Quiz

In [None]:
# Question 1
# fread url requires curl package on mac 
# install.packages("curl")

library(data.table)
housing <- data.table::fread("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv")

# VAL attribute says how much property is worth, .N is the number of rows
# VAL == 24 means more than $1,000,000
housing[VAL == 24, .N]

# Answer: 
# 53

In [None]:
# question 3
fileUrl <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx"
download.file(fileUrl, destfile = paste0(getwd(), '/getdata%2Fdata%2FDATA.gov_NGAP.xlsx'), method = "curl")

dat <- xlsx::read.xlsx(file = "getdata%2Fdata%2FDATA.gov_NGAP.xlsx", sheetIndex = 1, rowIndex = 18:23, colIndex = 7:15)
sum(dat$Zip*dat$Ext,na.rm=T)

# Answer:
# 36534720

In [None]:
# question 4
# install.packages("XML")
library("XML")
fileURL<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
doc <- XML::xmlTreeParse(sub("s", "", fileURL), useInternal = TRUE)
rootNode <- XML::xmlRoot(doc)

zipcodes <- XML::xpathSApply(rootNode, "//zipcode", XML::xmlValue)
xmlZipcodeDT <- data.table::data.table(zipcode = zipcodes)
xmlZipcodeDT[zipcode == "21231", .N]

# Answer: 
# 127

In [None]:
# question 5
DT <- data.table::fread("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv")

# Answer (fastest):
system.time(DT[,mean(pwgtp15),by=SEX])

# Answer Question 5
# DT[,mean(pwgtp15),by=SEX]