# R BASICS <a name=top></a>
Many software packages and libraries are available to the data analyst. R not only has the advantage that we can easily use its available packages, but it provides enough flexibility for the analyst who wants to get dirty with the data. It's also widely used, and thus fairly portable: most analysts speak some level of R (or something that sounds and looks an awful lot like R). 

In this notebook, you will find examples and tips that highlight R's data manipulation features. It is not meant to be a complete introduction, or even a showcase of good programming practices.

#### TUTORIAL OUTLINE

1. [Installing Packages and Libraries](#packages)
2. [Commonly Used Libraries](#libraries)
3. [Help & Documentation](#help)
4. The R Workspace: 
 - [Loading a Built-in Dataset](#workspace_load_internal)
 - [Loading an External Dataset](#workspace_load_external)
 - [Removing and Storing Workspace Elements](#workspace_removing)
5. Simple Data Manipulation
  - [Assigning Data](#data_manip_ass)
  - [Data Types and Conversion](#data_manip_conv)
6. [Writing Functions](#functions)
7. Exploring Data
  - [`swiss` Dataset](#exploring_swiss)
8. [A Word About NAs](#exploring_NAs)
9. [Data Wrangling](#wrangling)

---
[Back to top](#top)
## 1. PACKAGES AND LIBRARIES <a name=packages></a>

While it is possible to write command line functions in R (we'll have a few in subsequent modules), we will mostly use routines and functions which are available through various packages and libraries. 

With an Internet connection, is is fairly straightforward to install and/or update R packages.

In [1]:
installed.packages() # see what packages are currently installed
# update.packages() # to update the currently installed packages

# install.packages("specific_package_name") to install a specific package



---
[Back to top](#top)
## 2. COMMONLY USED LIBRARIES <a name=libraries></a>

- Outlier Detection: outlier, EVIR
- Feature Selection: Features, RRF
- Data Transformation: plyr, data.table
- Data Visualization: ggplot2, googleVis, graphics, GGally
- Text Mining: tm, wordcloud
- Dimension Reduction: factoMiner, CCP
- Imputation: MissForest, MissMDA
- Association Rules: arules, arulesViz
- Decision Trees: rpart, party, rattle, rpart.plot, randomForest, RGtk2, ctree
- Clustering: stats, cluster, apcluster
- ANNs: nnet, neuralnet
- SVMs: e1071, libsvm, kernlab 
- Summary Statistics: psych
- Analysis: stats
- Baseball: lahman
- Other: stringr

---
[Back to top](#top)
## 3. HELP & DOCUMENTATION <a name=help></a>

R's various help files and demos can be accessed using the following commands (where function_name and search_term correspond to the desired function and/or term): 

- `?function_name`
- `example(function_name)`
- `args(function_name)`
- `??search_term`

In [2]:
?glm

0,1
glm {stats},R Documentation

0,1
formula,"an object of class ""formula"" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’."
family,"a description of the error distribution and link function to be used in the model. For glm this can be a character string naming a family function, a family function or the result of a call to a family function. For glm.fit only the third option is supported. (See family for details of family functions.)"
data,"an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which glm is called."
weights,an optional vector of ‘prior weights’ to be used in the fitting process. Should be NULL or a numeric vector.
subset,an optional vector specifying a subset of observations to be used in the fitting process.
na.action,"a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful."
start,starting values for the parameters in the linear predictor.
etastart,starting values for the linear predictor.
mustart,starting values for the vector of means.
offset,"this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be NULL or a numeric vector of length equal to the number of cases. One or more offset terms can be included in the formula instead or as well, and if more than one is specified their sum is used. See model.offset."

0,1
coefficients,a named vector of coefficients
residuals,"the working residuals, that is the residuals in the final iteration of the IWLS fit. Since cases with zero weights are omitted, their working residuals are NA."
fitted.values,"the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function."
rank,the numeric rank of the fitted linear model.
family,the family object used.
linear.predictors,the linear fit on link scale.
deviance,"up to a constant, minus twice the maximized log-likelihood. Where sensible, the constant is chosen so that a saturated model has deviance zero."
aic,"A version of Akaike's An Information Criterion, minus twice the maximized log-likelihood plus twice the number of parameters, computed by the aic component of the family. For binomial and Poison families the dispersion is fixed at one and the number of parameters is the number of coefficients. For gaussian, Gamma and inverse gaussian families the dispersion is estimated from the residual deviance, and the number of parameters is the number of coefficients plus one. For a gaussian family the MLE of the dispersion is used so this is a valid value of AIC, but for Gamma and inverse gaussian families it is not. For families fitted by quasi-likelihood the value is NA."
null.deviance,"The deviance for the null model, comparable with deviance. The null model will include the offset, and an intercept if there is one in the model. Note that this will be incorrect if the link function depends on the data other than through the fitted mean: specify a zero offset to force a correct calculation."
iter,the number of iterations of IWLS used.


In [3]:
example(glm)


glm> ## Dobson (1990) Page 93: Randomized Controlled Trial :
glm> counts <- c(18,17,15,20,10,20,25,13,12)

glm> outcome <- gl(3,1,9)

glm> treatment <- gl(3,3)

glm> print(d.AD <- data.frame(treatment, outcome, counts))
  treatment outcome counts
1         1       1     18
2         1       2     17
3         1       3     15
4         2       1     20
5         2       2     10
6         2       3     20
7         3       1     25
8         3       2     13
9         3       3     12

glm> glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())

glm> anova(glm.D93)
Analysis of Deviance Table

Model: poisson, link: log

Response: counts

Terms added sequentially (first to last)


          Df Deviance Resid. Df Resid. Dev
NULL                          8    10.5814
outcome    2   5.4523         6     5.1291
treatment  2   0.0000         4     5.1291

glm> ## No test: 
glm> ##D summary(glm.D93)
glm> ## End(No test)
glm> 
glm> ## No test: 
glm> ##D ## an example with offsets fro

In [4]:
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
print(d.AD <- data.frame(treatment, outcome, counts))
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
anova(glm.D93)
summary(glm.D93)

  treatment outcome counts
1         1       1     18
2         1       2     17
3         1       3     15
4         2       1     20
5         2       2     10
6         2       3     20
7         3       1     25
8         3       2     13
9         3       3     12


Unnamed: 0,Df,Deviance,Resid. Df,Resid. Dev
,,,8,10.581446
outcome,2.0,5.452305,6,5.129141
treatment,2.0,7.105427e-15,4,5.129141



Call:
glm(formula = counts ~ outcome + treatment, family = poisson())

Deviance Residuals: 
       1         2         3         4         5         6         7         8  
-0.67125   0.96272  -0.16965  -0.21999  -0.95552   1.04939   0.84715  -0.09167  
       9  
-0.96656  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.045e+00  1.709e-01  17.815   <2e-16 ***
outcome2    -4.543e-01  2.022e-01  -2.247   0.0246 *  
outcome3    -2.930e-01  1.927e-01  -1.520   0.1285    
treatment2   1.011e-15  2.000e-01   0.000   1.0000    
treatment3   7.105e-16  2.000e-01   0.000   1.0000    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 10.5814  on 8  degrees of freedom
Residual deviance:  5.1291  on 4  degrees of freedom
AIC: 56.761

Number of Fisher Scoring iterations: 4


In [5]:
args(glm)

In [6]:
??neural



---
[Back to top](#top)
## 4. THE R WORKSPACE
How do we arrange for data to be made available in the R workspace? 

We can either use built-in datasets, or we can load data from external sources. 

### 4.1 LOADING A BUILT-IN DATASET <a name=workspace_load_internal></a>

In [7]:
data() # lists datasets in the datasets package



In [8]:
data(package = .packages(all.available = TRUE)) # lists datasets in all available packages

“datasets have been moved from package 'base' to package 'datasets'”

“datasets have been moved from package 'stats' to package 'datasets'”



Let's look at three datasets:

- swiss
- volcano
- InsectSprays

In [9]:
swiss

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6
Porrentruy,76.1,35.3,9,7,90.57,26.6
Broye,83.8,70.2,16,7,92.85,23.6
Glane,92.4,67.8,14,8,97.16,24.9
Gruyere,82.4,53.3,12,7,97.67,21.0
Sarine,82.9,45.2,16,13,91.38,24.4


In [10]:
head(swiss,10)
?swiss

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6
Porrentruy,76.1,35.3,9,7,90.57,26.6
Broye,83.8,70.2,16,7,92.85,23.6
Glane,92.4,67.8,14,8,97.16,24.9
Gruyere,82.4,53.3,12,7,97.67,21.0
Sarine,82.9,45.2,16,13,91.38,24.4


0,1
swiss {datasets},R Documentation

0,1,2
"[,1]",Fertility,"Ig, ‘common standardized fertility measure’"
"[,2]",Agriculture,% of males involved in agriculture  as occupation
"[,3]",Examination,% draftees receiving highest mark  on army examination
"[,4]",Education,% education beyond primary school for draftees.
"[,5]",Catholic,% ‘catholic’ (as opposed to ‘protestant’).
"[,6]",Infant.Mortality,live births who live less than 1  year.


In [11]:
?volcano

0,1
volcano {datasets},R Documentation


In [12]:
volcano

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
100,100,101,101,101,101,101,100,100,100,⋯,107,107,107,106,106,105,105,104,104,103
101,101,102,102,102,102,102,101,101,101,⋯,108,108,107,107,106,106,105,105,104,104
102,102,103,103,103,103,103,102,102,102,⋯,109,108,108,107,107,106,106,105,105,104
103,103,104,104,104,104,104,103,103,103,⋯,109,109,108,108,107,107,106,106,105,105
104,104,105,105,105,105,105,104,104,103,⋯,110,109,109,108,107,107,107,106,106,105
105,105,105,106,106,106,106,105,105,104,⋯,110,110,109,108,108,108,107,107,106,106
105,106,106,107,107,107,107,106,106,105,⋯,110,111,110,109,109,108,108,107,107,106
106,107,107,108,108,108,108,107,107,106,⋯,113,112,110,110,109,109,108,108,107,106
107,108,108,109,109,109,109,108,108,107,⋯,115,114,112,110,110,109,109,108,107,107
108,109,109,110,110,110,110,109,109,108,⋯,117,115,113,111,110,110,109,108,107,107


In [13]:
?InsectSprays

0,1
InsectSprays {datasets},R Documentation

0,1,2,3
"[,1]",count,numeric,Insect count
"[,2]",spray,factor,The type of spray


In [14]:
InsectSprays

count,spray
10,A
7,A
20,A
14,A
14,A
12,A
10,A
23,A
17,A
20,A


### 4.2 LOADING AN EXTERNAL DATASET <a name=workspace_load_external></a>

`Data <- read.csv("path_name/file_name", header=TRUE, sep=",")` #CSV file

`Data <- read.table("path_name/file_name", sep="\t", header=TRUE)` #tab separated

`Data <- read.table(file = "clipboard", sep="\t", header=TRUE)` #clipboard

`Data <- read.csv("http://dns/path_name/file")` #web

In [1]:
# Read in the file car.csv found in the folder 'Path' and save to: car.data
car.data <- read.csv("Data/car.csv", header=TRUE, sep=",")

?car.data

“cannot open file 'Data/car.csv': No such file or directory”

ERROR: Error in file(file, "rt"): cannot open the connection


### 4.3 REMOVING AND SAVING WORKSPACE ELEMENTS <a name=workspace_removing></a>

`rm(variable_x)`   #removing variable_x from the workspace

`save.image()`   #saving entire workspace

`save(variable_name, file="file_name.rda")`  #saving a specific object

`load("file_name.rda")`   #saving a specific object

---
[Back to top](#top)
## 5. SIMPLE DATA MANIPULATION
So what can we actually do with R?

### 5.1 ASSIGNING DATA <a name=data_manip_ass></a>

In [16]:
x<-1:3   # creating a vector of a sequence of numbers 

In [17]:
(x <- 1:3)   # assigning this vector to a variable

In [18]:
x   # displaying the vector

In [19]:
y = 4:6   # another assignment

In [20]:
(z = 7:9)   # another way to display, with yet another assignment

In [21]:
(w <- c(12,-9))   # assignment of non-sequential numbers

In [22]:
(v = c(w,"pamplemousse"))   # assignment of mixed objects

In [23]:
(u = t(matrix(1:10,ncol=5)))   # assignment of a matrix

0,1
1,2
3,4
5,6
7,8
9,10


### 5.2 DATA TYPES AND CONVERSION <a name=data_manip_conv></a>

In [24]:
# test if an object is of a certain type I
is.numeric(x)
is.character(x)
is.vector(x)
is.matrix(x)
is.data.frame(x)

In [25]:
# test if an object is of a certain type II
is.character(w)
is.character(v)
is.data.frame(swiss)

In [26]:
# set an object as a specific type
as.numeric(x)
as.character(x)
as.vector(x)
as.matrix(x)
as.data.frame(x)

0
1
2
3


x
1
2
3


In [27]:
# combine vectors into single vector
c(y,z)

# convert vectors to matrix
cbind(x,y)
rbind(x,y)

# convert vectors to data.frame
data.frame(x,y)

x,y
1,4
2,5
3,6


0,1,2,3
x,1,2,3
y,4,5,6


x,y
1,4
2,5
3,6


In [28]:
# convert matrix to vector
as.vector(u)

# convert matrix to data frame
as.data.frame(u)

V1,V2
1,2
3,4
5,6
7,8
9,10


In [29]:
# convert data frame to matrix
(swiss_matrix=as.matrix(swiss))

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6
Porrentruy,76.1,35.3,9,7,90.57,26.6
Broye,83.8,70.2,16,7,92.85,23.6
Glane,92.4,67.8,14,8,97.16,24.9
Gruyere,82.4,53.3,12,7,97.67,21.0
Sarine,82.9,45.2,16,13,91.38,24.4


---
[Back to top](#top)
### 6. WRITING FUNCTIONS <a name=functions></a>
What if we're interested in writing our own functions in R?

The template for all functions is a block of code that looks like: 

`my.function <- function(arg1,arg2, ..., argn) {`
     `# what my.function does, typically involving the arguments`
`}`

Here are some simple examples:

In [30]:
# Function my.product which computes the product of two arguments x and y
my.product <- function (x,y) {
    x*y
}

# call my.product for x=12 and y=-2
my.product(12,-2)
my.product(x=12,y=-2)
my.product(y=-2,x=12)
my.product(-2,12) ## ok, because the product is commutative

In [31]:
# Function my.ratio which computes the quotient x / y
my.quotient <- function (x,y) {
    x/y
}

# call my.quotient for x=12 and y=-2
my.quotient(12,-2)
my.quotient(x=12,y=-2)
my.quotient(y=-2,x=12)
my.quotient(-2,12) ## what's happening here?

# call my.quotient for x=12 and y=0
my.quotient(12,0)

---
[Back to top](#top)
## 7. EXPLORING DATA
Let's take a look at the swiss dataset in detail.

### 7.1 `swiss` DATASET  <a name=exploring_swiss></a>

In [32]:
# Display the first few entries of the dataset
head(swiss)   # default is 6 observations
head(swiss,10)   # setting a different number of observations, 10 in this case

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6
Porrentruy,76.1,35.3,9,7,90.57,26.6


Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6
Porrentruy,76.1,35.3,9,7,90.57,26.6
Broye,83.8,70.2,16,7,92.85,23.6
Glane,92.4,67.8,14,8,97.16,24.9
Gruyere,82.4,53.3,12,7,97.67,21.0
Sarine,82.9,45.2,16,13,91.38,24.4


In [33]:
# Display the last few entries of the dataset
tail(swiss)
tail(swiss,10)
str(swiss)

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Neuchatel,64.4,17.6,35,32,16.92,23.0
Val de Ruz,77.6,37.6,15,7,4.97,20.0
ValdeTravers,67.6,18.7,25,7,8.65,19.5
V. De Geneve,35.0,1.2,37,53,42.34,18.0
Rive Droite,44.7,46.6,16,29,50.43,18.2
Rive Gauche,42.8,27.7,22,29,58.33,19.3


Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Sion,79.3,63.1,13,13,96.83,18.1
Boudry,70.4,38.4,26,12,5.62,20.3
La Chauxdfnd,65.7,7.7,29,11,13.79,20.5
Le Locle,72.7,16.7,22,13,11.22,18.9
Neuchatel,64.4,17.6,35,32,16.92,23.0
Val de Ruz,77.6,37.6,15,7,4.97,20.0
ValdeTravers,67.6,18.7,25,7,8.65,19.5
V. De Geneve,35.0,1.2,37,53,42.34,18.0
Rive Droite,44.7,46.6,16,29,50.43,18.2
Rive Gauche,42.8,27.7,22,29,58.33,19.3


'data.frame':	47 obs. of  6 variables:
 $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
 $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
 $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
 $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
 $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
 $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...


In [34]:
# Display a specific column as a data frame
swiss$Education   # extracting a specific colum with the $ operator
#swiss_matrix$Education   # this cannot be done to a matrix
swiss_matrix[,4]

In [35]:
# Displaying specific entries, rows, and columns using matrix notation
swiss[1,1] # 1st row, 1st column
swiss[1,] # 1st row
swiss[,2] # 2nd column
swiss[c(2,4),] # 2nd and 4th rows
swiss[,c(2,4)] # 2nd and 4th columns
swiss[,-2] # all rows without the 2nd column
swiss[-3,] # all columns without the 3rd row

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17,15,12,9.96,22.2


Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Delemont,83.1,45.1,6,9,84.84,22.2
Moutier,85.8,36.5,12,7,33.77,20.3


Unnamed: 0,Agriculture,Education
Courtelary,17.0,12
Delemont,45.1,9
Franches-Mnt,39.7,5
Moutier,36.5,7
Neuveville,43.5,15
Porrentruy,35.3,7
Broye,70.2,7
Glane,67.8,8
Gruyere,53.3,7
Sarine,45.2,13


Unnamed: 0,Fertility,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,15,12,9.96,22.2
Delemont,83.1,6,9,84.84,22.2
Franches-Mnt,92.5,5,5,93.4,20.2
Moutier,85.8,12,7,33.77,20.3
Neuveville,76.9,17,15,5.16,20.6
Porrentruy,76.1,9,7,90.57,26.6
Broye,83.8,16,7,92.85,23.6
Glane,92.4,14,8,97.16,24.9
Gruyere,82.4,12,7,97.67,21.0
Sarine,82.9,16,13,91.38,24.4


Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6
Porrentruy,76.1,35.3,9,7,90.57,26.6
Broye,83.8,70.2,16,7,92.85,23.6
Glane,92.4,67.8,14,8,97.16,24.9
Gruyere,82.4,53.3,12,7,97.67,21.0
Sarine,82.9,45.2,16,13,91.38,24.4
Veveyse,87.1,64.5,14,6,98.61,24.5


In [36]:
# Summary statistics
colnames(swiss)   # column names
rownames(swiss)   # row names
str(swiss)   # structure of the data frame
summary(swiss)   # summary statistics of the data frame (5pt-summary + mean for numeric variables )

library(psych)
describe(swiss)   # matrix of data fame statistics: n, mean, sd, median, min, max, range, skew, kurtosis, se, + others
cor(swiss)   # correlation matrix of the data

'data.frame':	47 obs. of  6 variables:
 $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
 $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
 $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
 $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
 $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
 $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...


   Fertility      Agriculture     Examination      Education    
 Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
 1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
 Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
 Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
 3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
 Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
    Catholic       Infant.Mortality
 Min.   :  2.150   Min.   :10.80   
 1st Qu.:  5.195   1st Qu.:18.15   
 Median : 15.140   Median :20.00   
 Mean   : 41.144   Mean   :19.94   
 3rd Qu.: 93.125   3rd Qu.:21.70   
 Max.   :100.000   Max.   :26.60   

Unnamed: 0,vars,n,mean,sd,median,trimmed,mad,min,max,range,skew,kurtosis,se
Fertility,1,47,70.14255,12.491697,70.4,70.658974,10.22994,35.0,92.5,57.5,-0.4556871,0.2599542,1.8221013
Agriculture,2,47,50.65957,22.711218,54.1,51.15641,23.86986,1.2,89.7,88.5,-0.3203637,-0.8855271,3.3127716
Examination,3,47,16.48936,7.977883,16.0,16.076923,7.413,3.0,37.0,34.0,0.4463996,-0.1369364,1.1636939
Education,4,47,10.97872,9.615407,8.0,9.384615,5.9304,1.0,53.0,52.0,2.2684389,6.1397347,1.4025513
Catholic,5,47,41.14383,41.70485,15.14,39.116154,18.65111,2.15,100.0,97.85,0.4789257,-1.6654195,6.0832776
Infant.Mortality,6,47,19.94255,2.912697,20.0,19.984615,2.81694,10.8,26.6,15.8,-0.3314326,0.7772868,0.4248605


Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Fertility,1.0,0.35307918,-0.6458827,-0.66378886,0.4636847,0.41655603
Agriculture,0.3530792,1.0,-0.6865422,-0.63952252,0.4010951,-0.06085861
Examination,-0.6458827,-0.68654221,1.0,0.6984153,-0.5727418,-0.1140216
Education,-0.6637889,-0.63952252,0.6984153,1.0,-0.1538589,-0.09932185
Catholic,0.4636847,0.40109505,-0.5727418,-0.15385892,1.0,0.17549591
Infant.Mortality,0.416556,-0.06085861,-0.1140216,-0.09932185,0.1754959,1.0


In [37]:
# Contrast: dataset with categorical variables
summary(InsectSprays)   # count for categorical variables
table(InsectSprays)   # joint empirical distribution
str(InsectSprays)
describe(InsectSprays)   # look at the statistics for the categorical variable
cor(InsectSprays)   # what happens if there are categorical variables?

     count       spray 
 Min.   : 0.00   A:12  
 1st Qu.: 3.00   B:12  
 Median : 7.00   C:12  
 Mean   : 9.50   D:12  
 3rd Qu.:14.25   E:12  
 Max.   :26.00   F:12  

     spray
count A B C D E F
   0  0 0 2 0 0 0
   1  0 0 4 0 2 0
   2  0 0 2 1 1 0
   3  0 0 2 2 4 0
   4  0 0 1 2 1 0
   5  0 0 0 5 2 0
   6  0 0 0 1 2 0
   7  1 1 1 0 0 0
   9  0 0 0 0 0 1
   10 2 0 0 0 0 1
   11 0 2 0 0 0 1
   12 1 0 0 1 0 0
   13 1 1 0 0 0 2
   14 3 1 0 0 0 0
   15 0 0 0 0 0 2
   16 0 1 0 0 0 1
   17 1 3 0 0 0 0
   19 0 1 0 0 0 0
   20 2 0 0 0 0 0
   21 0 2 0 0 0 0
   22 0 0 0 0 0 1
   23 1 0 0 0 0 0
   24 0 0 0 0 0 1
   26 0 0 0 0 0 2

'data.frame':	72 obs. of  2 variables:
 $ count: num  10 7 20 14 14 12 10 23 17 20 ...
 $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...


Unnamed: 0,vars,n,mean,sd,median,trimmed,mad,min,max,range,skew,kurtosis,se
count,1,72,9.5,7.203286,7.0,8.896552,7.413,0,26,26,0.5590721,-0.8356673,0.8489154
spray*,2,72,3.5,1.71981,3.5,3.5,2.2239,1,6,5,0.0,-1.3163327,0.2026816


ERROR: Error in cor(InsectSprays): 'x' must be numeric


In [38]:
# number of rows/observations 
nrow(swiss)

In [39]:
# summary of a single feature
summary(swiss$Fertility)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  35.00   64.70   70.40   70.14   78.45   92.50 

In [40]:
# finding all observations for which a feature takes on a value greater than a threshold
swiss$Fertility>50

In [41]:
#summary of a logical vector
summary(swiss$Fertility>50)

   Mode   FALSE    TRUE 
logical       3      44 

In [42]:
# historical cantons for which Fertility was > 50
swiss[swiss$Fertility>50,]

# number of such historical cantons
nrow(swiss[swiss$Fertility>50,])   # should be at most as large as the number of observations

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6
Porrentruy,76.1,35.3,9,7,90.57,26.6
Broye,83.8,70.2,16,7,92.85,23.6
Glane,92.4,67.8,14,8,97.16,24.9
Gruyere,82.4,53.3,12,7,97.67,21.0
Sarine,82.9,45.2,16,13,91.38,24.4


In [43]:
# historical cantons data where Fertility is in the top 50%
swiss[swiss$Fertility>median(swiss$Fertility),]

# Fertiliy and Education variables for historical cantons where Fertility is in the top 50%
swiss[swiss$Fertility>median(swiss$Fertility),c(1,4)]   # matrix option
swiss[swiss$Fertility>median(swiss$Fertility),c("Fertility","Education")]   # data frame call

# historical canton(s) data where Fertility is maximal
swiss[swiss$Fertility == max(swiss$Fertility),]

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6
Porrentruy,76.1,35.3,9,7,90.57,26.6
Broye,83.8,70.2,16,7,92.85,23.6
Glane,92.4,67.8,14,8,97.16,24.9
Gruyere,82.4,53.3,12,7,97.67,21.0
Sarine,82.9,45.2,16,13,91.38,24.4


Unnamed: 0,Fertility,Education
Courtelary,80.2,12
Delemont,83.1,9
Franches-Mnt,92.5,5
Moutier,85.8,7
Neuveville,76.9,15
Porrentruy,76.1,7
Broye,83.8,7
Glane,92.4,8
Gruyere,82.4,7
Sarine,82.9,13


Unnamed: 0,Fertility,Education
Courtelary,80.2,12
Delemont,83.1,9
Franches-Mnt,92.5,5
Moutier,85.8,7
Neuveville,76.9,15
Porrentruy,76.1,7
Broye,83.8,7
Glane,92.4,8
Gruyere,82.4,7
Sarine,82.9,13


Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Franches-Mnt,92.5,39.7,5,5,93.4,20.2


In [44]:
swiss$var1 <- swiss[,1]>median(swiss[,1])   # find the historical cantons for which the first variable is in the top 50%
swiss$var4 <- swiss[,4]>median(swiss[,4])   # find the historical cantons for which the fourth variable is in the top 50%
table(swiss$var1)   # distribution of cantons about the median of the first variable
table(swiss$var4)   # distribution of cantons about the median of the fourth variable
table(swiss$var1,swiss$var4)   # what's going on here? rows = first variable, columns = second variable


FALSE  TRUE 
   24    23 


FALSE  TRUE 
   25    22 

       
        FALSE TRUE
  FALSE     8   16
  TRUE     17    6

---
[Back to top](#top)
## 8. A WORD ABOUT NAs <a name=exploring_NAs></a>
NA values in R can create some havoc. Be careful!

In [45]:
test = sample(c(1:4,NA),100, replace=TRUE)   # pick 100 values (with replacement) among the values {1,2,3,4,NA}
summary(test)   # 5pt summary + mean + number of NAs
mean(test)   # mean of test data without removal of the NAs
mean(test, na.rm=TRUE)   # mean of test data with removal of the NAs

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   1.000   2.000   2.506   4.000   4.000      21 

In [46]:
median(test, na.rm=TRUE)   # median of test data with removal of the NAs
min(test, na.rm=TRUE)   # minimum of test data with removal of the NAs
max(test, na.rm=TRUE)   # maximum of test data with removal of the NAs
quantile(test, na.rm=TRUE)   # quantiles of test data with removal of the NAs

---
[Back to top](#top)
## 9. DATA WRANGLING <a name=wrangling></a>
This section is based on _Data Wrangling with R: How to work with the structures of your data_ by G. Grolemund (Slides available at:
bit.ly/wrangling-webinar).

In [34]:
library(tidyr)
library(dplyr)
# library(EDAWR) 
# the datasets that we use are found in the package EDAWR, but that package is not available on cocalc 

In particular, we will work with the following datasets:

- `storms`
- `cases`
- `pollution`
- `tb`

In [35]:
str(storms)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	10010 obs. of  13 variables:
 $ name       : chr  "Amy" "Amy" "Amy" "Amy" ...
 $ year       : num  1975 1975 1975 1975 1975 ...
 $ month      : num  6 6 6 6 6 6 6 6 6 6 ...
 $ day        : int  27 27 27 27 28 28 28 28 29 29 ...
 $ hour       : num  0 6 12 18 0 6 12 18 0 6 ...
 $ lat        : num  27.5 28.5 29.5 30.5 31.5 32.4 33.3 34 34.4 34 ...
 $ long       : num  -79 -79 -79 -79 -78.8 -78.7 -78 -77 -75.8 -74.8 ...
 $ status     : chr  "tropical depression" "tropical depression" "tropical depression" "tropical depression" ...
 $ category   : Ord.factor w/ 7 levels "-1"<"0"<"1"<"2"<..: 1 1 1 1 1 1 1 1 2 2 ...
 $ wind       : int  25 25 25 25 25 25 25 30 35 40 ...
 $ pressure   : int  1013 1013 1013 1013 1012 1012 1011 1006 1004 1002 ...
 $ ts_diameter: num  NA NA NA NA NA NA NA NA NA NA ...
 $ hu_diameter: num  NA NA NA NA NA NA NA NA NA NA ...


In [36]:
cases <- read.csv("Data/cases.csv")
cases

country,X2011,X2012,X2013
FR,7000,6900,7000
DE,5800,6000,6200
US,15000,14000,13000


In [54]:
pollution <- read.csv("Data/pollution.csv")
pollution

city,size,amount
New York,large,23
New York,small,14
London,large,22
London,small,16
Beijing,large,121
Beijing,small,56


In [55]:
tb <- read.csv("Data/tb.csv")
str(tb)
head(tb)

#'' TB data
#'
#' A subset of data from the World Health Organization Global 
#' Tuberculosis Report.
#'
#' @format A dataset with the variables
#' \describe{
#' \item{country}{}
#' \item{year}{}
#' \item{sex}{}
#' \item{child}{Number of new cases reported among people 0 - 14 years of age.}
#' \item{adult}{Number of new cases reported among people 15 - 64 years of age.}
#' \item{elderly}{Number of new cases reported among people over 64 years of age.}
#' }
#' 
#' @source \url{http://www.who.int/tb/country/data/download/en/}
#'


'data.frame':	3800 obs. of  6 variables:
 $ country: Factor w/ 100 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year   : int  1995 1995 1996 1996 1997 1997 1998 1998 1999 1999 ...
 $ sex    : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ child  : int  NA NA NA NA 5 0 45 30 25 8 ...
 $ adult  : int  NA NA NA NA 96 26 1142 500 484 212 ...
 $ elderly: int  NA NA NA NA 1 0 20 41 8 8 ...


country,year,sex,child,adult,elderly
Afghanistan,1995,female,,,
Afghanistan,1995,male,,,
Afghanistan,1996,female,,,
Afghanistan,1996,male,,,
Afghanistan,1997,female,5.0,96.0,1.0
Afghanistan,1997,male,0.0,26.0,0.0


### 9.1 PIPELINE OPERATOR `%>%`

R is a functional language, which means nested parentheses, which make code hard difficult to read. The **pipeline operator** `%>%` and the package `dplyr` can be used to remedy the situation. 

Hadley Wickham provided an example in 2014 to illustrate how it works (don't run the next bit of code, it won't execute):



In [39]:
hourly_delay <- filter(
  summarise(
    group_by( 
      filter(
        flights, 
        !is.na(dep_delay)
      ), 
      date, hour
    ), 
    delay = mean(dep_delay), 
    n = n()
  ), 
  n > 10 
)

ERROR: Error in filter(flights, !is.na(dep_delay)): object 'flights' not found


Take some time to figure out what is supposed to be happening here. 

The pipeline operator eschews nesting function calls in favor of passing data from one function to the next (also won't run):


In [40]:
hourly_delay <- flights %>% 
   filter(!is.na(dep_delay)) %>% 
   group_by(date, hour) %>% 
   summarise(delay = mean(dep_delay),n = n()) %>% 
   filter(n > 10)

ERROR: Error in eval(lhs, parent, parent): object 'flights' not found


The beauty of this approach is that it can be 'read' aloud to discover what the block of code is meant to do. 

The flights data frame is 

    1. filtered (to remove missing values of the dep_delay variable)
    2. grouped by hours within days
    3. the mean delay is calculated within groups, and 
    4. the mean delay is returned for those hours with more than n > 10 flights.

The **pipeline rules** are simple: the object on the left hand side is passed as the *first* argument to the function on the right hand side. 

- `data %>% function` is the same as `function(data)`
- `data %>% function(arg=value)` is the same as `function(data, arg=value)`

**References:** https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html

### 9.2 TIDY DATA

Tidy data has specific structure:
- Each variable is a column
- Each observation is a row
- Each type of observational unit is a table

There are two functions in package `tidyr` to reshape tables to a tidy format: `gather()` and `spread()`. 

`gather()` requires a data frame to reshape, a **key** column (against which to reshape), a **value** column (which will contain the new variable of interest), and the indices of the columns that need to be collapsed. For instance, in a tidy format, the `cases` dataset would look like:

In [41]:
gather(cases,"year","n",2:4)

country,year,n
FR,X2011,7000
DE,X2011,5800
US,X2011,15000
FR,X2012,6900
DE,X2012,6000
US,X2012,14000
FR,X2013,7000
DE,X2013,6200
US,X2013,13000


`spread()`, on the other hand, generates multiple columns from two columns; it requires a data frame to reshape, a key column, and values in the value column to become new values. For instance, in a tidy format, the `pollution` dataset would look like:

In [42]:
spread(pollution,size,amount)

city,large,small
Beijing,121,56
London,22,16
New York,23,14


`gather()` and `spread()` are inverses of one another. Other useful wrangling functions include `separate()` and `unite()`. What do you think these do? 

### 9.3 THE `dplyr` PACKAGE

The `dplyr` package is useful when it comes to transforming tabular data, and is compatible withthe pipeline operator %>%. Its most useful functions are:
- `select()`: to extract a subset of variables from the data frame
- `filter()`: to extract a subset of observations from the data frame
- `arrange()`: to sort the data frame
- `mutate()`: to create new variables from existing variables
- `summarise()`: to create so-called **pivot tables**
- `group_by()`: ... self-evident?

We will showcase these functions with the help of various examples.

In [43]:
storms.2 <- select(storms, name,pressure)
storms.3 <- select(storms, -name)
storms.4 <- select(storms, lat:pressure)
head(storms)
head(storms.2)
head(storms.3)
head(storms.4)

name,year,month,day,hour,lat,long,status,category,wind,pressure,ts_diameter,hu_diameter
Amy,1975,6,27,0,27.5,-79.0,tropical depression,-1,25,1013,,
Amy,1975,6,27,6,28.5,-79.0,tropical depression,-1,25,1013,,
Amy,1975,6,27,12,29.5,-79.0,tropical depression,-1,25,1013,,
Amy,1975,6,27,18,30.5,-79.0,tropical depression,-1,25,1013,,
Amy,1975,6,28,0,31.5,-78.8,tropical depression,-1,25,1012,,
Amy,1975,6,28,6,32.4,-78.7,tropical depression,-1,25,1012,,


name,pressure
Amy,1013
Amy,1013
Amy,1013
Amy,1013
Amy,1012
Amy,1012


year,month,day,hour,lat,long,status,category,wind,pressure,ts_diameter,hu_diameter
1975,6,27,0,27.5,-79.0,tropical depression,-1,25,1013,,
1975,6,27,6,28.5,-79.0,tropical depression,-1,25,1013,,
1975,6,27,12,29.5,-79.0,tropical depression,-1,25,1013,,
1975,6,27,18,30.5,-79.0,tropical depression,-1,25,1013,,
1975,6,28,0,31.5,-78.8,tropical depression,-1,25,1012,,
1975,6,28,6,32.4,-78.7,tropical depression,-1,25,1012,,


lat,long,status,category,wind,pressure
27.5,-79.0,tropical depression,-1,25,1013
28.5,-79.0,tropical depression,-1,25,1013
29.5,-79.0,tropical depression,-1,25,1013
30.5,-79.0,tropical depression,-1,25,1013
31.5,-78.8,tropical depression,-1,25,1012
32.4,-78.7,tropical depression,-1,25,1012


In [44]:
storms.5 <- filter(storms, wind>40)
storms.6 <- filter(storms, wind>40, name %in% c("Amy", "Alberto", "Alexis", "Allison"))
nrow(storms)
nrow(storms.5)
nrow(storms.6)

In [45]:
storms.7 <- mutate(storms, quotient = wind/pressure)
storms.8 <- mutate(storms, quotient = wind/pressure, inv = 1/quotient)
head(storms.7)
head(storms.8)

name,year,month,day,hour,lat,long,status,category,wind,pressure,ts_diameter,hu_diameter,quotient
Amy,1975,6,27,0,27.5,-79.0,tropical depression,-1,25,1013,,,0.02467917
Amy,1975,6,27,6,28.5,-79.0,tropical depression,-1,25,1013,,,0.02467917
Amy,1975,6,27,12,29.5,-79.0,tropical depression,-1,25,1013,,,0.02467917
Amy,1975,6,27,18,30.5,-79.0,tropical depression,-1,25,1013,,,0.02467917
Amy,1975,6,28,0,31.5,-78.8,tropical depression,-1,25,1012,,,0.02470356
Amy,1975,6,28,6,32.4,-78.7,tropical depression,-1,25,1012,,,0.02470356


name,year,month,day,hour,lat,long,status,category,wind,pressure,ts_diameter,hu_diameter,quotient,inv
Amy,1975,6,27,0,27.5,-79.0,tropical depression,-1,25,1013,,,0.02467917,40.52
Amy,1975,6,27,6,28.5,-79.0,tropical depression,-1,25,1013,,,0.02467917,40.52
Amy,1975,6,27,12,29.5,-79.0,tropical depression,-1,25,1013,,,0.02467917,40.52
Amy,1975,6,27,18,30.5,-79.0,tropical depression,-1,25,1013,,,0.02467917,40.52
Amy,1975,6,28,0,31.5,-78.8,tropical depression,-1,25,1012,,,0.02470356,40.48
Amy,1975,6,28,6,32.4,-78.7,tropical depression,-1,25,1012,,,0.02470356,40.48


In [46]:
pollution %>% summarise(median=median(amount), variance=var(amount))
pollution %>% summarise(mean=mean(amount), sum=sum(amount), n=n()) # n() doesn't take arguments

median,variance
22.5,1731.6


mean,sum,n
42,252,6


In [47]:
storms.9 <- arrange(storms, wind)
storms.10 <- arrange(storms, desc(wind))
storms.11 <- arrange(storms, wind, desc(year),month,day)
head(storms.9)
head(storms.10)
head(storms.11)

name,year,month,day,hour,lat,long,status,category,wind,pressure,ts_diameter,hu_diameter
Bonnie,1986,6,28,6,36.5,-91.3,tropical depression,-1,10,1013,,
Bonnie,1986,6,28,12,37.2,-90.0,tropical depression,-1,10,1012,,
AL031987,1987,8,16,18,30.9,-83.2,tropical depression,-1,10,1014,,
AL031987,1987,8,17,0,31.4,-82.9,tropical depression,-1,10,1015,,
AL031987,1987,8,17,6,31.8,-82.3,tropical depression,-1,10,1015,,
Alberto,1994,7,7,0,32.7,-86.3,tropical depression,-1,10,1012,,


name,year,month,day,hour,lat,long,status,category,wind,pressure,ts_diameter,hu_diameter
Gilbert,1988,9,14,0,19.7,-83.8,hurricane,5,160,888,,
Wilma,2005,10,19,12,17.3,-82.8,hurricane,5,160,882,304.9567,74.8007
Gilbert,1988,9,14,6,19.9,-85.3,hurricane,5,155,889,,
Mitch,1998,10,26,18,16.9,-83.1,hurricane,5,155,905,,
Mitch,1998,10,27,0,17.2,-83.8,hurricane,5,155,910,,
Rita,2005,9,22,3,24.7,-87.3,hurricane,5,155,895,,


name,year,month,day,hour,lat,long,status,category,wind,pressure,ts_diameter,hu_diameter
Alberto,1994,7,7,0,32.7,-86.3,tropical depression,-1,10,1012,,
Alberto,1994,7,7,6,32.7,-86.6,tropical depression,-1,10,1012,,
Alberto,1994,7,7,12,32.8,-86.8,tropical depression,-1,10,1012,,
Alberto,1994,7,7,18,33.0,-87.0,tropical depression,-1,10,1013,,
AL031987,1987,8,16,18,30.9,-83.2,tropical depression,-1,10,1014,,
AL031987,1987,8,17,0,31.4,-82.9,tropical depression,-1,10,1015,,


In [48]:
storms %>% select(name, pressure)

name,pressure
Amy,1013
Amy,1013
Amy,1013
Amy,1013
Amy,1012
Amy,1012
Amy,1011
Amy,1006
Amy,1004
Amy,1002


In [49]:
storms %>% filter(wind>40)

name,year,month,day,hour,lat,long,status,category,wind,pressure,ts_diameter,hu_diameter
Amy,1975,6,29,12,33.8,-73.8,tropical storm,0,45,1000,,
Amy,1975,6,29,18,33.8,-72.8,tropical storm,0,50,998,,
Amy,1975,6,30,0,34.3,-71.6,tropical storm,0,50,998,,
Amy,1975,6,30,6,35.6,-70.8,tropical storm,0,55,998,,
Amy,1975,6,30,12,35.9,-70.5,tropical storm,0,60,987,,
Amy,1975,6,30,18,36.2,-70.2,tropical storm,0,60,987,,
Amy,1975,7,1,0,36.2,-69.8,tropical storm,0,60,984,,
Amy,1975,7,1,6,36.2,-69.4,tropical storm,0,60,984,,
Amy,1975,7,1,12,36.2,-68.3,tropical storm,0,60,984,,
Amy,1975,7,1,18,36.7,-67.2,tropical storm,0,60,984,,


In [50]:
storms %>% filter(wind>40) %>% select(name,pressure)

name,pressure
Amy,1000
Amy,998
Amy,998
Amy,998
Amy,987
Amy,987
Amy,984
Amy,984
Amy,984
Amy,984


In [51]:
storms %>% mutate(quotient = wind / pressure) %>% select(name, quotient) 

name,quotient
Amy,0.02467917
Amy,0.02467917
Amy,0.02467917
Amy,0.02467917
Amy,0.02470356
Amy,0.02470356
Amy,0.02472799
Amy,0.02982107
Amy,0.03486056
Amy,0.03992016


In [52]:
storms %>% group_by(name) %>% summarise(year = max(year), mean_wind = mean(wind, na.rm=TRUE), mean_pressure = mean(pressure, na.rm=TRUE), mean_ts_diameter = mean(ts_diameter, na.rm=TRUE), mean_hu_diameter = mean(hu_diameter, na.rm=TRUE)) %>% arrange(year,name)

name,year,mean_wind,mean_pressure,mean_ts_diameter,mean_hu_diameter
Amy,1975,46.50000,995.1333,,
Caroline,1975,38.93939,1002.1212,,
Doris,1975,73.69565,983.2174,,
Belle,1976,68.05556,981.5556,,
Anita,1977,72.00000,981.5000,,
Clara,1977,40.00000,1004.7083,,
Evelyn,1977,51.11111,1001.2222,,
Amelia,1978,34.16667,1007.6667,,
Bess,1978,33.46154,1008.5385,,
Cora,1978,47.63158,1002.4737,,


In [62]:
tb %>% group_by(country, year) %>% summarise(cases = sum(child,adult,elderly, na.rm=TRUE))

country,year,cases
Afghanistan,1995,0
Afghanistan,1996,0
Afghanistan,1997,128
Afghanistan,1998,1778
Afghanistan,1999,745
Afghanistan,2000,2666
Afghanistan,2001,4639
Afghanistan,2002,6509
Afghanistan,2003,6528
Afghanistan,2004,8245


In [63]:
tb %>% group_by(country, year) %>% summarise(cases = sum(child,adult,elderly, na.rm = TRUE)) %>% summarise(cases = sum(cases))

country,cases
Afghanistan,140225
Algeria,128119
Angola,308365
Argentina,117156
Azerbaijan,29965
Bangladesh,1524034
Belarus,37185
Benin,48821
Bolivia (Plurinational State of),122555
Botswana,71470


In [64]:
tb %>% group_by(country, year) %>% summarise(cases = sum(child,adult,elderly, na.rm = TRUE)) %>% summarise(cases = sum(cases)) %>% summarise(cases = sum(cases))

cases
42718969


---
`dplyr` also comes with "database" functionality (`bind_cols()`, `bind_rows()`, `union()`, `intersect()`, `setdiff()`, `left_join()`, `inner_join()`, `semi_join()`, `anti_join()`, etc.). Consult Grolemund's slides or the `dplyr` documentation for more details (a cheatsheet is available at https://www.rstudio.com/resources/cheatsheets/).