# Lecture 7: Functional-style programming and Hypothesis testing

#### FROM THE PREVIOUS CLASS:


In [33]:
square <- function(x) {
    x^2
}
deviation <- function(x, y=0) {
    -y + y + x - mean(x)
}
x <- runif(100) # runif(X) generates X random deviates 
y <- runif(100)

In [34]:
library(tidyverse)
x %>%
  deviation(y) %>%
  square() %>%
  mean() %>%
  sqrt()

## Reading in functions from an R script

Usually the step before packaging your code, is having some functions in another script that you want to read into your analysis. We use the `source` function to do this:

In [2]:
source("convertemp.R")

Once you do this, you have access to all functions contained within that script:

In [3]:
celsius_to_kelvin(0)

## Introduction to R packages

- `source("script_with_functions.R")` is useful, but when you start using these functions in different projects you need to keep copying the script, or having overly specific paths...
- The next step is packaging your R code so that it can be installed and then used across multiple projects on your (and others) machines without directly pointing to where the code is stored, but instead accessed using the `library` function.
- Let's tour a simple R package to get a better understanding of what they are: https://github.com/ttimbers/convertemp

### Install the convertemp R package:

In RStudio, type: `devtools::install_github("ttimbers/convertemp")`

In [1]:
library(convertemp)

In [2]:
?celsius_to_kelvin

celsius_to_kelvin          package:convertemp          R Documentation

_C_o_n_v_e_r_t _C_e_l_s_i_u_s _t_o _K_e_l_v_i_n

_D_e_s_c_r_i_p_t_i_o_n:

     Convert a temperature from Celsius to Kelvin

_U_s_a_g_e:

     celsius_to_kelvin(temp)
     
_A_r_g_u_m_e_n_t_s:

    temp: numeric

_V_a_l_u_e:

     numeric

_E_x_a_m_p_l_e_s:

     celsius_to_kelvin(0)
     

In [3]:
celsius_to_kelvin(0)

### Packages and environments

- Each package attached by library() becomes one of the parents of the global environment
- The immediate parent of the global environment is the last package you attached, the parent of that package is the second to last package you attached, …
- This is known as the search path because all objects in these environments can be found from the top-level interactive workspace

<img src="https://d33wubrfki0l68.cloudfront.net/038b2da4f5db1d2a8acaf4ee1e7d08d04ab36ebc/ac22a/diagrams/environments/search-path.png" width=800>

*Source: [Advanced R](https://adv-r.hadley.nz/) by Hadley Wickham*

In [114]:
# You can see the names of these environments with 
base::search()

### Packages and environments

When you attach another package with library(), the parent environment of the global environment changes:

<img src="https://d33wubrfki0l68.cloudfront.net/7c87a5711e92f0269cead3e59fc1e1e45f3667e9/0290f/diagrams/environments/search-path-2.png" width=800>

*Source: [Advanced R](https://adv-r.hadley.nz/) by Hadley Wickham*

## Functional style programming in R with `purrr` (Continue)


![](https://ih1.redbubble.net/image.329884292.2339/sticker,375x360-bg,ffffff.u1.png)

https://purrr.tidyverse.org/

### Let's start at the beginning with the most general `purrr` function: `map`

```
map(.x, .f, ...)
```

Above reads as: `for` every element of `.x` apply `.f`.

It takes a vector and a function, calls the function once for each element of the vector, and returns the results in a list. 

and can be pictured as:

<img src="https://d33wubrfki0l68.cloudfront.net/12f6af8404d9723dff9cc665028a35f07759299d/d0d9a/diagrams/functionals/map-list.png" width=500>

All map functions always return an output vector the same length as the input, which implies that each call to .f must return a single value. If it does not, you’ll get an error:

In [122]:
pair <- function(x) c(x, x)
map(1:3, pair)

In [123]:
#The base equivalent to map() is lapply()
lapply(1:3, pair)

- Sometimes it is inconvenient to return a list when a simpler data structure would do.
- There are four more specific variants: map_lgl(), map_int(), map_dbl(), and map_chr():

In [52]:
# map_chr() always returns a character vector
map_chr(mtcars, typeof)

In [53]:
# map_lgl() always returns a logical vector
map_lgl(mtcars, is.double)

In [54]:
# map_dbl() always returns a double vector
map_dbl(mtcars, mean)

### Back to anonymous function calls within `purrr::map*`

Long form:

```
map_dbl(mtcars, function(x) median(x, na.rm  = TRUE))
```

Short form:
```
map_dbl(mtcars, ~ median(., na.rm  = TRUE))
```

In the shortcut we replace `function(VARIABLE)` with a `~` and replace the `VARIABLE` in the function call with a `.`

In [57]:
#Short form:
map_dbl(mtcars, ~ median(., na.rm  = TRUE))

### Mapping with > 1 data objects

What if the function you want to map takes in > 1 data objects?

- `map2*` and `pmap*` are your friends here!

### `purrr::map2*`

```
map2*(.x, .y, .f, ...)
```

Above reads as: `for` every element of `.x` and `.y` apply `.f` 

### `purrr::map2_df` example:


For example, say you want to calculate a weighted means (using `weighted.mean`) for columns of a data frame where you had another data frame containing those weights.

Let's make some data:

In [71]:
#library(dplyr, quietly = TRUE)
data <- tibble(x1 = runif(10),
               x2 = runif(10),
               x3 = runif(10))
data[1, 1] <- NA
weights <- tibble(x1 = rpois(10, 5) + 1,
                 x2 = rpois(10, 5) + 1,
                 x3 = rpois(10, 5) + 1,)

data
weights

x1,x2,x3
,0.6762564,0.81602506
0.54606039,0.41532448,0.46129659
0.08700938,0.08818477,0.23009321
0.01897722,0.80902725,0.57036964
0.35342173,0.27682371,0.15864191
0.16385471,0.78204922,0.05006447
0.18127121,0.74944675,0.53457803
0.06350928,0.4507504,0.14705411
0.79162966,0.03095526,0.29811671
0.88780919,0.89267576,0.43712769


x1,x2,x3
6,8,7
3,6,11
4,6,6
6,6,8
8,3,6
5,8,6
4,2,6
7,3,4
3,6,6
7,4,2


### `purrr::map2_df` example:

Let's use `map2_df` to calculate the weighted mean using these two data frames.

In [72]:
?weighted.mean

weighted.mean              package:stats               R Documentation

_W_e_i_g_h_t_e_d _A_r_i_t_h_m_e_t_i_c _M_e_a_n

_D_e_s_c_r_i_p_t_i_o_n:

     Compute a weighted mean.

_U_s_a_g_e:

     weighted.mean(x, w, ...)
     
     ## Default S3 method:
     weighted.mean(x, w, ..., na.rm = FALSE)
     
_A_r_g_u_m_e_n_t_s:

       x: an object containing the values whose weighted mean is to be
          computed.

       w: a numerical vector of weights the same length as ‘x’ giving
          the weights to use for elements of ‘x’.

     ...: arguments to be passed to or from methods.

   na.rm: a logical value indicating whether ‘NA’ values in ‘x’ should
          be stripped before the computation proceeds.

_D_e_t_a_i_l_s:

     This is a generic function and methods can be defined for the
     first argument ‘x’: apart from the default methods there are
     methods for the date-time classes ‘"POSIXct"’, ‘"POSIXlt"’,
     ‘"diffti

In [73]:
map2_df(data, weights, weighted.mean)

x1,x2,x3
,0.5188407,0.394207


Ah! That NA got us again! We need to write this an an anonymous function so that we can pass in `na.rm = TRUE`

### `purrr::map2_df` example:

Now using an anonymous function with the long form:

In [74]:
map2_df(data, weights, function(x, y) weighted.mean(x, y, na.rm = TRUE))

x1,x2,x3
0.3299135,0.5188407,0.394207


In [84]:
#Now with the short form:
map2_df(data, weights, ~ weighted.mean(.x, .y, na.rm = TRUE))

x1,x2,x3
0.3299135,0.5188407,0.394207


### `purrr::map2*`

Also, if `y` has less elements than `x`, it recycles `y`:

<img src="https://d33wubrfki0l68.cloudfront.net/55032525ec77409e381dcd200a47e1787e65b964/dcaef/diagrams/functionals/map2-recycle.png" width=400>

This is most useful when y has only one element.

### `purrr::pmap*`

```
pmap*(list(.x1, .x2, ... .xn), .f, ...)
```

Above reads as: `for` every element of in the **list** (that contains `.x1, .x2, ... .xn`) apply `.f` 

### Example of using `pmap_df` to calculate the weighted means:

In [86]:
pmap_df(list(data, weights), ~ weighted.mean(.x, .y, na.rm = TRUE))

x1,x2,x3
0.6172998,0.6851068,0.5054811


But what happens when you have > 2 arguments?

### More than two arguments

Without an anonymous function, works as so:

In [90]:
f1 <- function(x, y, z) {
    x + y + z
}

pmap_dbl(list(c(1, 1), c(1, 2), c(2, 2)), f1)

If you want to use an anonymous function, then use `..1`, `..2`, `..3`, and so on to specify where the mapped objets go in your function:

In [89]:
f2 <- function(x, y, z, a = 0) {
    x + y + z + a
}

pmap_dbl(list(c(1, 1), c(1, 2), c(2, 2)), ~ f2(..1, ..2, ..3, a = -1))

We only used two inputs to our function here, but we can use any number with `pmap`, we just need to add them to our list!

### Want to iterate row-wise, instead of column-wise?

Here you can use `purrr::pmap` on a single data frame!

This: ```purrr::pmap(df, .f)```

reads as: `for` every tuple in `.l` (*i.e.*, each row of `df`) apply `.f`

The key point is that `pmap()` iterates over tuples = the collection of `i`-th elements of `k` lists. A data frame row is an interesting special case.

### Here's an example of row-wise iteration 

Here we calculate the sum for each row in the `mtcars` data frame:

In [91]:
pmap(mtcars, sum)

## R Hypothesis Testing and Linear Regression

### HYPOTHESIS TESTING: 
It is used to determine if a relationship exists between two sets of data and make decisions/conclusions about that relationship.

#### Useful for:
1. business - determining effectiveness of marketing, identifying customer buying properties, online advertising optimization
2. science/social science - determining if data sets match a model, understanding scientific process based on collected data values, analysis of study data


#### STEPS

1. Declare hypotheses statement and null hypothesis
2. Decide on test statistic
3. Use P-value and/or confidence interval to make decision/conclusion

- A p-value of 0.05 “signifies that if the null hypothesis is true, and all other assumptions made are valid, there is a 5% chance of obtaining a result at least as extreme as the one observed” (http://www.nature.com/news/statisticians-issue-warning-over-misuse-of-p-values-1.19503)
- Data is used as evidence. Perform a test in order to make a decision: reject the null hypothesis or fail to reject the null hypothesis.
- NOTE: We cannot prove if the null hypothesis is true or false. We can only show that there is evidence to suggest one conclusion or another. 

#### Assumptions

There are assumptions that need to be met before performing statistical tests. 

- For the one sample case
    - Population of interest is normally distributed
    - Independent random samples are taken

- For the two sample case:
    - The two samples are independent
    - Populations of interest are normally distributed


### One Sample Test

A one sample test is used when a sample is compared to a model or known population/estimate. For example, it used to determine whether the sample mean is different from a specific value.

As an example, using the car data test if the average mileage is different than 10 km/L. 

In [103]:
car_data <- read.csv("car_data.csv")
car_data


t.test(x = car_data$km.L, alternative = c("two.sided"), mu = 10)


Total.km,Distance,Litres,Price,Total,km.L,mi.gal
,487,44.29,1.169,51.78,10.99,25.86
,304,27.59,1.109,30.6,11.02,25.91
244270.0,290,26.97,1.219,32.88,10.75,25.29
244728.0,458,48.34,1.219,58.92,9.48,22.29
245155.0,427,42.94,1.259,54.06,9.94,23.39
245631.0,476,45.21,1.259,56.92,10.53,24.76
246131.0,500,47.38,1.299,61.54,10.55,24.82
246515.0,384,36.22,1.249,45.24,10.6,24.94
247019.0,504,47.28,1.269,60.0,10.66,25.07
247372.0,353,37.41,1.209,45.23,9.44,22.19



	One Sample t-test

data:  car_data$km.L
t = 1.608, df = 29, p-value = 0.1187
alternative hypothesis: true mean is not equal to 10
95 percent confidence interval:
  9.90338 10.80729
sample estimates:
mean of x 
 10.35533 


- If p-value > 0.05, the probability of seeing a sample mean more extreme is not that unlikely. 
    1. Fail to reject the null hypothesis 
    2. There is no evidence to suggest that the mean value of VARIABLE is less than, greater than, or different than the test value.

- If p-value < 0.05, 
    1. Reject the null hypothesis
    2. There is evidence to suggest that the mean value of VARIABLE is less than, greater than, or different than the test value.


### Two Sample Unpaired

An unpaired (independent) two sample test compares two independent samples to determine if there is a difference between the groups. 

Examples:
- Compare effectiveness of two different drugs tested on two sets of patients


In [106]:
library(datasets)

# Test the hypothesis that there is no difference between the mean active temperature and the mean non-active temperatures.
#H0: µ1 = µ2 → µ1 - µ2 = 0
#HA: µ1 ≠ µ2 → µ1 - µ2 ≠ 0

beaver2

t.test(temp~activ, data=beaver2, 
  alternative=c("two.sided"), mu=0, 
  paired=FALSE)


day,time,temp,activ
307,930,36.58,0
307,940,36.73,0
307,950,36.93,0
307,1000,37.15,0
307,1010,37.23,0
307,1020,37.24,0
307,1030,37.24,0
307,1040,36.90,0
307,1050,36.95,0
307,1100,36.89,0



	Welch Two Sample t-test

data:  temp by activ
t = -18.548, df = 80.852, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.8927106 -0.7197342
sample estimates:
mean in group 0 mean in group 1 
       37.09684        37.90306 


The p-value << 0.05.

Reject the null hypothesis. There is evidence to suggest that there is a difference between active and non active temperatures. 


### Two Sample Paired

A paired (dependent) two sample test compares two dependent samples to see if there is a difference between the groups. This test typically uses multiple measurements on one subject. Also called a "repeated measures" test.

Examples:
- Affect of treatment on a patient (before and after)
- Apply something to test subjects to see if there is an effect
- Car example: Do cars get better mileage with different grades of gasoline?

In [109]:
#The athlete.csv dataset contains data on ten athletes and their speeds for the 100m dash before training (Training = 0) and after (Training = 1). 
#Test the hypothesis that their training has no affect on the times of the athletes. Test to see if the mean of the difference is different than 0.
#H0: d= 0
#HA: d≠ 0


athletes_data <- read.csv("athletes.csv")
athletes_data

t.test(Time~Training, data = athletes_data, alternative=c("two.sided"), mu=0, paired=TRUE)

Athlete,Time,Training
1,12.9,0
2,13.5,0
3,12.8,0
4,15.6,0
5,17.3,0
6,19.32,0
7,12.6,0
8,15.3,0
9,14.4,0
10,11.3,0



	Paired t-test

data:  Time by Training
t = -0.12031, df = 9, p-value = 0.9069
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.5544647  0.4984647
sample estimates:
mean of the differences 
                 -0.028 


The two sample case tests for a difference between the groups (d ≠ 0). The CI is for the difference.

Fail to reject the null hypothesis, p-value >> 0.05

### Question: How many of the following hypothesis questions should use two sample unpaired tests?

1. Is the average student mark in courses 70%?
2. Does a student's mark improve after studying?
3. Has the average student height increased since 1990?
4. Does radiation reduce the size of tumors when used to treat patients?
5. Is aspirin more effective than Tylenol for treating headaches?
6. Are college graduates better than high school graduates at standardized tests? 


### Try It: Hypothesis Testing

Using the airquality dataset that containes a random sample of air quality measurements in New York City:
1. An official claims that the average wind speed in the city is 9 miles per hours. Is that plausible?
2. A certain solar array will only be cost effective if mean solar radiation is over 175 Langleys. Would ut be a sound investemnet in light of this data?


In [130]:
head(airquality,3)

Ozone,Solar.R,Wind,Temp,Month,Day
41,190,7.4,67,5,1
36,118,8.0,72,5,2
12,149,12.6,74,5,3


### Fitting a Linear Model
- We use the function ``lm(y~x, data=dataset)`` 
- Lm stands for linear model
- ``y ~ x ``, Formula: ``y-values = y-intercept + slope * x-values``
- The linear models function model then calculate the least-square stimates for the ``y-intercept`` and the ``slope``


In [135]:
car.regression <- lm(km.L~Litres, data = car_data)
car.regression


Call:
lm(formula = km.L ~ Litres, data = car_data)

Coefficients:
(Intercept)       Litres  
   12.06973     -0.04455  


In [136]:
summary(car.regression)



Call:
lm(formula = km.L ~ Litres, data = car_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.58220 -0.55879  0.03717  0.45726  3.02414 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12.06973    0.84075  14.356 1.95e-14 ***
Litres      -0.04455    0.02116  -2.105   0.0444 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.145 on 28 degrees of freedom
Multiple R-squared:  0.1367,	Adjusted R-squared:  0.1058 
F-statistic: 4.432 on 1 and 28 DF,  p-value: 0.04437


- Residuals: Summary of the residuals (the distance from the datat to the fitted line). Ideally they shoynd be symmetrically disributted arround the line. Meaning that ideally you want the max and min values to be the same distant from 0.
- Coefficients: It tells us about the least-square estimates for the fitted line. The Std. Error and t values show you how the P values were calculated. P-values << 0.5 means that our ``x`` is statistically significant.