Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
227 lines (198 sloc) 8.11 KB
---
title: "HW 2: From correlation to linear mixed-effect models"
knit: (function(inputFile, encoding) {
out_dir <- './../docs';
rmarkdown::render(inputFile,
encoding=encoding,
output_file=file.path(dirname(inputFile), out_dir, 'HW2.html')) })
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, eval = FALSE)
library(tidyverse)
library(lme4)
```
## 1. Vowel reduction in Russian
Pavel Duryagin ran an experiment on perception of vowel reduction in Russian language. The dataset `shva` includes the following variables:
* `time1` - reaction time 1
* `duration` - duration of the vowel in the stimuly (in milliseconds, ms)
* `time2` - reaction time 2
* `f1`, `f2`, `f3` - the 1st, 2nd and 3rd formant of the vowel measured in Hz (for a short introduction into formants, see [here](https://home.cc.umanitoba.ca/~krussll/phonetics/acoustic/formants.html))
* `vowel` - vowel classified according the 3-fold classification (_A_ - _a_ under stress, _a_ - _a/o_ as in the first syllable before the stressed one, _y_ (stands for shva) - _a/o_ as in the second etc. syllable before the stressed one or after the stressed syllable, cf. _g_[_y_]_g_[_a_]_t_[_A_]_l_[_y_] _gogotala_ `guffawed').
In this part, we will ask you to analyse correlation between f1, f2, and duration.
The dataset is available [https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/duryagin_ReductionRussian.txt](here).
### 1.0 Read the data from file to the variable `shva`.
```{r 1.0, eval = TRUE, include=FALSE}
shva <- read_tsv("https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/duryagin_ReductionRussian.txt")
```
### 1.1 Scatterplot `f1` and `f2` using `ggplot()`.
Design it to look like the following:
```{r 1.1, eval = TRUE}
shva %>%
ggplot(aes(x=f2, y=f1, color=vowel)) +
geom_point(show.legend = FALSE) +
scale_x_reverse()+
scale_y_reverse()+
labs(x = "f2",
y = "f1",
title = "f2 and f1 of the reduced and stressed vowels",
caption = "Data from Duryagin 2018")
```
### 1.2 Plot the boxplots of `f1` and `f2` for each vowel using `ggplot()`.
<!--- Note that you can use `coord_flip()` parameter to flip the coordinates if needed (see R code [here](http://ggplot2.tidyverse.org/reference/geom_boxplot.html)). --->
```{r 1.2, eval = TRUE}
shva %>%
ggplot(aes(x=vowel, y=f1, fill=vowel)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
guides(color = guide_legend()) +
labs(x = "",
y = "f1",
title = "f1 distribution in each vowel",
caption = "Data from Duryagin 2018")
shva %>%
ggplot(aes(x=vowel, y=f2, fill=vowel)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
guides(color = guide_legend()) +
labs(x = "",
y = "f2",
title = "f2 distribution in each vowel",
caption = "Data from Duryagin 2018")
```
### 1.3 Which `f1` can be considered outliers in _a_ vowel?
We assume outliers to be those observations that lie outside 1.5 * IQR, where IQR, the 'Inter Quartile Range', is the difference between the 1st and the 3rd quartile (= 25% and 75% percentile).
```{r 1.3}
shva %>%
group_by(vowel) %>%
summarise(quantile(f1, 0.75), quantile(f1, 0.25), IQR(f1))
# provide your code below:
boxplot.stats(shva[shva$vowel == "a",]$f1)$out
```
### 1.4 Calculate Pearson's correlation of `f1` and `f2` (all data)
```{r 1.4}
shva %>%
select(f1,f2) %>%
cor()
```
### 1.5 Calculate Pearson's correlation of `f1` and `f2` for each vowel
```{r 1.5}
shva %>%
group_by(vowel) %>%
do(data.frame(Cor=t(cor(.[,5], .[,4]))))
```
### 1.6 Use the linear regression model to predict `f2` by `f1`.
#### 1.6.1 Provide the result regression formula
#### 1.6.2 Provide the adjusted R$^2$
#### 1.6.3 Add the regression line in scatterplot 1.1
```{r 1.6, eval = TRUE}
fit1 <- lm(f1~f2, data = shva)
shva %>%
ggplot(aes(x=f2, y=f1, color=vowel)) +
geom_point(show.legend = FALSE) +
scale_x_reverse()+
scale_y_reverse()+
geom_line(aes(f2, predict(fit1)), color = "darkgray") +
labs(x = "f2",
y = "f1",
title = "f2 and f1 of the reduced and stressed vowels",
caption = "Data from Duryagin 2018")
```
### 1.7 Use the mixed-efects model to predict `f2` by `f1` using `vowel` intercept as a random effect
#### 1.7.1 Provide the fixed effects formula
#### 1.7.2 Provide the variance for intercept argument for vowel random effects
#### 1.7.3 Add the regression line in scatterplot 1.1
```{r 1.7, eval = TRUE}
fit2 <- lmer(f2~f1+(1|vowel), data = shva)
shva$model2 <- predict(fit2)
shva %>%
ggplot(aes(x=f2, y=f1, color=vowel)) +
geom_point(show.legend = FALSE) +
scale_x_reverse()+
scale_y_reverse()+
geom_line(aes(f2, model2), show.legend = FALSE) +
labs(x = "f2",
y = "f1",
title = "f2 and f1 of the reduced and stressed vowels",
caption = "Data from Duryagin 2018")
```
## 2. English Lexicon Project data
880 nouns, adjectives and verbs from the English Lexicon Project data (Balota et al. 2007).
* `Format` -- A data frame with 880 observations on the following 5 variables.
* `Word` -- a factor with lexical stimuli.
* `Length` -- a numeric vector with word lengths.
* `SUBTLWF` -- a numeric vector with frequencies in film subtitles.
* `POS` -- a factor with levels JJ (adjective) NN (noun) VB (verb)
* `Mean_RT` -- a numeric vector with mean reaction times in a lexical decision task
Source (http://elexicon.wustl.edu/WordStart.asp)
Data from Natalya Levshina's `RLing` package available (here)[https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/ELP.csv]
### 2.0 Read the data from file to the variable `elp`.
```{r, include=FALSE, eval=TRUE}
elp <- read_csv("https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/ELP.csv")
```
### 2.1 Which two variables have the highest Pearson's correlaton value.
```{r}
cor(elp[, -c(1, 4)])
```
### 2.2 Group your data by parts of speech and make a scatterplot of SUBTLWF and Mean_RT.
```{r, eval= TRUE}
elp %>%
ggplot(aes(SUBTLWF, Mean_RT, color = Length))+
geom_point()+
scale_x_log10()+
facet_wrap(~POS)+
theme_bw()+
scale_color_continuous(low = "lightblue", high = "red")+
labs(caption = "data from (Balota et al. 2007)")
```
I've used `scale_color_continuous(low = "lightblue", high = "red")`
### 2.3 Use the linear regression model to predict Mean_RT by log(SUBTLWF) and POS.
#### 2.3.1 Provide the result regression formula
#### 2.3.2 Provide the adjusted R$^2$
#### 2.3.3 Add the regression line in scatterplot 1.1
```{r, eval = TRUE}
fit3 <- lm(Mean_RT ~ log(SUBTLWF), data = elp)
elp$model3 <- predict(fit3)
elp %>%
ggplot(aes(log(SUBTLWF), Mean_RT))+
geom_point(aes(color = Length))+
theme_bw()+
scale_color_continuous(low = "lightblue", high = "red")+
labs(caption = "data from (Balota et al. 2007)")+
geom_line(aes(log(SUBTLWF), model3))
```
### 2.4 Use the mixed-efects model to predict `Mean_RT` by `log(SUBTLWF)` using POS intercept as a random effect
#### 2.4.1 Provide the fixed effects formula
#### 2.4.2 Provide the variance for intercept argument for POS random effects
#### 2.4.3 Add the regression line to scatterplot
```{r, eval = TRUE}
fit4 <- lmer(Mean_RT ~ log(SUBTLWF)+(1|POS), data = elp)
elp$model4 <- predict(fit4)
elp %>%
ggplot(aes(x=log(SUBTLWF), y=Mean_RT)) +
geom_point(aes(color=POS), show.legend = FALSE) +
labs(caption = "data from (Balota et al. 2007)")+
geom_line(aes(log(SUBTLWF), model4))+
facet_wrap(~POS)
```
```{r, eval = TRUE}
fit4 <- lmer(Mean_RT ~ log(SUBTLWF)+(1 + log(SUBTLWF)|POS), data = elp)
elp$model4 <- predict(fit4)
elp %>%
ggplot(aes(x=log(SUBTLWF), y=Mean_RT)) +
geom_point(aes(color=POS), show.legend = FALSE) +
labs(caption = "data from (Balota et al. 2007)")+
geom_line(aes(log(SUBTLWF), model4))+
facet_wrap(~POS)
```
```{r, eval = TRUE}
fit4 <- lmer(Mean_RT ~ log(SUBTLWF)+(0 + log(SUBTLWF)|POS), data = elp)
elp$model4 <- predict(fit4)
elp %>%
ggplot(aes(x=log(SUBTLWF), y=Mean_RT)) +
geom_point(aes(color=POS), show.legend = FALSE) +
labs(caption = "data from (Balota et al. 2007)")+
geom_line(aes(log(SUBTLWF), model4))+
facet_wrap(~POS)
```
You can’t perform that action at this time.