Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
114 lines (76 sloc) 6.04 KB
---
title: "Lab12. Decision Trees and Forests. Variable importance"
output: html_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
We will use the library `party`. However, there is a number of other packages for classification and regression tree-based approach (CART): `randomForest`, `rpart`, `crat`, `maptree`, `partykit` and other.
```{r, message=FALSE}
library(party)
```
### 1. Vowel elision
Our student Varvara Sveshnikova wrote her BA paper on two cases of the consonant drop: (a) when in the complex -stvov- (like in _beschinstVovat'_ 'to riot') another labial consonant is pronounced after it, and (b) when no consonant follows (in two contexts: _beschinstVuju_ 'I riot', beschinstVo_ 'roistering'). The dataset [](https://raw.githubusercontent.com/agricolamz/r_on_line_course_data/master/Sveshnikova.2016.v.elision.csv) includes the following data:
`v.elision` --- elision of [v] / no elision;
`group` --- a group of test words, first (beschinstvovat'), second (beschinstvuju), third (beschinstvo);
`word` --- root under analysis;
`position` --- phrase position: strong, under logical stress (_I am not *CRYING*, I resent), weak (_He ALWAYS likes to *cry*).
Fit a CART model, using ctree() function, predicting v.elision variable by all others.
1.1 Visualize a model using plot() function. What is the number of observation in node 6?
1.2 Visualize a model using print() function. Which split have a statistic 14.01?
1.3 Predict a value of v.elision for word with a root "попеч" in a third group, in a strong position.
Fit a cforest model using additional argument controls=cforest_unbiased(ntree=1000, mtry=3).
1.4 Predict a value of v.elision for word with a root "попеч" in a third group, in a strong position using `cforest` model.
You need to add an argument OOB=TRUE. e. g. yes
1.5 Calculate a variable importance for a group variable in the random forest model using varimp() function.
Code to use:
```{r}
df <- read.csv("https://raw.githubusercontent.com/agricolamz/r_on_line_course_data/master/Sveshnikova.2016.v.elision.csv")
fit <- ctree(v.elision~., data = df)
plot(fit)
print(fit)
predict(fit, df[45,-1], response = TRUE)
fit2 <- cforest(v.elision~., data = df, controls=cforest_unbiased(ntree=1000, mtry=2))
predict(fit2, df[45,-1],OOB=TRUE)
vi <- as.data.frame(sort(varimp(fit2), decreasing=TRUE))
vi
```
### 2. /S/ deletion in Panamanian Spanish
Here's some data from Henrietta Cedergren's 1973 study of /s/-deletion in Panamanian Spanish (via Greg Guy and Scott Kiesling). Cedergren had noticed that speakers in Panama City, like in many dialects of Spanish, variably deleted the /s/ at the end of words. She undertook a study to find out if there was a change in progress: if final /s/ was systematically dropping out of Panamanian Spanish. The attached data are from interviews she performed across the city in four different social classes (1=highest, 2=second highest, 3=second lowest, 4= lowest), to see how the variation was structured in the community. She also investigated the linguistic constraints on deletion, so she coded for a phonetic constraint — whether the following segment was consonant, vowel, or pause — and the grammatical category of word that the /s/ is part of a:
monomorpheme, where the s is part of the free morpheme (eg, menos)
verb, where the s is the second singular inflection (eg, tu tienes, el tienes)
determiner, where s is plural marked on a determiner (eg, los, las)
adjective, where s is a nominal plural agreeing with the noun (eg, buenos)
noun, where s marks a plural noun (eg, amigos)
Fit the CART model predicting s.deletion by phonetic environment and social class
Data: [https://raw.githubusercontent.com/LingData2019/LingData/master/data/cedergren73.csv](https://raw.githubusercontent.com/LingData2019/LingData/master/data/cedergren73.csv)
2.1 Visualize a model using plot() function. What is the number of observation in node 6?
2.2 Visualize a model using print() function. Which split have a statistic 61.559 (e. g. pause, vowel vs. consonant)?
2.3 Predict a value of s.delition for word said by person from 1 class, before consonant.
Fit a `cforest` model using additional argument controls=cforest_unbiased(ntree=100, mtry=2).
2.4 Calculate a variable importance for the random forest model using varimp() function. Which of the variable is more important?
```{r}
df <- read.csv("https://raw.githubusercontent.com/LingData2019/LingData/master/data/cedergren73.csv")
```
### 3.
Pavel Duryagin ran an experiment on perception of vowel reduction in Russian language. The dataset shva includes the following variables:
* `time1` - reaction time 1
* `duration` - duration of the vowel in the stimuly (in milliseconds, ms)
* `time2` - reaction time 2
* `f1`, `f2`, `f3` - the 1st, 2nd and 3rd formant of the vowel measured in Hz
* `vowel` - vowel classified according the 3-fold classification (A - a under stress, a - a/o as in the first syllable before the stressed one, y (stands for shva) - a/o as in the second etc. syllable before the stressed one or after the stressed syllable, cf. g[y]g[a]t[A]l[y] gogotala `guffawed’).
The dataset is available https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/duryagin_ReductionRussian.txt.
Fit the CART model predicting vowel by f1 and f2
3.1 Visualize a model using plot() function. What is the number of observation in node 9?
3.2 Predict a value of vowel for sound with f1 = 600, f2 = 1300?
Fit a cforest model using additional argument controls=cforest_unbiased(ntree=100, mtry=2).
3.3 Predict a value of vowel for sound with f1 = 600, f2 = 1300?
You need to add an argument OOB=TRUE.
3.4 Calculate a variable importance for the random forest model using varimp() function. Which of the variable is more important?
```{r}
shva <- read.csv("https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/duryagin_ReductionRussian.txt", sep = "\t")
#predict(fit, newdata = data.frame(f1 = as.integer(600),
# f2 = as.integer(1300)), response = TRUE)
```
You can’t perform that action at this time.