# Data preparation

This notebook contains the code for the preparation of the data for the datasets from the British Lexicon Project (BLP) and the Calgary Semantic Decision Project (CSDP).

## Preliminaries

Load the libraries required for data preprocessing.

In [1]:
# Load libraries
suppressMessages(library(vwr))
suppressMessages(library(ndl))
library(MASS)
library(readxl)

## British Lexicon Project

Preparation of the BLP data. We read in the response time data, take the relevant subset from them, and compute the lexical-distributional variables for the response time analyses.

In [2]:
# Read in data
blp = read.table("input/blp-items.txt", header = TRUE)

# Restrict to relevant columns
blp = blp[,c("spelling","rt", "accuracy")]
colnames(blp) = c("Word", "RT", "Accuracy")

# Remove words without response time
blp = blp[which(!is.na(blp$RT)),]

# Limit to words with a minimum accuracy of 0.75 in the BLP
blp = blp[which(blp$Accuracy >= 0.75),]

# Read in word frequencies
frequencies = read.table("input/blp-stimuli.txt", sep = "\t", header = TRUE)
frequencies = frequencies[,c("spelling", "subtlex.frequency")]
colnames(frequencies) = c("Word", "Frequency")

# Combine 
blp = merge(blp, frequencies, by = "Word")

# Limit to words with a minimum frequency of 10 in SUBTLEX-UK
blp = blp[which(blp$Frequency >= 10),]

# Define word length
blp$Length = nchar(blp$Word)

# Define OLD20
blp$OLD20 = as.numeric(old20(blp$Word,blp$Word))
blp$OLD20Norm = blp$OLD20 / blp$Length

# Define mean bigram frequency
bigram_list = strsplit(orthoCoding(blp$Word), "_")
bigrams = unique(unlist(bigram_list))
bigram_freqs = sapply(bigrams, FUN = function(x) {
  sum(blp$Frequency[which(unlist(lapply(bigram_list, function(y){x%in%y})))])  
})
blp$MeanBigramFrequency = unlist(lapply(bigram_list, 
  function(x){sum(bigram_freqs[x])/length(x)}))

# Run Box-Cox tests to determine appropriate transformations
# par(mfrow=c(2,3))
# for(pred in c("RT", "Frequency", "Length", "OLD20Norm", "MeanBigramFrequency")) {
#   boxcox(blp[,pred] ~ 1)
# }

# Apply tranformations
blp$RTInv = -1000 / blp$RT
blp$LogFrequency = log(blp$Frequency + 1)
blp$LogOLD20Norm = log(blp$OLD20Norm)
blp$LogMeanBigramFrequency = log(blp$MeanBigramFrequency + 1)

# Restrict to relevant columns
blp = blp[,c("Word", "RT", "RTInv", "Frequency", "LogFrequency", "Length", 
  "LogOLD20Norm", "LogMeanBigramFrequency")]

# Rename rows
rownames(blp) = 1:nrow(blp)

# Save data
saveRDS(blp, file = "data/blp.rds")

## Calgary Semantic Decision Project

Preparation of the CSDP data. We read in the response time data, take the relevant subset from them, and compute the lexical-distributional variables for the response time analyses.

In [3]:
# Read in data
csdp = read_excel("input/CSD.xlsx")

# Restrict to relevant columns
csdp = csdp[,c("Word","RTclean_mean", "WordType", "Block")]
colnames(csdp) = c("Word", "RT", "Type", "Block")

# Remove words without response time
csdp = csdp[which(!is.na(csdp$RT)),]

# Get lexical-distributional variables from the blp data frame
blp_var = blp[,c("Word", "Frequency", "LogFrequency", "Length", "LogOLD20Norm", 
  "LogMeanBigramFrequency")]
csdp = merge(csdp, blp_var, by = "Word")

# Limit to words with a minimum frequency of 10 in SUBTLEX-US
csdp = csdp[which(csdp$LogFrequency >= log(10)),]

# Run Box-Cox test to determine appropriate transformation for response times
# boxcox(csdp$RT ~ 1)

# Transform response times
csdp$LogRT = log(csdp$RT)

# Rename rows
rownames(csdp) = 1:nrow(csdp)

# Save data
saveRDS(csdp, file = "data/csdp.rds")