In many situations, the dependent variable is not continuous, nor is it even normally distributed.
3.1 Bounded Variables
-
While might be tempted to believe that most variables in psychological science are true continuous variables, this is often not the case. In fact, many variables are bounded: there values are delimited by hard bounds. This is typically the case for slider (aka “analog” scakes), dimensions scores (average or sum) from multiple Likert scales, percentages, proportions, etc.
+
While might be tempted to believe that most of the data that we collect in psychological science are true continuous variables, this is often not the case. In fact, many variables are bounded: there values are delimited by hard bounds. This is typically the case for slider (aka “analog” scakes), dimensions scores (average or sum) from multiple Likert scales, percentages, proportions, etc.
Most psychometric indices are bounded. For instance, the minimum and maximum values for the IQ test (WAIS-IV) are 45-155. It is 0 and 63 for the depression scale BDI-II, 20 and 80 for the STAI.
Despite this fact, we still most often use linear models to analyze these data, which is not ideal as it assumes that the dependent variable is continuous and normally distributed.
3.1.1 The Problem with Linear Models
-
Let’s take the data from Makowski et al. (2023) that contains data from participants that underwent the Mini-IPIP6 personality test and the PID-5 BF questionnaire for “maladaptive” personality. We will focus on the “Disinhibition” trait from the PID-5 BF questionnaire. Note that altough it is usually computed as the average of items a 4-point Likert scales (0-3), this study used analog slides to obtain more finer-grained scores.
-
#| code-fold: false
-
-using Downloads, CSV, DataFrames, Random
-using Turing, Distributions, SequentialSamplingModels
-using GLMakie
-
-Random.seed!(123) # For reproducibility
-
-df = CSV.read(Downloads.download("https://raw.githubusercontent.com/DominiqueMakowski/CognitiveModels/main/data/makowski2023.csv"), DataFrame)
-
-# Show 10 first rows
-first(df, 10)
-
-# Plot the distribution of "Disinhibition"
-hist(df.Disinhibition, normalization = :pdf, color=:darkred)
+
Let’s take the data from Makowski et al. (2023) that contains data from participants that underwent the Mini-IPIP6 personality test and the PID-5 BF questionnaire for “maladaptive” personality. We will focus on the “Disinhibition” trait from the PID-5 BF questionnaire. Note that altough it is usually computed as the average of items a 4-point Likert scales [0-3], this study used analog slides to obtain more finer-grained scores.
+
+
usingDownloads, CSV, DataFrames, Random
+usingTuring, Distributions, SequentialSamplingModels
+usingGLMakie
+
+Random.seed!(123) # For reproducibility
+
+df = CSV.read(Downloads.download("https://raw.githubusercontent.com/DominiqueMakowski/CognitiveModels/main/data/makowski2023.csv"), DataFrame)
+
+# Show 10 first rows
+first(df, 10)
+
+# Plot the distribution of "Disinhibition"
+hist(df.Disinhibition, normalization =:pdf, color=:darkred, bins=40)
+
+
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
+└ @ Makie C:\Users\domma\.julia\packages\Makie\VRavR\src\scenes.jl:220
+
+
+
+
+
We will then fit a simple Gaussian model (an “intercept-only” linear model) that estimates the mean and the standard deviation of our variable of interest.
-
#| code-fold: false
-#| output: false
-
-@model function model_Gaussian(x)
-
- # Priors
- σ ~ truncated(Normal(0, 1); lower=0) # Strictly positive half normal distribution
- μ ~ Normal(0, 3)
-
- # Iterate through every observation
- for i in 1:length(x)
- # Likelihood family
- x[i] ~ Normal(μ, σ)
- end
-end
-
-# Fit the model with the data
-fit_Gaussian = model_Gaussian(df.Disinhibition)
-# Sample results using MCMC
-chain_Gaussian = sample(fit_Gaussian, NUTS(), 400)
+
+
@modelfunctionmodel_Gaussian(x)
+
+# Priors
+ σ ~truncated(Normal(0, 1); lower=0) # Strictly positive half normal distribution
+ μ ~Normal(0, 3)
+
+# Iterate through every observation
+for i in1:length(x)
+# Likelihood family
+ x[i] ~Normal(μ, σ)
+end
+end
+
+# Fit the model with the data
+fit_Gaussian =model_Gaussian(df.Disinhibition)
+# Sample results using MCMC
+chain_Gaussian =sample(fit_Gaussian, NUTS(), 400)
+
Let see if the model managed to recover the mean and standard deviation of the data:
-
println("Mean of the data: $(round(mean(df.Disinhibition); digits=3)) vs. mean from the model: $(round(mean(chain_Gaussian[:μ]); digits=3))")
-println("SD of the data: $(round(std(df.Disinhibition); digits=3)) vs. SD from the model: $(round(mean(chain_Gaussian[:σ]); digits=3))")
+
+
+Code
+
println("Mean of the data: $(round(mean(df.Disinhibition); digits=3)) vs. mean from the model: $(round(mean(chain_Gaussian[:μ]); digits=3))")
+println("SD of the data: $(round(std(df.Disinhibition); digits=3)) vs. SD from the model: $(round(mean(chain_Gaussian[:σ]); digits=3))")
+
+
+
Mean of the data: 1.064 vs. mean from the model: 1.066
+SD of the data: 0.616 vs. SD from the model: 0.618
+
+
Impressive! The model managed to almost perfectly recover the mean and standard deviation of the data. That means we must have a good model, right? Not so fast!
Linear models are by definition designed at recovering the mean of the outcome variables (and its SD, assuming it is inavriant across groups). That does not mean that they can capture the full complexity of the data.
Let us then jump straight into generating predictions from the model and plotting the results against the actual data to see how well the model fits the data (a procedure called the posterior predictive check).
-
#| output: false
-
-pred = predict(model_Gaussian([(missing) for i in 1:length(df.Disinhibition)]), chain_Gaussian)
-pred = Array(pred)
-
fig = hist(df.Disinhibition, normalization = :pdf, color=:darkred)
-for i in 1:length(chain_Gaussian)
- lines!(Makie.KernelDensity.kde(pred[i, :]), alpha=0.1, color=:black)
-end
-fig
-
As we can see, the model assumes that the data is normally distributed, in a way that allows for negative values and values above 3, which are not possible. In other words, a linear model might be the best choice for our data.
+
+
+Code
+
pred =predict(model_Gaussian([(missing) for i in1:length(df.Disinhibition)]), chain_Gaussian)
+pred =Array(pred)
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
+└ @ Makie C:\Users\domma\.julia\packages\Makie\VRavR\src\scenes.jl:220
+
+
+
+
+
+
As we can see, the model assumes that the data is normally distributed, in a way that allows for negative values and values above 3, which are not possible because our variable is, by design, bounded between 0 and 3 (at it is the result of the mean of variables between 0 and 3). The linear model might thus not be the best choice for this type of data.
3.1.2 Rescaling
-
Continuous variables can be trivially rescaled, which is often done to improve the interpretability of the results. For instance, a z-score is a rescaled variable with a mean of 0 and a standard deviation of 1.
+
Continuous variables can be trivially rescaled, which is often done to improve the interpretability of the results. For instance, a z-score is a rescaled variable with a mean of 0 and a standard deviation of 1. Importantly, rescaling variables does not change the variable’s distribution or the absolute relationship between variables. It does not alter the fundamental conclusions from an analysis, but can help with the interpretation of the results (or computational performance).
+
Two common rescalings are standardization (mean=0, SD=1) and normalization (range 0-1). The benefits of standardization are the interpretation in terms of deviation (which can be compared across variables). The benefits of normalization are the interpretation in terms of “proportion” (percentage): a value of \(0.5\) (i.e., \(50\%\)) means that the value is in the middle of the range. The latter is particularly useful for bounded variables, as it redefines their bounds to 0 and 1 and allows for a more intuitive interpretation (e.g., “Condition B led to an increase of 20% of the rating on that scale”).
+
Let’s rescale the “Disinhibition” variable to a 0-1 range:
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
+└ @ Makie C:\Users\domma\.julia\packages\Makie\VRavR\src\scenes.jl:220
+
+
+
+
+
3.1.3 Beta Models
-
#| code-fold: false
-
-using Distributions, Random
-using Turing
-
-# Reparameterized Beta distribution
-function BetaMean(μ, σ)
- if σ <= 0 || σ >= sqrt(μ * (1 - μ))
- error("Standard deviation σ must be in the interval (0, sqrt(μ*(1-μ))=$(sqrt(μ*(1-μ)))).")
- end
- ν = μ * (1 - μ) / σ^2 - 1
- α = μ * ν
- β = (1 - μ) * ν
-
- return Beta(α, β)
-end
diff --git a/search.json b/search.json
index f60de32..1dd07cd 100644
--- a/search.json
+++ b/search.json
@@ -134,7 +134,7 @@
"href": "3_scales.html",
"title": "3 Choices and Scales",
"section": "",
- "text": "3.1 Bounded Variables\nWhile might be tempted to believe that most variables in psychological science are true continuous variables, this is often not the case. In fact, many variables are bounded: there values are delimited by hard bounds. This is typically the case for slider (aka “analog” scakes), dimensions scores (average or sum) from multiple Likert scales, percentages, proportions, etc.\nMost psychometric indices are bounded. For instance, the minimum and maximum values for the IQ test (WAIS-IV) are 45-155. It is 0 and 63 for the depression scale BDI-II, 20 and 80 for the STAI.\nDespite this fact, we still most often use linear models to analyze these data, which is not ideal as it assumes that the dependent variable is continuous and normally distributed.",
+ "text": "3.1 Bounded Variables\nWhile might be tempted to believe that most of the data that we collect in psychological science are true continuous variables, this is often not the case. In fact, many variables are bounded: there values are delimited by hard bounds. This is typically the case for slider (aka “analog” scakes), dimensions scores (average or sum) from multiple Likert scales, percentages, proportions, etc.\nMost psychometric indices are bounded. For instance, the minimum and maximum values for the IQ test (WAIS-IV) are 45-155. It is 0 and 63 for the depression scale BDI-II, 20 and 80 for the STAI.\nDespite this fact, we still most often use linear models to analyze these data, which is not ideal as it assumes that the dependent variable is continuous and normally distributed.",
"crumbs": [
"3Choices and Scales"
]
@@ -144,7 +144,7 @@
"href": "3_scales.html#bounded-variables",
"title": "3 Choices and Scales",
"section": "",
- "text": "3.1.1 The Problem with Linear Models\nLet’s take the data from Makowski et al. (2023) that contains data from participants that underwent the Mini-IPIP6 personality test and the PID-5 BF questionnaire for “maladaptive” personality. We will focus on the “Disinhibition” trait from the PID-5 BF questionnaire. Note that altough it is usually computed as the average of items a 4-point Likert scales (0-3), this study used analog slides to obtain more finer-grained scores.\n#| code-fold: false\n\nusing Downloads, CSV, DataFrames, Random\nusing Turing, Distributions, SequentialSamplingModels\nusing GLMakie\n\nRandom.seed!(123) # For reproducibility\n\ndf = CSV.read(Downloads.download(\"https://raw.githubusercontent.com/DominiqueMakowski/CognitiveModels/main/data/makowski2023.csv\"), DataFrame)\n\n# Show 10 first rows\nfirst(df, 10)\n\n# Plot the distribution of \"Disinhibition\"\nhist(df.Disinhibition, normalization = :pdf, color=:darkred)\nWe will then fit a simple Gaussian model (an “intercept-only” linear model) that estimates the mean and the standard deviation of our variable of interest.\n#| code-fold: false\n#| output: false\n\n@model function model_Gaussian(x)\n\n # Priors\n σ ~ truncated(Normal(0, 1); lower=0) # Strictly positive half normal distribution\n μ ~ Normal(0, 3)\n\n # Iterate through every observation\n for i in 1:length(x)\n # Likelihood family\n x[i] ~ Normal(μ, σ)\n end\nend\n\n# Fit the model with the data\nfit_Gaussian = model_Gaussian(df.Disinhibition)\n# Sample results using MCMC\nchain_Gaussian = sample(fit_Gaussian, NUTS(), 400)\nLet see if the model managed to recover the mean and standard deviation of the data:\nprintln(\"Mean of the data: $(round(mean(df.Disinhibition); digits=3)) vs. mean from the model: $(round(mean(chain_Gaussian[:μ]); digits=3))\")\nprintln(\"SD of the data: $(round(std(df.Disinhibition); digits=3)) vs. SD from the model: $(round(mean(chain_Gaussian[:σ]); digits=3))\")\nImpressive! The model managed to almost perfectly recover the mean and standard deviation of the data. That means we must have a good model, right? Not so fast!\nLinear models are by definition designed at recovering the mean of the outcome variables (and its SD, assuming it is inavriant across groups). That does not mean that they can capture the full complexity of the data.\nLet us then jump straight into generating predictions from the model and plotting the results against the actual data to see how well the model fits the data (a procedure called the posterior predictive check).\n#| output: false\n\npred = predict(model_Gaussian([(missing) for i in 1:length(df.Disinhibition)]), chain_Gaussian)\npred = Array(pred)\nfig = hist(df.Disinhibition, normalization = :pdf, color=:darkred)\nfor i in 1:length(chain_Gaussian)\n lines!(Makie.KernelDensity.kde(pred[i, :]), alpha=0.1, color=:black)\nend\nfig\nAs we can see, the model assumes that the data is normally distributed, in a way that allows for negative values and values above 3, which are not possible. In other words, a linear model might be the best choice for our data.\n\n\n3.1.2 Rescaling\nContinuous variables can be trivially rescaled, which is often done to improve the interpretability of the results. For instance, a z-score is a rescaled variable with a mean of 0 and a standard deviation of 1.\n\n\n3.1.3 Beta Models\n#| code-fold: false\n\nusing Distributions, Random\nusing Turing\n\n# Reparameterized Beta distribution\nfunction BetaMean(μ, σ)\n if σ <= 0 || σ >= sqrt(μ * (1 - μ))\n error(\"Standard deviation σ must be in the interval (0, sqrt(μ*(1-μ))=$(sqrt(μ*(1-μ)))).\")\n end\n ν = μ * (1 - μ) / σ^2 - 1\n α = μ * ν\n β = (1 - μ) * ν\n\n return Beta(α, β)\nend\n\n\n\n\n\n\nCaution\n\n\n\nTODO: Replace if it is supported somewhere (e.g., Distributions.jl)\n\n\n\n#| eval: false\n\n# @model function model_Beta(x)\n# μ ~ Beta(1, 1)\n# σ ~ Uniform(eps(typeof(μ)), μ * (1 - μ) - eps(typeof(μ)))\n# for i in 1:length(x)\n# x[i] ~ MeanVarBeta(μ, σ)\n# end\n# end\n# chains = sample(model_Beta(rand(MeanVarBeta(0.5, 0.2), 200)), NUTS(), 500; \n# initial_params=[0.5, 0.1])\n\n\n3.1.4 OrdBeta Models\nHow to implement this in Turing?\n\nhttps://cran.r-project.org/web/packages/ordbetareg/vignettes/package_introduction.html\nhttps://stats.andrewheiss.com/compassionate-clam/notebook/ordbeta.html#ordered-beta-regression\nhttps://www.robertkubinec.com/post/limited_dvs/",
+ "text": "3.1.1 The Problem with Linear Models\nLet’s take the data from Makowski et al. (2023) that contains data from participants that underwent the Mini-IPIP6 personality test and the PID-5 BF questionnaire for “maladaptive” personality. We will focus on the “Disinhibition” trait from the PID-5 BF questionnaire. Note that altough it is usually computed as the average of items a 4-point Likert scales [0-3], this study used analog slides to obtain more finer-grained scores.\n\nusing Downloads, CSV, DataFrames, Random\nusing Turing, Distributions, SequentialSamplingModels\nusing GLMakie\n\nRandom.seed!(123) # For reproducibility\n\ndf = CSV.read(Downloads.download(\"https://raw.githubusercontent.com/DominiqueMakowski/CognitiveModels/main/data/makowski2023.csv\"), DataFrame)\n\n# Show 10 first rows\nfirst(df, 10)\n\n# Plot the distribution of \"Disinhibition\"\nhist(df.Disinhibition, normalization = :pdf, color=:darkred, bins=40)\n\n┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.\n└ @ Makie C:\\Users\\domma\\.julia\\packages\\Makie\\VRavR\\src\\scenes.jl:220\n\n\n\n\n\nWe will then fit a simple Gaussian model (an “intercept-only” linear model) that estimates the mean and the standard deviation of our variable of interest.\n\n@model function model_Gaussian(x)\n\n # Priors\n σ ~ truncated(Normal(0, 1); lower=0) # Strictly positive half normal distribution\n μ ~ Normal(0, 3)\n\n # Iterate through every observation\n for i in 1:length(x)\n # Likelihood family\n x[i] ~ Normal(μ, σ)\n end\nend\n\n# Fit the model with the data\nfit_Gaussian = model_Gaussian(df.Disinhibition)\n# Sample results using MCMC\nchain_Gaussian = sample(fit_Gaussian, NUTS(), 400)\n\nLet see if the model managed to recover the mean and standard deviation of the data:\n\n\nCode\nprintln(\"Mean of the data: $(round(mean(df.Disinhibition); digits=3)) vs. mean from the model: $(round(mean(chain_Gaussian[:μ]); digits=3))\")\nprintln(\"SD of the data: $(round(std(df.Disinhibition); digits=3)) vs. SD from the model: $(round(mean(chain_Gaussian[:σ]); digits=3))\")\n\n\nMean of the data: 1.064 vs. mean from the model: 1.066\nSD of the data: 0.616 vs. SD from the model: 0.618\n\n\nImpressive! The model managed to almost perfectly recover the mean and standard deviation of the data. That means we must have a good model, right? Not so fast!\nLinear models are by definition designed at recovering the mean of the outcome variables (and its SD, assuming it is inavriant across groups). That does not mean that they can capture the full complexity of the data.\nLet us then jump straight into generating predictions from the model and plotting the results against the actual data to see how well the model fits the data (a procedure called the posterior predictive check).\n\n\nCode\npred = predict(model_Gaussian([(missing) for i in 1:length(df.Disinhibition)]), chain_Gaussian)\npred = Array(pred)\n\n\n\n\nCode\nfig = hist(df.Disinhibition, normalization = :pdf, color=:darkred, bins=40)\nfor i in 1:length(chain_Gaussian)\n lines!(Makie.KernelDensity.kde(pred[i, :]), alpha=0.1, color=:black)\nend\nfig\n\n\n┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.\n└ @ Makie C:\\Users\\domma\\.julia\\packages\\Makie\\VRavR\\src\\scenes.jl:220\n\n\n\n\n\nAs we can see, the model assumes that the data is normally distributed, in a way that allows for negative values and values above 3, which are not possible because our variable is, by design, bounded between 0 and 3 (at it is the result of the mean of variables between 0 and 3). The linear model might thus not be the best choice for this type of data.\n\n\n3.1.2 Rescaling\nContinuous variables can be trivially rescaled, which is often done to improve the interpretability of the results. For instance, a z-score is a rescaled variable with a mean of 0 and a standard deviation of 1. Importantly, rescaling variables does not change the variable’s distribution or the absolute relationship between variables. It does not alter the fundamental conclusions from an analysis, but can help with the interpretation of the results (or computational performance).\nTwo common rescalings are standardization (mean=0, SD=1) and normalization (range 0-1). The benefits of standardization are the interpretation in terms of deviation (which can be compared across variables). The benefits of normalization are the interpretation in terms of “proportion” (percentage): a value of \\(0.5\\) (i.e., \\(50\\%\\)) means that the value is in the middle of the range. The latter is particularly useful for bounded variables, as it redefines their bounds to 0 and 1 and allows for a more intuitive interpretation (e.g., “Condition B led to an increase of 20% of the rating on that scale”).\nLet’s rescale the “Disinhibition” variable to a 0-1 range:\n\nfunction data_rescale(x; old_range=[minimum(x), maximum(x)], new_range=[0, 1])\n return (x .- old_range[1]) ./ (old_range[2] - old_range[1]) .* (new_range[2] - new_range[1]) .+ new_range[1]\nend\n\n# Rescale the variable\ndf.Disinhibition2 = data_rescale(df.Disinhibition; old_range=[0, 3], new_range=[0, 1])\n\n242-element Vector{Float64}:\n 0.014\n 0.456\n 0.414\n 0.15\n 0.204\n 0.236\n 0.014\n 0.5\n 0.408\n 0.308\n 0.38799999999999996\n 0.148\n 0.19399999999999998\n ⋮\n 0.026\n 0.62\n 0.18400000000000002\n 0.484\n 0.586\n 0.448\n 0.104\n 0.22\n 0.023999999999999997\n 0.41\n 0.40599999999999997\n 0.536\n\n\n\n\nCode\n# Visualize\nfig = hist(df.Disinhibition2, normalization = :pdf, color=:darkred, bins=40)\nxlims!(-0.1, 1.1)\nfig\n\n\n┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.\n└ @ Makie C:\\Users\\domma\\.julia\\packages\\Makie\\VRavR\\src\\scenes.jl:220\n\n\n\n\n\n\n\n3.1.3 Beta Models\n\nfunction BetaMean(μ, σ)\n if σ <= 0 || σ >= sqrt(μ * (1 - μ))\n error(\"Standard deviation σ must be in the interval (0, sqrt(μ*(1-μ))=$(sqrt(μ*(1-μ)))).\")\n end\n ν = μ * (1 - μ) / σ^2 - 1\n α = μ * ν\n β = (1 - μ) * ν\n\n return Beta(α, β)\nend\n\nBetaMean (generic function with 1 method)\n\n\n\n\n\n\n\n\nCaution\n\n\n\nTODO: Replace if it is supported somewhere (e.g., Distributions.jl)\n\n\n\n\n\nCode\n# @model function model_Beta(x)\n# μ ~ Beta(1, 1)\n# σ ~ Uniform(eps(typeof(μ)), μ * (1 - μ) - eps(typeof(μ)))\n# for i in 1:length(x)\n# x[i] ~ MeanVarBeta(μ, σ)\n# end\n# end\n# chains = sample(model_Beta(rand(MeanVarBeta(0.5, 0.2), 200)), NUTS(), 500; \n# initial_params=[0.5, 0.1])\n\n\n\n\n3.1.4 OrdBeta Models\nNeed help to implement this in Turing!\n\nhttps://cran.r-project.org/web/packages/ordbetareg/vignettes/package_introduction.html\nhttps://stats.andrewheiss.com/compassionate-clam/notebook/ordbeta.html#ordered-beta-regression\nhttps://www.robertkubinec.com/post/limited_dvs/",
"crumbs": [
"3Choices and Scales"
]