Skip to content
Gabriel A. Devenyi edited this page Jun 9, 2023 · 52 revisions

#Quick Excel trick to round to a list Useful for resampling issues in volumes/surfaces

http://www.mrexcel.com/forum/excel-questions/379391-round-up-nearest-number-within-list-2.html#post3972903

How to summarize a dataframe by arbitrary groups

library(doBy)
#> Loading required package: survival
#> Loading required package: splines

# Run the functions length, mean, and sd on the value of "change" for each group,
# broken down by sex + condition
cdata <- summaryBy(change ~ sex + condition, data=data, FUN=c(length,mean,sd))
cdata
#>   sex condition change.length change.mean change.sd
#> 1   F   aspirin             5   -3.420000 0.8642916
#> 2   F   placebo            12   -2.058333 0.5247655
#> 3   M   aspirin             9   -5.411111 1.1307569
#> 4   M   placebo             4   -0.975000 0.7804913

How to make histograms to visualise your data

#install packages
library(RMINC)
library(ggplot2)

Create a histogram of age distribution

qplot(x=Age,geom="histogram",data=data,binwidth = 1,main = "Histogram for Age",xlab = "Age", ylab = "Count", fill=I("blue"),col=I("red"),alpha=I(.2),xlim=c(5,40))

Same histogram, but this time split by DX (Autism and Controls)

qplot(x=Age, fill=DX, geom="histogram",data=data,binwidth = 1,main = "Histogram for Age",xlab = "Age", ylab = "Count",col=I("red"),alpha=I(.5),xlim=c(5,40))

geom="histogram" creates a histogram for Age (x=Age)

binwidth = 1 : set each bin to represent 1 year (if you don't specify anything, the default is 1) You could also specify "bins = 10", and instead it will give you 10 equal sized bins

main = "Histogram for Age" : Creates a title for your histogram

xlab = "Age", ylab = "Count" : labels for x and y axis

fill = DX : Allows you to visualise how much of each bin is made up of by each group, in this case Control or Autism (it will assign each group a colour)

OR, if not splitting by group, fill=I("blue") : choose the colour of your bars

col=I("red") : Choose the colour of the border of your bars

alpha=I(.2) : Transparency of the fill [0 (fully transparent) - 1 (opaque)]

xlim=c(20,50)) : Range from values 20 to 50 on the x-axis

How to merge two (or more) data frames

#Load in your csvs
data <- read.csv('mydata.csv')
meanCT_ASD <- read.csv('meanCT_ASD.csv')
meanCT_ADHD <- read.csv('meanCT_ADHD.csv')

#Merge vertically (add rows):
total_meanCT <- rbind(meanCT_ASD, meanCT_ADHD)

#Merge horizontally (add columns) by Subject ID:
total <- merge(data, total_meanCT, select="Subject_ID", all.x=TRUE)

So you wanna build a bootstrapped & residualized correlation matrix?

The following contains an explanation and example of how to create a structural correlation plot. This means that we have a bunch of structures (or other variables of interest) that we want to correlate together. In addition, we will do this while covarying for confounding variables and bootstraping our results to ensure robust results. For this example, we will build such a plot using data of many subjects who each have volumes from many different regions in the brain. We want to correlate all these volumes together to see if there is any correlational relationship between the volumes of various structures. We will also be residualizing this specific data for sex and total brain volume.

#first install the required packages
library(corrplot)
library(ggplot2)
library(Hmisc)
library(reshape2)
library(plyr)
library(gplots)
options(digits=3)
theme_set(theme_bw(base_size = 20))

The first thing you must do to your data is residualize for your covariates. For example if you are running a General Linear Model (GLM), typically you would run your GLM while simultaneously covarying for the effects of age, sex, education, etc. Here, since we will be working only with correlations, we would want to resudualize (or remove) the factors of age, sex etc. from our data beforehand. You can skip this step if your data does not require any residualizing.

#to residualize, we need to first make a residualizing function
#here we are residualizing for total brain volume (eTIV) and sex (M1.F0)

f <- function(name) {
  linmod <- lm(as.formula(paste(name, "~ M1.F0 + eTIV")), data = OASIS_master)
  results<- residuals(linmod)
  return (results) }

After completing the above step you will notice you have a new function called “f” in the functions pane (if you are using rstudio). This function will allow you to do the residualization of your dataset. Now, before we continue, we must first create a list of the column names of our new dataset that will be created with the residualized data in it from the “f” function. Remember in this example we have our OASIS_master file that has volumes of brain structures (as well as other data like age, education, scan type, QC rating etc.) and we will change these volumes into different numbers that are “residualized” for our covariates. Yes! Thats right, your numbers will change, and become different after residualizing! For example, a volume of 1000mm^3 could now be a value of -1.342. I know, weird right?!

#lets create a list of the column names for input to the function itself “f”
my_collumn_names <- colnames(OASIS_master)

Now under the values pane you will notice we have created a list called my_collumn_names. In here our original dataset that we had (OASIS_master) with all the headings are listed here. BUT REMEMBER! We are only are going to want to correlate the structure volumes only. We dont wantt hings like education or scan type or QC ratings in there! So our end result data sheet should only have structure names, so lets subset the column names to include collumns number 2 though 23 (these are the ones that are only for structures).

#subsetting the names to include only the subfields
my_collumn_names_sub <- my_collumn_names[2:23]

Now that we have this all set up, we can apply the function to work on the subset of data.

#applying the function to work on the subset of data
results = lapply(my_collumn_names_sub, f)

After doing this, you will see that there is a new entry under values that is called “results”. This is a list of all the residualized data (using the f function that we created). So now what do we do? Well we cant access that data yet because we need to create a csv version of all those results:

#writing it out to csv file format
write.csv(results, file="residualstest.csv")
#here we renamed the results to residualstest.csv

NOTE that we need to manually change the headings in this new residualstest.csv. So lets go into our Libre Office suite (or other editor) and change the collumn names of residualstest.csv. Now lets call it residualstestmatrix.csv to show that it is a matrix.

Now you might be thinking we are going to correlate all the structures together... Wait! To do this properly we cant simply correlate. The most thurough approach is to correlate with bootstraping. Well what is bootstraping though? Well, the bootstrapping procedure essentially pulls out some of the data and re-runs the correlation matrix. It does this multiple times and then places data back and repeats the process; this is called sampling with replacement. Lets go ahead and do this:

#doing the bootsrapping for residualstestmatrix.csv
#load the library
library(boot)

We will need to build the bootstrapping function around the correlation function. So lets go ahead and do that:

#Create a function to wrap around cor, to allow for boot to sample
boot.cor = function(data, i) {
  boot.cor = cor(data[i,])
}

bootresults = boot(residualstestmatrix, boot.cor, R = 100000)

The resulting bootstrap runs are in bootresults$t. Now it is important to note that everything is not separated into columns. It is more like a single line of data, so we must split it back up into columns of data based on our structures. We had 22 structures correlated so we should have a 22x22 matrix. That is 22 columns and 22 rows.

#There are 22 structures in the original input so:
bootstrapmeans = matrix(colMeans(bootresults$t), nrow=22, ncol=22)

If you you try to plot this like so:

#plotting it without names
corrplot(bootstrapmeans, method="square", type = "lower")

By default the colours are blue for positive and red for negative (the opposite of what we usually do). To change this you create a custom colour palette (the inversion of the default) and call it in the corrplot call.

mypalette = colorRampPalette(rev(c("#67001F", "#B2182B", "#D6604D", "#F4A582",
                                   "#FDDBC7", "#FFFFFF", "#D1E5F0", "#92C5DE",
                                   "#4393C3", "#2166AC", "#053061")))

corrplot(..., col=mypalette(200))

You will notice that there are no names. You can either enter the names manually or specify the names in R as such:

#fixing the names
#you must know the order of your structures!
colnames(bootstrapmeans) <- c("R_CA1", "R_CA2/3", "R_CA4/DG",	"R_SLRM",	"R_Sub",	"R_Alv",	"R_Fimb",	"R_Fornix", “R_Mam”, "R_WMTot",	"R_HippTotal", "L_CA1",	"L_CA2/3", "L_CA4/DG",	"L_SLRM",	"L_Sub",	"L_Alv",	"L_Fimb",	"L_Fornix",	“L_Mam”,	"L_WMTot",	"L_HippTotal")
rownames(bootstrapmeans) <- c("R_CA1", "R_CA2/3", "R_CA4/DG",  "R_SLRM",	"R_Sub",	"R_Alv",	"R_Fimb",	"R_Fornix", “R_Mam”, "R_WMTot",	"R_HippTotal", "L_CA1",	"L_CA2/3", "L_CA4/DG",	"L_SLRM",	"L_Sub",	"L_Alv",	"L_Fimb",	"L_Fornix", “L_Mam”,	"L_WMTot",	"L_HippTotal")

Now we can plot this all with the names!

#plotting again it all again
corrplot(bootstrapmeansNoMam, method="square", type = "lower")

You can change the details of the corrplot function. For example instead of “square” you can choose other shapes and colours. The given is the default.

Create Attractive XY Scatter Plots

This code will allow you to change options for creating and stylizing an XY scatter plot for a simple plot. In this example we will create a graph of age plotted on the x axis and volume on the y axis. We will also change the colour of each point to denote age as well as ad a line of best fit and error bar.

    #install reqired packages for the following
    library(corrplot)
    library(ggplot2)
    library(Hmisc)
    library(reshape2)
    library(plyr)
    library(gplots)
    options(digits=3)
    theme_set(theme_bw(base_size = 20))

The full code:

ggplot(OASIS_masterQC, aes(x=Age, y=L_HippTotal)) + geom_point(aes(colour = Age)) + scale_colour_gradient(guide = "colourbar", low="red", high = "black")+ theme_gray() + xlab(expression(paste("Age (", years, ")"))) + ylab(expression(paste("Left Hippocampus Total Volume (", mm^3, ")"))) + ggtitle("Left Hippocampus Total Volume vs. Age") + geom_smooth(fill = "grey20", colour="blue", method=lm) + scale_y_continuous(limits=c(1800,3200))

In the following paragraphs we will deconstruct all of the components of the graph.

The first function is ggplot which asks you to specify the csv file you are using (here OASIS_masterQC), as well as your x and y variables (here Age and Left hippocampal volume respectively) that you will want for the graph.

The geom_point function calls the specific graph you would like to create. Here we are calling the point graph function. It is also able to create different graphs (e.g. geom_bar for bargraph). Inside the geom_point function we have an option stating geom_point(aes(colour = Age)) + scale_colour_gradient(guide = "colourbar", low="red", high = "black"). The first part of this statement states that you want to colour each dot by “Age”. Note you can colour by a specific colour “red” or “green” but here “Age” is written to colour by the variable “Age”. But what colour will the dots be then? This is where we write scale_colour_gradient(guide = "colourbar", low="red", high = "black") which means that the lower the Age, the more red the dots will be, the higher the Age then the more black the dots will be. Feel free to choose your own colours here (http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf).

The theme_gray() is important. Themes can be changed to automatically change a bunch of stylistic components of the graph. This includes the background chart colour, font size, type etc. Try typing theme_ and touble tap the tab button and you will see many options (theme_classic, theme_white, etc.).

The xlab(expression(paste("Age (", years, ")"))) component of the script basically types out the x label you wish to have. Note that insertion of the brackets can be difficult. Follow the format given here and switch in what you wish. Note that the y label denoted by ylab(expression(paste("Left Hippocampus Total Volume (", mm^3, ")"))) also has a superscript. Follow the script syntax in order to create you own superscript.

The ggtitle("Left Hippocampus Total Volume vs. Age") section does exactly as it implies; that is, it places a title of your choosing. Note that the title can be placed in other areas of the graph with additional functions.

Perhaps the most important part of this code is the line that says geom_smooth(fill = "grey20", colour="blue", method=lm). This section codes for a smooth error line that is filled with the colour “grey20” and a line of best fit (based on a linear model; i.e. method=lm) with a line that is blue. Change all these options to create your own style and colour.

The scale_y_continuous(limits=c(1800,3200)) is the area that places the limits on the entire graph, but only on the y axis. This means that only the y axis will be constricted to having volumes between 1800 and 3200 mm^3. Be weary that if there are points that fall outside of the specifications, they will be excluded. The same can also be done for the x axis.

Generating Boxplot from Peak Voxel

This code example is useful for generating boxplots at your peak voxel, if you want to highlight group differences (ex: between genotypes as shown below). You will need the coordinates of your peak voxel x, y, and z (world voxel).

library(RMINC)
library(ggplot2)

You will need to use original jacobians for this as you want the precise location of your voxel. From your t-statistic maps viewed in Display, you must input the coordinates of your peak voxel (ensure that they are world voxels). To view most effectively, display the population average, and then the statistics map twice. We are going to load custom color maps. For one of your statistics maps, load the redhot color map: from the pop up menu, go to File (T). Go to Load UserDef ColC (Y). The GUI should open. Navigate to /opt/quarantine/resources/Display-colormaps. Choose hotred. This is your positive map. If you have significant z-scores at 1% FDR, set your upper bound to the 1% threshold and your lower bound to the 5% threshold. The areas in white are your peak volumes. Then, change your slice view (S-T). For the population average, set the color to grayscale (D-D). For the second stats map, load hot blue (T-Y-select hotblue). This time, you will use the negative z-scores to set your scale, so you are looking at areas of decrease in volume. To smooth everything, go to Q (volume config) and S (Interp:). X is v1, Y is v2, and Z is v:

filtered.data$stri_voxel = mincGetWorldVoxel(filenames = filtered.data$orig_jacobians, v1 = -2.8, v2 = 2.3, v3 = 2.0 )
str.vox.genotype = subset(filtered.data, filtered.data$QC == 1 )

"Genotype" is the variable I am interested in comparing, so a box will be made for each genotype. "main": title, "theme_grey": grey color scheme, "base_size=7" for font size, ylab and xlab will let you label the x and y axes.

figure = qplot(Genotype, stri_voxel,geom="boxplot", data=str.vox.genotype, ylab="Relative Volume", xlab=NULL, main='Left Primary \nSomatosensory Cortex') + theme_grey(base_size=7)
figure

Since you've created your figure as "figure" you can now use the "ggsave" command to save the image as a .pdf; you can choose the width, height, and font size of your figure.

ggsave(filename="/data/scratch/gumeli/D2-basline-L-PSMC.pdf", device=cairo_pdf, plot=figure, width=35, height=26.25, units="mm", pointsize=7 )

Example of plotting two-level timepoints by faceting.

ggplot(squeakmean, aes(x=trial, y=mean, colour=treat_group)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.1) +
  geom_line() +
  geom_point() +
  facet_wrap(~timepoint, nrow=2) +
  labs(x="Trial#", y="Mean Latency") +
  theme(legend.title=element_blank())
#Plot using squeakmean's data (which has the computed means, sd's, se's, etc,), with x axis as trial number, y axis as the mean of the latencies.
#Plot error bars that are mean -/+ the standard error, with a certain width.
#Facet (or breakdown) the data by timepoint. This creates a new chart for every timepoint (timepoint must be a categorical variable). We want two rows of graphs.
#X axis label is Trial#. Y axis label is Mean Latency.
#Get rid of legend title.

Analysing multiple structure volumes per subject at once and doing FDR correction across multiple volumes at once

This code example is useful to analyse the effect of 'group' across a number of MAGeT volumes.

library(RMINC)
## Given a standard dataframe with columns 64:128 containing volume measures for subjects (HC and so on)
data$vols<-data[c(64:128)]
## this will write the various structure volumes found in variables 64 to 128 into a matrix called 'vols'

ana<-anatLm(~group, data, data$vols)
## runs a linear model analysis across all those volumes and saves 'F', 'R-squared', 'beta' ,and 't' values.

anatFDR(ana)
## does multiple comparison correction across the entire analysis and determines thresholds for
## the t-statistics for every model component.

write.csv(ana, "Analysis_by_group.csv")
## writes text file of analysis for further use (to send to collaborator for example)

Plotting the value of a given vertex across subjects

This code example is useful for plotting thicknesses and other vertex-based measurements using R/RMINC.

​library(RMINC)

##Your main datafile, this needs to include subjects identifiers that you used in CIVET
gf = read.csv("path/to/my/data.csv")
gf$left_thickness = paste("path/to/civet/",gf$ID, "/thickness/PREFIX_",gf$ID,"_native_rms_rsl_tlink_20mm_left.txt", sep="")
##This command reads every file listed in gf$left_thickness and loads the data into a matrix
left.vertex.table = vertexTable(gf$left_thickness)

##Plot vertex 10495 thickness versus Age
n = 10495
plot(gf$Age, left.vertex.table[n+1,])

Plotting histograms of t-values with FDR thresholds

A nice way to visualize the t-values and FDR levels for a vertexLm run is to plot them as a histogram, this shows the magnitude of the effects, as well as displaying how many vertices are greater than the FDR thresholds at the same time.

The highlighted areas 1% (dark green) 5% (light green), 10% (yellow) and 20% (red), the legend shows the FDR 5%, Max and Min t-values.

vertexHistogram = function(linmodel, fdrmodel) {
  columns = grep("Intercept", invert=TRUE, value=TRUE, grep("tvalue", colnames(fdrmodel), value=TRUE))
  for (column in columns) {
  #Generates a nice histogram of vertexLm t-values, with FDR levels labelled
  legendline = c(c("FDR 5%", attributes(fdrmodel)$thresholds["0.05", column]),
                 c("MAX", max(linmodel[, column])),
                 c("MIN", min(linmodel[, column])))
  hist(
    linmodel[, column],
    breaks = seq(floor(min(linmodel[, column])), ceiling(max(linmodel[, column])), by =
                   0.25),
    xlab = paste(deparse(substitute(linmodel)), ",", column),
    main =  paste(deparse(substitute(linmodel)), ",", column)
  )
  legend("topleft", legend = legendline)
  lim <- par("usr")
  rect(
    lim[1],
    lim[3],
    -attributes(fdrmodel)$thresholds["0.01", column],
    lim[4],
    border = '#00FF0055',
    col = '#00FF0077'
  )
  rect(
    attributes(fdrmodel)$thresholds["0.01", column],
    lim[3],
    lim[2],
    lim[4],
    border = '#00FF0055',
    col = '#00FF0077'
  )

  rect(
    -attributes(fdrmodel)$thresholds["0.01", column],
    lim[3],
    -attributes(fdrmodel)$thresholds["0.05", column],
    lim[4],
    border = '#00FF0055',
    col = '#00FF0055'
  )
  rect(
    attributes(fdrmodel)$thresholds["0.05", column],
    lim[3],
    attributes(fdrmodel)$thresholds["0.01", column],
    lim[4],
    border = '#00FF0055',
    col = '#00FF0055'
  )

  rect(
    -attributes(fdrmodel)$thresholds["0.05", column],
    lim[3],
    -attributes(fdrmodel)$thresholds["0.1", column],
    lim[4],
    border = '#FFFF0055',
    col = '#FFFF0055'
  )
  rect(
    attributes(fdrmodel)$thresholds["0.1", column],
    lim[3],
    attributes(fdrmodel)$thresholds["0.05", column],
    lim[4],
    border = '#FFFF0055',
    col = '#FFFF0055'
  )

  rect(
    -attributes(fdrmodel)$thresholds["0.1", column],
    lim[3],
    -attributes(fdrmodel)$thresholds["0.2", column],
    lim[4],
    border = '#FF000055',
    col = '#FF000055'
  )
  rect(
    attributes(fdrmodel)$thresholds["0.2", column],
    lim[3],
    attributes(fdrmodel)$thresholds["0.1", column],
    lim[4],
    border = '#FF000055',
    col = '#FF000055'
  )
  }
}

#vs.lin.left is the output of a vertexLm run
#fdr.lin.left is the output of a vertexFDR on vs.lin.left
#Age is the original name of the model variable

vs.lin.left = vertexLm(left_thickness ~ M.F + Age, data )
fdr.lin.left = vertexFDR(vs.lin.left)


vertexHistogram(vs.lin.left, fdr.lin.left)

#Random Effects Syntax in lme In a mixed effects model, you can specify random effects. Whether you are new to R or not, you may find the syntax for specifying random effects a bit funky. Here is a breakdown:

Intercept Only Model - Allow each Group to have it's own intercept

(1|Group)

Correlated Intercept and Slope Model - Allow each Group to have it's own intercept, and allow the effect of Age to vary by Group. In this notation, the intercept and slope are correlated, i.e. a higher intercept will mean an increase in slope

(1+Age|Group)

Uncorrelated Intercept and Slope Model - Allow each Group to have it's own intercept, and allow the effect of Age to vary by Group. Here, the restriction on the correlation between intercept and slope is removed.

(1|Group) + (Age-1|Group)

#The order of predictors matters: How Type I/II/III sums of squares can influence your regression results with lm in R Hello all, after a short discussion with a few people in the lab, we thought that it would be a good idea to circulate this information. This is important for anyone who is currently running linear models (lm) in R or will be doing so at any point in the future. (i provide a bit of a summary here, but I definitely do not touch everything that is covered in the link included below, so please refer to it for the details)

The crux of the issue is that R treats predictor variables in a linear regression in an ordered fashion, such that the first predictor takes up as much variance as possible, the next then takes from what is left, and so on ... this means that the position of predictors in the equation matters (type I SS). I.e., if you want to a linear regression that controls for the effects of age and sex on the relationship between behaviour and brain volume, then you would need to enter age and sex first and behaviour last. And just to note that it is not always the case that order will have a big effect, from what I have read it is primarily an issue when the design of the experiment (or test) is unbalanced and if the predictors are not orthogonal. In many cases, the results will not change at all.

e.g: volume ~ age + sex + behaviour (correct way to run an analysis in R that controls for age and sex) note that if you ran this analysis: volume ~ behaviour + age + sex you would not necessarily be controlling for the effects of age and sex (again, depending on how balanced the design is and if the variables are orthogonal this could have a large or negligent effect).

If you are using and/or interested in comparing your results with additional tools (such as SPSS and SAS), their defaults are that order does not matter (type III SS). There are, of course, times when you have a specific hypothesis that you can test with ordered entry of variables.

If you want to be able to talk about the unique effects of all of the predictors in your model, then you are probably interested in running a lm with type III SS. See the link for details though, there are more issues to consider :-/.

So you know, there are ways to calculate the regression based on Type II and Type III SS in R. I have taken one method for calculating results based on I and III and implemented it below. If anyone has experience with setting/unsetting these options I would appreciate if you could confirm that this does what I expect it to do when implemented in RStudio (the R universe seems to do random things sometimes...)

Reference: http://goanna.cs.rmit.edu.au/~fscholer/anova.php

lm_type1_and_3 <- function(model_formula,data){
  # function to calculate lm regression based on type I and type III sum of squares
  # both stored in output variable and accessible as usual
  # for details, see: http://goanna.cs.rmit.edu.au/~fscholer/anova.php

  my_contrasts = getOption("contrasts")   #get the default contrast option that has been set (or was the original default)
  #print(my_contrasts)
  new_contrasts = my_contrasts
  new_contrasts[1] = "contr.sum"
  new_contrasts[2] = "contr.poly"

  res_out = list()
  res_orig <- lm(model_formula, data=data)

  options(contrasts = new_contrasts)
  res <- lm(model_formula, data=data)
  res_type3 <- drop1(res, .~., test="F")

  res_out$res_type1 = res_orig
  res_out$res_type3 = res_type3
  #print(options()$contrasts)
  options(contrasts=my_contrasts) # set the contrasts back to what they were originally
  #print(options()$contrasts)
  return(res_out)
}

Regressing vertex-wise data against vertex-wise data

You may have vertex-wise data (e.g. cortical thickness, tissue contrast, T1 intensity samples) and want to take other vertex-wise measurements into account in your analysis as residuals. For example, Waehnert and colleagues (2014, 2016) have found that T1 intensity values vary as a function of local curvature; if you are working with vertex-wise data that is related to intenstiy, you may want to take curvature into account when performing regression models on your data.

Conventional tools linear regression tools like mincLm, vertexLm, vertexLmer etc. do not offer the ability to extract residuals after a regression. The proposed residualizing method is equivalent to something like vertexLm that runs a linear model at each vertex, except this method returns the residuals as output.

# This loop conducts a linear model at each vertex.
# The response variable is the distribution of values at a vertex i across your subjects, and you covariate is the distribution of values at vertex i across subjects.
# Tables of this format (rows = vertices, columns = subjects) can be easily created by calling the vertexTable function (from the RMINC package) on an array of filenames.

for (i in 1:40962){
  residualized_vertexwise_data[i,] <- residuals(lm(vertexwise_response_variable[i,] ~ vertexwise_covariate[i,]))
}

The residuals are now in a giant table with subjects as columns and vertex-wise residuals as rows, and can be parsed back into single-subject files for future regression analyses.

for (i in 1:ncol(residualized_vertexwise_data)){
  txt <- residualized_vertexwise_data[,i]
  filename <- colnames(residualized_vertexwise_data)[i] # subject ID's go here
  filename <- paste(filename, "_vertexdata.txt", sep = "")
  path <- "/path/to/subjectwise/files/"
  filename <- paste(path, filename, sep = "")
  write.table(txt,filename,sep="\t",row.names=FALSE, col.names =  FALSE)
}

Dark mode for ggplots (black background and white titles)

#new way to make dark theme - need to manual set paratmers (see theme commands) - since ggdark does not work for latest R versions
ggplot(df, aes(x=group, y=C4)) +  geom_boxplot(aes(fill = group), colour="white") +
  # scale_fill_manual(values = c(brewer.pal(3, "Set1")[2],brewer.pal(3, "Set1")[1])) +
  scale_fill_manual(values = c("#619CFF", "#F564E3")) +
  geom_jitter(color="white", size=0.6, alpha=0.9) +
  geom_signif(annotations = c(paste("p=",round(summary(lm(C4~group*sex, data=df))$coefficients[2,4],digits=3))),
              y_position = c(max(df$C4)+5), xmin=c(1), xmax=c(2), textsize=6, colour = "white") +  # manually add position using x and y coordinates
  ylab("Subject Component Weights") + xlab ("Group") +
  theme(legend.position="none") +
  # theme(legend.title=element_blank(),legend.background = element_rect(colour="black"), legend.text=element_text(colour="white")) +
  theme(panel.background=element_rect(fill="grey10"), panel.grid = element_line(colour="grey30")) +
  theme(plot.background=element_rect(fill="black", colour="black"), text=element_text(colour="white"), axis.text = element_text(colour="white")) +
  theme(text = element_text(size=20)) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle('Component 4')
library(ggdark) #only works for old versions of R

ggplot(data = DABcellTissuearea, aes(y = pSyn.str.right, x = group)) +
  geom_boxplot(aes(fill = group)) +                                     # create boxplots and colour code the groups
  ylab("DAB cell per tissue area") + xlab ("Group") +                   # label axes
  ggtitle("Right Striatum (p-aSyn)") +                                  # add title
  theme(text = element_text(size=20)) +                                 # set size of text
  dark_theme_gray() +                                                   # dark theme
  theme(panel.grid.major = element_line(color = "grey30", size = 0.2),
        panel.grid.minor = element_line(color = "grey30", size = 0.2))  # makes background grid gray (instead of black)

Add significance notation to plot using geom_signif

library(ggsignif)

ggplot(data = DABcellTissuearea, aes(y = pSyn.str.right, x = group)) +
  geom_boxplot(aes(colour = group, fill = sex)) +                       # create boxplots for the different groups and split by sex
  scale_colour_manual(values = c("black", "black")) +                   # set coloured outline for group
  scale_fill_manual(values = c("#F8766D","#00BFC4")) +                  # set coloured fill for sex (red and blue)
  geom_signif(annotations = c("p=0.018","p=0.01"),                      # add significance notation on plot (can add the p value or just the * notation)
              y_position = c(3.05e-05, 3.25e-05),                       # manually add position using x and y coordinates
              xmin=c(1.80,1.20), xmax=c(2.20,2.20),
              colour = "white") +
  ylab("DAB cell per tissue area") + xlab ("Group") +                   # label axes
  ggtitle("Right Striatum (p-aSyn)") +                                  # add title
  theme(text = element_text(size=20)) +                                 # set size of text
  dark_theme_gray() +                                                   # dark theme
  theme(panel.grid.major = element_line(color = "grey30", size = 0.2),
        panel.grid.minor = element_line(color = "grey30", size = 0.2))  # makes background grid gray (instead of black)

Visualising subject lines on plot

An example of code to plot weights across time (number of days post-injection). This code plots the weight of mice of two genotpyes (WT and hemizygous) that were injected with either PBS, Hu-PFF or Ms-PFF, and also using facet_wrap to visualize the males and the females onto separate subplots. Also, note, colouring coding of the points and the lines according to injection group and linestyle corresponds to genotype. Also, you can use alpha to dictate the opacity. The code below using the code written in a separate wiki on properly plotting lm and lmer model

  model = lmer(weight ~ genotype*injection*poly(X.dpi,3)*sex +
                       (1|litter_size) + (1|X.mice.in.cage) +
                       (1|batch) + (1|cohort) + (1|animal_ID), data= subset(data, !is.na(data$X.dpi)&!is.na(data$weight)))

  fitdata_jacob= as.data.frame(Effect(c("genotype", "injection", "X.dpi"),model,xlevels=list(X.dpi=seq(-12,126,1))))

  print(
  ggplot(data = subset(data, !is.na(data$X.dpi)&!is.na(data$weight)),
         aes(x=X.dpi, y=weight, group=injection:genotype, colour = injection, linetype=genotype)) +
    geom_point(aes(y = weight, x = X.dpi, fill=sex, shape=genotype, colour=injection, alpha=0.8)) +
    geom_line(data = fitdata_jacob, aes(y=fit),size=1, alpha=0.8) +
    geom_line(aes(group = animal_ID, linetype=genotype, alpha=0.7)) + #PLOTTING SUBJECT LINES
    geom_ribbon(data = fitdata_jacob, aes(y=fit, ymin=lower, ymax=upper, color=NULL), alpha=0.1) +
    scale_colour_manual(values = c(brewer.pal(3, "Set1")[2],brewer.pal(3, "Set1")[1],brewer.pal(3, "Set1")[3])) +
    scale_fill_manual(values = c("#619CFF", "#F564E3")) +
    scale_shape_manual(values=c(21, 24))+
    facet_wrap(~sex) + # makes two subplots; one for each sex
    labs(x = "Days post-injection", y= "Weight (grams)", title = paste())+
    theme(plot.title = element_text(hjust = 0.5), text = element_text(size=15))
  )

plot-weight-longitudinal-sexeffects

Chord Diagram in R

A chord diagram can be useful displaying relational/network type data in which you want to show how different data points relate to each other. For example, you may have regions as data points and connectivity type metrics for different pairs of regions, and you want to visualize connectivity for each pair of regions. Your data points/nodes/regions are arranged in a circle, and edges are drawn between regions according to the connectivity value. You can colour code both nodes and edges according to any grouping of interest.

You’ll require the tidygraph and ggraph libraries. You’ll need to .csv files - one listing your nodes and their attributes and another listing you edges/connections and their attributes. Attributes refer to any grouping information that you want encoded into the graph. For example if you want nodes to have certain colours, that needs to be included in the nodes spreadsheet. The code below is taken from a sample script, you can find this script as well as the edges and nodes csv at /data/chamal/projects/raihaan/projects/misc/wikisamples/chord_diagram if you want to grab a copy and experiment.

At the most basic level, your nodes data needs to contain the name or identifier of each node. You can also include colour or size variables to specify in the graph:

library(tidygraph)
library(ggraph)

my_nodes<-read.csv('nodes.csv')
my_nodes[1:3,]
#      name               network_name network_num R G   B
#1 R_V3_ROI 2_Early_Visual_Cortex__ROI           2 0 0 255
#2 R_V4_ROI 2_Early_Visual_Cortex__ROI           2 0 0 255
#3 R_V2_ROI 2_Early_Visual_Cortex__ROI           2 0 0 255

my_edges<-read.csv('edges.csv')
my_edges[1:3,]
#      from        to     weight component
#1 R_V4_ROI  R_V2_ROI 0.03273383         4
#2 R_V3_ROI  R_V2_ROI 0.03266655         4
#3 R_V3_ROI R_MST_ROI 0.03261715         4

Using the tbl_graph function, you can make a basic diagram as follows:

my_nodes<-read.csv('nodes.csv')
my_edges<-read.csv('edges.csv')

# my_nodes$name my_edges$from my_edges$to should be characters, not factors
my_nodes$name<-as.character(my_nodes$name)
my_edges$from<-as.character(my_edges$from)
my_edges$to<-as.character(my_edges$to)

tbl_graph(nodes = my_nodes, edges = my_edges, directed=TRUE) %>%
  ggraph(layout = 'linear', circular = TRUE) +
  geom_edge_arc() +
  geom_node_point(size = 7) +
  theme_graph() +
  theme(legend.position = "none")

basic

There are a variety of different layout options. See ?ggraph for more. You can also instead specify geom_edge_link() instead of geom_edge_arc() to have straight lines instead of smooth arcs as edges. A note of caution on the legend - it is hidden in the above code. If you have many nodes each with unique names, turning on the legend will show a symbol for each node. Probably not ideal if you have say 100 nodes. Later examples show how to select what to show in the legend.

You can colour code the nodes so long as you have some colour info for them in your nodes spreadsheet. In this example I have rgb codes which first need to be converted to hex codes.

my_nodes<-read.csv('nodes.csv')
my_edges<-read.csv('edges.csv')

#my_nodes$name my_edges$from my_edges$to should be characters, not factors
my_nodes$name<-as.character(my_nodes$name)
my_edges$from<-as.character(my_edges$from)
my_edges$to<-as.character(my_edges$to)

my_nodes$R<-my_nodes$R/255
my_nodes$G<-my_nodes$G/255
my_nodes$B<-my_nodes$B/255

node_colours<-vector(mode='character')

for (x in seq(1,length(my_nodes$name))) {
  node_colours<-append(node_colours, rgb(my_nodes[x,'R'], my_nodes[x,'G'], my_nodes[x,'B']))
}
my_nodes$node_colour<-node_colours
names(node_colours)<-my_nodes$network_name #makes it a named vector. Seems to be required

tbl_graph(nodes = my_nodes, edges = my_edges, directed=TRUE) %>%
  ggraph(layout = 'linear', circular = TRUE) +
  geom_edge_arc() +
  geom_node_point(aes(colour = network_name), size = 7) +
  scale_color_manual(values = node_colours, guide=FALSE) +
  theme_graph() +
  theme(legend.position = "none")

As shown above, colour coding requires specifying the aes(color=) in the geom specification, but also using the scale_color set of functions. Here, the guide=FALSE is specifying we don’t want this colour coding to show up in the legend. Adjust if applicable. colournodes

Similarly, you can colour code edges by modifying the aes setting of geom_edge_arc:

tbl_graph(nodes = my_nodes, edges = my_edges, directed=TRUE) %>%
  ggraph(layout = 'linear', circular = TRUE) +
  geom_edge_arc(aes(edge_colour = as.factor(component))) +
  geom_node_point(aes(colour = network_name), size = 7) +
  scale_color_manual(values = node_colours, guide=FALSE) +
  scale_edge_colour_manual(values=c("#883d0c",
                                    "#036eec",
                                    "#f3be5b",
                                    "#eb5bdb",
                                    "#01c6ac"),
                           name='Component',guide = guide_legend(override.aes = list(edge_width = 3) )) +
  theme_graph() +
  theme(legend.position = "right", legend.text = element_text(size=20), legend.key.width = unit(1,"cm"))

In the above, scale_edge_color_manual() is used to manually colour code the 5 levels of the component variable. You could replace this with predefined colours if appropriate, more like what is done for the node colours. Note that guide=FALSE does not show up in scale_edge_colour_manual() so that the legend shows this info. colournodesedges

This describes how to make a basic chord diagram, but tidygraph and ggraph have a huge array of different layouts to choose from for network/relational data. Some references for more info and options: https://www.r-graph-gallery.com/309-intro-to-hierarchical-edge-bundling.html https://kmlawson.github.io/dh-tutorials/network.html https://rstudio-pubs-static.s3.amazonaws.com/341807_7b9d1a4e787146d492115f90ad75cd2d.html

Extract subject specific intercepts and coefficients

When running a mixed effect model looking at a measure in relation to time in R, you can use the coef function in R to extract an intercept and coefficient for each subject which describes their modelled baseline and change over time. This can be useful for looking at how each person declines over time, for example.

library(lme4)

#run your mixed effects model
results<-lmer(volume~time + (1+time|subjectID), data=data)
#extract the subject specific coefficients
subj_coef <- coef(results)$subjectID

subj_coef is now a dataframe with 1 row for every subject, a column for their intercept, and a column for the coefficient of the time variable, which describes the effect of time in each individual. If you have more covariates in your model, coef will return a coefficient for each of them.

You can access both the fixed (global) effects and random (subject deviations from global) effects in R. coef gives you the global + subject specific effects, while fixef and ranef can be used to extract only the fixed or random effects.

> fixef(results) #global effect
(Intercept)     time
18.14759513 -0.09014879
> ranef(results)$subjectID[1:3,] #estimated deviation per subj. the indexing is just to show first 3 subjects
   (Intercept)      time
13    1.149758 -0.006270924
15   -4.810939  0.076446363
18   -9.569530 -0.007058805
> coef(results)$subjectID[1:3,] #global + subj deviation
   (Intercept)     time
13   19.297354 -0.09641971
15   13.336657 -0.01370242
18    8.578066 -0.09720759

Note: depending on the package you are using to run mixed effects model, the output format of coef may be slightly different, and you may not have to append $subjectID to the call. Run a test of coef(results) to see what the output format for your specific package

Saving the output of a function to a text file

In this section, you'll learn how to save the output of a variable to a text file. In my case, I saved the output of the minc false discovery rate function (mincFDR).

To show you a little bit where my code is coming from, let me show you the linear mixed-effects model I ran:

s2mlmer <- mincLmer(resampled_log_full_det ~ TreatmentGroup*Age_days + (1|ID),
                   data = s2m,
                   mask = mask,
                   parallel = c("slurm", 200),
                   REML = TRUE)

After this was executed, I used mincFDR (plus the function to calculate degrees of freedom 'mincLmerEstimateDF'), to control for false positives, so that I can identify significant results:

fdrs2m <- mincFDR(mincLmerEstimateDF(model = s2mlmer), mask = mask)

Now the variable fdrs2m contains my FDR results, if I would print in the terminal my FDR results, they would look like this:

>fdrs2m
Multidimensional MINC volume
Columns:       qvalue-(Intercept) qvalue-TreatmentGroupHFD_Tg qvalue-Age_days qvalue-TreatmentGroupHFD_Tg:Age_days
[1] "/data/chamal/projects/chloe/2017_lifestyle/Chloe/3_derivatives/scan/Pydpiper/RCA_25August2019_first_level/10_001_processed/mch_rca_1210_001_20180905_114432_131073_mri_lsq6/mch_rca_1210_001_20180905_114432_131073_mri_lsq6_I_lsq6_lsq12_and_nlin_inverted_displ_log_det_abs_fwhm0.2_res_to_RCA_25August2019_second_level-nlin-3.mnc"
Degrees of Freedom: 23.999999999565 23.9999999994728 23.9999999994992 23.9999999995242
FDR Thresholds:
     tvalue-(Intercept) tvalue-TreatmentGroupHFD_Tg tvalue-Age_days
0.01           4.494333                    5.609232        4.344495
0.05           3.326460                    3.930236        3.223780
0.1            2.772412                    3.158971        2.689660
0.15           2.426796                    2.714389        2.353925
0.2            2.173277                    2.408499        2.102806
     tvalue-TreatmentGroupHFD_Tg:Age_days
0.01                             5.632350
0.05                             3.800115
0.1                              3.026946
0.15                             2.600388
0.2                              2.307256
>

Since I wanted to run several models overnight, I used the following lines of code to save the outputs in a text file, and look at my results in the morning (explanation follows):

#save FDR Results
cat("fdrs2m", file = "HFD_FDR_Results.txt", append = TRUE) #add subtitle
cat("\n", file = "HFD_FDR_Results.txt", append = TRUE) #add one space
capture.output(fdrs2m, file="HFD_FDR_Results.txt", append=TRUE) #add results

cat("\n\n", file = "HFD_FDR_Results.txt", append = TRUE) #add two lines

cat("fdrs2f", file = "HFD_FDR_Results.txt", append = TRUE) #add subtitle
cat("\n", file = "HFD_FDR_Results.txt", append = TRUE) #add one space
capture.output(fdrs2f, file="HFD_FDR_Results.txt", append=TRUE)
cat("\n", file = "HFD_FDR_Results.txt", append = TRUE) #add one space

The lines above are indicating R to add "fdrs2m" text to the text file "HFD_FDR_Results.txt"(you don't need to create this file, R will create it if it doesn't exist). The append argument will tell the cat function if the information should be added (attach, adjoin). If append is set to FALSE, new information will overwrite the current content of the indicated file (be careful). Then I would add "\n" which means one space. Then I would add the actual output of my variable fdrs2m with this other function 'capture.output'. Then I would add two spaces "\n\n", and then paste another FDR analysis (fdrs2f).

If you fail to add subtitles, you will have the results but you won't be able to identify from which analysis they came from. Do your tests before you rely on these lines.

Clone this wiki locally