Skip to content
Fetching contributors… Cannot retrieve contributors at this time
143 lines (109 sloc) 5.36 KB
 --- output: pdf_document --- # Solution of 8.23.4, Titles and citations --- Letchford et al. 2015 > Letchford et al. (2015) found an interesting pattern: papers that have shorter titles tend to fare better in terms of citations. They took top-cited papers from a variety of journals, and ranked them by title length (in number of characters), and number citations received (as of November 2014). Then they performed a Kendall's tau test to see whether these rankings are correlated. A negative correlation would mean that the articles with longer titles tend to be ranked low for citations. > The file `Letchford2015_data.csv` contains the data needed to replicate their results. > 1. Write a program that performs the test described above using all the papers published in 2010. The program should do the following: 1) read the data; 2) extract all the papers published in 2010; 3) rank the articles by citations, and by title length; 4) compute the Kendall's tau expressing the correlation between the two rankings. For this dataset, the Authors got a tau of about -0.07 with a significant p-value. First, we need to read the data: ```{r} # Read the data l2015 <- read.csv("../data/Letchford2015_data.csv", stringsAsFactors = FALSE) # Check dimensions dim(l2015) # Print first few lines head(l2015) ``` Now extract only the papers published in 2010: ```{r} p2010 <- l2015[l2015\$year == 2010, ] ``` To rank the manuscripts according to title length, we can use the function `rank` (check `?rank`), which ranks the entries, resolving possible ties: ```{r} # Example of use of rank rank(c(1, 2, 3, 2, 1, 5, 3, 4)) ``` Let's store the ranking of title lengths and citations separately: ```{r} rank_titlelength <- rank(p2010\$title_length) rank_citations <- rank(p2010\$cites) ``` Finally, let's calculate Kendall's tau. A simple way is to invoke `cor`: ```{r} cor(rank_citations, rank_titlelength, method = "kendall", use = "pairwise") ``` Which is similar to that reported by the Authors (they peformed some filtering of the articles before analyzing them, so we don't expect a perfect match). To get also a *p*-value, you can use `cor.test`: ```{r} cor.test(rank_citations, rank_titlelength, method = "kendall", use = "pairwise") ``` > 2. Write a function that repeats the analysis for a particular journal-year combination. Try to run the function for the top scientific publications `Nature` and `Science`, and for the top medical journals `The Lancet` and `New Eng J Med`, for all years in the data (2007-2013). Do you always find a negative, significant correlation (i.e., negative tau with low *p*-value)? For this point, we need to write a function, that takes as input the data, a `journal` and a `year`. The function will then extract the relevant data, and run the test. We can start building the function from this skeleton: ```{r} compute_tau_journal_year <- function(my_data, my_journal, my_year) { # First, filter the data my_subset <- my_data[my_data\$journal == my_journal & my_data\$year == my_year, ] print(c(my_journal, my_year, "Articles:", dim(my_subset))) } ``` and try running it to make sure everything is good: ```{r} compute_tau_journal_year(l2015, "Nature", 2010) ``` Next, we write the analysis proper: ```{r} compute_tau_journal_year <- function(my_data, my_journal, my_year) { # First, filter the data my_subset <- my_data[my_data\$journal == my_journal & my_data\$year == my_year, ] # Rank by title length and citations rank_titlelength <- rank(my_subset\$title_length) rank_citations <- rank(my_subset\$cites) # Return the value of tau return(data.frame(Journal = my_journal, Year = my_year, tau = cor(rank_citations, rank_titlelength, method = "kendall", use = "pairwise"))) } # Try running it compute_tau_journal_year(l2015, "Nature", 2010) ``` We can write a fancier version that stores also the *p*-values, and checks that there are enough articles: ```{r} compute_tau_journal_year <- function(my_data, my_journal, my_year) { # First, filter the data my_subset <- my_data[my_data\$journal == my_journal & my_data\$year == my_year, ] if (dim(my_subset) < 2) { tau <- NA p.value <- NA } else { # Rank by title length and citations rank_titlelength <- rank(my_subset\$title_length) rank_citations <- rank(my_subset\$cites) # Run the test my_test <- cor.test(rank_citations, rank_titlelength, method = "kendall", use = "pairwise") tau <- as.numeric(my_test\$estimate) p.value <- as.numeric(my_test\$p.value) } return(data.frame(Journal = my_journal, Year = my_year, tau = tau, p.value = p.value)) } compute_tau_journal_year(l2015, "Nature", 2010) ``` Now let's run it for all years and a few journals: ```{r} results <- data.frame() for (year in 2007:2013){ for (jr in c("Nature", "Science", "The Lancet", "New Eng J Med")) { results <- rbind(results, compute_tau_journal_year(l2015, jr, year)) } } ``` We can see that we can get both positive or negative tau(s): ```{r} results ``` However, to be sure we have a meaningful result, we should correct for multiple testing (when trying very many tests, we can obtain a number of significant *p*-values just by chance), either applying Bonferroni's correction, or using more sophisticated false-discovery-rate approaches.
You can’t perform that action at this time.