Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
143 lines (109 sloc) 5.36 KB
---
output: pdf_document
---
# Solution of 8.23.4, Titles and citations --- Letchford et al. 2015
> Letchford et al. (2015) found an interesting pattern: papers that have shorter titles tend to fare better in terms of citations. They took top-cited papers from a variety of journals, and ranked them by title length (in number of characters), and number citations received (as of November 2014). Then they performed a Kendall's tau test to see whether these rankings are correlated. A negative correlation would mean that the articles with longer titles tend to be ranked low for citations.
> The file `Letchford2015_data.csv` contains the data needed to replicate their results.
> 1. Write a program that performs the test described above using all the papers published in 2010. The program should do the following: 1) read the data; 2) extract all the papers published in 2010; 3) rank the articles by citations, and by title length; 4) compute the Kendall's tau expressing the correlation between the two rankings. For this dataset, the Authors got a tau of about -0.07 with a significant p-value.
First, we need to read the data:
```{r}
# Read the data
l2015 <- read.csv("../data/Letchford2015_data.csv", stringsAsFactors = FALSE)
# Check dimensions
dim(l2015)
# Print first few lines
head(l2015)
```
Now extract only the papers published in 2010:
```{r}
p2010 <- l2015[l2015$year == 2010, ]
```
To rank the manuscripts according to title length, we can use the function `rank` (check `?rank`), which ranks the entries, resolving possible ties:
```{r}
# Example of use of rank
rank(c(1, 2, 3, 2, 1, 5, 3, 4))
```
Let's store the ranking of title lengths and citations separately:
```{r}
rank_titlelength <- rank(p2010$title_length)
rank_citations <- rank(p2010$cites)
```
Finally, let's calculate Kendall's tau. A simple way is to invoke `cor`:
```{r}
cor(rank_citations, rank_titlelength, method = "kendall", use = "pairwise")
```
Which is similar to that reported by the Authors (they peformed some filtering of the articles before analyzing them, so we don't expect a perfect match). To get also a *p*-value, you can use `cor.test`:
```{r}
cor.test(rank_citations, rank_titlelength, method = "kendall", use = "pairwise")
```
> 2. Write a function that repeats the analysis for a particular journal-year combination. Try to run the function for the top scientific publications `Nature` and `Science`, and for the top medical journals `The Lancet` and `New Eng J Med`, for all years in the data (2007-2013). Do you always find a negative, significant correlation (i.e., negative tau with low *p*-value)?
For this point, we need to write a function, that takes as input the data, a `journal` and a `year`. The function will then extract the relevant data, and run the test.
We can start building the function from this skeleton:
```{r}
compute_tau_journal_year <- function(my_data, my_journal, my_year) {
# First, filter the data
my_subset <- my_data[my_data$journal == my_journal & my_data$year == my_year, ]
print(c(my_journal, my_year, "Articles:", dim(my_subset)[1]))
}
```
and try running it to make sure everything is good:
```{r}
compute_tau_journal_year(l2015, "Nature", 2010)
```
Next, we write the analysis proper:
```{r}
compute_tau_journal_year <- function(my_data, my_journal, my_year) {
# First, filter the data
my_subset <- my_data[my_data$journal == my_journal & my_data$year == my_year, ]
# Rank by title length and citations
rank_titlelength <- rank(my_subset$title_length)
rank_citations <- rank(my_subset$cites)
# Return the value of tau
return(data.frame(Journal = my_journal,
Year = my_year,
tau = cor(rank_citations, rank_titlelength,
method = "kendall", use = "pairwise")))
}
# Try running it
compute_tau_journal_year(l2015, "Nature", 2010)
```
We can write a fancier version that stores also the *p*-values, and checks that there are enough articles:
```{r}
compute_tau_journal_year <- function(my_data, my_journal, my_year) {
# First, filter the data
my_subset <- my_data[my_data$journal == my_journal & my_data$year == my_year, ]
if (dim(my_subset)[1] < 2) {
tau <- NA
p.value <- NA
} else {
# Rank by title length and citations
rank_titlelength <- rank(my_subset$title_length)
rank_citations <- rank(my_subset$cites)
# Run the test
my_test <- cor.test(rank_citations, rank_titlelength,
method = "kendall", use = "pairwise")
tau <- as.numeric(my_test$estimate)
p.value <- as.numeric(my_test$p.value)
}
return(data.frame(Journal = my_journal,
Year = my_year,
tau = tau,
p.value = p.value))
}
compute_tau_journal_year(l2015, "Nature", 2010)
```
Now let's run it for all years and a few journals:
```{r}
results <- data.frame()
for (year in 2007:2013){
for (jr in c("Nature", "Science", "The Lancet", "New Eng J Med")) {
results <- rbind(results,
compute_tau_journal_year(l2015, jr, year))
}
}
```
We can see that we can get both positive or negative tau(s):
```{r}
results
```
However, to be sure we have a meaningful result, we should correct for multiple testing (when trying very many tests, we can obtain a number of significant *p*-values just by chance), either applying Bonferroni's correction, or using more sophisticated false-discovery-rate approaches.
You can’t perform that action at this time.