Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Measuring the most prominent people on Wikipedia
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
.gitignore
LICENSE
README.md
people.csv
variance.png
wikipedia.r

README.md

wikipedia-rankings

Support files for TIME's ranking of the prominent people on Wikipedia.

Data was collected over several days in May using node-wikipedia, a Node.js module maintained by @wilson428.

We considered eight data points for each entry:

  • Number of words
  • Number of links to other Wikipedia pages
  • Number of external links (which are typically references)
  • Number of categories the person is in
  • Total number of revisions to the page
  • Number of unique individuals who have edited the page as a signed-in editors
  • Number of anonymous edits
  • Number of vandalisms, as identified in editing notes

Data for the top 100,000-or-so people is available as a 15MB CSV file.

Analysis

Using out-of-the-box R functions, we reduced these eight variables to their principal components (using this handy guide). As you can see, a huge amount of the variance is contained in the first PC:

variance

You can rerun the principal component analysis like so:

RScript wikipedia.r

(This may require installing the relevant libraries first).

By trial and error, the ranking that most satisfied our anecdotal sense for "influence" in the real world was PC1 + PC2, which becomes the score for each person.

Something went wrong with that request. Please try again.