GitHub - Graendal/Wikipedia-Philosophy: Graphing Wikipedia chains according to xkcd/903

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.txt		README.txt
wikipage.py		wikipage.py

Repository files navigation

Disclaimer: I'm not really a programmer at all so I'm sure most of this is really really ugly code. Sorry! I learned almost everything I needed to do this as I was doing it. So I apologize for ugly code and non-standard comments and whatever other etiquette I'm breaching.

This program is inspired by the alt-text of the xkcd comic www.xkcd.com/903. The idea is that if you pick a random wikipedia page and click the first non-italic, non-parenthetical link in the article text, you will eventually reach the page for philosophy. I was curious about how often this happened and what kinds of chains there might be, so I wrote this program to pick a random wikipedia page and record graph data in a .csv file about which pages link to what other pages in this way. I then used Gephi to interpret the graph data and make some pretty graphs, but that is way beyond my ability to include in the code itself. Basically on each page the program just grabs the HTML and cuts out all the pieces of it where it might find inappropriate links, such as parentheticals, italics, tables, and sidebars. Then it picks the first link it finds, records the new edge, and goes to that page and does it again. It stops going down a chain when it hits a page it's already seen (or a page without any links), and it stops altogether when it's found the specified number of chains. If you keep using the same .csv file, it will keep track of pages it's recorded on previous runs, too. So you can run it again to gather more data without worrying about repeating anything you've already done.

Here are a couple Gephi files for the graph projects I've made with this:
http://dl.dropbox.com/u/22756010/bigwikiproject.gephi
http://dl.dropbox.com/u/22756010/wikiproject.gephi

And for pictures of the graphs themselves:
http://dl.dropbox.com/u/22756010/bigwikiproject.svg
http://dl.dropbox.com/u/22756010/wikiproject.svg

Known problems:
1) If a wikipedia page has a link to "Coordinates" at the top left, with the geographic coordinates of the location the page is about, it thinks the "Coordinates" link is an appropriate choice and goes to the wikipedia page Geographic Coordinate System.
2) Sometimes when it does response.read() for certain pages, it ends up with something that isn't actually the HTML of the website. I have no idea if this is a problem with the code or what. It's pretty rare and it accounts for almost all of the "no link" occasions. What's especially weird is that for the same article, it will sometimes happen and sometimes not.
3) It doesn't convert URL encodings in the names of the articles, but this only matters if you care about making the labels on your graph look nice.
4) Gephi is somewhat picky about characters in EdgeFile. It will interpret commas and spaces as node separators, so some post-processing on the .csv file is necessary but pretty easy.
5) This one is more of an inefficiency problem, but it might be faster to find a link and check if it's appropriate, rather than cutting out everything inappropriate and then finding the first leftover link. I couldn't think of a nice way to check the appropriateness of each link without doing other inefficient things, though.