This repository is private.
All pages are served over SSL and all pushing and pulling is done over SSH.
No one may fork, clone, or view it unless they are added as a member.
Every repository with this icon (
) is private.
Every repository with this icon (
This repository is public.
Anyone may fork, clone, or view it.
Every repository with this icon (
) is public.
Every repository with this icon (
| name | age | message | |
|---|---|---|---|
| |
.gitignore | ||
| |
README | ||
| |
Rakefile | ||
| |
bin/ | ||
| |
config/ | ||
| |
configuration.example.yml | ||
| |
db/ | ||
| |
lib/ | ||
| |
log/ | ||
| |
temp/ | ||
| |
test/ |
README
Parses Wikipedia dumps for the Get to Philosophy game, as seen at http://en.wikipedia.org/wiki/Wikipedia:Get_to_Philosophy Ultimately this will save data in a format suitable for loading into a Rails website, but until then, run bin/page_xml_parser_interface.rb on an XML file like simplewiki-20081029-pages-articles.xml To get such an XML file, go to http://download.wikimedia.org/backup-index.html and then choose your language edition, and download a dump that contains the latest revision (as opposed to all revisions) of all articles. To do: Speed up analysis, preferably using less memory. Ensure iteration over pages doesn't load all of them at once. Handle mainspace articles that have colons in their names. Consider how to handle footnotes. Should Philosophy count as linking to A.C. Grayling, the author of a cited publication? What about an explanatory footnote in the United Kingdom? Find the longest chain from any article to its loop. Find out how long the terminating loops are. Add the option to ignore certain links such as dates or Greek Language. Handle templates containing text content, such as Template:Day. Distinguish between articles and redirects?








