Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
README.md
RESULT-frwiki-latest-pages-meta-current.xml.txt
outputcmd.txt
wiki-analysis.py

README.md

Introduction

The wish to do a frequency analysis of wikipedia came from the lack of a good frequency table for french character that included every characters(space, tab, LF etc). All that, for performing frequency analysis for crypto challenges. I took the Wikipedia content as it seems to be greatest source of text easily accessible and downloadable.

Dumps can be downloaded on wikimedia. The one I took for this test is the release of the 18th of February 2012 and weigh uncompressed 12.5 GB.

The code used is quite simple, the only thing I had to deal with was the xml and the wiki syntax. Indeed I wanted only articles content and not all xml tags which would have disguised results.

Thats why I filtered only the text contained into '<text xml:space="preserve">' and '</text>' which are the two markers for the begin and the end of an article. Moreover within an article I do some clean up that's why I remove all Wiki syntax tags(in a dirty way) otherwise it would also disguise results (and '[' would be for instance well ranked whereas it should not).

Note: I have written this script in python 3 that support unicode which is crucial here.

Results

The execution of my script took aproximately 3 hours and the output is:

bash$ ./wiki-analysis.py frwiki-latest-pages-meta-current.xml
Lines: 222669441
Execution time: 10033.887793064117 secondes                                                                                  
Articles: 3714740       Lines: 222669441        Number characters: 9217905119   Different characters: 27605

Results in: RESULT-frwiki-latest-pages-meta-current.xml.txt

You can download the output file with the complete table here.

As we could have supposed the space comes in first position with 1329412224 times, followed by "e" and "t". The LF comes in 16th position.