Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
========================= wahatttt - wh4t software: ========================= Goals: ------ The goal of the project "wh4t" (shortened form of "wahatttt") is to create a text-technologically traversed archive of texts originating from Herwart Holland-Moritz (also known as Wau Holland, or simply "[Ww]au", born 1951, died 2001). The idea is to classify text material of Wau upon topics and (in a later phase) to visualize this information in a suitable way for the interested public. What should emerge is (kind of) a picture of the (varying) areas Wau was active in, presented in a structured way. Wau was one of the founding figures of the Chaos Computer Club (CCC), the biggest Hacker Foundation in Europe, with today hackerspaces all over Europe, which are directly or indirectly related / connected to the CCC. After his (early) death in 2001 the Wau Holland Foundation (German: Wau-Holland-Stiftung or WHS) emerged in order to make his works public and keep his ideas and ideals alive. Setting of this project: ------------------------ This project emerges from an idea and some prior work done by Bernd Fix <firstname.lastname@example.org> of the WHS. However, it's now (independently) carried out as a student programming project at the University of Zurich (UZH) by Hernani Marques <email@example.com>. Data/source Text: ------------------ The text material used -- for the time being -- comes from the debate@ FITUG mailing list. FITUG stands for "Förderverein Informationstechnik und Gesellschaft", roughly meaning "Booster club (for) information technology and society". From a corpus point of view the texts are likely to be full of computational technical terms, but also enriched with expressions from political/sociological fields. Its (prior) archives are publicly available: * http://www.fitug.de/debate/index.html Wau used to post on debate@ between 1996 and 2000. Further data may be gathered from the (Google) USENET archives or from (not yet publicly) available material being held in Berlin, Germany. This additional text material will be subject of inclusion in later project phases. what or wahatttt is this name about? ------------------------------------ "wahatttt" stands for "what" in a longish, not quite correctly spelled version for the average English speaker; however, it's actually considered being a word by the (not quite representative) Urban Dictionary: * http://www.urbandictionary.com/define.php?term=Wahatttt Alternatively, it may be an acronym for the following (longish) constructions; in German: (1) WAu Holland-Archiv - TexT-Technologisch(-)Traversiert (2) WAu Hacker-Archiv - TexT-Technologisch(-)Traversiert Or, put in English, it may (further) stand for: (3) WAu Holland Archive -- TexT-Technologically(-)Traversed (4) WAu Hacker Archive -- TexT-Technologically(-)Traversed All these ambiguities -- starting already in the name of the project -- should symbolize the difficulties this project (naturally) faces, as it deals with natural language processing. And, most importantly: This project aims at maximizing not only the outcome, but also the fun factor. Software requirements: ---------------------- wh4t was tested with the following software: $ java -version java version "1.6.0_18" OpenJDK Runtime Environment (IcedTea6 1.8.13) (6b18-1.8.13-0+squeeze1) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) $ python -V # Using nltk-2.0.1rc4 and some other libraries (to be listed) Python 2.6.6 First exploration: ------------------ Invoke the following commands: git clone https://github.com/2mh/wahatttt.git cd wahatttt/resources ./create_nouns_file.sh cd ../src/ javac GrabFitug.java java GrabFitug ./wh4t_xml_checker.py ./wh4t_explore.py Neural clustering: ------------------- Invoke the following commands: git clone https://github.com/2mh/wahatttt.git cd wahatttt/resources create_nouns_file.sh cd ../src/ javac GrabFitug.java java GrabFitug ./wh4t_nn_classificator.py # Caution: May need several gigs of RAM Common term-based clustering: ----------------------------- Invoke the following commands: git clone https://github.com/2mh/wahatttt.git cd wahatttt/resources create_nouns_file.sh cd ../src/ javac GrabFitug.java java GrabFitug ./wh4t_stems_classificator.py # Works quite fast