Permalink
Browse files

Documentation.

  • Loading branch information...
1 parent 9bf9894 commit b6da2676078b34572e42a8365e6d4bdd11490c32 @joshuaeckroth joshuaeckroth committed Jul 16, 2011
Showing with 433 additions and 19 deletions.
  1. +9 −19 AINews.py
  2. +52 −0 README.corpus
  3. +142 −0 README.md
  4. +30 −0 corpus-mds.r
  5. +12 −0 process-results.pl
  6. +188 −0 tables.sql
View
@@ -50,21 +50,18 @@ def usage():
-f, --file crawl target URLs stored in the file
-u, --url crawl one target URL
- (2) categorize:
- choose topics for all the news.
-
- (3) train:
+ (2) train:
train news classifiers based on human rates.
- (4) rank:
+ (3) rank:
rank the latest news and generate output files.
- (5) publish:
+ (4) publish:
publish news from output files to Pmwiki site and send emails.
It is weekly publish to the public.
- (6) all:
- Automatically processing crawl, categorize, train, rank, and publish tasks.
+ (5) all:
+ Automatically processing train, crawl, rank, and publish tasks.
View Latest news at:
http://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/AINews
@@ -96,10 +93,6 @@ def crawl(opts):
if rss_flag:
crawler.crawl()
-def categorize():
- centroid = AINewsCentroidClassifier()
- centroid.categorize_all()
-
def train():
svm = AINewsSVM()
svm.collect_feedback()
@@ -128,7 +121,7 @@ def main():
# Set en_US, UTF8
locale.setlocale(locale.LC_ALL,'en_US.UTF-8')
- commands_list = ("crawl", "categorize", "train", "rank", "publish", "all", "help")
+ commands_list = ("train", "crawl", "rank", "publish", "all", "help")
try:
if len(sys.argv) < 2 or sys.argv[1] not in commands_list:
usage()
@@ -141,15 +134,12 @@ def main():
usage()
sys.exit(2)
- if command == "crawl":
- crawl(opts)
-
- elif command == "categorize":
- categorize()
-
elif command == "train":
train()
+ if command == "crawl":
+ crawl(opts)
+
elif command == "rank":
rank()
View
@@ -0,0 +1,52 @@
+From: http://www-users.cs.umn.edu/~han/data/tmdata.tar.gz
+
+As mentioned in:
+
+@article{han2000centroid,
+ title={Centroid-based document classification: Analysis and experimental results},
+ author={Han, E.H. and Karypis, G.},
+ journal={Principles of Data Mining and Knowledge Discovery},
+ pages={116--123},
+ year={2000},
+ publisher={Springer}
+}
+
+http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.1097&rep=rep1&type=pdf
+
+
+1. This directory contains 20 data sets used in the experiments reported
+ in paper titled "Centroid-Based Document Classification: Analysis &
+ Experimental Results" by Eui-Hong (Sam) Han and George Karypis.
+ Note that we did not include three data sets (west1, west2, and west3)
+ from West Group, because they are not available for public use.
+
+2. For each data set, there are 3 files. For example, for oh0 data set:
+
+ oh0.mat : sparse vector space representation of documents where rows
+ are documents and columns are words
+ oh0.mat.rlabel: row label (the class name of each document)
+ oh0.mat.clabel: column label (stemmed words)
+
+3. Format of the *.mat file is as follows:
+
+ no-of-rows no-of-columns no-of-non-zero-elements
+ doc-1
+ doc-2
+ .
+ .
+ .
+ doc-n
+
+ Each row corresponds to a document and has the following format:
+
+ c1 f1 c2 f2 c3 f3 ..... cm fm
+
+ where c? represents a word id and f? represents the text frequency
+ of c? in this document.
+
+4. The *.mat.rlabel contains class label of documents. The first line
+ corresponds to the class label of the first document represented in
+ *.mat file, and so forth.
+
+5. The *.mat.clabel contains stemmed words of the documents. The first
+ line corresponds to the word-id 1 used in *.mat file, and so forth.
View
142 README.md
@@ -0,0 +1,142 @@
+# AINews
+
+The main script, `AINews.py`, performs the training/crawling/publishing
+process. The script can be called with the `all` option to perform all steps:
+
+<pre>
+python AINews.py all
+</pre>
+
+Or, one of `train`, `crawl`, `rank`, or `publish` can be used to
+execute just one of the subprocesses, e.g.:
+
+<pre>
+python AINews.py crawl
+</pre>
+
+## Database
+
+Look at `tables.sql` to create tables needed by AINews. The table-creation
+code in that file is documented as well.
+
+## Configuration
+
+All configuration is provided in the `config/*.ini` files:
+
+ - `config.ini` has miscellaneous configuration options, described in the file
+ - `db.ini` has database connectivity information (host, username, password,
+ etc.)
+ - `paths.ini` stores the paths of various data and output directories used by
+ AINews. This file allows you to store data and output in whatever way makes
+ sense for a particular installation (for example, data files should be outside
+ of a webserver root)
+
+Make a copy of the `.ini.sample` files (rename to `.ini`) to build your own
+configuration.
+
+## Train
+
+The trainer collects user ratings and finds the best support vector machines to
+rank news articles as irrelevant, somewhat relevant, or relevant. Output is
+saved to the `svm_data` path (defined in `config/paths.ini`).
+
+## Crawl
+
+The crawler can be invoked by:
+
+<pre>
+python AINews.py crawl
+</pre>
+
+The crawler will read the `sources` table in the database and crawl
+each source. In the `sources` table, each source has a parser associated with
+it. You can add new parsers (and thus new sources) by adding to
+`AINewsSourceParser.py`.
+
+Articles are processed as they are crawled. Results are stored in the `urllist`
+table. Also, article descriptions are stored in the `news_data/desc/`
+directory, article text is stored in `news_data/text/`, and article metadata
+(title, publication date, topic, etc.) is stored in `news_data/meta/`; in each
+case, the urlid (as found from the `urllist` table) is the name of the file,
+and the extension is `.pkl` (Python "pickle" file).
+
+## Rank
+
+The ranker finds recent articles saved into the database (from the crawler, in
+the table `urllist`) and chooses the highest scoring news, while also trying to
+choose news from multiple categories. The "top news" is saved into the `output`
+folder (from `paths.ini`) in a Python "pickle" file called `topnews.pkl`. This
+is the file that the publisher (described below) uses to generate HTML files
+and RSS feeds for the news.
+
+## Publish
+
+The publisher reads the news articles stored in `topnews.pkl` (see "Rank"
+above), and writes HTML and RSS files with the appropriate formatting. The
+formats are given by Cheetah templates (the `compiled` path, in the section
+`templates`, in `config/paths.ini`). Pmwiki is assumed to be the content
+management system providing the AINews website, so Pmwiki-styled output is
+produced by the publisher, as well as RSS feeds and an email designed for
+sending to the AIAlert mailing list.
+
+# Classifier
+
+The classifier is not executed by `AINews.py`. Rather, the classifier is used
+by the crawling process to classifier the news. The classifier can be trained
+and evaluated by running the `AINewsCentroidClassifier.py` script, as described
+below.
+
+## Evaluating the classifier
+
+The classifier can be trained/evaluated on a file-based corpus (described in
+the README.corpus file) or a database corpus (essentially described by the
+`cat_corpus` and `cat_corpus_cats` tables in `tables.sql`). A file-based
+corpus can be indicated by the format `file:X` where `X` is a corpus name. The
+corpus is read from the file `corpus_other/X.mat` and related files
+(`X.mat.clabel` and `X.mat.rlabel`). The directory `corpus_other` can be
+specified in `config/paths.ini`.
+
+<pre>
+python AINewsCentroidClassifier.py evaluate file:oh10
+</pre>
+
+The other option is a database corpus. The `db:X:Y` specification format is as
+follows: `X` is the table with corpus articles (format follows `cat_corpus`
+table described in `tables.sql`); `Y` is the table with corpus article
+categories (`cat_corpus_cats` described in `tables.sql`).
+
+<pre>
+python AINewsCentroidClassifier.py evaluate db:cat_corpus:cat_corpus_cats
+</pre>
+
+You will probably want to save the output (redirect stdout) to a file.
+
+## Filtering results
+
+The classifier evaluator may print a lot of poor results, with the good results
+mixed in and thus hard to find. The script `process-results.pl` may be helpful
+here. Edit the script to filter only the data that is meaningful, then execute
+as follows:
+
+<pre>
+perl process-results.pl < evaluator-output.txt
+</pre>
+
+
+## Exporting the corpus dissimilarity matrix
+
+<pre>
+python CorpusExport.py file:oh10 > corpus-oh10.csv
+</pre>
+
+
+Then you can graph these dissimilarities using R:
+
+<pre>
+Rscript corpus-mds.r corpus-oh10
+</pre>
+
+That command will produce the graphs `corpus-oh10-mds.png` and
+`corpus-oh10-mds-faceted.png`.
+
+
View
@@ -0,0 +1,30 @@
+library(ggplot2)
+args <- commandArgs(trailingOnly = T)
+corpus <- read.csv(paste(args[1],".csv", sep=""))
+fit <- cmdscale(corpus, k=2)
+data <- as.data.frame(fit)
+data$Category <- gsub("\\d+ ", "", rownames(data))
+data$URLID <- gsub("(\\d+)?.*", "\\1", rownames(fit))
+
+Category <- factor(gsub("\\d+ ", "", rownames(fit)))
+
+png(paste(args[1],"-mds.png", sep=""), width=500, height=500, res=100)
+
+p <- ggplot(data) +
+ geom_point(data=subset(data, URLID != ""),
+ aes(x=V1, y=V2, size=1.5, color=Category)) +
+ geom_point(data=subset(data, URLID == ""),
+ aes(x=V1, y=V2, size=7, shape=c(1), color=Category)) +
+ scale_x_continuous("", breaks=NA) +
+ scale_y_continuous("", breaks=NA) +
+ opts(axis.text.x = theme_blank(), axis.title.x=theme_blank(),
+ axis.text.y = theme_blank(), axis.title.y=theme_blank(),
+ legend.position = "none")
+
+p
+dev.off()
+
+png(paste(args[1],"-mds-faceted.png", sep=""), width=500, height=500, res=100)
+p + facet_wrap(~ Category)
+dev.off()
+
View
@@ -0,0 +1,12 @@
+#!/usr/bin/perl
+
+while(<>)
+{
+ ($pct, $icsd, $csd, $sd, $matched) =
+ (m/(\d\d)%, icsd=(-?\d\.\d\d), csd=(-?\d\.\d\d), sd=(-?\d.\d\d) matched avg ([\d\.]+)%/);
+ if(defined($matched) && $matched>56)
+ {
+ print "$pct, icsd=$icsd, csd=$csd, sd=$sd, matched=$matched\n";
+ }
+}
+
Oops, something went wrong.

0 comments on commit b6da267

Please sign in to comment.