Permalink
Browse files

Added license info to all files and enhanced README to include better…

… installation/customization instructions and authorship information. Also updated database tables schemas.
  • Loading branch information...
1 parent 221f54f commit cda21b305567fbe16e37a2b2d622dccf07beb744 @joshuaeckroth joshuaeckroth committed Aug 21, 2011
View
@@ -1,27 +1,10 @@
-# AINewsFinder
-# Copyright (C) 2010
-# Author: Liang Dong <ldong@clemson.edu>
-# URL: <http://bioinformatics.clemson.edu/ldong>
-# $Id$
-
-"""
-With funds allocated by the Executive Committee, we have hired a summer intern
-to bring AI in the News back online as a twice-monthly service. Liang Dong,
-a CS graduate student from Clemson, is automating the service, under the
-supervision of Bruce Buchanan and Reid Smith, with an AI program, called
-NewsFinder. It first pulls in RSS feeds from Google News and other reliable
-sources and filters out blogs, press releases, and advertisements. A support
-vector machine has been trained with manually scored stories from the web to
-classify each story as "not relevant to AI" (0), or "very interesting" (+5),
-"somewhat interesting" (+3), or "mildly interesting" (+1).
-
-We augment the SVM's scores with a measure of interest (frequency * inverse
-doc frequency) of selected terms and additional heuristics (using multi-word
-phrases) that indicate higher or lower interest. The sources of the articles
-will also be considered, since appearance of a story in a major news publication
-like the NY Times makes it more likely to be asked about.
-
-"""
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
import sys
import os
View
@@ -1,3 +1,11 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
+
"""
AINewsConfig reads the configure file: config.ini.
It parses the config.ini as well as pre-define several static parameters.
View
@@ -1,3 +1,10 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
import sys
import random
View
@@ -1,3 +1,12 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
+
+
"""
Crawling major news websites for latest Artificial Intelligence related news
stories.
View
@@ -1,3 +1,11 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
+
"""
The database wrapper for MySQL. It provides the fundamental functions to
access the MySQL database.
View
@@ -1,3 +1,11 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
+
"""
This is a human made duplicated news testing set used for experiment.
I manually making duplicated news pair from news id 314-664
View
@@ -1,3 +1,11 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
+
from datetime import date, timedelta
from AINewsCorpus import AINewsCorpus
from AINewsConfig import config
View
@@ -1,3 +1,11 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
+
"""
The base parser class for extracting text for general news story.
The urllib2 and urlparse are used to download the HTML pages from the website.
View
@@ -1,3 +1,11 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
+
import feedparser
import PyRSS2Gen
import sys
View
@@ -1,3 +1,10 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
import sys
import re
View
@@ -1,3 +1,11 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
+
"""
AINewsSourceParser includes a set of parsers inherit from the the AINewsParser.
Each of the parser is desgined specifically for one source/publisher website.
View
@@ -1,24 +1,12 @@
-# Simple Summarizer
-# Copyright (C) 2010 Tristan Havelick
-# Author: Tristan Havelick <tristan@havelick.com>
-# URL: <http://www.tristanhavelick.com>
-# $Id$
-"""
-A summarizer based on the algorithm found in Classifier4J by Nick Lothan.
-In order to summarize a document this algorithm first determines the
-frequencies of the words in the document. It then splits the document
-into a series of sentences. Then it creates a summary by including the
-first sentence that includes each of the most frequent words. Finally
-summary's sentences are reordered to reflect that of those in the original
-document.
-Original Author: Tristan Havelick
-Modified by: Liang Dong
-original code url: http://tristanhavelick.com/summarize.zip
-"""
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
from operator import itemgetter
-from nltk.probability import FreqDist
-from nltk.tokenize import RegexpTokenizer
import nltk.data
from AINewsConfig import stopwords
View
@@ -1,3 +1,11 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
+
"""
AINewsTextProcessor is used to process text into bag of words.
NLTK libarary is used to morph the words into original form. The
View
@@ -1,3 +1,11 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
+
"""
AINewsTools provides a set of helper functions to manage save/load files
and html/text processing.
View
@@ -1,3 +1,10 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
import sys
from AINewsDB import AINewsDB
View
@@ -1,3 +1,10 @@
+# This file is part of NewsFinder.
+# https://github.com/joshuaeckroth/AINews
+#
+# Copyright (c) 2011 by the Association for the Advancement of
+# Artificial Intelligence. This program and parts of it may be used and
+# distributed without charge for non-commercial purposes as long as this
+# notice is included.
import sys
import re
View
@@ -0,0 +1,4 @@
+Copyright (c) 2011 by the Association for the Advancement of
+Artificial Intelligence. This program and parts of it may be used and
+distributed without charge for non-commercial purposes as long as this
+notice is included.
View
117 README.md
@@ -1,24 +1,26 @@
# NewsFinder
-Full details of this software can be found at [the AITopics wiki](http://aaai.org/AITopics/AINewsProcedure).
+Full details of this software can be found at
+[the AITopics wiki](http://aaai.org/AITopics/AINewsProcedure).
-The main script, `AINews.py`, performs the training/crawling/publishing
-process. The script can be called with one of three arguments:
+The main script, `AINews.py`, performs the
+training/crawling/publishing process. The script can be called with
+one of three arguments:
<pre>
python AINews.py crawl
</pre>
-This crawls the sources list (database table `sources`) and stores its results
-back into the database (table `urllist`).
+This crawls the sources list (database table `sources`) and stores its
+results back into the database (table `urllist`).
<pre>
python AINews.py publish
</pre>
-This finds the articles that have been crawled but not yet processed, filters
-stories based on relevance, and publishes the resutls as wiki pages, email, and
-RSS feeds.
+This grabs the articles that have been crawled but not yet processed,
+filters stories based on relevance, and publishes the resutls as wiki
+pages, email, and RSS feeds.
## Installation
@@ -32,32 +34,107 @@ NewsFinder is primarily coded in Python and requires the following libraries:
- [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/)
- [PyRSS2Gen](http://www.dalkescientific.com/Python/PyRSS2Gen.html)
- [feedparser](http://www.feedparser.org/)
+
+"Installation" of AINews should only involve downloading the code in
+this repository. Assuming Python includes the AINews code in its path,
+execution of AINews should be as simple as running `python AINews.py`
+
+There are various supplementary files that AINews expects; these
+involve publishing the news into PmWiki format and generating an
+email-formatted output. These files can be obtained from another
+repository:
+[AINewsSupplementary](https://github.com/joshuaeckroth/AINewsSupplementary)
+
+In our working configuration, we have a directory in our home called
+`NewsFinder`. In this directory are the AINewsSupplementary
+directories (e.g. `templates`, `resource`, etc.). The AINews code
+found in this repository is stored in a `code` directory in the main
+`NewsFinder` folder. In any event, the `paths.ini` file (in the
+`config` directory of this project) can be modified to point to
+whereever you keep your files.
## Database
Look at `tables.sql` to create tables needed by NewsFinder.
+## News sources
+
+Sources are specified in the `sources` table of the database. A source
+has the following properties: URL, parser identifier, description,
+status, and relevance. The URL points to either an RSS feed or a
+search results page that is to be parsed. The parser identifier is
+composed of two parts separated by `::` -- the first part is the
+publisher name, the second part is the type (e.g. `rss`); the parser
+identifer (both parts) is used in `AINewsSourceParser.py` to determine
+how the URL will be processed. If you wish to crawl a news source not
+yet represented in `AINewsSourceParser.py`, you may have to write a
+new parser (see the file for examples). The description of the source
+is just helpful text to disambiguate among different search terms used
+on the same publisher (e.g. all the different Google News
+searches). The description is not used by the AINews code. Finally,
+status is a boolean value (1 or 0) indicating whether this source
+should be crawled, and relevance is a ranking (higher = better) of how
+relevant or credible is the source. Stories from more relevant sources
+are more likely to be published.
+
## Configuration
All configuration is provided in the `config/*.ini` files:
- - `config.ini` has miscellaneous configuration options, described in the file
- - `db.ini` has database connectivity information (host, username, password,
- etc.)
- - `paths.ini` stores the paths of various data and output directories used by
- AINews. This file allows you to store data and output in whatever way makes
- sense for a particular installation (for example, data files should be outside
- of a webserver root)
+ - `config.ini` has miscellaneous configuration options, described in
+ the file
+ - `db.ini` has database connectivity information (host, username,
+ password, etc.)
+ - `paths.ini` stores the paths of various data and output
+ directories used by AINews. This file allows you to store data and
+ output in whatever way makes sense for a particular installation
+ (for example, data files should be outside of a webserver root)
-Make a copy of the `.ini.sample` files (rename to `.ini`) to build your own
-configuration.
+Make a copy of the `.ini.sample` files (rename to `.ini`) to build
+your own configuration.
## Train
<pre>
python AINews.py train
</pre>
-The trainer collects user ratings and finds the best support vector machines to
-categorize the articles and select which are relevant to AINews readers. Output
-is saved to the `svm_data` path (defined in `config/paths.ini`).
+The trainer collects user ratings and finds the best support vector
+machines to categorize the articles and select which are relevant to
+AINews readers. Output is saved to the `svm_data` path (defined in
+`config/paths.ini`).
+
+# Authors
+
+The most recent version (as of this writing) of NewsFinder was written
+by [Joshua Eckroth](http://aaai.org/AITopics/Profiles/Jeckroth). The
+prior iteration was written by
+[Liang Dong](http://aaai.org/AITopics/Profiles/Ldong). Before that,
+NewsFinder was coded by
+[Tom Charytoniuk](http://aaai.org/AITopics/Profiles/Tcharytoniuk). The
+project has been supervised by
+[Bruce Buchanan](http://aaai.org/AITopics/Profiles/Bgbuchanan) and
+[Reid Smith](http://aaai.org/AITopics/Profiles/Rgsmith).
+
+# Publications
+
+The NewsFinder software is been documented in the following
+publications:
+
+L. Dong, R. G. Smith and
+B.G. Buchanan. [NewsFinder: Automating an Artificial Intelligence News Service](http://www.aaai.org/AITopics/articles&columns/NewsFinder2011.pdf). *Twenty-Third
+IAAI Conference on Innovative Applications of Artificial Intelligence
+(IAAI11)*, July, 2011.
+
+L. Dong, R. G. Smith and
+B.G. Buchanan. [Automating the Selection of Stories for AI in the News](http://www.aaai.org/AITopics/assets/PDF/NewsFinder_IEA-AIE_2011.pdf). *Twenty-fourth
+International Conference on Industrial, Engineering and Other
+Applications of Applied Intelligent Systems (IEA/AIE 2011)*, June,
+2011.
+
+# License
+
+Copyright (c) 2011 by the Association for the Advancement of
+Artificial Intelligence. This program and parts of it may be used and
+distributed without charge for non-commercial purposes as long as this
+notice is included.
Oops, something went wrong.

0 comments on commit cda21b3

Please sign in to comment.