Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
This is the NewsFinder software, designed to automatically crawl the web for news related to artificial intelligence, filter, categorize, and rank the news, and publish to a wiki, mailing list, and RSS feeds.
Python R
Failed to load latest commit information.
config Training and classifying with Weka seems to work.
.gitignore Added wiki to .gitignore so I can store the AINews wiki in this folder.
AINews.py Training experiments added.
AINewsConfig.py Removed a lot of unneeded code.
AINewsCorpus.py Training and classifying with Weka seems to work.
AINewsCrawler.py Training and classifying with Weka seems to work.
AINewsDB.py Major (but incomplete) updates to parsing and extraction.
AINewsDupExperiment.py OTS summarizer, parser experimenter, better SVM experimenter, feature…
AINewsDuplicates.py Showing publisher info even if it comes from Google News.
AINewsPublisher.py Small fixes for classification.
AINewsSummarizer.py Various improvements with text handling and summarizing.
AINewsTextProcessor.py Training and classifying with Weka seems to work.
AINewsTools.py Training and classifying with Weka seems to work.
AINewsWekaClassifier.py Training experiments added.
CorpusCategories.py Some code documentation and reorganization for clarity.
CorpusExport.py Added license info to all files and enhanced README to include better…
LICENSE Added license info to all files and enhanced README to include better…
README.corpus Documentation.
README.md Training and classifying with Weka seems to work.
arff.py Added ARFF file for Weka training, and fixed training to avoid the te…
corpus-mds.r CorpusExport produces matrices for each category (for a faceted grid)…
ents.py Added HTML entity conversion code.
svm-easy.py Path fixes, SVM bug fixed (w.r.t. labels), other cleanups.
svm-grid.py Added NotRelated SVM support, added transcript for article filtering …
tables.sql Updated MySQL table schemas.

README.md

NewsFinder

Full details of this software can be found at AITopics.

The main script AINews.py can be called with one of three arguments:

python AINews.py crawl

This crawls the sources list and stores its results in the database (table urllist).

python AINews.py prepare

This filters and processes the news, and creates an XML file export.

python AINews.py email

This generates the weekly email as an HTML submission form.

Our configuration has a script that uses the crawl and prepare (in that order) commands once each day, and the email command once a week. The email is sent manually using the generated submission form.

Installation

NewsFinder is primarily coded in Python and requires the following libraries:

Packages for Ubuntu:

sudo apt-get install python-mysqldb libsvm-tools python-libsvm \
                     python-cheetah python-nltk python-beautifulsoup \
                     python-pyrss2gen python-feedparser python-unidecode

"Installation" of NewsFinder should only involve downloading the code in this repository. Assuming Python includes the NewsFinder code in its path, execution of NewsFinder should be as simple as running python AINews.py

There are various supplementary files that NewsFinder expects. These files can be obtained from another repository: AINewsSupplementary

In our working configuration, we have a directory in our home called NewsFinder. In this directory are the AINewsSupplementary directories (e.g. templates, resource, etc.). The NewsFinder code found in this repository is stored in a code directory in the main NewsFinder folder. In any event, the paths.ini file (in the config directory of this project) can be modified to point to whereever you keep your files.

Database

Look at tables.sql to create tables needed by NewsFinder.

Configuration

All configuration is provided in the config/*.ini files:

  • config.ini has miscellaneous configuration options, described in the file
  • db.ini has database connectivity information (host, username, password, etc.)
  • paths.ini stores the paths of various data and output directories used by NewsFinder. This file allows you to store data and output in whatever way makes sense for a particular installation (for example, data files should be outside of a webserver root)

Make a copy of the .ini.sample files (rename to .ini) to build your own configuration.

News sources

Sources are specified in a CSV file obtained from a URL provided in the paths.ini config file. Here is an example short sources CSV file:

"SourceID","Title","Link","Parser"
"62671","Forbes Technology","http://www.forbes.com/technology/index.xml","RSS"
"62673","BBC Technology","http://feeds.bbci.co.uk/news/technology/rss.xml","RSS"

Only RSS sources are supported at this time.

Authors

The most recent version (as of this writing) of NewsFinder was written by Joshua Eckroth. The prior iteration was written by Liang Dong. Before that, NewsFinder was coded by Tom Charytoniuk. The project has been supervised by Bruce Buchanan and Reid Smith.

License

Copyright (c) 2011 by the Association for the Advancement of Artificial Intelligence. This program and parts of it may be used and distributed without charge for non-commercial purposes as long as this notice is included.

The file arff.py is pulled from the laic-arff package, which is distributed under the MIT License.

Something went wrong with that request. Please try again.