Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Fetching contributors…
Cannot retrieve contributors at this time
36 lines (24 sloc) 2.02 KB
TwitterReporter: Breaking News Detection and Visualization through the Geo-Tagged Twitter Network
Twitter provides a constant stream of concise data, useful within both geospatial and temporal domains. This project attempts to accomplish a useful and interesting task: using live Twitter data to automatically identify breaking news events in near real-time.
Please see TwitterReporter.pdf for more details.
TwitterReporter was granted increased access to the location-based streaming filter Twitter API. The removed rate-limiting effectively gives us all live, geotagged tweets in the continental US by filtering on a grid system. Although this is important for data quality, the library can still be used on any account.
Currently, the bot processes and cleanses incoming data through the following steps:
1.) the tweet must be geotagged
2.) skip any accounts flagged with non-English languages.
3.) replace whitespace with single spaces (easier to parse)
4.) skip anything with non-printable ASCII characters
5.) remove URLs
6.) remove replies ("RT @foo" syntax)
7.) remove hashtags
8.) remove XHTML encoded characters
9.) remove non-alphanumeric characters
10.) remove stopwords
The stopwords are an aggregated list from 9 different sources, as well as a full geological database. All words are compiled into a single stopwords/generated.txt file that is used by Lucene.
Eventually (work-in-progress), resulting, usable tweets are run through a simple document-frequency (DF) algorithm in geospatial chunks. Other algorithms were experiemented with (IDF, etc.), but thrown out for various reasons (see the paper for explanations). If a topic is found, it is stored along with the tweet that composed it.
The hope is that stored topics can be displayed on a geographic visualization. Google Maps apps, etc. are envisioned.
The Java/XML system is built on the following:
- Twitter4J
- Hibernate ORM
- Apache Lucene
CONTRIBUTIONS ARE WELCOME! Feel free to contact me with any questions!
Licensed under the GNU Lesser General Public License (LGPL) v3.0
Jump to Line
Something went wrong with that request. Please try again.