Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
TwitterReporter: Breaking News Detection and Visualization through the Geo-Tagged Twitter Network
Java
Branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
src/main
stopwords
.gitignore
README.txt
TwitterReporter.pdf
lgpl-3.0.txt
pom.xml

README.txt

TwitterReporter: Breaking News Detection and Visualization through the Geo-Tagged Twitter Network

Twitter provides a constant stream of concise data, useful within both geospatial and temporal domains.  This project attempts to accomplish a useful and interesting task: using live Twitter data to automatically identify breaking news events in near real-time.

Please see TwitterReporter.pdf for more details.

TwitterReporter was granted increased access to the location-based streaming filter Twitter API.  The removed rate-limiting effectively gives us all live, geotagged tweets in the continental US by filtering on a grid system.  Although this is important for data quality, the library can still be used on any account.

Currently, the bot processes and cleanses incoming data through the following steps:

1.) the tweet must be geotagged
2.) skip any accounts flagged with non-English languages.
3.) replace whitespace with single spaces (easier to parse)
4.) skip anything with non-printable ASCII characters
5.) remove URLs
6.) remove replies ("RT @foo" syntax)
7.) remove hashtags
8.) remove XHTML encoded characters
9.) remove non-alphanumeric characters
10.) remove stopwords

The stopwords are an aggregated list from 9 different sources, as well as a full geological database.  All words are compiled into a single stopwords/generated.txt file that is used by Lucene.

Eventually (work-in-progress), resulting, usable tweets are run through a simple document-frequency (DF) algorithm in geospatial chunks.  Other algorithms were experiemented with (IDF, etc.), but thrown out for various reasons (see the paper for explanations).  If a topic is found, it is stored along with the tweet that composed it.

The hope is that stored topics can be displayed on a geographic visualization.  Google Maps apps, etc. are envisioned.

The Java/XML system is built on the following:

- Twitter4J
- Hibernate ORM
- Apache Lucene

CONTRIBUTIONS ARE WELCOME!  Feel free to contact me with any questions!

Licensed under the GNU Lesser General Public License (LGPL) v3.0
Something went wrong with that request. Please try again.