Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 36 lines (24 sloc) 2.072 kb
df84461 Brett Meyer README updates
brmeyer authored
1 TwitterReporter: Breaking News Detection and Visualization through the Geo-Tagged Twitter Network
2
3 Twitter provides a constant stream of concise data, useful within both geospatial and temporal domains. This project attempts to accomplish a useful and interesting task: using live Twitter data to automatically identify breaking news events in near real-time.
4
5 Please see TwitterReporter.pdf for more details.
6
8190043 Brett Meyer readme updates
brmeyer authored
7 TwitterReporter was granted increased access to the location-based streaming filter Twitter API. The removed rate-limiting effectively gives us all live, geotagged tweets in the continental US by filtering on a grid system. Although this is important for data quality, the library can still be used on any account.
df84461 Brett Meyer README updates
brmeyer authored
8
8190043 Brett Meyer readme updates
brmeyer authored
9 Currently, the bot processes and cleanses incoming data through the following steps:
10
11 1.) the tweet must be geotagged
12 2.) skip any accounts flagged with non-English languages.
13 3.) replace whitespace with single spaces (easier to parse)
14 4.) skip anything with non-printable ASCII characters
15 5.) remove URLs
16 6.) remove replies ("RT @foo" syntax)
17 7.) remove hashtags
18 8.) remove XHTML encoded characters
19 9.) remove non-alphanumeric characters
20 10.) remove stopwords
21
22 The stopwords are an aggregated list from 9 different sources, as well as a full geological database. All words are compiled into a single stopwords/generated.txt file that is used by Lucene.
23
24 Eventually (work-in-progress), resulting, usable tweets are run through a simple document-frequency (DF) algorithm in geospatial chunks. Other algorithms were experiemented with (IDF, etc.), but thrown out for various reasons (see the paper for explanations). If a topic is found, it is stored along with the tweet that composed it.
df84461 Brett Meyer README updates
brmeyer authored
25
26 The hope is that stored topics can be displayed on a geographic visualization. Google Maps apps, etc. are envisioned.
27
28 The Java/XML system is built on the following:
29
c2b65a8 Brett Meyer Removed Camel from architecture. Re-worked to use straight Twitter4J.
brmeyer authored
30 - Twitter4J
df84461 Brett Meyer README updates
brmeyer authored
31 - Hibernate ORM
32 - Apache Lucene
33
dd626ba Brett Meyer Updated license
brmeyer authored
34 CONTRIBUTIONS ARE WELCOME! Feel free to contact me with any questions!
35
36 Licensed under the GNU Lesser General Public License (LGPL) v3.0
Something went wrong with that request. Please try again.