Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
agent.json		agent.json
count_tweets.py		count_tweets.py
index.html		index.html
simple_http.py		simple_http.py
tweet_baseball.py		tweet_baseball.py
tweet_basketball.py		tweet_basketball.py

Repository files navigation

tweet_processor

Overview

This is a toy project that

tracks specific terms on twitter,
stream the tweets to SQL server,
query on the stored tweets,
then present the result of queries on webpage.

This project is meant as a way to learn how to use

Twitter API
AWS Kinesis
AWS Redshift

Dependency

AWS EC2 setup:

Python 3
Kinesis agent
gcc
memcached

Python scripting:

psycopg2
memcache
tweepy

Twitter / tweet tracking:

Twitter bot keys & tokens

Other AWS services:

Redshift
S3
Kinesis

Components

tweet_basketball.py and tweet_baseball.py:

Connect to Tweeter API using bot tokens.
Track tweet streams with specific terms (basketball and baseball for each script).
Extract created_at and id_str field from each tweet record.
Store the tweet record to /tmp/<term>.log

Kinesis Firehose:

Agent on stream source side is configured by /etc/aws-kinesis/agent.json.
Configure Kinesis agent to monitor the two /tmp/<term>.log files.
Send new records to the corresponding delivery streams.
Each delivery stream is then configured to
- Store the records to S3 first.
- Then use COPY command to copy the records to Redshift.

Redshift:

A Redshift cluster / database is preconfigured with 2 tables for each stream.

count_tweets.py:

Query the database for "number of all tweets that are at most 10 minutes older than the newest tweet", for each table.
Store the result to memcache.

simple_http.py:

Simple HTTP server to serve index.html for presenting the result.
index.html requests status.json for an update.

About

A toy project that track and query tweet streams using AWS Kinesis firehose and Redshift

Report repository

Releases

No releases published

Packages

Contributors

Languages