The initial thought behind the project was to create a simple program for gathering data from reddit. The initial problem was with storing historical data, not only how popular post is but also how quickly it took him to get there (same with comments). The usual use case would be feeding this data to data mining algorithm and finding sentiment / popularity of certain terms.
It searches through subreddit listed in configuration.toml and checks every 15 minutes for new posts and new comments. Once the post has been checked it will go to stage two, which means the next update will be in 1 hour. The next stages will be in 3, 5, 8, 16, 24 hours. The final stage will be updated in 900 days.
There are 4 requirements for it to run:
- Rust and cargo
- Running postgre instance with 3 tables (based on the postgresql_tables.sql):
- Updates - for storing and setting update times for posts
- Posts - storing data about posts
- Comments - storing data about comments
- Two configuration files
- configuration.toml which should be filled with data based on the information inside (and renamed configuration.toml from configuration_new.toml)
- tokens.json for storing Reddit tokens (and renamed tokens.json from tokens_new.json)
- Why does it not store in confugration.toml? The toml crate has a very limited support for date time objects and there is limited interoperability with chrono (other crate)
- Reddit credentials
- Go to https://www.reddit.com/prefs/apps/ and create new app (you have to be logged in)
Once those requirements are cleared. You can run the project with
cargo run
from the folder (there will be DEBUG information displayed but it could be changed with DEBUG = false in the code)
There will soon be a post about the whole process on https://michalszwedo.com (will edit the readme then)