TweetPinna is a tweet archiver that saves tweets and metadata to MongoDB. It is designed for long-running archival projects (e.g. for academic use) and is based on Tweepy. As of now, TweetPinna is able to archive tweets based on search terms and/or hashtags as well as based on location. There is rudimentary support for archiving specific user's timelines.
TweetPinna, as of 1.0.9, supports Python 3. However, the original TweetPinna, development started in early 2014, was in written in (legacy) Python (2.7x). I'm in the process of hopefully refactoring (and actually moving away from legacy Python) the whole codebase. The project, as it stands right now, works fine, but is fairly messy. There will be (very infrequent) marginal updates to this legacy version.
- Automatic image download (profile pictures, images in tweets)
- Flask-based web dashboard providing information and some basic statistics about the current archival project
- Email alerts in case of problems with the archiver
- Ability to manage multiple archival projects using configuration files
- Preview of random tweets
- Tracking user's timelines
- Tracking basd on locations / boundary boxes
- Archiving replies to tweets (limited)
Installation and Usage
- Install and configure MongoDB (currently TweetPinna does not support authentication)
- Clone the repository into a dictionary
- Either edit
cfg/TweetPinnaDefault.cfgor create your own configuration file (see
docs/annotated-default-config.txt). Remember to change the password for the dashboard!
- Install all Python dependencies by running
pip install -r requirements.txt
- Install a cronjob that regularly runs
- If you want to regularly fetch timelines, install a cronjob that regularly runs
- Run both
TweetPinnaDashboard.py(either as a service or in a screen session). If you also want to track based on location, you have to run
TweetPinnaTrackLocation.pyin a similar fashion.
install.sh is an alternative to steps 4 and 5 and will use the default configuration. The installer will assume
pip to be your preferred commands.
start.shwill run both the archiver and the dashboard in a screen session.
All TweetPinna scripts require a valid configuration to run. The configuration is always passed as the first argument, e.g.
python TweetPinna.py cfg/TweetPinnaDefault.cfg.
Running TweetPinna in Production
If you plan to run TweetPinna in production, it is advisable to implement the following:
- Use virtualenv
- Use a dedicated webserver/WSGI
- If you plan on harvesting large amounts of data use
SimpleCacheand precache hashtags from time to time
Keep in mind that using the media/image downloader will generate a lot of traffic. Based on a sample of 600 tweets, an average tweet amounts to roughly 6 MB of image data.
If you decide to not download images immediately (
media_download_instantly : 0) you can manually download all images by running
python TweetPinnaImageDownloader.py config.cfg.
Since Twitter (at least with free API access) does not allow you to track/search for replies directly, TweetPinna tries to archive these via the user's timelines. Tracking replies, due to these restrictions, is time-sensitive. Thus, you should set up a job running
TweetPinnaReplies.py config.cfg <limit> with a limit that resembles how many tweets have been tracked per job execution. Hence, if you are tracking approximately 500 tweets per hour, you should run this at least hourly with a limit of 500.
Also, keep in mind that this method is far from perfect and especially does not account for the fact that replies can be added later. Therefore, in the example above, you would only track replies which were made fairly immediately.
If persistent logging/tracking is paramount,
restart.sh can be called from time to time (cronjob) in order to restart both TweetPinna and the MongoDB service in case they are down for some reason. While this is certainly not the 'cleanest' solution, it works well in practice.
Todo and Bugtracker
- Add calling module/file to the log
- AWS S3 compatibility for images
- Fetching a list of friends/relationships and retrieve their tweets (with a given level of depth)
- Fix xlabels in the dashboard
- get_hashtags() cosumes to much memory and cpu
- Implement i18n
- Implement OSoMe's Botometer (see botometer-python)
- MongoDB auth compatibility
- Provide better installation/running routines
- Replace print/own logger with logging
- Restructuring the project / "make it more pythonic"
- Sphinx Documentation
- Testing / add Test
- Too many hits on tweepy result in an
- Unify the individual modules and/or write a wrapper to access them all
- Video downloader
- Separate config and tweepy initialization into a helper function --> modularization
- Implement instant download functionality within the timeline module
- Dashboard should not start without MongoDB connection -> implement global db checks
- Before adding a tweet to the DB we should check whether it already exists
- The "Tweets over Time" graph(s) doesn't show the actual number of tweets due to scaling effects
- Switch over to f-Strings
- Add full support for 'extended_tweet' objects
If the database (MongoDB) becomes unavailable for any reason, TweetPinna continues to collect tweets. Once the connection is reestablished, the tweet-buffer is dumped into the database. While this behaviour can be memory heavy, it ensures that no (less) tweets are lost. If you want to disable this function set
tweet_buffer : 0.
If there is no dashboard username set, the dashboard will be unprotected.