pip3 install mtweepy
pip3 install git+https://github.com/Souvic/mtweepy.git
There are three functions in the repo: get_followers, get_timelines, get_users.
All the functions use all the auth tokens optimally for fastest scraping.
Apart from self explantory inputs:
- As auths, a list of tweepy bearer tokens are expected if you want to use oauth2 limits for twitter api.
- As auths, a list of [oauth_consumer_key, oauth_consumer_secret, client_secret, oauth_token, oauth_token_secret] are expected if you want to use oauth1 limits for twitter api.
- use_userid parameter is by default False. If it is passed as True in get_followers, get_followers will treat the screen_name_or_userid parameter as userid for which follower is to be scraped.
- output_folder is supposed to be an empty folder to save output from get_timelines and get_users functions.
An example usage is provided here.
from mtweepy import get_followers, get_users, get_timelines
list_followers= get_followers(auths, "INCIndia", max_num=500)#gets list of followers appended in chunk of 5000, if max_num<5000, will get last 5000 followers.
The output is saved in the output_folder as multiple jsonl files(one file per access token). Each line of jsonl files contains the maximally extended user object for one user.
get_users(auths, list_followers, output_folder="./testfolder1")
The output is saved in the output_folder as multiple jsonl files(one file per access token). Each line of jsonl files contains last 3200 tweets of a user.
get_timelines(auths, list_followers, output_folder="./testfolder2")
Type this in commandline at any point of data collection
find ./testfolder1 -name '*.jsonl' | xargs wc -l
For get_users function: Each line contains 100 users approximately. For get_timelines function: Each line contains 1 user timeline.
So you can calculate an approximate rate with this function to know when data collection will be finished.