Skip to content
/ mtweepy Public

Fastest scraping using multiple apps and user tokens for Twitter API!

License

Notifications You must be signed in to change notification settings

Souvic/mtweepy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Makes twitter scrapping with multiple twitters apps easy again!

License: MIT stars Github All Releases PyPI python

Build Status Scrutinizer Code Quality Release date Latest Stable Version

tweet

Support me

Buy Me A Coffee

Install from PyPi

pip3 install mtweepy

Or Install from main branch

pip3 install git+https://github.com/Souvic/mtweepy.git

Example usage

There are three functions in the repo: get_followers, get_timelines, get_users.

All the functions use all the auth tokens optimally for fastest scraping.

Apart from self explantory inputs:

  1. As auths, a list of tweepy bearer tokens are expected if you want to use oauth2 limits for twitter api.
  2. As auths, a list of [oauth_consumer_key, oauth_consumer_secret, client_secret, oauth_token, oauth_token_secret] are expected if you want to use oauth1 limits for twitter api.
  3. use_userid parameter is by default False. If it is passed as True in get_followers, get_followers will treat the screen_name_or_userid parameter as userid for which follower is to be scraped.
  4. output_folder is supposed to be an empty folder to save output from get_timelines and get_users functions.

An example usage is provided here.

Gets 5000*ceil(max_num/5000) number of followers' userids as a list for screen_name INCIndia

from mtweepy import get_followers, get_users, get_timelines
list_followers= get_followers(auths, "INCIndia", max_num=500)#gets list of followers appended in chunk of 5000, if max_num<5000, will get last 5000 followers.

Gets all the maximally extended user objects for list_followers(a list of user ids)

The output is saved in the output_folder as multiple jsonl files(one file per access token). Each line of jsonl files contains the maximally extended user object for one user.

get_users(auths, list_followers, output_folder="./testfolder1")

Gets all the tweets in the timelines of list_followers(a list of user ids)

The output is saved in the output_folder as multiple jsonl files(one file per access token). Each line of jsonl files contains last 3200 tweets of a user.

get_timelines(auths, list_followers, output_folder="./testfolder2")

To get the total number of lines written in files in the directory ./testfolder1

Type this in commandline at any point of data collection

find ./testfolder1 -name '*.jsonl' | xargs wc -l

For get_users function: Each line contains 100 users approximately. For get_timelines function: Each line contains 1 user timeline.

So you can calculate an approximate rate with this function to know when data collection will be finished.