# Article Notebook for Scraping Twitter Using Tweepy

Package Github: https://github.com/tweepy/tweepy

Package Documentation: https://tweepy.readthedocs.io/en/latest/

Article Read-Along: https://towardsdatascience.com/how-to-scrape-more-information-from-tweets-on-twitter-44fd540b8a1f

### Notebook Author: Martin Beck
#### Information current as of August, 13th 2020
<b> Dependencies:</b> Make sure Tweepy is already installed in your Python environment. If not, you can pip install Tweepy to install the package. If you want more information on setting up I have an article [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) that goes into deeper detail.

## Notebook's Table of Contents<a name="TOC"></a>

0. [Credentials and Authorization](#Section0)
<br>Setting up credentials and authorization in order to utilize Tweepy
1. [Getting More Information From Tweets](#Section1)
<br>How to scrape more information from tweets such as favorite count, retweet count, if they're replying to someone else, if turned on the coordinates of where the tweet came from, etc.
2. [Getting User Information From Tweets](#Section2)
<br>How to scrape user information from tweets such as their follower count, total amount of tweets, if they're a verified user, location of where account is registered, etc.
3. [Scraping Tweets With Advanced Queries](#Section3)
<br>How to scrape for tweets using deeper queries such as searching by language of tweets, tweets within a certain location, tweets within specific date ranges, top tweets, etc.
4. [Putting It All Together](#Section4)
<br>Showcasing how you can mix and match the methods shown above to create queries that'll fulfill your data needs.

## Imports for Notebook

In [1]:
# Pip install Tweepy if you don't already have the package
# !pip install tweepy

# Imports
import tweepy
import pandas as pd
import time

## 0. Credentials and Authorization<a name="Section0"></a>
[Return to Table of Contents](#TOC)
<br>Tweepy requires credentials before you can utilize its API. The below code helps setup the notebook for authorization. I already have an an article covering setting up Tweepy and getting credentials [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) if further instructions are needed.

You don't necessarily have to create a credentials file, however if you find youself sharing Tweepy code to other parties I recommend it so you don't accidentally share your credentials. Otherwise skip the below cell and just enter your credentials in and have them hardcoded below.

In [11]:
# Loading in from csv file

credentials_df = pd.read_csv('credentials.csv',header=None,names=['name','key'])

credentials_df

Unnamed: 0,name,key
0,consumer_key,XXXXXXXXXXX
1,consumer_secret,XXXXXXXXXXX
2,access_token,XXXXXXXXXXX
3,access_secret,XXXXXXXXXXX


In [8]:
# Credentials from csv file

consumer_key = credentials_df.loc[credentials_df['name']=='consumer_key','key'].iloc[0]
consumer_secret = credentials_df.loc[credentials_df['name']=='consumer_secret','key'].iloc[0]
access_token = credentials_df.loc[credentials_df['name']=='access_token','key'].iloc[0]
access_token_secret = credentials_df.loc[credentials_df['name']=='access_secret','key'].iloc[0]

# Credentials hardcoded

# consumer_key = "XXXXX"
# consumer_secret = "XXXXX"
# access_token = "XXXXX"
# access_token_secret = "XXXXXX"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)

## 1. Getting More Information From Tweets<a name="Section1"></a>
[Return to Table of Contents](#TOC)
<br>List of information available in tweet object with Tweepy. This is not an exhaustive list but does contain a majority of the available information. If you want an exhaustive list of everything contained in the tweet object there's documentation [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object) describing all the attributes. 

String versions of Id's (e.g., id_str, in_reply_to_status_id_str) are used instead to best keep data integrity as there is a possibility for Id's stored as integers to be cut off.

* tweet.user <b>User information is covered in part 2 in greater detail</b><br><br>

* tweet.full_text: <b>Text content of tweet when API is told to pull all contents of tweets that have more than 140 characters</b><br><br>

* tweet.text: Text content of tweet
* tweet.created_at: Date tweet was created
* tweet.id_str: Id of tweet
* tweet.user.screen_name: Username of tweet's author
* tweet.coordinates: Geographic location as reported by user or client. May be null that is why extract_coordinates function below was created
* tweet.place: Indicates place associated with tweet where user signed up with like Las Vegas, NV. May be null that so extract_place function below was created
* tweet.retweet_count: Count of retweets
* tweet.favorite_count: Count of favorites
* tweet.lang: Indicates a BCP 47 language identifier corresponding to machine detected language of tweet text.
* tweet.source: Source where tweet was posted through. Ex: Twitter Web Client
* tweet.in_reply_to_status_id_str: If a tweet is a reply, the original tweet's id. Can be null if tweet is not a reply
* tweet.in_reply_to_user_id_str: If a tweet is a reply, string representation of original tweet's user id
* tweet.is_quote_status: If tweet is a quote tweet

#### F0. extract_coordinates and extract_place
These functions check for if a tweet has either coordinate information or place information and extract the pertinent information from their data dictionary. Because these values are sometimes null it's important to check first if a tweet has the information available then extract and replace them in the dataframe.

In [38]:
# Function created to extract coordinates from tweet if it has coordinate info
# Tweets tend to have null so important to run check
# Make sure to run this cell as it is used in a lot of different functions below
def extract_coordinates(row):
    if row['Tweet Coordinates']:
        return row['Tweet Coordinates']['coordinates']
    else:
        return None

# Function created to extract place such as city, state or country from tweet if it has place info
# Tweets tend to have null so important to run check
# Make sure to run this cell as it is used in a lot of different functions below
def extract_place(row):
    if row['Place Info']:
        return row['Place Info'].full_name
    else:
        return None

In [39]:
username = 'random'
max_tweets = 150
 
tweets = tweepy.Cursor(api.user_timeline,id=username).items(max_tweets)
 
# Pulling information from tweets iterable object
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.user.screen_name, tweet.coordinates, tweet.place, tweet.retweet_count, tweet.favorite_count, tweet.lang, tweet.source, tweet.in_reply_to_status_id_str, tweet.in_reply_to_user_id_str, tweet.is_quote_status] for tweet in tweets]
 
# Creation of dataframe from tweets_list
# Add or remove columns as you remove tweet information
tweets_df1 = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Twitter @ Name', 'Tweet Coordinates', 'Place Info', 'Retweets', 'Favorites', 'Language', 'Source', 'Replied Tweet Id', 'Replied Tweet User Id Str', 'Quote Status Bool'])
 
# Checks if there are coordinates attached to tweets, if so extracts them
tweets_df1['Tweet Coordinates'] = tweets_df1.apply(extract_coordinates,axis=1)
 
# Checks if there is place information available, if so extracts them
tweets_df1['Place Info'] = tweets_df1.apply(extract_place,axis=1)

In [53]:
tweets_df11.head()

Unnamed: 0,Tweet Text,Tweet Datetime,Tweet Id,Twitter @ Name,Tweet Coordinates,Place Info,Retweets,Favorites,Language,Source,Replied Tweet Id,Replied Tweet User Id Str,Quote Status Bool
0,Join the cryptocurrency forum - https://t.co/F...,2018-01-05 08:38:41,949198454911262720,random,,,1,7,en,Twitter Web Client,,,False
1,@jilleduffy Thanks!!,2017-07-23 15:42:27,889148714895286272,random,,,0,0,en,Twitter for iPhone,8.891467923815629e+17,18486446.0,False
2,@jilleduffy you mentioned a VPN service with C...,2017-07-23 15:29:14,889145385368690689,random,,,0,0,en,Twitter for iPhone,,18486446.0,False
3,Who needs wealth when you can make a woman laugh.,2015-12-31 02:14:21,682384266408247297,random,,,13,36,en,Twitter Web Client,,,False
4,The struggle ends when the gratitude begins.,2015-11-16 19:36:18,666339028115943425,random,,,10,17,en,Twitter Web Client,,,False


## Allowing API to Access up to 280 Characters From Tweets

In the cursor parameters add tweet_mode='extended' to access tweet text that goes beyond Twitter's original 140 character limit.

If tweet_mode is set to extended the tweet attribute tweet.text becomes tweet.full_text isntead.

In [9]:
username = 'billgates'
max_tweets = 150
 
tweets = tweepy.Cursor(api.user_timeline,id=username, tweet_mode='extended').items(max_tweets)
 
# Pulling information from tweets iterable object
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.full_text, tweet.created_at, tweet.id_str, tweet.user.screen_name, tweet.coordinates, tweet.place, tweet.retweet_count, tweet.favorite_count, tweet.lang, tweet.source, tweet.in_reply_to_status_id_str, tweet.in_reply_to_user_id_str, tweet.is_quote_status] for tweet in tweets]
 
# Creation of dataframe from tweets_list
# Add or remove columns as you remove tweet information
tweets_df12 = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Twitter @ Name', 'Tweet Coordinates', 'Place Info', 'Retweets', 'Favorites', 'Language', 'Source', 'Replied Tweet Id', 'Replied Tweet User Id Str', 'Quote Status Bool'])
 
# Checks if there are coordinates attached to tweets, if so extracts them
tweets_df12['Tweet Coordinates'] = tweets_df12.apply(extract_coordinates,axis=1)
 
# Checks if there is place information available, if so extracts them
tweets_df12['Place Info'] = tweets_df12.apply(extract_place,axis=1)

In [10]:
tweets_df12.head()

Unnamed: 0,Tweet Text,Tweet Datetime,Tweet Id,Twitter @ Name,Tweet Coordinates,Place Info,Retweets,Favorites,Language,Source,Replied Tweet Id,Replied Tweet User Id Str,Quote Status Bool
0,Mosquito City is home to the world’s largest c...,2020-08-22 17:11:15,1297219798925860864,BillGates,,,508,3516,en,Twitter Web App,,,False
1,Deaths from malaria have been cut by more than...,2020-08-20 15:48:58,1296474314221522944,BillGates,,,548,4562,en,Twitter Web App,,,False
2,RT @gavi: Vaccines prevent millions of deaths ...,2020-08-20 15:36:55,1296471280921837570,BillGates,,,167,0,en,Twitter Web App,,,False
3,"Ridding the world of preventable, treatable di...",2020-08-19 15:06:37,1296101267152859136,BillGates,,,771,4819,en,Twitter Web App,,,False
4,RT @GlobalFund: As a community health worker i...,2020-08-18 21:09:40,1295830246168526848,BillGates,,,101,0,en,Twitter Web App,,,False


## 2. Getting User Information From Tweets<a name="Section2"></a>
[Return to Table of Contents](#TOC)

<b>Tweepy excels in this category. Having more access to user information than GetOldTweets3.</b>
<br>List of information available in user object with Tweepy. This is not an exhaustive list but does contain a majority of the available information. If you want an exhaustive list of everything contained in the user object there's documentation [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/user-object) describing all the attributes. 

String versions of Id's (e.g., id_str, user.id_str) are used instead to best keep data integrity as there is a possibility for Id's stored as integers to be cut off.

* tweet.text: Text content of tweet
* tweet.created_at: Date tweet was created
* tweet.id_str: Id of tweet
* tweet.user.name: Name of the user as they've defined it
* tweet.user.screen_name: Username of tweet's author, commonly called User @ name
* tweet.user.id_str: Use id of tweet's author
* tweet.user.location: User defined location for account's profile. Can be nullable
* tweet.user.url: URL provided by user in bio. Can be nullable
* tweet.user.description: Text in user bio. Can be nullable
* tweet.user.verified: Boolean indicating whether user has a verified account
* tweet.user.followers_count: Count of followers user has
* tweet.user.friends_count: Count of other users that user is following
* tweet.user.favourites_count: Count of tweets user has liked in the account's lifetime
* tweet.user.statuses_count: Count of tweets (including retweets) issued by user
* tweet.user.listed_count: Count of public lists that user is member of
* tweet.user.created_at: Date that the user account was created on Twitter
* tweet.user.profile_image_url_https: HTTPS-based URL pointing to user's profile image
* tweet.user.default_profile: When true, indicates user has not altered the theme or background of user profile
* tweet.user.default_profile_image: When true, indicates if user has not uploaded their own profile image and default image is used instead

In [41]:
text_query = 'Coronavirus'
max_tweets = 150
 
# Creation of query method using parameters
tweets = tweepy.Cursor(api.search,q=text_query).items(max_tweets)
 
# Pulling information from tweets iterable object
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.user.name, 
                tweet.user.screen_name, tweet.user.id_str, tweet.user.location, 
                tweet.user.url, tweet.user.description, tweet.user.verified, tweet.user.followers_count, 
                tweet.user.friends_count, tweet.user.favourites_count, tweet.user.statuses_count, tweet.user.listed_count, 
                tweet.user.created_at, tweet.user.profile_image_url_https, tweet.user.default_profile, tweet.user.default_profile_image] for tweet in tweets]

# Creation of dataframe from tweets_list
# Add or remove columns as you remove tweet information
tweets_df2 = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Twitter Username', 'Twitter @ name',
                                             'Twitter User Id', 'Twitter User Location', 'URL in Bio', 'Twitter Bio',
                                             'User Verified Status', 'Users Following Count',
                                             'Number users this account is following', 'Users Number of Likes', 'Users Tweet Count',
                                             'Lists Containing User', 'Account Created Time', 'Profile Image URL',
                                             'User Default Profile', 'User Default Profile Image'])

In [52]:
tweets_df2.head()

Unnamed: 0,Tweet Text,Tweet Datetime,Tweet Id,Twitter Username,Twitter @ name,Twitter User Id,Twitter User Location,URL in Bio,Twitter Bio,User Verified Status,Users Following Count,Number users this account is following,Users Number of Likes,Users Tweet Count,Lists Containing User,Account Created Time,Profile Image URL,User Default Profile,User Default Profile Image
0,"RT @DWUhlfelderLaw: Pinellas County, Florida s...",2020-08-15 23:48:26,1294783037931167745,T Morgan 🇺🇸,timwmorgan,249967206,Jersey Shore,,"Money, Credit and Banking go-to guy. Former Te...",False,11788,11473,95506,85026,2,2011-02-10 04:08:33,https://pbs.twimg.com/profile_images/109652576...,False,False
1,RT @CartwheelPrint: Dan Tehan’s uni plan isn’t...,2020-08-15 23:48:26,1294783037662625792,Noisynana@waratah,Noisynanawarat1,1196590977932091392,,,"Social justice , environmental health supporte...",False,527,512,57113,64951,3,2019-11-19 00:49:38,https://abs.twimg.com/sticky/default_profile_i...,True,True
2,RT @MglMauricio: Cientos de organizaciones de ...,2020-08-15 23:48:26,1294783037150986240,Danton Reynoso 🇲🇽 #RedAMLO,58rey,472308682,,,Por un México sin PriAn,False,4239,4072,67795,227487,24,2012-01-23 20:28:28,https://pbs.twimg.com/profile_images/109697668...,False,False
3,"RT @BobjeffHD: FHC, Lula, Boulos, Haddad, Gilm...",2020-08-15 23:48:26,1294783036643573761,pedro almeida,pedroal44611264,837689044238798849,,,,False,28,205,9984,9778,0,2017-03-03 15:40:06,https://abs.twimg.com/sticky/default_profile_i...,True,True
4,RT @CanadaHimalaya: Coronavirus scientist reve...,2020-08-15 23:48:26,1294783034948952064,坐看云起时#3,yunqi1111111,1286720318954967041,"London, England",,光照在黑暗中，黑暗不能胜过光。,False,150,594,3227,1099,0,2020-07-24 17:50:16,https://pbs.twimg.com/profile_images/128672834...,True,False


## 3. Scraping Tweets With Advanced Queries<a name="Section3"></a>
[Return to Table of Contents](#TOC)
<br>List of query methods available with Tweepy. This is not an exhaustive list but does contain a majority of the methods available. If you want an exhaustive list of everything available there's documentation [here](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets).

* q = str: Setting query based on text
* geocode = str "lat,long,radius": Setting location of query and radius
* lang = str: Setting language of query, full list of language codes [here](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)</b>
* result_type = str "mixed"/"recent"/"popular": Setting popularity preference of query
* until = str "yyyy-mm-dd": Setting upper bound date on query, if using standard search API be cognizant of 7-day limit
* since_id = str or int: Returns results with Id's more recent than given Id
* max_id = str or int: Returns results with Id's older than given Id
* count = int: Number of tweets to return per page. Max is 100, defaults to 15

In [44]:
# Example may no longer show tweets if until_date falls outside 
# of 7-day period from when you run cell
coordinates = '19.402833,-99.141051,50mi'
language = 'es'
result_type = 'recent'
until_date = '2020-08-10'
max_tweets = 150
 
# Creation of query method using parameters
tweets = tweepy.Cursor(api.search, geocode=coordinates, lang=language, result_type = result_type, until = until_date, count = 100).items(max_tweets)
 
# List comprehension pulling chosen tweet information from tweets iterable object
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.favorite_count, tweet.user.screen_name, tweet.user.id_str, tweet.user.location, tweet.user.url, tweet.user.verified, tweet.user.followers_count, tweet.user.friends_count, tweet.user.statuses_count, tweet.user.default_profile_image, 
tweet.lang] for tweet in tweets]

# Creation of dataframe from tweets_list
# Add or remove columns as you remove tweet information
tweets_df3 = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Tweet Favorite Count', 'Twitter @ name',
                                             'Twitter User Id', 'Twitter User Location', 'URL in Bio','User Verified Status', 'Users Current Following Count',
                                             'Number of accounts user is following', 'Users Tweet Count',
                                             'Profile Image URL','Tweet Language'])

In [51]:
tweets_df3.head()

Unnamed: 0,Tweet Text,Tweet Datetime,Tweet Id,Tweet Favorite Count,Twitter @ name,Twitter User Id,Twitter User Location,URL in Bio,User Verified Status,Users Current Following Count,Number of accounts user is following,Users Tweet Count,Profile Image URL,Tweet Language
0,RT @XochimilcoAl: Durante la emergencia sanita...,2020-08-09 23:59:59,1292611617369264130,0,HetorHdz1,347258311,,,False,54,220,4672,False,es
1,RT @Josa1108: @alfredodelmazo #MujeresInvencbl...,2020-08-09 23:59:59,1292611617289633794,0,Patrici89121863,1259342866704404486,,,False,53,31,1127,False,es
2,RT @iztalccihuatl: #AyudaConUnRT Apoyemos en l...,2020-08-09 23:59:59,1292611616438091779,0,IsabelOrsini,4085117615,,,False,243,226,35989,False,es
3,Con estos stickers nadie te quitará la expresi...,2020-08-09 23:59:59,1292611616262033408,0,Wokii_int,1171150458670505984,Ciudad de México,https://t.co/YN8wlMD8ce,False,52,126,461,False,es
4,@maduradascom Mucho recontra malparado !!,2020-08-09 23:59:59,1292611616152989697,1,wilmermanuelrin,154946118,san cristóbal,,False,21,190,2652,False,es


## 4. Putting It All Together<a name="Section4"></a>
[Return to Table of Contents](#TOC)
<br>
Great, we now know how to pull more information from tweets and querying with advanced parameters. The great thing is how easy it is to mix and match whatever you want to search for. While it was shown above several times. The point is that you can mix and match the information you want from the tweets and the type of queries you conduct. It's just important that you update the column names in the pandas dataframe so you don't get errors.

<br>
Below is an example of a search for 150 tweets with 'Coronavirus' in it that occurred within a 50 mile radius of Las Vegas, NV. Which in this case has the geo coordinates of lat 36.169786, long -115.139858


In [48]:
text_query = 'Coronavirus'
coordinates = '36.169786,-115.139858,50mi'
max_tweets = 150
 
# Creation of query method using parameters
tweets = tweepy.Cursor(api.search, q = text_query, geocode = coordinates, count = 100).items(max_tweets)
 
# Pulling information from tweets iterable object
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.favorite_count, tweet.user.screen_name, tweet.user.id_str, tweet.user.location, tweet.user.followers_count, tweet.coordinates, tweet.place] for tweet in tweets]

# Creation of dataframe from tweets_list
# Add or remove columns as you remove tweet information
tweets_df4 = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Tweet Favorite Count', 'Twitter @ name',
                                             'Twitter User Id', 'Twitter User Location', 'Users Current Following Count', 'Tweet Coordinates', 'Place Info'])

# Checks if there are coordinates attached to tweets, if so extracts them
tweets_df4['Tweet Coordinates'] = tweets_df4.apply(extract_coordinates,axis=1)
    
# Checks if there is place information available, if so extracts them
tweets_df4['Place Info'] = tweets_df4.apply(extract_place,axis=1)

In [50]:
tweets_df4.head()

Unnamed: 0,Tweet Text,Tweet Datetime,Tweet Id,Tweet Favorite Count,Twitter @ name,Twitter User Id,Twitter User Location,Users Current Following Count,Tweet Coordinates,Place Info
0,How to recognize Covid-19 symptoms in children...,2020-08-15 23:47:36,1294782827355975681,0,ckleintv,50799343,"Las Vegas, NV",14245,,
1,RT @realkatiejow: Make no mistake.\n\nChina ha...,2020-08-15 23:45:50,1294782383409881088,0,NMTNMS,507801095,Surf City USA,612,,
2,"Good.\n\n""Following fierce public backlash, th...",2020-08-15 23:42:30,1294781541474680832,2,LisaSongSutton,28351859,Las Vegas,36175,,
3,"RT @meganmesserly: As always, check out our CO...",2020-08-15 23:39:06,1294780689208860672,0,GayCae,1085613896436600832,,332,,
4,RT @FOX5Vegas: Heads up if you were planning t...,2020-08-15 23:18:45,1294775564872515585,0,jousefarmijos,637324445,,160,,
