# Notebook for Scraping Twitter With Tweepy and GetOldTweets3

Tweepy Package Github: https://github.com/tweepy/tweepy

GetOldTweets3 Package Github: https://github.com/Mottl/GetOldTweets3

Tweepy Package Documentation: https://tweepy.readthedocs.io/en/latest/

Article Read-Along: https://towardsdatascience.com/how-to-scrape-more-information-from-tweets-on-twitter-44fd540b8a1f

### Notebook Author: Martin Beck
#### Information current as of August, 13th 2020
<b> Dependencies:</b> Make sure Tweepy and GetOldTweets3 is already installed in your Python environment. If not, you can pip install Tweepy to install the package. If you want more information on setting up I have an article [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) that goes into deeper detail.

## Notebook's Table of Contents<a name="TOC"></a>

0. [Credentials and Authorization](#Section0)
<br>Setting up credentials and authorization in order to utilize Tweepy
1. [Available Methods With Tweepy](#Section1)
<br>Methods available with Tweepy to pull more information
2. [How to Use Tweepy With GetOldTweets3](#Section2)
<br>Examples on using Tweepy's methods and how to use them on your datasets.

## Imports for Notebook

In [1]:
# Pip install Tweepy if you don't already have the package
# !pip install tweepy

# Imports
import tweepy
import pandas as pd
import GetOldTweets3 as got
import time

## 0. Credentials and Authorization<a name="Section0"></a>
[Return to Table of Contents](#TOC)
<br>Tweepy requires credentials before you can utilize its API. The below code helps setup the notebook for authorization. I already have an an article covering setting up Tweepy and getting credentials [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) if further instructions are needed.

You don't necessarily have to create a credentials file, however if you find youself sharing Tweepy code to other parties I recommend it so you don't accidentally share your credentials. Otherwise skip the below cell and just enter your credentials in and have them hardcoded below.

In [44]:
# Loading in from csv file

credentials_df = pd.read_csv('credentials.csv',header=None,names=['name','key'])

credentials_df

Unnamed: 0,name,key
0,consumer_key,XXXXXX
1,consumer_secret,XXXXXX
2,access_token,XXXXXX
3,access_secret,XXXXXX


In [3]:
# Credentials from csv file

consumer_key = credentials_df.loc[credentials_df['name']=='consumer_key','key'].iloc[0]
consumer_secret = credentials_df.loc[credentials_df['name']=='consumer_secret','key'].iloc[0]
access_token = credentials_df.loc[credentials_df['name']=='access_token','key'].iloc[0]
access_token_secret = credentials_df.loc[credentials_df['name']=='access_secret','key'].iloc[0]

# Credentials hardcoded

# consumer_key = "XXXXX"
# consumer_secret = "XXXXX"
# access_token = "XXXXX"
# access_token_secret = "XXXXXX"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)

## 1. Available Methods With Tweepy<a name="Section1"></a>
[Return to Table of Contents](#TOC)
<br>For the most part there are only two relevant methods. If you're curious about what else you can do with Tweepy the documentation is available [here](http://docs.tweepy.org/en/latest/api.html#search-methods). 

The revelant methods are api.get_status and api.get_user

<b>api.get_status provides you with access to Tweepy's tweet object which by default also includes user information.</b>

<b>api.get_user only provides you with user information. </b>

You can use either if you only care about accessing user data since they both contain it. However, if you want access to tweet information that is only available with Tweepy such as tweet.in_reply_to_user_id_str I'd recommend using api.get_status

## 2. How to Use Tweepy With GetOldTweets3<a name="Section2"></a>
[Return to Table of Contents](#TOC)

Below is a list of information accessible in both Tweepy's tweet and user object. This is not an exhaustive list for either object. If you want an exhaustive list of everything contained in Tweepy's tweet object there's documentation [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/user-object). If you want an exhaustive list of everything contained in the Tweepy's user object there's documentation [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/user-object). 

* tweet.coordinates: Geographic location as reported by user or client. May be null that is why extract_coordinates function below was created
* tweet.place: Indicates place associated with tweet where user signed up with like Las Vegas, NV. May be null that so extract_place function below was created
* tweet.lang: Indicates a BCP 47 language identifier corresponding to machine detected language of tweet text.
* tweet.source: Source where tweet was posted through. Ex: Twitter Web Client
* tweet.in_reply_to_status_id_str: If a tweet is a reply, the original tweet's id. Can be null if tweet is not a reply
* tweet.in_reply_to_user_id_str: If a tweet is a reply, string representation of original tweet's user id
* tweet.user.location: User defined location for account's profile. Can be nullable
* tweet.user.url: URL provided by user in bio. Can be nullable
* tweet.user.description: Text in user bio. Can be nullable
* tweet.user.verified: Boolean indicating whether user has a verified account
* tweet.user.followers_count: Count of followers user has
* tweet.user.friends_count: Count of other users that user is following
* tweet.user.favourites_count: Count of tweets user has liked in the account's lifetime
* tweet.user.statuses_count: Count of tweets (including retweets) issued by user
* tweet.user.listed_count: Count of public lists that user is member of
* tweet.user.created_at: Date that the user account was created on Twitter
* tweet.user.profile_image_url_https: HTTPS-based URL pointing to user's profile image
* tweet.user.default_profile: When true, indicates user has not altered the theme or background of user profile
* tweet.user.default_profile_image: When true, indicates if user has not uploaded their own profile image and default image is used instead

<b>Remember Tweepy still has its request limitations meaning if you have larger datasets, that running these requests may take time. I've ran this workaround on a smaller dataset of 5k tweets and it took me 1-2hrs to finish running. It's up to you whether you'd rather let your computer spend time running for free or spend money on using Twitter's Premium/Enterprise APIs to work with bigger datasets.

### Preparation

To use Tweepy with GetOldTweets3 there is a little bit of preparation required. Depending on whether you're using the api.get_status or api.get_user method you'll need to have the relevant information available.

In the case of api.get_status make sure you use GOT3 to scrape for <b>tweet.id</b>

In the case of api.get_user make sure you use GOT3 to scrape for either <b>tweet.author_id</b> or <b>tweet.username</b>

I'll showcase this below.

In [4]:
text_query = 'Hello'
since_date = "2020-7-20"
until_date = "2020-7-21"

count = 150
 
# Creation of tweetCriteria query object with methods to specify further
tweetCriteria = got.manager.TweetCriteria()\
.setQuerySearch(text_query).setSince(since_date)\
.setUntil(until_date).setMaxTweets(count)
 
# Creation of tweets iterable containing all queried tweet data
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
 
# List comprehension pulling chosen tweet information from tweets
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.text, tweet.retweets, tweet.favorites, tweet.replies, tweet.date] for tweet in tweets]
 
# Creation of dataframe from tweets list
# Add or remove columns as you remove tweet information
tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User', 'Text',
                                                  'Retweets', 'Favorites', 'Replies', 'Datetime'])

### I scraped with GetOldTweets3 making sure that I have tweet.id, and tweet.author_id or tweet.username.

In [41]:
# Taking a quick look at the data scraped
tweets_df.head()

Unnamed: 0,Tweet Id,Tweet User Id,Tweet User,Text,Retweets,Favorites,Replies,Datetime
0,1285363858832363520,1182717701203972096,workinclassbird,friend..... hello,0,3,1,2020-07-20 23:59:59+00:00
1,1285363857242947584,1183184898070405120,Soap_The_Scrub,hello yes i interacted,0,4,3,2020-07-20 23:59:59+00:00
2,1285363856202698753,844768299388813314,kuroslays,"Hello lew,",0,0,0,2020-07-20 23:59:58+00:00
3,1285363856055951363,1214501518646247425,bubsji,im nervous HELLO,0,0,0,2020-07-20 23:59:58+00:00
4,1285363852851511301,811267164476841984,realJakeLogan,Butt Stallion says hello neck gaiter,0,0,0,2020-07-20 23:59:58+00:00


### Alright now we have our data, let's look at a row for information to test how api.get_status and api.get_user work.

In [6]:
# Using iloc to show a specific row of data
tweets_df.iloc[4]

Tweet Id                           1285363852851511301
Tweet User Id                       811267164476841984
Tweet User                               realJakeLogan
Text             Butt Stallion says hello neck gaiter 
Retweets                                             0
Favorites                                            0
Replies                                              0
Datetime                     2020-07-20 23:59:58+00:00
Name: 4, dtype: object

In [7]:
# Printing out the relevant information for us
print("Tweet Id: ",tweets_df.iloc[4][0])
print("User Id: ",tweets_df.iloc[4][1])
print("Username: ",tweets_df.iloc[4][2])

Tweet Id:  1285363852851511301
User Id:  811267164476841984
Username:  realJakeLogan


### Perfect now let's test get_status and get_user with the above Tweet Id, User Id, and Username.

In [48]:
api.get_status(1285363852851511301)

### There's a lot going on with that. Remember the list from above that shows the attributes of tweet and user objects? We can use that to focus on the relevant parts.

In [8]:
# Using the get_status method to request for the tweet data and pull out requested information
print(api.get_status(1285363852851511301).user.location)
print(api.get_status(1285363852851511301).user.followers_count)
print(api.get_status(1285363852851511301).source)

Salinas Valley, CA
9
WordPress.com


In [9]:
print(api.get_user(811267164476841984).location)
print(api.get_user('realJakeLogan').followers_count)

# Should throw an error because user object only has user information
print(api.get_user(811267164476841984).source)

Salinas Valley, CA
9


AttributeError: 'User' object has no attribute 'source'

As you can see user information is available with either method. The only difference is api.get_status requires you to enter the user keyword as seen with user.location to look at its user information whereas api.get_user only requires .location because it is the user information. That's why we see the error above with looking at the source information with api.get_user because there is no tweet information.

Lastly, as you can see api.get_user is able to use either User Id or a Twitter Username to pull up user information.

These methods are great, but using it on a single item is only good for testing. The power really comes in when you can create a function allowing you to use it with a whole dataset.

In [10]:
# Creating copy of original df to mess around with
tweet_df_test = tweets_df.copy()

In [20]:
# Creating functions to request tweet or user information and extract them
def extract_tweepy_tweet_info(row):
    tweet = api.get_status(row['Tweet Id'])
    return tweet.source

def extract_tweepy_tweet_user_info(row):
    tweet = api.get_status(row['Tweet Id'])
    return tweet.user.statuses_count
    
def extract_tweepy_user_info1(row):
    user = api.get_user(row['Tweet User Id'])
    return user.followers_count

def extract_tweepy_user_info2(row):
    user = api.get_user(row['Tweet User'])
    return user.verified

In [21]:
# Setting new columns to be equal to the returned data from each function
tweet_df_test['Tweet Source'] = tweet_df_test.apply(extract_tweepy_tweet_info,axis=1)
tweet_df_test['Tweets Count'] = tweet_df_test.apply(extract_tweepy_tweet_user_info,axis=1)
tweet_df_test['Follower Count'] = tweet_df_test.apply(extract_tweepy_user_info1,axis=1)
tweet_df_test['Verified Status'] = tweet_df_test.apply(extract_tweepy_user_info2,axis=1)

In [26]:
# Output of data
tweet_df_test.head()

Unnamed: 0,Tweet Id,Tweet User Id,Tweet User,Text,Retweets,Favorites,Replies,Datetime,Tweet Source,Follower Count,Verified Status,Tweets Count
0,1285363858832363520,1182717701203972096,workinclassbird,friend..... hello,0,3,1,2020-07-20 23:59:59+00:00,Twitter for iPhone,1877,False,561
1,1285363857242947584,1183184898070405120,Soap_The_Scrub,hello yes i interacted,0,4,3,2020-07-20 23:59:59+00:00,Twitter for iPhone,1265,False,11815
2,1285363856202698753,844768299388813314,kuroslays,"Hello lew,",0,0,0,2020-07-20 23:59:58+00:00,Twitter for iPhone,1201,False,7332
3,1285363856055951363,1214501518646247425,bubsji,im nervous HELLO,0,0,0,2020-07-20 23:59:58+00:00,Twitter for Android,568,False,10844
4,1285363852851511301,811267164476841984,realJakeLogan,Butt Stallion says hello neck gaiter,0,0,0,2020-07-20 23:59:58+00:00,WordPress.com,9,False,147


As you can see there are now four new columns added on at the end of this dataframe.

It's worth noting the above code is not done efficiently in regards to time and API requests. If you find yourself using either method to access more than one piece of information for each tweet the functions above are not the best way to do so because they send one request per tweet.attribute instead of collecting several attributes for one request.

If you want to access several attributes per Tweet, there's a couple ways of doing so. Either create a list, store the data in the list then add it to the dataframe. Or create a function that will create a series and return it, then use pandas to apply this method to a dataframe. I'll showcase the former as it's easier to grasp.

In [24]:
# Creating copy of original df to mess around with
tweets_df_test_efficient = tweets_df.copy()

In [34]:
# Creation of list to store scrape tweet data
tweets_holding_list = []

def extract_tweepy_tweet_info_efficient(row):
    # Using Tweepy API to request for tweet data
    tweet = api.get_status(row['Tweet Id'])
    
    # Storing chosen tweet data in tweets_holding_list to be used later
    tweets_holding_list.append((tweet.source, tweet.user.statuses_count, tweet.user.followers_count, tweet.user.verified))

In [None]:
# Applying the extract_tweepy_tweet_info_efficient function to store tweet data in the tweets_holding_list
tweets_df_test_efficient.apply(extract_tweepy_tweet_info_efficient, axis=1)

# Creating new columns to store the data that's currently being held in tweets_holding_list
tweets_df_test_efficient[['Tweet Source', 'User Tweet Count', 'Follower Count', 'User Verified Status']] = pd.DataFrame(tweets_holding_list)

In [43]:
# Output of data
tweets_df_test_efficient.head()

Unnamed: 0,Tweet Id,Tweet User Id,Tweet User,Text,Retweets,Favorites,Replies,Datetime,Tweet Source,User Tweet Count,Follower Count,User Verified Status
0,1285363858832363520,1182717701203972096,workinclassbird,friend..... hello,0,3,1,2020-07-20 23:59:59+00:00,Twitter for iPhone,561,1878,False
1,1285363857242947584,1183184898070405120,Soap_The_Scrub,hello yes i interacted,0,4,3,2020-07-20 23:59:59+00:00,Twitter for iPhone,11819,1266,False
2,1285363856202698753,844768299388813314,kuroslays,"Hello lew,",0,0,0,2020-07-20 23:59:58+00:00,Twitter for iPhone,7333,1200,False
3,1285363856055951363,1214501518646247425,bubsji,im nervous HELLO,0,0,0,2020-07-20 23:59:58+00:00,Twitter for Android,10861,568,False
4,1285363852851511301,811267164476841984,realJakeLogan,Butt Stallion says hello neck gaiter,0,0,0,2020-07-20 23:59:58+00:00,WordPress.com,147,9,False


There you go. That's all there is to it. It's more efficient to only run the api request once and pull all the information you need than to send a request for each tweet.attribute. It'll save a lot more time in the long run.