# Companion Notebook for Scraping Twitter Using GetOldTweets3

Package: https://github.com/Mottl/GetOldTweets3

Article Read-Along: https://towardsdatascience.com/how-to-scrape-more-information-from-tweets-on-twitter-44fd540b8a1f

### Notebook Author: Martin Beck
#### Information current as of August, 13th 2020
<b> Dependencies:</b> Make sure GetOldTweets3 is already installed in your Python environment. If not, you can pip install GetOldTweets3 to install the package. If you want more information on setting up I have an article [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) that goes into deeper detail.

## Notebook's Table of Contents<a name="TOC"></a>
<br>
<b>This companion notebook is meant to build on the scraping article and article notebook as it covers more scenarios that may come up and provides more examples.</b>

1. [Getting More Information From Tweets](#Section1)
<br>How to scrape more information from tweets such as favorite count, retweet count, mentions, permalinks, etc.
2. [Getting User Information From Tweets](#Section2)
<br><b>GetOldTweets3 does not offer</b> anymore user information than their screename or Twitter @ name which is shown in section 1.
3. [Scraping Tweets With Advanced Queries](#Section3)
<br>How to scrape for tweets using deeper queries such as searching by language of tweets, tweets within a certain location, tweets within specific date ranges, top tweets, etc.
4. [Putting It All Together](#Section4)
<br>Showcasing how you can mix and match the methods shown above to create queries that'll fulfill your data needs.

## Imports for Notebook

In [3]:
# Pip install GetOldTweets3 if you don't already have the package
# !pip install GetOldTweets3

# Imports
import GetOldTweets3 as got
import pandas as pd

## 1. Getting More Information From Tweets <a name="Section1"></a>
[Return to Table of Contents](#TOC)
<br>
List of information available in the tweet object with GetOldTweets3 I included everything except geo data due to issues that are currently still open.

* tweet.geo: <b>*NOTE GEO-DATA NOT WORKING BASED ON ISSUE</b><br><br>

* tweet.id: Id of tweet
* tweet.author_id: User id of tweet's author
* tweet.username: Username of tweet's author, commonly called User's @ name
* tweet.to: If tweet is a reply, the original tweet's username
* tweet.text: Text content of tweet
* tweet.retweets: Count of retweets
* tweet.favorites: Count of favorites
* tweet.replies: Count of replies
* tweet.date: Date tweet was created
* tweet.formatted_date: Formatted version of when tweet was created
* tweet.hashtags: Hashtags that tweet contains
* tweet.mentions: Mentions of other users that tweet contains
* tweet.urls: Urls that are in the tweet
* tweet.permalink: Permalink of tweet itself

### Query by Username
I created three functions to build off of based off of various scenarios that are likely to happen for someone scraping tweets from users. After each function I call them to showcase an example of them being used.

#### F1. scrape_user_tweets
This function scrapes a single users tweets and exports the data as a csv or excel file

#### F2. scrape_multiple_users_multifile
This function scrapes multiple users based on a list and exports separate csv or excel files per user.

#### F3. scrape_multiple_users_singlefile
This function scrapes multiple users based on a list and exports one csv or excel file containing all tweets

In [34]:
def scrape_user_tweets(username, max_tweets):
    # Creation of query object
    tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
                                            .setMaxTweets(max_tweets)
    # Creation of list that contains all tweets
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)

    # Pulling information from tweets iterable object
    # Add or remove tweet information you want in the below list comprehension
    tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,
                    tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, 
                    tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]

    # Creation of dataframe from tweets list
    # Add or remove columns as you remove tweet information
    tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',
                                                     'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])
    
    # Removing timezone information to allow excel file download
    tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))
    
    # Uncomment/comment below lines to decide between creating csv or excel file 
    tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)
#     tweets_df.to_excel('{}-tweets.xlsx'.format(username), index = False)

In [35]:
# Creating example username to scrape from
username = 'jack'

# Max recent tweets pulls x amount of most recent tweets from that user
max_tweets = 150

# Function will scrape username, attempt to pull max_tweet amount, and create csv/excel file from data.
scrape_user_tweets(username,max_tweets)

In [36]:
def scrape_multiple_users_multifile(username_list, max_tweets_per):
    # Looping through each username in user list
    for username in username_list:
        # Creation of query object
        tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
                                                .setMaxTweets(max_tweets_per)
        # Creation of list that contains all tweets
        tweets = got.manager.TweetManager.getTweets(tweetCriteria)

        # Creating list of chosen tweet data
        # Add or remove tweet information you want in the below list comprehension
        tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,
                        tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, 
                        tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]

        # Creation of dataframe from tweets list
        # Add or remove columns as you remove tweet information
        tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',
                                                         'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])
        
        # Removing timezone information to allow excel file download
        tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))
        
        # Uncomment/comment below lines to decide between creating csv or excel file 
        tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)
#         tweets_df.to_excel('{}-tweets.xlsx'.format(username), index = False)

In [37]:
# Creating example user list with 3 users
user_name_list = ['jack','billgates','random']

# Max recent tweets pulls x amount of most recent tweets from that user
max_tweets_per = 150

# Function will scrape each user, attempting to pull max_tweet amount, and create csv/excel file per user.
scrape_multiple_users_multifile(user_name_list, max_tweets_per)

In [40]:
def scrape_multiple_users_singlefile(username_list, max_tweets_per):
    # Creating master list to contain all tweets
    master_tweets_list = []
    
    # Looping through each username in user list
    for username in user_name_list:
        # Creation of query object
        tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
                                                .setMaxTweets(max_tweets_per)
        # Creation of list that contains all tweets
        tweets = got.manager.TweetManager.getTweets(tweetCriteria)

        # Creating list of chosen tweet data
        # Appending new tweets per user into the master tweet list
        # Add or remove tweet information you want in the below list comprehension
        for tweet in tweets:
            master_tweets_list.append((tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,
                            tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, 
                            tweet.mentions, tweet.urls, tweet.permalink))

    # Creation of dataframe from tweets list
    # Add or remove columns as you remove tweet information
    tweets_df = pd.DataFrame(master_tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',
                                                             'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])
    
    # Removing timezone information to allow excel file download
    tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))
    
    # Uncomment/comment below lines to decide between creating csv or excel file 
    tweets_df.to_csv('multi-user-tweets.csv', sep=',', index = False)
#     tweets_df.to_excel('multi-user-tweets.xlsx', index = False)

In [41]:
# Creating example user list with 3 users
user_name_list = ['jack','billgates','random']

# Max recent tweets pulls x amount of most recent tweets from that user
max_tweets_per = 150

# Function will scrape each user, attempting to pull max_tweet amount, and create one csv/excel file containing all data name multi-user-tweets.
scrape_multiple_users_singlefile(user_name_list, max_tweets_per)

### Query by Text Search
I created a function to build off of for scraping tweets by text search.

#### F1. scrape_text_query
This function scrapes tweets from Twitter based on the text search and exports the data as a csv or excel file

In [44]:
def scrape_text_query(text_query, count):
    # Creation of query object
    tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query)\
                            .setMaxTweets(count)
    # Creation of list that contains all tweets
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)

    # Creating list of chosen tweet data
    # Add or remove tweet information you want in the below list comprehension
    tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,
                    tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, 
                    tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]

    # Creation of dataframe from tweets
    # Add or remove columns as you remove tweet information
    tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',
                                                             'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])
    
    # Removing timezone information to allow excel file download
    tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))
    
    # Uncomment/comment below lines to decide between creating csv or excel file 
    tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index = False)
#     tweets_df.to_excel('{}-tweets.xlsx'.format(text_query), index = False)

In [None]:
# Input search query to scrape tweets and name csv file
text_query = 'Coronavirus'

# Max recent tweets pulls x amount of most recent tweets from that user
max_tweets = 150

# Function scrapes for tweets containing text_query, attempting to pull max_tweet amount and create csv/excel file containing data.
scrape_text_query(text_query, max_tweets)

## 2. Getting User Information From Tweets<a name="Section2"></a>
[Return to Table of Contents](#TOC)
<br><b>GetOldTweets3 is limited in the user information that is accessible.</b> This library only allows access to a tweet author's username and user_id. If you want user information I recommend looking into utilizing Tweepy for all of your scraping, or using Tweepy in tandem with GetOldTweets3 in order to utilize both libraries to their strengths.

## 3. Scraping Tweets With Advanced Queries<a name="Section3"></a>
[Return to Table of Contents](#TOC)
<br>
List of methods available with GetOldTweets3 to refine your queries.

* setUsername(str): Setting query based on username
* setMaxTweets(int): Setting maximum number of tweets to search
* setQuerySearch(str): Setting query based on text
* setSince(str "yyyy-mm-dd"): Setting lower bound date on query
* setUntil(str "yyyy-mm-dd"): Setting upper bound date on query
* setNear(str): Setting location of query search
* setWithin(str): Setting radius of query search location
* setLang(str): Setting language of query
* setTopTweets(bool): Setting query to search only for top tweets
* setEmoji("ignore"/"unicode"/"name"): Setting query to search using emoji styles

I created two functions to build off of that utilize the different query methods available through the TweetCriteria class. As you can see you can mix and match the above methods in any way. It's important to remember that the more restrictive you make the search the more likely that a smaller amount of tweets that will come up.

#### F1. scrape_advanced_queries1
This function queries by using .setUsername to set the username, .setQuerySearch to set text to query for, .setSince to set the oldest date of the tweets to query, .setUntil to set the most recent date of the tweets to query, .setMaxTweets to set the amount of tweets to query for.

#### F2. scrape_advanced_queries2
This function queries by using .setQuerySearch, .setNear to set a location to query for tweets around, .setWithin to set a radius restriction around the chosen location, .setLang to scrape for tweets written in a specific language, .setMaxTweets

In [45]:
def scrape_advanced_queries1(username, text_query, since_date, until_date, count):
    # Creation of query object with as many specific queries as you want
    tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
    .setQuerySearch(text_query).setSince(since_date)\
    .setUntil(until_date).setMaxTweets(count)
    
    # Creation of list that contains all tweets
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)

    # Creating list of chosen tweet data
    # Add or remove tweet information you want in the below list comprehension
    tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,
                    tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, 
                    tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]

    # Creation of dataframe from tweets list
    # Add or remove columns as you remove tweet information
    tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',
                                                     'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])
    
    # Removing timezone information to allow excel file download
    tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))
    
    # Uncomment/comment below lines to decide between creating csv or excel file 
    tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)
#     tweets_df.to_excel('{}-tweets.xlsx'.format(username), index = False)

In [46]:
username = "BarackObama"
text_query = "Hello"
since_date = "2011-01-01"
until_date = "2016-12-20"
count = 150

scrape_advanced_queries1(username, text_query, since_date, until_date, count)

In [47]:
def scrape_advanced_queries2(text_query, location, radius, language, count):
    # Creation of query object with as many specific queries as you want
    tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query)\
    .setNear(location).setWithin(radius).setLang(language).setMaxTweets(count)
    
    # Creation of list that contains all tweets
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)

    # Creating list of chosen tweet data
    # Add or remove tweet information you want in the below list comprehension
    tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,
                    tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, 
                    tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]

    # Creation of dataframe from tweets list
    # Add or remove columns as you remove tweet information
    tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',
                                                     'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])
    
    # Removing timezone information to allow excel file download
    tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))
    
    # Uncomment/comment below lines to decide between creating csv or excel file 
    tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index = False)
    tweets_df.to_excel('{}-tweets.xlsx'.format(text_query), index = False)

In [48]:
text_query = "Hola"
location = "Mexico"
radius = "100mi"
language = "Spanish"
count = 150

scrape_advanced_queries2(text_query, location, radius, language, count)

## 4. Putting It All Together<a name="Section4"></a>
[Return to Table of Contents](#TOC)
<br>
Great, we now know how to pull more information from tweets and querying with advanced parameters. The great thing is how easy it is to mix and match whatever you want to search for. While it was shown above several times. The point is that you can mix and match the information you want from the tweets and the type of queries you conduct. It's just important that you update the column names in the pandas dataframe so you don't get errors.

<br>
Below is an example of a search for 150 top tweets with 'coronavirus' in it that occurred between August 5th and August 8th 2020 in Washington D.C.


In [49]:
text_query = 'Coronavirus'
since_date = '2020-08-05'
until_date = '2020-08-10'
location = 'Washington, D.C.'
top_tweets = True
count = 150

# Creation of tweetCriteria query object with methods to specify further
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query).setSince(since_date)\
.setUntil(until_date).setNear(location).setTopTweets(top_tweets).setMaxTweets(count)

# Creation of tweets iterable containing all queried tweet data
tweets = got.manager.TweetManager.getTweets(tweetCriteria)

# List comprehension pulling chosen tweet information per tweet from tweets
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,
                tweet.replies,tweet.date, tweet.mentions, tweet.urls, tweet.permalink,] 
               for tweet in tweets]

# Creation of dataframe from tweets list
# Add or remove columns as you remove tweet information
tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Twitter User Id', 'Twitter @ Name','Reply to', 'Text','Retweets', 'Favorites', 
                                                 'Replies', 'Datetime','Mentions','Urls','Permalink'])
# Removing timezone information to allow excel file download
tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))

# Uncomment/comment below lines to decide between creating csv or excel file 
tweets_df.to_csv('put-together-tweets.csv', sep=',', index = False)
# tweets_df.to_excel('put-together-tweets.xlsx', index = False)