# Article Notebook for Scraping Twitter Using GetOldTweets3

Package: https://github.com/Mottl/GetOldTweets3

Article Read-Along: 

### Notebook Author: Martin Beck
#### Information current as of August, 13th 2020
<b> Dependencies:</b> Make sure GetOldTweets3 is already installed in your Python environment. If not, you can pip install GetOldTweets3 to install the package. If you want more information on setting up I have an article [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) that goes into deeper detail.

## Notebook's Table of Contents<a name="TOC"></a>

1. [Getting More Information from Tweets](#Section1)
<br>How to scrape more information from tweets such as favorite count, retweet count, mentions, permalinks, etc.
2. [Getting User Information from Tweets](#Section2)
<br><b>GetOldTweets3 does not offer</b> anymore user information than their screename or Twitter @ name which is shown in section 1.
3. [Scraping Tweets with Advanced Queries](#Section3)
<br>How to scrape for tweets using deeper queries such as searching by language of tweets, tweets within a certain location, tweets within specific date ranges, top tweets, etc.
4. [Putting it All Together](#Section4)
<br>Showcasing how you can mix and match the methods shown above to create queries that'll fulfill your data needs.

## Imports for Notebook

In [27]:
# Pip install Tweepy if you don't already have the package
# !pip install tweepy

# Imports
import GetOldTweets3 as got
import pandas as pd

## 1. Getting More Information from Tweets <a name="Section1"></a>
[Return to Table of Contents](#TOC)
<br>
List of information available in the tweet object with GetOldTweets3
* tweet.geo: <b>*NOTE GEO-DATA NOT WORKING BASED ON ISSUE</b><br><br>

* tweet.id: Id of tweet
* tweet.author_id: User id of tweet's author
* tweet.username: Username of tweet's author, commonly called User @ name
* tweet.to: If tweet is a reply, the original tweet's username
* tweet.text: Text content of tweet
* tweet.retweets: Count of retweets
* tweet.favorites: Count of favorites
* tweet.replies: Count of replies
* tweet.date: Date tweet was created
* tweet.formatted_date: Formatted version of when tweet was created
* tweet.hashtags: Hashtags that tweet contains
* tweet.mentions: Mentions of other users that tweet contains
* tweet.urls: Urls that are in the tweet
* tweet.permalink: Permalink of tweet itself

In [28]:
username = 'jack'
count = 150
 
# Creation of tweetCriteria query object with methods to specify further
tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
.setMaxTweets(count)
    
# Creation of tweets iterable containing all queried tweet data
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
 
# List comprehension pulling chosen tweet information from tweets
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites, tweet.replies, tweet.date, tweet.formatted_date, tweet.hashtags, tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]
 
# Creation of dataframe from tweets_list
# Add or remove columns as you remove tweet information
tweets_df1 = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',
                                                 'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])

In [29]:
tweets_df1

Unnamed: 0,Tweet Id,Tweet User Id,Tweet User,Reply to,Text,Retweets,Favorites,Replies,Datetime,Formatted date,Hashtags,Mentions,Urls,Permalink
0,1294765289255706624,12,jack,jsngr,Jordan is incredible,82,882,48,2020-08-15 22:37:55+00:00,Sat Aug 15 22:37:55 +0000 2020,,,https://twitter.com/jsngr/status/1294635175222...,https://twitter.com/jack/status/12947652892557...
1,1293753884159234050,12,jack,SpaceForceDoD,?,741,9112,583,2020-08-13 03:38:57+00:00,Thu Aug 13 03:38:57 +0000 2020,,,https://twitter.com/spaceforcedod/status/12936...,https://twitter.com/jack/status/12937538841592...
2,1293687636675223552,12,jack,TwitterDev,Build on Twitter again!,620,4941,440,2020-08-12 23:15:42+00:00,Wed Aug 12 23:15:42 +0000 2020,,,https://twitter.com/TwitterDev/status/12935935...,https://twitter.com/jack/status/12936876366752...
3,1293641297459388416,12,jack,boardroom,Thanks for the chat @richkleiman and Gianni! G...,52,383,90,2020-08-12 20:11:34+00:00,Wed Aug 12 20:11:34 +0000 2020,,@richkleiman,https://twitter.com/boardroom/status/129356427...,https://twitter.com/jack/status/12936412974593...
4,1291956273814990848,12,jack,Mayalangersegal,Thank you. Thank you. Thank you. @RemindMe_OfT...,2,93,16,2020-08-08 04:35:53+00:00,Sat Aug 08 04:35:53 +0000 2020,,@RemindMe_OfThis,,https://twitter.com/jack/status/12919562738149...
5,1291221954699997185,12,jack,boo,Oh no,1,58,20,2020-08-06 03:57:58+00:00,Thu Aug 06 03:57:58 +0000 2020,,,,https://twitter.com/jack/status/12912219546999...
6,1291198140700270598,12,jack,kenisajerk,Good luck. Will be rooting for you from the si...,2,204,13,2020-08-06 02:23:20+00:00,Thu Aug 06 02:23:20 +0000 2020,,,,https://twitter.com/jack/status/12911981407002...
7,1291174985315352577,12,jack,kenisajerk,Do what you need to,11,263,26,2020-08-06 00:51:20+00:00,Thu Aug 06 00:51:20 +0000 2020,,,,https://twitter.com/jack/status/12911749853153...
8,1291024141273923586,12,jack,Sekani__Solomon,Design all the things,122,1041,194,2020-08-05 14:51:56+00:00,Wed Aug 05 14:51:56 +0000 2020,,,https://twitter.com/Sekani__Solomon/status/129...,https://twitter.com/jack/status/12910241412739...
9,1291017485387325442,12,jack,Vietman18,You can,10,168,24,2020-08-05 14:25:29+00:00,Wed Aug 05 14:25:29 +0000 2020,,,,https://twitter.com/jack/status/12910174853873...


## 2. Getting User Information from Tweets<a name="Section2"></a>
[Return to Table of Contents](#TOC)
<br><b>GetOldTweets3 is limited in the user information that is accessible.</b> This library only allows access to a tweet author's username and user_id. If you want user information I recommend looking into utilizing Tweepy for all of your scraping, or using Tweepy in tandem with GetOldTweets3 in order to utilize both libraries to their strengths.

## 3. Scraping Tweets With Advanced Queries<a name="Section3"></a>
[Return to Table of Contents](#TOC)
<br>
List of methods available with GetOldTweets3 to refine your queries.

* setUsername(str): Setting query based on username
* setMaxTweets(int): Setting maximum number of tweets to search
* setQuerySearch(str): Setting query based on text
* setSince(str "yyyy-mm-dd"): Setting lower bound date on query
* setUntil(str "yyyy-mm-dd"): Setting upper bound date on query
* setNear(str): Setting location of query search
* setWithin(str): Setting radius of query search location
* setLang(str): Setting language of query
* setTopTweets(bool): Setting query to search only for top tweets
* setEmoji("ignore"/"unicode"/"name"): Setting query to search using emoji styles

In [30]:
username = "BarackObama"
text_query = "Hello"
since_date = "2011-01-01"
until_date = "2016-12-20"
count = 150
 
# Creation of tweetCriteria query object with methods to specify further
tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
.setQuerySearch(text_query).setSince(since_date)\
.setUntil(until_date).setMaxTweets(count)
 
# Creation of tweets iterable containing all queried tweet data
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
 
# List comprehension pulling chosen tweet information from tweets
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.text, tweet.retweets, tweet.favorites,tweet.replies,tweet.date] for tweet in tweets]
 
# Creation of dataframe from tweets list
# Add or remove columns as you remove tweet information
tweets_df3 = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User', 'Text','Retweets', 'Favorites', 
                                                 'Replies', 'Datetime'])

In [32]:
tweets_df3

Unnamed: 0,Tweet Id,Tweet User Id,Tweet User,Text,Retweets,Favorites,Replies,Datetime
0,682986933862154241,813286,BarackObama,"Hello, 2016.",3506,13010,760,2016-01-01 18:09:08+00:00
1,547783171199496192,813286,BarackObama,Say hello to friends you know and everyone you...,3555,9075,1087,2014-12-24 15:57:39+00:00
2,457281289351999489,813286,BarackObama,"Hello, spring.",5807,10091,1040,2014-04-18 22:15:30+00:00
3,438453976833343488,813286,BarackObama,"""Hello OFA!"" —President Obama at the #ActionSu...",134,244,57,2014-02-25 23:22:28+00:00
4,265569746991333377,813286,BarackObama,"“Hello, Columbus! Hello, Ohio! Are you fired u...",513,208,81,2012-11-05 21:42:16+00:00
5,265198579352756224,813286,BarackObama,"""Hello, Florida! Are you fired up? Are you rea...",706,266,172,2012-11-04 21:07:23+00:00
6,261624859694624768,813286,BarackObama,"President Obama: “Hello, Ohio! Are you fired u...",310,155,87,2012-10-26 00:26:41+00:00
7,261210241482518529,813286,BarackObama,"President Obama: ""Hello Colorado! Are you fire...",211,94,62,2012-10-24 20:59:09+00:00
8,260899166627176448,813286,BarackObama,"Hello, Florida:",1123,799,298,2012-10-24 00:23:03+00:00
9,258617790540435456,813286,BarackObama,"President Obama: ""Hello, Iowa! Are you fired u...",314,138,121,2012-10-17 17:17:40+00:00


## 4. Putting it All Together<a name="Section4"></a>
[Return to Table of Contents](#TOC)
<br>
Great, we now know how to pull more information from tweets and querying with advanced parameters. The great thing is how easy it is to mix and match whatever you want to search for. While it was shown above several times. The point is that you can mix and match the information you want from the tweets and the type of queries you conduct. It's just important that you update the column names in the pandas dataframe so you don't get errors.

<br>
Below is an example of a search for 150 top tweets with 'coronavirus' in it that occurred between August 5th and August 8th 2020 in Washington D.C.

In [33]:
text_query = 'Coronavirus'
since_date = '2020-08-05'
until_date = '2020-08-10'
location = 'Washington, D.C.'
top_tweets = True
count = 150
 
# Creation of tweetCriteria query object with methods to specify further
tweetCriteria = got.manager.TweetCriteria()\
.setQuerySearch(text_query).setSince(since_date)\
.setUntil(until_date).setNear(location).setTopTweets(top_tweets)\
.setMaxTweets(count)
 
# Creation of tweets iterable containing all queried tweet data
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
 
# List comprehension pulling chosen tweet information from tweets
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites, tweet.replies, tweet.date, tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]
 
# Creation of dataframe from tweets list
# Add or remove columns as you remove tweet information
tweets_df4 = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text',
                                                  'Retweets', 'Favorites', 'Replies', 'Datetime', 'Mentions','Urls','Permalink'])

In [34]:
tweets_df4

Unnamed: 0,Tweet Id,Tweet User Id,Tweet User,Reply to,Text,Retweets,Favorites,Replies,Datetime,Mentions,Urls,Permalink
0,1292610170309181447,535643852,JordanSchachtel,,Fauci had a very interesting Q&A this weekend ...,276,562,92,2020-08-09 23:54:14+00:00,,https://www.cnbc.com/2020/08/07/coronavirus-va...,https://twitter.com/JordanSchachtel/status/129...
1,1292584089833349121,225265639,ddale8,,If the president confused you about what was a...,1742,3481,143,2020-08-09 22:10:36+00:00,,https://cnn.it/31zwwir,https://twitter.com/ddale8/status/129258408983...
2,1292543811235840000,53809979,davidalim,,Antigen tests have been touted as a way to sca...,164,212,27,2020-08-09 19:30:33+00:00,@rachel_roubein,https://www.politico.com/news/2020/08/09/coron...,https://twitter.com/davidalim/status/129254381...
3,1292525660422930432,18956073,dcexaminer,,"A Nashville, Tennessee, councilwoman wants tho...",315,275,390,2020-08-09 18:18:26+00:00,,https://washex.am/3kD8L1E,https://twitter.com/dcexaminer/status/12925256...
4,1292468804648394752,309822757,ryanstruyk,,The United States just reached 5 million repor...,974,1257,52,2020-08-09 14:32:30+00:00,,,https://twitter.com/ryanstruyk/status/12924688...
5,1292467612086083584,22771961,Acosta,,"CNN: There are now at least 5,000,603 cases of...",6212,14283,1280,2020-08-09 14:27:46+00:00,,,https://twitter.com/Acosta/status/129246761208...
6,1292277343944351744,59331128,PhilipRucker,,Trump recently hosted biopharma exec Andrew Wh...,878,1651,378,2020-08-09 01:51:42+00:00,,https://www.washingtonpost.com/politics/trump-...,https://twitter.com/PhilipRucker/status/129227...
7,1292212832818401281,2475407894,hugolowell,,Just in : Trump appears to have tonight actual...,2012,3144,228,2020-08-08 21:35:22+00:00,,,https://twitter.com/hugolowell/status/12922128...
8,1292210017253371905,54622050,MichaelCBender,,Trump said he was reducing additional jobless ...,1166,1663,302,2020-08-08 21:24:10+00:00,,https://www.wsj.com/articles/trump-to-sign-exe...,https://twitter.com/MichaelCBender/status/1292...
9,1292198332765331457,558040899,AnnTelnaes,,Trump is blabbing on so you'll forget over 161...,168,339,6,2020-08-08 20:37:45+00:00,,,https://twitter.com/AnnTelnaes/status/12921983...
