This is a tweet scaping script based *Heavily* on the example script found here: https://gist.github.com/bjmarsh/315a632aa1ab0e8436e631f8a1acf40b orignially created by Bennett Marsh.

In [1]:
from collections import defaultdict
import os, sys
import time
import pandas as pd
import GetOldTweets3 as got


In [2]:
os.makedirs('tweet_data', exist_ok=True)
users = ["elonmusk"]
username = users[0]

In [3]:
count = 10
# Creation of query object                                                                                                                                                                                      
tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
                                               .setMaxTweets(count)\
                                               .setSince("2020-05-30")\
                                               .setUntil("2020-05-31")
tweets = None
for ntries in range(2):        
    try:
        tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    except SystemExit:
        print("Trying again in 15 minutes.")
        time.sleep(15*60)
    else:
        break
if tweets is None:
    print("Failed after 2 tries, quitting!")
    exit(1)

In [4]:
len(tweets)

4

In [5]:
tweets[0]

<GetOldTweets3.models.Tweet.Tweet at 0x12337df40>

Got it, the getTweets() function returns a tweet object.  No docstring on the got tweet object though.

In [6]:
tweets[0].id
tweets[0].to

Bennet's original script does just fine in gathering up all of Elon's tweets.  I'd like to have a record of the semantic content of the tweet/reply conversations that Elon has with his followers and twitters API does not make this properly available.  I'm going to have to fudge it, but I think that this algorithm will do a decent job of getting at least some of the conversations Elon has.

In [7]:
def get_other_user_reply(username,t_init,t_final): 
    #searches a secondary user's tweets within a range of time and 
    #returns tweets that either reply to or @elonmusk
    print(username)
    count = 0
    # Creation of query object                                                                                                                                                                                      
    tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
                                               .setMaxTweets(count)\
                                               .setSince(t_init)\
                                               .setUntil(t_final)
    # Creation of list that contains all tweets                                                                                                                                                                     
    tweets = None
    for ntries in range(5):
        try:
            tweets = got.manager.TweetManager.getTweets(tweetCriteria)
        except SystemExit:
            print("Trying again in 15 minutes.")
            time.sleep(15*60)
        else:
            break
    if tweets is None:
        print("Failed after 5 tries, quitting!")
        exit(1)

    data = defaultdict(list)
    for t in tweets:
        if t.to == 'elonmusk' or t.mentions == '@elonmusk':
            data["username"].append(username)
            data["tweet_id"].append(t.id)
            data["reply_to"].append(t.to)
            data["date"].append(t.date)
            data["retweets"].append(t.retweets)
            data["favorites"].append(t.favorites)
            data["hashtags"].append(list(set(t.hashtags.split())))
            data["mentions"].append(t.mentions)
            data["text"].append(t.text)
            data["permalink"].append(t.permalink)
        else:
            pass
    return data

In [2]:
elon_tweets_df = pd.read_csv('./tweet_data/elonmusk.csv')

In [3]:
elon_tweets_df.columns

Index(['Unnamed: 0', 'username', 'tweet_id', 'reply_to', 'date', 'retweets',
       'favorites', 'hashtags', 'mentions', 'text', 'permalink'],
      dtype='object')

In [4]:
# Convert 'Time' column to datetime and strip time information.
elon_tweets_df['Time'] = pd.to_datetime(elon_tweets_df['date'])#.dt.date

In [5]:
elon_tweets_df.dtypes

Unnamed: 0                  int64
username                   object
tweet_id                    int64
reply_to                   object
date                       object
retweets                    int64
favorites                   int64
hashtags                   object
mentions                   object
text                       object
permalink                  object
Time          datetime64[ns, UTC]
dtype: object

In [6]:
elon_tweets_df = elon_tweets_df.drop(['Unnamed: 0','date'],axis='columns')

In [7]:
elon_tweets_df.index

RangeIndex(start=0, stop=9807, step=1)

In [8]:
elon_tweets_df.head(15)

Unnamed: 0,username,tweet_id,reply_to,retweets,favorites,hashtags,mentions,text,permalink,Time
0,elonmusk,1267180654896254976,SpaceX,22581,250519,[],,Nine years later,https://twitter.com/elonmusk/status/1267180654...,2020-05-31 19:46:25+00:00
1,elonmusk,1267160409498357764,NASASpaceflight,81,2494,[],,Must be due to relativistic aging,https://twitter.com/elonmusk/status/1267160409...,2020-05-31 18:25:58+00:00
2,elonmusk,1267157474886455296,NASASpaceflight,708,14436,[],,Brought home by same person who placed it ther...,https://twitter.com/elonmusk/status/1267157474...,2020-05-31 18:14:19+00:00
3,elonmusk,1267156817295085575,Rogozin,1209,7558,[],,"Спасибо, сэр, ха-ха. Мы рассчитываем на взаимо...",https://twitter.com/elonmusk/status/1267156817...,2020-05-31 18:11:42+00:00
4,elonmusk,1267146619562201090,SpaceX,5576,67423,[],@Space_Station,Congratulations Bob & Doug on docking & hatch ...,https://twitter.com/elonmusk/status/1267146619...,2020-05-31 17:31:11+00:00
5,elonmusk,1267057495773675521,TeslaGong,81,3948,[],,Sure,https://twitter.com/elonmusk/status/1267057495...,2020-05-31 11:37:02+00:00
6,elonmusk,1267056905601638404,TeslaTested,1650,84762,[],,Probably,https://twitter.com/elonmusk/status/1267056905...,2020-05-31 11:34:41+00:00
7,elonmusk,1267056312497721344,SpaceX,16259,149590,[],@Space_Station,Dragon docks with @Space_Station in ~3 hours,https://twitter.com/elonmusk/status/1267056312...,2020-05-31 11:32:20+00:00
8,elonmusk,1266890648587776003,NASA,4042,64610,[],,Dragonship Endeavor,https://twitter.com/elonmusk/status/1266890648...,2020-05-31 00:34:02+00:00
9,elonmusk,1266811094527508481,,54238,862612,[],,5 mins to T-0,https://twitter.com/elonmusk/status/1266811094...,2020-05-30 19:17:55+00:00


In [18]:
elon_replies_df = elon_tweets_df.loc[elon_tweets_df['reply_to'].notna()]
elon_mentions_df = elon_tweets_df.loc[elon_tweets_df['mentions'].notna()]
elon_hashtags_df = elon_tweets_df.loc[elon_tweets_df['hashtags']!='[]']

In [20]:
elon_hashtags_df.head(50)

Unnamed: 0,username,tweet_id,reply_to,retweets,favorites,hashtags,mentions,text,permalink,Time
561,elonmusk,1251335445977403392,flcnhvy,316,2280,['#CancelNewsNetwork'],,#CancelNewsNetwork,https://twitter.com/elonmusk/status/1251335445...,2020-04-18 02:23:13+00:00
1144,elonmusk,1226132778967687170,SachaBaronCohen,8248,60748,['#DeleteFacebook'],,#DeleteFacebook It’s lame,https://twitter.com/elonmusk/status/1226132778...,2020-02-08 13:16:49+00:00
2045,elonmusk,1179957355628253185,SciGuySpace,3333,31908,['#Armageddon69'],@NASA,Excited about launching @NASA asteroid defense...,https://twitter.com/elonmusk/status/1179957355...,2019-10-04 03:12:10+00:00
2988,elonmusk,1141132845202599937,nichegamer,147,6235,['#moneygang'],,"Actually, I stole it from my secret meme deale...",https://twitter.com/elonmusk/status/1141132845...,2019-06-18 23:57:26+00:00
4294,elonmusk,1082180642937491456,,33665,312123,['#NewProfilePic'],,#NewProfilePic,https://twitter.com/elonmusk/status/1082180642...,2019-01-07 07:42:25+00:00
5084,elonmusk,1041555319166447616,,15252,83580,['#OccupyMars'],,#OccupyMars,https://twitter.com/elonmusk/status/1041555319...,2018-09-17 05:11:53+00:00
5656,elonmusk,1010431046460923905,,306,5235,['#donotpanic'],,#donotpanic,https://twitter.com/elonmusk/status/1010431046...,2018-06-23 07:55:09+00:00
5863,elonmusk,1005564275656568832,,8802,45802,['#ThrowFlamesResponsibly'],,Terms & conditions for “Not-a-Flamethrower” Pl...,https://twitter.com/elonmusk/status/1005564275...,2018-06-09 21:36:20+00:00
5927,elonmusk,1002237545273483264,paulmasonnews,387,5049,['#Pravduh'],,#Pravduh,https://twitter.com/elonmusk/status/1002237545...,2018-05-31 17:17:06+00:00
6562,elonmusk,960975644644593664,,7940,41038,['#FalconHeavy'],,Camera views from inside the payload fairing #...,https://twitter.com/elonmusk/status/9609756446...,2018-02-06 20:37:02+00:00


In [27]:
from datetime import datetime

In [28]:
datetime.utcnow()

datetime.datetime(2020, 6, 2, 2, 40, 26, 537553)

In [43]:
t = elon_tweets_df['Time'].iloc[0]
elon_tweets_df['Time'].iloc[0]

Timestamp('2020-05-31 19:46:25+0000', tz='UTC')

In [44]:
t.date().day

31

In [47]:
from collections import defaultdict
import os, sys
import time
import pandas as pd
import GetOldTweets3 as got

def get_new_tweets(t_last_tweet,username = "elonmusk"):
    """Function to scrape the recent tweets of Elon Musk"""
    #t_last_tweet must be pandas Timestamp data
    os.makedirs('tweet_data', exist_ok=True)
    date_str = str(t_last_tweet.date().year)+"-"\
              +str(t_last_tweet.date().month)+"-"\
              +str(t_last_tweet.date().day)
    count = 0
    # Creation of query object                                                                                                                                                                                      
    tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
                                               .setMaxTweets(count)\
                                               .setSince(date_str)
    # Creation of list that contains all tweets                                                                                                                                                                     
    tweets = None
    for ntries in range(5):
        try:
            tweets = got.manager.TweetManager.getTweets(tweetCriteria)
        except SystemExit:
            print("Trying again in 15 minutes.")
            time.sleep(15*60)
        else:
            break
    if tweets is None:
        print("Failed after 5 tries, quitting!")
        exit(1)

    data = defaultdict(list)
    for t in tweets:
        data["username"].append(username)
        data["tweet_id"].append(t.id)
        data["reply_to"].append(t.to)
        data["date"].append(t.date)
        data["retweets"].append(t.retweets)
        data["favorites"].append(t.favorites)
        data["hashtags"].append(list(set(t.hashtags.split())))
        data["mentions"].append(t.mentions)
        data["text"].append(t.text)
        data["permalink"].append(t.permalink)
    if len(data) == 0: #no new tweets
        return None
    else:
        #make a DataFrame out of the scraped tweets
        df = pd.DataFrame(data, columns=["username","tweet_id","reply_to","date","retweets","favorites","hashtags","mentions","text","permalink"])        
        # Convert 'Time' column to datetime and strip time information.
        df['Time'] = pd.to_datetime(df['date'])
        #df = df.drop(['Unnamed: 0','date'],axis='columns') #unused columns
        return df

In [48]:
test_df = get_new_tweets(elon_tweets_df['Time'].iloc[0],username = "elonmusk")

In [49]:
test_df.head()

Unnamed: 0,username,tweet_id,reply_to,date,retweets,favorites,hashtags,mentions,text,permalink,Time
0,elonmusk,1267650659320500226,,2020-06-02 02:54:03+00:00,3430,42194,[],,Off Twitter for a while,https://twitter.com/elonmusk/status/1267650659...,2020-06-02 02:54:03+00:00
1,elonmusk,1267531196751323144,PPathole,2020-06-01 18:59:21+00:00,1779,26679,[],,Starship is the key to making life multiplanet...,https://twitter.com/elonmusk/status/1267531196...,2020-06-01 18:59:21+00:00
2,elonmusk,1267415489111785472,mharrisonair,2020-06-01 11:19:34+00:00,410,10044,[],,Well said,https://twitter.com/elonmusk/status/1267415489...,2020-06-01 11:19:34+00:00
3,elonmusk,1267409179339296768,DjKeyWay,2020-06-01 10:54:30+00:00,774,4926,[#JusticeForGeorge],,Definitely not right that the other officers w...,https://twitter.com/elonmusk/status/1267409179...,2020-06-01 10:54:30+00:00
4,elonmusk,1267402337653587968,scale_banana,2020-06-01 10:27:19+00:00,1474,61511,[],,Where’s the banana!?,https://twitter.com/elonmusk/status/1267402337...,2020-06-01 10:27:19+00:00


In [50]:
elon_tweets_df.to_csv(f"tweet_data/elonmusk.csv")

In [51]:
#not we'll have to do a .drop and set the 'Time' column to the proper values every time
reload_test_df = pd.read_csv('./tweet_data/elonmusk.csv').drop(['Unnamed: 0'],axis='columns')
reload_test_df['Time'] = pd.to_datetime(reload_test_df['Time'])

In [54]:
reload_test_df.head()


Unnamed: 0.1,Unnamed: 0,username,tweet_id,reply_to,retweets,favorites,hashtags,mentions,text,permalink,Time
0,0,elonmusk,1267180654896254976,SpaceX,22581,250519,[],,Nine years later,https://twitter.com/elonmusk/status/1267180654...,2020-05-31 19:46:25+00:00
1,1,elonmusk,1267160409498357764,NASASpaceflight,81,2494,[],,Must be due to relativistic aging,https://twitter.com/elonmusk/status/1267160409...,2020-05-31 18:25:58+00:00
2,2,elonmusk,1267157474886455296,NASASpaceflight,708,14436,[],,Brought home by same person who placed it ther...,https://twitter.com/elonmusk/status/1267157474...,2020-05-31 18:14:19+00:00
3,3,elonmusk,1267156817295085575,Rogozin,1209,7558,[],,"Спасибо, сэр, ха-ха. Мы рассчитываем на взаимо...",https://twitter.com/elonmusk/status/1267156817...,2020-05-31 18:11:42+00:00
4,4,elonmusk,1267146619562201090,SpaceX,5576,67423,[],@Space_Station,Congratulations Bob & Doug on docking & hatch ...,https://twitter.com/elonmusk/status/1267146619...,2020-05-31 17:31:11+00:00
