COSC2671 Social Media and Network Analytics

# Assignment 2 - Twitter posts downloader

@author Lukas Krodinger, s3961415

Note that this notebook requires the file twitterClient.py written by Jeffrey Chan with a valid twitter bearerToken, where the limit is not exceeded in order to work.

In [15]:
import json
import math

import tweepy
from workshop03Code import twitterClient

In [16]:
def load_tweets(filename):
    """
    Loads the tweets from the file with the given name into an array of tweets.

    @param filename: The filename of the file to load the tweets from.

    @returns: An array of tweets.
    """
    tweets = []
    with open(filename, 'r') as f:
        for sLine in f:
            tweet = json.loads(sLine)
            tweets.append(tweet)
    return tweets

In [17]:
client = twitterClient.twitterClient()

Here, I define the search query, what fields of each tweet to download, the maximum amount of downloaded tweets as well as the name of the output json file.

My search focuses on ...
I want to download all tweet fields supported by tweepy and requiring no authentication, as one can still filter out the required fields for analysis later on.
Note that the max_tweets might not be reached, because I only download tweets which are at most one-week-old.

In [18]:
# Define what tweets do download
search_query = 'tennis -table'

# All non-authenticated tweet fields
all_tweet_fields = ['id', 'text', 'attachments', 'author_id', 'context_annotations', 'conversation_id', 'created_at', 'entities', 'geo', 'in_reply_to_user_id', 'lang', 'possibly_sensitive', 'public_metrics', 'referenced_tweets', 'reply_settings', 'source', 'withheld']

# The maximum amount of tweets to download
max_tweets = 300000  # 50000 was used here

# The filename of the file to store the tweets into
all_twitter_fields_filename = "tennis_2022_10_13_10_50.json"

I download the tweets via the tweepy client in a paginated manner (100 at once).

In [19]:
tweets = []

# try:
#     # Download the tweets paginated, 100 at once
#     for tweet in tweepy.Paginator(client.search_recent_tweets, search_query, max_results=100, tweet_fields=all_tweet_fields).flatten(limit=max_tweets):
#         tweets.append(tweet)
# finally:
#     print(len(tweets))


twitterResponse = client.search_recent_tweets(search_query, max_results=100, tweet_fields=all_tweet_fields) #until=date_until
while len(tweets) < max_tweets:
    try:
        twitterResponse = client.search_recent_tweets(search_query, max_results=100, tweet_fields=all_tweet_fields, next_token =twitterResponse.meta.get("next_token"))
    except:
        break
    finally:
        print(len(tweets))

    if twitterResponse.data is not None:
        for tweet in twitterResponse.data:
            tweets.append(tweet)

print("Number of tweets downloaded: ", len(tweets))
print(twitterResponse.meta.get("next_token"))

0
100
199
298
397
496
595
694
793
893
993
1093
1193
1293
1392
1491
1591
1688
1786
1884
1984
2082
2182
2281
2380
2479
2576
2674
2773
2871
2971
3070
3169
3268
3368
3468
3568
3668
3767
3866
3966
4066
4166
4266
4364
4462
4562
4660
4760
4858
4957
5056
5155
5253
5351
5450
5549
5649
5749
5849
5948
6047
6146
6245
6344
6442
6541
6636
6733
6833
6933
7032
7130
7230
7329
7428
7528
7628
7728
7828
7927
8027
8127
8226
8325
8425
8524
8624
8723
8823
8922
9022
9122
9219
9317
9417
9514
9613
9712
9810
9910
10009
10108
10208
10307
10406
10506
10605
10704
10804
10904
11003
11102
11200
11300
11400
11500
11599
11698
11798
11897
11997
12096
12196
12294
12394
12494
12593
12692
12791
12890
12988
13087
13187
13287
13387
13487
13583
13683
13783
13882
13981
14078
14178
14278
14378
14478
14576
14676
14776
14875
14974
15073
15172
15271
15371
15471
15571
15670
15770
15870
15970
16070
16170
16270
16369
16469
16568
16668
16768
16868
16968
17068
17168
17266
17366
17465
17565
17664
17764
17864
17964
18063
18162
18261
1836

b26v89c19zqg8o3fpzbngfdivtupw624idr1ro4wl0dbx
b26v89c19zqg8o3fpzbngdaan5f7ocvpscb5vemqqlif1
b26v89c19zqg8o3fpzbngdaafj8qv80nifj3dw5sl59j1

b26v89c19zqg8o3fpzbn1e117y06l7zcelrqn0zafzcl9

Now I store the downloaded tweets to the specified output file.

In [20]:
print(len(tweets))


44552


In [24]:
with open(all_twitter_fields_filename, 'w') as json_file:
    for tweet in tweets:
        json.dump(tweet.data, json_file)
        json_file.write('\n')

print("Tweets successfully stored to: ", all_twitter_fields_filename)

Tweets successfully stored to:  tennis_2022_10_13_10_50.json
